Repository Fringe 2017 (#rfringe17) – Day One Liveblog

Welcome – Janet Roberts, Director of EDINA

My colleagues were explaining to me that this event came from an idea from Les Carr that should be not just one repository conference, but also a fringe – and here were are at the 10th Repository Fringe on the cusp of the Edinburgh Fringe.

So, this week we celebrate ten years of repository fringe, the progress we have made over the last 10 years to share content beyond borders. It is a space for debating future trends and challenges.

At EDINA we established the OpenDepot to provide a space for those without a repository… That has now migrated to Zenodo… and the challenges are changing, around the size of data, how we store and access that data, and what those next generation repositories will look like.

Over the next few days we have some excellent speakers as well as some fringe events, including the Wiki Datathon – so I hope you have all brought your laptops!

Thank you to our organising team from EDINA, DCC and the University of Edinburgh. Thank you also to our sponsors: Atmire; FigShare; Arkivum; ePrints; and Jisc!

Opening Keynote – Kathleen Shearer, Executive Director COARRaising our game – repositioning repositories as the foundation for sustainable scholarly communication

Theo Andrew: I am delighted to introduce Kathleen, who has been working in digital libraries and repositories for years. COAR is an international organisation of repositories, and I’m pleased to say that Edinburgh has been a member for some time.

Kathleen: Thank you so much for inviting me. It’s actually my first time speaking in the UK and it’s a little bit intimidating as I know that you folks are really ahead here.

COAR is now about 120 members. Our activities fall into four areas: presenting an international voice so that repositories are part of a global community with diverse perspective. We are being more active in training for repository managers, something which is especially important in developing countries. And the other area is value added services, which is where today’s talk on the repository of the future comes in. The vision here is about

But first, a rant… The international publishing system is broken! And it is broken for a number of reasons – there is access, and the cost of access. The cost of scholarly journals goes up far beyond the rate of inflation. That touches us in Canada – where I am based, in Germany, in the UK… But much more so in the developing world. And then we have the “Big Deal”. A study of University of Montreal libraries by Stephanie Gagnon found that of 50k subscribed-to journals, really there were only 5,893 unique essential titles. But often those deals aren’t opted out of as the key core journals separately cost the same as that big deal.

We also have a participation problem… Juan Pablo Alperin’s map of authors published in Web of Science shows a huge bias towards the US and the UK, a seriously reduced participation in Africa and parts of Asia. Why does that happen? The journals are operated from the global North, and don’t represent the kinds of research problems in the developing world. And one Nobel Prize winner notes that the pressure to publish in “luxury” journals encourages researchers to cut corners and pursue trendy fields rather than areas where there are those research gaps. That was the cake with Zika virus – you could hardly get research published on that until a major outbreak brought it to the attention of the dominant publishing cultures, then there was huge appetite to publish there.

Timothy Gowers talks about “perverse incentives” which are supporting the really high costs of journals. It’s not just a problem for researchers and how they publish, its also a problem of how we incentivise researchers to publish. So, this is my goats in trees slide… It doesn’t feel like goats should be in trees… Moroccan tree goats are taught to climb the trees when there isn’t food on the ground… I think of the researchers able to publish in these high end journals as being the lucky goats in the tree here…

In order to incentivise participation in high end journals we have created a lucrative publishing industry. I’m sure you’ve seen the recent Guardian article: “is the staggeringly profitable business of science publishing bad for science”. Yes. For those reasons of access and participation. We see very few publishers publishing the majority of titles, and there is a real

My colleague Leslie Chan, funded by the International Development Council, talked about openness not just being about gaining access to knowledge but also about having access to participate in the system.

On the positive side… Open access has arrived. A recent study (Piwowar et al 2017) found that about 45% of articles published in 2015 were open access. And that is increasing every year. And you have probably seen the May 27th 2016 statement from the EU that all research they fund must be open by 2020.

It hasn’t been a totally smooth transition… APCs (Article Processing Charges) are very much in the mix and part of the picture… Some publishers are trying to slow the growth of access, but they can see that it’s coming and want to retain their profit margins. And they want to move to all APCs. There is discussion here… There is a project called OA2020 which wants to flip from subscription based to open access publishing. It has some traction but there are concerns here, particularly about sustainability of scholarly comms in the long term. And we are not syre that publishers will go for it… Particularly one of them (Elsevier) which exited talks in The Netherlands and Germany. In Germany the tap was turned off for a while for Elsevier – and there wasn’t a big uproar from the community! But the tap has been turned back on…

So, what will the future be around open access? If you look across APCs and the average value… If you think about the relative value of journals, especially the value of high end journals… I don’t think we’ll see lesser increases in APCs in the future.

At COAR we have a different vision…

Lorcan Dempsey talked about the idea of the “inside out” library. Similarly a new MIT Future of Libraries Report – published by a broad stakeholder group that had spent 6 months working on a vision – came up with the need for libraries to be open, trusted, durable, interdisciplinary, interoperable content platform. So, like the inside out library, it’s about collecting the output of your organisation and making is available to the world…

So, for me, if we collect articles… We just perpetuate the system and we are not in a position to change the system. So how do we move forward at the same time as being kind of reliant on that system.

Eloy Rodrigues, at Open Repository earlier this year, asked whether repositories are a success story. They are ubiquitous, they are adopted and networked… But then they are also using old, pre-web technologies; mostly passive recipients; limited interoperability making value added systems hard; and not really embedded in researcher workflows. These are the kinds of challenges we need to address in next generation of repositories…

So we started a working group on Next Generation Repositories to define new technologies for repositories. We want to position repositories as the foundation for a distributed, globally networked infrastructure for scholarly communication. And on top of which we want to be able to add layers of value added services. Our principles include distributed control to guard againts failure, change, etc. We want this to be inclusive, and reflecting the needs of the research communities in the global south. We want intelligent openness – we know not everything can be open.

We also have some design assumptions, with a focus on the resources themselves, not just associated metadata. We want to be pragmatic, and make use of technologies we have…

To date we have identified major use cases and user stories, and shared those. We determined functionality and behaviours; and a conceptual models. At the moment we are defining specific technologies and architectures. We will publish recommendations in September 2017. We then need to promote it widely and encourages adoption and implementation, as well as the upgrade of repositories around the world (a big challenge).

You can view our user stories online. But I’d like to talk about a few of these… We would like to enable peer review on top of repositories… To slowly incrementally replace what researchers do. That’s not building peer review in repositories, but as a layer on top. We also want some social functionalities like recommendations. And we’d like standard usage metrics across the world to understand what is used and hw.. We are looking to the UK and the IRUS project there as that has already been looked at here. We also need to address discovery… Right now we use metadata, rather than indexing full text content… So contat can be hard to get to unless the metadata is obvious. We also need data syncing in hubs, indexing systems, etc. reflect changes in the repositories. And we also want to address preservation – that’s a really important role that we should do well, and it’s something that can set us apart from the publishers – preservation is not part of their business model.

So, this is a slide from Peter Knoth at CORE – a repository aggregator – who talks about expanding the repository, and the potential to layer all of these additional services on top.

To make this happen we need to improve the functionality of repositories: to be of and not just on the web. But we also need to step out of the article paradigm… The whole system is set up around the article, but we need to think beyond that, deposit other content, and ensure those research outputs are appropriately recognised.

So, we have our (draft) conceptual model… It isn’t around siloed individual repositories, but around a whole network. And some of our draft recommendations for technologies for next generation repositories. These are a really early view… These are things like: ResourceSync; Signposting; Messaging protocols; Message queue; IIIF presentation API; AOAuth; Webmention; and more…

Critical to the widespread adoption of this process is the widespread adoption of the behaviours and functionalities for next generation repositories. It won’t be a success if only one software or approach takes these on. So I’d like to quote a Scottish industrialist, Andrew Carnegie: “strength is derived from unity…. “. So we need to coalesce around a common vision.

Ad it isn’t just about a common vision, science is global and networked and our approach has to reflect and connect with that. Repositories need to balance a dual mission to (1) showcase and provide access to institutional research and (2) be nodes in a global research network.

To support better networking in repositories and in Venice, in May we signed an International Accord for Repository Networks, with networks from Australasia, Canada, China, Europe, Japan, Latin America, South Africa, United States. For us there is a question about how best we work with the UK internationally. We work with with OpenAIRE but maybe we need something else as well. The networks across those areas are advancing at different paces, but have committed to move forward.

There are three areas of that international accord:

  1. Strategic coordination – to have a shared vision and a stronger voice for the repository community
  2. Interoperability and common “behaviours” for repositories – supporting the development of value added services
  3. Data exchange and cross regional harvesting – to ensure redundancy and preservation. This has started but there is a lot to do here still, especially as we move to harvesting full text, not just metadata. And there is interest in redundancy for preservation reasons.

So we need to develop the case for a distributed community-managed infrastructure, that will better support the needs of diverse regions, disciplines and languages. Redundancy will safeguard against failure. With less risk of commercial buy out. Places the library at the centre… But… I appreciate it is much harder to sell a distributed system… We need branding that really attracts researchers to take part and engage in †he system…

And one of the things we want to avoid… Yesterday it was announced that Elsevier has acquired bepress. bepress is mainly used in the US and there will be much thinking about the implications for their repositories. So not only should institutional repositories be distributed, but they should be different platforms, and different open source platforms…

Concluding thoughts here… Repositories are a technology and technologies change. What its really promoting is a vision in which institutions, universities and their libraries are the foundational nodes in a global scholarly communication system. This is really the future of libraries in the scholarly communication community. This is what libraries should be doing. This is what our values represent.

And this is urgent. We see Elsevier consolidating, buying platforms, trying to control publishers and the research cycle, we really have to move forward and move quickly. I hope the UK will remain engaged with this. And i look forward to your participation in our ongoing dialogue.


Q1 – Les Carr) I was very struck by that comment about the need to balance the local and the global I think that’s a really major opportunity for my university. Everyone is obsessed about their place in the global university ranking, their representation as a global university. This could be a real opportunity, led by our libraries and knowledge assets, and I’m really excited about that!

A1) I think the challenge around that is trying to support common values… If you are competing with other institutions it’s not always an incentive to adopt systems with common technologies, measures, approaches. So there needs to be a benefit for institutions in joining this network. It is a huge opportunity, but we have to show the value of joining that network It’s maybe easier in the UK, Europe, Canada. In the US they don’t see that value as much… They are not used to collaborating in this way and have been one of the hardest regions to bring onboard.

Q2 – Adam ?) Correct me if I’m wrong… You are talking about a Commons… In some way the benefits are watered down as part of the Commons, so how do we pay for this system, how do we make this benefit the organisation?

A2) That’s where I see that challenge of the benefit. There has to be value… That’s where value added systems come in… So a recommender system is much more valuable if it crosses all of the repositories… That is a benefit and allows you to access more material and for more people to access yours. I know CORE at the OU are already building a recommender system in their own aggregated platform.

Q3 – Anna?) At the sharp end this is not a problem for libraries, but a problem for academia… If we are seen as librarians doing things to or for academics that won’t have as much traction… How do we engage academia…

A3) There are researchers keen to move to open access… But it’s hard to represent what we want to do at a global level when many researchers are focused on that one journal or area and making that open access… I’m not sure what the elevator pitch should be here. I think if we can get to that usage statistics data there, that will help… If we can build an alternative system that even research administrators can use in place of impact factor or Web of Science, that might move us forward in terms of showing this approach has value. Administrators are still stuck in having to evaluate the quality of research based on journals and impact factors. This stuff won’t happen in a day. But having standardised measures across repositories will help.

So, one thing we’ve done in Canada with the U15 (top 15 universities in Canada)… They are at the top of what they can do in terms of the cost of scholarly journals so they asked us to produce a paper for them on how to address that… I think that issue of cost could be an opportunity…

Q4) I’m an academic and we are looking for services that make our life better… Here at Edinburgh we can see that libraries are the naturally the consistent point of connection with repository. Does that translate globally?

A4) It varies globally. Libraries are fairly well recognised in Western countries. In developing world there are funding and capacity challenges that makes that harder… There is also a question of whether we need repositories for every library.. Can we do more consortia repositories or similar.

Q5 – Chris) You talked about repository supporting all kinds of materials… And how they can “wag the dog” of the article

A5) I think with research data there is so much momentum there around making data available… But I don’t know how well we are set up with research data management to ensure data can be found and reused. We need to improve the technology in repositories. And we need more resources too…

Q6) Can we do more to encourage academics, researchers, students to reuse data and content as part of their practice?

A6) I think the more content we have at Commons level, the more it can be reused. We have to improve discoverability, and improve the functionality to help that content to be reused… There is huge use of machine reuse of content – I was speaking with Peter Knoth about this – but that isn’t easy to do with repositories…

Theo) It would be really useful to see Open Access buttons more visible, using repositories for document delivery, etc.

Chris Banks, Director of Library Services, Imperial CollegeFocusing upstream: supporting scholarly communication by academics

10×10 presentations (Chair: Ianthe Sutherland, University Library & Collections)

  1. v2.juliet – A Model For SHERPA’s Mid-Term Infrastructure. Adam Field, Jisc
  1. CORE Recommender: a plug in suggesting open access content. Nancy Pontika, CORE
  1. Enhancing Two workflows with RSpace & Figshare: Active Data to Archival Data and Research to Publication. Rory Macneil, Research Space and Megan Hardeman of Figshare
  1. Thesis digitisation project. Gavin Willshaw, University of Edinburgh
  1. Weather Cloudy & Cool Harvest Begun’: St Andrews output usage beyond the repository. Michael Bryce, University of St Andrews

Impact and the REF panel session

Brief for this session: How are institutions preparing for the next round of the Research Excellence Framework #REF2021, and how do repositories feature in this? What lessons can we learn from the last REF and what changes to impact might we expect in 2021? How can we improve our repositories and associated services to support researchers to achieve and measure impact with a view to the REF? In anticipation of the forthcoming announcement by HEFCE later this year of the details of how #REF2021 will work, and how impact will be measured, our panel will discuss all these issues and answer questions from RepoFringers.

Pauline Jones, REF Manager and Head of Strategic Performance and Research Policy, University of Edinburgh

Anne-Sofie Laegran, Knowledge Exchange Manager, College of Arts, Humanities and Social Sciences, University of Edinburgh

Catriona Firth, REF Deputy Manager, HEFCE

Chair: Keith McDonald, Assistant Director, Research and Innovation Directorate, Scottish Funding Council

10×10 presentations

  1. National Open Data and Open Science Policies in Europe. Martin Donnelly, DCC
  1. IIIF: you can keep your head while all around are losing theirs! Scott Renton, University of Edinburgh
  1. Reference Rot in theses: a HiberActive pilot. Nicola Osborne, EDINA
  1. Lifting the lid on global research impact: implementation and analysis of a Request a Copy service. Dimity Flanagan, London School of Economics and Political Science
  1. What RADAR did next: developing a peer review process for research plans. Nicola Siminson, Glasgow School of Art
  1. Edinburgh DataVault: Local implementation of Jisc DataVault: the value of testing. Pauline Ward, EDINA
  1. Data Management & Preservation using PURE and Archivematica at Strathclyde. Alan Morrisson, University of Strathclyde
  1. Open Access… From Oblivion… To the Spotlight? Dawn Hibbert, University of Northampton
  1. Automated metadata collection from the researcher CV Lattes Platform to aid IR ingest. Chloe Furnival, Universidade Federal de São Carlos
  1. The Changing Face of Goldsmiths Research Online. Jeremiah Spillane, Goldsmiths, University of London

Chair: Ianthe Sutherland, University Library & Collections


Reflecting on my Summer Blockbusters and Forthcoming Attractions (including #codi17)

As we reach the end of the academic year, and I begin gearing up for the delightful chaos of the Edinburgh Fringe and my show, Is Your Online Reputation Hurting You?, I thought this would be a good time to look back on a busy recent few months of talks and projects (inspired partly by Lorna Campbell’s post along the same lines!).

This year the Managing Your Digital Footprint work has been continuing at a pace…

We began the year with funding from the Principal’s Teaching Award Scheme for a new project, led by Prof. Sian Bayne: “A Live Pulse”: Yik Yak for Teaching, Learning and Research at Edinburgh. Sian, Louise Connelly (PI for the original Digital Footprint research), and I have been working with the School of Informatics and a small team of fantastic undergraduate student research associates to look at Yik Yak and anonymity online. Yik Yak closed down this spring which has made this even more interesting as a cutting edge research project. You can find out more on the project blog – including my recent post on addressing ethics of research in anonymous social media spaces; student RA Lilinaz’s excellent post giving her take on the project; and Sian’s fantastic keynote from#CALRG2017, giving an overview of the challenges and emerging findings from this work. Expect more presentations and publications to follow over the coming months.

Over the last year or so Louise Connelly and I have been busy developing a Digital Footprint MOOC building on our previous research, training and best practice work and share this with the world. We designed a three week MOOC (Massive Open Online Course) that runs on a rolling basis on Coursera – a new session kicks off every month. The course launched this April and we were delighted to see it get some fantastic participant feedback and some fantastic press coverage (including a really positive experience of being interviewed by The Sun).

The MOOC has been going well and building interest in the consultancy and training work around our Digital Footprint research. Last year I received ISG Innovation Fund support to pilot this service and the last few months have included great opportunities to share research-informed expertise and best practices through commissioned and invited presentations and sessions including those for Abertay University, University of Stirling/Peer Review Project Academic Publishing Routes to Success event, Edinburgh Napier University, Asthma UK’s Patient Involvement Fair, CILIPS Annual Conference, CIGS Web 2.0 & Metadata seminar, and ReCon 2017. You can find more details of all of these, and other presentations and workshops on the Presentations & Publications page.

In June an unexpected short notice invitation came my way to do a mini version of my Digital Footprint Cabaret of Dangerous Ideas show as part of the Edinburgh International Film Festival. I’ve always attended EIFF films but also spent years reviewing films there so it was lovely to perform as part of the official programme, working with our brilliant CODI compare Susan Morrison and my fellow mini-CODI performer, mental health specialist Professor Steven Lawrie. We had a really engaged audience with loads of questions – an excellent way to try out ideas ahead of this August’s show.

Also in June, Louise and I were absolutely delighted to find out that our article (in Vol. 11, No. 1, October 2015) for ALISS Quarterly, the journal of the Association of Librarians and Information Professionals in the Social Sciences, had been awarded Best Article of the Year. Huge thanks to the lovely folks at ALISS – this was lovely recognition for our article, which can read in full in the ALISS Quarterly archive.

In July I attended the European Conference on Social Media (#ecsm17) in Vilnius, Lithuania. In addition to co-chairing the Education Mini Track with the lovely Stephania Manca (Italian National Research Council), I was also there to present Louise and my Digital Footprint paper, “Exploring Risk, Privacy and the Impact of Social Media Usage with Undergraduates“, and to present a case study of the EDINA Digital Footprint consultancy and training service for the Social Media in Practice Excellence Awards 2017. I am delighted to say that our service was awarded 2nd place in those awards!

Social Media in Practice Excellence Award 2017 - 2nd place - certificate

My Social Media in Practice Excellence Award 2017 2nd place certificate (still awaiting a frame).

You can read more about the awards – and my fab fellow finalists Adam and Lisa – in this EDINA news piece.

On my way back from Lithuania I had another exciting stop to make at the Palace of Westminster. The lovely folk at the Parliamentary Digital Service invited me to give a talk, “If I Googled you, what would I find? Managing your digital footprint” for their Cyber Security Week which is open to members, peers, and parliamentary staff. I’ll have a longer post on that presentation coming very soon here. For now I’d like to thank Salim and the PDS team for the invitation and an excellent experience.

The digital flyer for my CODI 2017 show - huge thanks to the CODI interns for creating this.

The digital flyer for my CODI 2017 show (click to view a larger version) – huge thanks to the CODI interns for creating this.

The final big Digital Footprint project of the year is my forthcoming Edinburgh Fringe show, Is Your Online Reputation Hurting You? (book tickets here!). This year the Cabaret of Dangerous Ideas has a new venue – the New Town Theatre – and two strands of events: afternoon shows; and “Cabaret of Dangerous Ideas by Candlelight”. It’s a fantastic programme across the Fringe and I’m delighted to be part of the latter strand with a thrilling but challengingly competitive Friday night slot during peak fringe! However, that evening slot also means we can address some edgier questions so I will be talking about how an online reputation can contribute to fun, scary, weird, interesting experiences, risks, and opportunities – and what you can do about it.

QR code for CODI17 Facebook Event

Help spread the word about my CODI show by tweeting with #codi17 or sharing the associated Facebook event.

To promote the show I will be doing a live Q&A on YouTube on Saturday 5th August 2017, 10am. Please do add your questions via Twitter (#codi17digifoot) or via this anonymous survey and/or tune in on Saturday (the video below will be available on the day and after the event).

So, that’s been the Digital Footprint work this spring/summer… What else is there to share?

Well, throughout this year I’ve been working on a number of EDINA’s ISG Innovation Fund projects…

The Reference Rot in Theses: a HiberActive Pilot project has been looking at how to develop the fantastic prior work undertaken during the Andrew W. Mellon-funded Hiberlink project (a collaboration between EDINA, Los Alamos National Laboratory, and the University of Edinburgh School of Informatics), which investigated “reference rot” (where URLs cease to work) and “content drift” (where URLs work but the content changes over time) in scientific scholarly publishing.

For our follow up work the focus has shifted to web citations – websites, reports, etc. – something which has become a far more visible challenge for many web users since January. I’ve been managing this project, working with developer, design and user experience colleagues to develop a practical solution around the needs of PhD students, shaped by advice from Library and University Collections colleagues.

If you are familiar with the Memento standard, and/or follow Herbert von de Sompel and Martin Klein’s work you’ll be well aware of how widespread the challenge of web citations changing over time can be, and the seriousness of the implications. The Internet Archive might be preserving all the (non-R-rated) gifs from Geocities but without preserving government reports, ephemeral content, social media etc. we would be missing a great deal of the cultural record and, in terms of where our project comes in, crucial resources and artefacts in many modern scholarly works. If you are new the issue of web archiving I would recommend a browse of my notes from the IIPC Web Archiving Week 2017 and papers from the co-located RESAW 2017 conference.

A huge part of the HiberActive project has been working with five postgraduate student interns to undertake interviews and usability work with PhD students across the University. My personal and huge thanks to Clarissa, Juliet, Irene, Luke and Shiva!

Still from the HiberActive gif featuring Library Cat.

A preview of the HiberActive gif featuring Library Cat.

You can see the results of this work at our demo site,, and we would love your feedback on what we’ve done. You’ll find an introductory page on the project as well as three tools for archiving websites and obtaining the appropriate information to cite – hence adopting the name one our interviewees suggested, Site2Cite. We are particularly excited to have a tool which enables you to upload a Word or PDF document, have all URLs detected, and which then returns a list of URLs and the archived citable versions (as a csv file).

Now that the project is complete, we are looking at what the next steps may be so if you’d find these tools useful for your own publications or teaching materials, we’d love to hear from you.  I’ll also be presenting this work at Repository Fringe 2017 later this week so, if you are there, I’ll see you in the 10×10 session on Thursday!

To bring the HiberActive to life our students suggested something fun and my colleague Jackie created a fun and informative gif featuring Library Cat, Edinburgh’s world famous sociable on-campus feline. Library Cat has also popped up in another EDINA ISG Innovation-Funded project, Pixel This, which my colleagues James Reid and Tom Armitage have been working on. This project has been exploring how Pixel Sticks could be used around the University. To try them out properly I joined the team for fun photography night in George Square with Pixel Stick loaded with images of notable University of Edinburgh figures. One of my photos from that night, featuring the ghostly image of the much missed Library Cat (1.0) went a wee bit viral over on Facebook:

James Reid and I have also been experimenting with Tango-capable phone handsets in the (admittedly daftly named) Strictly Come Tango project. Tango creates impressive 3D scans of rooms and objects and we have been keen to find out what one might do with that data, how it could be used in buildings and georeferenced spaces. This was a small exploratory project but you can see a wee video on what we’ve been up to here.

In addition to these projects I’ve also been busy with continuing involvement in the Edinburgh Cityscope project, which I sit on the steering group for. Cityscope provided one of our busiest events for this spring’s excellent Data Festread more about EDINA’s participation in this new exciting event around big data, data analytics and data driven innovation, here.

I have also been working on two rather awesome Edinburgh-centric projects. Curious Edinburgh officially launched for Android, and released an updated iOS app, for this year’s Edinburgh International Science Festival in April. The app includes History of Science; Medicine; Geosciences; Physics; and a brand new Biotechnology tours that led you explore Edinburgh’s fantastic scientific legacy. The current PTAS-funded project is led by Dr Niki Vermeulen (Science, Technology & Innovation Studies), with tours written by Dr Bill Jenkins, and will see the app used in teaching around 600 undergraduate students this autumn. If you are curious about the app (pun entirely intended!), visiting Edinburgh – or just want to take a long distance virtual tour – do download the app, rate and review it, and let us know what you think!

Image of the Curious Edinburgh History of Biotechnology and Genetics Tour.

A preview of the new Curious Edinburgh History of Biotechnology and Genetics Tour.

The other Edinburgh project which has been progressing at a pace this year is LitLong: Word on the Street, an AHRC-funded project which builds on the prior LitLong project to develop new ways to engage with Edinburgh’s rich literary heritage. Edinburgh was the first city in the world to be awarded UNESCO City of Literature status (in 2008) and there are huge resources to draw upon. Prof. James Loxley (English Literature) is leading this project, which will be showcased in some fun and interesting ways at the Edinburgh International Book Festival this August. Keep an eye on for updates or follow @litlong.

And finally… Regular readers here will be aware that I’m Convener for eLearning@ed (though my term is up and I’ll be passing the role onto a successor later this year – nominations welcomed!), a community of learning technologists and academic and support staff working with technologies in teaching and learning contexts. We held our big annual conference, eLearning@ed 2017: Playful Learning this June and I was invited to write about it on the ALTC Blog. You can explore a preview and click through to my full article below.

Playful Learning: the eLearning@ed Conference 2017

Phew! So, it has been a rather busy few months for me, which is why you may have seen slightly fewer blog posts and tweets from me of late…

In terms of the months ahead there are some exciting things brewing… But I’d also love to hear any ideas you may have for possible collaborations as my EDINA colleagues and I are always interested to work on new projects, develop joint proposals, and work in new innovative areas. Do get in touch!

And in the meantime, remember to book those tickets for my CODI 2017 show if you can make it along on 11th August!


ReCon 2017 – Liveblog

Today I’m at ReCon 2017, giving a presentation later (flying the flag for the unconference sessions!) today but also looking forward to a day full of interesting presentations on publishing for early careers researchers.

I’ll be liveblogging (except for my session) and, as usual, comments, additions, corrections, etc. are welcomed. 

Jo Young, Director of the Scientific Editing Company, is introducing the day and thanking the various ReCon sponsors. She notes: ReCon started about five years ago (with a slightly different name). We’ve had really successful events – and you can explore them all online. We have had a really stellar list of speakers over the years! And on that note…

Graham Steel: We wanted to cover publishing at all stages, from preparing for publication, submission, journals, open journals, metrics, alt metrics, etc. So our first speakers are really from the mid point in that process.

SESSION ONE: Publishing’s future: Disruption and Evolution within the Industry

100% Open Access by 2020 or disrupting the present scholarly comms landscape: you can’t have both? A mid-way update – Pablo De Castro, Open Access Advocacy Librarian, University of Strathclyde

It is an honour to be at this well attended event today. Thank you for the invitation. It’s a long title but I will be talking about how are things are progressing towards this goal of full open access by 2020, and to what extent institutions, funders, etc. are being able to introduce disruption into the industry…

So, a quick introduction to me. I am currently at the University of Strathclyde library, having joined in January. It’s quite an old university (founded 1796) and a medium size university. Previous to that I was working at the Hague working on the EC FP7 Post-Grant Open Access Pilot (Open Aire) providing funding to cover OA publishing fees for publications arising from completed FP7 projects. Maybe not the most popular topic in the UK right now but… The main point of explaining my context is that this EU work was more of a funders perspective, and now I’m able to compare that to more of an institutional perspective. As a result o of this pilot there was a report commissioned b a British consultant: “Towards a competitive and sustainable open access publishing market in Europe”.

One key element in this open access EU pilot was the OA policy guidelines which acted as key drivers, and made eligibility criteria very clear. Notable here: publications to hybrid journals would not be funded, only fully open access; and a cap of no more than €2000 for research articles, €6000 for monographs. That was an attempt to shape the costs and ensure accessibility of research publications.

So, now I’m back at the institutional open access coalface. Lots had changed in two years. And it’s great to be back in this spaces. It is allowing me to explore ways to better align institutional and funder positions on open access.

So, why open access? Well in part this is about more exposure for your work, higher citation rates, compliant with grant rules. But also it’s about use and reuse including researchers in developing countries, practitioners who can apply your work, policy makers, and the public and tax payers can access your work. In terms of the wider open access picture in Europe, there was a meeting in Brussels last May where European leaders call for immediate open access to all scientific papers by 2020. It’s not easy to achieve that but it does provide a major driver… However, across these countries we have EU member states with different levels of open access. The UK, Netherlands, Sweden and others prefer “gold” access, whilst Belgium, Cyprus, Denmark, Greece, etc. prefer “green” access, partly because the cost of gold open access is prohibitive.

Funders policies are a really significant driver towards open access. Funders including Arthritis Research UK, Bloodwise, Cancer Research UK, Breast Cancer Now, British Heard Foundation, Parkinsons UK, Wellcome Trust, Research Councils UK, HEFCE, European Commission, etc. Most support green and gold, and will pay APCs (Article Processing Charges) but it’s fair to say that early career researchers are not always at the front of the queue for getting those paid. HEFCE in particular have a green open access policy, requiring research outputs from any part of the university to be made open access, you will not be eligible for the REF (Research Excellence Framework) and, as a result, compliance levels are high – probably top of Europe at the moment. The European Commission supports green and gold open access, but typically green as this is more affordable.

So, there is a need for quick progress at the same time as ongoing pressure on library budgets – we pay both for subscriptions and for APCs. Offsetting agreements are one way to do this, discounting subscriptions by APC charges, could be a good solutions. There are pros and cons here. In principal it will allow quicker progress towards OA goals, but it will disproportionately benefit legacy publishers. It brings publishers into APC reporting – right now sometimes invisible to the library as paid by researchers, so this is a shift and a challenge. It’s supposed to be a temporary stage towards full open access. And it’s a very expensive intermediate stage: not every country can or will afford it.

So how can disruption happen? Well one way to deal with this would be the policies – suggesting not to fund hybrid journals (as done in OpenAire). And disruption is happening (legal or otherwise) as we can see in Sci-Hub usage which are from all around the world, not just developing countries. Legal routes are possible in licensing negotiations. In Germany there is a Projekt Deal being negotiated. And this follows similar negotiations by open At the moment Elsevier is the only publisher not willing to include open access journals.

In terms of tools… The EU has just announced plans to launch it’s own platform for funded research to be published. And Wellcome Trust already has a space like this.

So, some conclusions… Open access is unstoppable now, but still needs to generate sustainable and competitive implementation mechanisms. But it is getting more complex and difficult to disseminate to research – that’s a serious risk. Open Access will happen via a combination of strategies and routes – internal fights just aren’t useful (e.g. green vs gold). The temporary stage towards full open access needs to benefit library budgets sooner rather than later. And the power here really lies with researchers, which OA advocates aren’t always able to get informed. It is important that you know which are open and which are hybrid journals, and why that matters. And we need to think if informing authors on where it would make economic sense to publish beyond the remit of institutional libraries?

To finish, some recommended reading:

  • “Early Career Researchers: the Harbingers of Change” – Final report from Ciber, August 2016
  • “My Top 9 Reasons to Publish Open Access” – a great set of slides.


Q1) It was interesting to hear about offsetting. Are those agreements one-off? continuous? renewed?

A1) At the moment they are one-off and intended to be a temporary measure. But they will probably mostly get renewed… National governments and consortia want to understand how useful they are, how they work.

Q2) Can you explain green open access and gold open access and the difference?

A2) In Gold Open Access, the author pays to make your paper open on the journal website. If that’s a hybrid – so subscription – journal you essentially pay twice, once to subscribe, once to make open. Green Open Access means that your article goes into your repository (after any embargo), into the world wide repository landscape (see:

Q3) As much as I agree that choices of where to publish are for researchers, but there are other factors. The REF pressures you to publish in particular ways. Where can you find more on the relationships between different types of open access and impact? I think that can help?

A3) Quite a number of studies. For instance is APC related to Impact factor – several studies there. In terms of REF, funders like Wellcome are desperate to move away from the impact factor. It is hard but evolving.

Inputs, Outputs and emergent properties: The new Scientometrics – Phill Jones, Director of Publishing Innovation, Digital Science

Scientometrics is essentially the study of science metrics and evaluation of these. As Graham mentioned in his introduction, there is a whole complicated lifecycle and process of publishing. And what I will talk about spans that whole process.

But, to start, a bit about me and Digital Science. We were founded in 2011 and we are wholly owned by Holtzbrink Publishing Group, they owned Nature group. Being privately funded we are able to invest in innovation by researchers, for researchers, trying to create change from the ground up. Things like labguru – a lab notebook (like rspace); Altmetric; Figshare; readcube; Peerwith; transcriptic – IoT company, etc.

So, I’m going to introduce a concept: The Evaluation Gap. This is the difference between the metrics and indicators currently or traditionally available, and the information that those evaluating your research might actually want to know? Funders might. Tenure panels – hiring and promotion panels. Universities – your institution, your office of research management. Government, funders, policy organisations, all want to achieve something with your research…

So, how do we close the evaluation gap? Introducing altmetrics. It adds to academic impact with other types of societal impact – policy documents, grey literature, mentions in blogs, peer review mentions, social media, etc. What else can you look at? Well you can look at grants being awarded… When you see a grant awarded for a new idea, then publishes… someone else picks up and publishers… That can take a long time so grants can tell us before publications. You can also look at patents – a measure of commercialisation and potential economic impact further down the link.

So you see an idea germinate in one place, work with collaborators at the institution, spreading out to researchers at other institutions, and gradually out into the big wide world… As that idea travels outward it gathers more metadata, more impact, more associated materials, ideas, etc.

And at Digital Science we have innovators working across that landscape, along that scholarly lifecycle… But there is no point having that much data if you can’t understand and analyse it. You have to classify that data first to do that… Historically we did that was done by subject area, but increasingly research is interdisciplinary, it crosses different fields. So single tags/subjects are not useful, you need a proper taxonomy to apply here. And there are various ways to do that. You need keywords and semantic modeling and you can choose to:

  1. Use an existing one if available, e.g. MeSH (Medical Subject Headings).
  2. Consult with subject matter experts (the traditional way to do this, could be editors, researchers, faculty, librarians who you’d just ask “what are the keywords that describe computational social science”).
  3. Text mining abstracts or full text article (using the content to create a list from your corpus with bag of words/frequency of words approaches, for instance, to help you cluster and find the ideas with a taxonomy emerging

Now, we are starting to take that text mining approach. But to use that data needs to be cleaned and curated to be of use. So we hand curated a list of institutions to go into GRID: Global Research Identifier Database, to understand organisations and their relationships. Once you have that all mapped you can look at Isni, CrossRef databases etc. And when you have that organisational information you can include georeferences to visualise where organisations are…

An example that we built for HEFCE was the Digital Science BrainScan. The UK has a dual funding model where there is both direct funding and block funding, with the latter awarded by HEFCE and it is distributed according to the most impactful research as understood by the REF. So, our BrainScan, we mapped research areas, connectors, etc. to visualise subject areas, their impact, and clusters of strong collaboration, to see where there are good opportunities for funding…

Similarly we visualised text mined impact statements across the whole corpus. Each impact is captured as a coloured dot. Clusters show similarity… Where things are far apart, there is less similarity. And that can highlight where there is a lot of work on, for instance, management of rivers and waterways… And these weren’t obvious as across disciplines…


Q1) Who do you think benefits the most from this kind of information?

A1) In the consultancy we have clients across the spectrum. In the past we have mainly worked for funders and policy makers to track effectiveness. Increasingly we are talking to institutions wanting to understand strengths, to predict trends… And by publishers wanting to understand if journals should be split, consolidated, are there opportunities we are missing… Each can benefit enormously. And it makes the whole system more efficient.

Against capital – Stuart Lawson, Birkbeck University of London

So, my talk will be a bit different. The arguements I will be making are not in opposition to any of the other speakers here, but is about critically addressing our current ways we are working, and how publishing works. I have chosen to speak on this topic today as I think it is important to make visible the political positions that underly our assumptions and the systems we have in place today. There are calls to become more efficient but I disagree… Ownership and governance matter at least as much as the outcome.

I am an advocate for open access and I am currently undertaking a PhD looking at open access and how our discourse around this has been coopted by neoliberal capitalism. And I believe these issues aren’t technical but social and reflect inequalities in our society, and any company claiming to benefit society but operating as commercial companies should raise questions for us.

Neoliberalism is a political project to reshape all social relations to conform to the logic of capital (this is the only slide, apparently a written and referenced copy will be posted on Stuart’s blog). This system turns us all into capital, entrepreneurs of our selves – quantification, metricification whether through tuition fees that put a price on education, turn students into consumers selecting based on rational indicators of future income; or through pitting universities against each other rather than collaboratively. It isn’t just overtly commercial, but about applying ideas of the market in all elements of our work – high impact factor journals, metrics, etc. in the service of proving our worth. If we do need metrics, they should be open and nuanced, but if we only do metrics for people’s own careers and perform for careers and promotion, then these play into neoliberal ideas of control. I fully understand the pressure to live and do research without engaging and playing the game. It is easier to choose not to do this if you are in a position of privelege, and that reflects and maintains inequalities in our organisations.

Since power relations are often about labour and worth, this is inevitably part of work, and the value of labour. When we hear about disruption in the context of Uber, it is about disrupting rights of works, labour unions, it ignores the needs of the people who do the work, it is a neo-liberal idea. I would recommend seeing Audrey Watters’ recent presentation for University of Edinburgh on the “Uberisation of Education”.

The power of capital in scholarly publishing, and neoliberal values in our scholarly processes… When disruptors align with the political forces that need to be dismantled, I don’t see that as useful or properly disruptive. Open Access is a good thing in terms of open access. But there are two main strands of policy… Research Councils have spent over £80m to researchers to pay APCs. Publishing open access do not require payment of fees, there are OA journals who are funded other ways. But if you want the high end visible journals they are often hybrid journals and 80% of that RCUK has been on hybrid journals. So work is being made open access, but right now this money flows from public funds to a small group of publishers – who take a 30-40% profit – and that system was set up to continue benefitting publishers. You can share or publish to repositories… Those are free to deposit and use. The concern of OA policy is the connection to the REF, it constrains where you can publish and what they mean, and they must always be measured in this restricted structure. It can be seen as compliance rather than a progressive movement toward social justice. But open access is having a really positive impact on the accessibility of research.

If you are angry at Elsevier, then you should also be angry at Oxford University and Cambridge University, and others for their relationships to the power elite. Harvard made a loud statement about journal pricing… It sounded good, and they have a progressive open access policy… But it is also bullshit – they have huge amounts of money… There are huge inequalities here in academia and in relationship to publishing.

And I would recommend strongly reading some history on the inequalities, and the racism and capitalism that was inherent to the founding of higher education so that we can critically reflect on what type of system we really want to discover and share scholarly work. Things have evolved over time – somewhat inevitably – but we need to be more deliberative so that universities are more accountable in their work.

To end on a more positive note, technology is enabling all sorts of new and inexpensive ways to publish and share. But we don’t need to depend on venture capital. Collective and cooperative running of organisations in these spaces – such as the cooperative centres for research… There are small scale examples show the principles, and that this can work. Writing, reviewing and editing is already being done by the academic community, lets build governance and process models to continue that, to make it work, to ensure work is rewarded but that the driver isn’t commercial.


Comment) That was awesome. A lot of us here will be to learn how to play the game. But the game sucks. I am a professor, I get to do a lot of fun things now, because I played the game… We need a way to have people able to do their work that way without that game. But we need something more specific than socialism… Libraries used to publish academic data… Lots of these metrics are there and useful… And I work with them… But I am conscious that we will be fucked by them. We need a way to react to that.

Redesigning Science for the Internet Generation – Gemma Milne, Co-Founder, Science Disrupt

Science Disrupt run regular podcasts, events, a Slack channel for scientists, start ups, VCs, etc. Check out our website. We talk about five focus areas of science. Today I wanted to talk about redesigning science for the internet age. My day job is in journalism and I think a lot about start ups, and to think about how we can influence academia, how success is manifests itself in the internet age.

So, what am I talking about? Things like Pavegen – power generating paving stones. They are all over the news! The press love them! BUT the science does not work, the physics does not work…

I don’t know if you heard about Theranos which promised all sorts of medical testing from one drop of blood, millions of investments, and it all fell apart. But she too had tons of coverage…

I really like science start ups, I like talking about science in a different way… But how can I convince the press, the wider audience what is good stuff, and what is just hype, not real… One of the problems we face is that if you are not engaged in research you either can’t access the science, and can’t read it even if they can access the science… This problem is really big and it influences where money goes and what sort of stuff gets done!

So, how can we change this? There are amazing tools to help (Authorea, overleaf,, figshare, publons, labworm) and this is great and exciting. But I feel it is very short term… Trying to change something that doesn’t work anyway… Doing collaborative lab notes a bit better, publishing a bit faster… OK… But is it good for sharing science? Thinking about journalists and corporates, they don’t care about academic publishing, it’s not where they go for scientific information. How do we rethink that… What if we were to rethink how we share science?

AirBnB and Amazon are on my slide here to make the point of the difference between incremental change vs. real change. AirBnB addressed issues with hotels, issues of hotels being samey… They didn’t build a hotel, instead they thought about what people want when they traveled, what mattered for them… Similarly Amazon didn’t try to incrementally improve supermarkets.. They did something different. They dug to the bottom of why something exists and rethought it…

Imagine science was “invented” today (ignore all the realities of why that’s impossible). But imagine we think of this thing, we have to design it… How do we start? How will I ask questions, find others who ask questions…

So, a bit of a thought experiment here… Maybe I’d post a question on reddit, set up my own sub-reddit. I’d ask questions, ask why they are interested… Create a big thread. And if I have a lot of people, maybe I’ll have a Slack with various channels about all the facets around a question, invite people in… Use the group to project manage this project… OK, I have a team… Maybe I create a Meet Up Group for that same question… Get people to join… Maybe 200 people are now gathered and interested… You gather all these folk into one place. Now we want to analyse ideas. Maybe I share my question and initial code on GitHub, find collaborators… And share the code, make it open… Maybe it can be reused… It has been collaborative at every stage of the journey… Then maybe I want to build a microscope or something… I’d find the right people, I’d ask them to join my Autodesk 360 to collaboratively build engineering drawings for fabrication… So maybe we’ve answered our initial question… So maybe I blog that, and then I tweet that…

The point I’m trying to make is, there are so many tools out there for collaboration, for sharing… Why aren’t more researchers using these tools that are already there? Rather than designing new tools… These are all ways to engage and share what you do, rather than just publishing those articles in those journals…

So, maybe publishing isn’t the way at all? I get the “game” but I am frustrated about how we properly engage, and really get your work out there. Getting industry to understand what is going on. There are lots of people inventing in new ways.. YOu can use stuff in papers that isn’t being picked up… But see what else you can do!

So, what now? I know people are starved for time… But if you want to really make that impact, that you think is more interested… I undesrtand there is a concern around scooping… But there are ways to do that… And if you want to know about all these tools, do come talk to me!


Q1) I think you are spot on with vision. We want faster more collaborative production. But what is missing from those tools is that they are not designed for researchers, they are not designed for publishing. Those systems are ephemeral… They don’t have DOIs and they aren’t persistent. For me it’s a bench to web pipeline…

A1) Then why not create a persistent archived URI – a webpage where all of a project’s content is shared. 50% of all academic papers are only read by the person that published them… These stumbling blocks in the way of sharing… It is crazy… We shouldn’t just stop and not share.

Q2) Thank you, that has given me a lot of food for thought. The issue of work not being read, I’ve been told that by funders so very relevant to me. So, how do we influence the professors… As a PhD student I haven’t heard about many of those online things…

A2) My co-founder of Science Disrupt is a computational biologist and PhD student… My response would be about not asking, just doing… Find networks, find people doing what you want. Benefit from collaboration. Sign an NDA if needed. Find the opportunity, then come back…

Q3) I had a comment and a question. Code repositories like GitHub are persistent and you can find a great list of code repositories and meta-articles around those on the Journal of Open Research Software. My question was about AirBnB and Amazon… Those have made huge changes but I think the narrative they use now is different from where they started – and they started more as incremental change… And they stumbled on bigger things, which looks a lot like research… So… How do you make that case for the potential long term impact of your work in a really engaging way?

A3) It is the golden question. Need to find case studies, to find interesting examples… a way to showcase similar examples… and how that led to things… Forget big pictures, jump the hurdles… Show that bigger picture that’s there but reduce the friction of those hurdles. Sure those companies were somewhat incremental but I think there is genuinely a really different mindset there that matters.

And we now move to lunch. Coming up…


This will be me, so don’t expect an update for the moment…

SESSION TWO: The Early Career Researcher Perspective: Publishing & Research Communication

Getting recognition for all your research outputs – Michael Markie

Make an impact, know your impact, show your impact – Anna Ritchie

How to share science with hard to reach groups and why you should bother – Becky Douglas

What helps or hinders science communication by early career researchers? – Lewis MacKenzie



SESSION THREE: Raising your research profile: online engagement & metrics

Green, Gold, and Getting out there: How your choice of publisher services can affect your research profile and engagement – Laura Henderson

What are all these dots and what can linking them tell me? – Rachel Lammey

The wonderful world of altmetrics: why researchers’ voices matter – Jean Liu

How to help more people find and understand your work – Charlie Rapple




IIPC WAC / RESAW Conference 2017 – Day Two (Technical Strand) Liveblog

I am again at the IIPC WAC / RESAW Conference 2017 and, for today I am

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as neccassary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essemtial metaddta
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as neccassry

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to proide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solaar searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discocerale. Need to be able to re-run automated extraction.

We want to iteractively ipmprove automated metadat extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.


Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t acessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, SImon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase ( and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadan political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tightknit, and I like how we build and share and develop on each others work. Last year we presented We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixees, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about dericative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…


Q1) Do you plan to make graphs interactive, by using Kebana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kabana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..



IIPC WAC / RESAW Conference 2017 – Day One Liveblog

From today until Friday I will be at the International Internet Preservation Coalition (IIPC) Web Archiving Conference 2017, which is being held jointly with the second RESAW: Research Infrastructure for the Study of Archived Web Materials Conference. I’ll be attending the main strand at the School of Advanced Study, University of London, today and Friday, and at the technical strand (at the British Library) on Thursday.

I’m here wearing my “Reference Rot in Theses: A HiberActive Pilot” – aka “HiberActive” – hat. HiberActive is looking at how we can better enable PhD candidates to archive web materials they are using in their research, and citing in their thesis. I’m managing the project and working with developers, library and information services stakeholders, and a fab team of five postgraduate interns who are, whilst I’m here, out and about around the University of Edinburgh talking to PhD students to find out how they collect, manage and cite their web references, and what issues they may be having with “reference rot” – content that changes, decays, disappears, etc. We will have a webpage for the project and some further information to share soon but if you are interested in finding out more, leave me a comment below or email me:

These notes are being taken live so, as usual for my liveblogs, I welcome corrections, additions, comment etc. (and, as usual, you’ll see the structure of the day appearing below with notes added at each session). 

Opening remarks: Jane Winters and Nicholas Taylor

Opening plenary: Leah Lievrouw – Web history and the landscape of communication/media research Chair: Nicholas Taylor