Taming the beast – working with Hydra together

Another guest blog post for you from Chris Awre, University of Hull:

It seems at first slightly self-rewarding to be able to use a conference blog to highlight further some of the points I made in my conference presentation.  I’m sure it is.  Nevertheless, I’d like to do so in the context of the workshop and roundtable I led on ‘Getting to the Repository of the Future‘ and the conference as a whole, which I hope places any platform plug in its proper place.

To re-cap, Hydra is a project initiated in 2008 to create a flexible framework that could be applied to a variety of repository needs and solutions.  Hydra today is a repository solution that can be implemented (based around Fedora), a technical framework that can be adapted to meet specific local needs, and, of equal if not greater importance, a community of partners and users who wish to work together to create the full range of solutions that may be required.  There are, as of August 2013, 19 current partners, with a number of others considering joining: in Europe the University of Hull is joined by LSE and the Royal Library of Denmark as partners, with use also taking place at Glasgow Caledonian, Oxford, Trinity College Dublin (for the Digital Repository of Ireland), and the Theatre Museum of Barcelona.  Not a large number as it stands, perhaps, but each exploiting what Hydra can provide to meet varied needs, sharing experiences and ideas, and demonstrating how a flexible platform can be adapted.

I’d like to pick out three main themes:

  • What is Hydra?

In the Repository of the Future workshop one of the main points raised was about clarifying the purpose of a repository.  This allows it to be situated in a broader institutional context without necessarily competing with other systems.  In doing so, it suggests that repositories should focus their activity rather than suffer mission creep and dilute their core offering.  I was conscious of this as I described Hydra in my presentation as being able to manage any digital content the University of Hull wished us to.  Contradiction?  On one level, yes, and I am all too well aware of the need to clarify what the repository is actually doing so as to strengthen the message we give out: I am defining our repository service more succinctly for this reason, for the University and for library staff.  But that doesn’t mean the repository infrastructure shouldn’t be capable of managing different types of content so that when a use case arises the repository can offer that capability to address it.  Clarifying our repository’s purpose is thus emphasising that it is a service capable of managing structured digital content of all sorts, with foci around specific known collections.  Other Hydra partners have focused their developments on more specific use cases (e.g., Avalon for multimedia, or Sufia for self-deposit), albeit recognising that Hydra provides them with the wherewithal to expand this if they need to.  And if we can share the capability between us as part of a community, then we can expand functionality and purpose as we need to.

  • Repository as infrastructure

I mentioned repository infrastructure in the last paragraph.  A challenge I threw out to the workshop, and throw out here, is to go into an institutional IT department and ask if the repository is infrastructure or an application.  These are treated very differently, with the former often given more weight than the latter I would argue.  I would also suggest (and I’d welcome feedback on this) that repositories are more considered an application.  However, if we are to take the management of digital collections seriously then they need to be treated as infrastructure, and the purpose of a repository built up from there.  A lot of the thinking behind Hydra is based on the repository underpinning other services through making content available in flexible ways to allow it to be used as appropriate.  Someone at the workshop referred to a repository as a ‘lake of content’.  Whatever the scope and purpose of a repository, managing that lake is an infrastructural role akin to managing the water supply network as opposed to focusing on the bathroom fittings.

  • Technical support

Key to Hydra’s evolution has been the dedication of many software developers to contribute from the various institutions they are employed by – a classic open source model in many ways.  I was asked following my presentation how Hydra had been successful in getting such commitment.  One part of the answer was the one I gave, that the software choice, Ruby on Rails, had proved very amenable to agile development and frequent input, and that the developers liked using it.  Another is the further point I made, that the US libraries can sustain such projects as Hydra because they recognise the value of technology to their libraries, and are prepared in many cases to back that up with specific staffing resource.  Certainly this is most evident at the larger institutions, but it goes beyond this as well: not for nothing is there a technically-oriented national Digital Libraries Federation through which digital library initiatives can be showcased and shared, and the developer-focused Code4Lib community.   Developer staffing within libraries in the UK is there in some cases, but is not widespread.  If we consider repositories as being part of a library’s future, do we need the technical commitment to ensure they can do the job they need to?  At Hull we rely on IT department staffing, as many do.  Perfectly adequate for managing an application, but is it an indication of real commitment?  Where it is not feasible to have local technical staff, is there a model that supports dedicated developer input as part of a collaboration?  Of course, even with dedicated technical resource it may not be feasible to do everything alone – hence the Hydra model of doing things together that partners the size of Stanford and Virginia continue to value.

<setting out stall>

Of course, at the University of Hull we view Hydra as being a route down which we can get to the repository of the future.  It provides us with the infrastructure we need to establish our repository’s purpose, but adapt and grow from this as the University requires.  It also allows us to say ‘yes’ when we are asked about the ability to manage different content, even if there may be associated staffing resource issues that need resolving.  We think this will stand us in good stead moving forward.  Hydra won’t necessarily be right for others as a technology, but I hope that the community aspects of working together technically can be adapted to suit regardless of technical platform.  If interested in pursuing more about Hydra as a technical solution, though, let me know ;-)

</setting out stall>


Getting to the Repository of the Future – reflections

A week after the Getting to the Repository of the Future workshop, it is useful to reflect on what thoughts emerged from the event that we can take forward.  The workshop itself was very helpfully blogged by resident RepoFringe bloggers Rocio and Nancy, which captures many of the points raised.  There was also a follow-on round table discussion held the day after, from which additional ideas and suggestions emerged.  All contributions are being written up into a document to inform Jisc in their planning, but will also be openly reflected back to inform conversations back home within institutions and elsewhere.

By way of continuing the discussion online, I reflect here my own initial thoughts and conclusions from the discussion.  Feedback very welcome.

  • Repositories will become capable of dealing with content types according to their needs

Repositories have been established to manage many different types of material, with probably the largest focus being around research articles.  Nonetheless, with digital content collections of all sorts growing and needing better management, can repositories cope with this?  Discussion suggested that we have a technology available to us that can be used for a variety of use cases, and so can usefully be exploited in this way.  In doing so, though, it was recognised that we need to better understand what it means to manage different types of material so this exploitation can take place effectively and add to the value of the content.  As to type of repository, it should be recognised where materials benefit from being managed through specific repositories rather than a local repository, e.g., managing software code through GitHub or BitBucket, or holding datasets in specific data centres.  Overall message emerging: understand more how to deal with different types of content, be realistic about where they are best managed as part of this.

  • Repositories will move beyond being a store of PDFs to enable re-use to a greater extent

It was one very specific comment at the workshop that highlighted that many repositories are simply a store of PDF files (there was also a debate about whether repositories holding metadata are real repositories, but that’s another discussion).  PDF files can be re-usable if generated in the right way (i.e., are not just page images), but are never ideal.  Part of the added value that repositories can bring is facilitating re-use, and enabling the benefits that come from this.  To do this we need to move to a position where we can effectively either store non-PDF versions instead or alongside, or identify ways of storing non-PDF files by default.  The view expressed was that if we don’t address this we risk our repositories becoming silos of content with limited use.

  • Repositories will benefit greatly from linked data, but we need persistent identifiers to be better established and standardised

There is a chicken and egg aspect to this, as there is with a lot of linked data activity.  Content is exposed as linked data, but is not then consumed as much as might be anticipated, in part because the linked data doesn’t use recognised standards, and in particular standard identifiers, in its expression.  These weren’t used because there wasn’t enough activity within the community to inform a standard to use, or there are a number of different standards but a lack of an authoritative one.  One example is a standard list of organisational identifiers: there are a few in existence, but a need to bring these together, a task that Jisc is currently investigating.  Repositories could make use of linked data if the standards existed, but where is the impetus to create them?  An opposing view to this is that the standards pretty much do exist, it is more a matter of raising awareness of the options and opportunities in how these can effectively used within repositories, e.g., ORCID, which is now starting to gain traction, or the Library of Congress subject headings.  Whichever view you take, linked data screams ‘potential’, and there was little doubt that it will become part of the repository landscape in a far greater way than it does today.

  • Repositories will focus on holding material and preserving it, leaving all other functions to services built around the repository / Repositories will become invisibly integrated within user-facing services

At first site this theme appears to suggest that we reduce a repository, which seems to contradict the benefits that the previous statements suggest.  Discussion at the workshop, though, saw this more as getting repositories to play to their strengths; we need somewhere to store and preserve digital ‘stuff’, using a digital repository as the equivalent to print repositories.  Of course it can be held in a way that allows it to be exploited through other services, but should we not focus on what a repository does really well rather than become application managers as well?  Discuss.  In taking this line, we enable content to be made available from the repository (a ‘lake of content’ as expressed by one workshop attendee) wherever it is needed; do users need to know where it came from?  Issues of perceived value clearly raise their head here given the battles to establish repositories in the first place, and moving in the suggested direction will certainly require attention to this with budget-holders.  But for users this was felt to make sense.  One approach suggested was to consider repository as infrastructure rather than application, as this may change views of the support required.

  • Repositories will be challenged by other systems offering similar capability / Repositories will develop ways of demonstrating their impact

This theme was a natural follow-on to the previous one.  The debate about CRIS’s storing content, or VLEs for that matter, seems high on the agenda in affected institutions, and will no doubt continue.  This suggests a need for clarity in the role of each system, and an understanding of their respective benefit and impact for the institution in how they work together.  We cannot take repositories for granted, though the general perception at the workshop was that they have huge value (biased audience I know, but one with experience) and we need to continue identifying how we demonstrate that to best serve our institutional needs.

So, a full afternoon.  No blinding flashes of inspiration, perhaps, but some useful staging posts against which we can plot the future course of repositories in the next 2, 5, 10, etc years.  Repositories will only be what they are then because of what we choose to do now.

My main general takeaways from the workshop:

  • The role and need for a repository as a place to manage digital ‘stuff’ seems well accepted and here to stay


  • There is a need for re-stating and defining the clarity of purpose for our individual repositories, and taking ownership/leadership in how they develop
  • No specific gaps were perceived – we know what we wish to achieve with repositories, we just need a way of doing it


  • We need to clarify the barriers getting in the way and look at ways of overcoming them

What are your thoughts?  Or, indeed, what processes would work best to address these points (both institutionally and across the community)?