P6B: Digital Preservation LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use theÂ #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Digital Preservation Network, Saving the Scholarly Record Together
Speaker(s): Michele Kimpton, Robin Ruggaber

Michelle is CEO of DuraSpace. Myself and Robin are going to be talking about a new initiative in the US. This initiative wasn’t born out of grant funding but by university librarians and CIOs who wanted to think about making persistent access to scholarly materials and knew that something needed to be done at scale and now. Many of you will be well aware that libraries are being asked to preserve digital and born digital materials and there are not good solutions to do that in scale. Many of us have repositories in place. Typically there is an online or regular backup but these aren’t at preservation scale here.

So about a year ago a group of us met to talk about how we might be able to approach this problem. And from this D-P-N.org – Digital Preservation and Network – was born. DPN is not just a technical architecture. It’s an approach that requires replication of complete scholarly record access nodes with diverse architectures without single points of failure. It’s a fedration. And it is a community allowing this to work at mass scale.

At the core of DPN are a number of replicated nodes. There are minimum of three but up to five here. The role of the nodes is to have complete copies of content, full replications of each replicating nodes. This is a full content object store, not just a metadata node. And this model can work with multiple contributing nodes in different institutions – so those nodes replicate across architectures, geographic locations, institutions.

DPN Principle 1: Owned by the community

DPN Principle 2:Â Geographical diversity of nodes

DPN Principle 3:Â Diverse organisations – Uof Michigan, Stanford, San Diego, Academic Presrvation Trust, University of Virginia.

DPN Principle 4:Â Diverse Software Architectores – including iRODS, HATHI Trust, FedoraCommons, Standford Digital Library

DPN Principle 5:Â Diverse Political Environments – we’ve started in the US but the hope it to expand out to a more diverse global set of locations

So DPN will preserve scholarship for future generations, fund replicating ndes to ensure functional independence, audit ad verify content, provide a legal framework for holding succession rights – so if a node goes down this means the content will not be lost. And we have a diverse governance group taking responsibility for specific areas.

To date 54 partners and growing, about $1.5 million in funding – and this is not grant funding – and we now have a project manager in place.

Over to Robin…

Robim: Many of the partners in the APTrust have also been looking at DPN. APTrust ia a consortium committed to creation and management of an aggregated preservation repository and, now that DPN underway, to be a replicating node. APTrust was formed for reasons of community-building, economies of scale – things we could do together that we could not do agin, aggregated content, long term preservation, disaster recovery – particularly relevent given recent east coast storms.

The APTrust has several arms: Business and marketing strategy; governance policy and legal framework; preservation and collection framework; repository implementation plan – the technical side of APTrust and being a DPN node. So we had to bring together University librarians, technology liaisons, ingest/preservation. The APTrust services are the aggregation repository, the separate replicating node for DPN, and the access service – initially for administaration but also thinking about more services for the future.

There’s been a lot of confusion as APTrust and DPN started emerging at about the same time. And we are doing work with DPN. So we tend to think of the explanation here being about Winnowing of Content with researchers repository of files at the top, then local institutions repositories, then AP trust – preservation for our institutions that provide robustness for our content, and DPN is then for long term preservation. APTrust is preservation and access. DPN is about preservation only.

So the objectives of the initial phase of the APTrust is engaging partners, defining sustainable business model, hiring a project director, building the aggregation repository and setting up our DPN node. We have an advisory group for the project looking at governance. The service implementation is a phased approach building on experience, leveraging open soure – cloud storage, compute notes, DuraCloud all come into play, economies of scale, TRAC – we are using as a guideline for architecture. APTrust will be sitting at the end of legacy workflows for ingest, it will take that data in, ingest to DuraCloud services, synching to Fedora aggregation repository, and anything for long term preservation will also move to the APTrust DPN Noder with DuraCloud OS via cloudsync.

In terms of the interfaces there will be a single administrative interface which gives access to admin of DuraCloud, CloudSync and Fedora. Which will allow audit reports, functionality in each individual area etc. And that uses the API for each of those services. We will have a proof of that architecture at end of Q4 2012. Partners will feedback on that and we expect to deploy in 2013. Then we will be looking at disaster recovery access services, end-user acces, format migration services – considered a difficult issue so very interesting, best practices fro content types etc., coordinated collection development – across services, hosted repository services. Find out more at http://aptrust.org and http://d-p-n.org/

Q&A

Q1) In Denmark we are building our national repository which is quite like DPN. Something in your preserntation: it seems that everything is fully replicated to all nodes. In our organisation services that want to preserve something they can enter a contract with another/a service and that’s an economic way to do things but it seems that this model is everthing for everyone.

A1 – Michelle) Right now the principle is everyone gets a copy in everything. We may eventually have specialist centres for video, or for books etc. Those will probably be primarily access services. We do have a diverse ecosystem – back ups across organisations in different ways. You can’t choose stuff in one or another node.

Q2) This looks a lot like LOCKSS – what is the main difference between DPN and a private LOCKSS network.

A2) LOCKSS is a technology for preservation but it’s a single architecture. It is great at what it does so it will probably be part of the nodes here – probably Stanford will use this. But part of the point is to have multiple architectural systems so that if there is an attack on one architecture just one component of the whole goes down.

Q3) I understand the goal is replication but what about format obsolescence – will there be format audit and conversion etc?

A3 – Michelle) I think who does this stuff, format emulation, translation etc. has yet to be decided. That may be at node level not network level.

Topic: ISO : Trustworthy Digital Repository Certification in Practice

Speaker(s): Matthew Kroll, David Minor, Bernie Reilly,Â Michael Witt

This is a panel session chaired by Michael Witt of Purdue University. This is about ISO 16363 and TRAC, theÂ Trustworthy Repository Audit Checklist – how can a user trust that data is being stored corrrectly and securely, that it is what it says it is.

Matthew: I am a graduate research assistant working with Micheal Witt at Purdue. I’ve been preparing the Purdue Research Repository (PURR) for TRAC. We are a progressive repository with online workspace and data sharing platform, to user archiving and access, to preservation needs of Purdue University graduates, researchers and staff. So for today I will introduce you to ISO 16363 – this is the users guide that we are using to prepare ourselves, I’ll give an example of trustworthiness. So a neccassary and valid question to ask ourselves is “what is “trustworthiness” in this context?” – it’s a very vague concept and one that needs to grow as the digital preservation community and environment grows.

I’d like to offer 3 key qualities of trustworthiness (1) integrity, (2) sustainability, (3) support. And I think it’s important to map these across your organisations and across the three sections of ISO 16363. So, for example, integrity might be that the organisation has sufficient staff and funding to work effectively. Or for the repository it might be that you doÂ fixity checks, procedures and practices to ensure successful migration or translation, similarly integrity in infrastructure may just be offsite backup. Similarly sustainability might be about staff training being adequate to meet changing demands. These are open to interpretation here but useful to think about.

In ISO 16363 there are 3 sections of criteria (109 criteria in all): (3) Organizational Infrastructure; (4) Digital Object management; (5) Infrastructure and Security Risk Management. There isn’t a one-to-one relationship in documentation here. One criteria might have multiple documents, a document might support multiple criteria.

Myself and Micheal created a PURR Gap Analysis Tool – we graded ourselves and brought in experts from the organisation in the appropriate areas and we gave them a pop quiz. And we had an outsider read these things. This had great benefit – being prepared means you don’t overrate yourself. And secondly doing it this way – as PURR was developing and deploying our process here – we gained a real understanding of the digital environment

David Minor, Chronopolic Program Manager, UC San Diego Libraries and San Dieo Supercomputer Center: We completed the Trac process this April. We did it through the CDL organisation. We wanted to give you an overview of what we did, what we learnt. So a bit about us first. Chronopolis is a digital preservation network based on geographic replication – UCSD/SDSC, NCAR, UMIACS. We were initially funded vid the Livrary of Congress NDIIPP program. We spun out into a different type of organisation recently, a FIFA service. Our management and finances are via UCSD. All nodes are independent entities here – interesting questions arise from this for auditors.

So, why do TRAC? Well we wanted to do validation of our work – and this was a last step in our NDIIPP process – an important follow on for development. We wanted to learn about gaps, things we could do better. We wanted to hear what others in the community had to say – not just those we had worked for and served but others. And finally it sounds cyncial but it was to bring in more business – to let us get out there and show what we could do particularly as we moved into FIFA service mode.

The process logistics were that we began in Summer 2010 and finished Winter 2011. We were a slightly different model. We were a self-audit that then went to auditors to follow up, ask questions, speak to customers. The auditors were three people who did a site visit. It’s a closed process except for that visit though. We had management, finances, metadata librarians, and data centre managers – security, system admin etc all involved – equiverlent of 3 FTE. We had discussed with users and customers. IN the end we had hundreds of pages of documentation – some writen by us, some log files etc.

Comments and issues raised by auditors were that we were strong on technology (we expected this as we’d been funded for that purpose) and spent time commenting on connections with participant data centres. They found we were less strong on business plan – we had good data on costs and plans but needed better projections for future adoption. And we had discussion of preservation actions – auditors asked if we were even doing preservation and what that might mean.

Our next steps and future plans based on this experience has been to implement recommendations working to better identify new users and communities, improve working with other networks. How do changes impact audit – we will “re-audit|” in 18-24 months – what if we change technologies? What is management changes? And finally we definitely have had people getting in touch specifically because of knowing we have been through TRAC. All of our audit and self-audit materials are on the web too so do take a look.

Bernie from the Centre for Research Libraries Global Resources Network: We do audits and certification of key repositories. We are one of the publishers of the TRAC checklist. We are a publisher not an author so I can say that it is a brilliant document! We also participated in development of recent ISO standard 16363. So, where do we get the standing to do audits, certification and involvement in standards. Well we are a specialist centre in

We started in UofChichargo, Northwestern etc. established in the 1949. We are a group of 167 universities in US, Canada and Hong Kong and we are about preserving key research information for humanities and social science. Almost all of our funding comes from the research community – also where are stakeholders and governance sit. And the CRL Certification program has the goal to support advanced research. We do audits of repositories and we do analysis and evaluations. We take part in information sharing and best practice. We aim to do landscape studies – recently been working on digital protest and documentation

Portico, Cronopolic, currently looking at PURR and PTAB test audits. The process is much as described by my colleagues. The repository self-audits, then we request documentation, then a site visit, then report is shared via the web. In the future we will be doing TRAC certification alongside ISO 16363 and we will really focus on Humanities and social science data. We continue to have the same mission as when we were founded in 1949, to enable the resiliance and durability of research information.

Q&A

Q1 Askar, State University of Denmark) The finance and sustainability for organisations in TRAC… it seemed to be predicated on a single repository and that being the only mission. But national archives are more “too big to fail”. Questionning long term funding is almost insulting to managers…

A1) Certification is not just pass/fail. It’s about identifying potential weakness, flaws, points of failure for a repository. So for a national library they are too big to fail perhaps but the structure and support for the organisation may impact the future of the repository – cost volitility, decisions made over management and scope of content preserved. So for a national institution we look at finance for that – is it a line item in national budget. And that comes out in the order, the factors governing future developments and sustainability.

Topic: Stewardship and Long Term Preservation of Earth Science Data by the ESIP Federation
Speaker(s): Nancy J. Hoebelheinrich

I am principle of knowledge management at Knowledge Motifs in California. And I want to talk to you about preservation of earth science data by ESIP – Earth Science Informaion Partners. My background is in repositories and metadata and I am relatively new to earth sciences data and there are interesting similarities. We are also keen to build synergies with others so I thought it would be interesting to talk about this today.

The ESIP Federation is a knowledge network for science data and technology practitions – people who are building component for a science data infrastructure. It is distributed geographically, in terms of topic, interest. It’s about a community effort, free flowing ideas in a collaborative environment. It’s a membership organisation but you do not have to be a member to participate. It was started by NASA to support Earth Obervation data work. The idea was to not just rely on NASA for environmental resewarch data. They are interested in research, application in education etc. The areas of interest include climate, ecology, hydrometry, carbon management, etc. Members are of four types: Type 4 are large organisations and sponsors including NOAA and NASA. Type 1 are data centres – similar to libraries but considered separate. Type 2 are researchers and Type 3 are Application developers. There is a real cross sectoral grouping so really interesting discussion arises.

The type of things the group is working on are often in data informatics and data science. I’ll talk in more detail in a second but it’s important to note that organisations are cross functional as well – different stakeholders/focuses in each. We coordinate the community via In Person Meetings, ESIP Commons, Telecons/WebEx, Clusters, Working Groups and Committes and these all feed into making us interoperable. We are particularly focused on Professional development, outreach and collaboration.Â We have a number of active groups, committees and clusters.

Our data and informatics area is about collaborative activities in data preservation and stewardship, semantic web, etc.Â Data preservation and stewardship is very much about stewardship principles, ditation guidelines, provenance context and content standards, and linked data principles. Our Data Stewardship Principles are hat they are for data creators, intermediaries and data users. So this is about data management plans, open exchange of data, metadata and progress etc. Our data citation guidelines were accepted by ESIP Membership Assembley in January 2012. These are based on existing best practice from International Polar Year citation guidelines. And this ties into geospatial data standards and these will be used by tools like the new Thomson Reuters new Data Citation Index.

Our Provenance, Context and Content Standard are about thinking about the data you need about a data set to make it preservable into the long term. So this is about what you would want to collect and how you would collect that. Initially based on content from NASA and NOAA and discussions associated to them. It was developed and shared via the ESIP wiki. The initial version was in March 2011. latest version is June 2011 but this will be updated regularly. The categories are focused mostly on satellite remote setting – preflight/preopertional instrument descriptions etc. And these are based on Use cases – based on NASA work from 1998. What has happened as a result of that work is that NASA has come up with a specification for their data for earth sciences. They make Â a distinction betweeen documentation and metadata, a bit differently from some others. Categories here in 8 areas – many technical but also rationale. And these categories help set baseline etc.

Another project we are working on is Identifiers for data objects. There was an abstract research project on use cases – unique identification, unique location, citable location, scientifically unique identification. They came up with categories and characterstics and questions to ask each ID schemes. The recommended IDs ended up being DOI for a citable locator and UUID for unique identifier but we wanted to test this. We are in the process of looking at this at the moment. Questions and results will be compared again.

And one more thing the group is doing is Semantic Web Cluster Activities – they are creating different ontologies for specific areas such as SWEET – an ontology for environmental data. And there are services built on top of those ontologies (Data Quality Screening Service on weather and climade data from space (AIRE) for instance) – both are available online. Lots of applications for this sort of data.

And finally we do education and outreach – data management training short courses. given that it’s important that researchers know how to manage their data we have come up with a short training courses based on the Khan Acadaemy model. That is being authored and developed by volunteers at the moment.

And we have associated activities and organisations – DataOne, DataConservancy, NSF’s Earth Cube. If you are interested to work with ESIP please get in touch. If you want to join our meeting in Madison in 2 weeks time there’s still time/room!

Q&A

Q1 – Tom Kramer) It seems like ESIP is an eresearch community really – is there a move towards mega nodes or repository or is it still the Wild West?

A1) It’s still a bit like the Wild West! Lots going on but les focus on distribution and preservation, the focus is much more about making data is ingested and made available – where the repositories community was a few years ago. ESIP is interested in the data being open but not all scientists agree about that, so again maybe at the same point as this community a few years ago.

Q2 – Tom) So how do we get more ESIP folk without a background in libraries to OR2012?

A2) Well I’ll share me slides, we probably all know people in this area. I know there are organisations like EDINA here. etc.

Q3) [didn’t hear]

A3) EarthCube area to talk about making data available. A lot of those issues are being discussed. They are working out the common standard OGC, ISO, sharing ontologies but not nessaccarily preservation behind repositories. It’s sort of data centre by data centre.

Topic: Preservation in the Cloud: Three Ways (A DuraSpace Moderated Panel)
Speaker(s): Richard Rodgers, Mark Leggott, Simon Waddington, Michele Kimtpon,Â Carissa Smith

Michelle: DuraCloud was developed in the last few years. It’s a software but also a SAAS (Software As A Service) service. So we are going to talk about different usage etc.

Richard Rodgers, MIT: We at MIT libraries participated in several pilot processes in which DuraCloud was defined and refined. The use case here was to establish a geo distributed replication of the repository. We had content in our IR that was very heterogenous in type. We wanted to ensure system administration practices only Â address HW or admin failues – other errors unsecured. Service should be automatic yet visible.We developed a set of DSpace tools geared towards collection and administration. DuraCloud provided a single convenient point of service interoperation. Basically it gives you an abstractiojn to multiple backend services. That’s great as it means that applications and protects against lock-in. Tools ad APIs for DSpace integration. High bandwidth acces to developers. Platform for preservation system and institution-friendly service terms.

Challenges and solutions here… It’s not clear how the repository system should create and manae the files yourself. Do all aspects need to have correllated archival units. So we decided to use AIPs – units of replication which packages items together, they gather loose files. There is repository managere involveement – admin UI, integration, batch tools. There is an issue of scale – big data files really don’t suit interactivity in the cloud, replication can be slow, queued not synchronous. And we had to design a system were any local error wouldn’t be replicated (e.g. deletion locally isn’t repeated in replication version). However deletion is forever – you can remove content. The code we did for the pilot has been refined some what and is available for DSpace as an add on – we hink it’s fairly widely used in the DSpace community.

Mark Leggott, University of PEI/DiscoveryGarden: I would echo the complicated issues you need to consider here. We had the same experience in terms of very responsive process with DuraSpace team. Just a quick bit of info on Islandora. It is a Drupal + Fedora framework from UPEI. Flexible UI and apps etc. We think of DuraCloud as a natural extension of what we do. The approach we have is to leverage DuraCloud and CloudSync. The idea is to maintain the context of individual objects and;/or complete collections. To enable a single button restore of damaged edits. And it integrate with standard or private DC. We have an initial release coming. There is a new component in the Manage tab in the Admin panel called “Vault”. It provides full access to DuraCloud and CloudSync services. It’s accessible through Islandora Admin Panel – you can manage settings. you can integrate it with your DuraSpace enabled service. Or you can do this via DiscoveryGarden where we manage DuraCloud on client’s behalf. And in maaging youe maerials you can access or restore at an item or collection level. You can sync to DuraCloud or restore from the cloud etc. You get reports on synching etc. And reports on matches or mismatches so that you can restrore data from the cloud as needed. And you can then manually check the object.

Our next steps are to provide tihhter integratione nad more UI functions, to move to automated recovery, to enable full Fedora/Collection restore, and to include support for private DuraCloud Â instances.

Simon: I will be talking about the Kindura project funded by JISC which was a KCL, STFC and ? initiative. The problem is that storage of research outputs (data, documents) is quite ad hoc but it’s a changing language and UK funders can now require data for 10 years+ so it’s important. We wer elooking at hybrid cloud solutions – commercial cloud is very elastic, rapid deployments, transparent cost, but risky in terms of data sensitivity, data protection law, service availablily and loss. In house storage and cloud storage seem like the best way to gain the benefits but mitigate risks.

So Kindura was a proof of concept repository for research data combining commercial cloud and internal storage (iRODS). Based on Fedora Commons. DuraCloud provides a common stoarge intereface and we deployed from source code – we found Windows was best for this and have created some guidelines on this sort of set up. And we developed a storage management framework based on policies, legal and technical constraints as well as cost (including cost of transmitting data in/out of storage) We tried to implement something as flexible as possible. We wanted automated decisions for storage and migration. Content replicaion across storage providers for resiliance. Storage providers transparant to users.

The Kindura system is based on our Fedora Repository feeding Azure, iRODS and Castor (another use case for researchers to migrate to cheaper tape storage) as well as AWS and Rackspace, it also feeds DuraCloud. The repository is populated via web browser depositing into the management server and down into Fedora Respoitory AND DuraCloud.

Q&A

Q1) For Richard – you were talking about deletion and how to deal with them
A1 – Richard) There are a couple of ways to gather logically delete items. So you can automate based on a policy for garbage collection – e.g. anything deleted and not restored within a year. But you can also Â manually delete (you have to do it twice but you can’t mitigate against that).

Q2) Simon, I had a question. You integrated a rules engine and that’s quite interesting. It seems that Rules probably adds some significant flexibility.

A2 – Simon) We actually evaluated several different sorts of rules engines. Jules is easy, open source and for this set up it seemed quite logical to do this. It sits totally separate to DuraCloud set up at the moment but it seemed like a logical extension

EDINA Blogs

A Blogs.edina.ac.uk weblog

P6B: Digital Preservation LiveBlog