Pre-Raphaelites Invading London

Tate Britain will pay homage to the Pre-Raphaelite Brotherhood in a major exhibition running from 12 September 2012 to 13 January 2013.  This exhibition, called Pre-Raphaelites: Victorian Avant-Garde, is launched nearly three decades after a previous Pre-Raphaelite exhibition, held when the museum was still known as the Tate Gallery.

Art historian talking about a Tate exhibit

A 1984 news story about the Pre-Raphaelites and the Tate's last major exhibition of their work (PRE-RAPHAELITES ART. Channel 4 Early Evening News. 07-03-1984)

This Brotherhood of young artists – painters, sculptors, poets, designers – bemoaned the stagnation in the works of their contemporaries and their obsession with meticulous copying of the classics that ignored art’s purpose: making a statement.

Detailed portrait of a woman with fantastical elements

Some Pre-Raphaelite paintings were illustrations of the poetry that also came from within the movement (Jealousy. Art Online, Culture Grid. Painted in 1890)

John Ruskin, a powerful critic and great ally of the Brotherhood, did much to cement their legacy in the world of art history.  He once declared that Pre-Raphaelite doctrine stood against art only for the sake of aesthetic pleasures – beauty, he said, could only ever be subordinate to the message within a work of art.

Portrait of John Ruskin as a young man

Ruskin was a friend of Pre-Raphaelite brother John Millais, but the pair were involved in a love triangle with Ruskin's then-wife Effie, who went on to marry Millais (Portrait of John Ruskin as a young man. By George Richmond, Wellcome Images. 1900)

And send a message they did.  Their use of photographic realism in Christian scenes enraged critics, including Charles Dickens.  And Pre-Raphaelite women, especially those painted by Rossetti, were often derided for their ‘fleshy’ nature.

It wasn’t all about grandiose stabs at orthodoxy.  Their youthful vigour and passion for playful details make Pre-Raphaelite works favourites with the public to this day.

Painting of a woman clutching a pot of basil

This painting has it all: love, tragedy, basil (Isabella and the Pot of Basil. Art Online, Culture Grid. Painted in 1867)

The original Brotherhood was a relatively small group who worked for a short time as “brethren”, but the movement they started, the ideals they championed and the artistic styles they advocated reached far and wide in Britain and beyond.  Indeed, fans of the Pre-Raphaelite movement can be found around the world.

Export of Art Review Committee investigating

One of the Tate's Pre-Raphaelite pieces came into their hands during a bit of a scandal… (Art Deal. Channel 4 Early Evening News. 13-08-1998)

Further Links (will open in new tab/window):

OR2012 on YouTube

We’ve had some time to sort our videos out and get things organised. You may have already found our YouTube channel, but here’s a handful of useful links to get you browsing through what was said and done inside each session.

First, an apology. The recording for one of Tuesday’s talks, Research Data Management and Infrastructure (or P1A), didn’t end up working. Our AV team is still trying to salvage it, but for now we’re going to say there won’t be a video for that session. Fortunately, we’ve got a liveblog of P1A up, so you can refresh your memory on the subject or see what went on behind closed doors there. Also, session P3A on the same topic does have a video, which is embedded below.

Click here to view the embedded video.

Now on to the good stuff. We’re putting together playlists for each day of talks and for the Pecha Kucha sessions. We’ve also posted a bunch of new individual Pecha Kucha videos for your convenience. Check out the second RepoFringe Pecha Kucha session (RF5) below. If you just want to see the winner, Norman Grey’s first up.

Click here to view the embedded video.

At 65 uploaded videos and almost 2000 views so far, we think there’s something for pretty much all Open Repositories folk to enjoy!

Highlights (so far)

OR2012 has wrapped up, tweets are now just slowly fluttering in, and blog posts are popping up like new database entries in springtime. We wanted to gather together a sampling of the best stuff we’ve come across since last week and put it all in plain sight. We know you guys eat broken links and buried content for breakfast, but we figured this could be your pre-meal cup of coffee. …or something. Anyway, here’s what we’ve got.

Keita Bando was active throughout the conference. Here's a shot taken at the drinks and poster session. Click through to see the rest of Keita's lovely photos

Natasha Simons was one of our volunteer bloggers, and she did a fantastic job of it. Mixing summary, analysis, and flair into each post makes each and every one a pleasure to read. Here’s one on arriving in Edinburgh and hearing about the ‘Building a National Network’ workshop, one on conference day 2 (and haggis balls), and one with a sporran full of identifiers chat.

Rob Hilliker immortalized some of the software archiving workshop whiteboard notes for us. Linked to his Twitter post, which leads to a few more pictures and his epic stream of OR2012 tweets

Nick Sheppard, another of our volunteer bloggers, wrote up his reflections of the first two days of the conference on the train ride home. He was keen to write it, and you should be keen to read it. Trust us.

Owen Stephens put together some notes and commentary on repository services, and especially on ResourceSync for folks that are into that sort of thing.

We’re also pleased that discussing the Anthologizr project inspired an Edinburgh University MSc student to focus on that work for his e-Learning dissertation.

An amazing bit of #OR2012 activity analytics by Martin Hawkseye using Carrot2. Click through for full details on how it was made.

The JISC MRD folks took superb notes about the session on institutional perspectives in research data management and infrastructure.

Brian Kelly weighed in on Cameron Neylon’s opening plenary and the significance of connectedness, with particular focus on social media platforms. His site is always worth a browse, so keep tabs on it. View the plenary below.

The DevCSI developer challenge was quite a lively segment of the conference, no matter which side of the mic you were on. Stuart Lewis drummed up excitement about the collaboration between developers and managers that the challenge aimed for this year, and the result was more than we could hope for. The number of submissions was higher than ever. Check out the competition show and tell and read about the winners.

A mockup of Clang! It was the runner-up project in the DevCSI developer challenge. Click through for a post about the idea

That’s what we’ve gathered so far, but it isn’t enough to do you all justice. That’s why we want you to comment, write in, tweet, and photograph everything you think we missed. We need slide decks, papers, pictures, and everything else. Speakers, if you haven’t passed on slides to session chairs, don’t be shy. And everybody else, drop us a line. We’ll be sure to include whatever you’ve got.

"Coder we can believe in." Click through for Adam Field's first tweet of the image

All this work isn’t just for the website. Everything we gather up will be going into a repository of open repository conference content. What can we say, we’re pretty single-minded when it comes to keeping it all open access for you lot. Get sending, and we’ll share more soon.

Developer’s Challenge, Pecha Kucha Winners and Invitation to OR2013 LiveBlog

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin Ashley is introducing us to this final session…

How many of you managed to get along to a Pecha Kucha Session? It looks like pretty much all of you, that’s fantastic! So you will have had a chance to see these fun super short presentations. Now as very few will have seen all of these we are awarding winners for each session. And I understand that the prizes are on their way to us but may not be at the podium when you come up. So… for the first session RF1, and in the spirit of the ceilidh, I believe it has gone to a pair: Theo Andrew and Peter Burnhill! For the second stream, strand RF3 it’s Peter Sefton – and Anna! For RF3 it’s Peter Van de Joss! And for RF4 it’s Norman Grey!

And now over to Mahendra Mahey for the Developer Challenge winners…

The Developer Challenge has been run by my project, DevCSI: Developer Community Supporting Innovation and we are funded by JISC, which is funded by UK Government. The project’s aims it about highlighting the potential, value and impact of the work developers do in UK Universities in the area of technical innovation, this is through sharing experience, training each other and often on volunteer basis. It’s about using tecnology in new ways, breaking out of silos. And running challenges… so onto the winners of the Developers Challenge at DevCSI this year.

The challenge this year was “to show us something new and cool in the use of repositories”. First of all I’d like to thank Alex Wade of Microsoft Research for sponsoring the Developer Challenge and he’ll be up presenting their special prize later. This year we really encouraged non developers to get involved to, but also to chat and discuss those ideas with developers. We had 28 ideas from splinter apps, repositories that blow bubble, SWORD buttons.. .and mini challenege appeared – Rob Sanderson from Los Alamos put out a mini idea! That’s still open for you to work on!

And so.. the final decisions… We will award the prizes and redo the winning pitches! I’d like to also thank our judges (full list on DevCSI site) and our audience who voted!

First of all honourable mentions:

Mark McGillivray and Richard Jones – getting academics close to repositories or Getting Researchers SWORDable.

Ben O’Steen and Cameron Neylon – Is this research readable

And now the Microsoft Research Prize and also the runners up for the main prize as they are the same team.

Alex: What we really loved was you guys came here with an idea, you shared it, you changed it, you worked collaboratively on it and

Keith Gilmerton and Linda Newman for their mobile audio idea.

Alex: they win a .Net Gadgeteer rapid prototyping kit with motherboard, joystick, monitor, and if you take to Julie Allison she’ll tell you how to make it blow bubbles!

Peter Sefton will award the main prize…

Peter: Patricks visualisation engine won as we’re sick of him entering the developer challenge

The winners and runners up will share £1000 of Amazon Vouchers and the winning entry – the team of one – will be funded to develop the idea – 2 days development time. Patrick: I’m looking for collaborators and also an institution that may want to test it get in touch.

Linda and Keith first

Linda: In Ohio we have a network of DSpace repositories including the Digital Archive of Literacy Narratives – all written in real peoples voices and using audio files, a better way to handle these would be a boon! We also have an Elliston Poetry Curator – he collects audio on analogue devices, digital would be better. And in the field we are increasingly using mobile technologies and the ability to upload audioj or video at the point of creation with transcript would greatly increse the volume of contribution

MATS – Mobile AudioVisual Transcription Service

Our idea is to create an app to deposit and transcript audio – and also video – and we used SWORDShare, an idea from last years conference, as we weren’t hugely experienced in mobile development. We’ve done some mock ups here. You record, transcribe and submit all from your phone. But based on what we saw in last years app you should be able to record in any app as an alternative too. Transcription is hugely important as that makes your file indexable. And it provides access for those with hearing disabilities, and those that want to preview/read the file when listening isn’t an option. So when you have uploaded your file you request your transcription. You have two options. Default is Microsoft Mavis – mechanical transcription. But you can also pick Amazon Mechanical Turk – human transcription, and you might want that if the audio quality was very poor or not in English.

MAVIS allows some additional functionality – subtitling, the ability to jump to a specific place in the file from a transcript etc. And a company called GreenButton offers a webservices API to MAVIS. We think that even if your transcription isn’t finished you can still submit to the repository as new version of SWORD supports updating. That’s our idea! We were pitching this idea but now we really want to build it! We want your ideas, feedback, tech skills, input!

And now Patrick McSweeney and DataEngine.

My friend Dave generated 1TB data in every data run and the uni wouldnt host that. We found a way to get that data down to 10 GB for visualisation. It was back ups on a home machine. It’s not a good preservation strategy. You should educate and inform people and build solutions that work for them!

See: State of the Onion. A problem you see all the time… most science is long tail, and support is very poor in that long tail. You have MATLAB and Excel and that’s about it. Dave had all this stuff, he had trouble managing his data and graphs. So the idea is to import data straight from Dave’s kit to the repository. For Dave the files were CSV. And many tools will export to it, its super basic unit of data sharing – not exciting but it’s simple and scientists understand it.

So, at ingest you give your data provenance and you share your URIs, and you can share the tools you use. And then you have tools for merging and manipulation. the file is pushed into storage form where you can run SQL processing. I implemented this in an EPrints repository – with 6 visualisation but you could add any number. You can go from source data, replay experiment, and get to visualisations. Although rerunning experiments might be boring you can also reuse the workflow with new similar data. You can create a visualisation of that new data and compare it with your original visualisation and know that the process has been entirely the same.

It’s been a hectic two days. It’s a picture (of two bikers on a mountain) but it’s also a metaphor. There are mountains to climb. This idea is a transitional idea. There are semantic solutions, there are LHC type ideas that will appear eventually but there are scientists at the long tail that want support now!

And finally… thank you everyone! I meant what I said last night, all who presented yesterday I will buy a drink! Find me!

I think 28 ideas is brilliant! The environment was huge fun, the developers lounge were a lovely space to work in.

And finally a plug… I’ve got a session at 4pm in the EPrints track and that’s a real demonstration of why the Developer Challenge works as the EPrints Bazaar, now live, busy, changing how we (or at least I) think about repositories started out at one of these Developer Challenges!

At the dinner someone noted that there are very few girls! Half our user base are women but hardly any women presented at the challenge, Ladies, please reprasent.

And also… Dave Mills exist. It is not a joke! He reckons he generated 78 GB of data – not a lot, you could probably get it on a memory stick! Please let your researchers have that space centrally! I drink with reseachers and you should too!

And Ben, Ben O’Steen had tech problems yesterday but he’s always here and is brilliant. is live right now, rate a DOI for whether its working.

And that’s all I have to say.

And now over to Prince Edward Island – Proud Host of OR 2013

I’m John Eade, CEO of DiscoveryGarden and this is Mark Leggot. So, the first question I get is where are you? Well we are in Canada! We are tiny but we are there. Other common questions…

Can I walk from one end of the island to the other? Not in a day! And you wouldn’t enjoy it if you did

How many people live there? 145,000 much more than it was

Do Jellyfish sting? We have some of the warmest waters so bring your swimsuit to OR2013!

Can you fly there? Yes! Direct from Toronto, Montreal, Halifax, Ottawa,(via Air Canada and Westjet) and from New York City (via Delta). Book your flights early! And Air Canada will add flights if neccassary!

We will work diligently to get things up on line as early as possible to make sure you can book travel as soon as possible.

Alternatively you can drive – you won’t be landlocked – we are connected to mainland. Canada is connected to us. We have an 8 mile long bridge that took 2 and a half years to build and its 64 metres high – its the highest point in PEI and also the official rollercoaster!

We are a big tourism destination – agriculture, fishing, farming, software, aerospace, bioresources. We get 1 million tourists per year. That means we have way more things to do there than a place our size should – championship quality gold courses. Great restaurants and a culinary institute. We have live theatre and we are the home of Anne of Green Gables, that plucky redhead!

We may not have castles… but we have our own charms…!

Cue a short video…

Mark: free registration if you can tell me what the guy was doing?

Audience member: gathering oysters?

Mark: yes! See me later!

So come join us in Prince Edward Island. Drop by our booth in the Concourse in Appleton Tower concourse for another chance to win free registration to next years event. We’ve had lots of support locally and this shoudl be a great event!

P6B: Digital Preservation LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Digital Preservation Network, Saving the Scholarly Record Together
Speaker(s): Michele Kimpton, Robin Ruggaber

Michelle is CEO of DuraSpace. Myself and Robin are going to be talking about a new initiative in the US. This initiative wasn’t born out of grant funding but by university librarians and CIOs who wanted to think about making persistent access to scholarly materials and knew that something needed to be done at scale and now. Many of you will be well aware that libraries are being asked to preserve digital and born digital materials and there are not good solutions to do that in scale. Many of us have repositories in place. Typically there is an online or regular backup but these aren’t at preservation scale here.

So about a year ago a group of us met to talk about how we might be able to approach this problem. And from this – Digital Preservation and Network – was born. DPN is not just a technical architecture. It’s an approach that requires replication of complete scholarly record access nodes with diverse architectures without single points of failure. It’s a fedration. And it is a community allowing this to work at mass scale.

At the core of DPN are a number of replicated nodes. There are minimum of three but up to five here. The role of the nodes is to have complete copies of content, full replications of each replicating nodes. This is a full content object store, not just a metadata node. And this model can work with multiple contributing nodes in different institutions – so those nodes replicate across architectures, geographic locations, institutions.

DPN Principle 1: Owned by the community

DPN Principle 2: Geographical diversity of nodes

DPN Principle 3: Diverse organisations – Uof Michigan, Stanford, San Diego, Academic Presrvation Trust, University of Virginia.

DPN Principle 4: Diverse Software Architectores – including iRODS, HATHI Trust, FedoraCommons, Standford Digital Library

DPN Principle 5: Diverse Political Environments – we’ve started in the US but the hope it to expand out to a more diverse global set of locations

So DPN will preserve scholarship for future generations, fund replicating ndes to ensure functional independence, audit ad verify content, provide a legal framework for holding succession rights – so if a node goes down this means the content will not be lost. And we have a diverse governance group taking responsibility for specific areas.

To date 54 partners and growing, about $1.5 million in funding – and this is not grant funding – and we now have a project manager in place.

Over to Robin…

Robim: Many of the partners in the APTrust have also been looking at DPN. APTrust ia a consortium committed to creation and management of an aggregated preservation repository and, now that DPN underway, to be a replicating node. APTrust was formed for reasons of community-building, economies of scale – things we could do together that we could not do agin, aggregated content, long term preservation, disaster recovery – particularly relevent given recent east coast storms.

The APTrust has several arms: Business and marketing strategy; governance policy and legal framework; preservation and collection framework; repository implementation plan – the technical side of APTrust and being a DPN node. So we had to bring together University librarians, technology liaisons, ingest/preservation. The APTrust services are the aggregation repository, the separate replicating node for DPN, and the access service – initially for administaration but also thinking about more services for the future.

There’s been a lot of confusion as APTrust and DPN started emerging at about the same time. And we are doing work with DPN. So we tend to think of the explanation here being about Winnowing of Content with researchers repository of files at the top, then local institutions repositories, then AP trust – preservation for our institutions that provide robustness for our content, and DPN is then for long term preservation. APTrust is preservation and access. DPN is about preservation only.

So the objectives of the initial phase of the APTrust is engaging partners, defining sustainable business model, hiring a project director, building the aggregation repository and setting up our DPN node. We have an advisory group for the project looking at governance. The service implementation is a phased approach building on experience, leveraging open soure – cloud storage, compute notes, DuraCloud all come into play, economies of scale, TRAC – we are using as a guideline for architecture. APTrust will be sitting at the end of legacy workflows for ingest, it will take that data in, ingest to DuraCloud services, synching to Fedora aggregation repository, and anything for long term preservation will also move to the APTrust DPN Noder with DuraCloud OS via cloudsync.

In terms of the interfaces there will be a single administrative interface which gives access to admin of DuraCloud, CloudSync and Fedora. Which will allow audit reports, functionality in each individual area etc. And that uses the API for each of those services. We will have a proof of that architecture at end of Q4 2012. Partners will feedback on that and we expect to deploy in 2013. Then we will be looking at disaster recovery access services, end-user acces, format migration services – considered a difficult issue so very interesting, best practices fro content types etc., coordinated collection development – across services, hosted repository services. Find out more at and


Q1) In Denmark we are building our national repository which is quite like DPN. Something in your preserntation: it seems that everything is fully replicated to all nodes. In our organisation services that want to preserve something they can enter a contract with another/a service and that’s an economic way to do things but it seems that this model is everthing for everyone.

A1 – Michelle) Right now the principle is everyone gets a copy in everything. We may eventually have specialist centres for video, or for books etc. Those will probably be primarily access services. We do have a diverse ecosystem – back ups across organisations in different ways. You can’t choose stuff in one or another node.

Q2) This looks a lot like LOCKSS – what is the main difference between DPN and a private LOCKSS network.

A2) LOCKSS is a technology for preservation but it’s a single architecture. It is great at what it does so it will probably be part of the nodes here – probably Stanford will use this. But part of the point is to have multiple architectural systems so that if there is an attack on one architecture just one component of the whole goes down.

Q3) I understand the goal is replication but what about format obsolescence – will there be format audit and conversion etc?

A3 – Michelle) I think who does this stuff, format emulation, translation etc. has yet to be decided. That may be at node level not network level.

Topic: ISO : Trustworthy Digital Repository Certification in Practice

Speaker(s): Matthew Kroll, David Minor, Bernie Reilly, Michael Witt

This is a panel session chaired by Michael Witt of Purdue University. This is about ISO 16363 and TRAC, the Trustworthy Repository Audit Checklist – how can a user trust that data is being stored corrrectly and securely, that it is what it says it is.

Matthew: I am a graduate research assistant working with Micheal Witt at Purdue. I’ve been preparing the Purdue Research Repository (PURR) for TRAC. We are a progressive repository with online workspace and data sharing platform, to user archiving and access, to preservation needs of Purdue University graduates, researchers and staff. So for today I will introduce you to ISO 16363 – this is the users guide that we are using to prepare ourselves, I’ll give an example of trustworthiness. So a neccassary and valid question to ask ourselves is “what is “trustworthiness” in this context?” – it’s a very vague concept and one that needs to grow as the digital preservation community and environment grows.

I’d like to offer 3 key qualities of trustworthiness (1) integrity, (2) sustainability, (3) support. And I think it’s important to map these across your organisations and across the three sections of ISO 16363. So, for example, integrity might be that the organisation has sufficient staff and funding to work effectively. Or for the repository it might be that you do fixity checks, procedures and practices to ensure successful migration or translation, similarly integrity in infrastructure may just be offsite backup. Similarly sustainability might be about staff training being adequate to meet changing demands. These are open to interpretation here but useful to think about.

In ISO 16363 there are 3 sections of criteria (109 criteria in all): (3) Organizational Infrastructure; (4) Digital Object management; (5) Infrastructure and Security Risk Management. There isn’t a one-to-one relationship in documentation here. One criteria might have multiple documents, a document might support multiple criteria.

Myself and Micheal created a PURR Gap Analysis Tool – we graded ourselves and brought in experts from the organisation in the appropriate areas and we gave them a pop quiz. And we had an outsider read these things. This had great benefit – being prepared means you don’t overrate yourself. And secondly doing it this way – as PURR was developing and deploying our process here – we gained a real understanding of the digital environment

David Minor, Chronopolic Program Manager, UC San Diego Libraries and San Dieo Supercomputer Center: We completed the Trac process this April. We did it through the CDL organisation. We wanted to give you an overview of what we did, what we learnt. So a bit about us first. Chronopolis is a digital preservation network based on geographic replication – UCSD/SDSC, NCAR, UMIACS. We were initially funded vid the Livrary of Congress NDIIPP program. We spun out into a different type of organisation recently, a FIFA service. Our management and finances are via UCSD. All nodes are independent entities here – interesting questions arise from this for auditors.

So, why do TRAC? Well we wanted to do validation of our work – and this was a last step in our NDIIPP process – an important follow on for development. We wanted to learn about gaps, things we could do better. We wanted to hear what others in the community had to say – not just those we had worked for and served but others. And finally it sounds cyncial but it was to bring in more business – to let us get out there and show what we could do particularly as we moved into FIFA service mode.

The process logistics were that we began in Summer 2010 and finished Winter 2011. We were a slightly different model. We were a self-audit that then went to auditors to follow up, ask questions, speak to customers. The auditors were three people who did a site visit. It’s a closed process except for that visit though. We had management, finances, metadata librarians, and data centre managers – security, system admin etc all involved – equiverlent of 3 FTE. We had discussed with users and customers. IN the end we had hundreds of pages of documentation – some writen by us, some log files etc.

Comments and issues raised by auditors were that we were strong on technology (we expected this as we’d been funded for that purpose) and spent time commenting on connections with participant data centres. They found we were less strong on business plan – we had good data on costs and plans but needed better projections for future adoption. And we had discussion of preservation actions – auditors asked if we were even doing preservation and what that might mean.

Our next steps and future plans based on this experience has been to implement recommendations working to better identify new users and communities, improve working with other networks. How do changes impact audit – we will “re-audit|” in 18-24 months – what if we change technologies? What is management changes? And finally we definitely have had people getting in touch specifically because of knowing we have been through TRAC. All of our audit and self-audit materials are on the web too so do take a look.

Bernie from the Centre for Research Libraries Global Resources Network: We do audits and certification of key repositories. We are one of the publishers of the TRAC checklist. We are a publisher not an author so I can say that it is a brilliant document! We also participated in development of recent ISO standard 16363. So, where do we get the standing to do audits, certification and involvement in standards. Well we are a specialist centre in

We started in UofChichargo, Northwestern etc. established in the 1949. We are a group of 167 universities in US, Canada and Hong Kong and we are about preserving key research information for humanities and social science. Almost all of our funding comes from the research community – also where are stakeholders and governance sit. And the CRL Certification program has the goal to support advanced research. We do audits of repositories and we do analysis and evaluations. We take part in information sharing and best practice. We aim to do landscape studies – recently been working on digital protest and documentation

Portico, Cronopolic, currently looking at PURR and PTAB test audits. The process is much as described by my colleagues. The repository self-audits, then we request documentation, then a site visit, then report is shared via the web. In the future we will be doing TRAC certification alongside ISO 16363 and we will really focus on Humanities and social science data. We continue to have the same mission as when we were founded in 1949, to enable the resiliance and durability of research information.


Q1 Askar, State University of Denmark) The finance and sustainability for organisations in TRAC… it seemed to be predicated on a single repository and that being the only mission. But national archives are more “too big to fail”. Questionning long term funding is almost insulting to managers…

A1) Certification is not just pass/fail. It’s about identifying potential weakness, flaws, points of failure for a repository. So for a national library they are too big to fail perhaps but the structure and support for the organisation may impact the future of the repository – cost volitility, decisions made over management and scope of content preserved. So for a national institution we look at finance for that – is it a line item in national budget. And that comes out in the order, the factors governing future developments and sustainability.

Topic: Stewardship and Long Term Preservation of Earth Science Data by the ESIP Federation
Speaker(s): Nancy J. Hoebelheinrich

I am principle of knowledge management at Knowledge Motifs in California. And I want to talk to you about preservation of earth science data by ESIP – Earth Science Informaion Partners. My background is in repositories and metadata and I am relatively new to earth sciences data and there are interesting similarities. We are also keen to build synergies with others so I thought it would be interesting to talk about this today.

The ESIP Federation is a knowledge network for science data and technology practitions – people who are building component for a science data infrastructure. It is distributed geographically, in terms of topic, interest. It’s about a community effort, free flowing ideas in a collaborative environment. It’s a membership organisation but you do not have to be a member to participate. It was started by NASA to support Earth Obervation data work. The idea was to not just rely on NASA for environmental resewarch data. They are interested in research, application in education etc. The areas of interest include climate, ecology, hydrometry, carbon management, etc. Members are of four types: Type 4 are large organisations and sponsors including NOAA and NASA. Type 1 are data centres – similar to libraries but considered separate. Type 2 are researchers and Type 3 are Application developers. There is a real cross sectoral grouping so really interesting discussion arises.

The type of things the group is working on are often in data informatics and data science. I’ll talk in more detail in a second but it’s important to note that organisations are cross functional as well – different stakeholders/focuses in each. We coordinate the community via In Person Meetings, ESIP Commons, Telecons/WebEx, Clusters, Working Groups and Committes and these all feed into making us interoperable. We are particularly focused on Professional development, outreach and collaboration. We have a number of active groups, committees and clusters.

Our data and informatics area is about collaborative activities in data preservation and stewardship, semantic web, etc. Data preservation and stewardship is very much about stewardship principles, ditation guidelines, provenance context and content standards, and linked data principles. Our Data Stewardship Principles are hat they are for data creators, intermediaries and data users. So this is about data management plans, open exchange of data, metadata and progress etc. Our data citation guidelines were accepted by ESIP Membership Assembley in January 2012. These are based on existing best practice from International Polar Year citation guidelines. And this ties into geospatial data standards and these will be used by tools like the new Thomson Reuters new Data Citation Index.

Our Provenance, Context and Content Standard are about thinking about the data you need about a data set to make it preservable into the long term. So this is about what you would want to collect and how you would collect that. Initially based on content from NASA and NOAA and discussions associated to them. It was developed and shared via the ESIP wiki. The initial version was in March 2011. latest version is June 2011 but this will be updated regularly. The categories are focused mostly on satellite remote setting – preflight/preopertional instrument descriptions etc. And these are based on Use cases – based on NASA work from 1998. What has happened as a result of that work is that NASA has come up with a specification for their data for earth sciences. They make  a distinction betweeen documentation and metadata, a bit differently from some others. Categories here in 8 areas – many technical but also rationale. And these categories help set baseline etc.

Another project we are working on is Identifiers for data objects. There was an abstract research project on use cases – unique identification, unique location, citable location, scientifically unique identification. They came up with categories and characterstics and questions to ask each ID schemes. The recommended IDs ended up being DOI for a citable locator and UUID for unique identifier but we wanted to test this. We are in the process of looking at this at the moment. Questions and results will be compared again.

And one more thing the group is doing is Semantic Web Cluster Activities – they are creating different ontologies for specific areas such as SWEET – an ontology for environmental data. And there are services built on top of those ontologies (Data Quality Screening Service on weather and climade data from space (AIRE) for instance) – both are available online. Lots of applications for this sort of data.

And finally we do education and outreach – data management training short courses. given that it’s important that researchers know how to manage their data we have come up with a short training courses based on the Khan Acadaemy model. That is being authored and developed by volunteers at the moment.

And we have associated activities and organisations – DataOne, DataConservancy, NSF’s Earth Cube. If you are interested to work with ESIP please get in touch. If you want to join our meeting in Madison in 2 weeks time there’s still time/room!


Q1 – Tom Kramer) It seems like ESIP is an eresearch community really – is there a move towards mega nodes or repository or is it still the Wild West?

A1) It’s still a bit like the Wild West! Lots going on but les focus on distribution and preservation, the focus is much more about making data is ingested and made available – where the repositories community was a few years ago. ESIP is interested in the data being open but not all scientists agree about that, so again maybe at the same point as this community a few years ago.

Q2 – Tom) So how do we get more ESIP folk without a background in libraries to OR2012?

A2) Well I’ll share me slides, we probably all know people in this area. I know there are organisations like EDINA here. etc.

Q3) [didn’t hear]

A3) EarthCube area to talk about making data available. A lot of those issues are being discussed. They are working out the common standard OGC, ISO, sharing ontologies but not nessaccarily preservation behind repositories. It’s sort of data centre by data centre.

Topic: Preservation in the Cloud: Three Ways (A DuraSpace Moderated Panel)
Speaker(s): Richard Rodgers, Mark Leggott, Simon Waddington, Michele Kimtpon, Carissa Smith

Michelle: DuraCloud was developed in the last few years. It’s a software but also a SAAS (Software As A Service) service. So we are going to talk about different usage etc.

Richard Rodgers, MIT: We at MIT libraries participated in several pilot processes in which DuraCloud was defined and refined. The use case here was to establish a geo distributed replication of the repository. We had content in our IR that was very heterogenous in type. We wanted to ensure system administration practices only  address HW or admin failues – other errors unsecured. Service should be automatic yet visible.We developed a set of DSpace tools geared towards collection and administration. DuraCloud provided a single convenient point of service interoperation. Basically it gives you an abstractiojn to multiple backend services. That’s great as it means that applications and protects against lock-in. Tools ad APIs for DSpace integration. High bandwidth acces to developers. Platform for preservation system and institution-friendly service terms.

Challenges and solutions here… It’s not clear how the repository system should create and manae the files yourself. Do all aspects need to have correllated archival units. So we decided to use AIPs – units of replication which packages items together, they gather loose files. There is repository managere involveement – admin UI, integration, batch tools. There is an issue of scale – big data files really don’t suit interactivity in the cloud, replication can be slow, queued not synchronous. And we had to design a system were any local error wouldn’t be replicated (e.g. deletion locally isn’t repeated in replication version). However deletion is forever – you can remove content. The code we did for the pilot has been refined some what and is available for DSpace as an add on – we hink it’s fairly widely used in the DSpace community.

Mark Leggott, University of PEI/DiscoveryGarden: I would echo the complicated issues you need to consider here. We had the same experience in terms of very responsive process with DuraSpace team. Just a quick bit of info on Islandora. It is a Drupal + Fedora framework from UPEI. Flexible UI and apps etc. We think of DuraCloud as a natural extension of what we do. The approach we have is to leverage DuraCloud and CloudSync. The idea is to maintain the context of individual objects and;/or complete collections. To enable a single button restore of damaged edits. And it integrate with standard or private DC. We have an initial release coming. There is a new component in the Manage tab in the Admin panel called “Vault”. It provides full access to DuraCloud and CloudSync services. It’s accessible through Islandora Admin Panel – you can manage settings. you can integrate it with your DuraSpace enabled service. Or you can do this via DiscoveryGarden where we manage DuraCloud on client’s behalf. And in maaging youe maerials you can access or restore at an item or collection level. You can sync to DuraCloud or restore from the cloud etc. You get reports on synching etc. And reports on matches or mismatches so that you can restrore data from the cloud as needed. And you can then manually check the object.

Our next steps are to provide tihhter integratione nad more UI functions, to move to automated recovery, to enable full Fedora/Collection restore, and to include support for private DuraCloud  instances.

Simon: I will be talking about the Kindura project funded by JISC which was a KCL, STFC and ? initiative. The problem is that storage of research outputs (data, documents) is quite ad hoc but it’s a changing language and UK funders can now require data for 10 years+ so it’s important. We wer elooking at hybrid cloud solutions – commercial cloud is very elastic, rapid deployments, transparent cost, but risky in terms of data sensitivity, data protection law, service availablily and loss. In house storage and cloud storage seem like the best way to gain the benefits but mitigate risks.

So Kindura was a proof of concept repository for research data combining commercial cloud and internal storage (iRODS). Based on Fedora Commons. DuraCloud provides a common stoarge intereface and we deployed from source code – we found Windows was best for this and have created some guidelines on this sort of set up. And we developed a storage management framework based on policies, legal and technical constraints as well as cost (including cost of transmitting data in/out of storage) We tried to implement something as flexible as possible. We wanted automated decisions for storage and migration. Content replicaion across storage providers for resiliance. Storage providers transparant to users.

The Kindura system is based on our Fedora Repository feeding Azure, iRODS and Castor (another use case for researchers to migrate to cheaper tape storage) as well as AWS and Rackspace, it also feeds DuraCloud. The repository is populated via web browser depositing into the management server and down into Fedora Respoitory AND DuraCloud.


Q1) For Richard – you were talking about deletion and how to deal with them
A1 – Richard) There are a couple of ways to gather logically delete items. So you can automate based on a policy for garbage collection – e.g. anything deleted and not restored within a year. But you can also  manually delete (you have to do it twice but you can’t mitigate against that).

Q2) Simon, I had a question. You integrated a rules engine and that’s quite interesting. It seems that Rules probably adds some significant flexibility.

A2 – Simon) We actually evaluated several different sorts of rules engines. Jules is easy, open source and for this set up it seemed quite logical to do this. It sits totally separate to DuraCloud set up at the moment but it seemed like a logical extension


P6A: Non-traditional content LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Eating your own dog food: Building a repository with API-driven development
Speaker(s): Nick John Jackson, Joss Luke Winn

The team decided they wanted to build a wholly new RDM, with research data as a focus for the sake of building the best tool for that job. This repository was also designed to store data during research, not just after.

Old repositories work very well, but they assume the entry of a whole file (or a pointer), only retrievable in bulk and in oddly organized pieces. They have generally limited interface methods and capacities. These old repositories also focus on formats, not form (structure and content) unless there is fantastic metadata.

The team wanted to do something different, and built a great backend first. They were prepared to deal with raw data as raw data. The API was built first, not the UI. APIs are the important bit. And those APIs need to be built in a way that people will want to use them.

This is wear eating your own dog food comes in. The team used their own API to build the frontend of the system, and used their own documentation. Everything had to be done well because it was all used in house. Then, they pushed it out to some great users, and made them do what they wanted to do with the ‘minimum viable product’. It works, and you build from there.

Traditional repos have a database, application, users. They might tack an API on at the end for manual and bulk control, but it doesn’t even include all of the functionality of the website usually. That or you screen scrape, and that’s rough work. Instead, this repository builds an API and then interacts with that via the website.

Research tends to happen on a subset of any given data set, nobody wants that whole data set. So forget the containers that hold it all. Give researches shared, easily usable databases. APIs put stuff in and out automatically.

This was also made extensible from day one. Extensible and writeable by everybody to the very core. The team also encourages re-usable modularity. People do the same things to their data over and over – just share that bit of functionality at a low data level. And they rely on things to do things to get things done – in other words, there’s no sense in replicating other people’s work if it’s done well.

The team ended up building better stuff because it uses its own work – if it doesn’t do what it’s meant to, it annoys them and they have to fix it. All functionality is exposed so they can get their work done quick and easy. Consistent and clean error handling were baked in for the sake of their own sanity, but also for everybody else. Once it’s all good and easy for them, it will be easy for 3rd parties to use, whether or not they have a degree in repo magic. And security is forcibly implemented across the board. API-level authentication means that everything is safe and sound.

Improved visibility is another component. Database querying is very robust, and saves the users the trouble of hunting. Quantitative information is quick and easy because the API gives open access to all the data.

This can be scalable horizontally, to as many servers as needed. It doesn’t use server states.

There are some problems involved in eating your own dog food. It takes time to design a decent API first. You also end up doubling up some development, particularly for frontend post-API development. APIs also add overhead. But after some rejigging, it all works with thousands of points per second, and it’s humming nicely.

Q: Current challenges?

A: Resourcing the thing. Lots of cutting edge technology and dependence on cloud architecture. Even with money and demand, IT infrastructure aren’t keeping up just yet.

Q: How are you looking after external users? Is there a more discoverable way to use this thing?

A: The closest thing we have is continuous integration to build the API at multiple levels. A discovery description could be implemented.

Q: Can you talk about scalability? Limitations?

A: Researchers will sometimes not know how to store what they’ve got. They might put pieces of data on their own individual rows when they don’t need to be. That brings us closer to our limit. Scaling up is possible, and doing it beyond limits is possible, but it requires a server-understood format.

Q: Were there issues with developers changing schemas mysteriously? Is that a danger with MongoDB?

A: By using our own documentation, forcing ourselves to look at it when building and questioning. We’ve got a standard object with tracking fields, and  if a researcher starts to get adventurous with schemas it’s then on them.


Topic: Where does it go from here? The place of software in digital repositories
Speaker(s): Neil Chue Hong

Going to talk about the way that developers of software are getting overlapping concerns with the repository community. This isn’t software for implementing infrastructure, but software that will be stored in that infrastructure.

Software is pervasive in research now. It is in all elements of research.

The software sustainability institute does a number of things at strategic and tactical levels to help create best practices in research software development.

One question is the role of software in the longer term – five and ten years on? The differences between preservation and sustainability. The former holds onto things for use later on, while the latter keeps understanding in a particular domain. The understanding, the sustainability, is the more important part here.

Several purposes for sustaining and preserving software. For achieving legal compliances (architecture models ought to be kept for the life of a building). For creating heritage value (gaining an overall understanding of influences of a creator). For continued access to data (looking back, through the lens of the software). For software reuse (funders like this one).

There are several approaches. Preserving the technology, whether it’s physical hardware or an emulated environment. Migration from one piece of software to another over time while ensuring functionality, or transitioning to something that does similar. There’s also hibernation, just making sure it can be picked apart some day if need be.

Computational science itself needs to be studied to do a good job of this. Software carpentry teaches scientists basic programming to improve their science. One thing, using repositories, is an important skill. Teaching scientists the exploratory process of hacking together code is the fun part, so they should get to do it.

Re-something is the new black. Reuse, review, replay, rerun, repair. But also reward. How can people be rewarded for good software contributions, the ones that other people end up using. People get pats on the back, glowing blog posts, but really reward in software is in its infancy. That’s where repositories come in.

Rewarding good development often requires publication which requires mention of the developments. That ends up requiring a scientific breakthrough, not a developmental one. Software development is a big part of science and it should be viewed/treated as such.

Software is just data, sure, but along with the Beyond Impact team these guys have been looking at software in terms of preservation beyond just data. What needs to get kept in software and development? Workflows should, because they show the boundaries of using software in a study – the dependencies and outputs of the code. Looking at code on various levels is also important. On the library/software/suite level? The program or algorithm or function level. That decision is huge. The granularity of software needs to be considered.

Versioning is another question. It indicates change, allows sharing of software, and confers some sort of status. Which versions should go in which repositories, though? That decision is based on backup (github), sharing (DRYAD), archiving (DSpace). Different repositories do each.

One of the things being looked at in sustaining software are software metapapers. These are scholarly records including ‘standard’ publication, method, dataset and models, and software. This enables replay, reproduction, and reuse. It’s a pragmatic approach that bundles everything together, and peer review can scrutinize the metadata, not the software.

The Journal of Open Research Software allows for the submission of software metapapers. This leads to where the overlap in development and repositories occurred, and where it’s going.

The potential for confusion occurs when users are brought in and licensing occurs. It’s not CC BY, it’s OSI standard software licenses.

Researchers are developing more software than ever, and trying to do it better. They want to be rewarded for creating a complete scholarly record, which includes software. Infrastructure needs to enable that. And we still don’t know the best way to shift from one repository role to another when it comes to software – software repositories from backup to sharing to archival. The pieces between them need to be explored more.

Q: The inconsistency of licensing between software and data might create problems. Can you talk about that?

A: There is work being done on this, on licensing different parts of scholarly record. Looking at reward mechanisms and computability of licenses in data and software need to be explored – which ones are the same in spirit?


Topic: The UCLA Broadcast News Archive Makes News: A Transformative Approach to Using the News in Teaching, Research, and Publication
Speaker(s): Todd Grappone, Sharon Farb

UCLA has been developing an archive since the Watergate hearings. It was a series of broadcast television recordings for a while, but not it’s digital libraries of broadcast recordings. That content is being put into a searchable, browsable interface. It will be publicly available next year. It grows about a terabyte a month (150000+ programs and counting), which pushes the scope of infrastructure and legality.

It’s possible to do program-level metadata search. Facial recognition, OCR of text on screen, closed caption text, all searchable. And almost 10 billion images. This is a new way for the library to collect the news since papers are dying.

Why is this important? It’s about the mission of the university copyright department: public good, free expression, and the exchange of ideas. That’s critical to teaching and learning. The archive is a great way to fulfill that mission. This is quite different from the ideas of other Los Angeles organizations, the MPAA and RIAA.

The mission of higher education in general is about four principles. The advancement of knowledge through research, through teaching, and of preservation and diffusion of that knowledge.

About 100 news stations being captured so far. Primarily American. International collaborators are helping, too. Pulling all broadcast, under a schedule scheme with data. It’s encoded and analyzed, then pushed to low-latency storage in H.264 (250MB/hr). Metadata is captures automatically (timestamp, show, broadcast ID, duration, and full search by closed captioning). The user interface allows search and browse.

So, what is news? Definitions are really broad. Novelties, information, and a whole lot of other stuff. The scope of the project is equally broad. That means Comedy Central is in there – it’s part of the news record. Other people doing this work are getting no context, little metadata, less broadcasts. And it’s a big legal snafu that is slowly untangling.

Fortunately, this is more than just capturing the news. There’s lots of metadata – transformative levels of information. Higher education and libraries need these archives for the sake of knowledge and preservation.

Q: Contextual metadata is so hard to find, and knowing how to search is hard. How about explore? How about triangulating with textual news via that metadata you do have?

A: We’re pulling in everything we can. Some of the publishing from these archives use almost literally everything (court cases, Twitter, police data, CCTV, etc). We’re excited to bring it all together, and this linkage and exploration is the next thing.

Q: In terms of tech. development, how has this archive reflected trends in the moving image domain? Are you sharing and collaborating with the community?

A: An on-staff archivist is doing just that, but so far this is just for UCLA. It’s all standards-driven so far, and community discussion is the next step.


Topic: Variations on Video: Collaborating toward a robust, open system to provide access to library media collections
Speaker(s): Mark Notess, Jon W. Dunn, Claire Stewart

This project has roots in a project called Variations in 1996. It’s now in use at 20 different institutions, three versions. Variations on Video is a fresh start, coming from a background in media development. Everything is open source, working with existing technologies, and hopefully engaging with a very broad base of users and developers.

The needs that Variations on Video are trying to meet are archival preservation, access for all sorts of uses. Existing repositories aren’t designed for time-based media. Storage, streaming, transcoding, access and media control, and structure all need to be handled in new ways. Access control needs to be pretty sophisticated for copyright and sensitivity issues.

Existing solutions have been an insufficient fit. Variations on Video offers basic functionality that goes beyond them or does them better. File upload, transcoding, and descriptive metadata will let the repository stay clean. Navigation and structural metadata will allow users to find and actually use it all.

VoV is built on a Hydra framework, Opencast Matterhorn, and a streaming server that can serve up content to all sorts of devices.

PBCore was chosen for descriptive metadata, with an ‘Atomic’ content model: parent objects for intellectual descriptions, child objects for master files, children of these for derivatives. There’s ongoing investigation for annotation schemes.

Release 0 was this month (upload, simple metadata, conversion), and release one will come about in December 2012. Development will be funded through 2014.

Uses Backlight for discover, Strobe media player for now. Other media players with more capabilities are being considered.

Variations on Video is becoming AVALON (Audio Video Archives and Libraries Online).

Using the agile Scrum approach with a single team at the university for development. Other partners will install, test, provide feedback. All documentation, code, workflow is open, and there are regular public demos. Hopefully, as the software develops, additional community will get involved.

Q: Delivering to mobile devices?

A: Yes, the formats video will transcode into will be selectable, but most institutions will likely choose a mobile-appropriate format. The player will be able to deliver to any particular device (focusing on iOS and Android).

Q: Can your system cope with huge videos?

A: That’s the plan, but ingesting will take work. We anticipate working with very large stuff.

Q: How are you referencing files internally? Filenames? Checksums? Collisions of named entries?

A: Haven’t talked about identifiers yet. UUIDs generated would be best, since filenames are a fairly fragile method. Fedora is handling identifiers so far.

Q: Can URLs point to specific times or segments?

A: That is an aim, and the audio project already does that.

Developer’s Challenge: Show and Tell LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 1 (LT1), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Hi there, I’m Mahendra Mahey, I run the DevCSI project, my organisation is funded by JISC. This is the fifth Developer Challenge. This is the biggest to date! We had 28 ideas. We have 19 presentations, each gets 3 minutes to present! You all need a voting slip! At the end of all of the presentations we will bring up a table with all the entries. To vote write the number of your favourite pitch. If it’s a 6 or a 9 please underline to help us! We will take in the votes and collate them. The judges won’t see that. They will convene and pick their favourites and then we will see if they agree… there will then be a final judging process.

The overall prize and runner up shares £1000 in Amazon vouchers. The overall winner will be funded to develop the idea (depending on what’s logitically possible). And Microsoft research have a .Net gadgeteer prize for the best development featuring Microsoft technology. So we start with…

1 – Matt Taylor, University of Southampton – Splinter: Renegade Repositories on Demand

The idea is that you have a temporary offshoot of your repository, can be disposed or reabsorbed, ideal for conferences or workshops, reduces overhead, network of personal microrepositories – the idea is that you don’t have to make accounts for anyone temporarily using your repositoriy. It’s a network of personal microrepository, A lightweight standalone anotation system. Its independent of the main repository. Great for inexperienced users, particularly important if you are a high prestige university. And the idea is that it’s a pseudopersonal workspace – can be shared on the web but separate of your main repository. And it’s a simplified workflow – so if you make a splinter repository for an event you can use contextual information – conference date, location, etc. to populate metadata. Microrepository already in development and tech exists: Demo at Bazaar workshop tomorrow. Reabsorption trivial using SWORD.

2 – Keith Gilmerton and Linda Newman – MATS: Mobile Audio Transcription and Submission

The idea is that you submit audio to repositories from phones. You set up once. You record audio. You select media for transcription, you add simple metadata You can review audio. Can pick from Microsoft Research’s MAVIS or Amazon’s Mechanical Turk. When submission back you get transcription and media to look at, can pick which of those two – either or both – you upload. And even if transcript not back its OK – new SWORD protocol does updates. And this is all possible using Android devices and code reused from one of last years challenges! Use cases – digital archive of literacy studies seek audio files, elliston poetry curator make analogue recordings , tablets in the field – Pompeii Archeaological Research Project would greatly increase submissions of data from the field.

3 – Joonas Kesaniemi and Kevin Van de Velde – Dusting off the mothballs introducing duster

The idea is to dust off time series here.  The only thing constant is change (Heraclitus 500BC). I want to get all the articles from AAlto university. It’s quite a new university but there used to be three universities that merged together. It would help to describe that the institution changed over time. Useful to have a temporal change model. Duster (aka Query expansion service) takes a data source that is a complex data model and then makes that available. Makes a simple Solr document for use via API. An example Kevin made – searching for one uni searches for all…

4 – Thomas Rosek, Jakub Jurkiewicz [sorry names too fast and not on screen] – Additional text for repository entries

In our repository we have keywords on the deposits – we can use intertext to explain keywords. Polish keywords you may not know them – but we can see that in English. And we can transliterate cyrillic. The idea is to build a system from blogs – connected like lego bricks. Build a blog for transliteration, for translating, for wikipedia, blog for geonames and mapping. And these would be connected to repository and all work together. And it would show how powerful

5 – Asger Askov Blekinge – SVN based repositories 

Many repositories have their own versioning systems but there are already well established versioning systems for software development that are better (SVN, GIT) so I propose we use SVN as the back end for Fedora.

Mass processing on the repository dowsn’t work well. Checkout the repo to a hadoop cluster, run the hadoop job, and commit the changed objects back. If we used standardised back end to access repository we could use Gource – software version control visualisation. I have developed a proof of concept that will be on Github in next few days to prove that you can do this, you can have a Fedora like interace on top of SVN repository.

6. Patrick McSweeney, University of Southampton – DataEngine

This is a problem we encountered, me and my friend Dabe Mills. For his PhD he had 1 GB of data, too much for the uni. Had to do his own workaround to visualise the data. Most of our science is in tier 3 where some data, but we need support! So the idea is that you put data into repository, allows you to show provenance, can manipulate data in the repository, merge into smaller CSV files, create a visualisation of your choice. You store intermediary files, data and the visualisations. You could do loads of visualisations. Important as first step on road to proper data science. Turns repository into tool that engages researchers from day one. And full data trail is there and is reproducable. And more interesting than that. You can take similar data, use same workflow and compare visualisation. And you can actually compare them. And I did loads in 2 days, imagine what I could do in another 2!

7. Petr Knoth from the Open University –  Cross-repository mobile application 

I would like to propose an application for searching across all repositories. You wouldn’t care about which repository it’s in, you would just get search it, get it, using these apps. And these would be provided for Apple and Google devices. Available now! How do you do this? You use APIs to aggregate – we can use applications like CORE, can use perhaps Microsoft Academic Search API. The idea of this mobile app is that it’s innovation – it’s a novel app. The vision is your papers are everywhere through syncing and sharing. It’s relevance to user problems: WYFIWYD: What you find is what you download. It’s cool. It’s usable. Its plausible for adoption/tech implementation.

8. Richard Jones and Mark MacGillivray, Cottage Labs – Sword it!

Mark: I am also a PhD student here at Edinburgh. From that perspective I know nothing of repositories… I don’t know… I don’t care… maybe I should… so how do we fix it. How do we make me be bothered?! How do we make it relevent.

Richard: We wrote Sword it code this week. It’s a jQuery plugin – one line of javascript in your header – to turn the page into a deposit button. Could go in repository, library website, your researchers page… If you made a GreaseMonkey script – we could but we haven’t – we could turn ANY page into a deposit! Same with Google results. Let us give you a quick example…

Mark: This example is running on a website. Couldn’t do on Informatics page as I forgot my login in true researcher style!

Richard: Pick a file. Scrapes metadata from file. Upload. And I can embed that on my webpage with same line of code and show off my publications!

9. Ben O Steen –

Cameron Neylon came up to me yesterday saying that lots of researchers submit papers to repositories like PubMed but also to publishers… you get DOIs. But who can see your paper? How can you tell which libraries have access to your papers? I have built We can use CrossRef and a suitable size sample of DOIs to find out the bigger picture – I faked some sample numbers but CrossRef is down just now. Submit a DOI, see if it works, fill in links and submit. There you go.

10. Dave Tarrant – The Thing of Dreams: A time machine for linked data

This seemed less brave than kinect deposit! We typically publish data as triples… why aren’t people publishing this stuff when they could be… well because they are slightly lazy. Technology can solve problems I’ve created It’s very Sword, very CRUD, very Amazon webs services… So in a browser… I can look at a standard Graphite RDF document. But that information is provided by this endpoint, gets annotated automatically. Adds date submitted and who submitted it. So, the cool stuff… well you can click view doc history… it’s just like Apple time machine that you can browse through time! And cooler yet you can restore it and browse through time. Techy but cool! But what else does this mean… we want to get to semantic web, final frontier.. how many countries have capital cities with an airport and a population over 2 million… on 6th June 2006. Can do it using Memento. Time travel for the web + time travel for data! The final frontier.

11. Les Carr – Boastr – marshalling evidence for reposting outcomes

I have found as a researcher I have to report on outcomes. There is technology missing. Last month a PhD student tweeted that he’d won a prize for a competition from the world bank – with link to World bank page and image of him winning prize, and competition page. We released press release, told EPSRC, they press released. Lots of dissemination, some of that should have been planned in advance. All published on the web. And it disappears super fast. It just dissapates… we need to capture that stuff for 2 years time when we report that stuff! It all gets lost! We want to capture imagination while it happens. We want to put stuff together. Path is a great app for stuff like Twitter has a great interface – who, what, where. Tie to sources of open data, maybe Microsoft Academic Live API. Capture and send to repositories! So that’s it: Boastr!

12. Juagr Adam Bakluha? – Fedora Object Locking

The idea is to allow multiple Fedora webapps working together to allow multiheaded fedora working we can do mass processing like: Fedora object store on a Hadoop File System, one fedora head, means bottlenecks, multiple heads mean multiple apps. Some shared stat between webapps. Add new rest methods – 3 lines in some jaxrs.xml. Add the decorator – 3 lines in Fedora.fcfg and you have Fedora Object locking

13. Graham Triggs – SHIELD

Before the proposal lets talk SWORD… its great, but just for deposit. With SWORD2 you can edit but you get edit iri and you need those, what if you lose them. What if you want to change content in the repository? So, SWORD could be more widely used if edit iris were discoverable. I want an ATOM feed. I want it to support authentication. Better replacement for OMI-PMH. But I want more. I want it to complete non archived items, non complete items, things you may have deposited before. Most importantly I want the edit iri! So I said I have a name…. I want a Simple Harvest Interface for Edit Link Discovery!

14. Jimmy Tang, DRI – Redundancy at the file and network level to protect data

I wanted to talk about redundancy at file and network level to protect data. One of the problems is that people with multi-terabyte archives like to protect it. Storage costs money. Replicating data is wasteful and expensive I think. LOCKSS/Replicating data can be wasteful. Replication means N times cost and money. My idea is to take an alternative approach… Possible solutions is using forward error correcting or erasure codes to a persistant layer – like setting up a RAID disc. You keep pieces of files and you can reconstruct it – move complexity from hardware to software world and save money with the efficiency. There are open source libraries to do this, most are mash ups. Should be possible!

15. Jose Martin – Machine and user-friendly policifying

I am proposing a way to embed data from SHERPA ROMEO webservices into records waiting to be reviewed in a repository. Last week I heard how SHERPA/ROMEO receives over 250K requests for data, he was looking for a script to make that efficient, a script to run on a daily or weekly basis. Besides this task is often fairly manual. Why not put machines to work instead… so we have an ePrints repository with 10 items to be reviewed. We download SHERPA/ROMEO information here. We have the colour code that give a hint about policy. Script would go over all items looking for ISSN matches and find colour code. and let us code those submissions – nice for repository manager and means the items are coded by policy ready to go. And updated policy info done in just one request for, say, 10 items. More efficient and happier! And retrieve journal title whilst at it.

16. Petr Knoth – Repository ANalytics

Idea to make repository managers lives very easy. They want to know what is being harvested and if everything is correct in their system. It’s good if someone can check from the outside. The idea is that analytics sit outside repository, lets them see metadata harvested, if it works OK and also provides stats on content – harvesting of full text PDF files. Very important. even though we have OMI-PMH there are huge discrepancies between the files. I am a repository manager I can see that everything is fine, that it has been carried out etc.  So we can see a problem with an end point. I propose we use this to automatically notify repository manager that something is wrong. Why do we count metadata not PDFs – latter are much more important. Want to produce other detailed full text stats, eg citation levels!

17. Steffan Godskesen – Current and complete CRIS with Metadata of excellent quality 

Researchers don’t want to do thinsg with metadata but librarians do care. In many cases metadata is already available from other sources and in your DI. So When we query the discovery iunterface cleverly we can extract metadata inject into CRIS, have librarians quality check it and obtain excellent CRIS. Can we do this? We have done this between our own DI (discovery system) and CRIS. And again when we changed CRIS, again when we changed DI. Why do again and again… to some extent we want help from DI and CRIS developers to help make these systems extract data more easily!

18. Julie Allison and Ben O’Steen – Visualising Repositories in the Real World

We want to use .Net Gadgeteer or Arduino to visualise repository activity, WHy? to demonstrate in the real world what happens in the repository world. Screens showing issues maybe. A physical guage for hits for hourse – great demo tool. A bell that ring when met deposits per day target. Or blowing bubbles for each deposit. Maybe 3D printing of deposited items? Maybe online Chronozoom, PivotViewer – explore content, JavaScript InfoVis – set of visualisation tools. Repository would be mine – York University. Using query interface to return creation date etc. Use APIs etc. So for example a JSPN animation of publications and networks and links between objects.

19. Ben O’Steen – Raid the repositories!

Lots of repositories with one managers, no developers. Raid them! VM that pulls them all in, pull in text mining, analysis, stats, enhancer etc. Data. Sell as a PR tool £20/month as a demo. Tools for reuse.

Applause meter in the room was split between Patrick MacSweeney  and Richard Jones & Mark MacGillivray’s presentation.

P5A: Deposit, Discovery and Re-use LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Repositories and Microsoft Academic Search
Speaker(s): Alex D. Wade, Lee Dirks

MSResearch seeks out innovators from the worldwide academic community. Everything they produce is freely available, non-profit.

They produce research accelerators in the for of Layerscape (visualization, storytelling, sharing), DataUp (used to be called DataCuration for Excel), and Academic Search.

Layerscape provides desktop tools for geospatial data visualization. It’s an Excel add-in that creates live-updating earth-model visuals. It provides the tooling to create a tour/fly-through of the data a researcher is discussing. Finally, it allows people to share their tours online – they can be browsed, watched, commented on like movies. If you want to interact with the data you can download the tour with data and play with it.

DataUp aids scientific discovery by ensuring funding agency data management compliance and repository compliance of Excel data. It lets people go from spreadsheet data to repositories easily. This can be done through an add-in or via cloud service. The glue that sticks theses applications together is repository agnostic, with minimum requirements for ease of connection. It’s all open source, driven by DataOne and CDL. It is in closed beta now with a wide release later this summer.

Now, Academic Search. It started by bringing together several research projects in MSResearch. It’s a search engine for academic papers from the web, feeds, repositories. Part of the utility of it is a profile of information around each publication, possibly from several sources, coalesced together. As other full-text documents cite in, those can be shown in context. Keywords can be shown and linked to DOI, can be subscribed to for change alerts. These data profiles are generated automatically, and that can build automatic author profiles as well. Conferences and journals they’ve published in, associations, citation history, institution search.

The compare button lets users compare institutions by different publication topics – by the numbers, by keywords, and so on. Visualizations are also available to be played with. The Academic Map shows publications on a map.

Academic Search will also hopefully be used a bit more than as a search engine. It is a rich source of information that ranks journals, conferences, academics, all sortable in a multitude of ways.

Authors also have domain-specific H-Index numbers associated with them.

Anyone can edit author pages, submit new content, clean things up. Anyone can also embed real-time pulls of data from the site onto their own site.

With the Public API, and an API key, you can fetch information with an even broader pull. Example: give me all authors associated with University of Edinburgh, and all data associated with them (citations, ID number, publications, other others, etc). With a publication ID, a user could see all of the references included, or all of the documents that cite it.

Q: What protocol is pushing information into the repositories?

A: SWORD was being looked at, but I’m uncertain about the merit protocol right now. SWORD is in the spec, so it will be that eventually.

Q: Does Academic Search harvest from repositories worldwide?

A: We want to, but first we’re looking at aggregations (OCLC Oyster). We want to provide a self-service registration mechanism, plus scraping via Bing. Right now, it’s a cursory attempt, but we’re getting better.

Q: How is the domain hierarchy generated?

A: The Domain hierarchy is generated manually with ISI categories. It’s an area of debate: we want an automated system, but the challenge is that more dynamic systems make rank lists and comparison over time more difficult. It’s a manual list of categories (200 total, at the journal level).

Q: Should we be using a certain type of metadata in repos? OAIPMH?

A: We use OAIPMH now, but we’re working on analysis of all that now. It’s a long term conversation about the best match.


Topic: Enhancing and testing repository deposit interfaces
Speaker(s): Steve Hitchcock, David Tarrant, Les Carr

Institutional repositories are facing big challenges. How are they presenting a range of services to users? How is presentation of repositories being improved, made easier? The DepositMO project hopes to improve just that. It asks how we can reposition the deposit process in a workflow. SWORD and V2 enable this.

So, IRs are under pressure. The Finch report suggests a transition with clear policy direction toward open access. This will make institutional open access repositories for publication obsolete, but not for research data. Repositories are taking a bigger view of that, though. Even if publications are open access, they can still be part of IR stores.

DepositMO has been in Edinburgh before. It induced spontaneous applause. It was also at OR before, in 2010.

This talk was a borderline accepted talk, perhaps because there is not a statement included: few studies of user action with repositories.

There are many ways that users interact with repositories, which ought to be analyzed. SWORD for Facebook, for Word.

SWORD gives a great scope of use between the user and repository, especially with V2. V2 is native in many repositories now, partially because of DepositMO.

With convenient tools built into already used software, like Word, work can be saved into repositories as it is developed. Users can set up Watch Folders for adding data, either as a new record or an update to an older version if changed locally. The latter example is quite a bit like Dropbox or Skydrive, but repositories aren’t harddrives. They aren’t designed as storage devices. They are curation and presentation services. Depositing means presenting very soon. DepositMO is a bit of a hack to prevent presentation while iteratively adding to repository content. Save for later, effectively.

Real user tests of DepositMO have been done – set up some laptops running created services and inviting users to test in pairs. This wasn’t about download, installation, and setup, but actual use in a workflow. Is it useful in the first place? Can it fit into the process? Task completion and success rates of repository user tasks were collected as users did these things.

On average, Word and watch folder deposit tools improved deposit time amongst other things. However, these entries aren’t necessarily as well documented as is typically necessary. The overall summary suggests that while there is a wow-factor in terms of repository interaction, the anxiety level of users increases as the amount of information they have to deposit increases. Users sometimes had to retrace steps, or else put things in the wrong places as they worked. They needed some trail or metadata to locate deposit items and fix deposit errors.

There are cases for not adding metadata during initial entry, though, so low metadata might not be the worst thing.

Now it’s time to do more research, exploring the uses with real repositories. That project is called DepositMOre. Watch Folder, EasyChair one-click submission, and to an extent the word add in will be analyzed statistically as people actually deposit into real repositories. It’s time to accomodate new workflows, to accomodate new needs, and face down challenges of publishers offering open access.

Q: Have you looked into motivations for user deposit into repositories?

A: No, it was primarily a study of test users through partners in the project. The how and what of usage and action, but not the why. There was a wonder whether more data about the users would be useful. If more data was obtainable, the most interesting thing would be understanding user experience with repositories. But mandate motivation, no, not looking into that.

Q: You’ve identified a problem users have with depositing many things and tracking deposits. Did you identify a solution?

A: It’s more about dissuading people from reverting to previous environments and tools. There are more explicit metadata tools, and we could do a better job of showing trails of submission, so that will need to filter back in. Unlike cloud drives, losers use control of an object once they are submitted to a repository. So, suddenly something else is doing something, and the user it’s disconcerting.


Topic: OERPub API for Publishing Remixable Open Educational Resources (OER)
Speaker(s): Katherine Fletcher, Marvin Reimer

This talk is about a SWORD implementation and client. Most of this work has happened in the last year, very quick.

Remixable open education repositories target less academic and more multi-institution, open repos. Remixability lets users learn anywhere. It’s a ton of power. All these open resources can seed a developer community for authoring and creation, machine learning algorithms, and it all encourages lots of remixable creation.

Remixability can be hard to support, though. Connexions, and other organizations, had grand ambition but not a very large API. And you need an importer/editor that is easy to use. Something that can mash data up.

In looking at APIs needed for open education, discoverability is important, but making publishing easier is important, too. We need to close the loop so that we stop losing the remixed work externally. That’s where SWORD comes in. V2.

Why SWORD V2 for OER? It has support for workflow. The things being targeted are live edited objects, versioned. Those versions need to be permanent so that changes are nondestructive. Adapting, translating, deriving are great, but associating them with common objects helps tie it all together.

OERPub extends SWORD V2. It clarifies and adds specificity to metadata. Specificity is required for showing the difference between versions and derivatives, specifically. And documentation is improved. Default values, repository controlled and auto-generated values are all documented. Precedents have been made clear, that’s it.

OERPub also merges semantics header for PUT. It simplifies what’s going on. Also added a section on Transforms under packaging. If a repository will transform content, it has a space to explain its actions. It provides error handling improvements, particularly elaboration on things like transform and deposit fails.

This is the first tool to submit to Connexions from outside of Connexions.

Lessons learned? Specification detail was great. Good to model on top of and save work. Bug fixes also lead the project away from multiple metadata specifications – otherwise bugs will come up. Learned that you always need a deposit receipt, which is normally optional. Finally, auto-discovery – this takeaway suggests a protocol for accessing and editing public item URLs.

A client was built to work with this – a transform tool to remixable format in very clean HTML, fed into Connexions, and pushed to clients on various devices. A college chemistry textbook was already created using this client. And a developer sprint got three new developers fixing three bugs in a day – two hours to get started. This is really enabling people to get involved.

Many potential future uses are cropping up. And all this fits into curation and preservation – archival of academic outputs as an example.

Q: Instead of PUT, should you be using PATCH?

A: Clients aren’t likely to not know repositories, but it is potentially dangerous to ignore headers. Other solutions will be looked at.

Q: One lesson learned was to avoid multiple ways of specifying metadata. What ways?

A: DublinCore fields with attributes and added containers. That caused errors. XML was mixed in, but we had to eventually specify exactly which we wanted.


Soup Cans and Spray Cans

Like a word repeated too many times in succession, Andy Warhol’s exploration of mass-produced icons shook sturdy foundations and put a new spin on an old world. He was a master of estranging the familiar, of estranging everything.

a pop-art rendering of Marilyn Monroe's face

One of Warhol’s most recognizable pieces (A photo released 30 April 1998 by Sotheby's New York shows Andy Warhol's "Orange Marilyn". AFP, Getty Images. 30-04-1998)

Warhol’s ‘Campbell’s Soup Cans’ premiered 50 years ago on Sunday, effectively introducing the newfound pop art scene to the west coast from the Ferus Gallery, Los Angeles. The series is iconic of what a Warhol does to a viewer: love it or hate it, his work rarely escapes strong reaction.

Warhol's Soup Cans in an art print

The pioneer of pop art is still inspiring coming generations of artists (Pop Art- Warhol's art 20 years on. AFP Footage, Getty Moving Images. 20-02-2007)

Warhol’s background in graphic and product design strongly shaped his work. Whether it was cans, bottles, or a media icon’s face, he playfully instigated dialogue on aesthetic, expression, and commoditisation through repetitions of what we might come across several times a day in the real world.

a man talks emotively in his office

The modern world is not lacking for Warhol’s influence, and his friend says he would have felt right at home in this era (Pop art's children: Fashion star talks about Andy Warhol. AFP Footage, Getty Images. 17-03-2009)

Street art has been pushing the bounds of artistic license even within the experimental realm of modern pop art. From the 1970s graffiti movements of New York City this practice has developed into a debate about artistic license – and created a commodity in high demand.

a print of Marilyn Monroe's face glued to concrete, framed

In Hackney there was debate over whether graffiti is art or vandalism (Hackney Council to remove street art by graffiti artist Banksy. ITN. 28-10-2007)

Debates over graffiti’s classification as art or vandalism have come up again and again, especially around the works of the infamous and mysterious Banksy. This has created an uncomfortable boundary between art and vandalism that decides a works fate based on the quality of the work.

man wiping graffiti from a wall

A Westminster council member asks the same question (Westminster Council to paint over work by Banksy. ITN. 24-10-2008)

Designated graffiti areas and licenses for artists have begun to crop up. The documentary ‘Exit Through the Gift Shop’ brought discussion about the contested medium into the spotlight. It seems that what was subversive is slowly becoming legal, at least in the right places.

Wall painted with 'Designated Graffiti Area' and some graffiti

‘This wall is a designated graffiti area’ (Street-art in Hoxton_11. GovEd Communications. 2008)

It looks like the next set of soup cans have as good a chance of being sprayed onto old bricks and concrete as they do on canvas. Warhol probably would have got a kick out of that.

artist with a spray can

Artist or vandal? (Graffiti. By Naki, PYMCA. 2000)


Happy 50th Anniversary of Independence, Rwanda and Burundi!

The Kingdoms of Rwanda and Burundi have existed for centuries. This month they celebrate the 50th anniversary of their renewed independence, after subjugation after the First World War.

The two kingdoms’ fate became intertwined when Belgium won them from Germany in 1916. Then called Ruanda-Urundi, the single state was run as a plunder economy with Belgian-selected indigenous rule. These rulers were selected based on their position on either side of a racial divide, a decision that has had reverberating impacts all the way in to the present.

Earl Hurie speaking

Early talk of Europe and African Federation (EARL HURIE BACK FROM AFRICA. ITV Early Evening News. 04-05-1959)

In 1962, after decades that the League of Nations and United Nations had hoped would be spent investing in the area, Rwanda and Burundi were granted independence. Belgium was pressured to leave not just politically, but by conflict in the Belgian Congo.

Lord Gladwyn, Acting Secretary General of the United Nations 1945-46

Dag Hammarskjold was killed in a plane crash en-route to ceasefire talks in Katanga, which broke away from the newly independent Democratic Republic of Congo in 1960 (DAG HAMMARSKJOLD DEATH. ITV Late Evening News. 18-09-1961)

Times remained turbulent in Rwanda and Burundi, and both countries have been marked by tragedy even recently.

Cameraman on a camera-guiding track

Behind the scenes of 100 Days, the first fictional account of the Rwandan genocide (Rwanda: Film. GNS Weekly News, AP Archive. 10-03-1999)

While the lives they knew are gone, some survivors have finally been able to go home.

Refugees in the back of a canvas-topped truck

With help, refugees head home (BURUNDI / RWANDA: REFUGESS RETURN TO RWANDA FROM BURUNDI. Reuters TV. 21-02-1996)

Fortunately, from tragedy comes adversity. There has been progress over time, and even a few smiles.

Burundi dancer clapping

The 50th anniversary of the two countries independence allows an opportunity to reflect on the past, present and future of these young nations. Rwanda and Burundi have been tempered by their own trials, but that only illustrates their tenacity. This is not merely a century of renewed independence now half full, it is a time of optimism and a celebration of what is to come.

Pierre Nkurunziza