P6A: Non-traditional content LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Eating your own dog food: Building a repository with API-driven development
Speaker(s): Nick John Jackson, Joss Luke Winn

The team decided they wanted to build a wholly new RDM, with research data as a focus for the sake of building the best tool for that job. This repository was also designed to store data during research, not just after.

Old repositories work very well, but they assume the entry of a whole file (or a pointer), only retrievable in bulk and in oddly organized pieces. They have generally limited interface methods and capacities. These old repositories also focus on formats, not form (structure and content) unless there is fantastic metadata.

The team wanted to do something different, and built a great backend first. They were prepared to deal with raw data as raw data. The API was built first, not the UI. APIs are the important bit. And those APIs need to be built in a way that people will want to use them.

This is wear eating your own dog food comes in. The team used their own API to build the frontend of the system, and used their own documentation. Everything had to be done well because it was all used in house. Then, they pushed it out to some great users, and made them do what they wanted to do with the ‘minimum viable product’. It works, and you build from there.

Traditional repos have a database, application, users. They might tack an API on at the end for manual and bulk control, but it doesn’t even include all of the functionality of the website usually. That or you screen scrape, and that’s rough work. Instead, this repository builds an API and then interacts with that via the website.

Research tends to happen on a subset of any given data set, nobody wants that whole data set. So forget the containers that hold it all. Give researches shared, easily usable databases. APIs put stuff in and out automatically.

This was also made extensible from day one. Extensible and writeable by everybody to the very core. The team also encourages re-usable modularity. People do the same things to their data over and over – just share that bit of functionality at a low data level. And they rely on things to do things to get things done – in other words, there’s no sense in replicating other people’s work if it’s done well.

The team ended up building better stuff because it uses its own work – if it doesn’t do what it’s meant to, it annoys them and they have to fix it. All functionality is exposed so they can get their work done quick and easy. Consistent and clean error handling were baked in for the sake of their own sanity, but also for everybody else. Once it’s all good and easy for them, it will be easy for 3rd parties to use, whether or not they have a degree in repo magic. And security is forcibly implemented across the board. API-level authentication means that everything is safe and sound.

Improved visibility is another component. Database querying is very robust, and saves the users the trouble of hunting. Quantitative information is quick and easy because the API gives open access to all the data.

This can be scalable horizontally, to as many servers as needed. It doesn’t use server states.

There are some problems involved in eating your own dog food. It takes time to design a decent API first. You also end up doubling up some development, particularly for frontend post-API development. APIs also add overhead. But after some rejigging, it all works with thousands of points per second, and it’s humming nicely.

Q: Current challenges?

A: Resourcing the thing. Lots of cutting edge technology and dependence on cloud architecture. Even with money and demand, IT infrastructure aren’t keeping up just yet.

Q: How are you looking after external users? Is there a more discoverable way to use this thing?

A: The closest thing we have is continuous integration to build the API at multiple levels. A discovery description could be implemented.

Q: Can you talk about scalability? Limitations?

A: Researchers will sometimes not know how to store what they’ve got. They might put pieces of data on their own individual rows when they don’t need to be. That brings us closer to our limit. Scaling up is possible, and doing it beyond limits is possible, but it requires a server-understood format.

Q: Were there issues with developers changing schemas mysteriously? Is that a danger with MongoDB?

A: By using our own documentation, forcing ourselves to look at it when building and questioning. We’ve got a standard object with tracking fields, and  if a researcher starts to get adventurous with schemas it’s then on them.


Topic: Where does it go from here? The place of software in digital repositories
Speaker(s): Neil Chue Hong

Going to talk about the way that developers of software are getting overlapping concerns with the repository community. This isn’t software for implementing infrastructure, but software that will be stored in that infrastructure.

Software is pervasive in research now. It is in all elements of research.

The software sustainability institute does a number of things at strategic and tactical levels to help create best practices in research software development.

One question is the role of software in the longer term – five and ten years on? The differences between preservation and sustainability. The former holds onto things for use later on, while the latter keeps understanding in a particular domain. The understanding, the sustainability, is the more important part here.

Several purposes for sustaining and preserving software. For achieving legal compliances (architecture models ought to be kept for the life of a building). For creating heritage value (gaining an overall understanding of influences of a creator). For continued access to data (looking back, through the lens of the software). For software reuse (funders like this one).

There are several approaches. Preserving the technology, whether it’s physical hardware or an emulated environment. Migration from one piece of software to another over time while ensuring functionality, or transitioning to something that does similar. There’s also hibernation, just making sure it can be picked apart some day if need be.

Computational science itself needs to be studied to do a good job of this. Software carpentry teaches scientists basic programming to improve their science. One thing, using repositories, is an important skill. Teaching scientists the exploratory process of hacking together code is the fun part, so they should get to do it.

Re-something is the new black. Reuse, review, replay, rerun, repair. But also reward. How can people be rewarded for good software contributions, the ones that other people end up using. People get pats on the back, glowing blog posts, but really reward in software is in its infancy. That’s where repositories come in.

Rewarding good development often requires publication which requires mention of the developments. That ends up requiring a scientific breakthrough, not a developmental one. Software development is a big part of science and it should be viewed/treated as such.

Software is just data, sure, but along with the Beyond Impact team these guys have been looking at software in terms of preservation beyond just data. What needs to get kept in software and development? Workflows should, because they show the boundaries of using software in a study – the dependencies and outputs of the code. Looking at code on various levels is also important. On the library/software/suite level? The program or algorithm or function level. That decision is huge. The granularity of software needs to be considered.

Versioning is another question. It indicates change, allows sharing of software, and confers some sort of status. Which versions should go in which repositories, though? That decision is based on backup (github), sharing (DRYAD), archiving (DSpace). Different repositories do each.

One of the things being looked at in sustaining software are software metapapers. These are scholarly records including ‘standard’ publication, method, dataset and models, and software. This enables replay, reproduction, and reuse. It’s a pragmatic approach that bundles everything together, and peer review can scrutinize the metadata, not the software.

The Journal of Open Research Software allows for the submission of software metapapers. This leads to where the overlap in development and repositories occurred, and where it’s going.

The potential for confusion occurs when users are brought in and licensing occurs. It’s not CC BY, it’s OSI standard software licenses.

Researchers are developing more software than ever, and trying to do it better. They want to be rewarded for creating a complete scholarly record, which includes software. Infrastructure needs to enable that. And we still don’t know the best way to shift from one repository role to another when it comes to software – software repositories from backup to sharing to archival. The pieces between them need to be explored more.

Q: The inconsistency of licensing between software and data might create problems. Can you talk about that?

A: There is work being done on this, on licensing different parts of scholarly record. Looking at reward mechanisms and computability of licenses in data and software need to be explored – which ones are the same in spirit?


Topic: The UCLA Broadcast News Archive Makes News: A Transformative Approach to Using the News in Teaching, Research, and Publication
Speaker(s): Todd Grappone, Sharon Farb

UCLA has been developing an archive since the Watergate hearings. It was a series of broadcast television recordings for a while, but not it’s digital libraries of broadcast recordings. That content is being put into a searchable, browsable interface. It will be publicly available next year. It grows about a terabyte a month (150000+ programs and counting), which pushes the scope of infrastructure and legality.

It’s possible to do program-level metadata search. Facial recognition, OCR of text on screen, closed caption text, all searchable. And almost 10 billion images. This is a new way for the library to collect the news since papers are dying.

Why is this important? It’s about the mission of the university copyright department: public good, free expression, and the exchange of ideas. That’s critical to teaching and learning. The archive is a great way to fulfill that mission. This is quite different from the ideas of other Los Angeles organizations, the MPAA and RIAA.

The mission of higher education in general is about four principles. The advancement of knowledge through research, through teaching, and of preservation and diffusion of that knowledge.

About 100 news stations being captured so far. Primarily American. International collaborators are helping, too. Pulling all broadcast, under a schedule scheme with data. It’s encoded and analyzed, then pushed to low-latency storage in H.264 (250MB/hr). Metadata is captures automatically (timestamp, show, broadcast ID, duration, and full search by closed captioning). The user interface allows search and browse.

So, what is news? Definitions are really broad. Novelties, information, and a whole lot of other stuff. The scope of the project is equally broad. That means Comedy Central is in there – it’s part of the news record. Other people doing this work are getting no context, little metadata, less broadcasts. And it’s a big legal snafu that is slowly untangling.

Fortunately, this is more than just capturing the news. There’s lots of metadata – transformative levels of information. Higher education and libraries need these archives for the sake of knowledge and preservation.

Q: Contextual metadata is so hard to find, and knowing how to search is hard. How about explore? How about triangulating with textual news via that metadata you do have?

A: We’re pulling in everything we can. Some of the publishing from these archives use almost literally everything (court cases, Twitter, police data, CCTV, etc). We’re excited to bring it all together, and this linkage and exploration is the next thing.

Q: In terms of tech. development, how has this archive reflected trends in the moving image domain? Are you sharing and collaborating with the community?

A: An on-staff archivist is doing just that, but so far this is just for UCLA. It’s all standards-driven so far, and community discussion is the next step.


Topic: Variations on Video: Collaborating toward a robust, open system to provide access to library media collections
Speaker(s): Mark Notess, Jon W. Dunn, Claire Stewart

This project has roots in a project called Variations in 1996. It’s now in use at 20 different institutions, three versions. Variations on Video is a fresh start, coming from a background in media development. Everything is open source, working with existing technologies, and hopefully engaging with a very broad base of users and developers.

The needs that Variations on Video are trying to meet are archival preservation, access for all sorts of uses. Existing repositories aren’t designed for time-based media. Storage, streaming, transcoding, access and media control, and structure all need to be handled in new ways. Access control needs to be pretty sophisticated for copyright and sensitivity issues.

Existing solutions have been an insufficient fit. Variations on Video offers basic functionality that goes beyond them or does them better. File upload, transcoding, and descriptive metadata will let the repository stay clean. Navigation and structural metadata will allow users to find and actually use it all.

VoV is built on a Hydra framework, Opencast Matterhorn, and a streaming server that can serve up content to all sorts of devices.

PBCore was chosen for descriptive metadata, with an ‘Atomic’ content model: parent objects for intellectual descriptions, child objects for master files, children of these for derivatives. There’s ongoing investigation for annotation schemes.

Release 0 was this month (upload, simple metadata, conversion), and release one will come about in December 2012. Development will be funded through 2014.

Uses Backlight for discover, Strobe media player for now. Other media players with more capabilities are being considered.

Variations on Video is becoming AVALON (Audio Video Archives and Libraries Online).

Using the agile Scrum approach with a single team at the university for development. Other partners will install, test, provide feedback. All documentation, code, workflow is open, and there are regular public demos. Hopefully, as the software develops, additional community will get involved.

Q: Delivering to mobile devices?

A: Yes, the formats video will transcode into will be selectable, but most institutions will likely choose a mobile-appropriate format. The player will be able to deliver to any particular device (focusing on iOS and Android).

Q: Can your system cope with huge videos?

A: That’s the plan, but ingesting will take work. We anticipate working with very large stuff.

Q: How are you referencing files internally? Filenames? Checksums? Collisions of named entries?

A: Haven’t talked about identifiers yet. UUIDs generated would be best, since filenames are a fairly fragile method. Fedora is handling identifiers so far.

Q: Can URLs point to specific times or segments?

A: That is an aim, and the audio project already does that.

Developer’s Challenge: Show and Tell LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 1 (LT1), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Hi there, I’m Mahendra Mahey, I run the DevCSI project, my organisation is funded by JISC. This is the fifth Developer Challenge. This is the biggest to date! We had 28 ideas. We have 19 presentations, each gets 3 minutes to present! You all need a voting slip! At the end of all of the presentations we will bring up a table with all the entries. To vote write the number of your favourite pitch. If it’s a 6 or a 9 please underline to help us! We will take in the votes and collate them. The judges won’t see that. They will convene and pick their favourites and then we will see if they agree… there will then be a final judging process.

The overall prize and runner up shares £1000 in Amazon vouchers. The overall winner will be funded to develop the idea (depending on what’s logitically possible). And Microsoft research have a .Net gadgeteer prize for the best development featuring Microsoft technology. So we start with…

1 – Matt Taylor, University of Southampton – Splinter: Renegade Repositories on Demand

The idea is that you have a temporary offshoot of your repository, can be disposed or reabsorbed, ideal for conferences or workshops, reduces overhead, network of personal microrepositories – the idea is that you don’t have to make accounts for anyone temporarily using your repositoriy. It’s a network of personal microrepository, A lightweight standalone anotation system. Its independent of the main repository. Great for inexperienced users, particularly important if you are a high prestige university. And the idea is that it’s a pseudopersonal workspace – can be shared on the web but separate of your main repository. And it’s a simplified workflow – so if you make a splinter repository for an event you can use contextual information – conference date, location, etc. to populate metadata. Microrepository already in development and tech exists: RedFeather.ecs.soton.ac.uk. Demo at Bazaar workshop tomorrow. Reabsorption trivial using SWORD.

2 – Keith Gilmerton and Linda Newman – MATS: Mobile Audio Transcription and Submission

The idea is that you submit audio to repositories from phones. You set up once. You record audio. You select media for transcription, you add simple metadata You can review audio. Can pick from Microsoft Research’s MAVIS or Amazon’s Mechanical Turk. When submission back you get transcription and media to look at, can pick which of those two – either or both – you upload. And even if transcript not back its OK – new SWORD protocol does updates. And this is all possible using Android devices and code reused from one of last years challenges! Use cases – digital archive of literacy studies seek audio files, elliston poetry curator make analogue recordings , tablets in the field – Pompeii Archeaological Research Project would greatly increase submissions of data from the field.

3 – Joonas Kesaniemi and Kevin Van de Velde – Dusting off the mothballs introducing duster

The idea is to dust off time series here.  The only thing constant is change (Heraclitus 500BC). I want to get all the articles from AAlto university. It’s quite a new university but there used to be three universities that merged together. It would help to describe that the institution changed over time. Useful to have a temporal change model. Duster (aka Query expansion service) takes a data source that is a complex data model and then makes that available. Makes a simple Solr document for use via API. An example Kevin made – searching for one uni searches for all…

4 – Thomas Rosek, Jakub Jurkiewicz [sorry names too fast and not on screen] – Additional text for repository entries

In our repository we have keywords on the deposits – we can use intertext to explain keywords. Polish keywords you may not know them – but we can see that in English. And we can transliterate cyrillic. The idea is to build a system from blogs – connected like lego bricks. Build a blog for transliteration, for translating, for wikipedia, blog for geonames and mapping. And these would be connected to repository and all work together. And it would show how powerful

5 – Asger Askov Blekinge – SVN based repositories 

Many repositories have their own versioning systems but there are already well established versioning systems for software development that are better (SVN, GIT) so I propose we use SVN as the back end for Fedora.

Mass processing on the repository dowsn’t work well. Checkout the repo to a hadoop cluster, run the hadoop job, and commit the changed objects back. If we used standardised back end to access repository we could use Gource – software version control visualisation. I have developed a proof of concept that will be on Github in next few days to prove that you can do this, you can have a Fedora like interace on top of SVN repository.

6. Patrick McSweeney, University of Southampton – DataEngine

This is a problem we encountered, me and my friend Dabe Mills. For his PhD he had 1 GB of data, too much for the uni. Had to do his own workaround to visualise the data. Most of our science is in tier 3 where some data, but we need support! So the idea is that you put data into repository, allows you to show provenance, can manipulate data in the repository, merge into smaller CSV files, create a visualisation of your choice. You store intermediary files, data and the visualisations. You could do loads of visualisations. Important as first step on road to proper data science. Turns repository into tool that engages researchers from day one. And full data trail is there and is reproducable. And more interesting than that. You can take similar data, use same workflow and compare visualisation. And you can actually compare them. And I did loads in 2 days, imagine what I could do in another 2!

7. Petr Knoth from the Open University –  Cross-repository mobile application 

I would like to propose an application for searching across all repositories. You wouldn’t care about which repository it’s in, you would just get search it, get it, using these apps. And these would be provided for Apple and Google devices. Available now! How do you do this? You use APIs to aggregate – we can use applications like CORE, can use perhaps Microsoft Academic Search API. The idea of this mobile app is that it’s innovation – it’s a novel app. The vision is your papers are everywhere through syncing and sharing. It’s relevance to user problems: WYFIWYD: What you find is what you download. It’s cool. It’s usable. Its plausible for adoption/tech implementation.

8. Richard Jones and Mark MacGillivray, Cottage Labs – Sword it!

Mark: I am also a PhD student here at Edinburgh. From that perspective I know nothing of repositories… I don’t know… I don’t care… maybe I should… so how do we fix it. How do we make me be bothered?! How do we make it relevent.

Richard: We wrote Sword it code this week. It’s a jQuery plugin – one line of javascript in your header – to turn the page into a deposit button. Could go in repository, library website, your researchers page… If you made a GreaseMonkey script – we could but we haven’t – we could turn ANY page into a deposit! Same with Google results. Let us give you a quick example…

Mark: This example is running on a website. Couldn’t do on Informatics page as I forgot my login in true researcher style!

Richard: Pick a file. Scrapes metadata from file. Upload. And I can embed that on my webpage with same line of code and show off my publications!

9. Ben O Steen – isthisresearchreadable.org

Cameron Neylon came up to me yesterday saying that lots of researchers submit papers to repositories like PubMed but also to publishers… you get DOIs. But who can see your paper? How can you tell which libraries have access to your papers? I have built isthisresearchreadable.org. We can use CrossRef and a suitable size sample of DOIs to find out the bigger picture – I faked some sample numbers but CrossRef is down just now. Submit a DOI, see if it works, fill in links and submit. There you go.

10. Dave Tarrant – The Thing of Dreams: A time machine for linked data

This seemed less brave than kinect deposit! We typically publish data as triples… why aren’t people publishing this stuff when they could be… well because they are slightly lazy. Technology can solve problems I’ve created LDS3.org. It’s very Sword, very CRUD, very Amazon webs services… So in a browser… I can look at a standard Graphite RDF document. But that information is provided by this endpoint, gets annotated automatically. Adds date submitted and who submitted it. So, the cool stuff… well you can click view doc history… it’s just like Apple time machine that you can browse through time! And cooler yet you can restore it and browse through time. Techy but cool! But what else does this mean… we want to get to semantic web, final frontier.. how many countries have capital cities with an airport and a population over 2 million… on 6th June 2006. Can do it using Memento. Time travel for the web + time travel for data! The final frontier.

11. Les Carr – Boastr – marshalling evidence for reposting outcomes

I have found as a researcher I have to report on outcomes. There is technology missing. Last month a PhD student tweeted that he’d won a prize for a competition from the world bank – with link to World bank page and image of him winning prize, and competition page. We released press release, told EPSRC, they press released. Lots of dissemination, some of that should have been planned in advance. All published on the web. And it disappears super fast. It just dissapates… we need to capture that stuff for 2 years time when we report that stuff! It all gets lost! We want to capture imagination while it happens. We want to put stuff together. Path is a great app for stuff like Twitter has a great interface – who, what, where. Tie to sources of open data, maybe Microsoft Academic Live API. Capture and send to repositories! So that’s it: Boastr!

12. Juagr Adam Bakluha? – Fedora Object Locking

The idea is to allow multiple Fedora webapps working together to allow multiheaded fedora working we can do mass processing like: Fedora object store on a Hadoop File System, one fedora head, means bottlenecks, multiple heads mean multiple apps. Some shared stat between webapps. Add new rest methods – 3 lines in some jaxrs.xml. Add the decorator – 3 lines in Fedora.fcfg and you have Fedora Object locking

13. Graham Triggs – SHIELD

Before the proposal lets talk SWORD… its great, but just for deposit. With SWORD2 you can edit but you get edit iri and you need those, what if you lose them. What if you want to change content in the repository? So, SWORD could be more widely used if edit iris were discoverable. I want an ATOM feed. I want it to support authentication. Better replacement for OMI-PMH. But I want more. I want it to complete non archived items, non complete items, things you may have deposited before. Most importantly I want the edit iri! So I said I have a name…. I want a Simple Harvest Interface for Edit Link Discovery!

14. Jimmy Tang, DRI – Redundancy at the file and network level to protect data

I wanted to talk about redundancy at file and network level to protect data. One of the problems is that people with multi-terabyte archives like to protect it. Storage costs money. Replicating data is wasteful and expensive I think. LOCKSS/Replicating data can be wasteful. Replication means N times cost and money. My idea is to take an alternative approach… Possible solutions is using forward error correcting or erasure codes to a persistant layer – like setting up a RAID disc. You keep pieces of files and you can reconstruct it – move complexity from hardware to software world and save money with the efficiency. There are open source libraries to do this, most are mash ups. Should be possible!

15. Jose Martin – Machine and user-friendly policifying

I am proposing a way to embed data from SHERPA ROMEO webservices into records waiting to be reviewed in a repository. Last week I heard how SHERPA/ROMEO receives over 250K requests for data, he was looking for a script to make that efficient, a script to run on a daily or weekly basis. Besides this task is often fairly manual. Why not put machines to work instead… so we have an ePrints repository with 10 items to be reviewed. We download SHERPA/ROMEO information here. We have the colour code that give a hint about policy. Script would go over all items looking for ISSN matches and find colour code. and let us code those submissions – nice for repository manager and means the items are coded by policy ready to go. And updated policy info done in just one request for, say, 10 items. More efficient and happier! And retrieve journal title whilst at it.

16. Petr Knoth – Repository ANalytics

Idea to make repository managers lives very easy. They want to know what is being harvested and if everything is correct in their system. It’s good if someone can check from the outside. The idea is that analytics sit outside repository, lets them see metadata harvested, if it works OK and also provides stats on content – harvesting of full text PDF files. Very important. even though we have OMI-PMH there are huge discrepancies between the files. I am a repository manager I can see that everything is fine, that it has been carried out etc.  So we can see a problem with an end point. I propose we use this to automatically notify repository manager that something is wrong. Why do we count metadata not PDFs – latter are much more important. Want to produce other detailed full text stats, eg citation levels!

17. Steffan Godskesen – Current and complete CRIS with Metadata of excellent quality 

Researchers don’t want to do thinsg with metadata but librarians do care. In many cases metadata is already available from other sources and in your DI. So When we query the discovery iunterface cleverly we can extract metadata inject into CRIS, have librarians quality check it and obtain excellent CRIS. Can we do this? We have done this between our own DI (discovery system) and CRIS. And again when we changed CRIS, again when we changed DI. Why do again and again… to some extent we want help from DI and CRIS developers to help make these systems extract data more easily!

18. Julie Allison and Ben O’Steen – Visualising Repositories in the Real World

We want to use .Net Gadgeteer or Arduino to visualise repository activity, WHy? to demonstrate in the real world what happens in the repository world. Screens showing issues maybe. A physical guage for hits for hourse – great demo tool. A bell that ring when met deposits per day target. Or blowing bubbles for each deposit. Maybe 3D printing of deposited items? Maybe online Chronozoom, PivotViewer – explore content, JavaScript InfoVis – set of visualisation tools. Repository would be mine – York University. Using query interface to return creation date etc. Use APIs etc. So for example a JSPN animation of publications and networks and links between objects.

19. Ben O’Steen – Raid the repositories!

Lots of repositories with one managers, no developers. Raid them! VM that pulls them all in, pull in text mining, analysis, stats, enhancer etc. Data. Sell as a PR tool £20/month as a demo. Tools for reuse.

Applause meter in the room was split between Patrick MacSweeney  and Richard Jones & Mark MacGillivray’s presentation.

P5A: Deposit, Discovery and Re-use LiveBlog

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Repositories and Microsoft Academic Search
Speaker(s): Alex D. Wade, Lee Dirks

MSResearch seeks out innovators from the worldwide academic community. Everything they produce is freely available, non-profit.

They produce research accelerators in the for of Layerscape (visualization, storytelling, sharing), DataUp (used to be called DataCuration for Excel), and Academic Search.

Layerscape provides desktop tools for geospatial data visualization. It’s an Excel add-in that creates live-updating earth-model visuals. It provides the tooling to create a tour/fly-through of the data a researcher is discussing. Finally, it allows people to share their tours online – they can be browsed, watched, commented on like movies. If you want to interact with the data you can download the tour with data and play with it.

DataUp aids scientific discovery by ensuring funding agency data management compliance and repository compliance of Excel data. It lets people go from spreadsheet data to repositories easily. This can be done through an add-in or via cloud service. The glue that sticks theses applications together is repository agnostic, with minimum requirements for ease of connection. It’s all open source, driven by DataOne and CDL. It is in closed beta now with a wide release later this summer.

Now, Academic Search. It started by bringing together several research projects in MSResearch. It’s a search engine for academic papers from the web, feeds, repositories. Part of the utility of it is a profile of information around each publication, possibly from several sources, coalesced together. As other full-text documents cite in, those can be shown in context. Keywords can be shown and linked to DOI, can be subscribed to for change alerts. These data profiles are generated automatically, and that can build automatic author profiles as well. Conferences and journals they’ve published in, associations, citation history, institution search.

The compare button lets users compare institutions by different publication topics – by the numbers, by keywords, and so on. Visualizations are also available to be played with. The Academic Map shows publications on a map.

Academic Search will also hopefully be used a bit more than as a search engine. It is a rich source of information that ranks journals, conferences, academics, all sortable in a multitude of ways.

Authors also have domain-specific H-Index numbers associated with them.

Anyone can edit author pages, submit new content, clean things up. Anyone can also embed real-time pulls of data from the site onto their own site.

With the Public API, and an API key, you can fetch information with an even broader pull. Example: give me all authors associated with University of Edinburgh, and all data associated with them (citations, ID number, publications, other others, etc). With a publication ID, a user could see all of the references included, or all of the documents that cite it.

Q: What protocol is pushing information into the repositories?

A: SWORD was being looked at, but I’m uncertain about the merit protocol right now. SWORD is in the spec, so it will be that eventually.

Q: Does Academic Search harvest from repositories worldwide?

A: We want to, but first we’re looking at aggregations (OCLC Oyster). We want to provide a self-service registration mechanism, plus scraping via Bing. Right now, it’s a cursory attempt, but we’re getting better.

Q: How is the domain hierarchy generated?

A: The Domain hierarchy is generated manually with ISI categories. It’s an area of debate: we want an automated system, but the challenge is that more dynamic systems make rank lists and comparison over time more difficult. It’s a manual list of categories (200 total, at the journal level).

Q: Should we be using a certain type of metadata in repos? OAIPMH?

A: We use OAIPMH now, but we’re working on analysis of all that now. It’s a long term conversation about the best match.


Topic: Enhancing and testing repository deposit interfaces
Speaker(s): Steve Hitchcock, David Tarrant, Les Carr

Institutional repositories are facing big challenges. How are they presenting a range of services to users? How is presentation of repositories being improved, made easier? The DepositMO project hopes to improve just that. It asks how we can reposition the deposit process in a workflow. SWORD and V2 enable this.

So, IRs are under pressure. The Finch report suggests a transition with clear policy direction toward open access. This will make institutional open access repositories for publication obsolete, but not for research data. Repositories are taking a bigger view of that, though. Even if publications are open access, they can still be part of IR stores.

DepositMO has been in Edinburgh before. It induced spontaneous applause. It was also at OR before, in 2010.

This talk was a borderline accepted talk, perhaps because there is not a statement included: few studies of user action with repositories.

There are many ways that users interact with repositories, which ought to be analyzed. SWORD for Facebook, for Word.

SWORD gives a great scope of use between the user and repository, especially with V2. V2 is native in many repositories now, partially because of DepositMO.

With convenient tools built into already used software, like Word, work can be saved into repositories as it is developed. Users can set up Watch Folders for adding data, either as a new record or an update to an older version if changed locally. The latter example is quite a bit like Dropbox or Skydrive, but repositories aren’t harddrives. They aren’t designed as storage devices. They are curation and presentation services. Depositing means presenting very soon. DepositMO is a bit of a hack to prevent presentation while iteratively adding to repository content. Save for later, effectively.

Real user tests of DepositMO have been done – set up some laptops running created services and inviting users to test in pairs. This wasn’t about download, installation, and setup, but actual use in a workflow. Is it useful in the first place? Can it fit into the process? Task completion and success rates of repository user tasks were collected as users did these things.

On average, Word and watch folder deposit tools improved deposit time amongst other things. However, these entries aren’t necessarily as well documented as is typically necessary. The overall summary suggests that while there is a wow-factor in terms of repository interaction, the anxiety level of users increases as the amount of information they have to deposit increases. Users sometimes had to retrace steps, or else put things in the wrong places as they worked. They needed some trail or metadata to locate deposit items and fix deposit errors.

There are cases for not adding metadata during initial entry, though, so low metadata might not be the worst thing.

Now it’s time to do more research, exploring the uses with real repositories. That project is called DepositMOre. Watch Folder, EasyChair one-click submission, and to an extent the word add in will be analyzed statistically as people actually deposit into real repositories. It’s time to accomodate new workflows, to accomodate new needs, and face down challenges of publishers offering open access.

Q: Have you looked into motivations for user deposit into repositories?

A: No, it was primarily a study of test users through partners in the project. The how and what of usage and action, but not the why. There was a wonder whether more data about the users would be useful. If more data was obtainable, the most interesting thing would be understanding user experience with repositories. But mandate motivation, no, not looking into that.

Q: You’ve identified a problem users have with depositing many things and tracking deposits. Did you identify a solution?

A: It’s more about dissuading people from reverting to previous environments and tools. There are more explicit metadata tools, and we could do a better job of showing trails of submission, so that will need to filter back in. Unlike cloud drives, losers use control of an object once they are submitted to a repository. So, suddenly something else is doing something, and the user it’s disconcerting.


Topic: OERPub API for Publishing Remixable Open Educational Resources (OER)
Speaker(s): Katherine Fletcher, Marvin Reimer

This talk is about a SWORD implementation and client. Most of this work has happened in the last year, very quick.

Remixable open education repositories target less academic and more multi-institution, open repos. Remixability lets users learn anywhere. It’s a ton of power. All these open resources can seed a developer community for authoring and creation, machine learning algorithms, and it all encourages lots of remixable creation.

Remixability can be hard to support, though. Connexions, and other organizations, had grand ambition but not a very large API. And you need an importer/editor that is easy to use. Something that can mash data up.

In looking at APIs needed for open education, discoverability is important, but making publishing easier is important, too. We need to close the loop so that we stop losing the remixed work externally. That’s where SWORD comes in. V2.

Why SWORD V2 for OER? It has support for workflow. The things being targeted are live edited objects, versioned. Those versions need to be permanent so that changes are nondestructive. Adapting, translating, deriving are great, but associating them with common objects helps tie it all together.

OERPub extends SWORD V2. It clarifies and adds specificity to metadata. Specificity is required for showing the difference between versions and derivatives, specifically. And documentation is improved. Default values, repository controlled and auto-generated values are all documented. Precedents have been made clear, that’s it.

OERPub also merges semantics header for PUT. It simplifies what’s going on. Also added a section on Transforms under packaging. If a repository will transform content, it has a space to explain its actions. It provides error handling improvements, particularly elaboration on things like transform and deposit fails.

This is the first tool to submit to Connexions from outside of Connexions.

Lessons learned? Specification detail was great. Good to model on top of and save work. Bug fixes also lead the project away from multiple metadata specifications – otherwise bugs will come up. Learned that you always need a deposit receipt, which is normally optional. Finally, auto-discovery – this takeaway suggests a protocol for accessing and editing public item URLs.

A client was built to work with this – a transform tool to remixable format in very clean HTML, fed into Connexions, and pushed to clients on various devices. A college chemistry textbook was already created using this client. And a developer sprint got three new developers fixing three bugs in a day – two hours to get started. This is really enabling people to get involved.

Many potential future uses are cropping up. And all this fits into curation and preservation – archival of academic outputs as an example.

Q: Instead of PUT, should you be using PATCH?

A: Clients aren’t likely to not know repositories, but it is potentially dangerous to ignore headers. Other solutions will be looked at.

Q: One lesson learned was to avoid multiple ways of specifying metadata. What ways?

A: DublinCore fields with attributes and added containers. That caused errors. XML was mixed in, but we had to eventually specify exactly which we wanted.