From Project to Product

The OA-RJ aspect of the Linked Data Focus group has now finished…. and finished with a real honest-to-goodness product.

First, the OA-RJ project has finished, however the two aspects of the service have continued as distinct & separate services: Organisation & Repository Identification (ORI) and Repository Junction Broker (RJB).

ORI is the discovery service, and is the service that the OA-RJ Linked Data work related to. The LDFocus work created the idea of Linked Data as a viable product for the discovery service.

In terms of Linked Data, ORI provides:

We are still discovering new ways to improve what we provide to consumers of the data, and would be delighted to hear any suggestions (contact the us via the UK RepositoryNet+ Helpdesk contact form or send email direct to: support@repositorynet.ac.uk

Possible future work based on OA-RJ Linked data

There is a agreed need to have an international collection of identifiable organisation… be this collated from national or local levels, or globally assigned.

There are several such lists available: the UK has a list of educational establishments at data.gov.uk, it is believed that the US has a similar one at data.gov. OA-RJ has an international list of institutions derived from repositories and the organisations that run them.

There are issues that need addressed, however:

  • Anything that is UK focused is, frankly, a waste of time: Even on a global scale, having 1961 countries each post their own list with no consolidation of reference between them gives rise to two main problems:
    1. Finding the lists becomes a problem, and people need to know where each list is
    2. There are a significant number of organisations are are not geographically restricted to a single country (or, indeed, geographically located at all!)
  • The history of organisations needs to be tracked: They are born, merge, split, rename, and even die.
  • They have complex parent/child relationships (there are research centres, funded by NERC, completely housed within larger organisations.
  • They have (multiple?) geo-spatial locations.
  • There needs to be some real-world examples of use for the data.

There have been a number of previous JISC (and other, overseas) projects in this area, which should be pulled together & combined into a greater whole.

[1] 196 countries is the 195 independent [sovereign] states recognised by the US State Department, plus Tiawan. This ignores the very real situation where England produces its own list, with the expectation that Scotland, Ireland & Wales will produce similar lists (so that’s 199). Add in the expectation that the US will produce lists at (possibly) State level, and you’re up to 250 lists. Add in larger countries like Russia or China devolving lists down, and the problem of isolation and “un-discoverability” get even worse! “Divide and Conquer” is definitely the way forward…. but not by secession and independent action – that would [in my view] be ignored by the larger community.

Episode 2, SUNCAT: Notes from initial chat

Questions about use cases for the data; what the benefits are to end users or to libraries.
Something we have to work on formulating with data in hand. Morag mentions a 2005 document discussing SUNCAT use cases.

One suggested usage is deriving information about the focus and specialisms of libraries, by extending library subject metadata using journal/article subject metadata – so identifying the bent of universities through the holdings of their libraries.

Another immediate usage is linking bibliographic datasets of journal articles, to journal issues and journal information found in SUNCAT. Medline is a useful example of dataset that can be integrated – work on Linked, Open Medline metadata happening through the OpenBiblio project.

SUNCAT holds a record for each institution, its library location, and this could helpfully be linked to the OARJ Linked Data for institutions, and the JISC CETIS PROD work collecting different sources of UK HE/FE information.

Sources in SUNCAT may have an OID which could be re-used as part of a URI. Journals both electronic and hardcopy also (though not always) have ISSNs.
There are restrictions on re-use of data licensed from the ISSN Network, but one can get some of it from other sources – CONSER is a North-America-focused example, with a bit of a scientific bent (thus useful for Medline).

SUNCAT uses OpenURL to search for journal articles and holdings data in institutional libraries. Libraries run an “OpenURL resolver” – often with a bit of proprietary software such as SFX – to map OpenURLs to stuff in their holdings. Would be interesting to find out more about the inside of an OpenURL resolver and how useful a Linked Data rendering of it would be…

Surprised to learn that university libraries often don’t maintain their own subscription database; journals are bought in “bundles” whose contents are shifting, and libraries depend on vendors to sell them back their processed catalogue data.

SUNCAT contains a dataset describing libraries their affiliations and locations, held in a set of text files. This would be a good place to start with a simple Linked Data model that we can link up to the outcome of the previous LDFocus sprint, and then work on connecting up the library holdings data.

Starting a separate notepad for SUNCAT links. Should have done this earlier, been busy about the new release of Unlock and the wrap-up of the Chalice project

Preparing for Episode 2: SUNCAT

We’re preparing to start the second Linked Data Focus sprint next week (from May 16th) – working with the developers from the SUNCAT team, who are bibliographic data specialists.

Our notepad from the first sprint has a lot of links to relevant resources – introductions to RDF, tools in different languages, and descriptions of related work around academic institutions and communications.

This presentation by Jeni Tennison from the Pelagios workshop is also worth looking at for sensible advice about taking an existing data model into a Linked Data form. Ian took this sort of approach for the Open Access Repository Junction work – working through the different objects in a relational database model, thinking about how to decorate them with common RDF elements, then creating a vocabulary for the missing pieces. Some of the same questions about publishing and structuring Linked Data should come up; and in the middle of the sprint we’ll hold another Linked Data Learn-in at EDINA.

SUNCAT should have a fair bit more in common with existing Linked Data projects – particularly the JISC-supported OpenBiblio – and we’ll try to make links between SUNCAT-listed publications and some of their metadata. If we can get as far as then linking through to pre-prints in the institutional repositories found in OARJ, then I’ll be entirely satisfied.

Resolvable IRIs

One of the tenants of linked data is that IRIs should be resolvable (it’s the 4th or 5th star, depending on which notation you are looking at)

There are two approaches to doing this:

  1. Create a server specifically to handle the linked data
    eg: http://opendata.opendepot.org/organisation/EDINA
  2. Create a resolver underneath an existing server
    eg: http://opendepot.org/opendata/organisation/EDINA

The main consideration is probably how many data sets you are resolving, and what association you want to promote. For example, the University of Southampton are exposing all their data at the University level – so having a central resolver (http://opendata.southampton.ac.uk) makes sense for them.

For OARJ, I can use the OpenDepot.org association…. thus it was easier for me to create a resolver within the opendepot.org server – so OARJ IRIs become something like http://opendepot.org/opendata/organisation/EDINA

The resolver script is http://opendepot.org/opendata/ and the standard Apache environment variable ‘PATH_INFO’ contains the rest if the IRI.

The code for the resolver is remarkably simple:

  use XML::LibXML;

  ## define $host
  ## get the full RDF document from the server: $dom
  ## get an XML Document that contains the RDF root element (complete with namespaces): $rdf

  # Get the <:RDF> element from $rdf
  $child = $rdf->firstChild;

  # for all XPAth stuff, we need to define the namespace
  $xpc = XML::LibXML::XPathContext->new;
  $xpc->registerNs('rdf',
                   'http://www.w3.org/1999/02/22-rdf-syntax-ns#');

  # We need the general "about" node with output
  $iri = "$host/reference/linked/1.0/oarj_ontology.rdf";

  # XPath queries are through the XML::XPath object
  @nodes = $xpc->findnodes("/rdf:RDF/rdf:Description[\@rdf:about=\"$iri\"]", $dom);
  $child->appendChild($nodes[0]) if $nodes[0];

  # and now find the specific rdf:Description we want
  $class = &get_first_pathinfo_item;  # will be "organisation" or "network" or....
  $t =     &get_second_pathinfo_item; # will be the name of the record **IRI encoded!**

  $iri = "$host/opendata/$class/$t";

  @nodes=();
  @nodes = $xpc->findnodes("/rdf:RDF/rdf:Description[\@rdf:about=\"$iri\"]", $dom);
  $child->appendChild($nodes[0]) if $nodes[0];

  print $dom->toString;

Obviously there are wrappers around that, but its a good basis.

Creating Turtle and RDF the easy way

One of the great things about Turtle format is that it is dead easy to write (see the blog post below on how easy it was.)

One of the great things about RDF format is that it is a well known format, rooted in XML, and very easily parsed….. but not fun to create.

What is needed is an easy way to create RDF from Turtle… and there is – Any23.org

(I’m a Perl-man, so my example code is in Perl – YMMV)

  use File::Slurp;
  use LWP::UserAgent;
  use HTTP::Request;

  ## create turtle text as before, in $t

  # Write the file into web-server space
  write_file("$datadir/$turtle_filename", $t);
  print "turtle written\n";

  # ping the whole thing off to any23.org to transform into RDF
  my $ua  = LWP::UserAgent->new();
  my $query = "http://any23.org/?format=rdfxml&uri=http://$host/$path/$filename";
  my $res = $ua->get($query);
  my $content = "";
  if ($res->is_success) {
    $content = $res->content;
    write_file("$datadir/$rdf_filename", $content);
    print "rdf written\n";
  } ## end if ($res->is_success)
  else { print $res->status_line; }

Et voila!