Episode 2: SUNCAT library information

Questions about use cases were not really answered during the sprint, but we decided to gather information SUNCAT holds about contributing institutions together in a linked data format, and some more work is being done on use cases for SUNCAT linked and/or open data in the SUNCAT UK Discovery Project project.

SUNCAT uses the MARC organisational code for libraries when it available. I was introduced to the work of Adrian Pohl and Felix Ostrowski from hbz in Germany who have created an international directory of libraries and related organisations which covers the US codes from the Library of Congress and the German organisation codes. The information for the UK libraries is in a PDF http://www.bl.uk/bibliographic/pdfs/marc_codes.pdf at the moment, but it might be possible to collect the data from this format. Felix and Adrian presented their idea of adding RDFa to webpages containing information about libraries at ELAG2011 “Your Website is your API – How to integrate your Library into the Web of data using RDFa” and a representative from OCLC who attended the presentation directly started implementing this in the WorldCat registry.

The Talis Platform hosting and consultancy blog posts “Linking and Cleaning Data” were a very useful illustration of the use of org:Organization, org:hasSite, and v:VCard for specifying the links between an organisation and its sites and the site addresses.

An organisation ontology was used to describe SUNCAT contributing libraries. There was discussion about whether a “library” should be modelled to represent a single library in one building or be an umbrella term for all an institution’s libraries.

I found the examples on “Howto – Describing libraries, their collections and services in RDF” on the hbz Semantic web wiki very helpful.

Vocabularies used:

The RDF Vocabulary (RDF):
http://www.w3.org/1999/02/22-rdf-syntax-ns#

The RDF Schema vocabulary (RDFS):
http://www.w3.org/2000/01/rdf-schema#

Friend of a Friend (FOAF):
http://xmlns.com/foaf/0.1/

DCMI Metadata Terms (DCT):
http://purl.org/dc/terms/

An Ontology for vCards (V) for representing address and contact information:
http://www.w3.org/2006/vcard/ns#

WGS84 Geo Positioning (GEO):
http://www.w3.org/2003/01/geo/wgs84_pos#

XML Schema (XSD):
http://www.w3.org/2001/XMLSchema#

OWL
http://www.w3.org/2002/07/owl#

SKOS
http://www.w3.org/2004/02/skos/core#

Ordnance Survey Postcode Ontology
http://data.ordnancesurvey.co.uk/ontology/postcode/

The rdf:about RDF/Turtle validator and Converter was useful for checking Turtle files.

There is a JISC MU list of organisations which I used enrich the SUNCAT institution data with JISC MU organisation identifiers by querying the SPARQL endpoint for  JISC MU institutions, also using the Perl CPAN module RDF::Query::Client.

Transforming the SUNCAT institution data into linked data has helped SUNCAT clean our data. The linked data can be used as internal source of data for various SUNCAT configuration files, web pages, and contact information.

Episode 2, SUNCAT: Notes from initial chat

Questions about use cases for the data; what the benefits are to end users or to libraries.
Something we have to work on formulating with data in hand. Morag mentions a 2005 document discussing SUNCAT use cases.

One suggested usage is deriving information about the focus and specialisms of libraries, by extending library subject metadata using journal/article subject metadata – so identifying the bent of universities through the holdings of their libraries.

Another immediate usage is linking bibliographic datasets of journal articles, to journal issues and journal information found in SUNCAT. Medline is a useful example of dataset that can be integrated – work on Linked, Open Medline metadata happening through the OpenBiblio project.

SUNCAT holds a record for each institution, its library location, and this could helpfully be linked to the OARJ Linked Data for institutions, and the JISC CETIS PROD work collecting different sources of UK HE/FE information.

Sources in SUNCAT may have an OID which could be re-used as part of a URI. Journals both electronic and hardcopy also (though not always) have ISSNs.
There are restrictions on re-use of data licensed from the ISSN Network, but one can get some of it from other sources – CONSER is a North-America-focused example, with a bit of a scientific bent (thus useful for Medline).

SUNCAT uses OpenURL to search for journal articles and holdings data in institutional libraries. Libraries run an “OpenURL resolver” – often with a bit of proprietary software such as SFX – to map OpenURLs to stuff in their holdings. Would be interesting to find out more about the inside of an OpenURL resolver and how useful a Linked Data rendering of it would be…

Surprised to learn that university libraries often don’t maintain their own subscription database; journals are bought in “bundles” whose contents are shifting, and libraries depend on vendors to sell them back their processed catalogue data.

SUNCAT contains a dataset describing libraries their affiliations and locations, held in a set of text files. This would be a good place to start with a simple Linked Data model that we can link up to the outcome of the previous LDFocus sprint, and then work on connecting up the library holdings data.

Starting a separate notepad for SUNCAT links. Should have done this earlier, been busy about the new release of Unlock and the wrap-up of the Chalice project

Preparing for Episode 2: SUNCAT

We’re preparing to start the second Linked Data Focus sprint next week (from May 16th) – working with the developers from the SUNCAT team, who are bibliographic data specialists.

Our notepad from the first sprint has a lot of links to relevant resources – introductions to RDF, tools in different languages, and descriptions of related work around academic institutions and communications.

This presentation by Jeni Tennison from the Pelagios workshop is also worth looking at for sensible advice about taking an existing data model into a Linked Data form. Ian took this sort of approach for the Open Access Repository Junction work – working through the different objects in a relational database model, thinking about how to decorate them with common RDF elements, then creating a vocabulary for the missing pieces. Some of the same questions about publishing and structuring Linked Data should come up; and in the middle of the sprint we’ll hold another Linked Data Learn-in at EDINA.

SUNCAT should have a fair bit more in common with existing Linked Data projects – particularly the JISC-supported OpenBiblio – and we’ll try to make links between SUNCAT-listed publications and some of their metadata. If we can get as far as then linking through to pre-prints in the institutional repositories found in OARJ, then I’ll be entirely satisfied.