Updated data in SPARQL and new SRU target for SUNCAT open data

If you’re interested in SUNCAT data then you’ll know that there’s been a lot of activity with the new SUNCAT service interface (http://suncat.ac.uk/).

There’s also been a lot of activity with the underlying database and some improvements to the data sitting behind that too.

With new update processes in place the Discover EDINA open data has benefitted and a new and up to date set of open data is now available on the SRU target, with a shiny new permanent base URL too!

The DiscoverEDINA SUNCAT Open Data SRU target is now based at

http://m2m.edina.ac.uk/sru/de_suncat

All the same SRU and CQL queries should run as before, with the updated data, and the addition of multiple records for the same SUNCAT ID (these represent the same journal but from multiple libraries).

Hope you find it useful!

Also see:
SPARQL endpoint for SUNCAT
SUNCAT Open Data – SRU

SPARQL endpoint for SUNCAT

As we explored how to extend access to the metadata contributed by a set of libraries using the SUNCAT service in order to promote discovery and reuse of the data, it soon became clear that Linked Data was one of the preferred format to enable this.

The previous phase of this project developed a transformation to express the information on holdings in a RDF model. The XSLT produced converts MARC-XML into RDF/XML. This XSLT transformation was used to process over 1,000,000 holdings records made available by the British Library, the National Library of Scotland, the University of Bristol Library, the University of Nottingham Library, the University of Glasgow Library and the library of the Society of Antiquaries of London in order to make them available through a Linked Data SPARLQ endpoint interface.

Setting up the Triplestore

We build on previous experience at EDINA on providing SPARQL endpoints to set up the interface for the SUNCAT Linked Data.

We chose the 4Store application which is fully open source, efficient, scalable, and provides a stable RDF database. Our experience is that it is also simpler to install than other products. We installed 4Store on an independent host in order to keep this application separate from other services for security and easy maintenance.

Loading the data

The data contributed by each library was processed separately. First, the data was extracted from SUNCAT following any given restrictions placed by the specific library. It was then transformed into RDF/XML and finally loaded in the triplestore. Each of these steps can be fairly time consuming according to the size of the data file. Once the data from each library has been added to the triplestore, queries can be made accross the whole RDF database.

APIs

A HTTP server is required to provide external acces and allow querying of the triplestore. 4Store includes a simple SPARQL HTTP protocol server which answers SPARQL 1.1 queries. Once the server is running, you can query the triplestore using:

  1. A machine to machine  API at http://sparql1.edina.ac.uk:8181/sparql/.
  2. A basic GUI is available at: http://sparql1.edina.ac.uk:8181/test/. 

GUI

The functionality of the basic test GUI is rather limited and only enables SELECT, CONSTRUCT, ASK and DESCRIBE operations. In order to customise the interface and provide additional information like example queries, we used an open source SPARQL frontend designed by Dave Challis called SPARQLfront and available on github. SPARQLfront is a PHP and Javascript based frontend and can be installed on top of a default Apache2/PHP server. It supports SPARQL 1.0.

An improved GUI is available at: http://sparql1.edina.ac.uk:8181/endpoint/.

The DiscoverEDINA SUNCAT SPARQL endpoint GUI provides four sample queries to help the user with the format and syntax required to compose correct SPARQL queries. For example, one of the queries is:

Is the following title (i.e. archaeological reports) held anywhere in the UK? 

SELECT ?title ?holder
WHERE {
        ?j foaf:primaryTopic ?pt.
        ?pt dc:title ?title;
            lh:held ?h.
        ?h lh:holder ?holder.

        FILTER regex(str(?title), "archaeological reports", "i")
      }

The user is provided with a box in which to enter queries. Syntax highlight is provided to help with composition.  The user can also select whether to display the namespaces in the box or not. There is a range of output formats that can be selected:

  • SPARQL XML (the default)
  • JSON
  • Plain text
  • Serialized PHP
  • Turtle
  • RDF/XML
  • Query structure
  • HTML table
  • Tab Separated Values (TSV)
  • Comma Separated Values (CSV)
  • SQLite database

The SPARQL endpoint GUI is ideal for running interactive queries, developing or troubleshooting queries to be run by the m2m SPARQL API or used in conjunction with the SRU target.

Making records from the SUNCAT database openly available: the experience with licensing

The background is explained in an earlier post (July 10 2012).  SUNCAT (Serials UNion CATalogue) aggregates the metadata (bibliographical and holdings information) for serials, no matter the physical format held in (currently) 89 libraries and it was planned (with the agreement of the Contributing Libraries) to make as much of this data as openly available as possible.

It was decided to adopt an opt in policy.  This approach was taken since it was felt that CLs needed to be fully aware of the commitment they were making and to have the opportunity to place any particular restrictions such as limiting the data which could be made open or restricting the number of formats in which the data would be made available.  In the event most of the participants availed themselves of the opportunity to specify, unambiguously, which data they were agreeing to being made open.

Legal advice was taken from the University solicitors and the licence format adopted was Open Data Commons Public Domain Dedication and Licence with reference to the ODC Attribution Share Alike Community Norms.  Staff in quite a number of institutions expressed interest but, in the event, only staff in 6 institutions proceeded as far as signing a licence with EDINA . A copy of the standard agreement may be viewed here.

Since many libraries have acquired some of the metadata records they use in OPACs from one or more third party commercial suppliers, there were very understandable concerns about giving permission for EDINA to make records from these sources openly available.  Accordingly, it was necessary to add an Appendix to the individual Agreements, specifying what particular restrictions should be applied.

The situation applying to each of the libraries is as follows:

British Library Permission was given to publish all serials records but they are not to be made available in MARCXML or MARC21 formats.
National Library of Scotland Permission was given to publish as open data, any NLS record that has ‘StEdNL’ in the MARC field  040$a

and

to publish as open data, the title, ISSN number and holdings data for any serials record in their catalogue.

University of Bristol Library Permission was given to publish as open data, any Bristol record that has “UkBrU-I� in the MARC field 040 $a.e.g.,

040   L $$aUkBrU-I

University of Nottingham Library Permission was given to publish as open data, any Nottingham record that has “UkNtU� in the MARC field 040 $a and $c.

e.g., 040   L $$aUkNtU$$cUkNtU

However, if there is an 035 tag identifying a different library, then do not use this record.

e.g.,

035   L $$a(OCoLC)1754614
035   L $$a(SFX)954925250111
035   L $$a(CONSER)sc-84001881-
040   L $$aUkNtU$$cUkNtU

University of Glasgow Library Permission was given to publish as open data any Glasgow record that is not derived from Serials Solutions as indicated in the MARC field 035 $$a (WaSeSS).
The library of the Society of Antiquaries of London. Permission was given to publish as open data  all serials records.

As mentioned above, staff in quite a few other libraries expressed interest in becoming involved but the short timescale of the project meant that there had to be concentration on those libraries able to sign the licence agreement quite quickly.

Subject to the availability of further funding it is planned to continue discussions with those libraries which have expressed interest but were not able to proceed to signing an agreement.

Negotiating the specific requirements for each of the libraries was a time consuming, although necessary process, and there are concerns about the resources which would be required to carry the negotiation for a rather larger number of libraries than participated in this phase.

Taken together the records which can be made openly available total in excess of 1,000,000; a considerable quantity of serials’ metadata.  Once the data has been released it will be most interesting to monitor the usages made of it.

Details about making the data openly available and the ways in which developers and others can access it are outlined in a separate blog entry.

That library staff have concerns about making available metadata which has been obtained from one or other third party has been well recognised for some time but to date there has been very little progress on resolving these issues at either a national or international level.  In the earlier blog post it was stated that:

“A number of librarians said that it would be a good idea if JISC/EDINA could come to an agreement with organisations such as OCLC and RLUK rather than individual libraries needing to approach them; this is an idea certainly worth pursuing�.

 JISC did commission work to be carried out in this area and there is a website available which provides guidance.  Whilst, clearly, this is very helpful the onus is placed upon staff in individual libraries to look carefully at their licence agreements with third party suppliers: even where this is done what is often found is that the licence agreements are not necessarily clear and unambiguous on what is possible and what is not.

RLUK recently commissioned work to scope the parameters of making RLUK data openly available and the results of that work should make helpful reading even if the focus is just on material in the RLUK database.

It certainly would be of considerable benefit to the HE community as a whole if national bodies including the JISC, SCONUL and RLUK could accept responsibility for initiating discussions with third party suppliers of records with a view to negotiating removing all restrictions on making metadata openly available.  Such an approach would remove the need for individual libraries to investigate their specific local circumstances and would be of enormous potential benefit to the user community.

SUNCAT Open Data – SRU

As part of the /open/ data strand of the SUNCAT bit of Discover EDINA, we have made available the individual library records that we have agreement to release.  At the time of writing this is:

National Library of Scotland, Glasgow University, British Library, Bristol University, Society of Antiquaries of London, Nottingham University.

In order to make these records available, we’ve opted for an SRU target, which is REST-ful.  In the first instance we’re intending users to use the SPARQL interface to run searches (see other post) and use the linked part of the data in the RDF incarnation of the records, and then use the SUNCAT ID to link through to the SRU target to extract the full MARC record (in most cases) should that be needed.

Since the target is a full blown SRU server there are actually a plethora of indices which are made over the MARC-XML records, but the one we anticipate being used most is that for the SUNCAT ID.  However, users are welcome to use the other indexes which will be detailed below.

In the first instance, the DiscoverEDINA SUNCAT SRU target can be found at

http://suncatdev.edina.ac.uk:31001/de_suncat

[EDIT 2014-05-13. The above URL should work but it is now preferred to use

http://m2m.edina.ac.uk/sru/de_suncat ]

so in order to get the MARC-XML format of a record with SUNCAT ID of “SC00374927310″ you should send a CQL query of sc.id=SC00374927310 which goes into an SRU searchRetrieve request as:

http://suncatdev.edina.ac.uk:31001/de_suncat?operation=searchRetrieve&version=1.1&startRecord=1&maximumRecords=1&query=sc.id%3DSC00374927310

Remember that the number of records released under the Open Data umbrella is limited, so you won’t find every SUNCAT ID here, but you will find every one that’s in the SPARQL endpoint.

The response will be a bunch of XML that is an SRU Response, and it may contain records (about the same item) from multiple libraries. These records can be found in the Xpath zs:searchRetrieveResponse/za:records/zs:record/zs:recordData. The number of records found is always sent in the zs:searchRetrieveResponse/zs:numberOfRecords element and you can specify which and how many records to retrieve by varying the startRecord and maximumRecords parameters in the HTTP query string.

By default, records will be returned in MARC-XML, with the exception of British Library records, which (due to licensing issues) will always be returned in the RDF transformed version of the record.

Okay, so that’s the basics of grabbing a full MARC-XML record with a SUNCAT ID.  Now for the fun stuff (I’m using ‘fun’ in quite a broad sense of the word).

You can grab a (non-BL) record in five (yes, five) different XML schemata!  To do so, just append the parameter recordSchema=X where X is one of marc (also the default), rdf, mods, mads, dc.  This transforms the MARC-XML into one of the other formats using an XSLT transform.  The rdf one was created in our previous project, and the mods, mads and dc ones are from Indexdata’s zebra software (freely available from http://www.indexdata.com/zebra).  These are relatively simple but might be useful.

Even more fun: obviously we’re making the records search-and-retriev-able on the SUNCAT ID since the perceived workflow is to use SPARQL to query the SPARQL endpoint, obtain the links in the RDF records (including a SUNCAT ID), use that SUNCAT ID to obtain the full records of anything you’re interested in from the SRU server.  However, since this is a full-blown SRU server, we’ve actually got a full set of indexes, and you can use any valid CQL query combining the lot of them!

These indexes are designed to be as close as possible to the existing SUNCAT service Z39.50 target indexes.  In the SRU server some are prefixed with the “bib1“namespace and the rest with the “sc” namespace.  Here is a table of the bib1 indexes and their equivalent Z39.50 BIB-1 index:

bib1.date/time-last-modified = Date/time-last-modified
bib1.lc-card-number = LC-card-number
bib1.isbn = ISBN
bib1.number-music-publisher = Number-music-publisher
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.dewey-classification = Dewey-classification
bib1.issn = ISSN
bib1.lc-call-number = LC-call-number
bib1.nlm-call-number = NLM-call-number
bib1.place-publication = Place-publication
bib1.publisher = Publisher
bib1.title-series = Title-series
bib1.identifier-standard = Identifier-standard
bib1.subject-heading = Subject-heading
bib1.number-govt-pub = Number-govt-pub
bib1.title = Title
bib1.any = Any
bib1.server-choice = Server-choice
bib1.date = Date
bib1.date-of-publication = Date-of-publication
bib1.title = Title
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.title-uniform = Title-uniform
bib1.code-institution = Code-institution
bib1.note = Note
bib1.code-language = Code-language
bib1.publisher = Publisher
bib1.place-publication = Place-publication
bib1.code-geographic = Code-geographic
bib1.subject-heading = Subject-heading

These are the sc ones mapped to their equivalent SUNCAT service index, which are not well documented here and some will be duplicates of the bib1 indexes, but you’re free to play!  Almost certainly the mainly useful two are the SUNCAT ID index, SC_ID and the contributing library code index, SC_WIS.  The values for SC_WIS can be:

StEdNL (National Library of Scotland)
StGlU (Glasgow University)
Uk (British Library)
UkBrU-I (Bristol University)
UkLSAL (Society of Antiquaries of London)
UkNtU (Nottingham University)

Here are all the other sc indexes:

sc.id = SC_ID
sc.005 = SC_005
sc.010 = SC_010
sc.020 = SC_020
sc.022 = SC_022
sc.028 = SC_028
sc.035 = SC_035
sc.049 = SC_049
sc.aut = SC_AUT
sc.awt = SC_AWT
sc.ddc = SC_DDC
sc.gvd = SC_GVD
sc.ismn = SC_ISMN
sc.issn = SC_ISSN
sc.lcc = SC_LCC
sc.nlm = SC_NLM
sc.pla = SC_PLA
sc.pub = SC_PUB
sc.sbd = SC_SBD
sc.sgn = SC_SGN
sc.sici = SC_SICI
sc.sid = SC_SID
sc.srs = SC_SRS
sc.ssn = SC_SSN
sc.stidn = SC_STIDN
sc.stmd = SC_STMD
sc.sub = SC_SUB
sc.sud = SC_SUD
sc.sul = SC_SUL
sc.sum = SC_SUM
sc.tit = SC_TIT
sc.ttl = SC_TTL
sc.wrd = SC_WRD
sc.wyr = SC_WYR
sc.wti = SC_WTI
sc.wau = SC_WAU
sc.wut = SC_WUT
sc.wur = SC_WUR
sc.wnc = SC_WNC
sc.wfm = SC_WFM
sc.wtp = SC_WTP
sc.wgo = SC_WGO
sc.wct = SC_WCT
sc.wid = SC_WID
sc.wsd = SC_WSD
sc.ntl = SC_NTL
sc.wis = SC_WIS
sc.wst = SC_WST
sc.wuc = SC_WUC
sc.wucx = SC_WUCX
sc.wuco = SC_WUCO
sc.wno = SC_WNO
sc.wln = SC_WLN
sc.wpu = SC_WPU
sc.wpl = SC_WPL
sc.wsrs1 = SC_WSRS1
sc.wsrs2 = SC_WSRS2
sc.wga = SC_WGA
sc.wsu = SC_WSU
sc.wsm = SC_WSM

Tagger – Final Blog Post

As part of the UK vision for supporting open discovery principles in relation to education materials, JISC has sponsored a number of projects to assist in discovering and enriching existing resources.Tagger (variously referred to previously as GTR or geotagger), was one strand of the JISC funded umbrella DiscoverEDINA project. For the two other strands to this work see here.

The primary purpose of Tagger is to assist in enriching and exposing ‘hidden’ metadata within resources – primarily images and multimedia files. Images for example embed a lot of descriptive and technical metadata within the file itself and very often it is not obvious that the main focus of interest – the image, is carrying a ‘secret’ payload of information some of which may be potentially compromising. For example, the recent embarrassment suffered by Dell after a family member uploaded images to social media sites with embedded location information, thus frustrating the efforts of a multi-million pound security operation. Or take the case of the US military when an innocently uploaded photograph of a new assignment of Apache helicopters led to their destruction when insurgents used the location information embedded in the image to precisely locate and destroy the helicopters.

There are many other instances of people being innocently or naively caught out by these ‘hidden’ signposts in resources that they distribute or curate. Tagger helps by providing tools to expose those hidden features and makes it easy to review, edit and manage the intrinsic metadata routinely bundled in resources. It has concentrated on, but not been limited to, geotags.

Tagger has delivered three main things:

  • A basic web service API based around ExifTool, suitable for 3rd party use to geo-tag/geo-code image, audio, and video metadata.
  • A demo web site enabling user upload, metadata parsing (from resource) and metadata enrichment (map based geo-tagging/geo-coding);
  • An Open Metadata corpus of geo-tagged/geo-coded enriched records with a REST based query interface. Currently, this corpus consists of approximately a quarter of a million creative commons licensed geotagged images mainly bootstrapped from Geograph.

Tagger supports the open discovery metadata principles and has made extensive use of open licensing models.

Along the way we started thinking about specific use cases. The ‘anonymise my location’ seemed an obvious case and Tagger’s API and website reflect that thinking. Additionally, in talking to colleagues involved in field trips it was clear that there was potential in providing integrated tooling and we experimented with Dropbox integration.

Taking this further and building on EDINA’s more general mobile development work, we then started to think about how Tagger could be used to assist and enrich in-field data capture use cases and post-trip reflective learning. We continue to explore this beyond the project funding as the enrichment facilities Tagger provides allows for flexible integration into 3rd party services and projects.

Of course, Tagger will never be a complete panacea for all the ills of metadata nor should it aim to be. However by building on best-of-breed opensource tools (Exiftool) Tagger, or more accurately the Tagger API, provides a facility for other service providers and projects to make use of to enable better manipulation and management of those ‘hidden’ metadata.

Therein lies the rub – the perennial  question of embedding and take up.

That’s are next challenge.

SUNCAT open data

First problem: getting permission from contributing libraries to allow their data to be re-distributed.  Fortunately for me that’s not my problem, and some sterling work from other members of the team has allowed some data to be released without strings.

Libraries who allow some of their data out into the wild usually have a stipulation that it can be any record they’ve contributed that doesn’t originate from such-and-such source, or has been created by them, or similar.

In practice, this means using records from particular libraries that have a particular library code in 040$a or don’t have a particular code in 035$a.  These types of rules could be added automatically at a live filtering stage, but in order to be utterly sure nothing untoward is being released we have chosen to extract those data and build a separate database from those alone.

So, once you get past the problem of libraries allowing their data to be distributed freely (which we haven’t ;) ) you then need to allow clients to usefully connect and retrieve the data.  Two approaches are being taken for this.

The first, is to produce an SRU target onto the database of (permitted) records.  We have a lot of experience with IndexData’s open source Zebra product which is a database and Z39.50/SRU frontend all in one.  It can be quite fiddly to configure (which is where the experience comes in handy!) but its performance (speed and reliability) is excellent.  It also allows multiple output formats for the records using XSLT.

One of the most useful outcomes from the Linked Data Focus project was an XSLT produced by Will Waites that converts MARC-XML into RDF/XML.  We can use this as one of the outputs from the SRU target, alongside MARC-XML (although some libraries have a requirement that their records not be released in MARC-XML, in which case the XSLT just blanks these records when requested in MARC-XML), a rudimentary MODS transformation, and a JSON transformation might be a possibility too.

Perhaps more usefully for the RDF/XML data, the second approach is to feed these into a SPARQL endpoint.  This should allow anyone interested in the linked data RDF to query in a language more familiar to the linked data world.

We’ll be providing more information on how to connect to the SRU target and the SPARQL endpoint once we’ve polished them up a bit for you.

 

Licensing SUNCAT serials’ records

The reasons for making bibliographic metadata openly available have been well put by JISC in the Open Bibliographic Data Guide and the Open Knowledge Foundation but whilst many librarians are keen to support making their institutional library metadata available there are issues to be resolved. There can be copyright issues and contractual issues over records in library OPACS which inhibit the release of records. The records in many OPACs will have been obtained from one or more third party organisations (e.g. OCLC, British Library, Ex Libris, Serials Solutions) and even though often the records received from these third parties will have been modified, perhaps quite extensively, there are understandable concerns expressed about the possible repercussions of making them available under an open licence.

SUNCAT is an aggregation of serials’ metadata from (currently) 86 libraries (referred to as Contributing Libraries (CLs)). Whilst much of the metadata will have been created by local library staff and will, therefore be ‘owned’ by the library, some of it will have been purchased from a third party supplier. The metadata is essentially supplied to EDINA on the basis of goodwill and a common understanding about how the data is used and made available. EDINA reached agreement with third party record suppliers that records in MARC21 format could be made available for downloading, but only to staff in CLs.

In the initial project SUNCAT: exploring open metadata (funded under the JISC Capital funded RDTF participation) the decision was taken to adopt an ‘opt in’ approach and, accordingly, an invitation was sent to all the CLs inviting them to participate in making their SUNCAT contributed data openly available under an Open Data Commons Public Domain Dedication and Licence with reference to the ODC Attribution Share Alike Community Norms. Considerable interest was expressed by CLs in becoming involved but concerns, particularly to do with making third party records available, were raised. A number of librarians said that it would be a good idea if JISC/EDINA could come to an agreement with organisations such as OCLC and RLUK rather than individual libraries needing to approach them; this is an idea certainly worth pursuing.

Licences have now been signed by three organisations. They are the British Library (BL), the National Library of Scotland (NLS) and the Society of Antiquaries; discussions are well advanced with a number of additional organisations. After discussion with BL staff, it was agreed that it would be preferable to add an Appendix to an existing contract between EDINA and the BL, and this has been done. All the data supplied to EDINA by the BL can be made openly available, provided records are not made available in either MARC21 or MARCXML formats. In the case of the National Library of Scotland permission has been given to make all the fields available of all records which have been created by NLS (identified by the presence of ‘StEdNL’ in the 040$a field) or to make title, ISSN and holdings information available for the whole of the contribution to SUNCAT. The Society of Antiquaries has placed no restrictions on the use of their contributed records.

Glasgow University has asked for records from a third party supplier to be excluded from the records made available for open usage and this will be done.

Work is now being carried out to make the records from the initial three organisations freely available on the basis described in the licences and as other licences are signed by additional organisations more data will be published for open usage.

Enhancing JISC MediaHub metadata

Why do JISC MediaHub metadata need to be enhanced?

JISC MediaHub contains around 130,000 images, videos and audio clips licensed by JISC Collections, plus records of another 600,000 harvested from other providers.

A large proportion of these record contain important information related to people, places and dates. Although some of these have been well catalogued, all too often the information exists but only in descriptive plain text. This can happen for a variety of reasons; it may be that the metadata were created from older, text based records that were not originally created with machine indexing in mind; that resources for cataloging records were limited; or that ill-conceived merging into a common metadata schema destroyed valuable elements.

Whatever the cause, the discoverability of records is reduced. Advanced searching and results filtering based on geographic location or dates may fail to display relevant records, where location and date are not indexed. Visualizations such as maps and timelines are also restricted to covering only a subset of all records.

It can also be very confusing to users, and dent their confidence in the service, if they see a date prominently displayed in the title or description of a record, and yet a subsequent search for that date fails to retrieve that record.

Methods for enhancing metadata

There are three obvious approaches that could be used

  1. Have the records professionally catalogued.
  2. Use text processing software to parse the metadata and attempt to identify dates, locations and the names of people and places etc.
  3. Crowd sourcing: a large community of users is likely to contain individuals with the necessary knowledge to contribute information.

The first approach, professional cataloguing, is unfortunately an extremely expensive option.

The second approach, using text processing software, is an efficient method but unfortunately is prone to error. Some information may be missed, and some terms may be falsely marked up. This can make the resulting records confusing to ordinary users who, reasonably enough, expect cataloguing to be definitive rather than a probabilistic expression of what a record is likely to be about.

We think the third approach, crowd sourcing, has great potential with an academic user community. Compared to an average web site, our users are likely to contain an unusually large number of experts, and individuals who are motivated to share knowledge. This approach involves certain risks; if the process for contributing is complex or time consuming, fewer users will take part; user contributed information cannot be assumed to be as trustworthy as “official� metadata, and is prone to malicious or frivolous uses; and whatever review processes exist must be lightweight (otherwise, we might as well do all the cataloguing ourselves).

Our approach

We will combine two of the approaches described above: text processing software and crowd sourcing.

By processing metadata, we can identify candidate elements for indexing. These will be treated as unreliable, and will not be indexed at this stage. To begin with, we will focus on location information, and use EDINA Unlock (http://unlock.edina.ac.uk/texts/introduction) for text processing.

When we have have candidate location elements for our records, we will present these to users in the JISC MediaHub user interface. This will be incorporated into the existing record display pages, so users find this new functionality as part of their normal usage of the service. Users will be asked simply to confirm or to reject each candidate element. This will should make the process of contributing very simple, and help to maximize users’ involvement. Also, since users are selecting from predefined options rather than entering their own values, it should reduce the scope for abuse. We intend that no complicated reviews should be needed, at least in the vast majority of cases, and instead that we can base the result on a poll of users’ opinions.

When we have a corpus of new metadata, we can index the values to add value to the records. We do not intend that the new metadata should be merged with the original records, as the provenance is important; however within the user interface, we can provide the option to include user contributed location data in searches.