SPARQL – What’s up with that?

The title of this post is intended to convey a touch of bewilderment through use of a phrase from the Cliff Clavin school of observational comedy.

Linked data and SPARQL

In the linked data world, SPARQL (SPARQL Protocol and RDF Query Language) is touted as the preferred method for querying structured RDF data. In recent years several high profile institutions have worked very hard to structure and transform their data into appropriate formats for linked data discovery and sharing, and as part of this, many have produced RDF triple (or quadruple) stores accessible via SPARQL endpoints – usually a web interface where anyone can type and run a SPARQL query in order to retrieve some of that rich linked data goodness.

This is admirable, but I have to admit to having had little success getting something out of SPARQL endpoints that I would consider useful. Every time I try to use a SPARQL facility I find I do better by scraping data from search results in the main interface. I have also increasingly become aware that I am not the only one to find it difficult.

RDF stores are different to relational databases; they are not so amenable to performing a search over the values on a particular field. Nor are they as flexible as text search databases like Solr. Instead they record facts relating entities to other entities. So it is important that as consumers of the data we know what kind of questions make sense and how to ask them in a way that yields useful results for us and does not strain the SPARQL endpoint unduly. If these are not the kind of questions we want to ask then we might need to question the application of SPARQL as the de facto way of accessing RDF triple stores.

I’d like to point out that my aim here is not to complain or to disparage SPARQL in general or anybody’s data in particular; I think it is fantastic so many institutions with large archives are making efforts to open up their data in ways that are considered best practice for the web, and for good reasons. However if SPARQL endpoints turn out to be flawed or inadequately realised, they will not get used and both the opportunity to use the data, and the work to produce it, will be wasted.

Problems with SPARQL endpoints

These are the problems I have commonly experienced:

No documentation of available vocabularies.
No example queries.
No access to unique identifiers so we can search for something specific.
Slowness and timeouts due to writing inefficient queries (usually without using unique ids or URIs).
Limits on the number of records which can be returned (due to performance limits).

Paraphrasing Juliette Culver’s list of SPARQL Stumbling Blocks on the Pelagios blog, here are some of the problems she experienced:

No query examples for the given endpoint.
No summary of the data or the ontologies used to represent it.
Limited results or query timeouts.
SPARQL endpoints are not optimised for full-text searching or keyword search.
No link from records in the main interface to the RDF/JSON for the record. (This is mentioned in relation to the British Museum, who provide a very useful search interface to their collection, but don’t appear to link it to the structured data formats available through their SPARQL endpoint.)

Clearly we have experienced similar issues. Note that some of these are due to the nature of RDF and SPARQL, and require a reconception of how to find information. Others are instances of unhelpful presentation; SPARQL endpoints are generally pretty opaque, but this can be alleviated by providing more documentation. With the amount of work it takes to prepare the data, I am surprised by how few providers accompany their endpoints with a clear list of the ontologies they use to represent their data, and at least a handful of example queries. This takes a few minutes but is invaluable to anybody attempting to use the service.

Nature provide the best example I have seen of a SPARQL endpoint, providing a rich set of meaningful example queries. Note also the use of AJAX for (minimal) feedback while query is running, and to keep the query on the results page.

Confusion about Linked Data

A blog post by Andrew Beeken of the JISC CLOCK project reports dissatisfaction with SPARQL endpoints and linked data, and provoked responses from other users of linked data:

“What is simple in SQL is complex in SPARQL (or at least what I wanted to do was) … You see an announcement about Linked Data and don’t know whether to expect a SPARQL endpoint, or lots of embedded RDF.” Chris Keene

“SPARQL seems most useful for our use context as a tool to describe an entity rather than as a means of discovery.” Ed Chamberlain

Chris’ point gives another perspective on linked data in general – what does it mean to provide linked (or should that be linkable) data, and how do we use it? Embedded RDF (RDFa) is good in that it tends to provide structured data in context, enriching a term in a webpage in a way that is invisible by default but that people can consume if they choose to. Ed indicates a fact about RDF as a data storage medium: it is a method of representing facts about entities which are named in an esoteric way; it is not structured in a way that is ideal for the freer keyword searching or SQL-style queries that we are used to.

Owen Wilson suggests the Linked Data Book‘s section 6.3 which describes approaches to consuming linked data, describing three architectural patterns. It looks worth a read to get one thinking about linked data in the right way.

Unique identifiers

“My native English, now I must forego” Richard II, Act 1, Scene 3

One of the tenets of linked data is that each object has a unique identifier. If we are looking for “William Shakespeare” we must use the URI or other identifier that represents him in the given scheme, rather than using the string “William Shakespeare”. It is thus also necessary that we have an easy way to access the unique identifiers that are used in the data, so that we can ask questions about a specific entity without forming a fuzzy, complex and resource-consuming query. The British Museum publicises its controlled terms, that is the limited vocabulary that they use in describing their collection, along with authority files which provide the canonical versions of terms and names, standardised in terms of spelling, capitalisation and so on, and thesauri which map synonymous terms to their canonical equivalents. These terms are used in describing object types, place names and so on, supporting consistency in the collections data. They are all available via the page British Museum controlled terms and the BM object names thesaurus. Armed with knowledge of what words are used in particular fields to categorise or describe entities in the data, and similarly with a list of ids or canonical names for things, we can then start to form structured queries that will yield results.

Shakespeare and British Museum

I have looked in particular at the British Museum’s SPARQL endpoint as an example, as BM is a project partner and because it has several items germane to Will’s World. To start with, the endpoint gives some context; a basic template query is included in the search box, which can be run immediately and which implicitly documents all the relevant ontologies by pulling them in to define namespaces. There is a Help link with some idea of how data is represented and can be accessed/referenced using URIs. All of this is good and I found it easy to get started with the endpoint.

However before long I came up against the problem I’ve had with other endpoints, namely that it is difficult to perform a keyword search, or at least perform a multi-stage search in order to (a) resolve a unique identifier for a keyword or named thing and then (b) retrieve information about or related to that thing. In this case I found a way to achieve what I needed by supplementing my usage of the SPARQL endpoint with keyword searches of the excellent Collection database search – and with some help from the technical staff at BM to resolve a couple of mistakes in my queries, can now harvest metadata about objects related to the person “William Shakespeare”.

It is reassuring to find out I am not alone in having difficulty retrieving and using SPARQL data. I followed Owen Stephen’s blog post about the British Museum’s endpoint with interest. Owen found the CIDOC CRM data model hard to query due to its (rich, but thereby counter-intuitive) multi-level structure. Additionally, he encountered the common issue that it is very difficult to perform a search for data containing or “related to” a particular entity which to start with is represented merely by a string literal such as “William Shakespeare”:

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

Admittedly RDF representations and SPARQL are not really intended to provide a “search interface” in the sense to which most users are accustomed. But from the user’s perspective, there must be an easy way to start identifying objects about which we want to ask questions, and this tends to start with performing some kind of keyword search. It is then necessary to identify the ids representing the resulting objects or records which are of interest. With the BM data this involves mapping a database id, which can be retrieved from the object URL, to the internal id used in the collections.

So what are the right questions?

Structured data requires a structured query – fair enough. However what sort of useful or meaningful query can we formulate when the data, the schema used to represent the data, the identifiers used within the data, are all specified internally? In order to construct an access point into the data, it is helpful to have not just a common language, but a common (or at least public) identifier scheme; canonical ways of referencing the entities in the data, such as “Shakespeare” or “the Rosetta Stone”. Without knowing the appropriate URI or the exact textual form (is it “Rosetta Stone”, “The Rosetta Stone”, “the Rosetta stone”? would we get more results for “Rosetta”?) it is nigh on impossible to easily ask questions about the entity of a SPARQL endpoint.

So how is one supposed to use a SPARQL endpoint? It is not a good medium for asking general questions or performing wide-ranging searches of the data. Instead it seems like a good way to link up records from different informational silos (BM, BL, NLS, RSC…) that share a common identifier scheme. If we know the canonical name of a work (“Macbeth”) or the ISBN of a particular edition, then we can start to link up these disparate sources of data on such entities.

But the variety of translations, the plurality of editions (which will only increase) and other degrees of freedom make it hard to perform an exhaustive analysis/usage of the data. In the case of the BM, who might have unique objects we don’t know we want to see, the way to find them is through keyword search. It seems that only by going first through a search interface or other secondary resource can we identify the items we want to know about and how to refer to them.

What we have in common between different sources is the language or ontologies used to describe the schema (foaf, dc, etc) – but this is syntax rather than semantics; structure rather than content. To echo Ed Chamberlain’s comment, we have access to how data is described, but not so much to the data itself.

British Museum data

The approach we will use to harvest British Museum metadata related to Shakespeare is outlined below. It is essentially the same approach that Owen Stephens found workable in his post on SPARQL, and involves reference to a secondary authority (the BM collection search interface) to establish identifiers.

Conduct a search for “Shakespeare” in the collections interface.
Extract an object id from each result. The Rosetta Stone has the id 117631.

Find the corresponding collection id from SPARQL with this query:

SELECT * WHERE { 
   ?s <http://www.w3.org/2002/07/owl#sameAs> 
      <http://collection.britishmuseum.org/id/codex/117631> 
}

The result should be a link to metadata describing the object, and the object’s collection id (in this case YCA62958) can be extracted for use in further searches.
```
http://collection.britishmuseum.org/id/object/YCA62958
```
If there is a result, retrieve metadata about the object from the URL: http://collection.britishmuseum.org/description/object/YCA62958.rdf (or .html, .json, .xml)
If there is no result, scrape metadata from the object’s description page in the collections interface. There is plenty of metadata available, but it is far less structured than RDF, being distributed through HTML.

This last step looks like it will be quite common as many of the Shakespeare-related results are portraits or book frontispieces which have no collection id. I am not sure whether this is an omission, or because they are part of another object, in which case it will require further querying to resolve the source object (if that is what we want to describe).

Another difficulty is that although Owen found a person-institution URI for Mozart, I cannot find one for Shakespeare. There is a rudimentary biography but little else, so we do not have a “Shakespeare” identifier for use in SPARQL searches.

Conclusion

Ultimately I am still finding it non-trivial and a bit hacky to identify, and ask questions about, the real Shakespeare through a SPARQL endpoint.

Click here to view the embedded video.

In summary:

SPARQL endpoint providers could provide more documentation and examples.
RDF stores allow us to ask structural questions, but semantic questions are much harder without knowing some URIs.
It is often necessary to make use of a secondary resource or authority in order to identify the entities we wish to ask about.

EDINA Blogs

A Blogs.edina.ac.uk weblog