Searching: Results Ranking

The new SUNCAT interface is available at http://suncat.ac.uk/ – this became the main SUNCAT interface in March 2014 but the old interface is still available. This new interface is built on a different platform and will therefore exhibit some differences in behaviour. We have discussed some of these in previous technical blog posts, and updated help and support documentation will also clarify the changes. In this post we will give some attention to how search results are ranked according to their relevance to the search terms.

Relevancy

One of the features of the Solr search server which we use to query the data, is that when we perform a search, the results that we get back include a relevancy score, or rank.

“Relevancy is the quality of results returned from a query, encompassing both what documents are found, and their relative ranking (the order that they are returned to the user).”

The scores are normalised to fall on a scale between 0 and 1, but you don’t need to worry about the scores as we don’t show them to you – we just use them to inform the resulting ordering that we show you. You can read more about relevancy scoring at http://wiki.apache.org/solr/SolrRelevancyFAQ.

By default the returned results are listed in order of relevance, with the most relevant first. This is what is reflected in the position column. Note that while the other sortable columns can be ordered in ascending or descending order, we do not allow the position column to be ordered in ascending order (i.e. from least relevant to most relevant). If you click on the position column header, the results will be ordered in descending order of relevancy.

Results table header

Clicking on the position column orders results by relevance. They may only be ordered in descending order.

We have defined relevancy so that things like punctuation and capitalisation don’t affect a result’s score.

Boosting

Boosting allows us to modify scores; so we give matches on a particular search index (field) more weight than others.

We can boost the importance to the search of a particular search field, or of particular documents when we put them into Solr, or of a particular clause within a query used to search the data. SUNCAT currently performs a variety of boosting:

  • Boost a result (significantly) if the search term matches exactly.
  • Boost a result where the search terms occur close together (within 3 words of each other).
  • When searches are made on the Title Keywords field, results are boosted if the search terms occur in the 245 MARC field (Title Statement), particularly any of the sub-fields $a (Title), $b (Remainder of title), $n (Number of part/section of a work) or $p (Name of part/section of a work).

So for example, searching for “Journal hellenic studies” in the Title Keywords field would produce results including Journal of Hellenic Studies as expected, and also Archaeological reports (which has “Journal of Hellenic Studies” in the Added Title field). However the former would appear higher up in the results because the search terms occur in the main title in the 245 field.

Examples

Here are some sample searches for British trees; first, a search for records with any of the words “British” and “trees”.

Results for "British trees" (any)

Search results for any of the words “British trees”. There are over 36,000.

There are more than 36,000 results. From around the 300th result and towards the end you will see many results which have been returned because they contain the word “British”, and less related to trees. This is possibly not what you were interested in, but much like using a search engine, you can ignore the results at the end, because the most relevant ones are shown to you first. You aren’t forced to make a more accurate search, though you can if necessary.

If we search for records with all (both!) of the words “British” and “trees”, we will get fewer results:

Results for "British trees" (all)

Search results for all of the words “British trees”.

There are only two results that include both words. The Basic search feature uses this interpretation by default, searching for all the specified terms.

You could search for the quoted phrase “British trees” but this produces nothing as the exact phrase does not occur anywhere:

Results for "British trees" (quoted phrase)

There are no search results for the exact phrase “British trees”.

Another aspect which affects the scoring of results is whether a search term has been stemmed. For example, when you enter the word “British”, it will be stemmed so that Solr will look for variations on it, such as “Brit” and “Britain”. Matches on the variations will have less influence over the score than precise matches to “British”.

Conclusion

It can be hard to unravel exactly what causes a particular record to get a higher score than another, because of the variety of factors and weightings that go into its calculation. The relevancy can be affected by the exactness of word matches, by their frequency, by how similar the words in the record are to words in the search term, how close together they are, what fields they appear in, and a variety of other factors it is possible to bring to bear on the scoring algorithm.

In deciding what aspects of the results should be considered most important, it is necessary to make trade-offs. The challenge is to make the results as intuitively sensible as possible, but it is not always possible to infer and reflect the exact intentions of the user – and sometimes particular combinations of boosting and searching on particular fields may give apparently counter-intuitive positioning to some results. Search algorithms are inherently heuristic and are an attempt to provide meaningful results to a simple query. In general, the more accurate and complete the underlying MARC records, the better the resulting scoring will be, much like trying to raise a website’s profile in a search engine.

The Advanced search feature provides more options, and more control over how search terms are interpreted, so that you can really pin down what you are searching for – but the basic search should in most cases provide a quick and effective doorway to the wealth of information in SUNCAT!

SPARQL – What’s up with that?

The title of this post is intended to convey a touch of bewilderment through use of a phrase from the Cliff Clavin school of observational comedy.

Linked data and SPARQL

In the linked data world, SPARQL (SPARQL Protocol and RDF Query Language) is touted as the preferred method for querying structured RDF data. In recent years several high profile institutions have worked very hard to structure and transform their data into appropriate formats for linked data discovery and sharing, and as part of this, many have produced RDF triple (or quadruple) stores accessible via SPARQL endpoints – usually a web interface where anyone can type and run a SPARQL query in order to retrieve some of that rich linked data goodness.

This is admirable, but I have to admit to having had little success getting something out of SPARQL endpoints that I would consider useful. Every time I try to use a SPARQL facility I find I do better by scraping data from search results in the main interface. I have also increasingly become aware that I am not the only one to find it difficult.

RDF stores are different to relational databases; they are not so amenable to performing a search over the values on a particular field. Nor are they as flexible as text search databases like Solr. Instead they record facts relating entities to other entities. So it is important that as consumers of the data we know what kind of questions make sense and how to ask them in a way that yields useful results for us and does not strain the SPARQL endpoint unduly. If these are not the kind of questions we want to ask then we might need to question the application of SPARQL as the de facto way of accessing RDF triple stores.

I’d like to point out that my aim here is not to complain or to disparage SPARQL in general or anybody’s data in particular; I think it is fantastic so many institutions with large archives are making efforts to open up their data in ways that are considered best practice for the web, and for good reasons. However if SPARQL endpoints turn out to be flawed or inadequately realised, they will not get used and both the opportunity to use the data, and the work to produce it, will be wasted.

Problems with SPARQL endpoints

These are the problems I have commonly experienced:

  • No documentation of available vocabularies.
  • No example queries.
  • No access to unique identifiers so we can search for something specific.
  • Slowness and timeouts due to writing inefficient queries (usually without using unique ids or URIs).
  • Limits on the number of records which can be returned (due to performance limits).

Paraphrasing Juliette Culver’s list of SPARQL Stumbling Blocks on the Pelagios blog, here are some of the problems she experienced:

  • No query examples for the given endpoint.
  • No summary of the data or the ontologies used to represent it.
  • Limited results or query timeouts.
  • SPARQL endpoints are not optimised for full-text searching or keyword search.
  • No link from records in the main interface to the RDF/JSON for the record. (This is mentioned in relation to the British Museum, who provide a very useful search interface to their collection, but don’t appear to link it to the structured data formats available through their SPARQL endpoint.)

Clearly we have experienced similar issues. Note that some of these are due to the nature of RDF and SPARQL, and require a reconception of how to find information. Others are instances of unhelpful presentation; SPARQL endpoints are generally pretty opaque, but this can be alleviated by providing more documentation. With the amount of work it takes to prepare the data, I am surprised by how few providers accompany their endpoints with a clear list of the ontologies they use to represent their data, and at least a handful of example queries. This takes a few minutes but is invaluable to anybody attempting to use the service.

Nature provide the best example I have seen of a SPARQL endpoint, providing a rich set of meaningful example queries. Note also the use of AJAX for (minimal) feedback while query is running, and to keep the query on the results page.

Confusion about Linked Data

A blog post by Andrew Beeken of the JISC CLOCK project reports dissatisfaction with SPARQL endpoints and linked data, and provoked responses from other users of linked data:

“What is simple in SQL is complex in SPARQL (or at least what I wanted to do was) … You see an announcement about Linked Data and don’t know whether to expect a SPARQL endpoint, or lots of embedded RDF.” Chris Keene

“SPARQL seems most useful for our use context as a tool to describe an entity rather than as a means of discovery.” Ed Chamberlain

Chris’ point gives another perspective on linked data in general – what does it mean to provide linked (or should that be linkable) data, and how do we use it? Embedded RDF (RDFa) is good in that it tends to provide structured data in context, enriching a term in a webpage in a way that is invisible by default but that people can consume if they choose to. Ed indicates a fact about RDF as a data storage medium: it is a method of representing facts about entities which are named in an esoteric way; it is not structured in a way that is ideal for the freer keyword searching or SQL-style queries that we are used to.

Owen Wilson suggests the Linked Data Book‘s section 6.3 which describes approaches to consuming linked data, describing three architectural patterns. It looks worth a read to get one thinking about linked data in the right way.

Unique identifiers

“My native English, now I must forego” Richard II, Act 1, Scene 3

One of the tenets of linked data is that each object has a unique identifier. If we are looking for “William Shakespeare” we must use the URI or other identifier that represents him in the given scheme, rather than using the string “William Shakespeare”. It is thus also necessary that we have an easy way to access the unique identifiers that are used in the data, so that we can ask questions about a specific entity without forming a fuzzy, complex and resource-consuming query. The British Museum publicises its controlled terms, that is the limited vocabulary that they use in describing their collection, along with authority files which provide the canonical versions of terms and names, standardised in terms of spelling, capitalisation and so on, and thesauri which map synonymous terms to their canonical equivalents. These terms are used in describing object types, place names and so on, supporting consistency in the collections data. They are all available via the page British Museum controlled terms and the BM object names thesaurus. Armed with knowledge of what words are used in particular fields to categorise or describe entities in the data, and similarly with a list of ids or canonical names for things, we can then start to form structured queries that will yield results.

Shakespeare and British Museum

I have looked in particular at the British Museum’s SPARQL endpoint as an example, as BM is a project partner and because it has several items germane to Will’s World. To start with, the endpoint gives some context; a basic template query is included in the search box, which can be run immediately and which implicitly documents all the relevant ontologies by pulling them in to define namespaces. There is a Help link with some idea of how data is represented and can be accessed/referenced using URIs. All of this is good and I found it easy to get started with the endpoint.

However before long I came up against the problem I’ve had with other endpoints, namely that it is difficult to perform a keyword search, or at least perform a multi-stage search in order to (a) resolve a unique identifier for a keyword or named thing and then (b) retrieve information about or related to that thing. In this case I found a way to achieve what I needed by supplementing my usage of the SPARQL endpoint with keyword searches of the excellent Collection database search – and with some help from the technical staff at BM to resolve a couple of mistakes in my queries, can now harvest metadata about objects related to the person “William Shakespeare”.

It is reassuring to find out I am not alone in having difficulty retrieving and using SPARQL data. I followed Owen Stephen’s blog post about the British Museum’s endpoint with interest. Owen found the CIDOC CRM data model hard to query due to its (rich, but thereby counter-intuitive) multi-level structure. Additionally, he encountered the common issue that it is very difficult to perform a search for data containing or “related to” a particular entity which to start with is represented merely by a string literal such as “William Shakespeare”:

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

Admittedly RDF representations and SPARQL are not really intended to provide a “search interface” in the sense to which most users are accustomed. But from the user’s perspective, there must be an easy way to start identifying objects about which we want to ask questions, and this  tends to start with performing some kind of keyword search. It is then necessary to identify the ids representing the resulting objects or records which are of interest. With the BM data this involves mapping a database id, which can be retrieved from the object URL, to the internal id used in the collections.

So what are the right questions?

Structured data requires a structured query – fair enough. However what sort of useful or meaningful query can we formulate when the data, the schema used to represent the data, the identifiers used within the data, are all specified internally? In order to construct an access point into the data, it is helpful to have not just a common language, but a common (or at least public) identifier scheme; canonical ways of referencing the entities in the data, such as “Shakespeare” or “the Rosetta Stone”. Without knowing the appropriate URI or the exact textual form (is it “Rosetta Stone”, “The Rosetta Stone”, “the Rosetta stone”? would we get more results for “Rosetta”?) it is nigh on impossible to easily ask questions about the entity of a SPARQL endpoint.

So how is one supposed to use a SPARQL endpoint? It is not a good medium for asking general questions or performing wide-ranging searches of the data. Instead it seems like a good way to link up records from different informational silos (BM, BL, NLS, RSC…) that share a common identifier scheme. If we know the canonical name of a work (“Macbeth”) or the ISBN of a particular edition, then we can start to link up these disparate sources of data on such entities.

But the variety of translations, the plurality of editions (which will only increase) and other degrees of freedom make it hard to perform an exhaustive analysis/usage of the data. In the case of the BM, who might have unique objects we don’t know we want to see, the way to find them is through keyword search. It seems that only by going first through a search interface or other secondary resource can we identify the items we want to know about and how to refer to them.

What we have in common between different sources is the language or ontologies used to describe the schema (foaf, dc, etc) – but this is syntax rather than semantics; structure rather than content. To echo Ed Chamberlain’s comment, we have access to how data is described, but not so much to the data itself.

 

British Museum data

The approach we will use to harvest British Museum metadata related to Shakespeare is outlined below. It is essentially the same approach that Owen Stephens found workable in his post on SPARQL, and involves reference to a secondary authority (the BM collection search interface) to establish identifiers.

  1. Conduct a search for “Shakespeare” in the collections interface.
  2. Extract an object id from each result. The Rosetta Stone has the id 117631.
  3. Find the corresponding collection id from SPARQL with this query:
    SELECT * WHERE { 
       ?s <http://www.w3.org/2002/07/owl#sameAs> 
          <http://collection.britishmuseum.org/id/codex/117631> 
    }
  4. The result should be a link to metadata describing the object, and the object’s collection id (in this case YCA62958) can be extracted for use in further searches.
    http://collection.britishmuseum.org/id/object/YCA62958
  5. If there is a result, retrieve metadata about the object from the URL: http://collection.britishmuseum.org/description/object/YCA62958.rdf (or .html, .json, .xml)
  6. If there is no result, scrape metadata from the object’s description page in the collections interface. There is plenty of metadata available, but it is far less structured than RDF, being distributed through HTML.

This last step looks like it will be quite common as many of the Shakespeare-related results are portraits or book frontispieces which have no collection id. I am not sure whether this is an omission, or because they are part of another object, in which case it will require further querying to resolve the source object (if that is what we want to describe).

Another difficulty is that although Owen found a person-institution URI for Mozart, I cannot find one for Shakespeare. There is a rudimentary biography but little else, so we do not have a “Shakespeare” identifier for use in SPARQL searches.

Conclusion

Ultimately I am still finding it non-trivial and a bit hacky to identify, and ask questions about, the real Shakespeare through a SPARQL endpoint.

Click here to view the embedded video.

In summary:

  • SPARQL endpoint providers could provide more documentation and examples.
  • RDF stores allow us to ask structural questions, but semantic questions are much harder without knowing some URIs.
  • It is often necessary to make use of a secondary resource or authority in order to identify the entities we wish to ask about.

Share

New Title List Features in version 1.55

The Title List function in your LOCKSS box allows you to report on the holdings of your LOCKSS box or the content available in the LOCKSS network as a whole. The latest release of the LOCKSS software includes some new options, and this post provides a general guide to usage of the Title List feature.

Please see the post View your holdings with LOCKSS titles export for an introduction to the Title List feature.

Title List Main Screen

Title List main screen

Basic Report Options

The main Title List screen presents you with several options for the report, described below. Note that the defaults will produce a standard KBART report, but these options allow you to customise that to better suit your needs.

Scope

This option allows you to specify the scope of the data you want to report on:

  • Available – all titles available in the LOCKSS network. This is all the content LOCKSS users can select from.
  • Configured – the titles that you have configured for collection on your institution’s machine using Add Titles.
  • Collected – the titles that have successfully collected on your machine. A title must be fully collected before it will appear in this list; partial completion is not enough.

Type

This option allows you to filter the titles to include only journals or books if you need to:

  • Journals – show only journal titles
  • Books – show only book titles (those that have an ISBN)
  • All – show all titles

Data Format

This option specifies the format of the data that is output from the reporting tool; the default Title Ranges (KBART) format displays a line for every complete range in a title, while the other options allow you to consolidate these into a single line per title.

The KBART coverage_notes field will contain a description of the complete ranges covered by each row, therefore documenting where the coverage gaps are in the consolidated formats.

  • Title Ranges – show a row for each unbroken range within a title (this is the default KBART format).
  • Titles – show a row for each title, consolidating all the individual unbroken ranges into a single range using the outermost start and end volumes. Note that the single range specified in this format may contain coverage gaps.
  • SFX DataLoader – show a row for each title, with the coverage_notes field listing coverage ranges in the SFX DataLoader format.

Output Format

This option specifies the digital format in which the report will be produced:

  • TSV – tab-separated format
  • CSV – comma-separated format
  • On-screen – the report will be turned into HTML and displayed on-screen

Having selected these options, press the List Titles button to generate the report.

A Title List report showing the customise option

The above shows the top of an HTML report. Notice there is a link to return to the main Title List page, and a button for customising the report you have generated.

Customisation Options

To provide finer-grained control over what gets into your report, there are customisation options allowing you to specify which fields to output and in what order.

Select the Customise Fields button either on the Title List screen or the HTML report screen. You will see these extra options appear at the bottom of the Title List screen:

Title List Customise

Title List customisation options

Coverage Range format

The default KBART output reports each unbroken range of a title on a separate line. Sometimes it is more useful to have a single line per title, with start and end points. The coverage_notes field is therefore used to elaborate which specific ranges are fully available within a title listed in a particular row. This field can display data in a number of formats:

  • Year Ranges – a comma-separated list of ranges, e.g. 1996-2002, 2004-2006, 2009-
  • Year(Volume) Ranges – a comma-separated list of ranges including volume in parentheses e.g. 1996(1)-2002(7), 2004(9)-2006(11), 2009(14)-
  • Year Summary – a concise format showing only the start and end of the whole range, e.g. 1996-
  • Year(Volume) Summary – a concise format showing only the start and end of the whole range, e.g. 1996(1)-
  • SFX DataLoader – this is the format defined by ExLibris for use with their SFX DataLoader utility, e.g. $obj->parsedDate(“>=”,1996,1,undef) && $obj->parsedDate(“<=”,2002,7,undef)

Note that year ranges can have an empty end point, meaning they extend to the present. The summary formats are for convenience and do not indicate where there are coverage gaps. The SFX format represents the same information as the year(volume) format but in a more verbose format.

Field Ordering

The field ordering box lists all the default KBART fields in their default order. You can edit the contents of this box to change which fields appear in the output. The ordering will also be reflected in the output report. On the right-hand side is a list of all the valid field names – be sure to include at least one of the identifying fields or there will be no way to organise the title records!

Note that the first two fields will be used to sort the resulting list.

Omit empty columns

Some columns have nothing in them due to a lack of appropriate data. Select this option to automatically omit such columns and simplify the report.

Reset/Cancel

If you decide you don’t want to customise the output after all, or have made a mistake, you can select either:

  • Reset – to reset the field ordering list to its original values
  • Cancel – to cancel customisation and go straight (back) to the report output

Finally, press List Titles to produce the report, or Hide Customise Fields to skip the customisation and use the three default report configuration options of scope, data format and output format.

Direct URL access to reports

It is possible to access a range of reports directly without manually selecting options on the Title List screen. This is useful for those applications where an automated update of the LOCKSS box’s holdings is required, for example when updating a knowledge base periodically. The default report can be retrieved by requesting the URL

http://lockss.box:8081/Titles?format=tsv

There are five configuration options for direct URL reports, which can be specified as URL parameters. The first four correspond to the basic report options described earlier:

  • format – output format: tsv, csv, html
  • scope – data scope: available, configured, collected
  • type – title type: journals, books, all
  • report – data format: kbart, titles, sfx
  • coverageNotesFormat – format for the coverage_notes field: year, year_volume, year_summary, year_volume_summary, sfx

The bare minimum is a format parameter. So for example a TSV report, in SFX report format, on titles configured in your LOCKSS box, can be retrieved from:

http://lockss.box:8081/Titles?format=tsv&report=sfx&scope=configured

Using the report

There are also several common uses for the reports generated by the Title List feature:

  • Reporting what is available on the LOCKSS network. This can be used to populate a knowledge base in a link resolver such as SFX, or cross-referenced with an institution’s subscriptions to produce a list of what should be added to LOCKSS.
  • Reporting what titles an institution has configured their LOCKSS box to collect. This can be used in institutional reporting on LOCKSS, or cross-referenced with subscriptions and available titles to find out what is missing from your configuration.
  • Reporting what has been successfully collected by LOCKSS. This report can indicate where there might be problems, or be used in institutional reporting.

We plan to produce format options for other link resolver products as necessary, though the standard KBART report should be sufficient for many applications. After the current phase of user interface development, the reports should also be useful in configuring your collection list with less manual effort.

If you have any further use cases for these reports do let us know at edina@ed.ac.uk.

Share

Understanding the Status of Titles

This post is intended to clarify the different classifications you might see for titles in your box, or titles which have been committed to the LOCKSS network.

Available, Configured & Collected

When you use the Title List feature in your LOCKSS box to report on its holdings, there are 3 options available to you that affect the scope of the report:

  • Available will include all titles available in the LOCKSS network. This is all the content LOCKSS users can select from.
  • Configured will report the titles that you have configured for collection on your institution’s machine using Add Titles.
  • Collected will report on the titles that have successfully collected on your machine.

Note that each scope is a subset of the previous one.

Committed Titles

The LOCKSS website provides a spreadsheet listing the publishers who have committed to providing their content in the LOCKSS network. This is available on the page at http://www.lockss.org/community/publishers-titles-gln and is regularly updated.

Note that content that has been committed is not instantly available in the LOCKSS network. It must be scheduled to go through a preparation and testing process that includes (a) writing a plugin which knows how to collect the content (b) testing that plugin to make sure the content collects properly. At this point the content is released to the network – the regular emails detailing new releases are the end result of this process which is perpetually being undertaken by a team of content testers in Stanford.

The spreadsheet is made by combining committed publisher data with a KBART report on Available titles from the Title List. It lists each title with the fields Publisher, Title, ISSN, eISSN, and the following fields indicate the status of each title:

  • Preserved Volumes lists the volumes that are preserved in the LOCKSS network
  • Preserved Years lists the years that are preserved in the LOCKSS network
  • In Progress Volumes lists the volumes that are committed but have not yet cleared the testing process
  • In Progress Years lists the years that are committed but have not yet cleared the testing process

Years and volumes are listed as one or more ranges separated by semi-colons. In some cases the data are not available for one reason or another. The fact that data are missing does not necessarily indicate that the title is not released.

We hope this outline of title statuses will aid institutions in making appropriate use of the reports and data available from LOCKSS boxes and the lockss.org website. Please direct any questions to edina@ed.ac.uk.

Share

LOCKSS User Interface Requirements document

We are pushing on with the preparations for User Interface (UI) development, and are about to get started with the preparatory first stages of coding. The UKLA Members’ Workshop on Thursday will include a presentation on the content of the proposed developments, and also a session for discussion of the requirements that have been identified.

In advance of this we are sharing the draft requirements document, which describes the different types of users of LOCKSS, common use cases which they encounter, the main features identified for implementation, and the work packages which will be undertaken.

Please feel free to record your comments using the link below.

Share