Mapping metadata to a common schema

Posted on October 30, 2012 by neil.mayo

One of the challenges in aggregating metadata is to unify the different schemata of the data sources, to make the data searchable through a single common schema. There are advantages to maintaining the original schema (field names) of each data source so that it can be searched in the same way that would have been possible in its original context (i.e. searching under the same field or fields). However this can complicate searches of the aggregated metadata when each data source represents similar information under a differently-named field – because a search for values on a range of essentially synonymous but differently-named fields across a variety of resources would be more complex to specify. Therefore it makes sense to map similar fields onto a single field, within a schema which is specific to the collection of aggregated data.

This post is about creating a schema for Will’s World and mapping other metadata into it. Despite the technical detail, it might elaborate some common stumbling blocks to be found in mapping disparate information to a single representation.

Developing a Schema

We had to develop a schema for Will’s World (WW) to represent the metadata which is being imported into the text database. This schema has several requirements:

Should be capable of representing/unifying metadata from several different sources.
Must unify some of the variations on similar pieces of data (for example there is a range of fields dealing with attribution and ownership – including rights/availability/copyright; owner/credit/creator/contributor).
Should use existing ontologies as far as is appropriate/possible (for example, Dublin Core, TEI)

Once the schema is designed, it is necessary to define how the metadata we collect from disparate sources will fit into the (probably more restrictive) schema that we have elected to use in our representations. This involves establishing a clear mapping from the source schemata to the target schema. Additionally, we need more than one schema for Will’s World:

A schema for metadata harvested from services, which represents content within those services.
A schema for descriptions of services themselves (including those from which we do not harvest metadata).
As a bonus, we might wish to define one or more schemata for texts relating to Shakespeare.

This post concentrates on the metadata schema, which is the most important and elaborate. We have defined a list of fields with namespaced identifiers, and for each field we define several properties:

Internal – whether a field is for internal use only, and is invisible to the user (because, for example, it is meaningless outside of the application).
Repeatable – whether a field can appear multiple times. For example, properties like “genre”, “subject” or “author” may be multi-valued (or rather have multiple instances), while we expect a metadata item to have a unique and canonical title (although we accept zero or more alternative titles).
Required – whether a field is required (because it acts as an identifier of some sort, or is essential to the meaning/categorisation of the record).

For the purposes of supporting text search, we need to answer a couple of questions for each field:

Should the field be Indexed (searchable through Solr queries)
Should the field be Stored (available to return from Solr for display in the interface)

Schemata

For reference while reading this blog post, you can refer to both the metadata schema and metadata mappings we have developed.

The schema for representing metadata about services is much simpler and records identifying metadata such as id, name, description, URL, provenance and usage/licensing information, and more specific metadata about how to use the service.

Collecting Data

The next stage is to collect, transform and ingest the data into Solr. This is a multi-stage process:

Collect the data – either in a well-structured format like RDF/JSON/XML/CSV, or contained in HTML.
Extract the data (if contained in an unstructured or variable format like HTML).
Transform the data from the source format into a format matching the WW schema.
Transform the WW metadata into a format matching the Solr import document schema.
Ingest the data into Solr, by posting it to the server.

Due to size and memory restrictions, any of the later stages, particularly the transformations and ingest, may involve splitting the documents into multiple (well-formed) documents.

We have to take a variety of approaches to collecting the data, depending on what types are available. Only a relatively small proportion of providers actually make their data available in easy to collect structured formats, while in other cases the data is present directly in HTML pages, or as the result of searches through an HTML web interface.

Scraping data – may involve two stages; in the case that the metadata itself is available directly from known URLs, the first and second steps are combined:
1. Scrape information from webpages (for example, the results of conducting a search through a search interface)
2. Extract links from that information and follow them to find the metadata.
Collecting data directly in a structured format.

Depending on the source of the data, it may be processed via scripts which construct structured data from un- or less structured data through a process of parsing and recombining data, or by an XSLT script which performs a simpler and more direct mapping of existing fields to the fields of the schema. This latter may involve many-to-one mappings in addition to direct one-to-one mappings. One-to-many mappings should however be avoided – this would produce data duplication and likely be an indication that the schema was not well defined.

Issues in Schema Design

In designing a metadata schema which is intended to unify metadata from sources with different concerns, we inevitably encounter certain issues which demand to be resolved in a consistent way.

Identifying fields

Naturally we need an identifying field so that every record can be distinguished and uniquely identified. We have two options for this – generate unique id fields (the easiest way would be to ask Solr to impose them), or map external values (from each source’s data) into an id field. The first option requires us to maintain unique identifiers upon updates, when we re-index data. The second option runs the risk of clashes. Identifying tuples or values sourced externally will vary a great deal and there is always the danger of non-uniqueness unless we are very careful about how we define them. It might be best to let Solr create a unique field value for indexing, and record an alternative ‘primary’ identifier from the identifying metadata of each particular data source. This might be considered a unique identifier within the records of the particular source.

Required fields

If the anticipated metadata informing a required field are not present in the source data, a default or “unknown” value must be given. This can be achieved by explicitly producing a default value from a script, or by providing a template in an XSLT transform for each required field, which matches when the source field is not present in a record. Here is an example enforcing the presence of a required dc:title field for a Solr import doc:

<xsl:if test="not(dc:title)">
  <field name="dc:title">Unknown</field>
</xsl:if>

Field values with subelements

What should we do when the source data that we want to insert as a field value incorporates tag structure of its own? We would like to maintain XML substructure where available as part of the value of the field, rather than discarding it. However the Solr import schema does not accept substructure to its “field” elements, so we must enclose them in CDATA tags. This is not easy using XSLT – unless we are willing to write templates to match each of the elements we want to use as subelements in the defined fields of the schema, and encode them appropriately. The solution is to pass the node which contains substructure to a target which wraps it in a CDATA section in the output like this:

<!-- Copy the current element and subelements to the output in a CDATA section -->
 <xsl:template name="xmlsection">
  <!-- Use the current node by default -->
  <xsl:param name="node" select="."/>
  <xsl:text disable-output-escaping="yes"><![CDATA[ <![CDATA[ ]]></xsl:text>
  <xsl:copy-of select="$node"/>
  <xsl:text disable-output-escaping="yes"><![CDATA[]]]]><![CDATA[>]]></xsl:text>
</xsl:template>

Aggregating multiple XML fields

As indicated earlier, there are several cases where we wish to map many source fields from a provider into one Will’s World field as subelements. An example of matching multiple fields at the same level which will contribute to the content of a single field is in National Library of Scotland data, where we want to match and combine the XML elements “summary”, “commentarytitle” and “commentary” into a single “dc:description” field. There are a few options for performing this aggregation; if we know that one of the source fields (say, “summary”) will always be available, we can match on it and then pull in values from the other fields as necessary using xsl:copy-of with a relative path in the select attribute. However, if the “summary” field is missing, the template will not match and we will lose the values of the other fields.

If we cannot be sure that any one of these fields will be available, that is, they are all optional, then we need to match every possible combination. Doing this via a number of similar templates is unnecessarily verbose and will also lead to duplicate output fields, as every one of the fields which is present will match and cause the output field to be written. The solution is to provide an OR-ed list of the possible combinations of these elements, so it will only match once and produce one instance of the output field. In the following simplified example, we want to add elements f1 and f2, which occur at the same level, and either of which may or may not be present, to a single output element – and only do it once. We therefore need to check for the presence of both elements, or of either element without the other:

<!-- Match f1 and/or f2 -->
<xsl:template match="f1[../f2]|f1[not(../f2)]|f2[not(../f1)]">
  <combined-field>
    <xsl:copy-of select="../f1" />
    <xsl:copy-of select="../f2" />
  </combined-field>
</xsl:template>

Note that the XPath expression will become increasingly complicated if we are combining more than two potentially absent elements. In such a situation, it would be desirable to find other ways to do this, to avoid a combinatorial explosion of match expressions.

Issues in Mapping Metadata

Scraping data

We scrape data on the granularity of a page. Sometimes this yields metadata at different levels. For example, the BBC’s “Shakespeare’s Restless World” is a radio programme produced in conjunction with the British Museum, exploring an event or period in history through a particular object from the museum’s collections. The pages describing each show tend to contain metadata about the show, and also metadata about the object around which the show is centred. These can either be mapped to a description of a single entity representing “the programme about the object”, or the metadata can be recorded independently for both the programme and the object.

How to record the various metadata at different levels?

In this case we conceptualise the object as the resource being described; metadata about the programme (title, description, URL to a podcast) will be recorded in a record describing the programme, while supplementary detail (background/object facts, quotes) will be incorporated into the text description (using appropriate tags) and also distributed where possible into a record that describes the object. A page listing objects by play is scraped separately in order to supply associated plays to each programme/object.

Summary

Aggregating a wide range of metadata by mapping its values to the fields of a single common target schema renders it searchable in a unified interface. The mapping effort does however throw up a number of issues, some of which have been described in this post along with potential solutions. In particular:

It is necessary to be clear on what one is describing with a particular schema – a record, or a service providing a description of a record.
Collection and transformation must be performed carefully, with sensitivity to the requirements of the schema; we cannot rely on receiving or extracting data of the correct type or structure.
We should try to maintain as much as we can of the structure of the source metadata, so as to minimise the loss of information inherent in shoehorning data into a different representation.

T-Rex and Utahraptor point out that “Shakespeare needs tons of notes to be readable today” – sometimes information and cultural artefacts need to be mapped into a single common language and supported by supplementary metadata. (Is that a weak segue into a Dinosaur Comics link?)

Creating an XML Schema from a Table Description

I found that developing the schema was most easily done in a spreadsheet or table format, so I wrote scripts to transform that specification (at least partially) into (a) an XML schema and (b) an HTML layout of the same. To produce a schema, we produce elements with minOccurs and maxOccurs attributes. If the field is required, minOccurs is 1, otherwise it is 0. If the field is repeatable, maxOccurs is “unbounded”, otherwise it is 1. Here is the script for generating a schema file from a CSV description of the WW fields.

Disclaimer: this script is quick and dirty, non-validating, and WW-specific. Use it at your own risk; or better, adapt it!

“I shall lose my life for want of language”: Shakespeare and digital formats

Posted on July 30, 2012 by neil.mayo

The Will’s World registry is intended to record three categories of data:

Metadata about services
Metadata harvested or contributed from those services
Annotated plays

The first will describe the services that are available, and will support searching for services that have specific features, and that produce particular types of data about aspects of Shakespeare’s work, history and contemporaries. These data will be contributed by a service during registration; and in the initial stage of the project will be added manually.

The second category will be metadata that we have retrieved from services through queries or scraping from HTML or search interfaces, and will be searchable as an aggregated metadata resource.

The third type of data is different because it involves storing target data rather than just metadata – there will be the actual structure and content of the literary resource. Several questions arise: How will the data be searched, retrieved and visualised? Will it be possible to attach metadata search results to it? Which language will be used to markup the text?

Why store data as well?

Will’s World should be a useful access point for a variety of Shakespeare resources. It should be possible to run a search and retrieve metadata on items from a range of services. What will a user then do with this metadata? The answer to this question is something we don’t know and do not intend to dictate, but one likely use is to link the metadata to the text of the plays. A marked up electronic text can be seen as a backbone which a user might develop with multimedia resources from a number of services, to produce anything from an enriched script for theatre or teaching use, to an online application linking plays to performances, to a mobile phone app. Some examples of what developers and creative types can devise with little more than basic marked-up text were produced at the recent hackday.

What about formats and schemas?

There are a handful of marked up versions of Shakespeare’s plays:

Jon Bosak provides a full set of annotated Shakespeare as an XML test package, which is potentially a great starting point with a basic DTD.
Shakespeare play schema by Susan Kelsch. This appears to be quite comprehensive on representing aspects of Shakespeare plays (including the major play groupings), to the extent that it is not generic enough for any other application. I am not aware of the DTD being put into use in any data.
Open Source Shakespeare (OSS) provides comma-delimited, tilde-demarcated database fields representing the Globe Shakespeare. We converted this into XML using a simple Play-Act-Scene-Paragraph-Line hierarchical schema which provides a simple to understand version of the play structure. The CharID element within Paragraphs indicates the speaker of speech.
Perseus has plays from the Globe Shakespeare, encoded using TEI and available under Creative Commons 3.0.

Jon Bosak’s XML is useful to have, but as he says in his caveat “should not be relied upon for scholarly purposes” and is intended “purely as a learning exercise … a benchmark … and as a resource for testing”. The other schemas are somewhat ad hoc and very Shakespeare specific. It is good that they are tailored closely to their purpose, but it also makes them non-transferrable. Plays of different authors are marked up using subtly or markedly different schemes and become mutually non-comparable, or at least not without difficulty. At the same time it makes it unlikely that such schemas will find a wider audience unless they really describe Shakespeare better than any other option out there.

The XML that we generated from the OSS texts retains typographical elements that should be encoded into TEI, such as square brackets indicating that a line is a stage direction. It fulfilled the requirements of the Culture Hack event and provided a usable baseline of marked up text for the participants, but by its nature it has limited usage unless people find it more generally useful.

The question: is it better to use a schema developed specifically for Shakespeare’s plays (or plays in general), or something more general like TEI?

Literature Markup Languages (LML and XLDL)

Literature Markup Language (along with its predecessor and foundation, Literature Description Language) are attempts to provide an XML markup specifically for literature. It is designed to cover literature at all levels – pamphlets, prose, poetry, plays, criticism. These MLs look promising because they are pitched at a level wider than the works of a specific author, but within a particular field of writing that exhibits a range of common properties – things like figurative language, a variety of structures (some very well defined, such as haiku; others looser) and a range of utterance.

It is encouraging that there are efforts to define the appropriate terminology and structural elements of literature. The LML definition provides for act, scene and genre elements, and the ability to specify the tone of utterances. It is intended to use RDFa to enhance the semantics available in XHTML elements. It should be of much wider application than Shakespeare, though it is not clear whether it is capable of representing complex non-hierarchical structures.

It looks like the creator of LML, Dr Olaf Hoffmann, has thought carefully about the diverse requirements for representing literary works. However it seems an omission that there is no mention of the Text Encoding Initiative (TEI), whose long-lived and widely-used standard makes several recommendations for its application to dramatic and literary texts. I must also admit that I don’t understand why the rationale for LML’s creation is so closely linked with the vector graphics format SVG – I’m not sure of Dr Hoffmann’s use case for marking up text in SVG files.

These schemas have apparently shown little development in the past few years and I am not aware of any projects that have applied the languages.

Is LML/XLDL more appropriate than TEI?

LML contains hard definitions of what it considers the appropriate hierarchical levels of plays to be – namely act, scene, speech and line. While these are more or less universal in drama and are certainly appropriate for Shakespeare’s plays, they are not essential. For example, the act is a typically western construct, and modern literature regularly plays with the structural elements of literature. Thus defining the levels as elements is potentially limiting the application of LML.

TEI does not define scene and act elements, but rather keeps the elements abstract and specifies their role through a type attribute. This allows for more possibilities but can render the markup somewhat verbose and less readable, and subsequent querying more convoluted. TEI is sprawling; it attempts to cover so many possibilities that its flexibility tends to render it too complex and too vague for many applications, and the ways it is applied so disparate that it can still be hard to compare marked up texts like-for-like.

Although TEI has its limitations, it does have provenance and a large and active community.

Text Encoding Initiative (TEI)

Electronic Textual Editing: Drama Case Study: The Cambridge Edition of the Works of Ben Jonson by David Gants provides a good discussion of some of the challenges involved in efforts to markup dramatic texts, and how TEI attempts to handle them. See in particular the section on Encoding Drama. Quotes in this section are taken from this article.

TEI provides a set of guidelines for encoding texts at a semantic level which can be turned into contextually-appropriate representation when necessary. Within those guidelines it describes “encoding strategies designed specifically for drama”. This allows more details to be encoded than the baseline of straight act/scene/line markup, for example the shifting perception of a character. Attributes of a speaker can be modified through the course of the play, characters can share lines. Even Ben Jonson’s plays provide a wide range of complex structures which can be difficult to properly represent. Typographical schemes were devised in the sixteenth century for representing these features on the page, but they can be difficult to transform into a structural representation in XML, which is essentially hierarchical.

The broad hierarchical features common to many plays are pretty straightforward to represent – acts, scenes and lines; speeches and speakers; individuals and chorus. However sometimes these features can overlap or otherwise become complicated. A handful of real-life textual features that TEI aims to support are:

The difference between the character/actor speaking and who is perceived to be speaking. For example “plays that deal with English historical subjects will often alter the speech prefix assigned a character as the title and status of that character changes, such as Bolingbroke/Henry IV or Gloucester/Richard III.”
Simultaneity and interweaving of utterances. The case study describes markup allowing the reconstruction of a letter (a fictional entity in the imagined world of the play, or a prop on stage) whose contents are read out, distributed over several character utterances with speeches overlapping. With appropriate markup it is then possible to extract objects/structures from the text.
Rhyme scheme <rhyme> and meter <met>.
Stage directions <stage> and character movement <move>.
Poetic sub-structures (stanzas, line groups, verse paragraphs).
The rend attribute describes how to render the text of an element, for inline encoding of typographical features.

Although a minimal set of tags tends to get used from any given TEI schema, it can be used to enrich the text with all sorts of supplementary metadata, to provide further semantic and analytical richness to the text (short of providing critical commentary), thus supporting a variety of literary analyses. This is not something we aim to represent, or have the knowledge or resources to produce, but it is one enrichment activity that someone might like to perform given the basic markup.

I’m not convinced about the use of number-suffixed div elements <div0>, <div1> which results in an arbitrary number of new element names. It seems more in the spirit of XML to put these indices in an attribute.

Editions of Shakespeare

There are several editions of Shakespeare, including the following well-known publisher editions:

The Arden Shakespeare is described as “the world’s most recognised scholarly edition”. It is commercial and comes at a cost.
The Riverside Shakespeare is a long-running series started at the tail end of the 19th century, and includes a scholarly edition. I believe it also established some typographical standards for representation of aspects of the text.
The Globe Shakespeare, a 19th century edition produced by Cambridge editors Clark and Wright. Some dispute their inclusion of particular textual variations.

Which of these is the best option for inclusion depends very much on which edition is available in a marked up form. Ultimately it would be great to be able to store several versions (with concordances) and provide parallel access to them with comparison tools.

Requirements

With such a variation in possible digital representations of the plays, we should consider what the purpose is of storing marked up texts, and what the possible applications are.

What is required in an electronic representation?

Searchability, the ability to distinguish between each component and reconstruct any hierarchy that inheres in the text.

Why is it important to agree on a scheme?

There is little need to stress the benefits of having a standardised representation. If a representation scheme is agreed and then adhered to by a number of people, the data of literary artefacts become well defined and their structure can be used predictably. Tools and interpretations can be built up around and from the common format.

What is required of a Shakespeare play schema?

Probably nothing too different to what is required of any other dramatic or literary schema.

Why is there no agreed format?

TEI is the most recognised format; the LML/XLDL effort is interesting but appears not to have been applied to anything as yet.

Conclusion

This post has looked at some of the options out there for Shakespeare mark up and annotation. There are some interesting XML definitions, but apparently a lack of a core set of professionally-annotated texts employing a well-defined schema. We do not have the resources to markup a text ourselves, but we can process whatever editions are available.

The Perseus markup seems to provide a mix of an appropriate edition, a well-established and flexible encoding scheme, and a richness of annotation, and will form the primary text of our registry. It is more complex than other schemas, and so may take some more processing to render for presentation, however it appears worth it to provide as much well-organised information as we can.

In the future, it would be good to see a single encoding scheme become the standard for Shakespeare’s plays, and thereby contribute to a unified effort combining crowd-sourcing, annotation, and a rich diversity of interpretations, transformations, visual representations and unforeseen recombinations of texts, metadata and other resources.

Good Grieffe, More Robot Suits

Finally, a word of warning from T-Rex, about ensuring the long term preservation of Shakespeare in appropriate formats.

EDINA Blogs

A Blogs.edina.ac.uk weblog

Category Archives: Metadata Handling