Day Five of Will’s World Online Hack

Posted on December 9, 2012 by Muriel Mewissen

Will’s World Online Hack is now in full swing and our daily check-in session today on Google+ Hangout generated some interesting discussion around the various versions of XML for the plays, those provided by Will’s World, those from Perseus and the latest offering from the Folger Library which includes a lot more details, but also linked data, identifiers, keywords, interative apps and geo-location. It is well worth viewing if you missed it earlier today. It is also available on YouTube and Google+.

Click here to view the embedded video.

The weekend seems to have been productive for Will’s World Online Hackers:

Victor is developing an exciting idea for an interactive app on which locations mentioned in Shakespeare’s work could be located on a map, like Google map. The app could let you know what Shakespeare’s landmarks are close to your current location and provide information, background data, etc. It could even warned you when you are near a place of interest.
Jeffrey is making good progress and has been looking at how to present the data, how words are used and how to provide keywords.
Richard’s hack is also going well and he is now working provided linked data for the characters of the plays. Unfortunately, dbpedia only has URLs for 93 out of the 1000 plus characters.

All of them are looking for some help:

Victor would like to hear from developer willing to help with the coding. Will’s World suggested Unlock to help scraping location from the text.
Jeffrey is looking for input from the academic/literature side!
Richard is searching for suggestion for alternative sources of identifiers.

And we know Owen is still open to collaboration on his hack, and looking for some help with WordPress and Moodle. Check our current hack page and get in touch if you can help. Your involvement will be most welcome!

My hack collaborator has been of limited help… More on pinterest!

Day 4 of Will’s World Hack

Posted on December 8, 2012 by Muriel Mewissen

The weekend has arrived and we are looking forward to an increased level of activity during this typical hack time. Will’s World is nearing the halfway mark but there is still plenty time left to build some wonderful and creative hacks. There is also plenty time left to join the hack. Like our latest participant who registered today, you too may have a good idea for a hack that you would like to explore with our team of hackers. Registration is still open. You will even get a goodie bag to support your efforts, which includes this great Will’s World Hack mug which is already put to good use by Owen:

More behind the scene photos of Will’s World Hack can be see on pinterest.

Additional data and information on how to use the Registry were made available today.

The metadata has been reloaded, with significant improvements:

Values for more fields, including missing titles and descriptions, dc:type and the attribution fields ww:credit, dc:creator, dc:contributor, dc:publisher.
Consistent values for dc:source and ww:credit; supporting the retrieval of results by collection (faceting).
Persistent record dc:identifiers, supporting reliable referencing even as records are updated.

The interface to Solr has also been improved with the following features:

Easier, more reliable searching with simpler URLs.
Better results through more general phrase searching.
Faceting on type and attribution fields.

The recording for today’s check-in session on Google+ Hangouts summarising these developments is now available to view on YouTube and Google+:

Click here to view the embedded video.

We looking forward hearing more about this weekend hack efforts in tomorrow’s check-in session on Google+, and as usual we will post a summary here.

Day Three of the Will’s World Hack

Posted on December 7, 2012 by Nicola Osborne

Will’s World Hack is in its third day now and we are hoping to hear more from potential hacks as the weekend approaches and people can leave work behind for some hack time fun!

We held two drop-in sessions today on Google+ Hangout and as usual, they are now available for viewing on YouTube, Google+ and linked in our daily update post at:

In the first session (1pm GMT), we highlighted the addition of rules on the Prize & Rules wiki page. There is nothing new or unusual in these but they do formalise what have been discussing earlier on. In particular, we would like people to register their hack or hack idea, preferably by Sunday evening and be ready to present live at the closing session or provide a video ahead of the time. Let us know if you have any queries.

You can watch the first check in session here:

Our 5pm Check in saw Jeffrey Kerzner join us from the US. Jeff has been looking at the Shakespeare data and how it might be accessed and analysed. He would love to hear from any academics about the real academic aims, the scholarly questions they want to ask of this data. Please do take a look at this video from the session and leave us a comment here, email us (willsworldhack@gmail.com) or contact Jeff directly via his details on the wiki if you’d like to get in touch with him.

In this check in our developer Neil also discussed various improvements and updates – including improvements to the Solr search that may require some to make small tweaks to already-developed code – that are being made to the Will’s World Registry tonight ready for your further hacking delight over the weekend.

In other news a tweet and email highlighted a great new Shakespeare resource from the Folger Library which provides the text for the plays rigorously encoded with every word, every punctuation mark, every space, within a sophisticated TEI-compliant XML structure, and you can download them!

Also released today were the new Google+ “Communities” – these are groups that allow you to share updates and discussion with other participants so do join ours and start chatting!

We have also had updates from various hackers so here’s what’s happening so far:

Jeffrey is looking for academic collaborators to work with – he’s got the coding skills so bring him the academic questions to answer!
Richard has been looking at URLs for Shakespeare characters (LCSH has been suggested) and has been considering the best approach for this – it will fit with the Linked Data texts he is working on and he would appreciate any advice or collaborators for this work.
Owen has been developing his ShakespearePress WordPress Plugin and has shared his code on Github: https://github.com/ostephens/shakespearepress. Owen would love to hear from potential collaborators so again do leave comments here, email us, or get directly in touch with Owen via the Participants Wiki page.

We will be supporting your hacks and working on a few of our own over the weekend so we shall see you online very soon and will provide another daily update here on the blog tomorrow.

The next check ins will be at 5pm on Saturday (8th December) and 1pm on Sunday (9th December) when the team will be available to answer your questions and you can continue to meet and work with each other.

And finally… a quick reminder of our last (but not least) prize category: Best set up for the hack. Send us picture of the environment you are working in and we will showcase the best on our Pinterest page with a prize for the very best!

Day Two of Will’s World Online Hack

Posted on December 6, 2012 by Muriel Mewissen

The Will’s World Online Hack has now been going for over 24 hours. While most traditional hacks would be over by now. Will’s World just started! And we are very pleased to see that the conversation is starting to flow and connections are being made.

We held our daily session on Google+ Hangout which was streamed live on YouTube and recorded. The videos for all sessions are made available on our YouTube channel and Google+ page, so feel free to catch up on any session you may have missed.

Click here to view the embedded video.

Today’session brought a few questions about the Registry:

How do you find the relevance of records in the Registry? It is not always obvious why some records are returned as results for a specific search.
There is currently no faceted search available for the Registry which would help with the exploration of the content and getting a feel for the data. This might help with undertanding the relevance of some records.
The service directory is not yet on to the Registry website.

The Registry is still under construction and we are looking into improving these. In addition, not all data available has been loaded in the Registry. We are hoping to address this within the next day (or two).

Connecting

Although, some participant have reported that it was not obvious how to join a Google+ Hangout. We are delighted to see that communication betwen participants, and between participants and the project team, has started to happen on Twitter @WillsWorldHack #willhack and IRC. These conversations yielded some useful questions about the data which have been added to FAQs on the wiki at: http://willsworld.edina.ac.uk/wiki/index.php/The_Data#Data_FAQs.

Track discussion around the Will’s World Hack with our #willhack Storify gathering tweets, videos, etc.

Two days, Two project ideas

We are delighted to have two ideas for project already listed on the Current Hacks page and hoping for more to be listed there in the coming days.

One of our hacker, Owen Stephens, has also blogged about his progress to date in this “To scrape or not to scrape” post about creating a queriable/api form of perfomance cast lists using scraper wiki for the thing he has built.

Day One of the Will’s World Hack

Posted on December 5, 2012 by Nicola Osborne

In readiness for the beginning of the Will’s World Hack we were delighted to launch the Will’s World Registry this morning. The Registry, the development of which we have been charting on this blog, includes the metadata from our fantastic project partners, information on the schema and mappings, and related resources including XML versions of all of Shakespeare’s plays. During the hack we will be continuing to finesse the Registry and we’d really welcome your comments and feedback on the version we have launched today.

Then, this afternoon (at 1pm GMT) saw the launch of the Will’s World Hack (#willhack) via a live Google+ chat streaming to YouTube!

We introduced ourselves and the Will’s World project as well as saying a bit about what we hope may come out of the hack (Will’s World Hack Event Presentation). It was brilliant to have four of our twenty or so registered hack participants joining us for this live portion of the event and we hope many more of you will be dropping into live sessions later in the week or able to catch up on YouTube. Today’s launch session can be viewed here:

We also had a check in session this afternoon and we got to hear about the first hack taking place: turning the Shakespeare plays we made available as XML for the hack into a Linked Data database for use in others hacks. Richard, who is working on this, would appreciate any pointers to existing ontologies for Shakespeare plays (or plays in general) so, if you have any suggestions, please leave a comment here or email the team. You can also find out more about that on our Current Hacks page on the Will’s World Hack Wiki – and you can add details of your own ideas, hacks and hack teams to the page while you’re there!

We plan to post a summary of the Hack each day so keep an eye here on the blog for updates all week of the Will’s World Hack. You can also join the conversation on Twitter with the #willhack hashtag.

And finally…

Remember that the key place for all information on the Hack event is the Will’s World Hack Wiki
Track discussion around the Will’s World Hack with our #willhack Storify gathering tweets, videos, etc.
And the Will’s World hack has been written up on the DMU Mashed Library blog – take a look here: http://librarymashups.our.dmu.ac.uk/2012/11/29/wills-world-online-hack-event/

Will’s World Hack (#willhack) Starts Soon!

Posted on December 4, 2012 by Nicola Osborne

Tomorrow will see the start of Will’s World Online Hack. The opening session will take place on Google+ hangout at 1pm GMT. Most participants are in the UK but we also have someone from Spain and folks from the USA. We’re hoping you can join us but it may be a bit early for California!

Join Us for the Live Launch!

If you have already registered then join us live on Google+ hangout or YouTube if you can (instructions later in this post), or checkout out the recording at a later time. If you haven’t registered to take part it’s not too late! You can still register here.

The Will’s World Google+ page

This live opening session will introduce:

The Will’s World team
The Will’s World project
The data
Our hopes for the hack
The prizes and judges
The practical organisation for the week ahead
And most important of all: you, the participants

The main aim of this session is to kick start the interaction and team formation! We will also be running a live check in at 5pm (GMT) which you are very welcome to join/look in on.

Joining Live Will’s World Hack Sessions

For the Live Launch we are using Google+ Hangouts. You can either join in via video or audio using Google+, or you can view the stream via our YouTube Channel. We will run other live sessions throughout the week, similarly, using Google+ and YouTube, but will also be available to support your fantastic hacking via Skype (WillsWorldWack), email (willsworldhack@gmail.com), Twitter (@willsworldhack) etc.

To join in the Google+ Hangouts:

Go to the Wills World Hack Google+ page
Look for the most recent update and click “Hang Out”. You may see a warning if our WillsWorldHack account is not already part of your circles – either add us to your circle or accept that warning to be added to the chat.
You will be taken to a new window where you will be added to the active Hang Out.
Your video camera and/or microphone will automatically be shown once you have joined the Hang Out. Usually when you speak you will appear onscreen. We will be managing the various cameras so if you wish to raise a question and your camera does not show in the main Hang Out window please just give us a wave!
Once in the hangout you can use the “chat” button (a grey or blue square speech bubble – the second link on the left hand menu) to raise questions via text chat. These may or may not be seen by others (depending on whether they have clicked the same link) but the Will’s World Team will be keeping an eye out for any questions.
To exit the call use the “hang up” symbol at the top right of the screen.

We have trialed this technology and although it sounds tricky, it’s actually pretty easy to get started with. If you have any questions please let us know (willsworldhack@gmail.com).

To view the live sessions on YouTube:

Go to the Will’s World Hack channel
Click on “Browse Videos” and the “Feed” link (or click here)
Look for the “Live” video – this should be the one closest to the top of the page
Click to access the video and view it live.

All videoed sessions will be recorded and made available on YouTube. If you would prefer not to appear in these we would recommend viewing the YouTube feed. If you are happy to appear in the video and/or want to ask questions in person or via text chat we would recommend joining the Google+ Hangout.

Schedule for Will’s World Hack

We have scheduled at least one live session per day but if you have any comments on the timing of these or would like to see additional times added to the schedule (available from the FAQ page on the wiki or via our Google Calendar) just let us know.

Follow the hack

Even if you are not taking part in the event itself, you can follow the development of the hack too!

This opening sessions will be streamed lived on YouTube and made available later on the Will’s World Google+ page and on this blog. Check them out!

We also have a hashtag for the event so look out for Shakespeare hack chatter – and add your own ideas and advice – on #willhack. You can also follow our lovely participants’ tweets via the Wills World Hack Twitter List.

For all other information about the hack – which will be added to throughout the week – please do take a look at the Will’s World Hack Wiki: http://willsworld.edina.ac.uk/wiki/index.php/Main_Page

If you have any feedback on the event, the format, the tools we are using, etc. then please just let us know here as a comment or through any of the channels mentioned above.

Register for the Will’s World Online Hack

Posted on November 30, 2012 by Muriel Mewissen

The last week has been very exciting for the Will’s World team as we have been busy preparing for the Online Hack event taking place next week. By the way, there are still a few days left to register if you fancy joining us!

Taking the stage

I made my YouTube stage debut in this video introducing the Will’s World project and the motivation behind the online format of the hackathon.

Click here to view the embedded video.

My colleague Neil got into the creative spirit of the hack for this short video presenting the metadata that will be available in the Shakespeare Registry.

Click here to view the embedded video.

Data sources

The Shakespeare Registry will include metadata from:

historical records from EDINA Statistical Accounts of Scotland
multimedia items from JISC MediaHub
multimedia items from Culture Grid Europeana
digitised prompt books from National Library of Scotland
collection items from the British Museum
programmes from the BBC Shakespeare’s Restless World
plays in XML format from the Open Library

With additional sources of data listed and some hack participants bringing their own data to add to ours, this will truely ensure we have some rich data to work with!

Goodies

All the items for our goodie bags have now arrived and I’ve been enjoying putting the packs together. I will be sending them soon to the lucky 18 people already registered. We are hoping that these goodie bags will support your creativity and provide a little bit of fun. Here’s a sneaky peek at the smallest item!

Find out more about Will’s World Online Hack at http://willsworld.edina.ac.uk/wiki/.

Join Will’s World Online Hack 5-12 Dec 2012

Posted on November 22, 2012 by Muriel Mewissen

Are you interested in Shakespeare? Are you tempted to take part in a hackathon or know someone else who might be? Do you have a great idea for a new app? Do you want to mash your own data with ours? Then get involved in the Will’s World Online Hack! We are pleased to announce that registration for this event is now opened at http://willsworld.edina.ac.uk/wiki.

This hackathon aims first to promote innovative use of the Shakespeare metadata registry built by the Will’s World project to hold metadata describing online digital resources relating Shakespeare,but also to explore an online format for hackathon.

How does an online hack work?

Well, like a traditional hackathon, technical and creative people with different expertise like software developers, graphic designers, domain experts and project managers, get together and collaborate to develop applications and explore concepts. But instead of getting physically together in one location, social media technologies are used to communicate and collaborate online. We used your very useful and positive feedback to our online survey to plan this event.

The event will take place over a week:

Opening session on Wednesday, 5th December, 1pm (BST):
This will be a live and interactive session to present the data, the goal of the event, prize categories, the set of social media tools and technologies to be used during the hack and the Will’s World project itself. Participants will be able to introduce themselves and put forward ideas.

Hack, 24 hours spread over 6 days:
The participants will have six days to form teams, familiarise themselves with the data and code. Participants are free to organise when they spent their 24 hack hours over these six days. They will have the flexibility to work when it suits them. Teams can set their own schedule either for members to work concurrently or consecutively. The Will’s World project team will be on hand throughout to answer any questions and regular interactive drop-in sessions are planned.

Closing session on wednesday, 12th December, 1pm (BST):
This will be a live and interactive session where each team will present their hack either live or as a pre-recorded video and prizes will be awarded.

We are hoping to capture as much as possible of the communication taking place. In particular, the opening and closing sessions will be videoed. All recordings will be shared on the event wiki or the blog.

The use of technology and social media is at the core of this online hack. We will be using a wiki to act as hub to support communication, before, during and after the event. Mailing lists, Google+ hangouts, YouTube, Skype, Twitter, IRC, Github and Dropbox will all help the communication and creativity flow. You will find more information about the event, the data, the technologies and how to take part here.

Register now and receive a goodie bag!

If you fancy taking part in this exciting event and be one of the first pioneering online hackers then please register on the Will’s World Online Hack wiki. Participation is free and the first 50 participants to register will receive a goodie bag!

Will’s World online hack survey results: Your Views!

Posted on November 13, 2012 by Muriel Mewissen

Over the last three weeks we have been drumming up interest for our idea of an online hack event. This twist on the traditional “in person” format has exciting potential to be more flexible and make great use of social media. It seemed like a very attractive idea to us but, we wondered, what did you think?

We drew up a short survey (15 questions) to capture your views, feedback and any experiences that would help us plan a great online hack. We spread the word through this blog, twitter, mailing lists, websites and asked other to do the same.

To date (the survey is still open) we have received 30 replies to the survey and many direct emails with further input. So a BIG THANK YOU to all! We are delighted that you found time to make this hugely valued contribution and we thought that the least we can do is share here what you told us.

A Good idea

In answer to that core question we found that 84% of our survey respondents felt that the online hack was a good idea, of whom: 57% of respondents thought that an online hack was a good idea and would be interested in taking part; a further 27% of respondents felt it was a good idea but they were not sure how it would work.

75% of respondents had attended hack events in the past, and interestingly 3 have already taken part in an online hack.
It is very encouraging to see that most people are supportive of the online format – only 10% would prefer an in-person event. Only one person doesn’t think it will work and another said they wouldn’t be interested in taking part.
Significantly, all three experienced online hackers think it’s a good idea with two of them definitely interested in participating is this hack – this is really encouraging!

Timing – it’s all relative…

Opinions are divided over what format might work best. This is not surprising since most of our respondents had not been to an online hack before so were being asked to speculate on what might work. However, close to half of those who responded favour a week-long drop-in format. Others were split between weekend and weekday days – we had lots of conflicting comments about availability here.

We didn’t ask you where you were based – although we would if we did this again – but from your experience and email addresses we know we have respondents from both sides of the Atlantic which further encourages us that any possible timings and format needs to support an international hack attendance as elegantly as possible.

Participation

We were really pleased to see that you weren’t just being lovely in sharing your views, you were also really up for participating!

50% of respondents said they are definitely interested in taking part in this hack, with an additional 30% a “maybe”, and several others interested but unable to attend on the specific suggested dates in December.
A significant number of people (52%) indicated that they may be able to bring additional data to the hack. However, most note that it would depend on having enough time to prepare it and/or obtain approval for sharing the data.

Social Media Technologies

Social media tools are essential in supporting the communication required by an online hack. Many applications are popular and received support from the participants of the survey, as seen in the graph below:

Knowing what tools you already use means we now feel well informed to choose the right combination of social media and web technologies that will ensure you feel comfortable and familiar with the tools and work for the functionality we think we would need.

We need you… but what do you need?

We also asked you what you might need to be able to take part in an online hack. The main requests were for:

as much data as possible

information on the data available ahead of the event

easy access to the data

access to the APIs ahead of the event

We can definitely see from these responses and our word cloud for this question (below) that the data is crucial!

Team-building and help with communication tools ahead of the event were also highlighted. The importance of time, pizza, publicity, prizes and a greater technology know-how were also mentioned!

Participants

So who are you all?

50% of respondents work in the Higher Education section. A further 18% are freelance and 11% work in the private sector.
Participants were from highly varied background, with different expertise and interests: from experienced developers, to artists, designers, managers, engineers, teachers, students, librarians – the only common characteristics seemed to be a passion for hacking, for Shakespeare or for both!
92% shared their email with us to be kept informed on the developments of our online hack event – thank you! We will be in touch with you soon!

If you’re not one of those who responded but would like to stay up to date on the hack event please either fill in the survey now or drop an email to edina@ed.ac.uk with “Wills World” in the subject line.

I Love Shakespeare

You have shared with us your wishes for playing with data, engaging with communication tools, supporting learning and producing creative material. You have encouraged us in our ambitious vision but warned us of the difficulties too.

Most of all, the word cloud for our additional comments section seems to indicate that you simply love Shakespeare!

Will’s World Online Hack is Coming Soon!

Following the positive responses we have received, we have looked further into the practicalities of organising an online hack event and are delighted to let you know that we will be going ahead with the event in early December! Further details and the official announcement will be out very soon… Watch this space!

Mapping metadata to a common schema

Posted on October 30, 2012 by neil.mayo

One of the challenges in aggregating metadata is to unify the different schemata of the data sources, to make the data searchable through a single common schema. There are advantages to maintaining the original schema (field names) of each data source so that it can be searched in the same way that would have been possible in its original context (i.e. searching under the same field or fields). However this can complicate searches of the aggregated metadata when each data source represents similar information under a differently-named field – because a search for values on a range of essentially synonymous but differently-named fields across a variety of resources would be more complex to specify. Therefore it makes sense to map similar fields onto a single field, within a schema which is specific to the collection of aggregated data.

This post is about creating a schema for Will’s World and mapping other metadata into it. Despite the technical detail, it might elaborate some common stumbling blocks to be found in mapping disparate information to a single representation.

Developing a Schema

We had to develop a schema for Will’s World (WW) to represent the metadata which is being imported into the text database. This schema has several requirements:

Should be capable of representing/unifying metadata from several different sources.
Must unify some of the variations on similar pieces of data (for example there is a range of fields dealing with attribution and ownership – including rights/availability/copyright; owner/credit/creator/contributor).
Should use existing ontologies as far as is appropriate/possible (for example, Dublin Core, TEI)

Once the schema is designed, it is necessary to define how the metadata we collect from disparate sources will fit into the (probably more restrictive) schema that we have elected to use in our representations. This involves establishing a clear mapping from the source schemata to the target schema. Additionally, we need more than one schema for Will’s World:

A schema for metadata harvested from services, which represents content within those services.
A schema for descriptions of services themselves (including those from which we do not harvest metadata).
As a bonus, we might wish to define one or more schemata for texts relating to Shakespeare.

This post concentrates on the metadata schema, which is the most important and elaborate. We have defined a list of fields with namespaced identifiers, and for each field we define several properties:

Internal – whether a field is for internal use only, and is invisible to the user (because, for example, it is meaningless outside of the application).
Repeatable – whether a field can appear multiple times. For example, properties like “genre”, “subject” or “author” may be multi-valued (or rather have multiple instances), while we expect a metadata item to have a unique and canonical title (although we accept zero or more alternative titles).
Required – whether a field is required (because it acts as an identifier of some sort, or is essential to the meaning/categorisation of the record).

For the purposes of supporting text search, we need to answer a couple of questions for each field:

Should the field be Indexed (searchable through Solr queries)
Should the field be Stored (available to return from Solr for display in the interface)

Schemata

For reference while reading this blog post, you can refer to both the metadata schema and metadata mappings we have developed.

The schema for representing metadata about services is much simpler and records identifying metadata such as id, name, description, URL, provenance and usage/licensing information, and more specific metadata about how to use the service.

Collecting Data

The next stage is to collect, transform and ingest the data into Solr. This is a multi-stage process:

Collect the data – either in a well-structured format like RDF/JSON/XML/CSV, or contained in HTML.
Extract the data (if contained in an unstructured or variable format like HTML).
Transform the data from the source format into a format matching the WW schema.
Transform the WW metadata into a format matching the Solr import document schema.
Ingest the data into Solr, by posting it to the server.

Due to size and memory restrictions, any of the later stages, particularly the transformations and ingest, may involve splitting the documents into multiple (well-formed) documents.

We have to take a variety of approaches to collecting the data, depending on what types are available. Only a relatively small proportion of providers actually make their data available in easy to collect structured formats, while in other cases the data is present directly in HTML pages, or as the result of searches through an HTML web interface.

Scraping data – may involve two stages; in the case that the metadata itself is available directly from known URLs, the first and second steps are combined:
1. Scrape information from webpages (for example, the results of conducting a search through a search interface)
2. Extract links from that information and follow them to find the metadata.
Collecting data directly in a structured format.

Depending on the source of the data, it may be processed via scripts which construct structured data from un- or less structured data through a process of parsing and recombining data, or by an XSLT script which performs a simpler and more direct mapping of existing fields to the fields of the schema. This latter may involve many-to-one mappings in addition to direct one-to-one mappings. One-to-many mappings should however be avoided – this would produce data duplication and likely be an indication that the schema was not well defined.

Issues in Schema Design

In designing a metadata schema which is intended to unify metadata from sources with different concerns, we inevitably encounter certain issues which demand to be resolved in a consistent way.

Identifying fields

Naturally we need an identifying field so that every record can be distinguished and uniquely identified. We have two options for this – generate unique id fields (the easiest way would be to ask Solr to impose them), or map external values (from each source’s data) into an id field. The first option requires us to maintain unique identifiers upon updates, when we re-index data. The second option runs the risk of clashes. Identifying tuples or values sourced externally will vary a great deal and there is always the danger of non-uniqueness unless we are very careful about how we define them. It might be best to let Solr create a unique field value for indexing, and record an alternative ‘primary’ identifier from the identifying metadata of each particular data source. This might be considered a unique identifier within the records of the particular source.

Required fields

If the anticipated metadata informing a required field are not present in the source data, a default or “unknown” value must be given. This can be achieved by explicitly producing a default value from a script, or by providing a template in an XSLT transform for each required field, which matches when the source field is not present in a record. Here is an example enforcing the presence of a required dc:title field for a Solr import doc:

<xsl:if test="not(dc:title)">
  <field name="dc:title">Unknown</field>
</xsl:if>

Field values with subelements

What should we do when the source data that we want to insert as a field value incorporates tag structure of its own? We would like to maintain XML substructure where available as part of the value of the field, rather than discarding it. However the Solr import schema does not accept substructure to its “field” elements, so we must enclose them in CDATA tags. This is not easy using XSLT – unless we are willing to write templates to match each of the elements we want to use as subelements in the defined fields of the schema, and encode them appropriately. The solution is to pass the node which contains substructure to a target which wraps it in a CDATA section in the output like this:

<!-- Copy the current element and subelements to the output in a CDATA section -->
 <xsl:template name="xmlsection">
  <!-- Use the current node by default -->
  <xsl:param name="node" select="."/>
  <xsl:text disable-output-escaping="yes"><![CDATA[ <![CDATA[ ]]></xsl:text>
  <xsl:copy-of select="$node"/>
  <xsl:text disable-output-escaping="yes"><![CDATA[]]]]><![CDATA[>]]></xsl:text>
</xsl:template>

Aggregating multiple XML fields

As indicated earlier, there are several cases where we wish to map many source fields from a provider into one Will’s World field as subelements. An example of matching multiple fields at the same level which will contribute to the content of a single field is in National Library of Scotland data, where we want to match and combine the XML elements “summary”, “commentarytitle” and “commentary” into a single “dc:description” field. There are a few options for performing this aggregation; if we know that one of the source fields (say, “summary”) will always be available, we can match on it and then pull in values from the other fields as necessary using xsl:copy-of with a relative path in the select attribute. However, if the “summary” field is missing, the template will not match and we will lose the values of the other fields.

If we cannot be sure that any one of these fields will be available, that is, they are all optional, then we need to match every possible combination. Doing this via a number of similar templates is unnecessarily verbose and will also lead to duplicate output fields, as every one of the fields which is present will match and cause the output field to be written. The solution is to provide an OR-ed list of the possible combinations of these elements, so it will only match once and produce one instance of the output field. In the following simplified example, we want to add elements f1 and f2, which occur at the same level, and either of which may or may not be present, to a single output element – and only do it once. We therefore need to check for the presence of both elements, or of either element without the other:

<!-- Match f1 and/or f2 -->
<xsl:template match="f1[../f2]|f1[not(../f2)]|f2[not(../f1)]">
  <combined-field>
    <xsl:copy-of select="../f1" />
    <xsl:copy-of select="../f2" />
  </combined-field>
</xsl:template>

Note that the XPath expression will become increasingly complicated if we are combining more than two potentially absent elements. In such a situation, it would be desirable to find other ways to do this, to avoid a combinatorial explosion of match expressions.

Issues in Mapping Metadata

Scraping data

We scrape data on the granularity of a page. Sometimes this yields metadata at different levels. For example, the BBC’s “Shakespeare’s Restless World” is a radio programme produced in conjunction with the British Museum, exploring an event or period in history through a particular object from the museum’s collections. The pages describing each show tend to contain metadata about the show, and also metadata about the object around which the show is centred. These can either be mapped to a description of a single entity representing “the programme about the object”, or the metadata can be recorded independently for both the programme and the object.

How to record the various metadata at different levels?

In this case we conceptualise the object as the resource being described; metadata about the programme (title, description, URL to a podcast) will be recorded in a record describing the programme, while supplementary detail (background/object facts, quotes) will be incorporated into the text description (using appropriate tags) and also distributed where possible into a record that describes the object. A page listing objects by play is scraped separately in order to supply associated plays to each programme/object.

Summary

Aggregating a wide range of metadata by mapping its values to the fields of a single common target schema renders it searchable in a unified interface. The mapping effort does however throw up a number of issues, some of which have been described in this post along with potential solutions. In particular:

It is necessary to be clear on what one is describing with a particular schema – a record, or a service providing a description of a record.
Collection and transformation must be performed carefully, with sensitivity to the requirements of the schema; we cannot rely on receiving or extracting data of the correct type or structure.
We should try to maintain as much as we can of the structure of the source metadata, so as to minimise the loss of information inherent in shoehorning data into a different representation.

T-Rex and Utahraptor point out that “Shakespeare needs tons of notes to be readable today” – sometimes information and cultural artefacts need to be mapped into a single common language and supported by supplementary metadata. (Is that a weak segue into a Dinosaur Comics link?)

Creating an XML Schema from a Table Description

I found that developing the schema was most easily done in a spreadsheet or table format, so I wrote scripts to transform that specification (at least partially) into (a) an XML schema and (b) an HTML layout of the same. To produce a schema, we produce elements with minOccurs and maxOccurs attributes. If the field is required, minOccurs is 1, otherwise it is 0. If the field is repeatable, maxOccurs is “unbounded”, otherwise it is 1. Here is the script for generating a schema file from a CSV description of the WW fields.

Disclaimer: this script is quick and dirty, non-validating, and WW-specific. Use it at your own risk; or better, adapt it!