List of outcomes of the Chalice project

We put together this long list of different things that happened during the Chalice project, for our last bi-weekly project meeting, on 28th April 2011. The final product post offers an introduction to Chalice.

Tangibles

These are pieces of work completed to form the project:

  • Corrected OCR for 5 EPNS volumes (*not* open licensed)
  • Quality assessment of the OCR
  • Extracted data in XML
  • Report on the text-mining and georeferencing process
  • RDF representation of extracted data, Open Database License
  • Searchable JSON API for the extracted data
  • Two prototype visualisations
  • Source code for the preceding 4 items
  • Two use case assessments
  • Supporting material for the use case assessments
  • Simple web service for alt-names for ADS
  • Sample Integration with GBHGIS data

Intangibles

These are less concrete but equally valuable side-effects of the project work:

  • A set of sameAs assertions for Cheshire names between geonames and Ordnance Survey 50K gazetteer to go to sameas.org
  • Historic place-name data to enhance geonames.org with, potentially.
  • Improvements to the Edinburgh Geoparser and the Unlock Text service
  • Pushed forward open source release of the Geoparser
  • Refactoring of the Unlock Places service
  • Discussions and potential alignment with other projects (SPQR, Pleiades, GBHGIS)
  • Discussions with other place-name surveys (SPNS – Wales?)

Talks / Dissemination

Linking historic places: looking at Victoria County History

Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.

Quality of text correction analysis from CDDA

The following post is by Elaine Yeates, project manager at the Centre for Data Digitisation and Analysis in Belfast. Elaine and her team have been responsible for taking scans of a selection of volumes of the English Place Name Survey and turning them into corrected OCR’d text, for later text mining to extract the data structures and republish them as Linked Data.

“I’ve worked up some figures based on an average character count from Cheshire, Buckinghamshire, Cambridgeshire and Derbyshire.

We had two levels of quality control:

1st QA Spelling and Font:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 346 character errors (average per page 8.65) = 0.22

1st QA Unicode:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 235 character errors (average per page 5.87)= 0.14.

TOTAL Error Rate 0.36
2nd QA – Encompasses all of 1st QA and based on 40 pages averaging 4000 characters per page the error rate was 18 character errors (average per page 0.45) = 0.01.

Through the pilot we indentified that there are quite a few Unicodes unique to this material. CDDA developed an in-house online Unicode database for analysts, they can view, update the capture file and raise new codes when found. I think for a more substantial project we might direct our QA process through an online audit system, where we could identify issues with material, OCR of same, macro’s and the 1st and 2nd stages of quality control.

We are pleased with these figures and it looks encouraging for a larger scaled project.”

Elaine also wrote in response to some feedback on markup error rates from Claire Grover on behalf of the Language Technology Group:

‘Thanks for these. Our QA team our primarily looking for spelling errors, from your list the few issues seem to be bold, spaces and small caps.

Of course when tagging, especially automated, you’re looking for certain patterns, however moving forward I feel this error rate is very encouraging and it helps our QA team to know what patterns might be searchable for future capture.

Looking at your issues so far, on part Part IV (5 issues e-mailed) and a total word count of 132,357 (an error rate of 0.00003).”

I am happy to have these numbers, as one can observe consistency of quality over iterations, as means are found to work with more volumes of EPNS.