List of outcomes of the Chalice project

We put together this long list of different things that happened during the Chalice project, for our last bi-weekly project meeting, on 28th April 2011. The final product post offers an introduction to Chalice.

Tangibles

These are pieces of work completed to form the project:

  • Corrected OCR for 5 EPNS volumes (*not* open licensed)
  • Quality assessment of the OCR
  • Extracted data in XML
  • Report on the text-mining and georeferencing process
  • RDF representation of extracted data, Open Database License
  • Searchable JSON API for the extracted data
  • Two prototype visualisations
  • Source code for the preceding 4 items
  • Two use case assessments
  • Supporting material for the use case assessments
  • Simple web service for alt-names for ADS
  • Sample Integration with GBHGIS data

Intangibles

These are less concrete but equally valuable side-effects of the project work:

  • A set of sameAs assertions for Cheshire names between geonames and Ordnance Survey 50K gazetteer to go to sameas.org
  • Historic place-name data to enhance geonames.org with, potentially.
  • Improvements to the Edinburgh Geoparser and the Unlock Text service
  • Pushed forward open source release of the Geoparser
  • Refactoring of the Unlock Places service
  • Discussions and potential alignment with other projects (SPQR, Pleiades, GBHGIS)
  • Discussions with other place-name surveys (SPNS – Wales?)

Talks / Dissemination

Chalice at WhereCamp

I was lucky enough to get to WhereCamp UK last Friday/Saturday, mainly because Jo couldn’t make it. I’ve never been to one of these unconferences before but was impressed by the friendly, anything-goes atmosphere, and emboldened to give an impromtu talk about CHALICE. I explained the project setup, its goals and some of the issues encountered, at least as I see them –

  • the URI minting question
  • the appropriateness (or lack of it) of only having points to represent regions instead of polygons
  • the scope for extending the nascent historical gazetteer we’re building and connecting it to others
  • how the results might be useful for future projects.

I was particularly looking for feedback on the last two points: ideas on how best to grow the historical gazetteer and who has good data or sources that should be included if and when we get funding for a wider project to carry on from CHALICE’s beginnings; and secondly, ideas about good use cases to show why it’s a good idea to do that.

We had a good discussion, with a supportive and interested audience. I didn’t manage to make very good notes, alas. Here’s a flavour of the discussion areas:

  • dealing with variant spellings in old texts – someone pointed out that the sound of a name tends to be preserved even though the spelling evolves, and maybe that can be exploited;
  • using crowd-sourcing to correct errors from the automatic processes, plus to gather further info on variant names;
  • copyright and IPR, and the fact that being out of print copyright doesn’t mean there won’t be issue around digital copyright in the scanned page images;
  • whether or not it would be possible – in a later project – to do useful things with the field names from EPNS;
  • the idea of parsing out the etymological references from EPNS, to build a database of derivations and sources;
  • using the gazetteer to link back to the scanned EPNS pages, to assist an online search application.

Plenty of use cases were suggested, and here are some that I remember, plus ideas about related projects that it might be good to tie up with:

  • a good gazetteer would aid research into the location of places that no longer exist, eg from Domesday period – if you can locate historical placenames mentioned in the same text you can start narrowing down the likely area for the mystery places;
  • the library world is likely to be very interested in good historical gazetteers, a case mentioned being the Alexandria Library project sponsored by the Library of Congress amongst others;
  • there are overlaps and ideas to share with similar historical placename projects like Pleiades, Hestia and GAP (Google Ancient Places).

I mentioned that, being based in Edinburgh, we’re particularly keen to include Scottish historical placenames. There are quite a few sources and people who have been working for ages in this area – that’s probably one of the next things to take forward, to see if we can tie up with some of the existing experts for mutual benefit.

There were loads of other interesting presentations and talk at WhereCamp… but this post is already too long.

Linking historic places: looking at Victoria County History

Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.

Discussions with CCED (or how I learned to stop worrying about vagueness and love point data)

I met recently with Prof. Stephen Taylor of the University of Reading. Prof. Taylor is one of the investigators of the Clergy of the Church of England (CCED) database project; whose backend development is the responsibility of the Centre for Computing in the Humanities (CCH). Like so many other online historical resources, CCED’s main motivation is to bring things together, in this case information about the CofE clergy between 1540 and 1835, just after which predecessors to the Crockford directory began to appear. There is, however, a certain divergance between what CCED does and what Crockford (simply a list of names of all clergy) does.

CCED started as a list of names, with the relatively straightforward ambition of documenting the name of every ordained  person between those dates, drawing on a wide variety of historical sources. Two things fairly swiftly became apparent: that a digital approach was needed to cope with the sheer amounts of information involved (CD-ROMS  were mooted at first), and that a facility to build queries around location would be critical to the use historians make of the resource. There is therefore clearly scope for considering how Chalice and CCED might complement one another.

Even more importantly however, some of the issues which CCED have come up against in terms of structure have a direct bearing on Chalice’s ambitions.  What was most interesting from Chalice’s point of view was the great complexity which the geographic component contains. It is important to note that there was no definitive list of English ecclesiastical parish names prior to the CCED (crucially, what was needed, was a list which also followed through the history of parishes – e.g. dates of creation, dissolution, merging, etc.), and this is a key thing that CCED provides, and is and of itself of great benefit to the wider community.

Location in CCED is dealt with in two ways: jurisdictional and geographical (see this article). Contrary to popular opinion, which tends to perceive a neat cursus honorum descending from bishop to archdeacon to deacon to incumbent to curate etc, ecclesiastical hierarchies can be very complex. For example, a vicar might be geographically located within a diocese, and yet not report to the bishop responsible for that diocese (‘peculiar’ jurisdictions).

In the geographic sense, location is dealt with in two distinct ways – according to civil geographical areas, such as counties, and according to what might be described as a ‘popular understanding’ of religious geography, treating a diocese as a single geographic unit. Where known, each parish name has a date associated with it, and for the most part this remains constant throughout the period, although where a name has changed there are multiple records (a similar principle to the attestation value of Chalice names, but a rather different approach in terms of structure).

Sub-parish units are a major issue for CCED, and there are interesting comparisons in the issues this throws up for EPNS. Chapelries are a key example: these existed for sure, and are contained with CCED, but it is not always possible to assign them to a geographical footprint (I left my meeting with Prof. Taylor considerably less secure in my convictions about spatial footprints) at least beyond the fact that, almost by definition, they will be been associated with a building. Even then there are problems, however. One example comes from East Greenwich, where there is a record of a curate being appointed, but there is no record of where the chapel is or was, and no visible trace of it today.

Boundaries are particularly problematic. The phenomenon of ‘beating the bounds’ around parishes only occurred where there was an economic or social interest in doing this, e.g. when there was an issue of which jurisdiction tithes should be paid to.  Other factors in determining these boundaries was folk memories, and the memories of the oldest people in the settlement. However, it is the case that, for a significant minority of parishes at least, pre Ordnance Survey there was very little formal/mapped conception of parish boundaries.

For this reason, many researchers consider that mapping based on points is more useful that boundaries. An exception is where boundaries followed natural features such as rivers. This is an important issue for Chalice to consider in its discussion about capturing and marking up natural features: where and how have these featured in the assignation and georeferencing of placenames, and when?

A similar issue is the development of urban centres in the late 18th and 19th centuries: in most cases these underwent rapid changes; and a system of ‘implied boundaries’ reflects the situation then more accurately than hard and fast geolocations.

Despite this, CCED reflects the formal structured entities of the parish lists. Its search facilities are excellent if you wish to search for information about specific parishes whose name(s) you know, but, for example, it would be very difficult to search for ‘parishes in the Thames Valley’; or (another example given in the meeting), to define all parishes within one day’s horse riding distance of Jane Austen’s home, thus allowing the user to explore the clerical circles she would have come into contact with but without knowing the names of the parishes involved.

At sub-parish level, even the structured information is lacking. For example, there remains no definitive list of chapelries.  CCED has ‘created’ chapelries, where the records indicate that one is apparent (the East Greenwich example above is an instance of this). In such cases, a link with Chalice and/or Victoria County History (VCH) could help establish/verify such conjectured associations (posts on Chalice’s discussions with VCH will follow at some point).

When one dips below even the imperfect georeferencing of parishes, there are non-geographic, or semi-geographic, exceptions which need to be dealt with: chaplains of naval vessels are one example; as are cathedrals, which sit outside the system, and indeed maintain heir own systems and hierarchies. In such cases, it is better to pinpoint the things that can be pinpointed, and leave it to the researcher to build their own interpretations around the resulting layers of fuzziness. One simple point layer that could be added to Chalice, for example, is data from Ordnance Survey’s describing the locations churches: a set of simple points which would associate the names of a parish with a particular location, not worrying too much about the amorphous parish boundaries, and yet eminently connectible to the structure of a resource such as CCED.

In the main, the interests that  CCED share with Chalice are ones of structural association with geography. Currently, Chalice relies on point based grid georeferencing, where that has been provided by county editors for the English Place Name Survey. However, the story is clearly far more complex than this.   If placename history is also landscape history, one must also accept that it is also intimately linked to Church history; since the Church exerted so much influence of all areas of life of so much of the period of history in question.

Therefore Chalice should consider two things:

  1. what visual interface/structure would work best to display complex layers of information
  2. how can the existing (limited) georeferencing of EPNS be enhanced by linking to it?

The association of (EPNS, placename, church, CCED, VCH) could allow historians to construct the kind of queries they have not been able to construct before.

CHALICE: Institutional and Collective Benefits

DRAFT – needs CeRch+CDDA detail plus specific end user engagements though we can go on on the latter topic in later posts.

At this point we should talk a bit more about who is involved in CHALICE and what we’re hoping to gain from it.

The project is led by the EDINA National Datacentre at the University of Edinburgh. EDINA is almost entirely supported by JISC, and runs the flagship Digimap service which provides UK HE/FE access to national mapping data for the UK.

EDINA also maintains the Unlock service, which provides search across different placename gazetteers, and extraction of placenames from text using different gazetteers to “ground” references to place at definite locations. Unlock started life as the GeoCrossWalk project, and it was our involvement in the “Embedding GeoCrossWalk” project that sparked this interest in using text mining techniques to generate placename authority files from historic texts.

The Language Technology Group at the School of Informatics in Edinburgh were partners in this, and have moved on with us to CHALICE. They created the Edinburgh Geoparser that sits behind the Unlock Text web service. Their text mining magic extends much deeper than we’ve really made use of yet, as far as being able to extract events and relations from text, as well as references to people and concepts.

CHALICE should be a fun challenge in an as yet under-explored research area of historic text mining – tuning grammar rules to do markup that can then be used to train machine learning recognisers, and comparing the results. Through their work with CDDA we hope to gain insight into the best balance between manual annotation and manually-corrected automatic annotation, in terms of cost of work, cost savings for others’ future work, and benefits of the different approaches to named entity recognition.

CeRch

CDDA