Discussions with CCED (or how I learned to stop worrying about vagueness and love point data)

I met recently with Prof. Stephen Taylor of the University of Reading. Prof. Taylor is one of the investigators of the Clergy of the Church of England (CCED) database project; whose backend development is the responsibility of the Centre for Computing in the Humanities (CCH). Like so many other online historical resources, CCED’s main motivation is to bring things together, in this case information about the CofE clergy between 1540 and 1835, just after which predecessors to the Crockford directory began to appear. There is, however, a certain divergance between what CCED does and what Crockford (simply a list of names of all clergy) does.

CCED started as a list of names, with the relatively straightforward ambition of documenting the name of every ordained  person between those dates, drawing on a wide variety of historical sources. Two things fairly swiftly became apparent: that a digital approach was needed to cope with the sheer amounts of information involved (CD-ROMS  were mooted at first), and that a facility to build queries around location would be critical to the use historians make of the resource. There is therefore clearly scope for considering how Chalice and CCED might complement one another.

Even more importantly however, some of the issues which CCED have come up against in terms of structure have a direct bearing on Chalice’s ambitions.  What was most interesting from Chalice’s point of view was the great complexity which the geographic component contains. It is important to note that there was no definitive list of English ecclesiastical parish names prior to the CCED (crucially, what was needed, was a list which also followed through the history of parishes – e.g. dates of creation, dissolution, merging, etc.), and this is a key thing that CCED provides, and is and of itself of great benefit to the wider community.

Location in CCED is dealt with in two ways: jurisdictional and geographical (see this article). Contrary to popular opinion, which tends to perceive a neat cursus honorum descending from bishop to archdeacon to deacon to incumbent to curate etc, ecclesiastical hierarchies can be very complex. For example, a vicar might be geographically located within a diocese, and yet not report to the bishop responsible for that diocese (‘peculiar’ jurisdictions).

In the geographic sense, location is dealt with in two distinct ways – according to civil geographical areas, such as counties, and according to what might be described as a ‘popular understanding’ of religious geography, treating a diocese as a single geographic unit. Where known, each parish name has a date associated with it, and for the most part this remains constant throughout the period, although where a name has changed there are multiple records (a similar principle to the attestation value of Chalice names, but a rather different approach in terms of structure).

Sub-parish units are a major issue for CCED, and there are interesting comparisons in the issues this throws up for EPNS. Chapelries are a key example: these existed for sure, and are contained with CCED, but it is not always possible to assign them to a geographical footprint (I left my meeting with Prof. Taylor considerably less secure in my convictions about spatial footprints) at least beyond the fact that, almost by definition, they will be been associated with a building. Even then there are problems, however. One example comes from East Greenwich, where there is a record of a curate being appointed, but there is no record of where the chapel is or was, and no visible trace of it today.

Boundaries are particularly problematic. The phenomenon of ‘beating the bounds’ around parishes only occurred where there was an economic or social interest in doing this, e.g. when there was an issue of which jurisdiction tithes should be paid to.  Other factors in determining these boundaries was folk memories, and the memories of the oldest people in the settlement. However, it is the case that, for a significant minority of parishes at least, pre Ordnance Survey there was very little formal/mapped conception of parish boundaries.

For this reason, many researchers consider that mapping based on points is more useful that boundaries. An exception is where boundaries followed natural features such as rivers. This is an important issue for Chalice to consider in its discussion about capturing and marking up natural features: where and how have these featured in the assignation and georeferencing of placenames, and when?

A similar issue is the development of urban centres in the late 18th and 19th centuries: in most cases these underwent rapid changes; and a system of ‘implied boundaries’ reflects the situation then more accurately than hard and fast geolocations.

Despite this, CCED reflects the formal structured entities of the parish lists. Its search facilities are excellent if you wish to search for information about specific parishes whose name(s) you know, but, for example, it would be very difficult to search for ‘parishes in the Thames Valley’; or (another example given in the meeting), to define all parishes within one day’s horse riding distance of Jane Austen’s home, thus allowing the user to explore the clerical circles she would have come into contact with but without knowing the names of the parishes involved.

At sub-parish level, even the structured information is lacking. For example, there remains no definitive list of chapelries.  CCED has ‘created’ chapelries, where the records indicate that one is apparent (the East Greenwich example above is an instance of this). In such cases, a link with Chalice and/or Victoria County History (VCH) could help establish/verify such conjectured associations (posts on Chalice’s discussions with VCH will follow at some point).

When one dips below even the imperfect georeferencing of parishes, there are non-geographic, or semi-geographic, exceptions which need to be dealt with: chaplains of naval vessels are one example; as are cathedrals, which sit outside the system, and indeed maintain heir own systems and hierarchies. In such cases, it is better to pinpoint the things that can be pinpointed, and leave it to the researcher to build their own interpretations around the resulting layers of fuzziness. One simple point layer that could be added to Chalice, for example, is data from Ordnance Survey’s describing the locations churches: a set of simple points which would associate the names of a parish with a particular location, not worrying too much about the amorphous parish boundaries, and yet eminently connectible to the structure of a resource such as CCED.

In the main, the interests that  CCED share with Chalice are ones of structural association with geography. Currently, Chalice relies on point based grid georeferencing, where that has been provided by county editors for the English Place Name Survey. However, the story is clearly far more complex than this.   If placename history is also landscape history, one must also accept that it is also intimately linked to Church history; since the Church exerted so much influence of all areas of life of so much of the period of history in question.

Therefore Chalice should consider two things:

  1. what visual interface/structure would work best to display complex layers of information
  2. how can the existing (limited) georeferencing of EPNS be enhanced by linking to it?

The association of (EPNS, placename, church, CCED, VCH) could allow historians to construct the kind of queries they have not been able to construct before.

Linked Data choices for historic places

We’ve had some fitful conversation about modelling historic place-names extracted from the English Place Name Survey as Linked Data, on the Chalice mailing list.
It would be great to get more feedback from others where we have common ground. Here’s a quick summary of the main issues we face and our key points of reference, to start discussion, and we can go into more detail on specific points as we work more with the EPNS data.

Re-use, reduce, recycle?

We should be making direct re-use of others’ vocabularies where we can. In some areas this is easy. For example, to represent the containment relations between places (a township contains a parish, a parish contains a sub-parish) we can re-use the some of the Ordnance Survey Research work on linked data ontologies – specifically their vocabulary to describe “Mereological Relations” – where “mereological” is a fancy word for “containment relationships”.

Adapting other schemas into a Linked Data model

One project which provides a great example of a more link-oriented, less geometry-oriented approach to describing ancient places is the Pleaides collection of geographic information about the Classical ancient world. Over the years, Pleaides has developed with scholars an interesting set of vocabularies, which don’t take a Linked Data approach but could be easily adapted to do so. They encounter issues to do with vagueness and uncertainty that geographical information systems concerning the contemporary world, can overlook. For example, the Pleiades attestation/confidence vocabulary expresses the certainty of scholars about the conclusions they are drawing from evidence.

So an approach we can take is to build on work done in research partnerships by others, and try to build mind-share about Linked Data representations of existing work. Pleiades also use URIs for places…

Use URIs as names for things

One interesting feature of the English Place Name Survey is the index of sources for each set of volumes. Each different source which documents names (old archives, previous scholarship, historic maps) has an abbreviation, and every time a historic place-name is mentioned, it’s linked to one of the sources.

As well as creating a namespace for historic place-names, we’ll create one for the sources (centred on the five volumes covering Cheshire, which is where the bulk of work on text correction and data extraction has been done. Generally, if anything has a name, we should be looking to give it a URI.

Date ranges

Is there a rough consensus (based on volume of data published, or number of different data sources using the same namespace) on what namespace to use to describe dates and date ranges as Linked Data? At one point there were several different versions of iCal, hCal, xCal vocabularies all describing more or less the same thing.

We’ve also considered other ways to describe date ranges – talking to Pleiades about mereological relations between dates – and investigating the work of Common Eras on user-contributed tags representing date ranges. It would be hugely valuable to learn about, and converge on, others’ approaches here.

How same is the same?

We propose to mint a namespace for historic place-names documented by the English Place Name Survey. Each distinct place-name gets its own URI.

For some of the “major names”, we’ve been able to use the Language Technology Group’s georesolution tool to make a link between the place-name and the corresponding entry in geonames.org.

Some names can’t be found in geonames, but can be found, via Unlock Places gazetteer search, in some of the Ordnance Survey open data sources. Next week we’ll be looking at using Unlock to make explicit links to the Ordnance Survey Linked Data vocabularies. One interesting side-effect of this is that, via Chalice, we’ll create links between geonames and the OS Linked Data, that weren’t there before.

Kate Byrne raised an interesting question on the Chalice mailing list – is the ‘sameAs’ link redundant? For example, if we are confident that Bosley in geonames.org is the same as Bosley in the Cheshire volumes of English Place Name Survey, should we re-use the geonames URI rather than making a ‘sameAs’ link between the two?

How same, in this case, is the same? We may have two, or more, different sets coordinates which approximately represent the location of Bosley. Is it “correct”, in Linked Data terms, to state that all three are “the same” when the locations are subtly different?
This is before we even get into the conceptual issues around whether a set of coordinates really has meaning as “the location” of a place. Geonames, in this sense, is a place to start working out towards more expressive descriptions of where a place is, rather than a conclusion.

Long-term preservation

Finally, we want to make sure that any URIs we mint are going to be preserved on a really long time horizon. I discussed this briefly on the Unlock blog last year. University libraries, or cultural heritage memory institutions, may be able to delegate a sub-domain that we can agree to long-term persistence of – but the details of the agreement, and periodic renewal of it due to infrastructural, organisational and technological change, is a much bigger issue than i think we recognise.

Visualisation of some early results

Claire showed us some early results from the work of the Language Technology Group, text mining volumes of the English Place Name Survey to extract geographic names and relations between them.

LTG visualisation of some Chalice data

LTG visualisation of some Chalice data

What you see here (or in the full-size visualisations – start with files *display.html) is the set of names extracted from an entry in EPNS (one town name, and associated names of related or contained places). Note there is just a display, the data structures are not published here at the moment, we’ll talk next week about that.

The names are then looked up in the geonames place-name gazetteer, to get a set of likely locations; then the best-match locations are guessed at based on the relations of places in the document.

Looking at one sample, for Ellesmere – five names are found in geonames, five are not. Of the five that are found, only two are certainly located, e.g. we can tell that the place in EPNS and place in geonames are the same, and establish a link.

What will help improve the quantity of samenesses that we can establish, is filtering searches to be limited by counties – either detailed boundaries or bounding boxes that will definitely contain the county. Contemporary data is now there for free re-use through Unlock Places, which is a place to start.

Note – the later volumes of EPNS do provide OS National Grid coordinates for town names; the earlier ones do not; we’re still not sure when this starts, and will have to check in with EPNS when we all meet there on September 3rd.

How does this fit expectations? We know from past investigations with mixed sets of user-contributed historic place-name data that geonames does well, but not typically above 50% of things located. Combining geonames with OS Open Data sources should help a bit.

The main thing i’m looking to find out now is what proportion of the set of all names will be left floating without a georeference, and how many hops or links we’ll have to traverse to connect floating place-names with something that does have a georeference. How important it will be to convey uncertainty about measurements; and what the cost/benefit will be of making interfaces allowing one to annotate and to correct the locations of place-names against different historic map data sources.

Clearly the further back we go the squashier the data will be; some of the most interesting use cases that CeRch have been talking to people about, involve Anglo-Saxon place references. No maps – not a bad thing – but potentially many hops to a “certain” reference. Thinking about how we can re-use, or turn into RDF namespaces, some of the Pleiades Ancient World GIS work on attestation/confidence of place-names and locations.