Geo-linking EPNS to other sources

We’re wrapping up the loose ends on the Chalice project now, preparing to publish all the final material.


Claire Grover at LTG did some interesting map renderings of the English Place-Name Survey names that we’ve managed to link to names in geonames and the Ordnance Survey Linked Data.

Claire writes: Following last Thursday’s discussion, I’ve pulled out some figures about the georeferences in the Chalice data.

I’ve also mapped the georeferences for each of the files – see the .display.html files in
http://homepages.inf.ed.ac.uk/grover/chalicemaps/. The primary.display.html ones (example: Cheshire Vol. 44) contain only the places that were identified as primary-sub-townships while the all.display.html ones (example: Cheshire Vol. 44) contain all the places that have at least one grid reference. Note that the colour of the gridreferences and markers in the display indicates source: green ones are from unlock, red ones are from geonames and blue ones were provided by EPNS (known-gridref – only in Cheshire and Shropshire).

It’s not easy to make any firm conclusions from this but I tend to agree with Paul [Ell, of CDDA] that it would be better not to georeference smaller places (secondary-sub-townships) but instead to assign them the grid reference of the larger place they are contained in/associated with.

Linked Data for places – any advice?

We’d really benefit from advice about what Linked Data namespaces to use to describe places and the relationships between them. We want to re-use as much of others’ work as possible, and use vocabularies which are likely to be well and widely understood.

Here’s a sample of a “vanilla” rendering of a record for a place-name in Cheshire as extracted from the English Place Name Survey – see this as a rough sketch.

<RDF>
<chalice:Place rdf:about=”/place/cheshire/prestbury/bosley/bosley”>
<rdfs:isDefinedBy>/doc/cheshire/prestbury/bosley/bosley
</rdfs:isDefinedBy>
<rdfs:label>Bosley</rdfs:label>
<chalice:parish rdf:resource=”/place/cheshire/prestbury/bosley”/>
<chalice:parent rdf:resource=”/place/cheshire/prestbury/bosley”/>
<chalice:parishname>Bosley</chalice:parishname>
<chalice:level>primary-sub-township</chalice:level>
<georss:point>53.1862392425537 -2.12721741199493</georss:point>
<owl:sameAs rdf:resource=”http://data.ordnancesurvey.co.uk/doc/50kGazetteer/28360″/>
</chalice:Place>
</rdf:RDF>

GeoNames

We could re-use as much as we can of the geonames ontology. It defines a gn:Feature to indicate that a thing is a place, and gn:parentFeature to indicate that one place contains another.

Ordnance Survey

Ordnance Survey publish some geographic ontologies: there are some within data.ordnancesurvey.co.uk, and there’s some older work including a vocabulary for mereological (i.e. containment) relations includes isPartOf and hasPart. But the status of this vocabulary is unclear – is its use still advised?

The Administrative Geography ontology defines a ‘parish‘ relation – this is the inverse of how we’re currently using ‘parish’. (i.e. Prestbury contains Bosley) (And our concepts of historic parish and sub-parish are terrifically vague…)

For place-names found in the 1:50K gazetteer the OS use the NamedPlace class – but it feels odd to be re-using a vocabulary explicitly designed for the 50K gazetteer.

Or…

Are there other wide-spread Linked Data vocabularies for places and their names which we could be re-using? Are there other ways in which we could improve the modelling? Comments and pointers to others’ work would be greatly appreciated.

Structuring a Linked Data namespace for places

Thoughts on structuring a namespace for historic English places, for our prototype Linked Data version of the English Place Name Survey; how do others do it? Our options seem to be:

  1. give each placename a numeric identifier that can be part of the link
  2. create a more human-readable identifier based on the name, to use as part of the link.

Numeric identifiers for places look like common practise. Geonames.org uses numbers to create links for places – so http://sws.geonames.org/2656197/ “is”, or refers to, Baschurch in Shropshire. Though the coordinates of the point may change, the number is associated with the name, and it remains the same.

Ordnance Survey Linked Data also uses a numeric ID to create its link that stands for (the same) Baschurch – http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354.

The Linked Data Patterns online book has a set of patterns for identifying URIs. The patterns are focused on use with systems that are already database-based, with some design thought having gone into how IDs look, how they can be looked up, and how their persistence is guaranteed.

The point here is that the numeric identifiers still need careful curation – an organisational guarantee that the identifiers will stay the same for the predicatable future.

We’re using a relational database (PostGIS) rather than a triplestore, to hold the Chalice data (because the data model won’t really change or expand). We can’t just use IDs that are created automatically by the database when items are inserted into it, because those might change if the names are inserted in a different order.

During Chalice we’re not building a be-all-end-all system, but rather prototyping an approach to text mining and georeferencing places can be used to turn an amazing hand-created resource into a 21st century Linked Data gazetteer; leaving behind open source tools to make sure the process can be repeated again with more digitised text.

But we’re not building something to throw away; we want to make sure the links we create can be preserved – that they won’t be broken and won’t change their meanings. So it may be better for us to structure our namespace using the EPNS names themselves, and the order in which they occur in the printed volumes of EPNS.

The EPNS volumes are arranged county-by-county – each county has its own editor, and so may have different layout, style guidelines, level of detail for things like field-names, and the presence or absence of OS Grid coordinates, more or less according to the whims of the county editor. (We’ve focused on Cheshire, but LTG have been developing test parsers for samples of several different counties.)

So it makes sense to include the county name in our namespace. This also helps with disambiguation – which Walton is this Walton? But there will still be cases where several places, in quite different locations, but still within the same county, share a name. In this case, we’d also give the places a numeric identifier (Walton-1, Walton-2) in the order in which they appear in the EPNS text.

Some volumes of EPNS give us OS National Grid coordinates for the “major names”, others don’t. Where the “major name” exists in one or more gazetteers (geonames, OS Open Data), the LTG’s georesolver tool can create some of the missing links using the Unlock Places gazetteer cross-search.

More potentially useful context in the work of the UK Location Programme on Linked Data namespaces for places – a recent Guide to Linked Data and the UK Location Strategy, and last year’s guidance on Designing URI sets for Location.

One more potential complication, which is a fairly subtle issue of semantics – does a link identify a place, or a description of a place? Ordnance Survey Research try to make the difference clear by using a different namespace for ‘IDs for places’ and ‘IDs for documents describing places’.
So http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354 “is” Baschurch; and http://data.ordnancesurvey.co.uk/doc/50kGazetteer/16354 “is” the description of Baschurch. To make sure we’re properly confused, when a human looks up the /id/ link using a web browser, the browser is redirected to the human-readable /doc/. To actually get hold of the Linked Data description of Baschurch (including the coordinates for it in the 50K gazetteer), one has to specifically request the machine-readable, rather than human-readable, version of the link, like this:

curl -L http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354 -H "Accept: application/rdf+xml" :) - but now you know that!

This took me a little while, and some back-and-forth with John Goodwin from OS Research on “Twitter”, to figure out, which is why I thought it worth writing down here.

Linked Data choices for historic places

We’ve had some fitful conversation about modelling historic place-names extracted from the English Place Name Survey as Linked Data, on the Chalice mailing list.
It would be great to get more feedback from others where we have common ground. Here’s a quick summary of the main issues we face and our key points of reference, to start discussion, and we can go into more detail on specific points as we work more with the EPNS data.

Re-use, reduce, recycle?

We should be making direct re-use of others’ vocabularies where we can. In some areas this is easy. For example, to represent the containment relations between places (a township contains a parish, a parish contains a sub-parish) we can re-use the some of the Ordnance Survey Research work on linked data ontologies – specifically their vocabulary to describe “Mereological Relations” – where “mereological” is a fancy word for “containment relationships”.

Adapting other schemas into a Linked Data model

One project which provides a great example of a more link-oriented, less geometry-oriented approach to describing ancient places is the Pleaides collection of geographic information about the Classical ancient world. Over the years, Pleaides has developed with scholars an interesting set of vocabularies, which don’t take a Linked Data approach but could be easily adapted to do so. They encounter issues to do with vagueness and uncertainty that geographical information systems concerning the contemporary world, can overlook. For example, the Pleiades attestation/confidence vocabulary expresses the certainty of scholars about the conclusions they are drawing from evidence.

So an approach we can take is to build on work done in research partnerships by others, and try to build mind-share about Linked Data representations of existing work. Pleiades also use URIs for places…

Use URIs as names for things

One interesting feature of the English Place Name Survey is the index of sources for each set of volumes. Each different source which documents names (old archives, previous scholarship, historic maps) has an abbreviation, and every time a historic place-name is mentioned, it’s linked to one of the sources.

As well as creating a namespace for historic place-names, we’ll create one for the sources (centred on the five volumes covering Cheshire, which is where the bulk of work on text correction and data extraction has been done. Generally, if anything has a name, we should be looking to give it a URI.

Date ranges

Is there a rough consensus (based on volume of data published, or number of different data sources using the same namespace) on what namespace to use to describe dates and date ranges as Linked Data? At one point there were several different versions of iCal, hCal, xCal vocabularies all describing more or less the same thing.

We’ve also considered other ways to describe date ranges – talking to Pleiades about mereological relations between dates – and investigating the work of Common Eras on user-contributed tags representing date ranges. It would be hugely valuable to learn about, and converge on, others’ approaches here.

How same is the same?

We propose to mint a namespace for historic place-names documented by the English Place Name Survey. Each distinct place-name gets its own URI.

For some of the “major names”, we’ve been able to use the Language Technology Group’s georesolution tool to make a link between the place-name and the corresponding entry in geonames.org.

Some names can’t be found in geonames, but can be found, via Unlock Places gazetteer search, in some of the Ordnance Survey open data sources. Next week we’ll be looking at using Unlock to make explicit links to the Ordnance Survey Linked Data vocabularies. One interesting side-effect of this is that, via Chalice, we’ll create links between geonames and the OS Linked Data, that weren’t there before.

Kate Byrne raised an interesting question on the Chalice mailing list – is the ‘sameAs’ link redundant? For example, if we are confident that Bosley in geonames.org is the same as Bosley in the Cheshire volumes of English Place Name Survey, should we re-use the geonames URI rather than making a ‘sameAs’ link between the two?

How same, in this case, is the same? We may have two, or more, different sets coordinates which approximately represent the location of Bosley. Is it “correct”, in Linked Data terms, to state that all three are “the same” when the locations are subtly different?
This is before we even get into the conceptual issues around whether a set of coordinates really has meaning as “the location” of a place. Geonames, in this sense, is a place to start working out towards more expressive descriptions of where a place is, rather than a conclusion.

Long-term preservation

Finally, we want to make sure that any URIs we mint are going to be preserved on a really long time horizon. I discussed this briefly on the Unlock blog last year. University libraries, or cultural heritage memory institutions, may be able to delegate a sub-domain that we can agree to long-term persistence of – but the details of the agreement, and periodic renewal of it due to infrastructural, organisational and technological change, is a much bigger issue than i think we recognise.

Visualisation of some early results

Claire showed us some early results from the work of the Language Technology Group, text mining volumes of the English Place Name Survey to extract geographic names and relations between them.

LTG visualisation of some Chalice data

LTG visualisation of some Chalice data

What you see here (or in the full-size visualisations – start with files *display.html) is the set of names extracted from an entry in EPNS (one town name, and associated names of related or contained places). Note there is just a display, the data structures are not published here at the moment, we’ll talk next week about that.

The names are then looked up in the geonames place-name gazetteer, to get a set of likely locations; then the best-match locations are guessed at based on the relations of places in the document.

Looking at one sample, for Ellesmere – five names are found in geonames, five are not. Of the five that are found, only two are certainly located, e.g. we can tell that the place in EPNS and place in geonames are the same, and establish a link.

What will help improve the quantity of samenesses that we can establish, is filtering searches to be limited by counties – either detailed boundaries or bounding boxes that will definitely contain the county. Contemporary data is now there for free re-use through Unlock Places, which is a place to start.

Note – the later volumes of EPNS do provide OS National Grid coordinates for town names; the earlier ones do not; we’re still not sure when this starts, and will have to check in with EPNS when we all meet there on September 3rd.

How does this fit expectations? We know from past investigations with mixed sets of user-contributed historic place-name data that geonames does well, but not typically above 50% of things located. Combining geonames with OS Open Data sources should help a bit.

The main thing i’m looking to find out now is what proportion of the set of all names will be left floating without a georeference, and how many hops or links we’ll have to traverse to connect floating place-names with something that does have a georeference. How important it will be to convey uncertainty about measurements; and what the cost/benefit will be of making interfaces allowing one to annotate and to correct the locations of place-names against different historic map data sources.

Clearly the further back we go the squashier the data will be; some of the most interesting use cases that CeRch have been talking to people about, involve Anglo-Saxon place references. No maps – not a bad thing – but potentially many hops to a “certain” reference. Thinking about how we can re-use, or turn into RDF namespaces, some of the Pleiades Ancient World GIS work on attestation/confidence of place-names and locations.