Geo-linking EPNS to other sources

We’re wrapping up the loose ends on the Chalice project now, preparing to publish all the final material.


Claire Grover at LTG did some interesting map renderings of the English Place-Name Survey names that we’ve managed to link to names in geonames and the Ordnance Survey Linked Data.

Claire writes: Following last Thursday’s discussion, I’ve pulled out some figures about the georeferences in the Chalice data.

I’ve also mapped the georeferences for each of the files – see the .display.html files in
http://homepages.inf.ed.ac.uk/grover/chalicemaps/. The primary.display.html ones (example: Cheshire Vol. 44) contain only the places that were identified as primary-sub-townships while the all.display.html ones (example: Cheshire Vol. 44) contain all the places that have at least one grid reference. Note that the colour of the gridreferences and markers in the display indicates source: green ones are from unlock, red ones are from geonames and blue ones were provided by EPNS (known-gridref – only in Cheshire and Shropshire).

It’s not easy to make any firm conclusions from this but I tend to agree with Paul [Ell, of CDDA] that it would be better not to georeference smaller places (secondary-sub-townships) but instead to assign them the grid reference of the larger place they are contained in/associated with.

Linking historic places: looking at Victoria County History

Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.

Structuring a Linked Data namespace for places

Thoughts on structuring a namespace for historic English places, for our prototype Linked Data version of the English Place Name Survey; how do others do it? Our options seem to be:

  1. give each placename a numeric identifier that can be part of the link
  2. create a more human-readable identifier based on the name, to use as part of the link.

Numeric identifiers for places look like common practise. Geonames.org uses numbers to create links for places – so http://sws.geonames.org/2656197/ “is”, or refers to, Baschurch in Shropshire. Though the coordinates of the point may change, the number is associated with the name, and it remains the same.

Ordnance Survey Linked Data also uses a numeric ID to create its link that stands for (the same) Baschurch – http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354.

The Linked Data Patterns online book has a set of patterns for identifying URIs. The patterns are focused on use with systems that are already database-based, with some design thought having gone into how IDs look, how they can be looked up, and how their persistence is guaranteed.

The point here is that the numeric identifiers still need careful curation – an organisational guarantee that the identifiers will stay the same for the predicatable future.

We’re using a relational database (PostGIS) rather than a triplestore, to hold the Chalice data (because the data model won’t really change or expand). We can’t just use IDs that are created automatically by the database when items are inserted into it, because those might change if the names are inserted in a different order.

During Chalice we’re not building a be-all-end-all system, but rather prototyping an approach to text mining and georeferencing places can be used to turn an amazing hand-created resource into a 21st century Linked Data gazetteer; leaving behind open source tools to make sure the process can be repeated again with more digitised text.

But we’re not building something to throw away; we want to make sure the links we create can be preserved – that they won’t be broken and won’t change their meanings. So it may be better for us to structure our namespace using the EPNS names themselves, and the order in which they occur in the printed volumes of EPNS.

The EPNS volumes are arranged county-by-county – each county has its own editor, and so may have different layout, style guidelines, level of detail for things like field-names, and the presence or absence of OS Grid coordinates, more or less according to the whims of the county editor. (We’ve focused on Cheshire, but LTG have been developing test parsers for samples of several different counties.)

So it makes sense to include the county name in our namespace. This also helps with disambiguation – which Walton is this Walton? But there will still be cases where several places, in quite different locations, but still within the same county, share a name. In this case, we’d also give the places a numeric identifier (Walton-1, Walton-2) in the order in which they appear in the EPNS text.

Some volumes of EPNS give us OS National Grid coordinates for the “major names”, others don’t. Where the “major name” exists in one or more gazetteers (geonames, OS Open Data), the LTG’s georesolver tool can create some of the missing links using the Unlock Places gazetteer cross-search.

More potentially useful context in the work of the UK Location Programme on Linked Data namespaces for places – a recent Guide to Linked Data and the UK Location Strategy, and last year’s guidance on Designing URI sets for Location.

One more potential complication, which is a fairly subtle issue of semantics – does a link identify a place, or a description of a place? Ordnance Survey Research try to make the difference clear by using a different namespace for ‘IDs for places’ and ‘IDs for documents describing places’.
So http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354 “is” Baschurch; and http://data.ordnancesurvey.co.uk/doc/50kGazetteer/16354 “is” the description of Baschurch. To make sure we’re properly confused, when a human looks up the /id/ link using a web browser, the browser is redirected to the human-readable /doc/. To actually get hold of the Linked Data description of Baschurch (including the coordinates for it in the 50K gazetteer), one has to specifically request the machine-readable, rather than human-readable, version of the link, like this:

curl -L http://data.ordnancesurvey.co.uk/id/50kGazetteer/16354 -H "Accept: application/rdf+xml" :) - but now you know that!

This took me a little while, and some back-and-forth with John Goodwin from OS Research on “Twitter”, to figure out, which is why I thought it worth writing down here.

Linked Data choices for historic places

We’ve had some fitful conversation about modelling historic place-names extracted from the English Place Name Survey as Linked Data, on the Chalice mailing list.
It would be great to get more feedback from others where we have common ground. Here’s a quick summary of the main issues we face and our key points of reference, to start discussion, and we can go into more detail on specific points as we work more with the EPNS data.

Re-use, reduce, recycle?

We should be making direct re-use of others’ vocabularies where we can. In some areas this is easy. For example, to represent the containment relations between places (a township contains a parish, a parish contains a sub-parish) we can re-use the some of the Ordnance Survey Research work on linked data ontologies – specifically their vocabulary to describe “Mereological Relations” – where “mereological” is a fancy word for “containment relationships”.

Adapting other schemas into a Linked Data model

One project which provides a great example of a more link-oriented, less geometry-oriented approach to describing ancient places is the Pleaides collection of geographic information about the Classical ancient world. Over the years, Pleaides has developed with scholars an interesting set of vocabularies, which don’t take a Linked Data approach but could be easily adapted to do so. They encounter issues to do with vagueness and uncertainty that geographical information systems concerning the contemporary world, can overlook. For example, the Pleiades attestation/confidence vocabulary expresses the certainty of scholars about the conclusions they are drawing from evidence.

So an approach we can take is to build on work done in research partnerships by others, and try to build mind-share about Linked Data representations of existing work. Pleiades also use URIs for places…

Use URIs as names for things

One interesting feature of the English Place Name Survey is the index of sources for each set of volumes. Each different source which documents names (old archives, previous scholarship, historic maps) has an abbreviation, and every time a historic place-name is mentioned, it’s linked to one of the sources.

As well as creating a namespace for historic place-names, we’ll create one for the sources (centred on the five volumes covering Cheshire, which is where the bulk of work on text correction and data extraction has been done. Generally, if anything has a name, we should be looking to give it a URI.

Date ranges

Is there a rough consensus (based on volume of data published, or number of different data sources using the same namespace) on what namespace to use to describe dates and date ranges as Linked Data? At one point there were several different versions of iCal, hCal, xCal vocabularies all describing more or less the same thing.

We’ve also considered other ways to describe date ranges – talking to Pleiades about mereological relations between dates – and investigating the work of Common Eras on user-contributed tags representing date ranges. It would be hugely valuable to learn about, and converge on, others’ approaches here.

How same is the same?

We propose to mint a namespace for historic place-names documented by the English Place Name Survey. Each distinct place-name gets its own URI.

For some of the “major names”, we’ve been able to use the Language Technology Group’s georesolution tool to make a link between the place-name and the corresponding entry in geonames.org.

Some names can’t be found in geonames, but can be found, via Unlock Places gazetteer search, in some of the Ordnance Survey open data sources. Next week we’ll be looking at using Unlock to make explicit links to the Ordnance Survey Linked Data vocabularies. One interesting side-effect of this is that, via Chalice, we’ll create links between geonames and the OS Linked Data, that weren’t there before.

Kate Byrne raised an interesting question on the Chalice mailing list – is the ‘sameAs’ link redundant? For example, if we are confident that Bosley in geonames.org is the same as Bosley in the Cheshire volumes of English Place Name Survey, should we re-use the geonames URI rather than making a ‘sameAs’ link between the two?

How same, in this case, is the same? We may have two, or more, different sets coordinates which approximately represent the location of Bosley. Is it “correct”, in Linked Data terms, to state that all three are “the same” when the locations are subtly different?
This is before we even get into the conceptual issues around whether a set of coordinates really has meaning as “the location” of a place. Geonames, in this sense, is a place to start working out towards more expressive descriptions of where a place is, rather than a conclusion.

Long-term preservation

Finally, we want to make sure that any URIs we mint are going to be preserved on a really long time horizon. I discussed this briefly on the Unlock blog last year. University libraries, or cultural heritage memory institutions, may be able to delegate a sub-domain that we can agree to long-term persistence of – but the details of the agreement, and periodic renewal of it due to infrastructural, organisational and technological change, is a much bigger issue than i think we recognise.

Connecting archives with linked geodata – Part I

This is the first half of the talk I gave at FOSS4G 2010 covering the Chalice project and the Unlock services. Part ii to follow shortly….

My starting talk title, written in a rush, was “Georeferencing archives with Linked Open Geodata” – too many geos; though perhaps they cancel one another out, and just leave *stuff*.

In one sense this talk is just about place-name text mining. Haven’t we seen all this before? Didn’t Schuyler talk about Gutenkarte (extracting place-names from classical texts and exploring them using a map) in like, 2005, at OSGIS before it was FOSS4G? Didn’t Metacarta build a multi-million business on this stuff and succeed in getting bought out by Nokia? Didn’t Yahoo! do good-enough gazetteer search and place-name text mining with Placemaker? Weren’t *you*, Jo, talking about Linked Data models of place-names and relations between them in 2003? If you’re still talking about this, why do you still expect anyone to listen?

What’s different now? One word: recursion. Another word: potentiality. Two more words: more people.

Before i get too distracted, i want to talk about a couple of specific projects that i’m organising.

One of them is called Chalice, which stands for Connecting Historical Authorities with Linked Data, Contexts, and Entities. Chalice is a text-mining project, using a pipeline of Natural Language Processing and data munging techniques to take some semi-structured text and turn the core of it into data that can be linked to other data.

The target is a beautiful production called the English Place Name Survey. This is a definitive-as-possible guide to place-names in England, their origins, the names by which things were known, going back through a thousand years of documentary evidence, reflecting at least 1500 years of the movement of people and things around the geography of England. There are 82 volumes of the English Place Name Survey, which started in 1925, and is still being written (and once its finished, new generations of editors will go back to the beginning, and fill in more missing pieces).

Place-name scholars amaze me. Just by looking at words and thinking about breaking down their meanings, place-name scholars can tell you about drainage patterns, changes in the order of political society, why people were doing what they were doing, where. The evidence contained in place-names helps us cross the gap between the archaeological and the digital.

So we’re text mining EPNS and publishing the core (the place-name, the date of the source from which the name comes, a reference to the source, references to earlier and later names for “the same place”). But why? Partly because the subject matter, the *stuff*, is so very fascinating. Partly to make other, future historic text mining projects much more successful, to get a better yield of data from text, using the one to make more sense of the other. Partly just to make links to other *stuff*.

In newer volumes the “major names”, i.e. the contemporary names (or the last documented name for places that have become forgotten) have neat grid references, point-based, thus they come geocoded. The earliest works have no such helpful metadata. But we have the technology; we can infer it. Place-name text mining, as my collaborators at the Language Technology Group in the School of Informatics in Edinburgh would have it, is a two-phase process. First phase is “geo-tagging”, the extraction of the place-names themselves; using techniques that are either rule-based (“glorified regular expressions”) or machine-learning based (“neural networks” for pattern cognition, like spam filters, that need a decent volume of training data).

Second phase is “geo-resolution”; given a set of place-names and relations between them, figuring out where they are. The assumption is that places cluster together in space similarly as they do in words, and on the whole that works out better than other assumptions. As far as i can see, the state of the research art in Geographic Information Retrieval is still fairly limited to point-based data, projections onto a Cartesian plane. This is partly about data availability, in the sense of access to data (lots of research projects use geonames data for its global coverage, open license, and linked data connectivity). It’s partly about data availability in the sense of access to thinking. Place-name gazetteers look point-based, because the place-name on a flat map begins at a point on a cartesian plane. (So many place-name gazetteers are derived visually from the location of strings of text on maps; they are for searching maps, not for searching *stuff*)

So next steps seem to involve

  • dissolving the difference between narrative, and data-driven, representations of the same thing
  • inferring things from mereological relations (containment-by, containment-of) rather than sequential or planar relationsOn the former – data are documents, documents are data.

On the latter, this helps explain why i am still talking about this, because it’s still all about access to data. Amazing things, that i barely expected to see so quickly, have happened since i started along this path 8 years ago. We now have a significant amount of UK national mapping data available on properly open terms, enough to do 90% of things. OpenStreetmap is complete enough to base serious commercial activity on; Mapquest is investing itself in supporting and exploiting OSM. Ordnance Survey Open Data combines to add a lot of as yet hardly tapped potential…

Read more, if you like, in Connecting archives with linked geodata – Part II which covers the use of and plans for the Unlock service hosted at the EDINA data centre in Edinburgh.

Visiting the English Place Name Survey

I was in Nottingham for OSGIS at the Centre for Geospatial Sciences on Tuesday; skipped out between lunch and coffee break to visit the English Place Name Survey in the same leafy campus.

A card file at EPNS

A card file at EPNS

Met with Paul Cavill, who dropped me right in to the heart of the operation – stacks of index cards in shoe boxes. Each major name has a set of annotation cards, describing different related names and their associations and sources – which range from Victorian maps to Anglo-Saxon chronicles.

The editing process takes the card sets and turns them right into print-ready manuscript. The manuscript then has very consistent layout conventions – capitalisation, indentation. This is going to make our work of structure mining a lot easier.

Another bonus I wasn’t expecting was the presence of OSGB grid references for all the major names. The task of making links becomes a snap – I was imagining a lot of iterative guesswork based on clustering and closeness to names in other sources. (There are four Waltons in the UK in geonames, dozens in the EPNS).

On this basis I reckon the entity recognition will be a breeze, LTG will hardly have to stretch their muscles, which means we can ask them to work on grammars and machine learning recognisers for parts of other related archives within the scope of CHALICE.

Pic_0622_026And we would have freedom in the EDINA team’s time to do more – specifically to look at using the National Map Library of Scotland’s map rectifier tools to correlate the gazetteer with detailed line-drawn maps  also created by the late H. D. G. Foxall. Digitisations of these maps live in the Shropshire Records Office. We must talk with them about their plans (the Records Office holds copyright in the scans).

The eye-opener for me was the index of sources, or rather the bibliography. Each placename variant is marked with a few letters identifying the source of the name. So the index itself provides a key to old maps and gazetteers and archival records. To use Ant Beck’s phrase the EPNS looks like a “decoupled synthesis” of placename knowledge in all these sources. If we extract its structure, we are recoupling the synthesis and the sources, and now know where to look next to go text mining and digitising.

Pic_0622_024So we have the Shropshire Hundreds as a great place to start, as this is where the EPNS are working on now and the volumes are “born digital”. Back at CDDA, Paul Ell has some of the very earliest volumes digitised, and if we find a sample from the middle, we can produce grammar rules that we can be pretty confident will extract the right structure from the whole set, when the time comes to digitise and publish the entire 80+ volume, and growing, set.

But now i’m fascinated by the use of the EPNS derived data as a concordance to so many associated archives documenting historic social patterns. Towards the end of our chat Paul Cavill was speculating about reconstructing Anglo-Saxon England by means of text mining and georeferencing archives – we could provide a reference map to help archaeologists understand what they are finding, or even help them focus on where to look for interesting archaeology.

Paul had been visited by the time-travelling Mormons digitising everything a couple of weeks previously, and will hopefully offer an introduction – i would really, really like to meet them.