Connecting archives with linked geodata – Part I

This is the first half of the talk I gave at FOSS4G 2010 covering the Chalice project and the Unlock services. Part ii to follow shortly….

My starting talk title, written in a rush, was “Georeferencing archives with Linked Open Geodata” – too many geos; though perhaps they cancel one another out, and just leave *stuff*.

In one sense this talk is just about place-name text mining. Haven’t we seen all this before? Didn’t Schuyler talk about Gutenkarte (extracting place-names from classical texts and exploring them using a map) in like, 2005, at OSGIS before it was FOSS4G? Didn’t Metacarta build a multi-million business on this stuff and succeed in getting bought out by Nokia? Didn’t Yahoo! do good-enough gazetteer search and place-name text mining with Placemaker? Weren’t *you*, Jo, talking about Linked Data models of place-names and relations between them in 2003? If you’re still talking about this, why do you still expect anyone to listen?

What’s different now? One word: recursion. Another word: potentiality. Two more words: more people.

Before i get too distracted, i want to talk about a couple of specific projects that i’m organising.

One of them is called Chalice, which stands for Connecting Historical Authorities with Linked Data, Contexts, and Entities. Chalice is a text-mining project, using a pipeline of Natural Language Processing and data munging techniques to take some semi-structured text and turn the core of it into data that can be linked to other data.

The target is a beautiful production called the English Place Name Survey. This is a definitive-as-possible guide to place-names in England, their origins, the names by which things were known, going back through a thousand years of documentary evidence, reflecting at least 1500 years of the movement of people and things around the geography of England. There are 82 volumes of the English Place Name Survey, which started in 1925, and is still being written (and once its finished, new generations of editors will go back to the beginning, and fill in more missing pieces).

Place-name scholars amaze me. Just by looking at words and thinking about breaking down their meanings, place-name scholars can tell you about drainage patterns, changes in the order of political society, why people were doing what they were doing, where. The evidence contained in place-names helps us cross the gap between the archaeological and the digital.

So we’re text mining EPNS and publishing the core (the place-name, the date of the source from which the name comes, a reference to the source, references to earlier and later names for “the same place”). But why? Partly because the subject matter, the *stuff*, is so very fascinating. Partly to make other, future historic text mining projects much more successful, to get a better yield of data from text, using the one to make more sense of the other. Partly just to make links to other *stuff*.

In newer volumes the “major names”, i.e. the contemporary names (or the last documented name for places that have become forgotten) have neat grid references, point-based, thus they come geocoded. The earliest works have no such helpful metadata. But we have the technology; we can infer it. Place-name text mining, as my collaborators at the Language Technology Group in the School of Informatics in Edinburgh would have it, is a two-phase process. First phase is “geo-tagging”, the extraction of the place-names themselves; using techniques that are either rule-based (“glorified regular expressions”) or machine-learning based (“neural networks” for pattern cognition, like spam filters, that need a decent volume of training data).

Second phase is “geo-resolution”; given a set of place-names and relations between them, figuring out where they are. The assumption is that places cluster together in space similarly as they do in words, and on the whole that works out better than other assumptions. As far as i can see, the state of the research art in Geographic Information Retrieval is still fairly limited to point-based data, projections onto a Cartesian plane. This is partly about data availability, in the sense of access to data (lots of research projects use geonames data for its global coverage, open license, and linked data connectivity). It’s partly about data availability in the sense of access to thinking. Place-name gazetteers look point-based, because the place-name on a flat map begins at a point on a cartesian plane. (So many place-name gazetteers are derived visually from the location of strings of text on maps; they are for searching maps, not for searching *stuff*)

So next steps seem to involve

  • dissolving the difference between narrative, and data-driven, representations of the same thing
  • inferring things from mereological relations (containment-by, containment-of) rather than sequential or planar relationsOn the former – data are documents, documents are data.

On the latter, this helps explain why i am still talking about this, because it’s still all about access to data. Amazing things, that i barely expected to see so quickly, have happened since i started along this path 8 years ago. We now have a significant amount of UK national mapping data available on properly open terms, enough to do 90% of things. OpenStreetmap is complete enough to base serious commercial activity on; Mapquest is investing itself in supporting and exploiting OSM. Ordnance Survey Open Data combines to add a lot of as yet hardly tapped potential…

Read more, if you like, in Connecting archives with linked geodata – Part II which covers the use of and plans for the Unlock service hosted at the EDINA data centre in Edinburgh.

Comments are closed.