Connecting archives with linked geodata – Part II

This is part two of a blog starting with a presentation about the Chalice project and our aim to create a 1000-year place-name gazetteer, available as linked data, text-mined from volumes of the English Place Name Survey.

Something else i’ve been organising is a web service called Unlock; it offers a gazetteer search service that searches with, and returns, shapes rather than just points for place-names. It has its origins in a 2001 project called GeoCrossWalk, extracting shapes from MasterMap and other Ordnance Survey data sources and making them available under a research-only license in the UK, available to subscribers to EDINA’s Digimap service.

Now that so much open geodata is out there, Unlock now contains an open data place search service, indexing and interconnecting the different sources of shapes that match up to names. It has geonames and the OS Open Data sources in it, adding search of Natural Earth data in short order, looking at ways to enhance what others (Nominatim, LinkedGeoData) are already doing with search and re-use of OpenStreetmap data.

The gazetteer search service sits alongside a placename text mining service. However, the text mining service is tuned to contemporary text (American news sources), and a lot of that also has to do with data availability and sharing of models, sets of training data. The more interesting use cases are in archive mining, of semi-unusual, semi-structured sets of documents and records (parliamentary proceedings, or historical population reports, parish and council records). Anything that is recorded will yield data, *is* data, back to the earliest written records we have.

Place-names can provide a kind of universal key to interpreting the written record. Social organisation may change completely, but the land remembers, and place-names remain the same. Through the prism of place-names one can glimpse pre-history; not just what remains of those people wealthy enough to create *stuff* that lasted, but of everybody who otherwise vanished without trace.

The other reason I’m here at FOSS4G; to ask for help. We (the authors of the text mining tools at the Language Technology Group, colleagues at EDINA, smart funders at JISC) want to put together a proper open source distribution of the core components of our work, for others to customise, extend, and work with us on.

We could use advice – the Software Sustainability Institute is one place we are turning for advice on managing an open source release and, hopefully, community. OSS Watch supported us in structuring an open source business case.

Transition to a world that is open by default turns out to be more difficult than one would think. It’s hard to get many minds to look in the same direction at the same time. Maybe legacy problems, kludges either technical, or social, or even emotional, arise to mess things up when we try to act in the clear.

We could use practical advice on managing an open source release of our work to make it as self-sustaining as possible. In the short term; how best to structure a repository for collaboration, for branching and merging; where we should most usefully focus efforts at documentation; how to automate the process of testing to free up effort where it can be more creative; how to find the benefits in moving the process of working, from a closed to an open world.

The Chalice project has a sourceforge repository where we’ve been putting the code the EDINA team has been working on; this includes an evolution of Unlock’s web service API, and user interface / annotation code from Addressing History. We’re now working on the best way to synchronise work-in-progress with currently published, GPL-licensed components from LTG, more pieces of the pipeline making up the “Edinburgh geoparser” and other things…

EDINA Blogs

A Blogs.edina.ac.uk weblog

Connecting archives with linked geodata – Part II