More on the use of Unlock Places by georeferencer.org

Some months back, Klokan Petr Pridal, who maintains OldMapsOnline.org and works with libraries and cartographic institutes across Europe, wrote with some questions about the Unlock Places service. We met at FOSS4G where I presented our work on the Chalice project and the Unlock services.
Petr writes about how Unlock is used in his applications, and what future requirements from the service may be:


It was great to meet you at FOSS4G in Barcelona and discuss with you
the progress related to Unlock and possible cooperation with
OldMapsOnline.org and usage in Georeferencer.org services.

As you have mentioned, the most important thing for us would be to
have in Unlock API/database the bounding boxes (or bounding polygons) for places as direct part of the JSON response.
We need that mostly for villages, towns and cities and for areas such
as districts or countries – all over the world. We need something like
“bounds” as provided by the Google geocoding API.

The second most important feature is to have the chance to install the
service in our servers
– especially in case you can’t provide
guarantees for it in a future.

It would be also great to have chance to improve the service for non-English languages, but right now the gazetteers and text processing is not primary target of our research.

In this moment the Unlock API is in use:

As a standard gazetteer search service to zoom the base maps to a place people type in the search box in our Georeferencer.org service – a
collaborative georeferencing online service for scanned historical
maps. It is in use by National Library of Scotland and a couple of other libraries.

Here’s an example map (you need to register first).

The uniqueness of Unlock is in openness of the license (primarily GeoNames.org CC-BY and also OS OpenData) and also so far very good availability of the online service (EDINA hardware and network?). We are missing the bounding box to be able to zoom our base maps to the correct area (determine the appropriate zoom level). Unlock API replaced Google Geocoder, which we can’t use, because we are displaying also non-google maps (such as Ordnance Survey OpenData) and we are potentially deriving data from the gazetteer database (the control points on the old maps), which is against Google TOS.

In the future we are keen to extend the gazetteer with alternative
historical toponyms
(which people can identify on georeferenced old
maps too), or participate on such work.

The other usage of Unlock API is:

As a metadata text analyzer, in a service such as our
http://geoparser.appspot.com/, where we automatically parse existing
library textual metadata to identify place names and locate the
described maps including automatic approximation of their spatial
coverage (by identifying map scale and physical size in the text and
doing a simple math on top of it). This service is in a prototype
phase only, we are using Yahoo Placemaker and I was testing Unlock Text API
with it too.

Here the huge advantage of Unlock would be primarily the possibility
to add custom gazetteers
(with Geonames as the default one), language detection (for example via Google Language API or otherwise) and also possibility to add into the workflow other tools, such as lemmatizator for particular language – the simplest available via hun/a/ispellu
database integration or via existing morphological rule-based software
such as:

The problem is that without returning the lemmatization of the text the geoparser is almost unusable in non-English languages – especially Slavic
one.

We are very glad for availability of your results and of the reliable
online services you provide. We can concentrate on the problems we
need to solve primarily (georeferencing, clipping, stitching and
presentation of old maps for later analysis) and use your results of
research as a component solving a problem we are touching and we have to practically solve somehow.”


Very glad that Petr wrote at such length about comprehensive use of Unlock. pushing the edges of what we are doing with the service.

We have some work in the pipeline adding bounding boxes for places worldwide by making Natural Earth Data searchable through Unlock Places. Natural Earth is a generalised dataset intended for use in cartography, but should also have quite a lot of re-use value for map search.

Connecting archives with linked geodata – Part II

This is part two of a blog starting with a presentation about the Chalice project and our aim to create a 1000-year place-name gazetteer, available as linked data, text-mined from volumes of the English Place Name Survey.

Something else i’ve been organising is a web service called Unlock; it offers a gazetteer search service that searches with, and returns, shapes rather than just points for place-names. It has its origins in a 2001 project called GeoCrossWalk, extracting shapes from MasterMap and other Ordnance Survey data sources and making them available under a research-only license in the UK, available to subscribers to EDINA’s Digimap service.

Now that so much open geodata is out there, Unlock now contains an open data place search service, indexing and interconnecting the different sources of shapes that match up to names. It has geonames and the OS Open Data sources in it, adding search of Natural Earth data in short order, looking at ways to enhance what others (Nominatim, LinkedGeoData) are already doing with search and re-use of OpenStreetmap data.

The gazetteer search service sits alongside a placename text mining service. However, the text mining service is tuned to contemporary text (American news sources), and a lot of that also has to do with data availability and sharing of models, sets of training data. The more interesting use cases are in archive mining, of semi-unusual, semi-structured sets of documents and records (parliamentary proceedings, or historical population reports, parish and council records). Anything that is recorded will yield data, *is* data, back to the earliest written records we have.


Place-names can provide a kind of universal key to interpreting the written record. Social organisation may change completely, but the land remembers, and place-names remain the same. Through the prism of place-names one can glimpse pre-history; not just what remains of those people wealthy enough to create *stuff* that lasted, but of everybody who otherwise vanished without trace.

The other reason I’m here at FOSS4G; to ask for help. We (the authors of the text mining tools at the Language Technology Group, colleagues at EDINA, smart funders at JISC) want to put together a proper open source distribution of the core components of our work, for others to customise, extend, and work with us on.

We could use advice – the Software Sustainability Institute is one place we are turning for advice on managing an open source release and, hopefully, community. OSS Watch supported us in structuring an open source business case.

Transition to a world that is open by default turns out to be more difficult than one would think. It’s hard to get many minds to look in the same direction at the same time. Maybe legacy problems, kludges either technical, or social, or even emotional, arise to mess things up when we try to act in the clear.

We could use practical advice on managing an open source release of our work to make it as self-sustaining as possible. In the short term; how best to structure a repository for collaboration, for branching and merging; where we should most usefully focus efforts at documentation; how to automate the process of testing to free up effort where it can be more creative; how to find the benefits in moving the process of working, from a closed to an open world.

The Chalice project has a sourceforge repository where we’ve been putting the code the EDINA team has been working on; this includes an evolution of Unlock’s web service API, and user interface / annotation code from Addressing History. We’re now working on the best way to synchronise work-in-progress with currently published, GPL-licensed components from LTG, more pieces of the pipeline making up the “Edinburgh geoparser” and other things…

Quality of text correction analysis from CDDA

The following post is by Elaine Yeates, project manager at the Centre for Data Digitisation and Analysis in Belfast. Elaine and her team have been responsible for taking scans of a selection of volumes of the English Place Name Survey and turning them into corrected OCR’d text, for later text mining to extract the data structures and republish them as Linked Data.

“I’ve worked up some figures based on an average character count from Cheshire, Buckinghamshire, Cambridgeshire and Derbyshire.

We had two levels of quality control:

1st QA Spelling and Font:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 346 character errors (average per page 8.65) = 0.22

1st QA Unicode:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 235 character errors (average per page 5.87)= 0.14.

TOTAL Error Rate 0.36
2nd QA – Encompasses all of 1st QA and based on 40 pages averaging 4000 characters per page the error rate was 18 character errors (average per page 0.45) = 0.01.

Through the pilot we indentified that there are quite a few Unicodes unique to this material. CDDA developed an in-house online Unicode database for analysts, they can view, update the capture file and raise new codes when found. I think for a more substantial project we might direct our QA process through an online audit system, where we could identify issues with material, OCR of same, macro’s and the 1st and 2nd stages of quality control.

We are pleased with these figures and it looks encouraging for a larger scaled project.”

Elaine also wrote in response to some feedback on markup error rates from Claire Grover on behalf of the Language Technology Group:

‘Thanks for these. Our QA team our primarily looking for spelling errors, from your list the few issues seem to be bold, spaces and small caps.

Of course when tagging, especially automated, you’re looking for certain patterns, however moving forward I feel this error rate is very encouraging and it helps our QA team to know what patterns might be searchable for future capture.

Looking at your issues so far, on part Part IV (5 issues e-mailed) and a total word count of 132,357 (an error rate of 0.00003).”

I am happy to have these numbers, as one can observe consistency of quality over iterations, as means are found to work with more volumes of EPNS.

Visiting the English Place Name Survey

I was in Nottingham for OSGIS at the Centre for Geospatial Sciences on Tuesday; skipped out between lunch and coffee break to visit the English Place Name Survey in the same leafy campus.

A card file at EPNS

A card file at EPNS

Met with Paul Cavill, who dropped me right in to the heart of the operation – stacks of index cards in shoe boxes. Each major name has a set of annotation cards, describing different related names and their associations and sources – which range from Victorian maps to Anglo-Saxon chronicles.

The editing process takes the card sets and turns them right into print-ready manuscript. The manuscript then has very consistent layout conventions – capitalisation, indentation. This is going to make our work of structure mining a lot easier.

Another bonus I wasn’t expecting was the presence of OSGB grid references for all the major names. The task of making links becomes a snap – I was imagining a lot of iterative guesswork based on clustering and closeness to names in other sources. (There are four Waltons in the UK in geonames, dozens in the EPNS).

On this basis I reckon the entity recognition will be a breeze, LTG will hardly have to stretch their muscles, which means we can ask them to work on grammars and machine learning recognisers for parts of other related archives within the scope of CHALICE.

Pic_0622_026And we would have freedom in the EDINA team’s time to do more – specifically to look at using the National Map Library of Scotland’s map rectifier tools to correlate the gazetteer with detailed line-drawn maps  also created by the late H. D. G. Foxall. Digitisations of these maps live in the Shropshire Records Office. We must talk with them about their plans (the Records Office holds copyright in the scans).

The eye-opener for me was the index of sources, or rather the bibliography. Each placename variant is marked with a few letters identifying the source of the name. So the index itself provides a key to old maps and gazetteers and archival records. To use Ant Beck’s phrase the EPNS looks like a “decoupled synthesis” of placename knowledge in all these sources. If we extract its structure, we are recoupling the synthesis and the sources, and now know where to look next to go text mining and digitising.

Pic_0622_024So we have the Shropshire Hundreds as a great place to start, as this is where the EPNS are working on now and the volumes are “born digital”. Back at CDDA, Paul Ell has some of the very earliest volumes digitised, and if we find a sample from the middle, we can produce grammar rules that we can be pretty confident will extract the right structure from the whole set, when the time comes to digitise and publish the entire 80+ volume, and growing, set.

But now i’m fascinated by the use of the EPNS derived data as a concordance to so many associated archives documenting historic social patterns. Towards the end of our chat Paul Cavill was speculating about reconstructing Anglo-Saxon England by means of text mining and georeferencing archives – we could provide a reference map to help archaeologists understand what they are finding, or even help them focus on where to look for interesting archaeology.

Paul had been visited by the time-travelling Mormons digitising everything a couple of weeks previously, and will hopefully offer an introduction – i would really, really like to meet them.