Next: Digitisation and Exposure of English Place-names

Chalice was a short project, funded by JISC, to extract a digital gazetteer, in Linked Data form, from selected volumes of the English Place-Name Survey.

Happily, the same group of partners, with the addition of Institute of Name Studies, secured significant funding from JISC to complete the scanning, OCR, error correction and text mining of all the existing published volumes of the Survey.

The project, known as DEEP – Digitisation and Exposure of English Place-names – will run until 2013, when the resulting data will be made available through the JISC-supported Unlock Places geographic search API.

Unlock Text: new API

With apologies for the long hiatus on the Unlock blog. We have been active in development of the service behind the scenes. Now the Unlock team are pleased to present a new API to Unlock Text, the geoparsing (place name text mining and mapping) service.

Read the Getting Started guide: http://unlock.edina.ac.uk/texts/getstarted or jump straight to the full Unlock Text API documentation

Project Bamboo - research apps, infrastructure
Meanwhile, we’ve been working with the US-based, Mellon-funded Bamboo project, a research consortium building a “Scholarly services platform” of which geoparsing is a part. We’ve been helping Bamboo to define an API which can be implemented by many different services.

Bamboo needed to be able to send off requests and get responses asynchronously and to be able to poll to tell whether a geoparsing task is done. The resulting approach feels much more robust than our previous API to the geoparser (which had a long wait time on large documents, complained about document type); is better suited to batches or collections of texts, which most people will be in practise working with.

We hope others find the new Unlock Text API useful. We have a bit of a sprint planned in the coming weeks and would be happy to accept any requests for new features or improvements, please get in touch with the Unlock team if there’s something you’re really missing.

List of outcomes of the Chalice project

We put together this long list of different things that happened during the Chalice project, for our last bi-weekly project meeting, on 28th April 2011. The final product post offers an introduction to Chalice.

Tangibles

These are pieces of work completed to form the project:

  • Corrected OCR for 5 EPNS volumes (*not* open licensed)
  • Quality assessment of the OCR
  • Extracted data in XML
  • Report on the text-mining and georeferencing process
  • RDF representation of extracted data, Open Database License
  • Searchable JSON API for the extracted data
  • Two prototype visualisations
  • Source code for the preceding 4 items
  • Two use case assessments
  • Supporting material for the use case assessments
  • Simple web service for alt-names for ADS
  • Sample Integration with GBHGIS data

Intangibles

These are less concrete but equally valuable side-effects of the project work:

  • A set of sameAs assertions for Cheshire names between geonames and Ordnance Survey 50K gazetteer to go to sameas.org
  • Historic place-name data to enhance geonames.org with, potentially.
  • Improvements to the Edinburgh Geoparser and the Unlock Text service
  • Pushed forward open source release of the Geoparser
  • Refactoring of the Unlock Places service
  • Discussions and potential alignment with other projects (SPQR, Pleiades, GBHGIS)
  • Discussions with other place-name surveys (SPNS – Wales?)

Talks / Dissemination

Final Product Post: Chalice: past places and use cases

This is our “final product post” as required by the #jiscexpo project guidelines. Image links somehow got broken, they are fixed now, please re-view.

Chalice – Past Places

Chalice is for anyone working with historic material – be that archives of records, objects, or ideas. Everything happens somewhere. We aimed to provide a historic place-name gazetteer covering a thousand years of history, linked to attestations in old texts and maps.

Place-name scholarship is fascinating; looking at names, a scholar can describe the lay of the land, see political developments. We would like to pursue further funding to work with the English Place-Name Survey on an expert-crowdsourced service consuming the other 80+ volumes and extracting the detailed information – etymology, field-names.

Linked to other archival sources, the place-name record has the potential to reveal connections between them, and in turn feed into deeper coverage in the place-name survey.

There is a Past Places browser to help illustrate the data and provide a Linked Data view of the data.

Stuart Dunn did a series of interviews and case studies with different archival sources, making suggestions for integration. The report on our use case for the Clergy of the Church of England Database may be found here; and that on our study of the Victoria County History is here. We also have valuable discussions with the Archaeology Data Service, which were reported in a previous post.

Rather than a classical ‘user needs’ approach, targeting groups such as historians, linguists and indeed place-name scholars, it was decided to look in detail at other digital resources containing reference material. This allowed us to start considering various ways in which a digitized, linkable EPNS could be automatically related to such resources. The problems are not only the ones we anticipated, of usability and semantic crossover between the placename variants listed in EPNS and elsewhere; but also ones of data structure, domain terminology and the relationship of secondary references acorss such corpora. We hope these considerations will help inform future development of placename digitization.

Project blog

This covers the work of the four partners in the project.

CeRch at KCL developed use cases through interviews with maintainers of different historic sources. There are blog descriptions of conversations with:

LTG did some visualisations for these use cases, and more seriously text mining the semi-structured text of different sample volumes of the English Place Name Survey.

The extraction of corrected text from previously digitised pages was done by CDDA in Belfast. There is a blog report on the final quality of the work, however the full resulting text is not open licensed nor distributed through Chalice.

EDINA took care of project management and software development. We used the opportunity to try out a Scrum-style “sprint” way of working with a larger team.

TOC to project blog –here is an Atom feed of all the project blog posts and they should be categorised / describe project partners

Project tag: chaliced

Full project name: Connecting Historical Authorities with Links, Contexts and Entities

Short description: Creating and re-using a linked data historic gazetteer through text mining.

Longer description:Text mining volumes of the English Place Name Survey to produce a Linked Data historic gazetteer for areas of England, which can then be used to improve the quality of georeferencing other archives. The gazetteer is linked to other placename sources on the Linked Data web via geonames.org and Ordnance Survey Open Data. Intensive user engagement with archive projects that can benefit from the open data gazetteer and open source text mining tools.

Key deliverables: Open source tools for text mining archives; Linked open data gazetteer, searchable through JISC’s Unlock service; studies of further integration potential.

Lead Institution: University of Edinburgh

Person responsible for documentation: Jo Walsh

Project Team: EDINA: Jo Walsh (Project Manager), Joe Vernon (Software Developer), Jackie Clark (UI design), David Richmond (Infrastructure), CDDA: Paul Ell (WP1 Coordinator), Elaine Yates (Administration), David Hardy (Technician), Karleigh Kelso (Clerical), LTG: Claire Grover (Senior Researcher), Kate Byrne (Researcher), Richard Tobin (Researcher), CeRch: Stuart Dunn (WP3 Coordinator).

Project partners and roles: Centre for Data Digitisation and Analysis, Belfast – preparing digitised text, Centre for e-Research, Kings College London – user engagement and dissemination, Language Technology Group, School of Informatics, Edinburgh – text mining research and tools.

This is the Chalice project blog and you can follow an Atom feed of blog posts (there are more to come).

The code produced during the Chalice project is free software; it is available under the GNU Affero GPL v3 license. You can get the code from our project sourceforge repository. The text mining code is available from LTG – please contact Claire Grover for a distribution…

The Linked Data created by text mining volumes of the English Place Name Survey – mostly covering Cheshire – is available under the
Open Database License – a share-alike license for data by Open Data Commons.
.

The contents of this blog itself are available under a Creative Commons Attribution-ShareAlike 3.0 Unported license.

 

CC-BY-SA GNU Affero GPL v3 license. Affero GPL v3

Link to technical instructional documentation

Project started: July 15th 2010
Project ended: April 30th 2011
Project budget: £68054


Chalice was supported by JISC as a project in its #jiscexpo programme. See its PIMS project management record for information about where responsibility fits in at JISC.

Talk to us about JISC 06/11

Glad to hear that Unlock has been cited in the JISC 06/11 “eContent Capital” call for proposals.

The Unlock team would be very happy to help anyone fit a beneficial use of Unlock into their project proposal. This could feature the Unlock Places place-name and feature search; and/or the Unlock Text geoparser service which extracts place-names from text and tries to find their locations.

One could use Unlock Text to create Linked Data links to geonames.org or Ordnance Survey Open Data. Or use Unlock Places to find the locations of postcodes; or find places within a given county or constituency…

Please drop an email jo.walsh@ed.ac.uk or look up metazool on Skype or Twitter to chat about how Unlock fits with your proposal for JISC 06/11 …

Episode 2, SUNCAT: Notes from initial chat

Questions about use cases for the data; what the benefits are to end users or to libraries.
Something we have to work on formulating with data in hand. Morag mentions a 2005 document discussing SUNCAT use cases.

One suggested usage is deriving information about the focus and specialisms of libraries, by extending library subject metadata using journal/article subject metadata – so identifying the bent of universities through the holdings of their libraries.

Another immediate usage is linking bibliographic datasets of journal articles, to journal issues and journal information found in SUNCAT. Medline is a useful example of dataset that can be integrated – work on Linked, Open Medline metadata happening through the OpenBiblio project.

SUNCAT holds a record for each institution, its library location, and this could helpfully be linked to the OARJ Linked Data for institutions, and the JISC CETIS PROD work collecting different sources of UK HE/FE information.

Sources in SUNCAT may have an OID which could be re-used as part of a URI. Journals both electronic and hardcopy also (though not always) have ISSNs.
There are restrictions on re-use of data licensed from the ISSN Network, but one can get some of it from other sources – CONSER is a North-America-focused example, with a bit of a scientific bent (thus useful for Medline).

SUNCAT uses OpenURL to search for journal articles and holdings data in institutional libraries. Libraries run an “OpenURL resolver” – often with a bit of proprietary software such as SFX – to map OpenURLs to stuff in their holdings. Would be interesting to find out more about the inside of an OpenURL resolver and how useful a Linked Data rendering of it would be…

Surprised to learn that university libraries often don’t maintain their own subscription database; journals are bought in “bundles” whose contents are shifting, and libraries depend on vendors to sell them back their processed catalogue data.

SUNCAT contains a dataset describing libraries their affiliations and locations, held in a set of text files. This would be a good place to start with a simple Linked Data model that we can link up to the outcome of the previous LDFocus sprint, and then work on connecting up the library holdings data.

Starting a separate notepad for SUNCAT links. Should have done this earlier, been busy about the new release of Unlock and the wrap-up of the Chalice project

Testing Unlock 3: new API, new features, soon even documentation

This week we are public testing version 3 of Unlock – a fairly deep rewrite including a new simpler API and some more geometrical query functions (searching inside shapes, searching using a buffer). New data – providing a search across Natural Earth Data, returning shapes for countries, regions, etc worldwide. So at last we can use Natural Earth for search, and link it up to geonames point data for countries. We also have an upgraded version of the Edinburgh Geoparser so have date and event information as well as place-name text mining, in Unlock Text.

The new search work is now on our replicated server at Appleton Tower and in a week or two we’ll switch the main unlock.edina.ac.uk over to the new version (keeping the old API supported indefinitely too). Here are notes/links from Joe Vernon. If you do any testing or experimentation with this we’d be very interested to hear how you got on. Note you can add ‘format=json‘ to any of these links to get javascript-useful results, ‘format=txt‘ to get a csv, etc.

‘GENERIC’ SEARCHING

http://geoxwalk-at.edina.ac.uk/ws/search?name=sheffield

http://geoxwalk-at.edina.ac.uk/ws/search?name=wales&featureType=european

http://geoxwalk-at.edina.ac.uk/ws/search?featureType=hotel&name=Marriott&minx=-79&maxx=-78&miny=36&maxy=37&operator=within

NATURAL EARTH GAZETTEER

http://geoxwalk-at.edina.ac.uk/ws/search?name=lake&gazetteer=naturalearth&country=canada

DISTANCE BETWEEN TWO FEATURES

Distance between Edinburgh and Glasgow (by feature ID):

http://geoxwalk-at.edina.ac.uk/ws/distanceBetween?idA=14131223&idB=11153386

SEARCHING WITHIN A FEATURE – ‘SPATIAL MASK’

United Kingdom’s feature ID is: 14127855

Searching for ‘Washington’s within the United Kingdom…

http://geoxwalk-at.edina.ac.uk/ws/search?name=Washington&spatialMask=14127855

Also, note the difference between searching for within the bounding box of the UK, or adding the ‘realSpatial‘ parameter, which uses the polygon of the feature concerned.

http://geoxwalk-at.edina.ac.uk/ws/search?name=Washington&spatialMask=14127855format=txt&maxRows=100&realSpatial=no

http://geoxwalk-at.edina.ac.uk/ws/search?name=Washington&spatialMask=14127855&format=txt&maxRows=100&realSpatial=yes

In this case, it picks up entries in Ireland if using the bounding box rather than the UK’s footprint.

SPATIAL SEARCHING WITH A BUFFER

8 hotels around the Royal Mile
http://geoxwalk-at.edina.ac.uk/ws/search?featureType=hotel&minx=-3.2&maxx=-3.19&miny=55.94&maxy=55.95&operator=within

75 within 2km
http://geoxwalk-at.edina.ac.uk/ws/search?featureType=hotel&minx=-3.2&maxx=-3.19&miny=55.94&maxy=55.95&operator=within&buffer=2000

FOOTPRINTS & POSTCODES

…should still be there:
http://geoxwalk-at.edina.ac.uk/ws/footprintLookup?identifier=14131223
http://geoxwalk-at.edina.ac.uk/ws/postCodeSearch?postCode=eh91pr

IMPLICIT COUNTRY SEARCHING

http://geoxwalk-at.edina.ac.uk/ws/search?format=txt&gazetteer=geonames&featureType=populated place&name=louth
vs
http://geoxwalk-at.edina.ac.uk/ws/search?format=txt&gazetteer=geonames&featureType=populated place&name=louth, uk

TIME BOUNDED SEARCH (still in development)

http://geoxwalk-at.edina.ac.uk/ws/search?name=edinburgh&startYear=2000&endYear=2009

http://geoxwalk-at.edina.ac.uk/ws/search?name=edinburgh&startYear=2000&endYear=2010

Very happy with all this, bringing the Unlock service up to offering something usefully distinctive again, trying to restrain myself from saying (“if X was so easy why don’t we do Y?”)

Geo-linking EPNS to other sources

We’re wrapping up the loose ends on the Chalice project now, preparing to publish all the final material.


Claire Grover at LTG did some interesting map renderings of the English Place-Name Survey names that we’ve managed to link to names in geonames and the Ordnance Survey Linked Data.

Claire writes: Following last Thursday’s discussion, I’ve pulled out some figures about the georeferences in the Chalice data.

I’ve also mapped the georeferences for each of the files – see the .display.html files in
http://homepages.inf.ed.ac.uk/grover/chalicemaps/. The primary.display.html ones (example: Cheshire Vol. 44) contain only the places that were identified as primary-sub-townships while the all.display.html ones (example: Cheshire Vol. 44) contain all the places that have at least one grid reference. Note that the colour of the gridreferences and markers in the display indicates source: green ones are from unlock, red ones are from geonames and blue ones were provided by EPNS (known-gridref – only in Cheshire and Shropshire).

It’s not easy to make any firm conclusions from this but I tend to agree with Paul [Ell, of CDDA] that it would be better not to georeference smaller places (secondary-sub-townships) but instead to assign them the grid reference of the larger place they are contained in/associated with.

Preparing for Episode 2: SUNCAT

We’re preparing to start the second Linked Data Focus sprint next week (from May 16th) – working with the developers from the SUNCAT team, who are bibliographic data specialists.

Our notepad from the first sprint has a lot of links to relevant resources – introductions to RDF, tools in different languages, and descriptions of related work around academic institutions and communications.

This presentation by Jeni Tennison from the Pelagios workshop is also worth looking at for sensible advice about taking an existing data model into a Linked Data form. Ian took this sort of approach for the Open Access Repository Junction work – working through the different objects in a relational database model, thinking about how to decorate them with common RDF elements, then creating a vocabulary for the missing pieces. Some of the same questions about publishing and structuring Linked Data should come up; and in the middle of the sprint we’ll hold another Linked Data Learn-in at EDINA.

SUNCAT should have a fair bit more in common with existing Linked Data projects – particularly the JISC-supported OpenBiblio – and we’ll try to make links between SUNCAT-listed publications and some of their metadata. If we can get as far as then linking through to pre-prints in the institutional repositories found in OARJ, then I’ll be entirely satisfied.

Notes for SUNCAT

http://schemas.library.nhs.uk/ApplicationProfile/JournalHolding/ – JournalHoldings within the NHS. Quite a highly engineered vocabulary, and too specific for us to use directly, but perhaps a useful example.

http:/purl.org/spar/fabio – the “FRBR-aligned bibliographic ontology” one of the SPAR (Semantic Publishing and Referencing) ontologies being developed through the JISC Open Citations project.

http://opencitations.wordpress.com/2010/10/14/introducing-the-semantic-publishing-and-referencing-spar-ontologies/

http://bibliontology.com/ | http://purl.org/ontology/bibo/ – BIBO, the bibliographic ontology. Generic and with reasonably wide use but criticised by specialists.

See Also

Discussion of some technical issues during Sprint 1

General advice from Jeni Tennison – start by thinking about your domain objects, e.g. all the things your system is currently modelling, then decide which of them should have a URI. Only then start to think about describing their relations, and use others’ work wherever possible – we should only be minting our own vocabularies as a last resort…

Posted in Uncategorized