Final Product Post: Chalice: past places and use cases

This is our “final product post” as required by the #jiscexpo project guidelines. Image links somehow got broken, they are fixed now, please re-view.

Chalice – Past Places

Chalice is for anyone working with historic material – be that archives of records, objects, or ideas. Everything happens somewhere. We aimed to provide a historic place-name gazetteer covering a thousand years of history, linked to attestations in old texts and maps.

Place-name scholarship is fascinating; looking at names, a scholar can describe the lay of the land, see political developments. We would like to pursue further funding to work with the English Place-Name Survey on an expert-crowdsourced service consuming the other 80+ volumes and extracting the detailed information – etymology, field-names.

Linked to other archival sources, the place-name record has the potential to reveal connections between them, and in turn feed into deeper coverage in the place-name survey.

There is a Past Places browser to help illustrate the data and provide a Linked Data view of the data.

Stuart Dunn did a series of interviews and case studies with different archival sources, making suggestions for integration. The report on our use case for the Clergy of the Church of England Database may be found here; and that on our study of the Victoria County History is here. We also have valuable discussions with the Archaeology Data Service, which were reported in a previous post.

Rather than a classical ‘user needs’ approach, targeting groups such as historians, linguists and indeed place-name scholars, it was decided to look in detail at other digital resources containing reference material. This allowed us to start considering various ways in which a digitized, linkable EPNS could be automatically related to such resources. The problems are not only the ones we anticipated, of usability and semantic crossover between the placename variants listed in EPNS and elsewhere; but also ones of data structure, domain terminology and the relationship of secondary references acorss such corpora. We hope these considerations will help inform future development of placename digitization.

Project blog

This covers the work of the four partners in the project.

CeRch at KCL developed use cases through interviews with maintainers of different historic sources. There are blog descriptions of conversations with:

LTG did some visualisations for these use cases, and more seriously text mining the semi-structured text of different sample volumes of the English Place Name Survey.

The extraction of corrected text from previously digitised pages was done by CDDA in Belfast. There is a blog report on the final quality of the work, however the full resulting text is not open licensed nor distributed through Chalice.

EDINA took care of project management and software development. We used the opportunity to try out a Scrum-style “sprint” way of working with a larger team.

TOC to project blog –here is an Atom feed of all the project blog posts and they should be categorised / describe project partners

Project tag: chaliced

Full project name: Connecting Historical Authorities with Links, Contexts and Entities

Short description: Creating and re-using a linked data historic gazetteer through text mining.

Longer description:Text mining volumes of the English Place Name Survey to produce a Linked Data historic gazetteer for areas of England, which can then be used to improve the quality of georeferencing other archives. The gazetteer is linked to other placename sources on the Linked Data web via geonames.org and Ordnance Survey Open Data. Intensive user engagement with archive projects that can benefit from the open data gazetteer and open source text mining tools.

Key deliverables: Open source tools for text mining archives; Linked open data gazetteer, searchable through JISC’s Unlock service; studies of further integration potential.

Lead Institution: University of Edinburgh

Person responsible for documentation: Jo Walsh

Project Team: EDINA: Jo Walsh (Project Manager), Joe Vernon (Software Developer), Jackie Clark (UI design), David Richmond (Infrastructure), CDDA: Paul Ell (WP1 Coordinator), Elaine Yates (Administration), David Hardy (Technician), Karleigh Kelso (Clerical), LTG: Claire Grover (Senior Researcher), Kate Byrne (Researcher), Richard Tobin (Researcher), CeRch: Stuart Dunn (WP3 Coordinator).

Project partners and roles: Centre for Data Digitisation and Analysis, Belfast – preparing digitised text, Centre for e-Research, Kings College London – user engagement and dissemination, Language Technology Group, School of Informatics, Edinburgh – text mining research and tools.

This is the Chalice project blog and you can follow an Atom feed of blog posts (there are more to come).

The code produced during the Chalice project is free software; it is available under the GNU Affero GPL v3 license. You can get the code from our project sourceforge repository. The text mining code is available from LTG – please contact Claire Grover for a distribution…

The Linked Data created by text mining volumes of the English Place Name Survey – mostly covering Cheshire – is available under the
Open Database License – a share-alike license for data by Open Data Commons.
.

The contents of this blog itself are available under a Creative Commons Attribution-ShareAlike 3.0 Unported license.

 

CC-BY-SA GNU Affero GPL v3 license. Affero GPL v3

Link to technical instructional documentation

Project started: July 15th 2010
Project ended: April 30th 2011
Project budget: £68054


Chalice was supported by JISC as a project in its #jiscexpo programme. See its PIMS project management record for information about where responsibility fits in at JISC.

Reflections on the second Chalice scrum

We had a second two-week Scrum session on code for the Chalice project. This was a followup to the first Chalice scrum during which we made solid progress.

During the second Scrum the team ran into some blocks and progress slowed. The following is quite a soul-searching post, in accordance with the project documentation instructions: “don’t forget to post the FAIL(s) as well: telling people where things went wrong so they don’t repeat mistakes is priceless for a thriving community.”

Our core problem was the relative inflexibility of the relational database backend. We’d chosen to use an RDBMS rather than an RDF triplestore mainly for the benefits of code-reuse and familiarity, as this enabled us to repurpose code from a couple of similar EDINA projects, Unlock and Addressing History.

However, when the time came to revise the model based on updated data extracted from EPNS volumes, this created a chain of dependencies – updates to the data model, then the API, then the prototype visualisation – progress slowed, and not much changed in the course of the second sprint.

A second problem was lack of really clearly defined use cases, especially for a visual interface to the Chalice data. Here we have a bit of a chicken-and-egg situation; the work exploring how different archive projects can re-use the Chalice data to enhance their collections, is still going on. This is something which we have more emphasis on during the latter part of the project.

So on the one hand there’s a need for a working prototype to be able to integrate Chalice data with other resources; and on the other, a need to know how those resources will re-use the Chalice data to inform the prototype.

So what would we do differently if we did it again?

  • More of a design phase before the Scrum proper starts – with time to experiment with different data storage backends
  • More work developing detailed use cases before software development starts
  • More active collaboration between people talking to end users and people developing the backend (made more difficult because the project partners are distributed in space)

Below are some detailed comments from two of the Scrum team members, Ross and Murray.

Ross: I found Scrum useful, efficient, great for noticing both what others are doing and when your heading down the wrong path and identifying when you need further meetings, as was the case a few times early in the process. The whiteboard idea developed later on was also very useful. I don’t think the bottlenecks where anything to do with the use of Scrum, just in the amount of information and quality of data we had available to us, maybe this is due partially to the absence of requirements gathering in Scrum.

The data we received had to be reverse engineered to some respect. As well as figuring out what everything in the given format was for (such as regnal dates, alternative names, contained places and their location relative to parent) and what parts where important to us (such as which of the many date formats we were going to store i.e. start, end and/or approximations) we also had no direct control over it.

In order for the database, interface and API to work we had to decide on a structure quickly and get data in the database meaning learning how to install and operate a triple store (the recommend method) or spend time figuring out how to get hibernate to work with the decided
structure (a more adaptable database access technology) would have delayed everything so a trade off was made to manually write code to
parse the data from XML and enter it into a familiar relational database which caused us more problems later on. One of these was that the data was to continue to change on every generation; elements being added and removed or completely changed meant changing the parsing, then the domain objects, then the database and lastly the database insertion code.

Lack of use cases: From the start we were developing an app without knowing what it should look like or how it should function. We were unsure as to what data we should or would need to store and how much control users of the service would have over the data in the database. We were unsure how to query the database and display API request responses so as to best fit the
needs of the intended users in an efficient, useful way. We are slightly more clear on this but more information on how the product will be used would be greatly helpful.

And as for future development… If we are sticking with the relational database model I definitely think it’s wise to get rid of all the database reading/writing code in favour of a hibernate solution, this would be tricky with our database structure however but more adaptable and symmetrical; so that changes to the input method are also made to the output and only one change needs
to be made. Some sort of XML-POJO relational tool may also be useful
although would make new dataset importing more complex (perhaps using
xslt) to further improve adaptability.
As well as that, some more specific use cases mentioning inputs and
required outputs would be very useful.

Murray: My comment, would be that we possibly should have worked on a hibernate
ORM first, before creating the database. As soon as we had natural keys,
triggers and stored procs in the database, it became too cumbersome to
reverse engineer them.

If we had created a ORM mapping first we could automatically generate
the db schema from that, rather than the other way round.
I presume we could write the searches even the spacial ones in hibernate
rather than stored procs.
Then it would be easier to cope will all the shifts in the xml
structure. Propagating to changes through the tiers would be case of
regenerating db and domain objects from the mappings rather than by hand.

The generated domain objects could be reused across the dataloading, api
and search. The default lazy loading in hibernate would have been good
enough to deal with the hierarchical nature of the data to a
indiscriminate depth.

Chalice at WhereCamp

I was lucky enough to get to WhereCamp UK last Friday/Saturday, mainly because Jo couldn’t make it. I’ve never been to one of these unconferences before but was impressed by the friendly, anything-goes atmosphere, and emboldened to give an impromtu talk about CHALICE. I explained the project setup, its goals and some of the issues encountered, at least as I see them –

  • the URI minting question
  • the appropriateness (or lack of it) of only having points to represent regions instead of polygons
  • the scope for extending the nascent historical gazetteer we’re building and connecting it to others
  • how the results might be useful for future projects.

I was particularly looking for feedback on the last two points: ideas on how best to grow the historical gazetteer and who has good data or sources that should be included if and when we get funding for a wider project to carry on from CHALICE’s beginnings; and secondly, ideas about good use cases to show why it’s a good idea to do that.

We had a good discussion, with a supportive and interested audience. I didn’t manage to make very good notes, alas. Here’s a flavour of the discussion areas:

  • dealing with variant spellings in old texts – someone pointed out that the sound of a name tends to be preserved even though the spelling evolves, and maybe that can be exploited;
  • using crowd-sourcing to correct errors from the automatic processes, plus to gather further info on variant names;
  • copyright and IPR, and the fact that being out of print copyright doesn’t mean there won’t be issue around digital copyright in the scanned page images;
  • whether or not it would be possible – in a later project – to do useful things with the field names from EPNS;
  • the idea of parsing out the etymological references from EPNS, to build a database of derivations and sources;
  • using the gazetteer to link back to the scanned EPNS pages, to assist an online search application.

Plenty of use cases were suggested, and here are some that I remember, plus ideas about related projects that it might be good to tie up with:

  • a good gazetteer would aid research into the location of places that no longer exist, eg from Domesday period – if you can locate historical placenames mentioned in the same text you can start narrowing down the likely area for the mystery places;
  • the library world is likely to be very interested in good historical gazetteers, a case mentioned being the Alexandria Library project sponsored by the Library of Congress amongst others;
  • there are overlaps and ideas to share with similar historical placename projects like Pleiades, Hestia and GAP (Google Ancient Places).

I mentioned that, being based in Edinburgh, we’re particularly keen to include Scottish historical placenames. There are quite a few sources and people who have been working for ages in this area – that’s probably one of the next things to take forward, to see if we can tie up with some of the existing experts for mutual benefit.

There were loads of other interesting presentations and talk at WhereCamp… but this post is already too long.

Linking historic places: looking at Victoria County History

Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.

Connecting archives with linked geodata – Part I

This is the first half of the talk I gave at FOSS4G 2010 covering the Chalice project and the Unlock services. Part ii to follow shortly….

My starting talk title, written in a rush, was “Georeferencing archives with Linked Open Geodata” – too many geos; though perhaps they cancel one another out, and just leave *stuff*.

In one sense this talk is just about place-name text mining. Haven’t we seen all this before? Didn’t Schuyler talk about Gutenkarte (extracting place-names from classical texts and exploring them using a map) in like, 2005, at OSGIS before it was FOSS4G? Didn’t Metacarta build a multi-million business on this stuff and succeed in getting bought out by Nokia? Didn’t Yahoo! do good-enough gazetteer search and place-name text mining with Placemaker? Weren’t *you*, Jo, talking about Linked Data models of place-names and relations between them in 2003? If you’re still talking about this, why do you still expect anyone to listen?

What’s different now? One word: recursion. Another word: potentiality. Two more words: more people.

Before i get too distracted, i want to talk about a couple of specific projects that i’m organising.

One of them is called Chalice, which stands for Connecting Historical Authorities with Linked Data, Contexts, and Entities. Chalice is a text-mining project, using a pipeline of Natural Language Processing and data munging techniques to take some semi-structured text and turn the core of it into data that can be linked to other data.

The target is a beautiful production called the English Place Name Survey. This is a definitive-as-possible guide to place-names in England, their origins, the names by which things were known, going back through a thousand years of documentary evidence, reflecting at least 1500 years of the movement of people and things around the geography of England. There are 82 volumes of the English Place Name Survey, which started in 1925, and is still being written (and once its finished, new generations of editors will go back to the beginning, and fill in more missing pieces).

Place-name scholars amaze me. Just by looking at words and thinking about breaking down their meanings, place-name scholars can tell you about drainage patterns, changes in the order of political society, why people were doing what they were doing, where. The evidence contained in place-names helps us cross the gap between the archaeological and the digital.

So we’re text mining EPNS and publishing the core (the place-name, the date of the source from which the name comes, a reference to the source, references to earlier and later names for “the same place”). But why? Partly because the subject matter, the *stuff*, is so very fascinating. Partly to make other, future historic text mining projects much more successful, to get a better yield of data from text, using the one to make more sense of the other. Partly just to make links to other *stuff*.

In newer volumes the “major names”, i.e. the contemporary names (or the last documented name for places that have become forgotten) have neat grid references, point-based, thus they come geocoded. The earliest works have no such helpful metadata. But we have the technology; we can infer it. Place-name text mining, as my collaborators at the Language Technology Group in the School of Informatics in Edinburgh would have it, is a two-phase process. First phase is “geo-tagging”, the extraction of the place-names themselves; using techniques that are either rule-based (“glorified regular expressions”) or machine-learning based (“neural networks” for pattern cognition, like spam filters, that need a decent volume of training data).

Second phase is “geo-resolution”; given a set of place-names and relations between them, figuring out where they are. The assumption is that places cluster together in space similarly as they do in words, and on the whole that works out better than other assumptions. As far as i can see, the state of the research art in Geographic Information Retrieval is still fairly limited to point-based data, projections onto a Cartesian plane. This is partly about data availability, in the sense of access to data (lots of research projects use geonames data for its global coverage, open license, and linked data connectivity). It’s partly about data availability in the sense of access to thinking. Place-name gazetteers look point-based, because the place-name on a flat map begins at a point on a cartesian plane. (So many place-name gazetteers are derived visually from the location of strings of text on maps; they are for searching maps, not for searching *stuff*)

So next steps seem to involve

  • dissolving the difference between narrative, and data-driven, representations of the same thing
  • inferring things from mereological relations (containment-by, containment-of) rather than sequential or planar relationsOn the former – data are documents, documents are data.

On the latter, this helps explain why i am still talking about this, because it’s still all about access to data. Amazing things, that i barely expected to see so quickly, have happened since i started along this path 8 years ago. We now have a significant amount of UK national mapping data available on properly open terms, enough to do 90% of things. OpenStreetmap is complete enough to base serious commercial activity on; Mapquest is investing itself in supporting and exploiting OSM. Ordnance Survey Open Data combines to add a lot of as yet hardly tapped potential…

Read more, if you like, in Connecting archives with linked geodata – Part II which covers the use of and plans for the Unlock service hosted at the EDINA data centre in Edinburgh.

Chalice poster from AHM 2010

Chalice had a poster presentation at All Hands Meeting in Cardiff, the poster session was an evening over drinks in the National Museum of Wales, and all very pleasant.

Chalice poster

View the poster on scribd and download if from there if you like, be aware the full size version is rather large.

I’ve found the poster very useful; projected it instead of presentation slides while I talked at FOSS4G and at the Place-Names workshop in Nottingham on September 3rd.

Quality of text correction analysis from CDDA

The following post is by Elaine Yeates, project manager at the Centre for Data Digitisation and Analysis in Belfast. Elaine and her team have been responsible for taking scans of a selection of volumes of the English Place Name Survey and turning them into corrected OCR’d text, for later text mining to extract the data structures and republish them as Linked Data.

“I’ve worked up some figures based on an average character count from Cheshire, Buckinghamshire, Cambridgeshire and Derbyshire.

We had two levels of quality control:

1st QA Spelling and Font:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 346 character errors (average per page 8.65) = 0.22

1st QA Unicode:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 235 character errors (average per page 5.87)= 0.14.

TOTAL Error Rate 0.36
2nd QA – Encompasses all of 1st QA and based on 40 pages averaging 4000 characters per page the error rate was 18 character errors (average per page 0.45) = 0.01.

Through the pilot we indentified that there are quite a few Unicodes unique to this material. CDDA developed an in-house online Unicode database for analysts, they can view, update the capture file and raise new codes when found. I think for a more substantial project we might direct our QA process through an online audit system, where we could identify issues with material, OCR of same, macro’s and the 1st and 2nd stages of quality control.

We are pleased with these figures and it looks encouraging for a larger scaled project.”

Elaine also wrote in response to some feedback on markup error rates from Claire Grover on behalf of the Language Technology Group:

‘Thanks for these. Our QA team our primarily looking for spelling errors, from your list the few issues seem to be bold, spaces and small caps.

Of course when tagging, especially automated, you’re looking for certain patterns, however moving forward I feel this error rate is very encouraging and it helps our QA team to know what patterns might be searchable for future capture.

Looking at your issues so far, on part Part IV (5 issues e-mailed) and a total word count of 132,357 (an error rate of 0.00003).”

I am happy to have these numbers, as one can observe consistency of quality over iterations, as means are found to work with more volumes of EPNS.

Musings on the first Chalice Scrum

For a while i’ve been hearing enthusiastic noises about how Scrum development practise can focus productivity and improve morale; and been agitating within EDINA to try it out. So Chalice became the guinea-pig first project for a “Rapid Application Development” team; we did three weeks between September 20th and October 7th. In the rest of this post I’ll talk about what happened, what seemed to work, and what seemed soggy.

What happened?

  • We worked as a team 4 days a week, Monday-Thursday, with Fridays either to pick up pieces or to do support and maintenance work for other projects.
  • Each morning we met at 9:45 for 15 minutes to review what had happened the day before, what would happen the next day
  • Each item of work-in-progress went on a post-it note in our meeting room
  • The team was of 4+1 people – four software developers, with a database engineer consulting and sanity checking
  • We had three deliverables –
        a data store and data loading tools
        a RESTful API to query the data
        a user interface to visualise the data as a graph and map

In essence, this was it. We slacked on the full Scrum methodology in several ways:

  • No estimates.

Why no estimates? The positive reason: this sprint was mostly about code re-use and concept re-design; we weren’t building much from scratch. The data model design, and API to query bounding boxes in time and space, were plundered and evolved from Unlock. The code for visualising queries (and the basis for annotating results) was lifted from Addressing History. So we were working with mostly known quantities.

  • No product owner

This was mostly an oversight; going into the process without much preparation time. I put myself in the “Scrum master” role by instinct, whereas other project managers might be more comfortable playing “product owner”. With hindsight, it would have been great to have a team member from a different institution (the user-facing folk at CeRch) or our JISC project officer, visit for a day and play product owner.

What seemed to work?

The “time-boxed” meeting (every morning for 15 minutes at 9:45) seemed to work very well. It helped keep the team focused and communicating. I was surprised that team members actually wanted to talk for longer, and broke up into smaller groups to discuss specific issues.

The team got to share knowledge on fundamentals, that should be re-useful across many other projects and services – for example, the optimum use of Hibernate to move objects around in Java decoupled from the original XML sources and the database implementation.

Emphasis on code re-use meant we could put together a lot of stuff in a compressed amount of time.

Where did things go soggy?

From this point we get into some collective soul-searching, in the hope that it’s helpful to others for future planning.

The start and end were both a bit halting – so out of 12 days available, for only 7 or 8 of those were we actually “on”. The start went a bit awkwardly because:

      We didn’t have the full team available ’til day 3 – holidays scheduled before the Scrum was planned
      It wasn’t clear to other project managers that the team were exclusively working on something else; so a couple of team members were yanked off to do support work before we could clearly establish our rules (e.g. “you’ll get yours later”).

We could address the first problem through more upfront public planning. If the Scrum approach seems to work out and EDINA sticks with it for other projects and services, then a schedule of intense development periods can be published with a horizon of up to 6 months – team members know which times to avoid – and we can be careful about not clashing with school holidays.

We could address the second problem by broadcasting more, internally to the organisation, about what’s being worked on and why. Other project managers will hopefully feel happier with arrangements once they’ve had a chance to work with the team. It is a sudden adjustment in development practise, where the norm has been one or two people full-time for a longish stretch on one service or project.

The end went a bit awkwardly because:

    I didn’t pin down a definite end date – I wasn’t sure if we’d need two or three weeks to get enough-done, and my own dates for the third week were uncertain
    Non-movable requirements for other project work came up right at the end, partly as a by-product of this

The first problem meant we didn’t really build to a crescendo, but rather turned up at the beginning of week 3, looked at how much of the post-it-note map we still had to cover. Then we lost a team member, and the last couple of days turned into a fest of testing and documentation. This was great in the sense that one cannot underestimate the importance of tests and documentation. This was less great in that the momentum somewhat trickled away.

On the basis of this, I imagine that we should:

  • Schedule up-front more, making sure that everyone involved has several months advance notice of upcoming sprints
  • Possibly leave more time than the one review week between sprints on different projects
  • Wait until everyone, or almost everyone, is available, rather than make a halting start with 2 or 3 people

We were operating in a bit of a vacuum as to end-user requirements, and we also had somewhat shifting data (changing in format and quality during the sprint). This was another scheduling fail for me – in an ideal world we would have waited another month, seen some in-depth use case interviews from CeRch and had a larger and more stable collection of data from LTG. But when the chance to kick off the Scrum process within the larger EDINA team came up so quickly, I just couldn’t postpone it.

We plan a follow-up sprint, with the intense development time between November 15th and 25th. The focuses here will be

  • adding annotation / correction to the user interface and API (the seeds already existing in the current codebase)
  • adding the ability to drop in custom map layers

Everything we built at EDINA during the sprint is in Chalice’s subversion repository on Sourceforge – which I’m rather happy with.

Posters and presentations

Happy to have had CHALICE accepted as a poster presentation for the e-Science All Hands Meeting in Cardiff this September. It will be good to have a glossy poster. Pleased to have been accepted at all, as the abstract was rather scrappy and last-minute. I had a chance to revise it, and have archived the PDF abstract.

I’m also doing a talk on CHALICE, related work and future dreams, at the FOSS4G 2010 conference in Barcelona a few days earlier. Going to be a good September, hopes.

Visiting the English Place Name Survey

I was in Nottingham for OSGIS at the Centre for Geospatial Sciences on Tuesday; skipped out between lunch and coffee break to visit the English Place Name Survey in the same leafy campus.

A card file at EPNS

A card file at EPNS

Met with Paul Cavill, who dropped me right in to the heart of the operation – stacks of index cards in shoe boxes. Each major name has a set of annotation cards, describing different related names and their associations and sources – which range from Victorian maps to Anglo-Saxon chronicles.

The editing process takes the card sets and turns them right into print-ready manuscript. The manuscript then has very consistent layout conventions – capitalisation, indentation. This is going to make our work of structure mining a lot easier.

Another bonus I wasn’t expecting was the presence of OSGB grid references for all the major names. The task of making links becomes a snap – I was imagining a lot of iterative guesswork based on clustering and closeness to names in other sources. (There are four Waltons in the UK in geonames, dozens in the EPNS).

On this basis I reckon the entity recognition will be a breeze, LTG will hardly have to stretch their muscles, which means we can ask them to work on grammars and machine learning recognisers for parts of other related archives within the scope of CHALICE.

Pic_0622_026And we would have freedom in the EDINA team’s time to do more – specifically to look at using the National Map Library of Scotland’s map rectifier tools to correlate the gazetteer with detailed line-drawn maps  also created by the late H. D. G. Foxall. Digitisations of these maps live in the Shropshire Records Office. We must talk with them about their plans (the Records Office holds copyright in the scans).

The eye-opener for me was the index of sources, or rather the bibliography. Each placename variant is marked with a few letters identifying the source of the name. So the index itself provides a key to old maps and gazetteers and archival records. To use Ant Beck’s phrase the EPNS looks like a “decoupled synthesis” of placename knowledge in all these sources. If we extract its structure, we are recoupling the synthesis and the sources, and now know where to look next to go text mining and digitising.

Pic_0622_024So we have the Shropshire Hundreds as a great place to start, as this is where the EPNS are working on now and the volumes are “born digital”. Back at CDDA, Paul Ell has some of the very earliest volumes digitised, and if we find a sample from the middle, we can produce grammar rules that we can be pretty confident will extract the right structure from the whole set, when the time comes to digitise and publish the entire 80+ volume, and growing, set.

But now i’m fascinated by the use of the EPNS derived data as a concordance to so many associated archives documenting historic social patterns. Towards the end of our chat Paul Cavill was speculating about reconstructing Anglo-Saxon England by means of text mining and georeferencing archives – we could provide a reference map to help archaeologists understand what they are finding, or even help them focus on where to look for interesting archaeology.

Paul had been visited by the time-travelling Mormons digitising everything a couple of weeks previously, and will hopefully offer an introduction – i would really, really like to meet them.