Update on the automatic processing

We are making steady progress on building the automatic processing pipeline for converting from OCR output to heavily annotated XML texts. The pipeline makes successive passes over an EPNS volume with each stage adding further layers of annotation.

The first step converts from a Word file to XML and retains the high level structural mark-up that CDDA added by hand. In the input, each line is a separate paragraph and this step considers sequences of lines and creates proper paragraphs around running text, reversing line-end hyphenation at the same time.

The second step identifies word and punctuation tokens, retaining information about font style and weight as attributes on the token elements. Sentence splitting is applied in running text paragraphs at this stage.

In the third step some of the higher level elements are segmented into smaller parts: field and street name sections are split into their individual members and long paragraphs containing minor place names, some with attestation or etymological information, some without, are split into individual entries. The name of each place is computed at this point and stored in an attribute on the relevant entry. While most of these names are straightforwardly recognisable, several variants can be compressed in quite complex ways (e.g. Woodhouse, (-End & -Green); The Dighills, Dighill Brook & Wood; Fanshawe (Lane), Fanshawe Brook (Fm); Whirleybarn, Whirley Cottages, Grove & Rd). At this stage just the extent of the name is found, leaving later stages to expand the names into all their variants.

The fourth step does finer-grained segmentation of the information associated with places. Attestations are place name forms linked to dates and sources in which they were attested. E.g. for the township Bramhall in the parish of Stockport a previous form is shown like this:

Bromehale 1426 Plea, 1433 ChRR

meaning that the form Bromehale was recorded in 1426 in the source abbreviated ‘Plea’ (Plea Rolls of the County of Chester) and in 1433 in the source abbreviated ‘ChRR’ (Calendar of the Chester Recognizance Rolls). The etymology of Bramhall is glossed as ‘Broom-nook’ from the elements br?m and halh. Each of these pieces of information (form, date, source, gloss, place name element) is identified and marked up with an appropriate XML element. To identify the sources we use a lexicon derived from the abbreviations list for the relevant county and, once identified, we add the id of the source to provide a link to the full form of the abbreviation. For each place name element, we query a local copy of the Key to English Place Names database and add the relevant database id if it is found.

We are still working on subsequent steps, with two main things still to do. The first is expansion of shorthand ways of recording alternative forms, e.g. expanding Fallibro(o)me, -y-, -bro(m) to give the forms Fallibrome, Fallibroome, Fallybrome, Fallibrom, Fallibro. The second outstanding step is georeferencing the places. Here we need to convert any old OS map references provided by EPNS to lat/long and we will also georeference the parishes and major place names against the Unlock gazetteer using the Edinburgh Geoparser. After verification of the georeferences, it will only take a small transformation to create entries in the DEEP historical gazetteer.

EDINA Blogs

A Blogs.edina.ac.uk weblog

Update on the automatic processing