Checking the OCR samples

Posted on May 9, 2012 by Stuart Dunn

Having just proof-read OCR samples from 6 county survey volumes (my first task on this project), I have been astonished at the sheer quality of the OCR output. Scanning technology has certainly come a long way since I first saw grainy scanned texts over 20 years ago. The samples have come from early volumes, with all the attendant issues of unavailability of certain fonts/characters and uneven inking on the page (which makes it hard to distinguish bolded from unbolded text). The OCR technology seems to have dealt very well with these difficulties and faithfully reproduced pretty much everything on the page â€“ to the point where some things (such as the printersâ€™ marks at the foot of some pages) will have to be removed. The main issues arising are missing macrons from place name elements; the treatment of footnotes; and how addenda/corrigenda will be incorporated. These issues are unrelated to the OCR process and will require an editorial decision.

DEEP in numbers

Posted on March 8, 2012 by Stuart Dunn

Following my talk at yesterday’s GeoCulture seminar in London, I thought I’d post some figures about the SEPN content which DEEP is digitising:

80+ years of scholarship
32 English counties
86 volumes
6157 elements
30,517 pages
c. 4,000,000 individual place-name forms
??? Bibliographic references (we will know soon â€“ itâ€™s quite a lot)

Digitisation as research

Posted on February 23, 2012 by Stuart Dunn

With the REF on the horizon, most academics are currently concerned with matters of impact and academic recognition. Therefore, getting academic recognition for a digitisation project, such as those funded under the JISC eContent programmes, is an important question. In order to receive JISC funding to digitise content, one has, of course, to demonstrate the academic value of the resource to be digitised, and to explain how making it available digitally will increase that value. The impact and value of digitisation outputs themselves, and how they fit into peer-review structures, has been the subject of previous studies, but the issue of getting credit for undertaking digitisation itself is less clear. This can cause problems when dealing with outside bodies concerned with the review or evaluation of research; or even with oneâ€™s own institution. In some cases, for example, digitisation activities might be interpreted as software development or IT support, thus preventing those involved from getting academic credit. How this classification is made varies from HEI to HEI. In some cases, an email from the PI or Co-I confirming that the project is â€˜researchâ€™ will suffice, in others there is a questionnaire or some other pro forma. However they classify activities, most Higher Education Institutions adopt the principles of the Frascati Manualâ€™s definition of research, or something very similar to them. These break research down into three headings:

Basic research is experimental or theoretical work undertaken primarily to acquire new knowledge of the underlying foundation of phenomena and observable facts, without any particular application or use in view.
Applied research is also original investigation undertaken in order to acquire new knowledge. It is, however, directed primarily towards a specific practical aim or objective.
Experimental development is systematic work, drawing on existing knowledge gained from research and/or practical experience, which is directed to producing new materials, products or devices, to installing new processes, systems and services, or to improving substantially those already produced or installed

Most academic digitisation work is likely to fall into the third category, provided that making available of digital resources is accompanied by some form of enhancement, such as machine-readable mark-up or a crowd-sourcing platform. This is especially so if it can be shown that the enhancement is drawn directly from the project teamâ€™s experience and expertise. Certainly in the context of the DEEP project, there are complicatedÂ questions of data structure, interpretation and mark-up, the exploration of which would appear as research questions to most scholars and deserving of recognition as such.Â Undoubtedly they requireÂ the extremely interdisciplinary skill set of all the partners.

Projects needing to make this argument may wish to consider the following suggestions:

1. Ensure the research question or questions that your resource will be addressing is clearly articulated, and that you have to hand a clear statement describing the unique knowledge needed to make it digitally available in the way you have chosen.

2. Refer to the Frascati guidelines, and any relevant institutional definitions of research and related activities.

3. Ensure you are talking to the right person. It may be the case that staff charged with classifying activities are not familiar with digitisation. This is especially so in departments or schools with little experience of such projects. In such cases, the decision on whether to classify the project as research may well need to be taken at a higher level than normal.

Both the Centre for Data Digitisation and Research at QUB and the Centre for e-Research in the Dept. of Digital Humanities at KCL have extensive experience in dealing with such projects, and would beÂ happyÂ to offer discussion and advice to any project which needs to make the argument that their workÂ constitutesÂ research.

Starting digitisation

Posted on January 12, 2012 by Stuart Dunn

The DEEP project started on time in November last year. Our project plan has been finalised, and will shortly be available from our page on the JISC website.

As it promised, the Survey of English Place-names (SEPN) is a complex and fascinating document. Produced by the English Place-Name Society (EPNS), the SEPN is a true community effort. Its 86 volumes document the names of some 40 English counties, and have been compiled by different place-name scholars over the years. Thus, a succession of different people have moulded the text itself to fit and reflect Englandâ€™s ancient and rich toponymic landscape.

While this provides an unrivalled resource for the place-name scholar, the historian, the geographer and the linguist, this makes digitizing it a challenge. Our aim is to put the forms into a structured gazetteer, but the structure varies from county to county. The basic hierarchy goes from large units, such as counties and hundreds, to smaller units, such as parishes, townships, settlements and minor names. Some conventions persist. Parish names are mentioned as headings for example, followed by townships and settlements, but there are inevitable exceptions, which makes tagging these sections of text complex â€“ we do not wish to impose artificial structures on anomalous portions of text, since they will all be anomalous for a reason.

OCRing the text is the responsibility of CDDA. This process has thrown up problems, for example in some cases matching Anglo Saxon characters to their supported Unicode equivalents requires expert input from the team at Nottingham. Sometimes AS characters are simply hard to read due to printing issues, sometimes the problem is that the Unicodes themselves need correcting. E.g. a character initially assigned Unicode E624 was misread and reassigned 01ED (?).
Cheshire is now completed, and work is underway on Shropshire.

Updated project description

Posted on December 6, 2011 by Stuart Dunn

Here is our updated description for the DEEP project:

Place-names are not static. They change and evolve over time, in response to the development of language, wars and conquests, shifting administrative boundaries, or simply the vagaries of spelling in the days before dictionaries and atlases. They have complex etymologies derived from different languages, and they mean different things to different communities. Therefore, historical documents and archives, ephemera and sources, contain different spellings (forms) of place-names, depending on their date and context. However â€“ and despite the fact that we now take for granted the ability to search geographic data using web services such as Google Maps and GeoNames.org – there is no gazetteer documenting these historic name forms. Therefore, there is no means of linking or cross-searching the geographic references they contain. In summary, a search using a modern place-name will not currently return results for that name in all its many variant forms. This has resulted in a major underutilisation of electronic resources.

Digitisation, however, offers a solution. In England, the historical developments of place-names over time have been systematically surveyed since 1922 by the specialists of the English Place-Name Society (EPNS). Examining an extensive range of documentary sources in local and national archives, and gathering the knowledge of local communities and experts, the EPNS has built up an 86-volume county by county survey of Englandâ€™s place-names â€“ detailing over four million variant forms, from classical sources, through the Anglo-Saxon period and into medieval England and beyond to the modern period. JISCâ€™s Digital Exposure of English Place-Names (DEEP) project will digitise all these forms, and make them available as structured data. The corpus will be comprise a gazetteer within JISCâ€™s Unlock service, meaning that researchers will be able to cross-query the dataset, and use it to search their own digital documents and databases for any historic place-name form. The gazetteer data will also be made available in structured XML, meaning that it will be possible to experiment with methods of data mining and visualisation that are not possible with the paper volumes. In addition to the digitisation, a network of experts will be convened to correct and enhance the dataset.

The completed resource will provide a key piece of electronic infrastructure for the discovery, clustering, use and analysis of e-content referenced by place. It will also be an important resource for scholars of place-names, and scholars in cognate disciplines such history, linguistics, archaeology, and historical geography.

DEEP has had a long gestation period, and as such it is a logical extension of existing work. Its context is significant existing investment which JISC has made in various forms of gazetteers and geospatial web services such as GeoCrosswalk, GeoDigRef, and Unlock. Principally, it grew from the Connecting Historical Authorities with Linked data, Contexts and Entities (CHALICE), funded in 2010 under JISCâ€™s Information Environment Programme, and led by EDINA. In this exemplar project, the current project team carried out a full pilot demonstrator. This exemplar digitised the place-names of Cheshire, and a sample of those of Shropshire, and extracted place-name, attestation and chronological data from them using the Edinburgh geoparser, and generated a gazetteer of historic place-names to link documents and authority files in Linked Data form. Â This proved the concept that is being rolled out under DEEP but, as an exemplar was constrained by limitations on time and resources. As a result, methodological challenges have been resolved and the team has a proven track record of working together..

EDINA Blogs

A Blogs.edina.ac.uk weblog

Author Archives: Stuart Dunn

Checking the OCR samples

DEEP in numbers

Digitisation as research

Starting digitisation

Updated project description