The following post is by Elaine Yeates, project manager at the Centre for Data Digitisation and Analysis in Belfast. Elaine and her team have been responsible for taking scans of a selection of volumes of the English Place Name Survey and turning them into corrected OCR’d text, for later text mining to extract the data structures and republish them as Linked Data.
“Iâ€™ve worked up some figures based on an average character count from Cheshire, Buckinghamshire, Cambridgeshire and Derbyshire.
We had two levels of quality control:
1st QA Spelling and Font:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 346 character errors (average per page 8.65) = 0.22
1st QA Unicode:- On completion of the OCR process and based on 40 pages averaging 4000 characters per page the error rate was 235 character errors (average per page 5.87)= 0.14.
TOTAL Error Rate 0.36
2nd QA â€“ Encompasses all of 1st QA and based on 40 pages averaging 4000 characters per page the error rate was 18 character errors (average per page 0.45) = 0.01.
Through the pilot we indentified that there are quite a few Unicodes unique to this material. CDDA developed an in-house online Unicode database for analysts, they can view, update the capture file and raise new codes when found. I think for a more substantial project we might direct our QA process through an online audit system, where we could identify issues with material, OCR of same, macroâ€™s and the 1st and 2nd stages of quality control.
We are pleased with these figures and it looks encouraging for a larger scaled project.”
‘Thanks for these. Our QA team our primarily looking for spelling errors, from your list the few issues seem to be bold, spaces and small caps.
Of course when tagging, especially automated, you’re looking for certain patterns, however moving forward I feel this error rate is very encouraging and it helps our QA team to know what patterns might be searchable for future capture.
Looking at your issues so far, on part Part IV (5 issues e-mailed) and a total word count of 132,357 (an error rate of 0.00003).”
I am happy to have these numbers, as one can observe consistency of quality over iterations, as means are found to work with more volumes of EPNS.