Digitisation Workflows and Tools

During the final stages of the County Surveys project, we are shifting our attention to the process of digitising some volumes.  From the outset, a key aim of the project has been to scope the resources required to bring a full set of the county surveys together in a convenient digital format. The creation of the online bibliographic search tool was the first step towards this long term aim, as it allows us to assess the potential of printed books for digitisation and the quality and access conditions of extant digitised copies. A second step was to carry out trial digitisations of a few key surveys. As we have noted elsewhere,  this is a pilot, through which we will explore the potential requirements of a fuller, high quality, full text online collection.

Using the information surfaced through the bibliography, we identified a number of candidate surveys and discovered their locations. We found that one of the rarest volumes was held right here in Edinburgh, in the collection of the Royal Botanic Gardens Edinburgh, who kindly allowed us to work with their copy and use their state of the art digitisation equipment. The work was carried out by Phil Mellor, currently a PhD student at the University of Strathclyde, who did a great job of thoughtfully documenting the requirements and exploring different options for carrying out each step of the process.

The workflow that Phil sketched out  involved four stages: locate, capture, edit, OCR.

The first step was to locate the material: this involves finding an accessible volume (which may be easier said than done in some cases), assessing its binding, checking for folded plates and other potential issues such as uncut pages. We chose two volumes: the General View of the Agriculture of the Shetland Islands (1814) by John Shirreff, which is a revised survey and has 228 pages; and the General View of the Agriculture in the Southern Districts of the County of Perth (1794) by James Robertson, a first series survey with 140 pages, bound with other volumes and featuring the long ‘s’ (historically used where modern English uses a double s).

The capture stage, which we were expecting to be the most time-consuming, turned out to be fairly quick. Using the RBGE equipment, two pages at a time can be photographed and once the initial set-up is done, the cradle and cameras remain in place: Phil was able to capture an entire volume in under an hour.   During the capture stage, we had the choice of creating RAW files or JPEG.  There is an advantage to using RAW, as having a loss-less format from the outset enables editing from a high quality original at any point in the future. However, we also found that the quality of JPEG produced was high enough for good OCR and the JPEG files were easier to edit and quicker to upload and transfer.

The next stage of the workflow was editing. After saving the page images, they were uploaded into the editing software provided with the the capture software and the skew of the pages amended, cropping where necessary. Skewing is an inevitable result of capturing a bound book, in which the pages will always be presented at an angle that increases and decreases depending on the page at which the book is opened. This can be addressed during the capture process by changing the camera angles, or more quickly during the editing phase. Editing can be done on a page-by-page basis or on a chapter-by-chapter basis, and we found the latter to be sufficient for our needs. This stage could be quite time-consuming, depending on the quality of image required and the quality of images captured, but Phil found that editing a whole volume on a chapter by chapter basis took around two hours.

The final stage of the process was OCR-ing the images to produce text files. We tried a couple of different software packages, including the open-source programme Tesseract and a proprietary software called ABBYY Fine reader. Tesseract performed well and we were able to produce searchable pdfs and text files quickly and easily. However, it struggled with the historical print and the many unique Scottish names found in the surveys. ABBYY handled these comparatively well, and also automatically formatted the text to mimic the page image which saved a considerable amount of editing work. Overall, then, the digitisation process was quicker and smoother than we had envisaged, which bodes well for future projects to complete the collection.

The above gave us a workflow which we would use as a template for any future digitisation. However, for comparison’s sake we digitised two further volumes with equipment kindly made available to us by Edinburgh University Library Centre for Research Collections. Using this, we digitised two volumes: a different copy of the 1814 John Shirreff survey of The Shetland Islands and the survey of Ayrshire from 1811 by William Aiton. It might, at first seem strange to do the Shetland survey again but this allows us to do a direct comparison both in term of process and quality of result. The conclusion, in terms of the process was that the equipment at RBGE made the process considerably quicker and easier. Specifically, the most important characteristics of the scanner were:

  • The ability to capture two pages at once reduces the time considerably.
  • A cradle which does not require the book to be flat is both quicker and probably safer in terms of handling fragile and older books.
  • An integrated editing software which allows for corrections of cropping and skewness to be applied at the same time makes the workflow smoother.

The comparison not only gave us some more material which we can make available online but also confirmed that the process we outline above is a good way to proceed.

A final note on the Ayrshire survey: when this book was retrieved from the archives, the pages were still uncut. The library staff were able to assist in opening the book up but there is something which gives pause for thought: this was the first time this book had been read, and it was being read by an Optical Scanner for online processing, a ‘reader’ unimaginable to those people whose effort went into producing it.

The digitised volumes can be found within our service and below:

General View of the Agriculture of the County of Ayr, by William Aiton , 1811

General View of the Agriculture  in the Southern Districts of the Country of Perth, by James Robertson, 1794

General View of the Agriculture of the Shetland Islands, by John Shirreff, 1814

 

Comments are closed.