The County Surveys Search Engine

One of our key aims in building the interface for our collection was to allow people to explore and “play with” the data. It’s hard to get a sense of the extent of the series and the relationships between the surveys without some kind of overview: once you can see the surveys all together and look at them in different ways, it’s much easier to grasp their logic. So we wanted a tool that would aggregate all of the information we have gathered and then allow people to look at that information in flexible ways, to filter and explore it according to their interests.

Flexibility was also a priority in technical terms: we’re making this data available for the first time in this format, so we are aware that we don’t really know what people will want to do with it. We don’t see what we have done with the demonstrator as being the last word but rather the first. Based on this, we can start to understand the data better and start to understand how people might want to access it.  We expect to have to adapt the data and the ways of accessing it as we go along and we learn what we can most usefully provide to the community.

The Data

The process of gathering data has been described in another post, but from the demonstrator’s point of view what was important was to try to keep things as general and adaptable as possible. Nevertheless, this kind of historical data presents certain peculiarities and challenges. One of the most obvious is how to present the survey data. The surveys are arranged by county but the counties that were used are not the counties as they are today. Indeed, the counties used in the first and second phases of the county surveys are not the same. So we needed a mechanism which would allow people to make sense of the data without being restrictive. We’ve achieved this by providing a canonical list of counties taken from Ordinance Survey Data from the early 19th century. We then map this to the actual counties as surveyed. There’s not a perfect match here but we take a “permissive” view of the data – we’d rather show you slightly too much than too little. So the user gets presented with the canonical list in the search facility and we then map that to the county data to decide what to show. The same holds for the author data. We hold a canonical list of authors and map these to the real authors. This allows us to adjust the data in future as we discover more about it.

The Data Model

This mapping then gives rise to the data model. We have surveys which have a county associated with them. Then we have a list of counties which we present to the user which may map to more than one of the underlying counties. That can get a bit confusing but if we look at an example, it becomes clear. If we want to look at the surveys for Shetland then in the filter list we have “Zetland or Shetland” which is how it is listed in the Ordinance Surveys. In the first phase of the surveys, Shetland was included under “Northern Counties and Islands” but in the second phase it has a survey of its own. The implication of this for the data model is that we have to have a one-to-many mapping from entries in the search list to the entries in the surveys. In fact, the same county survey might appear under more than one search term e.g. the first phase “Northern Counties and Islands” needs to appear under Shetland, Orkney, Caithness and Sutherland. So we have to have a many-to-many mapping between the search counties in the interface and the counties as specified in the surveys themselves. To do this we adopt the standard database approach of having a mapping table i.e.ccounty_county

So ccounty is the list of counties as it appears in the search list and county is as they appear in the surveys and the mapping table allows us to relate these two to each other in any way we want. Each Survey can have many publications and each publication can be held in multiple places. This explains why we have separated out surveys from publications from holdings in the data model.

database schema

Database Schema (click to open in new tab)

This model might seem a little complex but it gives us a great deal of flexibility in how we handle counties and authors and makes it fairly easy to add new information about publications and holdings as it becomes available to us.

The Technology Chosen

In line with the ethos of flexibility, we decided to work with standard technology components. At the back end is a relational database. Sitting on top of that is a Web Application built using a standard MVC framework. This approach has advantages in terms of the flexibility but also in terms of getting up and running quickly. The MVC approach (Model-View-Controller) separates out the storage of the data (the Model) from the logic of the application (the Controller) and how the data is displayed (the View). This means that changing one part of it has less impact because it is isolated from the other components. A good example of this flexibility is the change we made to the interface which was covered in a previous post.

The MVC approach to web applications is one of the standard development techniques for web applications these days and when it comes to implementing this you have a wide choice of languages and MVC systems. In our case, it’s all written in Perl using Postgres for the DB with a Catalyst Application on top. So the application takes the standard Catalyst approach of using DBIx::Class to implement the Model and interface to the database and Template Toolkit for the front end. The choice of specific MVC implementation doesn’t matter so much – there are plenty to choose from! It’s really the flexibility this approach gives which is the main thing. Using standard technologies gives us the adaptability we need to be able to do this easily, so that we can get the data available and we can adapt to whatever changes come out of that down the line.

Evolution by Use

So this demonstrator gives people access to look at the data. We’re hoping people will find it helpful in “playing with” the data. But it’s very much the first draft. We expect it to evolve over time as we and any one else interested in the Surveys gets to know the data better and we start to understand more about how to make this data available to people.

Shetland

shetland-islands-yell-north-mavine-fetlar-lerwick-lunna-ness-sumburgh-1893-map-119300-p

1893 map of Shetland, from Cassell’s Gazetteer of Great Britain and Ireland; Published by Cassell and Company Limited, London.

One of the aims of our current project is to establish the cost and workflow requirements for creating a complete virtual collection of the County Surveys.  Many of the surveys are already available in various online archives but discovering them is not as easy as it could be and quality and accessibility remain quite variable. In the longer term, we hope to aggregate high quality full-text files that we can use for research-led text mining.  In order to establish the projected costs and labour involved in such a project, as part of the pilot we plan to identify one or two rare surveys and digitise them according to current best practices, documenting this process for ourselves and others. Clearly, as funds are limited, it makes sense to focus on volumes that are not already available in digital form and which are rare even in print.

One such candidate is the General view of the Agriculture of the Shetland Islands by John Shirreff which was published in Edinburgh by Constable & Co in 1814. This is a volume, according to one early 19th Century reviewer, which was of a peculiarly special interest to contemporary readers for it describes “a remote part of the British dominions, with which many readers are perhaps as little acquainted as with the Islands in the South Sea; and they exhibit a state of Society very different in several respects from that which prevails in the other provinces of Britain.â€�  Indeed, comparing Orkney and Shetland to the wilds of the American frontier, he suggests the inhabitants of these northern islands belong to a different, less civilised time and “bring into view a stage in the progress of improvement at which the inhabitants of the South has arrived some centuries ago, and which had been long since passed over by the people of almost every other part of the Island.â€� (The Farmer’s Magazine 15 (Aug 1814): 343) The exoticism, snobbery and geo-political bias of these remarks seems almost comical today, but they suggest that the contents of Shetland survey may be of particular importance to historians given the apparently substantial differences from more ‘advanced’ mainland practices.  Happily we will all be able to judge for ourselves soon, because a print copy of the Shetland survey is held here in Edinburgh at the Royal Botanic Gardens and they have kindly agreed to allow its digitisation: we’ll post about this process once it gets underway.”

PhoneBooth Project Plan

The main goal of PhoneBooth is to repackage the Charles Booth Maps, Descriptive of London Poverty and selected notebooks from contemporary police observations for delivery to mobile devices. These materials are already available digitally through the Charles Booth Online Archive.

We will pilot the use of the mobilised maps and notebooks in a taught undergraduate course at LSE: London’s Geographies.

Project Outputs

The main objectives of the project are:

  • To enhance the existing Booth data to enable mobile delivery
  • To produce a model and technical capacity for the mobile delivery of Library-owned content
  • To engage with LSE academics and students involved in the London’s Geographies course to inform the development of mobile Library content
  • To evaluate the impact of mobilised content on teaching
  • To enhance the student experience of the course
  • To facilitate knowledge transfer within the professional community

The outputs from the project will be:

  • User/functional requirements for mobile content delivery
  • A revised course syllabus to include assesment of students’ use of the mobile content
  • Booth maps and notebooks in georeferenced preservation and delivery formats
  • Fedora/Hydra content models for geodata
  • Ingest of the Booth maps/notebooks into LSE Digital Library and an API for spatial query
  • A prototype web application for the delivery of Booth maps and notebooks to mobile devices
  • Knowledge transfer between EDINA and LSE
  • Report on the development of mobile content and its impact on teaching

Success measures:

  • A working prototype mobile web application that is used successfully by this year’s student cohort taking London’s Geographies
  • Increased knowledge of mobile and spatial technologies in the LSE Digital Library team
  • A positive impact on the teaching methodology of London’s geographies and a demonstrated engagement between the Library and academic community

Project Team

The project partners are LSE and EDINA. LSE own the collection and are responsible for gathering user requirements and project management/direction. EDINA are responsible for developing the technical prototype and sharing knowledge of the implementation with the LSE Digital Library Team

Project board (LSE)

  • Nicola Wright (deputy director of library service)
  • Sue Donnelly (head of archives)
  • Sharad Chari (lecturer, London’s geographies)

LSE staff

  • Ed Fay [@digitalfay] (digital library manager) – project management
  • Andrea Gibbons [@changita] (GTA, London’s Geographies) – gathering requirments from students and co-ordinating the piloting work
  • Andrew Amato (digital library developer) – ingesting the content into our digital library infrastructure and providing it out through APIs
  • Peter Spring (metadata technical officer, LSE) – providing support for the metadata aspects of the project

EDINA staff

  • James Reid [@sixfootdestiny] (geoservices manager) – project co-ordination and technical architecture
  • George Hamilton (software engineer) – technical prototyping of the application
  • John Pinto (software engineer) – technical prototyping of the application
  • Lasma Sietinsone (gi analyst) – providing support for the geographic aspects of the project

Timeline and work packages

Workpackage Owner Deliverables Timescale
0 – Project management LSE
  • Project initiation (project plan and budget)
  • Project communications (blog, Twitter)
  • Reporting (JISC, LSE Digital Library Steering Board)
November 2011 – July 2012
1 – User requirements analysis LSE
  • Report on user requirements
  • Revised course syllabus
  • Functional requirements
November 2011 – December 2011
2 – Data preparation EDINA
  • Booth maps and notebooks in georeferenced preservation and delivery formats
  • Knowledge transfer to LSE
January 2012 – March 2012
3 – Development: Digital Library LSE
  • Fedora/Hydra content models for geodata
  • Ingest of Booth content into LSE Digital Library
  • API for spatial query of Booth content
February 2012 – April 2012
4 – Development: delivery prototype EDINA
  • Prototype web application (initial version delivered April 2012 for testing, refinements and knowledge transfer to complete by June 2012)
  • Knowledge transfer to LSE
February 2012 – June 2012
5 – Piloting LSE
  • Use of the prototype in the 2011/12 ‘London’s Geographies’ course
  • Report on findings: pedagogical impact
  • Refinements to the course syllabus for 2012/13
April 2012 – May 2012
6 – Reporting on findings LSE
  • Case study on the development of mobile content and its impact on teaching
May 2012 – July 2012

Risk analysis

Risk Probability (1-5) Severity (1-5) Impact (Probability x Severity) Action
Project staff become unavailable 3 2 6
  • Project resources will be used to replace project team members. In all cases, the organisational context is such that project-specific knowledge and support can be provided by colleagues.
  • All project staff are existing, permanent staff of their respective institutions—the project does not require any additional recruitment.
  • The offer of extra hours to part-time staff to fill the project assistant post is an approach that has been used successfully at LSE on a number of digitisation projects.
Mobile prototype is not useful in course context 4 1 4
  • A thorough understanding of general user needs and requirements as well as those specific to the aims of the taught course will be the basis for development.
  • The Booth content is already used in the course and is suited to the course syllabus.
Booth content does not lend itself to geographic discovery or mobile delivery 5 1 5
  • Booth content is already provided for geo-discovery – giving high confidence that the remaining content can be delivered successfully. Comparison will be sought with other projects which have successfully delivered large format map and multi-page textual material to mobile devices.
The taught course does not run after the 2011/12 academic year 1 2 2
  • The prototype will be designed for a general audience as well as the specific needs of the taught course, making PhoneBooth available to existing, mature audiences of the Charles Booth Online Archive.
  • The prototype and service model will be designed to be useful for other content and delivery contexts, maximising reuse potential for other taught courses and collections.
The technical prototype and mobile access points are not maintained following project closure 1 3 3
  • The content and applications will be embedded into the infrastructure of the LSE Digital Library which is a major, strategic innovation and has a commitment of permanent, technical resource for its maintenance and development.
Smartphone ownership in the pilot cohort is not sufficient for use of content in teaching 4 2 8
  • The LSE Library Student Survey suggests that smartphone ownership will be sufficient for pairs of students to work together in the field – an approach acceptable to the course lecturer

Budget

The total project budget is £85,659 (£43,206 from JISC). Most of the LSE staff time and nearly all of the indirect and estates costs are institutional contributions.

budget-chart