DRAFT – needs CeRch+CDDA detail plus specific end user engagements though we can go on on the latter topic in later posts.
At this point we should talk a bit more about who is involved in CHALICE and what we’re hoping to gain from it.
The project is led by the EDINA National Datacentre at the University of Edinburgh. EDINA is almost entirely supported by JISC, and runs the flagship Digimap service which provides UK HE/FE access to national mapping data for the UK.
EDINA also maintains the Unlock service, which provides search across different placename gazetteers, and extraction of placenames from text using different gazetteers to “ground” references to place at definite locations. Unlock started life as the GeoCrossWalk project, and it was our involvement in the “Embedding GeoCrossWalk” project that sparked this interest in using text mining techniques to generate placename authority files from historic texts.
The Language Technology Group at the School of Informatics in Edinburgh were partners in this, and have moved on with us to CHALICE. They created the Edinburgh Geoparser that sits behind the Unlock Text web service. Their text mining magic extends much deeper than we’ve really made use of yet, as far as being able to extract events and relations from text, as well as references to people and concepts.
CHALICE should be a fun challenge in an as yet under-explored research area of historic text mining – tuning grammar rules to do markup that can then be used to train machine learning recognisers, and comparing the results. Through their work with CDDA we hope to gain insight into the best balance between manual annotation and manually-corrected automatic annotation, in terms of cost of work, cost savings for others’ future work, and benefits of the different approaches to named entity recognition.