Official Launch of Trading Consequences!

Today we are delighted to officially announce the launch of Trading Consequences!

Over the course of the last two years the project team have been hard at work to use text mining, traditional and innovative historical research methods, and visualization techniques, to turn digitized nineteenth century papers and trading records (and their OCR’d text) into a unique database of commodities and engaging visualization and search interfaces to explore that data.

Today we launch the database, searches and visualization tools alongside the Trading Consequences White Paper, which charts our work on the project including technical approaches, some of the challenges we faced, and what and how we have achieved during the project. The White Paper also discusses, in detail, how we built the tools we are launching today and is therefore an essential point of reference for those wanting to better understand how data is presented in our interfaces, how these interfaces came to be, and how you might best use and interpret the data shared in these resources in your own historical research.

Find the Trading Consequences searches, visualizations and code via the panel on the top right hand side of the project website (outlined in orange).

Find the Trading Consequences searches, visualizations and code via the panel on the top right hand side of the project website (outlined in orange).

There are four ways to explore the Trading Consequences database:

  1. Commodity Search. This performs a search of the database table of unique commodities, for commodities beginning with the search term entered. The returned list of commodities is sorted by two criteria (1) whether the commodity is a “commodity concept” (where any one of several unique names known to be used for the same commodity returns aggregated data for that commodity); or (2) alphabetically. Read more here.
  2. Location Search. This performs a search of the database table of unique locations, for locations beginning with the search term entered. The returned list of locations is sorted by the frequency that the search term is mentioned within the historical documents. Selecting a location displays: information about the location such as which country it is within, population etc; A map highlighting the location with a map marker; A list of historical documents and an indication of how many times the selected location is mentioned within each document. Read more here.
  3. Location Cloud Visualization. This shows the relation between a selected commodity and its related location. The visualization is based on over 170000 documents from digital historical archives (see list of archives below).The purpose of the visualization is to provide a general overview of how the importance of location mentions in relation to a particular commodity changed between 1800 and 1920. Read more here.
  4. Interlinked Visualization. This provides a general overview of how commodities were discussed between 1750 and 1950 along geographic and temporal dimensions. They provide an overview of commodity and location mentions extracted from 179000 historic documents (extracted from the digital archive listed below). Read more here.

Please do try out these tools (please note that the two visualizations will only work with newer versions of the Chrome Browser) and let us know what you think – we would love to know what other information or support might be useful, what feedback you have for the project team, how you think you might be able to use these tools in your own research.

Image of the Start page of the Interlinked Visualization.

Start page of the Interlinked Visualization.

We are also very pleased to announce that we are sharing some of the code and resources behind Trading Consequences via GitHub. This includes a range of Lexical Resources that we think historians and those undertaking historical text mining in related areas, may find particularly useful: the base lexicon of commodities created by hand for this project; the Trading Consequences SKOS ontology; and an aggregated gazeteer of ports and cities with ports.

Bea Alex shares text mining progress with the team at an early Trading Consequences meeting.

Bea Alex shares text mining progress with the team at an early Trading Consequences meeting.

Acknowledgements

The Trading Consequences team would like to acknowledge and thank the project partners, funders and data providers that have made this work possible. We would particularly like to thank the Digging Into Data Challenge, and the international partners and funders of DiD, for making this fun, challenging and highly collaborative transatlantic project possible. We have hugely enjoyed working together and we have learned a great deal from the interdisciplinary and international exchanges that has been so central to to this project.

We would also like to extend our thanks to all of those who have supported the project over the last few years with help, advice, opportunities to present and share our work, publicity for events and blog posts. Most of all we would like to thank all of those members of the historical research community who generously gave their time and perspectives to our historians, to our text mining experts, and particularly to our visualization experts to help us ensure that what we have created in this project meets genuine research needs and may have application in a range of historical research contexts.

Image of the Trading Consequences Project Team at our original kick off meeting.

Image of the Trading Consequences Project Team at our original kick off meeting.

What next?
Trading Consequences does not come to an end with this launch. Now that the search and visualization tools are live – and open for anyone to use freely on the web – our historians Professor Colin Coates (York University, Canada) and Dr Jim Clifford (University of Saskatchewan) will be continuing their research. We will continue to share their findings on historical trading patterns, and environmental history, via the Trading Consequences blog.

Over the coming months we will be continuing to update our publications page with the latest research and dissemination associated with the project, and we will also be sharing additional resources associated with the project via GitHub, so please do continue to keep an eye on this website for key updates and links to resources.

We value and welcome your feedback on the visualizations, search interfaces, the database, or any other aspect of the project, website or White Paper at any point. Indeed, if you do find Trading Consequences useful in your own research we would particularly encourage you to get in touch with us (via the comments here, or via Twitter) and consider writing a guest post for the blog. We also welcome mentions of the project or website in your own publications and we are happy to help you to publicize these.

Image of Testing and feedback at CHESS'13.

Testing and feedback at CHESS’13.

Explore Trading Consequences

Text Mining 19th Century Place Names

By Jim Clifford

Nineteenth century place names are a major challenge for the Trading Consequences project. The Edinburgh Geoparser uses the Geonames Gazetteer to supply crucial geographic information, including the place names themselves, their longitudes and latitudes, and population data that helps the algorithms determine which “Toronto” is most likely mentioned in the text (there are a lot of Torontos). Based on the first results from our tests, the Geoparser using Geonames works remarkably well. However, it often fails for historic place names that are not in the Geonames Gazetteer. Where is “Lower Canada” or the “Republic of New Granada“? What about all of the colonies created during the Scramble for Africa, but renamed after decolonization? Some of these terms are in Geonames, while others are not: Ceylon and Oil Rivers Protectorate. Geonames also lacks many of the regional terms often used in historical documents, such as “West Africa” or “Western Canada”.

To help reduce the number of missed place names or errors in our text mined results, we asked David Zylberberg, who did great work annotating our test samples, to help us solve many of the problems he identified. A draft of his new Gazetteer of missing 19th century place names is displayed above. Some of these are place names David found in the 150 page test sample that the prototype system missed. This includes some common OCR errors and a few longer forms of place names that are found in Geonames, which don’t totally fit within the 19th century place name gazetteer, but will still be helpful for our project. He also expanded beyond the place names he found in the annotation by identifying trends. Because our project focuses on commodities in the 19th century British world, he worked to identify abandoned mining towns in Canada and Australia. He also did a lot of work in identifying key place names in Africa, as he noticed that the system seemed to work in South Asia a lot better than it did in Africa. Finally, he worked on Eastern Europe, where many German place names changed in the aftermath of the Second World War. Unfortunately, some of these location were alternate names in Geonames and by changing the geoparser settings, we solved this problem, making David’s work on Eastern Europe and a few other locations redundant.  Nonetheless, we now have the beginnings of a database of  place names and region names missing from the standard gazetteers and we plan to publish this database in the near future and invite others to use and add to it. This work is at an early stage, so we’d be very interested to hear from others about how they’ve dealt with similar issues related to text-mining historical documents.

Invited talk on Digital History and Big Data

Last week I was invited to give talk about Trading Consequences at the Digital Scholarship: day of ideas event 2 organised by Dr. Siân Bayne.  If you are interested in my slides, you can look at them here on Slideshare.

Rather than give a summary talk about all the different things going on in the Edinburgh Language Technology Group at the School of Informatics, we decided that it would more informative to focus on one specific project and provide a bit more detail without getting too technical.  My aim was to raise our profile with attendees from the humanities and social sciences in Edinburgh and further afield who are interested in digital humanities research.  They made up the majority of the audience, so this talk was a great opportunity.

My presentation on Trading Consequences at the Digital Scholarship workshop (photo taken by Ewan Klein).

Most of my previous presentations were directed to people in my field, so to experts in text mining and information extraction.  So this talk would have to be completely different to how I would normally present my work which is to provide detailed information on methods and algorithms, their scientific evaluation etc.  None of the attendees would be interested in such things but I wanted them to know what sort of things our technology is capable of and at the same time let them understand some of the challenges we face.

I decided to focus the talk on the user-centric approach to our collaboration in Trading Consequences, explaining that our current users and collaborators (Prof. Colin Coates and Dr. Jim Clifford, environmental historians at York University, Toronto) and their research questions are key in all that we design and develop.  Their comments and error analysis feed directly back into the technology allowing us to improve the text mining and visualisation with every iteration.  The other point I wanted to bring across is that transparency in the quality of the text mining is crucial to our users, who want to know to what level they can trust the technology.  Moreover, the output of our text mining tool in its raw XML format is not something that most historians would be able to understand and query easily.  However, when text mining is combined with interesting types of visualisations, the data mined from all the historical document collections becomes alive.

We are currently processing digitised versions of over 10 million scanned document images from 5 different collections amounting to several hundred gigabytes worth of information.  This is not big data in the computer science sense where people talk about terrabytes or petabytes.  However, it is big data to historians who in the best case have access to some of these collections online using keyword search but often have to visit libraries and archives and go through them manually.  Even if a collection is available digitally and indexed, it does not mean that all the information relevant to a search term is easily accessible users.  In a large proportion of our data, the optical character recognised (OCRed) text contains a lot of errors and, unless corrected, those errors then find their way into the index.  This means that searches for correctly spelled terms will not return any matches in sources which mention them but with one or more errors contained in them.

The low text quality in large parts of our text collections is also one of our main challenges when it comes to mining this data.  So, I summarised the types of text correction and normalisation steps we carry out in order to improve the input for our text mining component.  However, there are cases when even we give up, that is when the text quality is just so low that is impossible even for a human being to read a document.  I showed a real example of one of the documents in the collections, the textual equivalent of an up-side-down image which was OCRed the wrong way round.

At the end, I got the sense that my talk was well received.  I got several interesting questions, including one asking whether we see that our users’ research questions are now shaped by the technology when the initial idea was for the technology to be driven by their research.  I also made some connections with people in literature, so there could be some exciting new collaborations on the horizon.  Overall, the workshop was extremely interesting and very well organised and I’m glad than I had the opportunity to present our work.

 

 

Putting it all together: first attempt

Within Trading Consequences, our intention is to develop a series of prototypes for the overall system. Initially, these will have limited functionality, but will then become increasingly powerful. We have just reached the point of building our first such prototype. Here’s a picture of the overall architecture:

Text Mining

Our project team has delivered the first prototype of the Trading Consequences system. The system takes in documents from a number of different collections. The Text Mining component consists of an initial preprocessing stage which converts each document to a consistent XML format. Depending on the corpus that we’re processing, a language identification step may be performed to ensure that the current document is in English. (We plan to also look at French documents later in the project.) The OCR-ed text is then automatically improved by correcting and normalising a number of issues.

The main processing of the TM component involves various types of shallow linguistic analysis of the text, lexicon and gazetteer lookup, named entity recognition and grounding, and relation extraction. We determine which commodities were traded when and in relation to which locations. We also determine whether locations are mentioned as points of origin, transit or destination and whether vocabulary relating to diseases and disasters appears in the text. All additional information which we mine from the text is added back into the XML document as different types of annotation.

Populating the Commodities Database

The entire annotated XML corpus is parsed to create a relational database (RDB). This stores not just metadata about the individual document, but also detailed information that results from the text mining, such as named entities, relations, and how these are expressed in the relevant document.

Visualisation

Both the visualisation and the query interface access the database so that users can either search the collections directly through textual queries or browse the data in a more exploratory manner through the visualisations. For the prototype, we have created a static web-based visualization that represents a subset of the data taken from the database. This visualization sketch is based on a map that shows the location of commodity mentions by country. We are currently working on setting up the query interface and are busy working on dynamic visualisation of the mined information.

Share

“Weevils”, “Vapours” and “Silver oics”: Finding Commodity Terms

One of the core tasks in Trading Consequences is being able to identify words in digitised texts which refer to commodities (as well as words which refer to places). Here’s a snippet of the kind of text we might be trying to analyse:

How do we know that gutta-percha in this text is a commodity name but, say, electricity is not? The simplest approach, and the one that we are adopting, is to use a big list of terms that we think could be names of commodities, and check against this list when we process our input texts. If we find gutta-percha in both our list of commodity terms and in the document that is being processed, then we add an annotation to the document that labels gutta-percha as a commodity name.

In our first version of the text mining system, we derived the list of commodity terms from WordNet. WordNet is a big thesaurus or lexical database, and its terms are organised hierarchically. This means that as a first approximation, we can guess that any lexical item in WordNet that is categorised as a subclass of Physical Matter, Plant Life, or Animal might be a commodity term. How well do we do with this? Not surprisingly, when we carried out some initial experiments at the very start of our work on the project, we found that there are some winners and some losers. Here’s some of terms that were plausibly labeled in as commodities in a sample corpus of digitised text:
horse, tin, coal, seedlings, grains, crab, merino fleece, fur, cod-liver oil, ice, log, potatoes, liquor, lemons. And here are some less plausible candidate commodity terms:
weevil, water frontage, vomit, vienna dejeuner, verde-antique, vapours, toucans, steam frigates, smut, simple question, silver oics.

There are a number of factors that conspire to give the incorrect results. The first is that our list of terms is just too broad, and includes things that could never be commodities. The second is that for now, we are not taking into account the context in which words occur in the text — this is computationally quite expensive, and not an immediate priority. The third is that the input to our text mining tools is not nice clean text such as we would get from ‘born-digital’ newswire. Instead, nineteenth century books have been scanned and then turned into text by the process of Optical Character Recognition (OCR for short). As we we’ll describe in future posts, OCR can sometimes produce bizarrely bad results, and this is probably responsible for our silver oics.

At the moment, we are working on generating a better list of commodity terms (as mentioned in a recent post by Jim Clifford. We’ll report back on progress soon.

Share

Unlock Text: new API

With apologies for the long hiatus on the Unlock blog. We have been active in development of the service behind the scenes. Now the Unlock team are pleased to present a new API to Unlock Text, the geoparsing (place name text mining and mapping) service.

Read the Getting Started guide: http://unlock.edina.ac.uk/texts/getstarted or jump straight to the full Unlock Text API documentation

Project Bamboo - research apps, infrastructure
Meanwhile, we’ve been working with the US-based, Mellon-funded Bamboo project, a research consortium building a “Scholarly services platform” of which geoparsing is a part. We’ve been helping Bamboo to define an API which can be implemented by many different services.

Bamboo needed to be able to send off requests and get responses asynchronously and to be able to poll to tell whether a geoparsing task is done. The resulting approach feels much more robust than our previous API to the geoparser (which had a long wait time on large documents, complained about document type); is better suited to batches or collections of texts, which most people will be in practise working with.

We hope others find the new Unlock Text API useful. We have a bit of a sprint planned in the coming weeks and would be happy to accept any requests for new features or improvements, please get in touch with the Unlock team if there’s something you’re really missing.

Talk to us about JISC 06/11

Glad to hear that Unlock has been cited in the JISC 06/11 “eContent Capital” call for proposals.

The Unlock team would be very happy to help anyone fit a beneficial use of Unlock into their project proposal. This could feature the Unlock Places place-name and feature search; and/or the Unlock Text geoparser service which extracts place-names from text and tries to find their locations.

One could use Unlock Text to create Linked Data links to geonames.org or Ordnance Survey Open Data. Or use Unlock Places to find the locations of postcodes; or find places within a given county or constituency…

Please drop an email jo.walsh@ed.ac.uk or look up metazool on Skype or Twitter to chat about how Unlock fits with your proposal for JISC 06/11 …

Unlock in use

It would be great to hear from people about how they are using the Unlock place search services. So you’re encouraged to contact us and tell us how you’re making use of Unlock and what you want out of the service.
screenshots from Molly, Georeferencer
Here are some of the projects and services we’ve heard about that are making interesting use of Unlock in research applications.

The Molly project based at University of Oxford provides an open source mobile location portal service designed for campuses. Molly uses some Cloudmade services and employs Unlock for postcode searching.

Georeferencer.org uses Unlock Places to search old maps. The service is used by National Library of Scotland Map Library and other national libraries in Europe.
More on the use of Unlock Places by georeferencer.org.

CASOS at CMU has been experimenting the Unlock Text service to geolocate social network information.

The Open Fieldwork project has been georeferencing educational resources: “In exploring how we could dynamically position links to fieldwork OER on a map, based on the location where the fieldwork takes place, one approach might be to resolve a position from the resource description or text in the resource. The OF project tried out the EDINA Unlock service – it looks like it could be very useful.”

We had several interesting entries to 2010′s dev8d developer challenge using Unlock:

Embedded GIS-lite Reporting Widget:
Duncan Davidson, Informatics Ventures, University of Edinburgh
“Adding data tables to content management systems and spreadsheet software packages is a fairly simple process, but statistics are easier to understand when the data is visual. Our widget takes geographic data – in this instance data on Scottish councils – passes it through EDINA’s API and then produces coordinates which are mapped onto Google. The end result is an annotated map which makes the data easier to access.”

Geoprints, which also works with the Yahoo Placemaker API, by
Marcus Ramsden at Southampton University.
“Geoprints is a plugin for EPrints. You can upload a pdf, Word document or Powerpoint file, and it will extract the plain text and send it to the EDINA API. GeoPrints uses the API will pull out the locations from that data and send it to the database. Those locations will then be plotted onto a map, which is a better interface for exploring documents.”

Point data in mashups: moving away from pushpins in maps:
Aidan Slingsby, City University London
“Displaying point data as density estimation services, chi surfaces and ‘tagmaps’. Using British placenames classified by generic form and linguistic origin, accessed through the Unlock Places API.”

The dev8d programme for 2011 is being finalised at the moment and should be published soon; the event this year runs over two days, and should definitely be worth attending for developers working in, or near, education and research.

Chalice at WhereCamp

I was lucky enough to get to WhereCamp UK last Friday/Saturday, mainly because Jo couldn’t make it. I’ve never been to one of these unconferences before but was impressed by the friendly, anything-goes atmosphere, and emboldened to give an impromtu talk about CHALICE. I explained the project setup, its goals and some of the issues encountered, at least as I see them –

  • the URI minting question
  • the appropriateness (or lack of it) of only having points to represent regions instead of polygons
  • the scope for extending the nascent historical gazetteer we’re building and connecting it to others
  • how the results might be useful for future projects.

I was particularly looking for feedback on the last two points: ideas on how best to grow the historical gazetteer and who has good data or sources that should be included if and when we get funding for a wider project to carry on from CHALICE’s beginnings; and secondly, ideas about good use cases to show why it’s a good idea to do that.

We had a good discussion, with a supportive and interested audience. I didn’t manage to make very good notes, alas. Here’s a flavour of the discussion areas:

  • dealing with variant spellings in old texts – someone pointed out that the sound of a name tends to be preserved even though the spelling evolves, and maybe that can be exploited;
  • using crowd-sourcing to correct errors from the automatic processes, plus to gather further info on variant names;
  • copyright and IPR, and the fact that being out of print copyright doesn’t mean there won’t be issue around digital copyright in the scanned page images;
  • whether or not it would be possible – in a later project – to do useful things with the field names from EPNS;
  • the idea of parsing out the etymological references from EPNS, to build a database of derivations and sources;
  • using the gazetteer to link back to the scanned EPNS pages, to assist an online search application.

Plenty of use cases were suggested, and here are some that I remember, plus ideas about related projects that it might be good to tie up with:

  • a good gazetteer would aid research into the location of places that no longer exist, eg from Domesday period – if you can locate historical placenames mentioned in the same text you can start narrowing down the likely area for the mystery places;
  • the library world is likely to be very interested in good historical gazetteers, a case mentioned being the Alexandria Library project sponsored by the Library of Congress amongst others;
  • there are overlaps and ideas to share with similar historical placename projects like Pleiades, Hestia and GAP (Google Ancient Places).

I mentioned that, being based in Edinburgh, we’re particularly keen to include Scottish historical placenames. There are quite a few sources and people who have been working for ages in this area – that’s probably one of the next things to take forward, to see if we can tie up with some of the existing experts for mutual benefit.

There were loads of other interesting presentations and talk at WhereCamp… but this post is already too long.

Linking historic places: looking at Victoria County History

Stuart Dunn mentioned the Victoria County History in his writeup of discussions with the Clergy of the Church of England Database project. Both resources are rich in place-name mentions and historic depth; as part of the Chalice project we’re investigating ways to make such resources more searchable by extracting historic place-names and linking them to our gazetteer.

Here’s a summary of some email conversation between Stuart, Claire Grover, Ross Drew at EDINA and myself while looking at some sample data from VCH.

The idea is to explore the possibilities in how Chalice data could enhance / complement semi-structured information like VCH (or more structured database-like sources such as CCED).

It would be very valuable, I think, to do an analysis of how much effort and preparation of the (target) data is needed to link CHALICE to VCH, and a more structured dataset like CCED. By providing georeferences and toponym links, we’re bringing all that EPNS documentary evidence to VCH, thus enriching it.

It would be very interesting if we were able to show how text-mining techniques could be used to add to the work of EPNS (extracting place references that aren’t listed, and suggesting them to editors along with suggested attestations (source and date).

In the more immediate future; this is about adding links to Chalice place-references to other resources, that would allow us to cross-reference them and search them in interesting ways.

Text mining isn’t absolutely necessary to map the EPNS place names to the VCH text. On the other hand, LTG have all the processing infrastructure to convert formats, tokenise the text etc. so we could put something in place very quickly. It wouldn’t be perfect but it would demonstrate the point. I’ve not seen the CCED data, so don’t know how complex that would be.

Here’s a sample reference to a volume of VCH that may have some overlap with the Shropshire content we have in “born-digital” form from EPNS. There’s the intriguing prospect of adding historic place-name text mining/search in at the digitisation phase, so resources can be linked to other references as soon as they’re published.