Putting it all together: first attempt

Within Trading Consequences, our intention is to develop a series of prototypes for the overall system. Initially, these will have limited functionality, but will then become increasingly powerful. We have just reached the point of building our first such prototype. Here’s a picture of the overall architecture:

Text Mining

Our project team has delivered the first prototype of the Trading Consequences system. The system takes in documents from a number of different collections. The Text Mining component consists of an initial preprocessing stage which converts each document to a consistent XML format. Depending on the corpus that we’re processing, a language identification step may be performed to ensure that the current document is in English. (We plan to also look at French documents later in the project.) The OCR-ed text is then automatically improved by correcting and normalising a number of issues.

The main processing of the TM component involves various types of shallow linguistic analysis of the text, lexicon and gazetteer lookup, named entity recognition and grounding, and relation extraction. We determine which commodities were traded when and in relation to which locations. We also determine whether locations are mentioned as points of origin, transit or destination and whether vocabulary relating to diseases and disasters appears in the text. All additional information which we mine from the text is added back into the XML document as different types of annotation.

Populating the Commodities Database

The entire annotated XML corpus is parsed to create a relational database (RDB). This stores not just metadata about the individual document, but also detailed information that results from the text mining, such as named entities, relations, and how these are expressed in the relevant document.

Visualisation

Both the visualisation and the query interface access the database so that users can either search the collections directly through textual queries or browse the data in a more exploratory manner through the visualisations. For the prototype, we have created a static web-based visualization that represents a subset of the data taken from the database. This visualization sketch is based on a map that shows the location of commodity mentions by country. We are currently working on setting up the query interface and are busy working on dynamic visualisation of the mined information.

Share

Trading Consequences at SICSA DemoFest 2012

Trading Consequences will be showing a poster at the SICSA DemoFest 2012 at the Informatics Forum, University of Edinburgh on Tuesday 6th November.
Trading Consequences poster for SICSA DEMOFest 12

According to the official publicity:

SICSA is the largest ICT research cluster in Europe and this year’s DEMOfest shows the best of Informatics and Computing Science state of the art research in Scotland. DEMOfest promotes research and encourages commercial collaboration between academia, business and industry. The event will exhibit over 50 presentations and demonstrations aimed at showing:

  • Research with commercial potential.
  • Opportunities for collaboration between university and industry.

We are looking forward to see who’s interested in our project!

Share

The Boundaries of Commodities

Together with Jim Clifford and Uta Hinrichs, I was lucky enough to be able to attend the first Networking Workshop for the AHRC Commodity Histories project on 6–7 September. This was organised by Sandip Hazareesingh, Jean Stubbs and Jon Curry-Machado, who are also jointly responsible for the Commodities of Empire project. The main stated goal of the meeting was to design a collaborative research web space for the community of digital historians interested in tracing the origins and growth of the global trade in commodities. This aspect of the meeting was deftly coordinated by Mia Ridge, and also took inspiration from William Turkel‘s analysis of designing and running a web portal for the NiCHE community of environmental historians in Canada.

Complementing the design and planning activity was an engaging programme of short talks, both by participants of Commodities of Empire and by people working on related initiatives. I won’t try to summarise the talks here; there are others who are much better qualified than me to do that. Instead, I want to mention a small idea about commodities that emerged from a discussion during the breaks.

A number of the workshop participants problematized the notion of ‘commodity’, and pointed out that it isn’t always possible or realistic to set sharp boundaries on what counts as a commodity. It’s certainly the case that we have tended to accept a simple reification of commodities within Trading Consequences. Tim Hitchcock argued that commodities are convenient fictions that abstract away from a complex chain of causes and effects. He gave guano as an example of such a commodity: it results from a collection of processes, during which fish are consumed by seabirds, digested and excreted, and the resulting accumulation of excrement is then harvested for subsequent trading. Of course, we can also think about the processes that guano undergoes after being transported, most obviously for use as a crop fertiliser that enters into further relations of production and trade. Here’s a picture that tries to capture this notion of a commodity being a transient spatio-temporal phase in a longer chain of processes, each of which takes place in a specific social/natural/technological environment.
Diagram of commodity as phase in a chain of processes
Although we have little access within the framework of Trading Consequences to these wider aspects of context, one idea that might be worth pursuing would be to annotate the plant-based commodities in our data with information about their preferred growing conditions. For example, it might be useful to know whether a given plant is limited to, say, tropical climate zones, and whether it grows in forested or open environments. Some of this data can probably be recovered from Wikipedia, but it would be nice if we could find a Linked Data set which could be more directly linked to from our current commodity vocabulary. One benefit of recording such information might be an additional sanity check that we have correctly geo-referenced locations that are associated with plants. Another line of investigation would be whether a particular plant is being cultivated on the margins of its environmental tolerance by colonists. Finally, data about climatic zone could play well with map-based visualisations of trading routes.

Share

Building vocabulary with SPARQL

Judging from the Oxford Digital Humanities workshop on A Humanities Web of Data, and this related post on SPARQL queries by Jonathan Blaney, there is growing interest in using Semantic Web technologies for the digital humanities. Since the Trading Consequences digital historians are already perched on the edge of this particular bandwagon, I have written up a somewhat more technical post on how we’re using SPARQL and SKOS to develop the commodities vocabulary. You can find it here.

Share

“Weevils”, “Vapours” and “Silver oics”: Finding Commodity Terms

One of the core tasks in Trading Consequences is being able to identify words in digitised texts which refer to commodities (as well as words which refer to places). Here’s a snippet of the kind of text we might be trying to analyse:

How do we know that gutta-percha in this text is a commodity name but, say, electricity is not? The simplest approach, and the one that we are adopting, is to use a big list of terms that we think could be names of commodities, and check against this list when we process our input texts. If we find gutta-percha in both our list of commodity terms and in the document that is being processed, then we add an annotation to the document that labels gutta-percha as a commodity name.

In our first version of the text mining system, we derived the list of commodity terms from WordNet. WordNet is a big thesaurus or lexical database, and its terms are organised hierarchically. This means that as a first approximation, we can guess that any lexical item in WordNet that is categorised as a subclass of Physical Matter, Plant Life, or Animal might be a commodity term. How well do we do with this? Not surprisingly, when we carried out some initial experiments at the very start of our work on the project, we found that there are some winners and some losers. Here’s some of terms that were plausibly labeled in as commodities in a sample corpus of digitised text:
horse, tin, coal, seedlings, grains, crab, merino fleece, fur, cod-liver oil, ice, log, potatoes, liquor, lemons. And here are some less plausible candidate commodity terms:
weevil, water frontage, vomit, vienna dejeuner, verde-antique, vapours, toucans, steam frigates, smut, simple question, silver oics.

There are a number of factors that conspire to give the incorrect results. The first is that our list of terms is just too broad, and includes things that could never be commodities. The second is that for now, we are not taking into account the context in which words occur in the text — this is computationally quite expensive, and not an immediate priority. The third is that the input to our text mining tools is not nice clean text such as we would get from ‘born-digital’ newswire. Instead, nineteenth century books have been scanned and then turned into text by the process of Optical Character Recognition (OCR for short). As we we’ll describe in future posts, OCR can sometimes produce bizarrely bad results, and this is probably responsible for our silver oics.

At the moment, we are working on generating a better list of commodity terms (as mentioned in a recent post by Jim Clifford. We’ll report back on progress soon.

Share