Progress to date on Trading Consequences Visualizations

Up here in St Andrews we are in the process of exploring several routes to visualize the vast amount of commodity data that have been extracted from the historical archives by our colleagues from the University of Edinburgh.

Research in environmental history can be an open-ended process where research questions are formed and refined as part of working with the available data (i.e. historic documents). Our goal is therefore the development of visualization concepts that will reveal a range of temporal, geographic and content-related perspectives on the commodity data, and that will highlight different conceptual angles and relations within the data. Such “interlinked” visualization perspectives can provide an overview of the entire dataset and, at the same time, act as probes to explore certain aspects of the commodity data in more detail. Using this approach we aim to support more open-ended explorations of the commodity data as well as providing easy access to specific documents of interest.

Our design process so far has been driven by discussions with Jim and Colin, paper sketches to iterate on certain visualization ideas and some literature research on information visualization and digital humanities.

Discussions with Jim and Colin revealed that the temporal and geographic aspects of the data are central to their research but always in close combination with commodity types and their relations to each other. This resulted in several paper sketches, as you can see below, to explore how these particular aspects could be visually expressed and augmented with interactive features.

We also created (static) computational sketches (shown below) based on samples from the actual database. At the same time, our collaborators from EDINA created an interface to the database that allowed interrogating the data through textual queries and list views.

Both these approaches allowed use to explore the character of the data and potential visualization challenges that this introduces.

The implementation of a web-based visualization prototype that combines the ideas from our early design explorations is currently in full swing. This prototype is based on the popular visualization library d3.js. We are closely collaborating with the teams from Toronto and Edinburgh on iterating  its design and implementation.

Moving from questions and the interests of researchers in environmental history to interactive visualizations which support digging into data with fluid and commodity oriented inquiries is a process on continual refinement and the exploration of small and large interaction research questions.

Putting it all together: first attempt

Within Trading Consequences, our intention is to develop a series of prototypes for the overall system. Initially, these will have limited functionality, but will then become increasingly powerful. We have just reached the point of building our first such prototype. Here’s a picture of the overall architecture:

Text Mining

Our project team has delivered the first prototype of the Trading Consequences system. The system takes in documents from a number of different collections. The Text Mining component consists of an initial preprocessing stage which converts each document to a consistent XML format. Depending on the corpus that we’re processing, a language identification step may be performed to ensure that the current document is in English. (We plan to also look at French documents later in the project.) The OCR-ed text is then automatically improved by correcting and normalising a number of issues.

The main processing of the TM component involves various types of shallow linguistic analysis of the text, lexicon and gazetteer lookup, named entity recognition and grounding, and relation extraction. We determine which commodities were traded when and in relation to which locations. We also determine whether locations are mentioned as points of origin, transit or destination and whether vocabulary relating to diseases and disasters appears in the text. All additional information which we mine from the text is added back into the XML document as different types of annotation.

Populating the Commodities Database

The entire annotated XML corpus is parsed to create a relational database (RDB). This stores not just metadata about the individual document, but also detailed information that results from the text mining, such as named entities, relations, and how these are expressed in the relevant document.

Visualisation

Both the visualisation and the query interface access the database so that users can either search the collections directly through textual queries or browse the data in a more exploratory manner through the visualisations. For the prototype, we have created a static web-based visualization that represents a subset of the data taken from the database. This visualization sketch is based on a map that shows the location of commodity mentions by country. We are currently working on setting up the query interface and are busy working on dynamic visualisation of the mined information.

Share