Putting it all together: first attempt

Within Trading Consequences, our intention is to develop a series of prototypes for the overall system. Initially, these will have limited functionality, but will then become increasingly powerful. We have just reached the point of building our first such prototype. Here’s a picture of the overall architecture:

Text Mining

Our project team has delivered the first prototype of the Trading Consequences system. The system takes in documents from a number of different collections. The Text Mining component consists of an initial preprocessing stage which converts each document to a consistent XML format. Depending on the corpus that we’re processing, a language identification step may be performed to ensure that the current document is in English. (We plan to also look at French documents later in the project.) The OCR-ed text is then automatically improved by correcting and normalising a number of issues.

The main processing of the TM component involves various types of shallow linguistic analysis of the text, lexicon and gazetteer lookup, named entity recognition and grounding, and relation extraction. We determine which commodities were traded when and in relation to which locations. We also determine whether locations are mentioned as points of origin, transit or destination and whether vocabulary relating to diseases and disasters appears in the text. All additional information which we mine from the text is added back into the XML document as different types of annotation.

Populating the Commodities Database

The entire annotated XML corpus is parsed to create a relational database (RDB). This stores not just metadata about the individual document, but also detailed information that results from the text mining, such as named entities, relations, and how these are expressed in the relevant document.

Visualisation

Both the visualisation and the query interface access the database so that users can either search the collections directly through textual queries or browse the data in a more exploratory manner through the visualisations. For the prototype, we have created a static web-based visualization that represents a subset of the data taken from the database. This visualization sketch is based on a map that shows the location of commodity mentions by country. We are currently working on setting up the query interface and are busy working on dynamic visualisation of the mined information.

Share

Comments are closed.