The Boundaries of Commodities

Together with Jim Clifford and Uta Hinrichs, I was lucky enough to be able to attend the first Networking Workshop for the AHRC Commodity Histories project on 6–7 September. This was organised by Sandip Hazareesingh, Jean Stubbs and Jon Curry-Machado, who are also jointly responsible for the Commodities of Empire project. The main stated goal of the meeting was to design a collaborative research web space for the community of digital historians interested in tracing the origins and growth of the global trade in commodities. This aspect of the meeting was deftly coordinated by Mia Ridge, and also took inspiration from William Turkel‘s analysis of designing and running a web portal for the NiCHE community of environmental historians in Canada.

Complementing the design and planning activity was an engaging programme of short talks, both by participants of Commodities of Empire and by people working on related initiatives. I won’t try to summarise the talks here; there are others who are much better qualified than me to do that. Instead, I want to mention a small idea about commodities that emerged from a discussion during the breaks.

A number of the workshop participants problematized the notion of ‘commodity’, and pointed out that it isn’t always possible or realistic to set sharp boundaries on what counts as a commodity. It’s certainly the case that we have tended to accept a simple reification of commodities within Trading Consequences. Tim Hitchcock argued that commodities are convenient fictions that abstract away from a complex chain of causes and effects. He gave guano as an example of such a commodity: it results from a collection of processes, during which fish are consumed by seabirds, digested and excreted, and the resulting accumulation of excrement is then harvested for subsequent trading. Of course, we can also think about the processes that guano undergoes after being transported, most obviously for use as a crop fertiliser that enters into further relations of production and trade. Here’s a picture that tries to capture this notion of a commodity being a transient spatio-temporal phase in a longer chain of processes, each of which takes place in a specific social/natural/technological environment.
Diagram of commodity as phase in a chain of processes
Although we have little access within the framework of Trading Consequences to these wider aspects of context, one idea that might be worth pursuing would be to annotate the plant-based commodities in our data with information about their preferred growing conditions. For example, it might be useful to know whether a given plant is limited to, say, tropical climate zones, and whether it grows in forested or open environments. Some of this data can probably be recovered from Wikipedia, but it would be nice if we could find a Linked Data set which could be more directly linked to from our current commodity vocabulary. One benefit of recording such information might be an additional sanity check that we have correctly geo-referenced locations that are associated with plants. Another line of investigation would be whether a particular plant is being cultivated on the margins of its environmental tolerance by colonists. Finally, data about climatic zone could play well with map-based visualisations of trading routes.

Share

Using source identifiers to link data

In the Chalice project we’ve used Unlock Places to make links across the Linked Data web, using the source identifier which appears in the results of each place search. As this might be useful to others, it’s worth walking through an example.

This search for “Bosley” shows us results in the UK from geonames and from the Ordnance Survey 50K gazetteer: http://unlock.edina.ac.uk/ws/nameSearch?name=Bosley&country=uk

Here’s an extract of one of the results, the listing for Bosley in the Ordnance Survey 1:50K gazetteer:

<identifier>11083412</identifier>
<sourceIdentifier>28360</sourceIdentifier>
<name>Bosley</name>
<country>United Kingdom</country>
<custodian>Ordnance Survey</custodian>
<gazetteer>OS Open 1:50 000 Scale Gazetteer</gazetteer>

The sourceIdentifier shown here is the identifier published by each of the original data sources that Unlock Places is using to cross-search.

Ordnance Survey Research re-uses these identifiers to create its Linked Data namespace. For any place in the 50K gazetteer, we can reconstruct the link that refers to that place by appending the source identifier to this URL, which is the namespace for the 50K gazetteer: http://data.ordnancesurvey.co.uk/id/50kGazetteer/

So our reference to Bosley can be made by adding the source identifier to the namespace:

http://data.ordnancesurvey.co.uk/id/50kGazetteer/28360

The same goes for source identifiers for places found in the geonames.org place-name gazetteer.

<sourceIdentifier>2655141</sourceIdentifier>
<name>Bosley</name>
<gazetteer>GeoNames</gazetteer>

Geonames uses http://sws.geonames.org/ as a namespace for its Linked Data links for places. So we can reconstruct the link for Bosley using the source identifier like this:

http://sws.geonames.org/2655141/

Note that the link needs the forward slash on the end to work correctly. If one looks at either of these links with a web browser, one is redirected to a human-readable page describing that place. To see the machine-readable, RDF version of the link’s contents, look at it with a command-line program such as curl, asking to “Accept” the RDF version:

curl -L http://data.ordnancesurvey.co.uk/id/50kGazetteer/28360 -H "Accept: application/rdf+xml"

I hope this is useful to others. We could add the links directly into the default search results, but many users may not be that interested in seeing RDF links in place-name search results. Thoughts on how we could offer this as a more useful function would be much appreciated.

Connecting archives with linked geodata – Part II

This is part two of a blog starting with a presentation about the Chalice project and our aim to create a 1000-year place-name gazetteer, available as linked data, text-mined from volumes of the English Place Name Survey.

Something else i’ve been organising is a web service called Unlock; it offers a gazetteer search service that searches with, and returns, shapes rather than just points for place-names. It has its origins in a 2001 project called GeoCrossWalk, extracting shapes from MasterMap and other Ordnance Survey data sources and making them available under a research-only license in the UK, available to subscribers to EDINA’s Digimap service.

Now that so much open geodata is out there, Unlock now contains an open data place search service, indexing and interconnecting the different sources of shapes that match up to names. It has geonames and the OS Open Data sources in it, adding search of Natural Earth data in short order, looking at ways to enhance what others (Nominatim, LinkedGeoData) are already doing with search and re-use of OpenStreetmap data.

The gazetteer search service sits alongside a placename text mining service. However, the text mining service is tuned to contemporary text (American news sources), and a lot of that also has to do with data availability and sharing of models, sets of training data. The more interesting use cases are in archive mining, of semi-unusual, semi-structured sets of documents and records (parliamentary proceedings, or historical population reports, parish and council records). Anything that is recorded will yield data, *is* data, back to the earliest written records we have.


Place-names can provide a kind of universal key to interpreting the written record. Social organisation may change completely, but the land remembers, and place-names remain the same. Through the prism of place-names one can glimpse pre-history; not just what remains of those people wealthy enough to create *stuff* that lasted, but of everybody who otherwise vanished without trace.

The other reason I’m here at FOSS4G; to ask for help. We (the authors of the text mining tools at the Language Technology Group, colleagues at EDINA, smart funders at JISC) want to put together a proper open source distribution of the core components of our work, for others to customise, extend, and work with us on.

We could use advice – the Software Sustainability Institute is one place we are turning for advice on managing an open source release and, hopefully, community. OSS Watch supported us in structuring an open source business case.

Transition to a world that is open by default turns out to be more difficult than one would think. It’s hard to get many minds to look in the same direction at the same time. Maybe legacy problems, kludges either technical, or social, or even emotional, arise to mess things up when we try to act in the clear.

We could use practical advice on managing an open source release of our work to make it as self-sustaining as possible. In the short term; how best to structure a repository for collaboration, for branching and merging; where we should most usefully focus efforts at documentation; how to automate the process of testing to free up effort where it can be more creative; how to find the benefits in moving the process of working, from a closed to an open world.

The Chalice project has a sourceforge repository where we’ve been putting the code the EDINA team has been working on; this includes an evolution of Unlock’s web service API, and user interface / annotation code from Addressing History. We’re now working on the best way to synchronise work-in-progress with currently published, GPL-licensed components from LTG, more pieces of the pipeline making up the “Edinburgh geoparser” and other things…

OpenStreetmap and Linked Geodata

I’ve been travelling overmuch for the last six weeks, but met lots of lovely people. Most recently, during a trip this week to discuss the Open Knowledge Foundation‘s part in the LOD2 consortium project, had a long chat with Jens and Claus, the developers and academics behind Linked Geo Data, the Linked Data version of the OpenStreetmap data.

linked geodata browser

The most interesting bit for Unlock is the RESTful interface to search the data; by point, radius, and bounding box, by feature class and by contents of labels assembled from tags. So it looks like Opensearch Geo as much as Unlock’s place search api does.

Claus made up a mapping between tags and clusters of tags in OpenStreetmap, to a simple linkedgeodata.org ontology. Here’s the mapping file – warning, it is quite large – OSM->linkedgeodata mapping rules. Pointed him at Jochen Topf’s new work on OSM tag analysis and clustering, Taginfo.

As well as the REST interface, there is a basic GeoSPARQL endpoint using Virtuoso as a Linked Data store – we ran containment queries for polygons returning polygons with reasonable performance. There is a fracturing in the GeoSPARQL world both in proposed standards and in actual implementation.

So we want to be able to return links to LinkedGeodata.org URLs in the results of our search. Right now Unlock’s place search returns original source identifiers (from geonames, etc) as well as our local identifiers, for place-names and shapes. In fact Unlock could help with the mapping across of Linkedgeodata.org URLs to geonames URLs, which are quite widely used, an entry point into the bigger Linked Data web.

Another very interesting tool for making links between things on the Linked Data web is SILK, by Chris Bizer, Anja Jentsch and their research group at the Freie Universitat Berlin. The latest (or still testing?) release of SILK has some spatial inference capacity as well as structural inference. So we could try it out on, for example, the Chalice data just to see what kind of links can be made between URLs for linkedgeodata things and URLs for historic place-names.

We’ve been setting up an instance of OpenStreetmap for Unlock and other purposes at EDINA recently. Our plan with this is to start working from Nominatim, which has a point-based gazetteer for place-names down to street address level, and attempt to extract and/or generalise shapes as well as points corresponding to the names. We’re doing this to provide more/richer data search, rather than republishing original datasets in some more/differently interpretable form. So there’s lots of common ground and I hope to find ways to work together in future to make sure we complement and don’t duplicate.