Data Visualisation Talk by Martin Hawksey

Today EDINA is hosting a talk by Martin Hawksey on data visualisation. He has posted a whole blog post on this, which includes his slides, so I won’t be blogging verbatim but hoping to catch key aspects of his talk.

Martin will be talking about achievable and effective ways to visualise data. He’s starting with Jon Snow’s 1850s map of cholera deaths identifying the epicentre of the outbreak through maps of death. And on an information literacy note you do need to know how to find the story in the graphics. Visualisation takes data, takes stories, and turns them into something of a narrative, explaining and enabling others to explore that data.

Robin Wilton georeferenced that original Snow data then Simon Rodgers (formally of Guardian, latterly of twitter) put data into CartoDB. This re interpretation of the data really makes the infected pump jump out at you, the different ways of visualising that data make the story even clearer.

Not all visualisations work, you may need narration. Graphics may not be meaningful to all people in the same way. E.g. Location of the pumps on these two maps. So this is where we get into theory. Reptinsp, a French cartographer, came up with his own systems of points, lines, symbols etc. but not based on research etc, his own cheat system. If you look at Gestalt psychology you get more research based visualisatsions – laws of similarity, proximity, continuity. There is something natural about where the eye is drawn but there is theory behind that too.

Jon Snows map was about explaining and investigating the data. His maps were explanatory visualisation and we have that same idea in Simon Rodgers map but it is also an exploratory visualisation, the reader/viewer can interact and interrogate it. But there are limitations of both approaches. Within both maps it’s essentially a heat map, more of something (in this case deaths). And you see that in visualisations you often get heat maps that actually map population rather than trends. Tony Hirst says “all charts are lies”. They are always an interpretation of the data from the creator’s point of view…

So going back to Simon Rodgers map we see that the radius of a dots based on the number of deaths. Note from the crowd “how to lie with statistics”. Yes, a real issue is that a lot of the work to get to that map is hidden, lots of room for error and confusion.

So having flagged up some examples and pitfalls I want to move onto the process of making data visualisations. Tools include Excel, Carto GB, Gephi, IBM Many Eyes, etc. but in addition to those tools and services you can also draw. Even now so many visualisations are made via drawing, if only final tweaking. Sometimes a sketch of a visualisation is the way to prototype ideas too. There are also code options, D3JS, SigmaJS, R, GGplot, etc.

Some issues around data: data access can be an issue, hard to find, hard to identify source data etc. Tony Hirst really recommends digging around for feeds, for RSS, find the stuff that feeds and powers pages. There are tools for reshaping feeds and data. Places like Yahoo Pipes, which lets you do drag and drop programming with input data. And I’ve started touching upon data shapes. Data may be provided in certain ways or shapes, but it may not suit your use. So a core skill is the transformation of data to reshape data, tools like Yahoo Pipes, Open Refine – which also lets you clean up data as well. I’ve tried Open Refine with public Jiscmail lists, to normalise for those with multiple user names.

So now the fun stuff…

For the Olympics last year for the cultural Olympiad last yer in Scotland we had the #citizenrelay tracking the progress of The Olympic torch. So lots of data to play with. First talk twitter (Topsy) media timeline. Uses Timeline by verity plus Topsy data. This was really easy to do. So data access was using Topsy, it pulls in data from Twitter to make its own archive. Has API to allow data. Make it easy to query for media against a hashtag. Can return data in XML but grabbed in Jason. Then output created with timelineJS. You can also use google spreadsheet template from timelineJS template (manually or automatically). Used spreadsheet her, yahoo pipes to manipulate. Can pull data in with google spreadsheets, when you’ve created the formula it will constantly refresh and update. So self updates when published.

Originally Topsy allowed data access without API key but now they require it. Google app script, JavaScript based – see big Stack Overflow community – has similar curl function for fetching URLs and dumping back into spreadsheet. Have also done this with yahoo pipes (use
Rate module for API key aspect).

Next as the relay went around the country they used Audioboo. When you upload AudioBoo geolocates your Boos. So AudioBoo has an API (without key) and you can filter for a tag. You can get the data out in XML, JSON and CSV option but they also produce KML. If you can access a public KML file and paste into Google Maps search box then it just gives you the map. Can then embed, or share link to that file. So super easy visualisation there. But disappointingly didn’t embed audio in the map pins. But that’s a google map limitation. Google Earth does let you do that though…

So using Google Earth we only have a bit of work to do. We need t work out the embed code. So Google now provides a template that lets you bring in placemark data (place marker templates). You can easily make changes here. And you can choose how to format variables. Yu can fill in manually but can also be automatically done SL use Google AppScript here. I go to AudioBoo API, grabs as JSON, then parses it. Then for each item push to spreadsheet. So for partial Geodata these Google templates are really useful.something else to mention: Google Spreadsheets are great, sit in the cloud. But recently was using Kasabi and it went down… And everything relying on it went live. Sometimes useful to take a flat capture as spreadsheet for back up.

So the next visualisation… Used NodeXL (SNA). This is an open source plug in for excel. It has a umber of data importers, including for twitter, Facebook, media wiki, etc. just from the menu). And it has lots of room for reformatting etc. then a grid view from that.

And this is where we start chaining tools together. So I had twitter data, I had NodeXL to identify community (who follows who, who is friends with who) so used Gephi, which lets you start using network graphs. A great way to see how nodes relate to which other. Often using for Social Network Analysis but people have also used it for cocktail recipes (there’s an academic paper on it). There is a recipe site that lets you reform recipes using same approach. Gephi is another tool.. You spend an hour playing… And then wonder about how to convey to others and you can end up with flat graphic. So I created something called TAGS Explorere to let anyone interact – and there are others who have done similar.

Another example here. A network of those using #ukoer hashtag and looking for bridges in the community, the key people. This is an early visualisation I created. It was generated From twitter connections and tag use with Gephi, but then combined and finished in a drawing package.

This is another example looking at different sources. A bubble chart for click throughs of tweets. Man get a degree of that info from bit.ly. But if you use another service it’s hard to get click through however can see referrals in Google Analytics – each twitter URL is unique to each person who tweets it so you can therefore see click through rate for an individual tweet. This is created in google spreadsheet. An explore interactively, reshape for your own exploration. So this spreadsheet goes and uses google analytics API and Twitter API then combines with some reshaping. One thing to be aware of is that spreadsheets have a duality of value and formulae. So when you call on APIs etc. it can get confusing. So sometimes good to use two sheets, second flr manipulaton. There’s a great blog post on this duality – “spreadsheet addiction”. if you are at IWMW next week I’m doing a whole session at Google Analytics data and reshaping.

Q&A

Comment: study/working group on social network analysis, some of these techniques could be buildpt onto our community of expertise here.

Comment: would have to slow way down for me but hopefully we can devise materials and workshops to make these step by step.

Martin: But there are some really easy wins, like that Google Maps one. And there is a good community of support around these tools. But for instance R, if I ask on Stack Overflow then I will get an answer back.

Q) is there a risk that if you start trying to visualise data you might miss out on proper statistical processes and vigour?

Martin: yes, that is a risk. People tend to be specialists in one area rather than all of them. Manchester Metroplitan use R as part of analysis of student surveys, recruitment etc. this was from an idea of Mark Stubbs, head of eLearning, raised by speaking to specialist in Teridon flight. r is wily used in the sciences and increasingly in big data analysis. So there it started with expert who did know what he was doing.

Q: have you done much with data mining or analysis, like Google N Gram?

Martin: not really. Done some work on sentiment analysis and social network data though.

Share/Bookmark

EDINA Blogs

A Blogs.edina.ac.uk weblog

Data Visualisation Talk by Martin Hawksey