Digital Scholarship Day of Ideas 2014: “Data” – LiveBlog

Today I am at the University of Edinburgh Digital Humanities and Social SciencesDigital Scholarship Day of Ideas 2014 which is taking place at the Edinburgh Centre for Carbon Innovation, High Street Yards, Edinburgh. This year’s event takes, as it’s specialist focus, “data”. These notes have been taken live so my usual disclaimers apply and comments, questions and corrections are, as ever, very much welcomed.

Introduction: Prof Dorothy Miell, Head of College of Humanities and Social Science

I’m really pleased to welcome everybody here today. This is our third Digital Scholarship Day of Ideas and they are an opportunity to bring in interesting outside speakers, but also for all of us interested in this area to come together, to network and build relationships, and to take work forward. Again today we have a mixture of international and local speakers, and this year we are keeping us all in one room so we can all hear from those speakers. I am really glad to see such a popular take up for the day, and mixing from across the college and Information Services.

Digital HSS, which organised this event, is work that Sian Bayne leads and there are a series of events throughout the year in that strand, as well as these events.

Today we are going to be talking about the idea of data, particularly what data means for scholars in the humanities, how can we understand the term Big Data that we hear in the Social Sciences, and how can we use these concepts in our own work.

Sian Bayne, Associate Dean (digital scholars) is introducing our first speaker. Annette describes herself as an “itinerant researcher”. Annette’s work focuses on internet and qualitative research methods, and the ethical aspects of internet research. I think she has a real talent for great paper titles. One of my favourites is “Undermining Data” – which today’s talk is partially based on – but I also loved that she had a paper entitled “Fieldwork in Social Media: What would Manonovsky do?”. Anyway, I am delighted to welcome Professor Annette Markham.

Can we get beyond ‘data’? Questioning the dominance of a core term in scientific inquiry - Prof Annette Markham, Department of Informatics, Umeå University, Sweden; Department of Aesthetics & Communication, Aarhus University, Denmark; School of Communication, Loyola University, Chicago (session chair: Dr Sian Bayne)

As Sian mentioned I have spent a lot of time… I was a professor for ten years before I quit in 2007 and pushed myself across other disciplines, to push forward some philosophical work on methods. For the last 5 years or so I’ve been thinking about innovative and creative ways to think of methods to resonate better with the complex and complexity of modern life. I work with STS – Science and Technology – scholars in Denmark, Informatics scholars, Machine learning Scolars in Boston, Language scholars in Helsinki… So a real range across the disciplines.

The work today is around methods work I’ve done with colleagues over the last few years, much is captured in a special issue of First Monday: Vol 18, No 10: Making Data – Big Data and Beyond Special Issue. And this I’m doing from a post humanist, STS, non positivist sort of perspective, thinking about the way in which data can be used to to indicate that we share an understanding when actually, we are understanding the same information in very different ways. For some data can be an easy term, consistent with your world view… a word that you understand in your own method of inquiry. Data and data sets might be familiar parts of your work. We all come from somewhere, we all do research… what I say may not be new, or may be totally new… it may resonate… or not at all… but I want this to be a provocation, to make you question and think about data and our methods.

So, why me, well mainly I guess because I know about methods… so this entire talk is part of a bigger project where I look at method, at forms of inquiry… but looking at method directly isn’t quite right, but looking at it from the side, from the corner of your eye… And to look at method is to look at the conditions in which we undertake inquiry in the 21st century. For many of us inquiry is shaped by funding, and funding priviledges that which produces evidence, which can be archived. For many qualitative researchers this is unthinkable… a coffee stain on field notes might have meaning for you as an ethnographer but how can that have meaning for anyone else? How can that be archivable or sharable or minebale.

And I think we also have to think about what it is that we do when we do inquiry, when we do research… to get rid of some of the baggage of inquiry – like collecting data, analysing and then writing up as there are many forms of inquiry that don’t fit that linear approach. Another way to think of this is to think of frames, of how we frame our research. As an American Scholar trained in the Chicago School of Sociology is that I cannot help but cite Erving Goffman. They both tell us to focus on something, and to ignore other things… So if I show you a picture of a frame here…. If I say Mona Lisa you might think of that painting. If I tell you to look outside of the frame you might envision the wall, or the gallery, or what sits outside that frame. And if you change the frame it changes what you see, what you focus on… so if I show you a frame diagram of a sphere and say that is a frame, a frame for research what do you see? (some comment they see the globe, they see 3D techniques, they see movement). The frame tells us to think about certain phenomenon…. to also not think about others… if I say Mona Lisa now… we think of very different things… Similarly an atomic structure type image works as a very different type of frame – no inside or outside but all interconnected node… But it’s almost impossible to easily frame, again, Mona Lisa…

So, another frame – a not-quite-closed drawn circle – and this is to say that frames don’t tell you a lot about what they do… and Goffman and others say that frames work best when they are almost invisible…. like maps (except say the McArthur Corrective Map). So, by repositioning a map, or by standing in an elevator the wrong way and talking to people – as Harold Garfield had his students do – we have a frame that helps us look differently at what we do. “Data” can make us think we look at the same map, when we are not… Data may not be understood as a shortcut term of a metanym, it could be taken rather as preexisting aspects of the phenomenon – have been filtered and created through a process, and organised in some way. Not the meaning I want for my work but not good or bad…

So I want to come back to “How are our research sensibilities being framed?”. In order to understand inquiry we have to understand three other things. (1) How do we frame culture and experience in the 21st Century; (2) How do we frame objects and processes of inquiry; (3) How do we frame “what counts” as proper and legitimate inquiry?

For me (1), as someone focused on internet studies, I think about how our research context has shifted, and how has our global society shifted, since the internet. It’s networked for instance. But also interesting to note how this frame has shifted considerably since the early days of the internet… So taking an image from the Atlas of CyberSpace – an image suggesting the internet as a tunnel. But city scapes were also common ways to understand the world. MIT suggested different ways to understand a computer interface. This is about what happened, the interests in the early days of the internet in the 90s. That playfulness and radical ideas change as commerce becomes a standard part of the internet. Skipping forward to Facebook for instance… interfaces are easy to understand, friendly, almost all social media looks the same, almost all websites look the same… and Google is a real model for this as their interface has always been so clean…

But I think the significant issue here about socio-technical research and understanding has been shaped by these internet interfaces we encounter on a daily basis.

For me frame (2) hasn’t changed that much… two slides…. this to me represents any phenomenon or study – a whole series of different networks of nodes connected to the centre. There is no obvious starting point. Not clear what belongs in the centre – a person, an event, a device – and there are all these entanglements charecterising these relationships. And yet our methods were designed for and work best in the traditional anthropological fieldwork conditions… And the process is still very linear in how we understand it – albeit with iterative cycles – but it’s still presented that way. And that matters as it priviledges the neat and tidy inquiry over the messy inquiry, the inquiry without clear conclusions… so how we frame inquiry hasn’t changed much in terms of inquiry methods.

Finally, and briefly, (3) my provocation is: I think we’ve gone backwards… you can go back to the 60s or earlier and look at feminist scholars and their total reunderstanding of scientific method, and situated research. But as budgets tighten, as research is funded under more conservative conditions this stuff that isn’t well understood isn’t as popular… so we’ve seen a return to evidence based methods, to clear conclusions, to scientific process. Particularly in media coverage of research. It’s still a dominent theme…

So… What is data?

I don’t want to be glib here. The word “data” is awefully easy to toss around. It is. In every day life this term is a metanym for lots of stuff, highly specific but unspecified stuff. It is arguably quite a powerfully rhetorical term. As Daniel Rosenburg says the use of the term data has really shifted over the last few hundred years. It appeared in the 1760s or so. Many of those associated with the word only had it appear in translations posthumously. It is derived from Latin and, in the 1760s, it was about conditions that exist before arguement. Then as something that exists before analysis. And in that context data has no theoretical baggage. It cannot be questions. It always exists… has an incontrovertible it-ness. A “fact” can be proven false. But false data is still “data”. Over time and usage “data” has come to represent the entirity of what the researcher seeks and needs in pursuit of the goal of inquiry. To consider the word in my non-positivist stance, I see data as “what is data within the more general idea of inquiry”. In the mid 1980s I was taught not to use that word, we collect materials, we collect artefacts as ethnographers… and we construct… data… see even I used it there, so hard not to. It has been operationalised as discreet and uncontrovertible.

Big data has brought critical responses out, they are timely and subtle responses… and boyd and Crawford (2011) came up with six provocations for big data. And Nancy Baym (2013) also talks about all social media metrics being a nonrepresentative partial sample. And that there is an inherant ambiguity that arises from decontextualising a moment of clicking from a stream of activity and turning it into a stand alone data point. Bruno LaTour talked about this too, in talking about soil from the Amazon, of removing something form it’s context.

And this idea disturbs me, particularly when understanding social life as representated in technology. Even outside the western world, even if we don’t use technology, as Sonia Livingstone notes, we are all implicated in technology in our everyday life. So, I want to show you a very common metaphor for everyday life in the 21st century – a Samsung Galaxy SII ad. I love this ad – it’s low hanging fruit for rhetorical critique! It flattens everything – your hopes and dreams offered at equal value to services or products you might buy… and flatterns as equal in not infitesimal bits that swirl around, can be transmitted, transformed, controlled – as long as we purchase that particular phone. An interesting depiction of life as data – and humans and their data as new. It’s not unusual and not a problem as we don’t buy into it as a notion, uncritically.

This ad troubles me more. This is Global Pulse, an NGO, a sub committee of UN, that distributes data on prices in the developing world. It follows the story of a woman affected by price shifts. So this ad… it has a lot of persuasive power and I want to be careful about this arguement that I make to conclude…

I really like what we get from many big data analyses. I have nothing against big data or computational analysis. Some of the work you hear about today is extroadinary, powerful… I won’t make an arguement about data, about data to solve certain problems. I want to talk about what Kate Crawford talks about as “big data fundamentalism”. I wouldn’t go that far… but algorithms can be powerful but not all human experience can be reduced to data points. And not everything can be framed by big data. Data can be hugely valuable but it’s important to trouble what is included and what is missed by big data. That advert implies data can be understood as it happens. Data is always filtered, transformed, framed… from that you draw conclusions. Data operates within the larger framework for inquiry. We have to remember that we have strong and robust models for inquiry that do not focus on data as the core of inquiry. Data might be important – it should be the chorus not the main player on the stage. The focus of non-positivist research is upon collecting the messy stuff….

And I wanted to show a visualisation, created in Gephi, by one of my colleagues who looked at Arab Spring coverage in media and social media in Sweden… In doing this as he shifts the algorithm he is manipulating data, changing how the data appears to us, changing variables to make his case… most of the algorithms of Gephi create neat round visualisations. Alex Galloway critiques this by saying that some forms may not be representable, and this tool does not accommodate that, or encourages us to think that all networks can be visualised in that way. These visualisations and network analyses are about algorithms… So I sort of want to leave it there, to say that data functions very powerfully as a term… and that from a methodoly perspective it creates a very particular frame that warrants concern, particularly when the dominant context tells us that data is the way to do inquiry.

Q&A

Q: I enjoyed that but I find you more pessimistic than I would be. That last visualization shows how different understandings of that network as possible. It’s easy to create a strawman like this but I’ve been reading papers where videos are included in papers… the audience can all think about different interpretations. We can click on a data point, to see that interview, to see that complex account of that point. There are many more opportunities to create richer entanglements of data… we should emphasize those, emphasize that complexity rather than hide the complexity of how that data is created.

A: Thanks for finishing my talk for me! If we consider the generative aspects of inquiry then we can use the tools to be transparent about the playfulness of interrogation, by offering multiple interpretations… I talk about a process of Borrow / Play / Move / Interrogate / Generate. So I was a bit pessimistic – that Global Pulse ad always depresses me. But I agree!

Q: I was taken by your argument that human experience cannot be reduced to a single data point… what else can it be reduced to… it implies an alternative to data… so what might that be?

A: I think that question is not one that I would ask. To me that is not the most important question. For me it’s about how we might make social change – how might I create interventions, how might I represent someone’s story. I’m not saying that there is an alternative… but that discussion of data in general puts us in that sort of terrain… and what is more interesting or important is to consider why we do research in the first place, why do we want to look for a particular phenomenon… to not let data overwhelm any other arguments.

Q: I think your talk noted that big data focuses on how people are similar and what similarities there are, whilst ethnography tend to be about difference. That makes those data tracking that cover most people particularly depressing. Is that the distinction though?

A: I think I would see it as simplification versus complexity… how do we envision inquiry in ways that try to explode the phenomenon into even a more complex set of entanglements and connections. It may be about differences but doesn’t have to be… its about what emerges from a more generative process… it’s an interesting reading though, I wouldn’t disagree.

Q: I wanted to share a story with you of finishing my PhD, a study of social workers when I was a social worker. I had an interview for a research post at the Scottish Government and one of the panel asked me “and how did you analyze your data” and I had never thought of my interviews and discussions as data… and since then I’ve been in academia in 20 years but actually I’ve had to put that idea, that people are not data, aside to progress my career – holding onto the concept but learning to talk the talk…

A: I can relate to that. You hear that a lot, struggling to find the vocabulary to make your work credible and understandable to other people. With my students I help them see that the vocabulary of science is there, and has been dominant… and to help them use other terms to replace the terms they use in the inquiry, in their method… these terms of mine (Borrow / play / move / interrogate / generate) to get them thinking another way, to make them look at their work in a different way from that dominant method. These become a way that people can talk about the same thing but with less weighty vocabulary, or terms that do not carry that baggage. So that’s one way I try to do that…

Crowd-sourced data coding for the social sciences: Massive non-expert coding of political texts - Prof Ken Benoit, Professor of Quantitative Social Research Methods, London School of Economics and Political Science (session chair: Prof John McInnes)

Professor John McInnes is introducing our next speaker, Professor Ken Benoit. Ken not only talks about big data but has the computational skills to work with it.

I will be showing you something very practical…. I had an idea that I’d do something live… so it could be an Epic Fail!

So I took the UKIP European Election Manifesto… converted to plain text in my text editor. Made every sentence one line… put into spreadsheet… Then I’m using CrowdFlower with some text questions… So I’ll leave that to run…

So back to my talk… the goal is to measure unobservable quantities… we want to understand ideology – the “left-right” policy positions… we have theories of how people vote, that they vote to parties most proximate to their own positions. For political scientists this is a huge issue. We might also want to measure corruption, cultural values, power… but today I’m going to focus on those policy positions.

A lot of political science data is “created” by experts… a lot of it is, frankly, made up. A lot of it is about hand-coded text units – you take a text, you unitise it…. e.g. immigration policy statements… (Comparative Manifesto Project, Policy Agenda Project). Another way is Solicited Expert Opinion (Benoit and Laver, Chapel Hill, etc) – I worked with Laver for years looking at understanding of policies of each party. It’s expensive work, takes an expert an hour to fill out a form… real headache… We have expert-completed checklists (Polity, Comparative Parliamentary Democracy Dataset, Freedom House, etc.). And there are Coded International events (KEDS, Penn State Event Data). And we have inductively scaled quantities (factor analysis such as “Billy Joe Jimbon Factoral analysis).

So what are some of the problems of coding using “experts”. Who are experts anyway? Difficult to find coders who are suitably qualified. It’s hard to find them AND hard to train them… most of the experts coding texts tend to be PhD students who find it a pleasing thing to do whilst avoiding finishing their thesis. There can be knowledge effects since no text is ever anonymous to an expert coder with country knowledge. Human coders are unreliable – their codings of the same text unit will vary wildly. And even single coding is relatively costly and time-consuming. So only one coder codes each text. Even when you pay the experts, they are still doing you a favour!

So I will talk about an alternative solution to this problem, and that problem is about classifying text units. So the idea is to observe a political party’s policy position by content analysis of it’s texts. And party manifestos are most common texts. The idea behind content analysis is breaking text into small units and then using human judgement to apply pre-defined codes. e.g. coding something as right wing policy. And usually that is done for LOTS of sentences by only ONE coder.

Tomorrow I’ll be in Berlin… the biggest (only?) game in town is the Comparative Manifesto Project (CMP). This is a huge project with 3500 party manifestos from 55 countries from 1945-2010 though still going. Human coders are trained and have PhDs. They break manifestos into sentences, human judgement to apply pre-defined codes. Each sentence assigned to one of 56 policy categories. Category percentages of the total text are used to measure policy. And each manifesto is seen by just one coder, and coded by just one coder.

So… what could we do… crowd-sourcing involves outsourcing a task by distributing it to an unspecific group, usually in parts… based idea of this, versus expert coding is that it reduces the expertise of each of the coders, but increase the number of coders. Distribute texts for coding partially and randomly. Increase the number of coders per sentence. Treat different coders as exchangable – and anonimous, and we don’t care if sitting in internet cafe in Estonia in their underwear, or whether they engage on a day off from a bank…

The coding scheme here is to have a more simplified coding scheme. We applied it to 18 of the “big 3″ British party manifestos from 1987 to 2010. So a sentence can be coded as Economic, Social or neither… under either of the first two categories there are further options (anti, neutral or pro) from “Very left” to “Very right”, or “Very liberal” to “Very conservative”. And there is a 10 question test to show correct codings, to guide the coder and to keep them on track.

So, to get this started we wanted a comparison we understood. We wanted to compare crowd coding to expert coding. So my colleague and I, and some graduate students, coded a total of 123,000 sentences between us… With between 4 and 6 coders per manifesto and using the same system to be deployed to the crowd. This was  a benchmark for the crowd sourcing end of things. This took ages to do… we did that…. that’s a lot of expert coding… and in practice you wouldn’t get this happening… For the crowdsourced codings we got almost twice as many codings…

We used an IRT type scaling model to estimate position. We didn’t want to just take averages here… we used a multi nomial method here. We treat each sentence as an item, to which the manifesto is responding, and the left or rightness (etc) as a quality they exhibit. Despite that complexity we found that a mean of means approach led to very similar results. We are trying to simplify that multi nomial method… but now the results…

Comparing expert codings to expert surveys on economic and social positions look pretty good.. good correlation for economic particularly a thing that we’d expect – and we see.

We tested to see how best to serve up results… we tried the sentences in order and out of order. Found .98 correlation so order doesn’t matter…

For the crowd sourcing we used Crowdflower, a front end to many crowd-sourcing platforms, not just Mechanical Turk. Uses a quality monitoring system so that you have to maintain an 80% “trust” score to be rejected. Trust maintained through “gold questions” carefully selected and generated by experts…

So, we can go back to the live experiement… it’s 96% complete!

So, looking at results in two dimensions… if Liberal Democrats were actually Liberal would be right of economics and left of social… but actually they are more left on economics. Conservatives on the right socially but getting nearer the left in some cases… but it’s not about the analysis so much as the comparison with the benchmark…

When we look at expert codings versus crowd coders… well the points are all over the place but we see correlations of 0.96 for economic, 0.92 for social dimensions. So in both cases there isn’t total agreement – we have either have a small crowd of experts or a bigger crowd of non experts. Its always an average but just a matter of scale…

So, how many coders do we need? No need for 20 codes for a sentence if it’s clearly not about immigration policy… we did massively over sample, then drew sub sets there for standard error… we saw that estimates from our errors the uncertainty starts to collapse… The rate of collapse for experts is substantially steeper… for aggregate of these two processes you need five times more non-expert coders than experts. But you can run good codings with five coders…

So we did some tests for immigration policy… used 2010 British manifestos, knowing that there were two expert surveys on this dimension (but no CMP measures). Only coded immigration or not, and if immigration is positive or not. Cost about $300. Ran again, same cost, extremely similar results…

Doing this we had 0.96 correlation with Benoit 2010 expert survey. .94 correlation with Chapel Hill Survey. And between the two runs correlation of around 0.94. Would have been higher… the experts differed between the immigration policies of Labour and Conservative… were not obvious positions in the text… but they had positions that experts knew about…

So, who are these people? Who are these crowd coders? They are from all over the world… the top countries were USA, Britain, India and Estonia. One person coded over 10,000 sentences! Crazy person loves coding! The mean trust score rarely drops below 0.8 as you’ll be booted off if it does… You don’t pay or get data from those that fail. Where are these jobs being sourced? We tried Mechanical Turk… we’ve used Crowd Flower… there are huge numbers of these sites – a student looked at about 40 of these sites… but trust scores are great no matter how these people are sourced… Techniques are not all ideal… but they don’t stay in the system if trust score changes. No relationship between coder quality and platform…

Conclusions here. Non experts produce valid results, just need a few more of them. Experts have variance, have noise, so experts are just another version of a crowd with higher expertise (lower variance). Repeat experiments prove that the method is reliable (and replicable). Some places require your work to be replicatable… is data plus script a good way to do that? Here you really can… You can replicate everything here. You can redo in February what you did in December… with the right text you can reproduce the result. Why does this appeal? Well it’s cheap, it’s flexible. Great for PhD students who lack expert access. And you can work independently from big organisations that have their own agenda for a study. You can try an idea, run again, tweak, see what works… Can go back again… And this works for any data production job that is easily distributed into simple tasks… sign up for Mechanical Turk, be a worker, see what it’s like to actually do this… for instance for transcriptions of audio tapes… it’s noisy…. a common job is that they upload 5 second clips and you transcribe that… gives you pretty good human transcription that timestamps weaves back together. Better than computer method…

So, we are 100% finished with our UKIP crowdsourcing experiment… Interestingly 40 negative, 48 positive… needs further analysis…

Q&A

Q: In terms of checking coders do the right thing – do you check them at the beginning or do you check during the process of codings?

A: Here I cheated a bit… used 126 gold questions from another experiment. You have to give a reason for each question about why it’s there – if the person doesn’t get it right then they get text to explain why that is the case… Very clear unambiguous questions here. But when you deploy a job you can monitor how participants responded or if they contested it… In a previous experiment we had so many contested responses that I actually looked again and removed it…

Q: A very interesting talk… I am a computer scientist and I am interested in whether now you have that huge gold data set you have thought about using machine learning.

A: Yes, we won’t let that go to waste. The crowd data too…

Q: I am impressed but have two questions… you look at every sentence of every manifesto… they are funny things as not every sentence is about the thing you are searching for – how do you deal with that? And a lot of what is in manifestos are sort of dog whistle things – with subtexts that the reader will pick up, how do you deal with that in crowdsourcing?

A: You get contextual sentences around the one you are coding, that helps indicate the relevance of that sentence, it’s context. In terms of the dog whistle question… people think that but manifestos are not designed to be subtle. They actually tend to be very plain, very clear. It’s rare for that subtlety to be present. Want truly outrageous immigration policy look at the BNP manifesto… every single area is about immigration, not subtle at all.

Q: I’m a linguist, I find it very interesting… and a question about tasks appropriate to crowdsourcing. Those that can be broken down into small tasks, and that your participants can relate to their daily life. I am doing work on musical interpretation… I need experts because I can’t see how to do that in language, in a way that is interpretable to non experts…

A: You can’t give something that’s complex… I couldn’t do your task… you can’t assume who your crowd is, we have very little information… we didn’t ask about language but they wouldn’t retain that trust score without some good English language skills. But workers have a trust score across projects so anything they can’t do they avoid as losing that score is too costly… You could simplify the task with some sort of task that can test corect or incorrect interpretation… but we keep the task simple.

Q: A very interesting talk, I have a quick question about how you set the right price for these tasks… how do you do that? People come from different areas and different contexts.

A: Good question. We paid 2 US cents per sentence. We tried at 5 cents and it was done very fast but quality wasn’t better. A job at 1 cent didn’t happen fast at all. So it’s about timings and pricing of other jobs.

Q: Could you say something about the ethics of this kind of method… you are not giving much consideration to the production of these texts, so I wondered if you could talk about the ethics of this work and responsibilities as researchers.

A: Well I didn’t ruin any rainforests, or ruined any summers. These people have signed up for terms and conditions. They are responsible for taxation in their jurisdiction. Our agreement with Crowdflower gives them responsibility. And it’s voluntary. Hopefully no sweatshops for this… I’m receptive to the idea of what ethical concerns could be… but couldn’t see anything inherently wrong about the notion of crowdsourcing that would be a concern. Did run past ethics committee at LSE. Didn’t directly contact people, completing tasks on the internet through third party supplier.

Q: You were showing public domain documents… but for research documents not in the public domain how would security be handled…

A: Generally transcriptions are private… but segments are usually 3 or 5 segments… like reading a document from the shredder basket… the system have that data but workers do not have access to that system

Q: But the system does have that so you need trust in the platform…

A: Yes.

Comment from floor: companies like Crowdflower have convinced companies to give them data – doctors notes etc. they have had to work on making sure they can assure customers about privacy of data… as a researcher when you go in you can consider what is being done in that business market in comparison

Q: Have you compared volunteer coders to paid coders? I am thinking particularly about ethical side of things and motivations, particularly given how in political tasks participants often have their own agendas. Might be interesting to do.

A: Volunteer crowdsourcing? Yes, it would be interesting to compare that…

Reading Data: Experiments in the Generative Humanities – Dr Lisa Otty, Lecturer in English Literature and Digital Humanities, University of Edinburgh (session chair: Dr Tom Mole)

Dr Tom Mole is introducing our next speaker, Dr Lisa Otty whose interests are in the relationship betweeen reading, writing and the technologies of transcription. And she will be talking about her work on Reading Poetry, and the process of what happens when we read a poem.

Now to be  a literature scholar speaking at an event like this I have to acknowledge that data is not a term typically used in our field. When you think about what we are used to reading texts are often books, poems… but a text is not neccassarily a traditional material but may also be another linguistic unit, something more complex. Taking the Open Archival Information Systems (CCSDS 2002) describes data as “a reinterpretable representation of information in a formalized manner suitable for communication, interpretatio, or processing”. Interpretation being crucial there. When we look at texts like books or poems those are “cooked” – edited, curated, finished. Data is too often not seen as that.

Johanna Drucker – in Humanities Approaches to Graphical Display (DHQ 5.1 2011) talks about data as Taken Not Given, Constructed from the Phenomological World. Data passes itself off as a priori conditions, as if same as phenomena observed, collapsing the critical gap between the data collection and observation.

Some of these arguements gel with some of the arguements around close versus distance reading. And I think it can therefore be more productive to see data as a generative process…

Between 2009-2012 I was involved in the research project Poetry Beyond Text (University of Glasgow, and University of Kent). This was a collaborative project so inevitably some of my reflections and insights are also collaborative and I would like to acknowledge my colleagues work here. The project was looking at interpretation of poetry, and particular visual forms of poetry such as artist boks. What these works share is that they are deeply resistent to being shared as just information.

For example Eugen Gomringer’s (1954) “silencio” is an example of how the space is more resonant than the words around it… So how do we interpret these texts? And how do our processes for interpretation effect our understanding. One method, popular in psychology, is eye tracking… a physical way of registering what you are doing. We combined eye-tracking with self-reporting. Eye Tracking takes advantage of the movements of a small area of the retina. So a map of concentration sees those little jumps, those movements around the page. But it’s an odd process to be part of – you wear a head brace with a camera focused on your eye. You get a great deal of data from the process. Where more concentration that usually indicates trickiness or challenge or interest in that section – particularly likely for challenging parts of text. From this data you can generate visualisations from this data. (We are watching a video of eye tracking process for poetry).

Doing this we found a lot of patterns. We saw that people did focus and understand space, but only when that space has significance in the process. In poems where space is more conceptual than nemetic. But interestingly people who recorded high confusion also reported liking them much more… With experiments with post linear poems the cross-linear connections. All people start with a linear reading patterns before visual reading. And that reflects the colour strip test – psychology test that shows that visual information trumps linguistic information… so visual readings and habitual reading processes are hard to overcome. We are programmed to read in a certain way… our habits are only broken by obstacles or glitches in the text we are reading…

Now talking about this project if I talk about findings I am back in that traditional research methods… and that would be misleading. We were a cross disciplinary team and so I am particularly interested in focusing on that process, on how we worked on that. The eye tracking data generates huge amounts of numerical data… we faced real challenges in understanding how to understand, to read this data… a useful reminder of the fact that data’s apparent neutrality has real repurcussions. Its one thing to make data open, another to enable people to work with it.

To my colleagues in psychology didn’t understand our interest in visualisations of numerical eye tracking data, it is an abstraction… and you have to understand the software to understand how that abstraction works. Psychologists like to interpret the data through the numerical data. They see visualisations, graphs etc. as having a rhetorical rather than analytical function. Our team were interested in that rhetorical function. We were humanists running an experiment – the framework was of hypotheses, of labs, of subjects… but the team came from creative practice background so this sense of experiment was also in play. In it’s broadest terms experiments are about seeing something in process and see how they behave, for scientists about testing hypotheses in this way, creative experiements rather different… For humanist analysis of these texts you have to deal with a huge number of variables, very much a contrast to traditional psychology experiements. For creative experiments there is a long tradition of work in surrealism, dadaism, etc. that poetry can unleash and disrupt our traditional reading of texts… they are deliberately breaking our habits. The reader of the literary form is a potentially revolutionasible(?) subject.

In Literary scholarship and humanities the process of reading is social, contextualised process. In psychology reading is a biomedical process, my colleagues in this field collapse the human and machine. In a recent article by Lutz Koepnick asked Can Computers Read? (2014) and discussed the different possible understandings of what reading is for.. our ideological framework of reading means to us… computational reading is less about what computers are, more about how we invest in them and envision them.

One of the things that came out of our project was the connections between poetry and psychology, and the connections to creative experiments.

To finish I want to talk about some examples of experiments around reading and what reading can mean.

The readers project – John Cayley and Daniel Howe (2009 – ) their work explores imaginative critiques of reading. Cayley is a literary scholar and has been working in digital production for some time. The readers project features “programmed autonomous entities”. Each reader moves through a text at different speeds and in different ways. So for each part of the experiment projections are used, and they are often shown with books, a deliberate choice. A number of interfaces are available. But these readers move according to machine reading rather than biomechanical reading. Cayley terms this an exploration of vectors of reading… directions in which reading might take of. It explores and engaged with new creative understandings of reading. This seems to be seen by Cayley in avant garde context. Emphasis on constructed nature of the work.

“because the project’s readers move within and are thus composed by the words within which they move, they also, effectively, write. They generate trxts and the traces of their writings are offered to th eproject’s human readers as such, as writing, as literary art.” (Cayley, The Readers Project website).

As someone engaging with these pieces the experience is of reading with, more than processing or consuming or analysing.

Tower – by Simon Biggs and Mark Shovman (2011), working at Hive, uses knowledge of natural language processing to build visualisations. When the interactor speaks their words spiral around them. And other texts are also present – the project is inspired by the Tower of Babel and builds up and up. Shovman’s previous work at Hive was on geometric structure. Biggs hope is that participants “will be enabled to reflect upon the inter-relations of the things that they are experiencing and their own contingency as part of that set of things.”

Michelle Kendrick talks about hybrids, that hybrid of human and machine interaction, the centrality of human investment in computer reading.

When I talk about this work I am overwhelmed by the rhetorical significance of words like “experiment” and the dominance of scientific research methods – the first interpretation of this work is often wrongly around seeing the work as applying scientific methods to literary interpretation.  But instead this work is about interpretation and exploring methods of understanding and interpretation.

Q&A

Q: You talked about different disciplines coming together. Do you think there is a need for humanities researchers to understand data and computational methods?

A: I think we would all benefit from a better understanding of data and analysis, particularly as we move more and more into using digital tools. I’m not sure if that needs to be in the curriculum but it’s certainly important.

Q: One of the interesting things about reading is the idea of it being a process of encoding and decoding… but the code shifts continously… and a challenge in experimental reading or interpretation is that literature is always experimental to some extent because the code always changes.

A: I think the idea of reading as always being experimental… I think that experimental writing is about disruption… less about process but more about creating challenge.

Q: I was very struck in what you were presenting there in the Poetry Beyond Text project about the importance of spatiality and space… so I was wondering about explicit spatial understandings – the eye tracking being a form of spatial understanding…

A: We were looking at the way that people had been interpreting those texts in the past, in the ways people had looked at that poetry in the past… they had talked about the structural work of the poets themselves… and we wanted to look beyond that…We wanted to find out people’s responses to some of these processes, and what the relationship was between that experience and those critical views of those texts.

Q: Did you do any work on different kinds of readers – expert readers or people who had studied these works?

A: It was quite a small group but we looked at the same people over time and we did see development over time. We worked mainly with students in literature or art and most hadn’t encountered this type of concrete poetry before but were well experienced with reading.

Q: I wanted to ask you about the ways in which we are trained to read… there are apps showing images of texts very very quickly, are we developing skills to read quickly rather than more fully and understand the text.

A: There was a process of rapid image showing to the eye (RSVP was the acronym) – to allow you to absorb more quickly but in actual fact that was quite uncomfortable. We do see digital texts playing with those notions. I don’t think we will move away from slow reading but we are seeing more of these rapid reading processes and technologies.

Chair: Kinetic Text project works in some of these ways, about focusing eye movement…

A: The text can also manipulate eye movement and therefore your reading and understanding of the text. Very interesting in that respect.

Algorithm Data and Interpretation - Dr Stephen Ramsay, Associate Professor of English at the University of Nebraska; Fellow at the Center for Digital Research in the Humanities (session chair: Prof James Loxley)

James Loxley is introducing our next speaker, Dr Stephen Ramsay.

I want to say that my mother is from Ireland, a little place west of here, and she said that if she had ever been to University it would have been to University of Edinburgh which she felt was the best in the world.

Now I was planning to teach a technical talk – I teach computer science in an English faculty. But instead I’m going to talk about data. So I’m going to start with the 1965 blackout of New York. At the time it was about disaster, groping in the dark, a city stranded. But then 9 months later they ran stories on the growth in birth rates, a sharp rise across hospitals across the state. All recording above average numbers of births. Although one report noted that Jewish hospitals did not see an increase. Sociologists talked about the blackout as in some way responsible… three years later a sociologist published a terse statement showing no increase in births after the Great Blackout. This work looked at average gestation period and noting that births would have been higher from June through to August, not just in August… but he found that 1966 was not unusual or remarkable. Black Out Babies were a myth…

You could read this tale as a cautionary one about the misuse of data. But I think this can be read another way… the New York Times piece said something about human nature – people turning to each other when power out is a sad reflection on the place of television in our life, but a hopeful narrative for humanity. And citing birth rates and data and using scientific language adds to that. And the comments about Jewish people shows prejudice. But at the same time that subsequent analysis frames the public as prone to fantasy, as uninformed, with the scholar overcoming this…

The idea of “lies, damn lies, and statistics” encourages us to always look for falsehood hiding behind truth… so we think of what stories we are being told, and what story we want to tell. It’s simple advice that is hard to do. I want to give a different spin on this. I think that data is narrative automatic. the way we use data is instructive – we talk about lists, numbers… Pride and Prejusice does not seem to be a data set unless we convert it. It gains narrative in transformation. The data can be shown to show and mean things – like stories, stories waiting to be told… data doesn’t mean anything by itself, someone has to hear what it is saying…

What does data look like in its pre interpretive state? There is an internet site called “Found” – collecting random items such as notes, cards, love letters, shopping lists. Materials without their context. Abandoned artefacts. All can be found there. But the great glorious treasure of Found is it’s lists…

[small pause here for technical difficulty reasons]

These lists are just abandoned slips of paper… one for instance says:

beer

neat

dogfoot

domestic

stenga

another:

roach spray

flashlight

watermellon

The spareness and absence of context turns these data-like lists turns them, quickly into narrative… not all are funny… one reads:

go out for a walk with someone

speak with someone

watch tv

go out to cemetry to speak to mom

go to my room

Have you ever wanted to give your data a hug? Bram Stoker said in writing Dracula he just wanted to write something scary… his novel is far more interesting without him as the interpretations of others are fascinating and intriguing… Do facts matter in the humanities? In some areas… who painted a picture, when a treaty was signed… these are not contingent truth claims… surely we can say fact is a good word for those things that are not subject to debate. Scholars can debate whether a painting is by Rembrandt or his school, that debate is about establishing a fact. But facts still matter…

If we look at Rembrandt’s Night Watch the lighting of the girl equating to that of the captain is intriguing. If he said it meant nothing we’d probably ignore him… The signing of a treaty may be a fact but why it occured is much more interesting. Humanities are about that category 1 inquiry more than the category 2 fact inquiries. Often this is the critique of the humanities and the digital humanities, Jonathan Gotschil insists that the humanities should embrace scientific approaches and sense of optimism… And sees the sciences as doing a better job of this stuff but that “what makes literature special” should be retained… he doesn’t say what those things are. There are unsettled matters if one takes scientific approaches. Of course Gotschil’s nightmare is to understand data with the same criticality we apply to Bram Stoker, questioning it’s being and meaning… and I suggest we make that nightmare a reality!

[More technical issues… ]

What I wanted to show you was a list of English Novels [being read to us]… It is a list, from Hoover, organises novels in terms of breadth of the vocabulary in that list. I have shown this list to many people over the last few years, including many professors… they see Faulkner and Henry James at the top and approve of that and of Mark Twain…. and young adult novel writers at the bottom… but actually I read you the list in ascending order… Faulkner and James are at the bottom. Kipling and Lewis are at the top. And there it starts… richness is questioned… people want to point out how clearly correct the answer is, despite having given the wrong answer; some explain that the methodology is flawed or misreported… these are category 1 people being annoyed by category 2 reality…

But when we stop using it as a Gotcha it is a more provocative question… each of these titles contains a thousand, a hundred thousand thoughts and connections… it is what we do… as humanists we make those connections… we ask questions of the narrative we have created… part of our problem is a general discomfort with lettinng the computer telling us what is so… but if we stop doing that we might see peculiar mappings of books a cultural objects… it might show us a way to deeper understanding of reading itself… it raises any number of questions about the development of English style… and most of all it raises questions of our discursive paradigms.

That gives us narrative possibilities we could not see. We cannot think of text as 50k word blocks. The computer can ONLY apprehend the text in such terms. To understand the computer as finding facts is to miss the point. It is about creating triggers to ask questions, to look at the text in new ways. This is something I came across working on Virginia Woolf’s The Wave. The structure is so orderly… and without traditional cultural narrative. And they speak in very similar styles, sentence structures, image patterns… some see some difference between gender or solidarity… but overall it is about unity… this is the sort of problem that attracts text analysis scholars like myself. I ran algorithm clustering models looking for similaritudes unseen by scholars. On a lark we posted a simple question… “what are the words that the women in the novel use in common, that none of the men do?” and it turns out that there are 9 such words. Could see that as a narrative – like a Found list – and then we did it with men and found 120 words! Dramatic. So many words… Some critics found that disparity frightening… some think it backs up sexism of western cannon. Others see this as a chance to ask another questions… to try with other authors, novels, characters… if you think this way, perhaps you’ve caught the DH bug, I welcome you. But do we think we’ll find an answer to questions of gender and isolation? Do we want to answer those? The humanities want a world that is more complex, deeper than we thoughts. That process is a conversation…

In 2015 the Text project will release huge volumes of literature. Perseus contains most greek texts… there are huge new resouerces. almost all questions we ask of these corpuses have not been asked before… we can say they will transform the humanities but that may not be true… the limiting factor is whether we choose to remain humanists in the face of such abundance… perhaps we need to be programmers, tool builders, text engineers… many more of us need to invite the new texts – lists, ngrams, maps etc. – into our ongoing conversation. We are here to talk about philosophical issues of data and these issues are critical… but we have to be engaging with these questions…. Digital humanities means databases, mark up, watermelon…!

Q&A

Q: I am intrigued to think about how we design for the things we don’t know what we need to know…

A: Sure, imagining what we don’t know… you inevitably build your own questions into the tools… ironically an issue for scientific methods. The nice thing about computers is that they are fast, obedient and stupid. They will do anything we ask them to, even our own most stupid ideas, huge serendipity just baked into that! Its a problem but its amazing how the computer does that job for me, surprisingly.

Q: That was a brilliant fascinating talk. Part of the problem with digital humanities for literature right now is that it either tells us what we do know… or it tells us what we don’t know but then we worry that it’s wrong… The description of the richness list was part of that. I really liked your call for an ongoing discussion that includes computer generated data… but I don’t see how we get past the current description. If all literary criticism says something is so, and expects “yes, but…” I can see how computer generated data sits in that… but how can data be a participant in that conversation – beyond ruling something out, or concurring with expectations.

A: Excellent point and lets not downplay at all the first part of your question. I saw Franco Morelli give a talk about titles getting shorter for instance… who’d have thought?! But I think it has a lot to do with how we build our tools… I find it frustrating that we all use R, or tools designed for science or psychology… I want our schools to look more like the art-informed projects Lisa talked about. I think the humanities needs to do more like that, to generate the synergies. Tools that are more ludic.

Q: May be to be about perceived barriers being quite high. An earlier speaker talked about the role of repeatability. Ambiguity reading a poem is repeatible. if barriers to entry low enough for repitition and for others to play, to ask new questions, maybe that brings the data in as part of the conversation…

A: There are tools that let you play with the text more ludically. Voyant for instance. But we come with a lot of cultural baggage as humanists… there is a phenomenon that… no matter what they are talking about they give a literary critical reading of a text but when they show a graph we all think we are scientists… there is so much cultural baggage. We haven’t learned how to be humanistic users of these tools, or to create our own tool.

Q: A question and an observation… There is a school of thought in cognitive psychology that humans are infinitely able to retrofit any narrative to any circumstances whatsoever, and that is very much what was coming through your data… Many humanities departments have become pseudo social sciences departments… but if you don’t have a clear distinction between category 1 and category 2 they can end up doing their own thing…

A: I don’t want the humanities. I resist the social science type study of literature, the study of human record or of the human condition… when we are talking about… in my own work I move between being a literary critic and being an engineer… when it comes to writing software that method definition is wrong, it doesn’t work… when I am a literary critic it is about all those shades of grey, those complexities… but those different states both seem important in pursuit of that end goal… if studying flu outbreaks lets not be ludic… but for Bram Stroker then we should!

Q: In my own field of politics there was a particular set of work which gave statistical data a bad name… and I wonder in your field is the risk of the same is there…

A: In digital literary studies this is sometimes seen as a 25 year project to get literary profs into the digital field.. but I always say that that’s not true, there’ll always be things to be done. There was a book in the 70s that looked at slavery in an entirely quantitative way, it made the arguement no one wanted to hear, that slavery had been extremely lucrative. Economists said that it’s profitable. History fled from statistical methods for years after that… but they do all agree that that was profitable. And there is quantitative work there again/still. If I had to predict I’d say the same thing for digital literary studies does seem likely…

Q: I can’t resist one here… I was following a blog by Kirsch where you say that scholars should code and I wanted to ask about that…

A: OK, well Kirsch lumps me in with the positivists… I’m not quite in the devils party. But I teach programming and software engineering to humanists. Its extremely divisive… My views have softened over the years… for me programming is a magnificant intellectual excercise… knowing about it seems to help understand the world. But also if you want to do research in this area you need some technical skills. If that’s programming… well learn what you need whether thats GIS, 3D Graphics… if you want to build things you might need coding!

Big Data and the Co-Production of Social Scientific Knowledge - Prof Rob Procter, Professor of Social Informatics, University of Warwick (session chair: Prof Robin Williams)

Professor Robin Williams is now introducing Professor Rob Proctor, our next speaker, talking about his work around social informatics.

The eagle eyed amongst you will spot my change of title – but digital is infinitely rewritable! I am working in the overlap of sociology and computational tools and methods. So, the second thing I want to talk about is Sociology in the age of “big data”. I think what this demonstrates is the opportunities for sociology to respond in various different ways to this big data, and tools to interrogate that data. The evolving of tools and methods is a key thing to look at in the area. So that brings me to the Collaborative Online Social Media Observatory (COSMOS) and tools we are developing for understanding social media… and then I want to talk about Sociology beyond the academy – knowledge co-produced of social scientific knowledge. But there are other types of expertise being mobilised at the moment, in looking at the computational turns things are taking. Not always a comfortable thing for social scientists…

So firstly Social Informatics. So what is that? Well to me its the inter-disciplinary study of factors that shape adoption and use of ICTs. And what gets me excited is how these then move into real processes. And for me the emphasis on innovation as public, participatory process of experimentation and learning where meanings of technologies are collaboratively explored and co-produced. In social media you can argue that this is a large scale experiment in social learning… Of course as we witness growing scale of adoption more people experience those processes: how social media works, how they might adopt or use it… to me this is a fascinating area to study. And because it is public and involves social media it is very easy to see what’s going on… to some extent. And generally that data is accessible for social research purposes. It is not quite that simple but you can research without barriers of having to pay for data if you do it in a careful way.

So these developments have led me into social media as a prime area of my research. So firstly some work we did on the impact of Web 2.0 on scholarly communications – work with Robin Williams and James Stewart – many of us will be part of this, many of us tweet our research… but many of us are not clear of what that means, what the implications are. So we did some work, got some interesting demographic research… we also did interviews with people and got ideas of why they were, and why they were not adopting… Some very polarised. And in parallel we looked at how scholarly publishers incorporate social media tools into their work, in order to remain key players… they do lots of experiments and often that is focused on measuring impact and seeing the movement of their work to other audiences. Some try providing blogs on their content. But that is all with mixed success. A comment notes that it is easier to get comments on cricket reports than on research online… So it’s hard to understand and capture impact…

I’ll come back to that and about co-creation of knowledge. But first I want to talk about the riots in England in 2011. This was work in conjunction with the Guardian Newspaper. They had been given 2.5 million tweets directly by Twitter. They wanted to know if social media was particularly vulnerable for sharing false information, did that support calls for shutting down social media at times of crisis? So we looks at a number of different rumours known about and present in the corpus: zoo animals on the loose; london eye on fire; miss selfridge on fire; rioters attack a children’s hospital in Birmingham. I will talk about that latter example. But we wanted to ask about how people use and understand and interpret social media in these circumstances, how they engage with rumous…

So this is about sociology in the age of “big data”. It calls for interpretive methods but we can’t do that at scale easily… so we need computational methods to focus scarce human resources. We could crowdsource some of this but at this scale that would still be a challenge…

So firstly lets look at the work of Savage and Burrows (2007) talked about the “coming crisis of empirical sociology” because the best sociology, as they saw it, was conducted by private companies who have the greatest and most useful data sets which sociologists could not rival nor access. However we might be more confident about the continuing relevance of social sciences… social media provides a lot of born digital data… maybe this should be entitled the “social data deluge”. There is a lot of data available, much of it freely available. Meanwhile lots of policy initiatives to promote open data in government for/by anyone with a legitimate usage for it. Perhaps we can be more confident about the future of academic sociology…

But if you see the purpose this data is put to, its a more mixed picture… so we see analysis of social media for stock market prediction. But here correlation is mistaken for causality. Perhaps more interesting are protest movements – like occupy wallstreet – or use of social media during the Egyptian revolution… It is a tool for political change, a way for citizens to acquire more freedom and change? Is it a movement to organise themselves? Lots of discussion of these contexts. Methodologically its a challenge of quantity, and methods that combine social science understanding with social media tools enabling analysis of large scale data…

So back to that rumour from the riots and that rumour of a children’s hospital being attacked in Birmingham. This requires thorough work with the data, but focused where it counts.

So, what sparked this off was someone tweeting that the police were assembling in large numbers outside the hospital… therefore the hospital must be under threat. A reasonable inference.

So, methodologically we undertook computational methods for analysing tweets in an active area of research: sentiment analysis; topic analysis. We combine a relatively simple tool looking at information flows… and then looking at flow from “opinion leaders” to others (e.g. RTs). Once that information flow analysis has been done we can then take those relative sizes to analyse that data, size as proxy for importance… this structure, we argue, is relatively useful for focusing human effort. And then we used coding frames for conventional qualitative methods of content analysis to understand how Twitter was used – to inductively analyse information flow content to develop a “code frame” of topics; use code frame to categorise inofrmation flows (e.g. agreement, disagreement, etc.); and then we used visualisation around that analysis of information flows…

So here we see that original tweet… you see the rumour mushroom, versions appear… bounding circles reflect information flows… and individuals and their influence… Initially tweets agree/repeat… and we then start to see common sense reasoning: those working or nearby dispute the threat, others point out that the police station is next door to the hospital thus providing alternative understanding. People respond and do not just accept the rumor as true… So rumours do break quickly BUT they are not neccassarily more vulnerable as versions and challenges quickly appear to provide alternative likely truth. That process might be more rapid with authoritative sources – media or police in this case – adding their voice. But false information may persist longer, with potential risk to public safety – see follow on Pheme project.

But I wanted to talk about authoritative sources again. The police and media and how they use social media. The question is what were the police doing on twitter at that time? Well another interesting case here… riots in Manchester led to people creating new accounts to draw attention to public bodies like the police, as an auxillery service to raise awareness of what was going on. Quite an interesting use of social meidia where these see something like this arising.

So what these examples demonstrate is innovation as a co-production… lots of people collectively experimenting, trying out things, learning about what social media can and cannot do. So I think it’s a prime example for sociologists. And we see uses are emergent, people learn as they use… and it continues to change and people reinvent their own uses… And we all do this, we have our own uses and agenda shaping our interactions.

So this work led to development of tools for use by social scientists… COSMOS involved James S, Ewan K, etc. from Edinburgh… It would be an error to assume social media can tell us everything that takes place in the world – this data goes with crime data, demographic data, etc. The aim of COSMOS is to forge interdisciplinary working between social and computing scientists. To provide open, sustainable platform for interoperable social media analysis tools. And refine and evolve capabilities, provide service models compatible with needs of diverse user communities.

There are existing tools out there for social media analysis… but many are blackbox systems, its hard to understand that process that is taking place. So we want those blackbox processes to be opened up, they are complex but can be understood and explored…

So the Cosmos Tools let you view timelines, to look at rates and flows… to look for selection based on keywords and hashtags… and to view the networks of who is tweeting… and to compare data with demographic data.

Also some experimental tools around geographical tools for clustering. The way people use Twitter can show geographical patterns. Another factor is about topic modelling, topic clustering… identifying tweets on the same topic. This is where NLP and Ewan and his colleagues in Informatics has become important.

So current research looking at: Social media and civil society – social media as digital agora; “hate” speech and social media – understanding users, networks and information flows –  a learning challenge here about people not understanding impact and implications of their comments, perhaps a misunderstanding of social media… ; citizen social science – harnessing volunteer effort; social media and predictions – crime sensing, data integration and statistical modelling; suicide clusters and social media; humanitariansim 2.0 – care for the future; BBC World Service – tweeting the olympics. And we have a wide range of collaborators and community engagement.

Let me briefly talk about social media as digital agora… may sound implausible… many talk about social media as a force for change… opportunities to promote democracy… not just in less democratic countries, but also democratic countries where processes don’t seem to work as well… So we are looking at social media in communicative, in smaller communities. And also thinking about social resiliance in a day to day small scale way… problems which if not managed may become bigger issues. For that we have studied Twitter in several locations, collected data, interviewed participants… and built up a network of communications. What is interesting, for instance, is that non governmental group @c3sc seems to have big impact. We have to see how this all plays out… deserves longitudinal approach…

So, to conclude… let me talk about the lessons for academic sociology… and I think it’s about sociology beyond the academy and the role of wider players. Firstly data journalism – was interested in Steven’s 1965 press accounts of the black out earlier. Perhaps nowadays the way journalists are being trained might change that… journalists are increasingly data savvy. We see this through Fact Check, through RealityCheck blog… through sourcing from social media. So is citizen journalism, used to gather evidence of what is happening… tools like Ushahidi… and a sense of empowerment for these communities… reminds me of notion of sousveillance… and the possibility of greater accountability… And Citizen Journalism in the expenses scandal – guardian recruited people to look at the expense claims. The journalists couldn’t do that externally… so recruited others.

So, citizen social science… in various ways (see Harris 2012 “Oh man, the crowd is getting an F in social science”. And Ken Benoit’s work discussed earlier… we see more people coming into social science understanding…

So the boundaries of social science research production are becoming more porous, social scientific knowledge production is changing, potentially becoming more open. These developments create an opportunity to reinvigorate the project for a “public sociology” – as per Burawoy (2005) and his call “For a public sociology”. to make sociology accountable to more people, to organisations, to those in power. Ethically we need to ask what is needed and wanted, how the agenda is set, how to deliver more meaningful and useful social sciences to the public.

How can we do that? New modes of scholarly communications, technology, but it’s not enough… we’ve also been working with a company on a  possible programme for the BBC where social media is used to reflect on the week, a knowledge transfer concept. Also knowledge transfer in the Pheme project – for discriminating false and true information… all quite conventional… but we need other pathways to impact… with people as sensors and interpreters of social life, training and capacity building – in ways we have not done before, and something that has emerged in science and citizen science has been the notion of workshops, hackathons, getting people engaged in using mundane technologies for their own research (e.g. Public Lab), we need something similar for tools, social media, to extract data they want for their purposes for their agenda… to create more public sociology that people can do themselves. And we need to also have an open dialogue about research problems.

Q&A

Q: My question is about COSMOS and the riot rumours stuff… within COSMOS do you have space for formal input around ethics and law… you cut close to making people identifiable and locatable. And related to that… with police in those circles… may arouse suspicions about motives… for instance in Birmingham did police just monitor or did they tweet.

A: They did tweet but not on that rumour. It is an understandable concern that collaborations make powerful state actors more powerful… for us we want these technologies available for anyone to use them… not some exclusive arrangement, should be available to communities, third sector organisations… anyone who feels that social media may be important in their research

Q: I was more concerned about self-led vigilantes, those who might gang up on others…

A: A responsibility of civil society to be aware of those dangers, to have mechanisms to avoid harm. It does exist already… so if social media becomes instrument of that we have to respond and be aware – partly what hate speech project is about… Bigger learning problem is about conduct in social media space. And the probably issue that people don’t realise how conduct quickly becomes visible to much bigger group of others… and that relates to ethics… twitter is public domain space but when something is highlighted by others… we have to revisit the ethics issues time and again… for the study for the riots we did the usual clearance process… Like Ken we were told it was fine… but don’t make identifiable but that is nearly impossible in social media. Not an easy thing to resolve.

Q: I’m curious about changes in social media platforms and how that effects us… moves from facebook to twitter to snapchat to instagram… how does that become apparent, may be invisible, how do we track that..

A: There is a fundamental issue of sustainability of access to data from social media. Not too much of a problem to gather data if you design harvesting appropriately for their rate limits. In terms of other platforms, and people moving to them, and changes in modality and observability and accessibility of data… what social research needs is agreement with providers of data that, under certain conditions of access, that their data is available for research.. to make access for legitimate data easy. There are efforts to archive data – Library of Congress collects all tweets. Likely to allow access under license I think, to ensure access to platforms as use of platforms change…

Edinburgh Data Science initiative – Prof Dave Robertson, Head of School of Informatics

Sian Bayne quickly introducing Dave Robertson providing a coda to today’s session.

I’m just briefly going to talk about the Edinburgh Data Science Initiative. The ideas being data as the catalyst for change in multiple academic disciplines and business sectors.

So firstly the business side… big data can be very big and very fast… that can be off-putting in the humanities… And you don’t have to build something big to be part of this… I work in these areas but my models are small… and there is a stack you never see – economic and political side of this stuff.

And here’s the other one… this is about variety and velocity – a chart from IBM – looking at predictions of the volume of data and, more interestingly, the uncertainty of data… And the data sites in a few categories… Enterprise Data, loads of Social Media, and loads of Sensors (internet of things)… but uncertainty over aggregate data is getting hugely large… and that’s not in sphere of traditional engineering, or traditional business…

The next slide here is about architectures… this is topical… it’s IBM’s Watson system… this is the one that won Jeopardy… harvested loads of information and hypothesis generation… This stack starts with very computational stuff but the top layers look much more like humanities work and concepts…

Now technology and society interact. Often technology pushes on society. For instance if we look at Moore’s Law (memory in your computer doubles every year) mapped against the cost of mapping the human genome. It looks radically different, costs drop hugely in late 2000′s as a lot of effort is pushed in here. And that drop in cost to $1000 per unit… that is socially important… I could sequence my genome… maybe I don’t want to. You can sequence at population scales… machines generate a TB of data a week too – huge data being generated! And this works the other way around… sometimes technology gives you an inflection point and you have to keep up, sometimes society pushes back. A lot of time online is spent on social networks (allegedly 1/7)… now a unified channel for discovery and interaction… And the number of connected devices is zooming up…

So that’s the sort of thing that is pushing a lot of things… A lot of people have spoken to all the schools in the university… everyone reacts… you will find everyone recognising this… and you hear them saying “and it changes the way it makes me think about my research”. That’s so unusual to have such a common response…

Why this is important at Edinburgh… We have many interdisciplinary foundations at Edinburgh… All are relevant, no matter how data intensive, but we are well developed in interdisciplinary working…

And we have a whole data driven start up Ecosystem in Edinburgh… we have Silicon Walk (miicard, zonefox, etc.), Waverley Gate (Amazon, Microsoft), Appleton Tower (Informatics Ventures, feusd, Disney research, tigerface), Evo House (FlockEdu, Lucky Frame, etc), Quartermile (Skyscanner, IBM), Informatics, Techcube (FanDuel, Outplay, CloudSoft, etc.). A huge ecosystem here!

So, I’ll leave it there but input, feedback welcomed, just speak to myself and/or Kevin.

And that was it for the day…

Related resources:

Share/Bookmark

Comments are closed.