Association of Internet Researchers AoIR 2016: Day 1 – Workshops

After a few weeks of leave I’m now back and spending most of this week at the Association of Internet Researchers (AoIR) Conference 2016. I’m hugely excited to be here as the programme looks excellent with a really wide range of internet research being presented and discussed. I’ll be liveblogging throughout the week starting with today’s workshops.

I am booked into the Digital Methods in Internet Research: A Sampling Menu workshop, although I may be switching session at lunchtime to attend the Internet rules… for Higher Education workshop this afternoon.

The Digital Methods workshop is being chaired by Patrik Wikstrom (Digital Media Research Centre, Queensland University of Technology, Australia) and the speakers are:

  • Erik Borra (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Axel Bruns (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Jean Burgess (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Carolin Gerlitz (University of Siegen, Germany),
  • Anne Helmond (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Ariadna Matamoros Fernandez (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Peta Mitchell (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Richard Rogers (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Fernando N. van der Vlist (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Esther Weltevrede (Digital Methods Initiative, University of Amsterdam, the Netherlands).

I’ll be taking notes throughout but the session materials are also available here:

Patrik: We are in for a long and exciting day! I won’t introduce all the speakers as we won’t have time!

Conceptual Introduction: Situating Digital Methods (Richard Rogers)

My name is Richard Rogers, I’m professor of new media and digital culture at the University of Amsterdam and I have the pleasure of introducing today’s session. So I’m going to do two things, I’ll be situating digital methods in internet-related research, and then taking you through some digital methods.

I would like to situate digital methods as a third era of internet research… I think all of these eras thrive and overlap but they are differentiated.

  1. Web of Cyberspace (1994-2000): Cyberstudies was an effort to see difference in the internet, the virtual as distinct from the real. I’d situate this largely in the 90’s and the work of Steve Jones and Steve (?).
  2. Web as Virtual Society? (2000-2007) saw virtual as part of the real. Offline as baseline and “virtual methods” with work around the digital economy, the digital divide…
  3. Web as societal data (2007-) is about “virtual as indication of the real. Online as baseline.

Right now we use online data about society and culture to make “grounded” claims.

So, if we look at Thanksgiving recipe searches on a map we get some idea of regional preference, or we look at Google data in more depth, we get this idea of internet data as grounding for understanding culture, society, tastes.

So, we had this turn in around 2008 to “web as data” as a concept. When this idea was first introduced not all were comfortable with the concept. Mike Thelwell et al (2005) talked about the importance of grounding the data from the internet. So, for instance, Google’s flu trends can be compared to Wikipedia traffic etc. And with these trends we also get the idea of “the internet knows first”, with the web predicting other sources of data.

Now I do want to talk about digital methods in the context of digital humanities data and methods. Lev Manovich talks about Cultural Analytics. It is concerned with digitised cultural materials with materials clusterable in a sort of art historical way – by hue, style, etc. And so this is a sort of big data approach that substitutes “continuous change” for periodisation and categorisation for continuation. So, this approach can, for instance, be applied to Instagram (Selfiexploration), looking at mood, aesthetics, etc. And then we have Culturenomics, mainly through the Google Ngram Viewer. A lot of linguists use this to understand subtle differences as part of distance reading of large corpuses.

And I also want to talk about e-social sciences data and method. Here we have Webometrics (Thelwell et al) with links as reputational markers. The other tradition here is Altmetrics (Priem et al), which uses online data to do citation analysis, with social media data.

So, at least initially, the idea behind digital methods was to be in a different space. The study of online digital objects, and also natively online method – methods developed for the medium. And natively digital is meant in a computing sense here. In computing software has a native mode when it is written for a specific processor, so these are methods specifically created for the digital medium. We also have digitized methods, those which have been imported and migrated methods adapted slightly to the online.

Generally speaking there is a sort of protocol for digital methods: Which objects and data are available? (links, tags, timestamps); how do dominant devices handle them? etc.

I will talk about some methods here:

1. Hyperlink

For the hyperlink analysis there are several methods. The Issue Crawler software, still running and working, enable you to see links between pages, direction of linking, aspirational linking… For example a visualisation of an Armenian NGO shows the dynamics of an issue network showing politics of association.

The other method that can be used here takes a list of sensitive sites, using Issue Crawler, then parse it through an internet censorship service. And variations on this that indicate how successful attempts at internet censorship are. We do work on Iran and China and I should say that we are always quite thoughtful about how we publish these results because of their sensitivity.

2. The website as archived object

We have the Internet Archive and we have individual archived web sites. Both are useful but researcher use is not terribly signficant so we have been doing work on this. See also a YouTube video called “Google and the politics of tabs” – a technique to create a movie of the evolution of a webpage in the style of timelapse photography. I will be publishing soon about this technique.

But we have also been looking at historical hyperlink analysis – giving you that context that you won’t see represented in archives directly. This shows the connections between sites at a previous point in time. We also discovered that the “Ghostery” plugin can also be used with archived websites – for trackers and for code. So you can see the evolution and use of trackers on any website/set of websites.

6. Wikipedia as cultural reference

Note: the numbering is from a headline list of 10, hence the odd numbering… 

We have been looking at the evolution of Wikipedia pages, understanding how they change. It seems that pages shift from neutral to national points of view… So we looked at Srebenica and how that is represented. The pages here have different names, indicating difference in the politics of memory and reconciliation. We have developed a triangulation tool that grabs links and references and compares them across different pages. We also developed comparative image analysis that lets you see which images are shared across articles.

7. Facebook and other social networking sites

Facebook is, as you probably well know, is a social media platform that is relatively difficult to pin down at a moment in time. Trying to pin down the history of Facebook find that very hard – it hasn’t been in the Internet Archive for four years, the site changes all the time. We have developed two approaches: one for social media profiles and interest data as means of stufying cultural taste ad political preference or “Postdemographics”; And “Networked content analysis” which uses social media activity data as means of studying “most engaged with content” – that helps with the fact that profiles are no longer available via the API. To some extend the API drives the research, but then taking a digital methods approach we need to work with the medium, find which possibilities are there for research.

So, one of the projects undertaken with in this space was elFriendo, a MySpace-based project which looked at the cultural tastes of “friends” of Obama and McCain during their presidential race. For instance Obama’s friends best liked Lost and The Daily Show on TV, McCain’s liked Desperate Housewives, America’s Next Top Model, etc. Very different cultures and interests.

Now the Networked Content Analysis approach, where you quantify and then analyse, works well with Facebook. You can look at pages and use data from the API to understand the pages and groups that liked each other, to compare memberships of groups etc. (at the time you were able to do this). In this process you could see specific administrator names, and we did this with right wing data working with a group called Hope not Hate, who recognised many of the names that emerged here. Looking at most liked content from groups you also see the shared values, cultural issues, etc.

So, you could see two areas of Facebook Studies, Facebook I (2006-2011) about presentation of self: profiles and interests studies (with ethics); Facebook II (2011-) which is more about social movements. I think many social media platforms are following this shift – or would like to. So in Instagram Studies the Instagram I (2010-2014) was about selfie culture, but has shifed to Instagram II (2014-) concerned with antagonistic hashtag use for instance.

Twitter has done this and gone further… Twitter I (2006-2009) was about urban lifestyle tool (origins) and “banal” lunch tweets – their own tagline of “what are you doing?”, a connectivist space; Twitter II (2009-2012) has moved to elections, disasters and revolutions. The tagline is “what’s happening?” and we have metrics “trending topics”; Twitter III (2012-) sees this as a generic resource tool with commodification of data, stock market predictions, elections, etc.

So, I want to finish by talking about work on Twitter as a storytelling machine for remote event analysis. This is an approach we developed some years ago around the Iran event crisis. We made a tweet collection around a single Twitter hashtag – which is no longer done – and then ordered by most retweeted (top 3 for each day) and presented in chronological (not reverse) order. And we then showed those in huge displays around the world…

To take you back to June 2009… Mousavi holds an emergency press conference. Voter turn out is 80%. SMS is down. Mousavi’s website and Facebook are blocked. Police use pepper spray… The first 20 days of most popular tweets is a good succinct summary of the events.

So, I’ve taken you on a whistle stop tour of methods. I don’t know if we are coming to the end of this. I was having a conversation the other day that the Web 2.0 days are over really, the idea that the web is readily accessible, that APIs and data is there to be scraped… That’s really changing. This is one of the reasons the app space is so hard to research. We are moving again to user studies to an extent. What the Chinese researchers are doing involves convoluted processes to getting the data for instance. But there are so many areas of research that can still be done. Issue Crawler is still out there and other tools are available at

Twitter studies with DMI-TCAT (Erik Borra)

I’m going to be talking about how we can use the DMI-TCAT tool to do Twitter Studies. I am here with Emile den Tex, one of the original developers of this tool, alongside Eric Borra.

So, what is DMI-TCAT? It is the Digital Methods Initiative Twitter Capture and Analysis Toolset, a server side tool which tries to capture robust and reproducible data capture and analysis. The design is based on two ideas: that captured datasets can be refined in different ways; and that the datasets can be analysed in different ways. Although we developed this tool, it is also in use elsewhere, particularly in the US and Australia.

So, how do we actually capture Twitter data? Some of you will have some experience of trying to do this. As researchers we don’t just want the data, we also want to look at the platform in itself. If you are in industry you get Twitter data through a “data partner”, the biggest of which by far is GNIP – owned by Twitter as of the last two years – then you just pay for it. But it is pricey. If you are a researcher you can go to an academic data partner – DiscoverText or Hexagon – and they are also resellers but they are less costly. And then the third route is the publicly available data – REST APIs, Search API, Streaming APIs. These are, to an extent, the authentic user perspective as most people use these… We have built around these but the available data and APIs shape and constrain the design and the data.

For instance the “Search API” prioritises “relevance” over “completeness” – but as academics we don’t know how “relevance” is being defined here. If you want to do representative research then completeness may be most important. If you want to look at how Twitter prioritises the data, then that Search API may be most relevant. You also have to understand rate limits… This can constrain research, as different data has different rate limits.

So there are many layers of technical mediation here, across three big actors: Twitter platform – and the APIs and technical data interfaces; DMI-TCAT (extraction); Output types. And those APIs and technical data interfaces are significant mediators here, and important to understand their implications in our work as researchers.

So, onto the DMI-TCAT tool itself – more on this in Borra & Reider (2014) (doi:10.1108/AJIM-09-2013-0094). They talk about “programmed method” and the idea of the methodological implications of the technical architecture.

What can one learn if one looks at Twitter through this “programmed method”? Well (1) Twitter users can change their Twitter handle, but their ids will remain identical – sounds basic but its important to understand when collecting data. (2) the length of a Tweet may vary beyond maximum of 140 characters (mentions and urls); (3) native retweets may have their top level text property stortened. (4) Unexpected limitations  support for new emoji characters can be problematic. (5) It is possible to retrieve a deleted tweet.

So, for example, a tweet can vary beyond 140 characters. The Retweet of an original post may be abbreviated… Now we don’t want that, we want it to look as it would to a user. So, we capture it in our tool in the non-truncated version.

And, on the issue of deletion and witholding. There are tweets deleted by users, and their are tweets which are withheld by the platform – and the withholding is a country by country issue. But you can see tweets only available in some countries. A project that uses this information is “Politwoops” ( which captures tweets deleted by US politicians, that lets you filter to specific states, party, position. Now there is an ethical discussion to be had here… We don’t know why tweets are deleted… We could at least talk about it.

So, the tool captures Twitter data in two ways. Firstly there is the direct capture capabilities (via web front-end) which allows tracking of users and capture of public tweets posted by these users; tracking particular terms or keywords, including hashtags; get a small random (approx 1%) of all public statuses. Secondary capture capabilities (via scripts) allows further exploration, including user ids, deleted tweets etc.

Twitter as a platform has a very formalised idea of sociality, the types of connections, parameters, etc. When we use the term “user” we mean it in the platform defined object meaning of the word.

Secondary analytical capabilities, via script, also allows further work:

  1. support for geographical polygons to delineate geographical regions for tracking particular terms or keywords, including hashtags.
  2. Built-in URL expander, following shortened URLs to their destination. Allowing further analysis, including of which statuses are pointing to the same URLs.
  3. Download media (e.g. videos and images (attached to particular Tweets).

So, we have this tool but what sort of studies might we do with Twitter? Some ideas to get you thinking:

  1. Hashtag analysis – users, devices etc. Why? They are often embedded in social issues.
  2. Mentions analysis – users mentioned in contexts, associations, etc. allowing you to e.g. identify expertise.
  3. Retweet analysis – most retweeted per day.
  4. URL analysis – the content that is most referenced.

So Emile will now go through the tool and how you’d use it in this way…

Emile: I’m going to walk through some main features of the DMI TCAT tool. We are going to use a demo site ( and look at some Trump tweets…

Note: I won’t blog everything here as it is a walkthrough, but we are playing with timestamps (the tool uses UTC), search terms etc. We are exploring hashtag frequency… In that list you can see Bengazi, tpp, etc. Now, once you see a common hashtag, you can go back and query the dataset again for that hashtag/search terms… And you can filter down… And look at “identical tweets” to found the most retweeted content. 

Emile: Eric called this a list making tool – it sounds dull but it is so useful… And you can then put the data through other tools. You can put tweets into Gephi. Or you can do exploration… We looked at Getty Parks project, scraped images, reverse Google image searched those images to find the originals, checked the metadata for the camera used, and investigated whether the cost of a camera was related to the success in distributing an image…

Richard: It was a critique of user generated content.

Analysing Social Media Data with TCAT and Tableau (Axel Bruns)

Analysing Network Dynamics with Agent Based Models (Patrik Wikström)

Tracking the Trackers (Anne Helmond, Carolin Gerlitz, Esther Weltevrede and Fernando van der Vlist)

Multiplatform Issue Mapping (Jean Burgess & Ariadna Matamoros Fernandez)

Analysing and visualising geospatial data (Peta Mitchell)