IIPC WAC / RESAW Conference 2017 – Day Three Liveblog

It’s the final day of the IIPC/RESAW conference in London. See my day one and day two post for more information on this. I’m back in the main track today and, as usual, these are live notes so comments, additions, corrections, etc. all welcome.

Collection development panel (Chair: Nicola Bingham)

James R. Jacobs, Pamela M. Graham & Kris Kasianovitz: What’s in your web archive? Subject specialist strategies for collection development

We’ve been archiving the web for many years but the need for web archiving really hit home for me in 2013 when NASA took down every one of their technical reports – for review on various grounds. And the web archiving community was very concerned. Michael Nelson said in a post “NASA information is too important to be left on nasa.gov computers”. And I wrote about when we rely on pointing not archiving.

So, as we planned for this panel we looked back on previous IIPC events and we didn’t see a lot about collection curation. We posed three topics all around these areas. So for each theme we’ll watch a brief screen cast by Kris to introduce them…

  1. Collection development and roles

Kris (via video): I wanted to talk about my role as a subject specialist and how collection development fits into that. AS a subject specialist that is a core part of the role, and I use various tools to develop the collection. I see web archiving as absolutely being part of this. Our collection is books, journals, audio visual content, quantitative and qualitative data sets… Web archives are just another piece of the pie. And when we develop our collection we are looking at what is needed now but in anticipation of what we be needed 10 or 20 years in the future, building a solid historical record that will persist in collections. And we think about how our archives fit into the bigger context of other archives around the country and around the world.

For the two web archives I work on – CA.gov and the Bay Area Governments archives – I am the primary person engaged in planning, collecting, describing and making available that content. And when you look at the web capture life cycle you need to ensure the subject specialist is included and their role understood and valued.

The CA.gov archive involves a group from several organisations including the government library. We have been archiving since 2007 in the California Digital Library initially. We moved into Archive-It in 2013.

The Bay Area Governments archives includes materials on 9 counties, but primarily and comprehensively focused on two key counties here. We bring in regional governments and special districts where policy making for these areas occur.

Archiving these collections has been incredibly useful for understanding government, their processes, how to work with government agencies and the dissemination of this work. But as the sole responsible person that is not ideal. We have had really good technical support from Internet Archive around scoping rules, problems with crawls, thinking about writing regular expressions, how to understand and manage what we see from crawls. We’ve also benefitted from working with our colleague Nicholas Taylor here at Stanford who wrote a great QA report which has helped us.

We are heavily reliant on crawlers, on tools and technologies created by you and others, to gather information for our archive. And since most subject selectors have pretty big portfolios of work – outreach, instruction, as well as collection development – we have to have good ties to developers, and to the wider community with whom we can share ideas and questions is really vital.

Pamela: I’m going to talk about two Columbia archives, the Human Rights Web Archive (HRWA) and Historic Preservation and Urban Planning. I’d like to echo Kris’ comments about the importance of subject specialists. The Historic Preservation and Urban Planning archive is led by our architecture subject specialist and we’d reached a point where we had to collect web materials to continue that archive – and she’s done a great job of bringing that together. Human Rights seems to have long been networked – using the idea of the “internet” long before the web and hypertext. We work closely with Alex Thurman, and have an additional specially supported web curator, but there are many more ways to collaborate and work together.

James: I will also reflect on my experience. And the FDLP – Federal Library Program – involves libraries receiving absolutely every government publications in order to ensure a comprehensive archive. There is a wider programme allowing selective collection. At Stanford we are 85% selective – we only weed out content (after five years) very lightly and usually flyers etc. As a librarian I curate content. As an FDLP library we have to think of our collection as part of the wider set of archives, and I like that.

As archivists we also have to understand provenance… How do we do that with the web archive. And at this point I have to shout out to Jefferson Bailey and colleagues for the “End of Term” collection – archiving all gov sites at the end of government terms. This year has been the most expansive, and the most collaborative – including FTP and social media. And, due to the Trump administration’s hostility to science and technology we’ve had huge support – proposals of seed sites, data capture events etc.

2. Collection Development approaches to web archives, perspectives from subject specialists

As subject specialists we all have to engage in collection development – there are no vendors in this space…

Kris: Looking again at the two government archives I work on there is are Depository Program Statuses to act as a starting point… But these haven’t been updated for the web. However, this is really a continuation of the print collection programme. And web archiving actually lets us collect more – we are no longer reliant on agencies putting content into the Depository Program.

So, for CA.gov we really treat this as a domain collection. And no-one really doing this except some UCs, myself, and state library and archives – not the other depository libraries. However, we don’t collect think tanks, or the not-for-profit players that influence policy – this is for clarity although this content provides important context.

We also had to think about granularity… For instance for the CA transport there is a top level domain and sub domains for each regional transport group, and so we treat all of these as seeds.

Scoping rules matter a great deal, partly as our resources are not unlimited. We have been fortunate that with the CA.gov archive that we have about 3TB space for this year, and have been able to utilise it all… We may not need all of that going forwards, but it has been useful to have that much space.

Pamela: Much of what Kris has said reflects our experience at Columbia. Our web archiving strengths mirror many of our other collection strengths and indeed I think web archiving is this important bridge from print to fully digital. I spent some time talking with our librarian (Chris) recently, and she will add sites as they come up in discussion, she monitors the news for sites that could be seeds for our collection… She is very integrated in her approach to this work.

For the human rights work one of the challenges is the time that we have to contribute. And this is a truly interdisciplinary area with unclear boundaries, and those are both challenging aspects. We do look at subject guides and other practice to improve and develop our collections. And each fall we sponsor about two dozen human rights scholars to visit and engage, and that feeds into what we collect… The other thing that I hope to do in the future is to do more assessment to look at more authoritative lists in order to compare with other places… Colleagues look at a site called ideallist which lists opportunities and funding in these types of spaces. We also try to capture sites that look more vulnerable – small activist groups – although it is nt clear if they actually are that risky.

Cost wise the expensive part of collecting is both human effort to catalogue, and the permission process in the collecting process. And yesterday’s discussion of possible need for ethics groups as part of the permissions prpcess.

In the web archiving space we have to be clearer on scope and boundaries as there is such a big, almost limitless, set of materials to pick from. But otherwise plenty of parallels.

James: For me the material we collect is in the public domain so permissions are not part of my challenge here. But there are other aspects of my work, including LOCKSS. In the case of Fugitive US Agencies Collection we take entire sites (e.g. CBO, GAO, EPA) plus sites at risk (eg Census, Current Industrial Reports). These “fugitive” agencies include publications should be in the depository programme but are not. And those lots documents that fail to make it out, they are what this collection is about. When a library notes a lost document I will share that on the Lost Docs Project blog, and then also am able to collect and seed the cloud and web archive – using the WordPress Amber plugin – for links. For instance the CBO looked at the health bill, aka Trump Care, was missing… In fact many CBO publications were missing so I have added it as a see for our Archive-it

3. Discovery and use of web archives

Discovery and use of web archives is becoming increasingly important as we look for needles in ever larger haystacks. So, firstly, over to Kris:

Kris: One way we get archives out there is in our catalogue, and into WorldCat. That’s one plae to help other libraries know what we are collecting, and how to find and understand it… So would be interested to do some work with users around what they want to find and how… I suspect it will be about a specific request – e.g. city council in one place over a ten year period… But they won’t be looking for a web archive per se… We have to think about that, and what kind of intermediaries are needed to make that work… Can we also provide better seed lists and documentation for this? In Social Sciences we have the Code Book and I think we need to share the equivalent information for web archives, to expose documentation on how the archive was built… And linking to seeds nad other parts of collections .

One other thing we have to think about is process and document ingest mechanism. We are trying to do this for CA.gov to better describe what we do… BUt maybe there is a standard way to produce that sort of documentation – like the Codebook…

Pamela: Very quickly… At Columbia we catalogue individual sites. We also have a customised portal for the Human Rights. That has facets for “search as research” so you can search and develop and learn by working through facets – that’s often more useful than item searches… And, in terms of collecting for the web we do have to think of what we collect as data for analysis as part of a larger data sets…

James: In the interests of time we have to wrap up, but there was one comment I wanted to make.which is that there are tools we use but also gaps that we see for subject specialists [see slide]… And Andrew’s comments about the catalogue struck home with me…


Q1) Can you expand on that issue of the catalogue?

A1) Yes, I think we have to see web archives both as bulk data AND collections as collections. We have to be able to pull out the documents and reports – the traditional materials – and combine them with other material in the catalogue… So it is exciting to think about that, about the workflow… And about web archives working into the normal library work flows…

Q2) Pamela, you commented about permissions framework as possibly vital for IRB considerations for web research… Is that from conversations with your IRB or speculative.

A2) That came from Matt Webber’s comment yesterday on IRB becoming more concerned about web archive-based research. We have been looking for faster processes… But I am always very aware of the ethical concern… People do wonder about ethics and permissions when they see the archive… Interesting to see how we can navigate these challenges going forward…

Q3) Do you use LCSH and are there any issues?

A3) Yes, we do use LCSH for some items and the collections… Luckily someone from our metadata team worked with me. He used Dublin Core, with LCSH within that. He hasn’t indicated issues. Government documents in the US (and at state level) typically use LCSH so no, no issues that I’m aware of.



IIPC WAC / RESAW Conference 2017 – Day Two (Technical Strand) Liveblog

I am again at the IIPC WAC / RESAW Conference 2017 and, for today I am

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as neccassary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essemtial metaddta
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as neccassry

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to proide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solaar searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discocerale. Need to be able to re-run automated extraction.

We want to iteractively ipmprove automated metadat extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.


Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t acessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, SImon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadan political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tightknit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixees, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about dericative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…


Q1) Do you plan to make graphs interactive, by using Kebana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kabana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..



Association of Internet Researchers AoIR2016: Day 4

Today is the last day of the Association of Internet Researchers Conference 2016 – with a couple fewer sessions but I’ll be blogging throughout.

As usual this is a liveblog so corrections, additions, etc. are welcomed. 

PS-24: Rulemaking (Chair: Sandra Braman)

The DMCA Rulemaking and Digital Legal Vernaculars – Olivia G Conti, University of Wisconsin-Madison, United States of America

Apologies, I’ve joined this session late so you miss the first few minutes of what seems to have been an excellent presentation from Olivia. 

Property and ownership claims made of distinctly American values… Grounded in general ideals, evocations of the Bill of Rights. Or asking what Ben Franklin would say… Bringing the ideas of the DCMA as being contrary to the very foundations of the United Statements. Another them was the idea of once you buy something you should be able to edit as you like. Indeed a theme here is the idea of “tinkering and a liberatory endeavour”. And you see people claiming that it is a basic human right to make changes and tinker, to tweak your tractor (or whatever). Commentators are not trying to appeal to the nation state, they are trying to perform the state to make rights claims to enact the rights of the citizen in a digital world.

So, John Deere made a statement that tractro buyers have an “implied license” to their tractor, they don’t own it out right. And that raised controversies as well.

So, the final register rule was that the farmers won: they could repair their own tractors.

But the vernacular legal formations allow us to see the tensions that arise between citizens and the rights holders. And that also raises interesting issues of citizenship – and of citizenship of the state versus citizenship of the digital world.

The Case of the Missing Fair Use: A Multilingual History & Analysis of Twitter’s Policy Documentation – Amy Johnson, MIT, United States of America

This paper looks at the multilingual history and analysis of Twitter’s policy documentation. Or policies as uneven scalar tools of power alignment. And this comes from the idea of thinking of the Twitter as more than just the whole complete overarching platform. There is much research now on moderation, but understanding this type of policy allows you to understand some of the distributed nature of the platforms. Platforms draw lines when they decide which laws to tranform into policies, and then again when they think about which policies to translate.

If you look across at a list of Twitter policies, there is an English language version. Of this list it is only the Fair Use policy and the Twitter API limits that appear only in English. The API policy makes some sense, but the Fair Use policy does not. And Fair Use only appears really late – in 2014. It sets up in 2005, and many other policies come in in 2013… So what is going on?

So, here is the Twitter Fair Use Policy… Now, before I continue here, I want to say that this translation (and lack of) for this policy is unusual. Generally all companies – not just tech companies – translate into FIGS: French, Italian, German, Spanish languages. And Twitter does not do this. But this is in contrast to the translations of the platform itself. And I wanted to talk in particularly about translations into Japanese and Arabic. Now the Japanese translation came about through collaboration with a company that gave it opportunities to expand out into Japen. Arabic is not put in place until 2011, and around the Arab Spring. And the translation isn’t doen by Twitter itself but by another organisaton set up to do this. So you can see that there are other actors here playing into translations of platform and policies. So this iconic platforms are shaped in some unexpected ways.

So… I am not a lawyer but… Fair Use is a phenomenon that creates all sorts of internet lawyering. And typically there are four factors of fair use (Section 107 of US Copyright Act of 1976): purpose and character of use; nature of copyright work; amount and substantiality of portion used; effect of use on potential market for or value of copyright work. And this is very much an american law, from a legal-economic point of view. And the US is the only country that has Fair Use law.

Now there is a concept of “Fair Dealing” – mentioned in passing in Fair Use – which shares some characters. There are other countries with Fair Use law: Poland, Israel, South Korea… Well they point to the English language version. What about Japanese which has a rich reuse community on Twitter? It also points to the English policy.

So, policy are not equal in their policynesss. But why does this matter? Because this is where rule of law starts to break down… And we cannot assume that the same policies apply universally, that can’t be assumed.

But what about parody? Why bring this up? Well parody is tied up with the idea of Fair Use and creative transformation. Comedy is protected Fair Use category. And Twitter has a rich seam of parody. And indeed, if you Google for the fair use policy, the “People also ask” section has as the first question: “What is a parody account”.

Whilst Fair Use wasn’t there as a policy until 2014, parody unofficially had a policy in 2009, an official one in 2010, updates, another version in 2013 for the IPO. Biz Stone writes about, when at Google, lawyers saying about fake accounts “just say it is parody!” and the importance of parody. And indeed the parody policy has been translated much more widely than the Fair Use policy.

So, policies select bodies of law and align platforms to these bodies of law, in varying degree and depending on specific legitimation practices. Fair Use is strongly associated with US law, and embedding that in the translated policies aligns Twitter more to US law than they want to be. But parody has roots in free speech, and that is something that Twitter wishes to align itself with.

Visual Arts in Digital and Online Environments: Changing Copyright and Fair Use Practice among Institutions and Individuals Abstract – Patricia Aufderheide, Aram Sinnreich, American University, United States of America

Patricia: Aram and I have been working with the College Art Association and it brings together a wide range of professionals and practitioners in art across colleges in the US. They had a new code of conduct and we wanted to speak to them, a few months after that code of conduct was released, to see if that had changed practice and understanding. This is a group that use copyrighted work very widely. And indeed one-third of respondents avoid, abandon, or are delayed because of copyrighted work.

Aram: four-fifths of CAA members use copyrighted materials in their work, but only one fifth employ fair use to do that – most or always seek permission. And of those that use fair use there are some that always or usually use Fair Use. So there are real differences here. So, Fair Use are valued if you know about it and undestand it… but a quarter of this group aren’t sure if Fair Use is useful or not. Now there is that code of conduct. There is also some use of Creative Commons and open licenses.

Of those that use copyright materials… But 47% never use open licenses for their own work – there is a real reciprocity gap. Only 26% never use others openly licensed work. and only 10% never use others’ public domain work. Respondents value creative copying… 19 out of 20 CAA members think that creative appropriation can be “original”, and despite this group seeking permissions they also don’t feel that creative appropriation shouldn’t neccassarily require permission. This really points to an education gap within the community.

And 43% said that uncertainty about the law limits creativity. They think they would appropriate works more, they would public more, they would share work online… These mirror fair use usage!

Patricia: We surveyed this group twice in 2013 and in 2016. Much stays the same but there have been changes… In 2016, 2/3rd have heard about the code, and a third have shared that information – with peers, in teaching, with colleagues. Their associations with the concept of Fair Use are very positive.

Arem: The good news is that the code use does lead to change, even within 10 months of launch. This work was done to try and show how much impact a code of conduct has on understanding… And really there was a dramatic differences here. From the 2016 data, those who are not aware of the code, look a lot like those who are aware but have not used the code. But those who use the code, there is a real difference… And more are using fair use.

Patricia: There is one thing we did outside of the survey… There have been dramatic changes in the field. A number of universities have changed journal policies to be default Fair Use – Yale, Duke, etc. There has been a lot of change in the field. Several museums have internally changed how they create and use their materials. So, we have learned that education matters – behaviour changes with knowledge confidence. Peer support matters and validates new knowledge. Institutional action, well publicized, matters .The newest are most likely to change quickly, but the most veteran are in the best position – it is important to have those influencers on board… And teachers need to bring this into their teaching practice.

Panel Q&A

Q1) How many are artists versus other roles?

A1 – Patricia) About 15% are artists, and they tend to be more positive towards fair use.

Q2) I was curious about changes that took place…

A2 – Arem) We couldn’t ask whether the code made you change your practice… But we could ask whether they had used fair use before and after…

Q3) You’ve made this code for the US CAA, have you shared that more widely…

A3 – Patricia) Many of the CAA members work internationally, but the effectiveness of this code in the US context is that it is about interpreting US Fair Use law – it is not a legal document but it has been reviewed by lawyers. But copyright is territorial which makes this less useful internationally as a document. If copyright was more straightforward, that would be great. There are rights of quotation elsewhere, there is fair dealing… And Canadian law looks more like Fair Use. But the US is very litigious so if something passes Fair Use checking, that’s pretty good elsewhere… But otherwise it is all quite territorial.

A3 – Arem) You can see in data we hold that international practitioners have quite different attitudes to American CAA members.

Q4) You talked about the code, and changes in practice. When I talk to filmmakers and documentary makers in Germany they were aware of Fair Use rights but didn’t use them as they are dependent on TV companies buy them and want every part of rights cleared… They don’t want to hurt relationships.

A4 – Patricia) We always do studies before changes and it is always about reputation and relationship concerns… Fair Use only applies if you can obtain the materials independently… But then the question may be that will rights holders be pissed off next time you need to licence content. What everyone told me was that we can do this but it won’t make any difference…

Chair) I understand that, but that question is about use later on, and demonstration of rights clearance.

A4 – Patricia) This is where change in US errors and omissions insurance makes a difference – that protects them. The film and television makers code of conduct helped insurers engage and feel confident to provide that new type of insurance clause.

Q5) With US platforms, as someone in Norway, it can be hard to understand what you can and cannot access and use on, for instance, in YouTube. Also will algorithmic filtering processes of platforms take into account that they deal with content in different territories?

A5 – Arem) I have spoken to Google Council about that issue of filtering by law – there is no difference there… But monitoring

A5 – Amy) I have written about legal fictions before… They are useful for thinking about what a “reasonable person” – and that can be vulnerable by jury and location so writing that into policies helps to shape that.

A5 – Patricia) The jurisdiction is where you create, not where the work is from…

Q6) There is an indecency case in France which they want to try in French court, but Facebook wants it tried in US court. What might the impact on copyright be?

A6 – Arem) A great question but this type of jurisdictional law has been discussed for over 10 years without any clear conclusion.

A6 – Patricia) This is a European issue too – Germany has good exceptions and limitations, France has horrible exceptions and limitations. There is a real challenge for pan European law.

Q7) Did you look at all of impact on advocacy groups who encouraged writing in/completion of replies on DCMA. And was there any big difference between the farmers and car owners?

A7) There was a lot of discussion on the digital right to repair site, and that probably did have an impact. I did work on Net Neutrality before. But in any of those cases I take out boiler plate, and see what they add directly – but there is a whole other paper to be done on boiler plate texts and how they shape responses and terms of additional comments. It wasn’t that easy to distinguish between farmers and car owners, but it was interesting how individuals established credibility. For farmers they talked abot the value of fixing their own equipment, of being independent, of history of ownership. Car mechanics, by contrast, establish technical expertise.

Q8) As a follow up: farmers will have had a long debate over genetically modified seeds – and the right to tinker in different ways…

A8) I didn’t see that reflected in the comments, but there may well be a bigger issue around micromanagement of practices.

Q9) Olivia, I was wondering if you were considering not only the rhetorical arguements of users, what about the way the techniques and tactics they used are received on the other side… What are the effective tactics there, or locate the limits of the effectiveness of the layperson vernacular stategies?

A9) My goal was to see what frames of arguements looked most effective. I think in the case of the John Deere DCMA case that wasn’t that conclusive. It can be really hard to separate the NGO from the individual – especially when NGOs submit huge collections of individual responses. I did a case study on non-consensual pornography was more conclusive in terms of strategies that was effective. The discourses I look at don’t look like legal discourse but I look at the tone and content people use. So, on revenge porn, the law doesn’t really reflect user practice for instance.

Q10) For Amy, I was wondering… Is the problem that Fair Use isn’t translated… Or the law behind that?

A10 – Amy) I think Twitter in particular have found themselves in a weird middle space… Then the exceptions wouldn’t come up. But having it in English is the odd piece. That policy seems to speak specifically to Americans… But you could argue they are trying to impose (maybe that’s a bit too strong) on all English speaking territory. On YouTube all of the policies are translated into the same languages, including Fair Use.

Q11) I’m fascinated in vernacular understanding and then the experts who are in the round tables, who specialise in these areas. How do you see vernacular discourse use in more closed/smaller settings?

A11 – Olivia) I haven’t been able to take this up as so many of those spaces are opaque. But in the 2012 rule making there were some direct quotes from remixers. And there a suggestion around DVD use that people should videotape the TV screen… and that seemed unreasonably onorous…

Chair) Do you forsee a next stage where you get to be in those rooms and do more on that?

A11 – Olivia) I’d love to do some ethnographic studies, to get more involved.

A11 – Patricia) I was in Washington for the DMCA hearings and those are some of the most fun things I go to. I know that the documentary filmmakers have complained about cost of participating… But a technician from the industry gave 30 minutes of evidence on the 40 technical steps to handle analogue film pieces of information… And to show that it’s not actually broadcast quality. It made them gasp. It was devastating and very visual information, and they cited it in their ruling… And similarly in John Deere case the car technicians made impact. By contrast a teacher came in to explain why copying material was important for teaching, but she didn’t have either people or evidence of what the difference is in the classroom.

Q12) I have an interesting case if anyone wants to look at it, around Wikipedia’s Fair Use issues around multimedia. Volunteers take pre-emptively being stricter as they don’t want lawyers to come in on that… And the Wikipedia policies there. There is also automation through bots to delete content without clear Fair Use exception.

A12 – Arem) I’ve seen Fair Use misappropriated on Wikipedia… Copyright images used at low resolution and claimed as Fair Use…

A12- Patricia) Wikimania has all these people who don’t want to deal with law on copyright at all! Wikimedia lawyers are in an a really difficult position.


Association of Internet Researchers AoIR 2016: Day Two

Today I am again at the Association of Internet Researchers AoIR 2016 Conference in Berlin. Yesterday we had workshops, today the conference kicks off properly. Follow the tweets at: #aoir2016.

As usual this is a liveblog so all comments and corrections are very much welcomed. 

Platform Studies: The Rules of Engagement (Chair: Jean Burgess, QUT)

How affordances arise through relations between platforms, their different types of users, and what they do to the technology – Taina Bucher (University of Copenhagen) and Anne Helmond (University of Amsterdam)

Taina: Hearts on Twitter: In 2015 Twitter moved from stars to hearts, changing the affordances of the platform. They stated that they wanted to make the platform more accessible to new users, but that impacted on existing users.

Today we are going to talk about conceptualising affordances. In it’s original meaning an affordance is conceived of as a relational property (Gibson). For Norman perceived affordances were more the concern – thinking about how objects can exhibit or constrain particular actions. Affordances are not just the visual clues or possibilities, but can be felt. Gaver talks about these technology affordances. There are also social affordances – talked about my many – mainly about how poor technological affordances have impact on societies. It is mainly about impact of technology and how it can contain and constrain sociality. And finally we have communicative affordances (Hutchby), how technological affordances impact on communities and communications of practices.

So, what about platform changes? If we think about design affordances, we can see that there are different ways to understand this. The official reason for the design was given as about the audience, affording sociality of community and practices.

Affordances continues to play an important role in media and social media research. They tend to be conceptualised as either high-level or low-level affordances, with ontological and epistemological differences:

  • High: affordance in the relation – actions enabled or constrained
  • Low: affordance in the technical features of the user interface – reference to Gibson but they vary in where and when affordances are seen, and what features are supposed to enable or constrain.

Anne: We want to now turn to platform-sensitive approach, expanding the notion of the user –> different types of platform users, end-users, developers, researchers and advertisers – there is a real diversity of users and user needs and experiences here (see Gillespie on platforms. So, in the case of Twitter there are many users and many agendas – and multiple interfaces. Platforms are dynamic environments – and that differentiates social media platforms from Gibson’s environmental platforms. Computational systems driving media platforms are different, social media platforms adjust interfaces to their users through personalisation, A/B testing, algorithmically organised (e.g. Twitter recommending people to follow based on interests and actions).

In order to take a relational view of affordances, and do that justice, we also need to understand what users afford to the platforms – as they contribute, create content, provide data that enables to use and development and income (through advertisers) for the platform. Returning to Twitter… The platform affords different things for different people

Taking medium-specificity of platforms into account we can revisit earlier conceptions of affordance and critically analyse how they may be employed or translated to platform environments. Platform users are diverse and multiple, and relationships are multidirectional, with users contributing back to the platform. And those different users have different agendas around affordances – and in our Twitter case study, for instance, that includes developers and advertisers, users who are interested in affordances to measure user engagement.

How the social media APIs that scholars so often use for research are—for commercial reasons—skewed positively toward ‘connection’ and thus make it difficult to understand practices of ‘disconnection’ – Nicolas John (Hebrew University of Israel) and Asaf Nissenbaum (Hebrew University of Israel)

Consider this… On Facebook…If you add someone as a friend they are notified. If you unfriend them, they do not. If you post something you see it in your feed, if you delete it it is not broadcast. They have a page called World of Friends – they don’t have one called World of Enemies. And Facebook does not take kindly to app creators who seek to surface unfriending and removal of content. And Facebook is, like other social media platforms, therefore significantly biased towards positive friending and sharing actions. And that has implications for norms and for our research in these spaces.

One of our key questions here is what can’t we know about

Agnotology is defined as the study of ignorance. Robert Proctor talks about this in three terms: native state – childhood for instance; strategic ploy – e.g. the tobacco industry on health for years; lost realm – the knowledge that we cease to hold, that we loose.

I won’t go into detail on critiques of APIs for social science research, but as an overview the main critiques are:

  1. APIs are restrictive – they can cost money, we are limited to a percentage of the whole – Burgess and Bruns 2015; Bucher 2013; Bruns 2013; Driscoll and Walker
  2. APIs are opaque
  3. APIs can change with little notice (and do)
  4. Omitted data – Baym 2013 – now our point is that these platforms collect this data but do not share it.
  5. Bias to present – boyd and Crawford 2012

Asaf: Our methodology was to look at some of the most popular social media spaces and their APIs. We were were looking at connectivity in these spaces – liking, sharing, etc. And we also looked for the opposite traits – unliking, deletion, etc. We found that social media had very little data, if any, on “negative” traits – and we’ll look at this across three areas: other people and their content; me and my content; commercial users and their crowds.

Other people and their content – APIs tend to supply basic connectivity – friends/following, grouping, likes. Almost no historical content – except Facebook which shares when a user has liked a page. Current state only – disconnections are not accounted for. There is a reason to not know this data – privacy concerns perhaps – but that doesn’t explain my not being able to find this sort of information about my own profile.

Me and my content – negative traits and actions are hidden even from ourselves. Success is measured – likes and sharin, of you or by you. Decline is not – disconnections are lost connections… except on Twitter where you can see analytics of followers – but no names there, and not in the API. So we are losing who we once were but are not anymore. Social network sites do not see fit to share information over time… Lacking disconnection data is an idealogical and commercial issue.

Commercial users and their crowds – these users can see much more of their histories, and the negative actions online. They have a different regime of access in many cases, with the ups and downs revealed – though you may need to pay for access. Negative feedback receives special attention. Facebook offers the most detailed information on usage – including blocking and unliking information. Customers know more than users, or Pages vs. Groups.

Nicholas: So, implications. From what Asaf has shared shows the risk for API-based research… Where researchers’ work may be shaped by the affordances of the API being used. Any attempt to capture negative actions – unlikes, choices to leave or unfriend. If we can’t use APIs to measure social media phenomena, we have to use other means. So, unfriending is understood through surveys – time consuming and problematic. And that can put you off exploring these spaces – it limits research. The advertiser-friends user experience distorts the space – it’s like the stock market only reporting the rises except for a few super wealthy users who get the full picture.

A biography of Twitter (a story told through the intertwined stories of its key features and the social norms that give them meaning, drawing on archival material and oral history interviews with users) – Jean Burgess (Queensland University of Technology) and Nancy Baym (Microsoft Research)

I want to start by talking about what I mean by platforms, and what I mean by biographies. Here platforms are these social media platforms that afford particular possibilities, they enable and shape society – we heard about the platformisation of society last night – but their governance, affordances, are shaped by their own economic existance. They are shaping and mediating socio-cultural experience and we need to better to understand the values and socio-cultural concerns of the platforms. By platform studies we mean treating social media platforms as spaces to study in their own rights: as institutions, as mediating forces in the environment.

So, why “biography” here? First we argue that whilst biographical forms tend to be reserved for individuals (occasionally companies and race horses), they are about putting the subject in context of relationships, place in time, and that the context shapes the subject. Biographies are always partial though – based on unreliable interviews and information, they quickly go out of date, and just as we cannot get inside the heads of those who are subjects of biographies, we cannot get inside many of the companies at the heart of social media platforms. But (after Richard Rogers) understanding changes helps us to understand the platform.

So, in our forthcoming book, Twitter: A Biography (NYU 2017), we will look at competing and converging desires around e.g the @, RT, #. Twitter’s key feature set are key characters in it’s biography. Each has been a rich site of competing cultures and norms. We drew extensively on the Internet Archives, bloggers, and interviews with a range of users of the platform.

Nancy: When we interviewed people we downloaded their archive with them and talked through their behaviour and how it had changed – and many of those features and changes emerged from that. What came out strongly is that noone knows what Twitter is for – not just amongst users but also amongst the creators – you see that today with Jack Dorsey and Anne Richards. The heart of this issue is about whether Twitter is about sociality and fun, or is it a very important site for sharing important news and events. Users try to negotiate why they need this space, what is it for… They start squabling saying “Twitter, you are doing it wrong!”… Changes come with backlash and response, changed decisions from Twitter… But that is also accompanied by the media coverage of Twitter, but also the third party platforms build on Twitter.

So the “@” is at the heart of Twitter for sociality and Twitter for information distribution. It was imported from other spaces – IRC most obviously – as with other features. One of the earliest things Twitter incorporated was the @ and the links back.. You have things like originally you could see everyone’s @ replies and that led to feed clutter – although some liked seeing unexpected messages like this. So, Twitter made a change so you could choose. And then they changed again to automatically not see replies from those you don’t follow. So people worked around that with “.@” – which created conflict between the needs of the users, the ways they make it usable, and the way the platform wants to make the space less confusing to new users.

The “RT” gave credit to people for their words, and preserved integrity of words. At first this wasn’t there and so you had huge variance – the RT, the manually spelled out retweet, the hat tip (HT). Technical changes were made, then you saw the number of retweets emerging as a measure of success and changing cultures and practices.

The “#” is hugely disputed – it emerged through hashtag.org: you couldn’t follow them in Twitter at first but they incorporated it to fend off third party tools. They are beloved by techies, and hated by user experience designers. And they are useful but they are also easily coopted by trolls – as we’ve seen on our own hashtag.

Insights into the actual uses to which audience data analytics are put by content creators in the new screen ecology (and the limitations of these analytics) – Stuart Cunningham (QUT) and David Craig (USC Annenberg School for Communication and Journalism)

The algorithmic culture is well understood as a part of our culture. There are around 150 items on Tarleton Gillespie and Nick Seaver’s recent reading list and the literature is growing rapidly. We want to bring back a bounded sense of agency in the context of online creatives.

What do I mean by “online creatives”? Well we are looking at social media entertainment – a “new screen ecology” (Cunningham and Silver 2013; 2015) shaped by new online creatives who are professionalising and monetising on platforms like YouTube, as opposed to professional spaces, e.g. Netflix. YouTube has more than 1 billion users, with revenue in 2015 estimated at $4 billion per year. And there are a large number of online creatives earning significant incomes from their content in these spaces.

Previously online creatives were bound up with ideas of democratic participative cultures but we want to offer an immanent critique of the limits of data analytics/algorithmic culture in shaping SME from with the industry on both the creator (bottom up) and platform (top down) side. This is an approach to social criticism exposes the way reality conflicts not with some “transcendent” concept of rationality but with its own avowed norms, drawing on Foucault’s work on power and domination.

We undertook a large number of interviews and from that I’m going to throw some quotes at you… There is talk of information overload – of what one might do as an online creative presented with a wealth of data. Creatives talk about the “non-scalable practices” – the importance and time required to engage with fans and subscribers. Creatives talk about at least half of a working week being spent on high touch work like responding to comments, managing trolls, and dealing with challenging responses (especially with creators whose kids are engaged in their content).

We also see cross-platform engagement – and an associated major scaling in workload. There is a volume issue on Facebook, and the use of Twitter to manage that. There is also a sense of unintended consequences – scale has destroyed value. Income might be $1 or $2 for 100,000s or millions of views. There are inherent limits to algorithmic culture… But people enjoy being part of it and reflect a real entrepreneurial culture.

In one or tow sentences, the history of YouTube can be seen as a sort of clash of NoCal and SoCal cultures. Again, no-one knows what it is for. And that conflict has been there for ten years. And you also have the MCNs (Multi-Contact Networks) who are caught like the meat in the sandwich here.

Panel Q&A

Q1) I was wondering about user needs and how that factors in. You all drew upon it to an extent… And the dissatisfaction of users around whether needs are listened to or not was evident in some of the case studies here. I wanted to ask about that.

A1 – Nancy) There are lots of users, and users have different needs. When platforms change and users are angry, others are happy. We have different users with very different needs… Both of those perspectives are user needs, they both call for responses to make their needs possible… The conflict and challenges, how platforms respond to those tensions and how efforts to respond raise new tensions… that’s really at the heart here.

A1 – Jean) In our historical work we’ve also seen that some users voices can really overpower others – there are influential users and they sometimes drown out other voices, and I don’t want to stereotype here but often technical voices drown out those more concerned with relationships and intimacy.

Q2) You talked about platforms and how they developed (and I’m afraid I didn’t catch the rest of this question…)

A2 – David) There are multilateral conflicts about what features to include and exclude… And what is interesting is thinking about what ideas fail… With creators you see economic dependence on platforms and affordances – e.g. versus PGC (Professionally Generated Content).

A2 – Nicholas) I don’t know what user needs are in a broader sense, but everyone wants to know who unfriended them, who deleted them… And a dislike button, or an unlike button… The response was strong but “this post makes me sad” doesn’t answer that and there is no “you bastard for posting that!” button.

Q3) Would it be beneficial to expose unfriending/negative traits?

A3 – Nicholas) I can think of a use case for why unfriending would be useful – for instance wouldn’t it be useful to understand unfriending around the US elections. That data is captured – Facebook know – but we cannot access it to research it.

A3 – Stuart) It might be good for researchers, but is it in the public good? In Europe and with the Right to be Forgotten should we limit further the data availability…

A3 – Nancy) I think the challenge is that mismatch of only sharing good things, not sharing and allowing exploration of negative contact and activity.

A3 – Jean) There are business reasons for positivity versus negativity, but it is also about how the platforms imagine their customers and audiences.

Q4) I was intrigued by the idea of the “Medium specificity of platforms” – what would that be? I’ve been thinking about devices and interfaces and how they are accessed… We have what we think of as a range but actually we are used to using really one or two platforms – e.g. Apple iPhone – in terms of design, icons, etc. and the possibilities of interface is, and what happens when something is made impossible by the interface.

A4 – Anne) When the “medium specificity” we are talking about the platform itself as medium. Moving beyond end user and user experience. We wanted to take into account the role of the user – the platform also has interfaces for developers, for advertisers, etc. and we wanted to think about those multiple interfaces, where they connect, how they connect, etc.

A4 – Taina) It’s a great point about medium specitivity but for me it’s more about platform specifity.

A4 – Jean) The integration of mobile web means the phone iOS has a major role here…

A4 – Nancy) We did some work with couples who brought in their phones, and when one had an Apple and one had an Android phone we actually found that they often weren’t aware of what was possible in the social media apps as the interfaces are so different between the different mobile operating systems and interfaces.

Q5) Can you talk about algorithmic content and content innovation?

A5 – David) In our work with YouTube we see forms of innovation that are very platform specific around things like Vine and Instagram. And we also see counter-industrial forms and practices. So, in the US, we see blogging and first person accounts of lives… beauty, unboxing, etc. But if you map content innovation you see (similarly) this taking the form of gaps in mainstream culture – in India that’s stand up comedy for instance. Algorithms are then looking for qualities and connections based on what else is being accessed – creating a virtual circle…

Q6) Can we think of platforms as instable, about platforms having not quite such a uniform sense of purpose and direction…

A6 – Stuart) Most platforms are very big in terms of their finance… If you compare that to 20 years ago the big companies knew what they were doing! Things are much more volatile…

A6 – Jean) That’s very common in the sector, except maybe on Facebook… Maybe.


Association of Internet Researchers AoIR 2016 РDay 1 РJos̩ van Dijck Keynote

If you’ve been following my blog today you will know that I’m in Berlin for the Association of Internet Researchers AoIR 2016 (#aoir2016) Conference, at Humboldt University. As this first day has mainly been about workshops – and I’ve been in a full day long Digital Methods workshop – we do have our first conference keynote this evening. And as it looks a bit different to my workshop blog, I thought a new post was in order.

As usual, this is a live blog post so corrections, comments, etc. are all welcomed. This session is also being videoed so you will probably want to refer to that once it becomes available as the authoritative record of the session. 

Keynote: The Platform Society – José van Dijck (University of Amsterdam) with Session Chair: Jennifer Stromer-Galley



Association of Internet Researchers AoIR 2016: Day 1 – Workshops

After a few weeks of leave I’m now back and spending most of this week at the Association of Internet Researchers (AoIR) Conference 2016. I’m hugely excited to be here as the programme looks excellent with a really wide range of internet research being presented and discussed. I’ll be liveblogging throughout the week starting with today’s workshops.

I am booked into the Digital Methods in Internet Research: A Sampling Menu workshop, although I may be switching session at lunchtime to attend the Internet rules… for Higher Education workshop this afternoon.

The Digital Methods workshop is being chaired by Patrik Wikstrom (Digital Media Research Centre, Queensland University of Technology, Australia) and the speakers are:

  • Erik Borra (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Axel Bruns (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Jean Burgess (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Carolin Gerlitz (University of Siegen, Germany),
  • Anne Helmond (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Ariadna Matamoros Fernandez (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Peta Mitchell (Digital Media Research Centre, Queensland University of Technology, Australia),
  • Richard Rogers (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Fernando N. van der Vlist (Digital Methods Initiative, University of Amsterdam, the Netherlands),
  • Esther Weltevrede (Digital Methods Initiative, University of Amsterdam, the Netherlands).

I’ll be taking notes throughout but the session materials are also available here: http://tinyurl.com/aoir2016-digmethods/.

Patrik: We are in for a long and exciting day! I won’t introduce all the speakers as we won’t have time!

Conceptual Introduction: Situating Digital Methods (Richard Rogers)

My name is Richard Rogers, I’m professor of new media and digital culture at the University of Amsterdam and I have the pleasure of introducing today’s session. So I’m going to do two things, I’ll be situating digital methods in internet-related research, and then taking you through some digital methods.

I would like to situate digital methods as a third era of internet research… I think all of these eras thrive and overlap but they are differentiated.

  1. Web of Cyberspace (1994-2000): Cyberstudies was an effort to see difference in the internet, the virtual as distinct from the real. I’d situate this largely in the 90’s and the work of Steve Jones and Steve (?).
  2. Web as Virtual Society? (2000-2007) saw virtual as part of the real. Offline as baseline and “virtual methods” with work around the digital economy, the digital divide…
  3. Web as societal data (2007-) is about “virtual as indication of the real. Online as baseline.

Right now we use online data about society and culture to make “grounded” claims.

So, if we look at Allrecipes.com Thanksgiving recipe searches on a map we get some idea of regional preference, or we look at Google data in more depth, we get this idea of internet data as grounding for understanding culture, society, tastes.

So, we had this turn in around 2008 to “web as data” as a concept. When this idea was first introduced not all were comfortable with the concept. Mike Thelwell et al (2005) talked about the importance of grounding the data from the internet. So, for instance, Google’s flu trends can be compared to Wikipedia traffic etc. And with these trends we also get the idea of “the internet knows first”, with the web predicting other sources of data.

Now I do want to talk about digital methods in the context of digital humanities data and methods. Lev Manovich talks about Cultural Analytics. It is concerned with digitised cultural materials with materials clusterable in a sort of art historical way – by hue, style, etc. And so this is a sort of big data approach that substitutes “continuous change” for periodisation and categorisation for continuation. So, this approach can, for instance, be applied to Instagram (Selfiexploration), looking at mood, aesthetics, etc. And then we have Culturenomics, mainly through the Google Ngram Viewer. A lot of linguists use this to understand subtle differences as part of distance reading of large corpuses.

And I also want to talk about e-social sciences data and method. Here we have Webometrics (Thelwell et al) with links as reputational markers. The other tradition here is Altmetrics (Priem et al), which uses online data to do citation analysis, with social media data.

So, at least initially, the idea behind digital methods was to be in a different space. The study of online digital objects, and also natively online method – methods developed for the medium. And natively digital is meant in a computing sense here. In computing software has a native mode when it is written for a specific processor, so these are methods specifically created for the digital medium. We also have digitized methods, those which have been imported and migrated methods adapted slightly to the online.

Generally speaking there is a sort of protocol for digital methods: Which objects and data are available? (links, tags, timestamps); how do dominant devices handle them? etc.

I will talk about some methods here:

1. Hyperlink

For the hyperlink analysis there are several methods. The Issue Crawler software, still running and working, enable you to see links between pages, direction of linking, aspirational linking… For example a visualisation of an Armenian NGO shows the dynamics of an issue network showing politics of association.

The other method that can be used here takes a list of sensitive sites, using Issue Crawler, then parse it through an internet censorship service. And variations on this that indicate how successful attempts at internet censorship are. We do work on Iran and China and I should say that we are always quite thoughtful about how we publish these results because of their sensitivity.

2. The website as archived object

We have the Internet Archive and we have individual archived web sites. Both are useful but researcher use is not terribly signficant so we have been doing work on this. See also a YouTube video called “Google and the politics of tabs” – a technique to create a movie of the evolution of a webpage in the style of timelapse photography. I will be publishing soon about this technique.

But we have also been looking at historical hyperlink analysis – giving you that context that you won’t see represented in archives directly. This shows the connections between sites at a previous point in time. We also discovered that the “Ghostery” plugin can also be used with archived websites – for trackers and for code. So you can see the evolution and use of trackers on any website/set of websites.

6. Wikipedia as cultural reference

Note: the numbering is from a headline list of 10, hence the odd numbering… 

We have been looking at the evolution of Wikipedia pages, understanding how they change. It seems that pages shift from neutral to national points of view… So we looked at Srebenica and how that is represented. The pages here have different names, indicating difference in the politics of memory and reconciliation. We have developed a triangulation tool that grabs links and references and compares them across different pages. We also developed comparative image analysis that lets you see which images are shared across articles.

7. Facebook and other social networking sites

Facebook is, as you probably well know, is a social media platform that is relatively difficult to pin down at a moment in time. Trying to pin down the history of Facebook find that very hard – it hasn’t been in the Internet Archive for four years, the site changes all the time. We have developed two approaches: one for social media profiles and interest data as means of stufying cultural taste ad political preference or “Postdemographics”; And “Networked content analysis” which uses social media activity data as means of studying “most engaged with content” – that helps with the fact that profiles are no longer available via the API. To some extend the API drives the research, but then taking a digital methods approach we need to work with the medium, find which possibilities are there for research.

So, one of the projects undertaken with in this space was elFriendo, a MySpace-based project which looked at the cultural tastes of “friends” of Obama and McCain during their presidential race. For instance Obama’s friends best liked Lost and The Daily Show on TV, McCain’s liked Desperate Housewives, America’s Next Top Model, etc. Very different cultures and interests.

Now the Networked Content Analysis approach, where you quantify and then analyse, works well with Facebook. You can look at pages and use data from the API to understand the pages and groups that liked each other, to compare memberships of groups etc. (at the time you were able to do this). In this process you could see specific administrator names, and we did this with right wing data working with a group called Hope not Hate, who recognised many of the names that emerged here. Looking at most liked content from groups you also see the shared values, cultural issues, etc.

So, you could see two areas of Facebook Studies, Facebook I (2006-2011) about presentation of self: profiles and interests studies (with ethics); Facebook II (2011-) which is more about social movements. I think many social media platforms are following this shift – or would like to. So in Instagram Studies the Instagram I (2010-2014) was about selfie culture, but has shifed to Instagram II (2014-) concerned with antagonistic hashtag use for instance.

Twitter has done this and gone further… Twitter I (2006-2009) was about urban lifestyle tool (origins) and “banal” lunch tweets – their own tagline of “what are you doing?”, a connectivist space; Twitter II (2009-2012) has moved to elections, disasters and revolutions. The tagline is “what’s happening?” and we have metrics “trending topics”; Twitter III (2012-) sees this as a generic resource tool with commodification of data, stock market predictions, elections, etc.

So, I want to finish by talking about work on Twitter as a storytelling machine for remote event analysis. This is an approach we developed some years ago around the Iran event crisis. We made a tweet collection around a single Twitter hashtag – which is no longer done – and then ordered by most retweeted (top 3 for each day) and presented in chronological (not reverse) order. And we then showed those in huge displays around the world…

To take you back to June 2009… Mousavi holds an emergency press conference. Voter turn out is 80%. SMS is down. Mousavi’s website and Facebook are blocked. Police use pepper spray… The first 20 days of most popular tweets is a good succinct summary of the events.

So, I’ve taken you on a whistle stop tour of methods. I don’t know if we are coming to the end of this. I was having a conversation the other day that the Web 2.0 days are over really, the idea that the web is readily accessible, that APIs and data is there to be scraped… That’s really changing. This is one of the reasons the app space is so hard to research. We are moving again to user studies to an extent. What the Chinese researchers are doing involves convoluted processes to getting the data for instance. But there are so many areas of research that can still be done. Issue Crawler is still out there and other tools are available at tools.digitalmethods.net.

Twitter studies with DMI-TCAT (Erik Borra)

I’m going to be talking about how we can use the DMI-TCAT tool to do Twitter Studies. I am here with Emile den Tex, one of the original developers of this tool, alongside Eric Borra.

So, what is DMI-TCAT? It is the Digital Methods Initiative Twitter Capture and Analysis Toolset, a server side tool which tries to capture robust and reproducible data capture and analysis. The design is based on two ideas: that captured datasets can be refined in different ways; and that the datasets can be analysed in different ways. Although we developed this tool, it is also in use elsewhere, particularly in the US and Australia.

So, how do we actually capture Twitter data? Some of you will have some experience of trying to do this. As researchers we don’t just want the data, we also want to look at the platform in itself. If you are in industry you get Twitter data through a “data partner”, the biggest of which by far is GNIP – owned by Twitter as of the last two years – then you just pay for it. But it is pricey. If you are a researcher you can go to an academic data partner – DiscoverText or Hexagon – and they are also resellers but they are less costly. And then the third route is the publicly available data – REST APIs, Search API, Streaming APIs. These are, to an extent, the authentic user perspective as most people use these… We have built around these but the available data and APIs shape and constrain the design and the data.

For instance the “Search API” prioritises “relevance” over “completeness” – but as academics we don’t know how “relevance” is being defined here. If you want to do representative research then completeness may be most important. If you want to look at how Twitter prioritises the data, then that Search API may be most relevant. You also have to understand rate limits… This can constrain research, as different data has different rate limits.

So there are many layers of technical mediation here, across three big actors: Twitter platform – and the APIs and technical data interfaces; DMI-TCAT (extraction); Output types. And those APIs and technical data interfaces are significant mediators here, and important to understand their implications in our work as researchers.

So, onto the DMI-TCAT tool itself – more on this in Borra & Reider (2014) (doi:10.1108/AJIM-09-2013-0094). They talk about “programmed method” and the idea of the methodological implications of the technical architecture.

What can one learn if one looks at Twitter through this “programmed method”? Well (1) Twitter users can change their Twitter handle, but their ids will remain identical – sounds basic but its important to understand when collecting data. (2) the length of a Tweet may vary beyond maximum of 140 characters (mentions and urls); (3) native retweets may have their top level text property stortened. (4) Unexpected limitations  support for new emoji characters can be problematic. (5) It is possible to retrieve a deleted tweet.

So, for example, a tweet can vary beyond 140 characters. The Retweet of an original post may be abbreviated… Now we don’t want that, we want it to look as it would to a user. So, we capture it in our tool in the non-truncated version.

And, on the issue of deletion and witholding. There are tweets deleted by users, and their are tweets which are withheld by the platform – and the withholding is a country by country issue. But you can see tweets only available in some countries. A project that uses this information is “Politwoops” (http://politwoops.sunlightfoundation.com/) which captures tweets deleted by US politicians, that lets you filter to specific states, party, position. Now there is an ethical discussion to be had here… We don’t know why tweets are deleted… We could at least talk about it.

So, the tool captures Twitter data in two ways. Firstly there is the direct capture capabilities (via web front-end) which allows tracking of users and capture of public tweets posted by these users; tracking particular terms or keywords, including hashtags; get a small random (approx 1%) of all public statuses. Secondary capture capabilities (via scripts) allows further exploration, including user ids, deleted tweets etc.

Twitter as a platform has a very formalised idea of sociality, the types of connections, parameters, etc. When we use the term “user” we mean it in the platform defined object meaning of the word.

Secondary analytical capabilities, via script, also allows further work:

  1. support for geographical polygons to delineate geographical regions for tracking particular terms or keywords, including hashtags.
  2. Built-in URL expander, following shortened URLs to their destination. Allowing further analysis, including of which statuses are pointing to the same URLs.
  3. Download media (e.g. videos and images (attached to particular Tweets).

So, we have this tool but what sort of studies might we do with Twitter? Some ideas to get you thinking:

  1. Hashtag analysis – users, devices etc. Why? They are often embedded in social issues.
  2. Mentions analysis – users mentioned in contexts, associations, etc. allowing you to e.g. identify expertise.
  3. Retweet analysis – most retweeted per day.
  4. URL analysis – the content that is most referenced.

So Emile will now go through the tool and how you’d use it in this way…

Emile: I’m going to walk through some main features of the DMI TCAT tool. We are going to use a demo site (http://tcatdemo.emiledentex.nl/analysis/) and look at some Trump tweets…

Note: I won’t blog everything here as it is a walkthrough, but we are playing with timestamps (the tool uses UTC), search terms etc. We are exploring hashtag frequency… In that list you can see Bengazi, tpp, etc. Now, once you see a common hashtag, you can go back and query the dataset again for that hashtag/search terms… And you can filter down… And look at “identical tweets” to found the most retweeted content. 

Emile: Eric called this a list making tool – it sounds dull but it is so useful… And you can then put the data through other tools. You can put tweets into Gephi. Or you can do exploration… We looked at Getty Parks project, scraped images, reverse Google image searched those images to find the originals, checked the metadata for the camera used, and investigated whether the cost of a camera was related to the success in distributing an image…

Richard: It was a critique of user generated content.

Analysing Social Media Data with TCAT and Tableau (Axel Bruns)

Analysing Network Dynamics with Agent Based Models (Patrik Wikström)

Tracking the Trackers (Anne Helmond, Carolin Gerlitz, Esther Weltevrede and Fernando van der Vlist)

Multiplatform Issue Mapping (Jean Burgess & Ariadna Matamoros Fernandez)

Analysing and visualising geospatial data (Peta Mitchell)



Supporting Digital Scholarship within the College of Humanities and Social Sciences

Back on 2nd December 2015 I attended a Digital Scholarship event arranged by Anouk Lang, lecturer in Digital Humanities at the University of Edinburgh.

The event ran in two parts: the first section enabled those interested in digital humanities to hear about events, training opportunities, and experiences of others, mainly those based within the College of Humanities and Social Sciences; the second half of the event involved short presentations and group discussions on practical needs and resources available. My colleague Lisa Otty and I had been asked to present at the second half of the day, sharing the range of services, skills and expertise EDINA offer for digital scholarship (do contact us if you’d like to know more), and were delighted to be able to attend the full half day event.

My notes were captured live, so all the usual caveats about typos, corrections, additions, etc. apply despite the delay in me setting this live. 

The event is opening, after a wee intro from Anouk Lang, with a review of various events and sessions around Digital Humanities, starting with those who had addtended the DHOxSS: Digital Humanities Summer School at Oxford in summer 2016.

Continue reading