figshare

Peter Burnhill, Director of EDINA is introducing our closing keynote, something of a Repository Fringe frequent flyer. But he is also announcing that this year is the 30th birthday of the University of Edinburgh Data Library. There was a need for social scientists to store data and work with it. That has come a long way since. And we now face questions like curation, access, etc. Back to my first duty here… I had an email from Robin Rice in 2011 “we like FigShare” and wrote to the organising list “FigShare: could be the new data sharing killer app!” a bit of an understatement there. So, let’s find out what’s happened in the last two days. So, over to Mark!

Mark Hahnel – FigShare

So I am doing this PK-style as it’s Friday afternoon and we have people on stilts going past! Here we have people from institutions, from libraries. I’m not. We have different ideas so I want your ideas and feedback!

So I’m going to talk about open and closed… We’ll see where we get.

So FigShare let’s you upload your research. Yo can manage your research in the cloud. This has evolved since 2011. We can’t ignore why not all data can be open… So we have a private side now. Our core goal is still being Discoverable, Sharable (social media), Citable (DOI). discoverable is tricky!

We are hosted on Amazon web services, we are ORCID launch partner (only one with non article data I think), we are on a COPE (committee on publication ethics), we are getting DOIs from DataCite AND we are backed up in LOCKSS.

We wanted dissemination of content on the internet – its a solved issue. Instead of going backwards… Let’s see how we go forward by copying this stuff. In common these services like flickr, sound loud etc. visualise content in the browser – you don’t have to download to use.

So live demo number 1. So we have a poster here. Content on the left. Author there. Simple metadata, DOI, and social media shares. We’ve just added embedding – upload content to FigShare and use on your own site. So datasets are custom built in the browser – want to see your 2GB file before you download. You shouldn’t even be downloading, should all be on the web and will be. Ad we have author profiles. With stats including sharing stats. That is motivating. That rewards sharing. Think about who is involved in research. E try to do the other side of incentives action here too! Metrics are good. So is doing something cool with it. So for instance here is a blogpost with a CSV and a graph. So we have a PNG of the data… You can’t interact. But the CSV let’s you create new interactive charts. And we also added in ways to filter data.

We are also looking at incentivising to give back – doing research like an instant T test. Moving towards the idea of interactive research. But this is something that allows you to make research more interactive.

Q – Pat McSweeney) is this live or forthcoming?

It’s live but manually done. A use case for groups that use FigShare the most, that need special interaction for journals.

We are a commercial company but you can upload data for free. We work with publishers. We visualise content really well. So this is additional materials for PLoS, these are all just here on afigShare – theres a video? Play it! It’s how the internet work! Don’t download! We do his for publishers. Another thing we created for a publisher is that click open a graph, you get a Dataset. A researcher asked for it, we built it!

So, back off the internet…

So discoverable. What does that mean? google finds us but… Well is it hearsay? So DataCite started tracking our DOIs. For three months we were 8 our of top ten, then 7 out of top ten, then 9 out of to ten for traffic. So hey, we are discoverable!

But the future of repositories… Who cares?

So who takes ownership of this problem now – funders, stakeholders, or academics? I think it’s institutions and more specifically librarians. librarians are badass. They have taken ownership. They lead change, they try new things.

but the funders? Funders are really reacting to the fact that they want their data – it may be about what researchers want to reuse but really it’s about the impact of their spending. But they are owning that problem. NSF requires sharing with other researchers, similarly humanities. The EU are also talking about this – but not owning the problem, just declaring it really.

So looking across funders… Some have policies… Some stipulations… wellcome Trust withhold 10% of cash if you do not share data. That will make a difference. But what do you do with that data?

What about academics? Well they share data! I generated 9GB a year – probably in middle of the curve in terms of scale – in my PhD. So globally 3PB/year ish. But how much of my PhD is available? A few KB of data. My PhD is under embargo until later in the year, but it will be there.

I felt there were moral and ethical obligations. Sharing detailed research data is associated with increased citation. Simplicity matters, visualisation is cool. I thought it was about an ego trip, academics have to disambiguate themselves…

Now two years after leaving I was asked t come back ion and print excel files for my data for a publication… I generated this without a research data plan. Two years after I left my boss thinks I still work for her. She will hand the next guy working for her… What does he do, copy them back in?

So there is so much more here. It is not just open or closed, it is about control. It’s the Cory Doctorrow thing, the further you are from a problem, the more data you’ll give up, the Facebook issue. you do want control, it matters.

So what motivates academics? Being easy, being useful, and what do funders what – we will jump through hoops for them.

So back to the web… My profile has new different stuff but you’ll see sharing folders – group projects and discussions, ways to reshare that data. Nudge your sharing. But you need the file uploaded now to share two years later. You can share otherwise closed things with colleagues, regardless of institution.

Btw on this slide we have our designers idea of an institutional library – looks a lot like a prison.

So back to those libraries. How much data does an institution generate? Very few know this, how do you assess. Right now we are doing stuff for PLoS we let them browse all their stuff. They can see what they produced. And this aggregation is great for SEO too. Makes it easy to Google then find the research article from there. So from this aggregation we can filter top most viewed, to particular titles. Essentially this is a repository of research outputs, we take all formats. You can imagine that this could be there for any institution. And this has an API.

Institutions also want stats. See where traffic is from. Not just location but institutional IP ranges. So we can show where that item has impact, where viewers come from. But, at the same time populating repositories is hard. But we have data from Nature from PLoS. We can hand that data back to your repositories. We can find the association with the institution.

So it’s about control. It’s Research Data Management as well as Research Output Dissemination all in one.

So we have launched FigShare for institutions. We have heard concerns about metadata standards and how much metadata we have, so Henry Winlaker used our API to build a way to add more metadata to fit institutional needs. So if you share responsibility… Well what’s the point of the institutional repository? I would say that I think IRs are about to move fast. They have to, it was idealistic but now it’s mandated! Next year repositories will look very different. RDM plans say they have to. Funders say they have to.

This community is amazing! resourceSync is great, I want to use it! PMRs Dev challenge idea is great. We are commercial but we can work together!

Do we need to go back further? People use Dropbox, drag files in. We have a desktop app too. But maybe whenever you save a file maybe you need to upload it then. So at projects.ac there is a project. A filesystem that nudges you to add metadata and do things as you are reqArded to do them. You can star things, it does version control. Digital science created this. It’s kind of like it can do so much more. So releasing it to see what’s needed. What’s really cool… You can download this now… If you press save now it saves it to FigShare. That sync would be ideal. Trying it out now. I work in the same office but there is no reason why these shouldn’t all be connected up to IRs to FigShare to all of these things…

And this is a slide specially for Peter Murray-Rust…

I know that openness is brilliant! But it’s also great to work with publishers. More files were made available for free, for academics, that’s great. Everything publicly available will ONLY be by CC0 and CC-BY. SHARE ALL THE DATA.

Q&A

Q1 – Paul) what is the business model?

A1) for PloS it’s about visualisations and data. They lay us to do that. They have a business model for that. And FigShare for Institutions is coming that’s also part of the model

Q2 – Peter MR) I trust you completely but I do not trust Elsevier or Google… Etc. so you have to build organisational DNA to prevent you becoming evil. If you left or died what would happen to FigShare, yo see the point?

A2) I see that. But this is aimed at this costs us money. E sell to institutions but there are economies of scale. Two institutions have built their own data repositories and they cost Â£1million and Â£2million. Thats a lot of money.

Q2) Mendeley have a copy of all the published scientific data these days. FigShare will have massive value of data in it, huge worth, institutions may want to know what staff are doing, t spy on the,. You have something of vast power, vast potential value. The time is now to create governance structure to address that.

peter Burnhill) there are some fundamental trust issues

Mark Hahnel) you can trust the internet to an extent. Make stuff available and it proliferates but you can reuse, you can sell it on etc.

Peter Burnhill) next year we need a discussion of ethics

Q3 – Kevin Ashley) FigShare for institutions. can you say anything about the background consultation around that. A contract is very different to free stuff

A3) sure, legally we have a lot of responsibility. Eve been working with universities, individual ones, to see what the needs are. We spoke to lots of people. Mainly in London but to see we didn’t tread on toes, we didn’t risk their research leaking out. We spoke to institutions more globally. Digital science is a good thing, this is where they come in.

Peter Burnhill) I am a member of the CLOCKSS brand. There is contract between all publishers that CLOCKSS ingests everything they make available and it says that if a failure to deliver happens – for whatever reason – then CLOCKSS have the right to make that data available via platforms (one here at EDINA, one at Stanford) so in terms of assurance that what comes in goes out, joining CLOCKSS does that. The agreement is supra government. You give up that right there that it will remain available.

Mark: absolutely. And all data is available via the API if you want to.

Final Wrap Up – Kevin Ashley

Thanks you to mark for a great final session. So, at an event like this we come here to share ideas, we come to share experience, we look for answers, we come to meet people and to make new connections. We come to learn. We may come with one or many objectives. We at the DCC certainly have been able to. Many of you are new here.

I have learnt lots of stuff. A few things stuck. A whole room of experts can’t put an object into an EPrints repository, there’s a lesson there somewhere about interfaces. And the other interesting idea I picked up from les Carr. Maintaining open access and having a business plan for what we do. So the Dcc how to set up RDM licenses are free but limited edition leather bound copies to come – great idea Les!

I hope all of you did one or several of those things then share, tell us, this is an unconference! We want to keep making this event better every year. We see the event as being about you, about facilitating you to meet and connect.

There will be a Repository Fringe next year. One reason for that is that we have fantastic sponsors. All of whom put into this event. And hopefully we can extend that further next year. But thank you also to session chairs, the speakers, and to the organising committee here. I know how much work goes into this. And a great deal happens and happens smoothly because of that work.

Two people to thank specifically. Florance Kennedy of the DCC and our chair Nicola Osborne!

After a refreshing coffee we’re back and Robin Rice of EDINA is introducing our next speaker. All of the work in the Research Data Management strand is about long term cultural change and I think Mark’s approach here is really inspired.

Mark Hahnel (Imperial College London) â€“ Figshare â€“ Publish All Your Data

Don’t be mad at me for not having a guitar!

Basically this is a bit different to the other repositories in terms of what it does. One problem everyone seems to have is incentivising people to upload and share their data. This is about what would incentivise me as someone from a science background.

I was doing a PhD, generating data, then generated lots of data, charts, graphs, etc. Only a tiny percentage of what I produced will ever be written up but that other data is useful too. That smaller subset will get out there with traditional publication methods. What can I do so that others can use, cite or be aware of it. This was the whole idea behind FigShare. This was originally an idea selfishly for myself. It’s built on a MediaWiki base. Others said – well it’s useful for you but it might be useful to others too…

But why do this? Well within that data I have tested what x does to y. But I know that 20 other labs may fund the same research. There is this whole issue of negative data – it’s part of what is broken in the current publishing systems. In those 20 labs you can get 19 with negative results, 1 with a false positive but it’s much easier to publish that one result than those negative ones.

So FigShare comes in here. A very simple set of boxes – I won’t use a repository that I have to be trained in. No one would use Facebook if you needed training for it! And researchers want their data to be visualised – we are working on making that embeddable. Each set of data is a persistant URL (no matter where hosted). And this has clickable everything. You can also preview datasets on the page without having to download everything. And automatically a researcher profile collects their work.

And we also have space for videos – again not publishable but show interesting things. You can link your theses to this permanent URL in the same way. One of the things I have learned is that if you build a platform for scientists they will do their own thing with it. I thought it would be great for disseminating data and finding stuff on Google. Others have said they want feedback on material for publication. People started sharing their research through different outputs. If you click on person you can pull in an RSS feed of your research. So people have been plugging in that RSS to friendfeed to disseminate and people have given great feedback, questioned his methodology and collaborating. You could also plug the RSS feed into a blog as an eLab Book.

And the permanent storage of something online – access your research anywhere which means you can instantly show people what you are working on. In terms of permanance we are working on exports to endnote and so on. The handles are similar to DOIs. Everything is listed by tags, searches etc. It is discoverable. You can search or browse by anuthing here. I wanted to do this for selfish reasons. When I started my PhD (on mobilisation of mscs) my lab had just had a huge paper released, reviewed in Nature, a feature on page 3 of the Guardian . If I search now my own work – which is useful for others – on FigShare are the top result even though it will not be published in a journal. I am happy to see that it is working in terms of discoverability. So the thing about this is that the data is more discoverable, it’s disseminated, it’s available for sharing. We have done all this on a budget of zero and for that reason we are asking researchers to make their data open when they upload it here. The thing about JISC is that they fund these amazing tools and resources but even as an interested researcher I don’t find things out. When I do I retweet, I get the word out. Retweet everything! Make the most of the amazing stuff that is being built.

In the first few months we had several hundred researchers and 700 ish data sets submitted. Even with 700 objects that’s not great to search. It was suggested that I seed the database. There is an open subset of PMC of articles but finding the figures is tough so this is about breaking figures out of repositories. About a month ago we began parsing the xml files and we have been pulling in about 2-3000 figures per day. About 50,ooo figures so far. We should make about half a million figures more discoverable in total in this process. The other thing is that if you publish in an open access journal you therefore may already have a profile and data available.

We’ve been looking at what else might be needed…

We were asked to allow grouped files – for projects but also for complex 3D imaging objects. Researchers like to big themselves up. We are included alt metrics here – allowing new ways to boast about their work. Also graphical representations of page views – in a nice graph it’s quite appealing. And we also provide Embed code for adding their data for their theses or papers etc.

So that is the long and short of the features as it is. And everyone I’ve talked to in science has an opinion – positive or negative. I am really pleaed that so many repositories are educating researchers on depositing data and articles and on open access.

Q&A

Q1 – Les Carr) It’s just amazing what one can accomplish as a diversion almost from one’s PhD. Looking at all these figures from external data sources, the actual data sets – which are so important – you have a handful of dozens of those. Any sense of how will you increase this

A1) I have an idea that when we’re doing journal clubs and things like that you can use the QR code to look at the figure, see the data, explore further. Some journals require you to be uploading all of your data. There are projects like Driad. There are lots of datasets under CC0 – I could do that in the same way as we have for the figures but I’d prefer people to upload their own data.

Q2 – Peter Murray-Rust) I think this is fantastic. Have you had any interest from journals about this. For instance I work with BioMedCentral and this would be trivial to link back and forth

A2) BioMedCentral have been in touch, mainly as we have been compiling a list of repositories to deposit specialist materials.

Q3 – Robin Rice) If journals and publishers are becoming dependant on figures beingthere what do you see as the sustainability model for FigShare

A3) In the first week of pre-beta a not for profit organisations offered to host FigShare indefinitely – at least 3 years and it’s just had funding for at least the next 20 years.

EDINA Blogs

A Blogs.edina.ac.uk weblog

Category Archives: figshare

LiveBlog: Closing Keynote

LiveBlog: Mark Hahnel â€“ Figshare: Publish All Your Data