Highlights from the RDM Programme Progress Report: August – October 2015

The RDM Roadmap 2.0 has been completed, approved, and published online and work has started on achieving the deliverables. A copy of the Roadmap is publicly available on the RDM webpages and can be downloaded from http://www.ed.ac.uk/files/atoms/files//uoe-rdm-roadmap_-_v2_0.pdf.

The RDM Services brochure has now been published in both paper and electronic form and is proving very popular with researchers. The electronic version can be downloaded from http://www.ed.ac.uk/files/atoms/files/rdm_service_a5_booklet_0.pdf

Work on DataVault is progressing well and an interim DataVault service is now nearly complete. The Software Sustainability Institute has worked with the DataVault team to road test the interim solution, as a result some optimisations to the process were identified and are being coded up. DataVault user events have been held in both Manchester and Edinburgh, both events were well attended and the general impression of the current DataVault functionality was positive. Further, round three, funding is being sought from Jisc in December to continue this joint development effort.

Jisc has provided funding for up to nine PhD students to be employed one day per week for four months within their school. Their role will be to help researchers within their school record their research data as Datasets in the PURE system, and to direct any RDM or DMP queries to the RDM team for further support. The Dataset records in PURE will provide the Edinburgh University contribution to the national Research Data Discovery Service, this will increase the discoverability of Edinburgh data and ensure that more researchers are meeting the requirements of their research funders to make their data discoverable and reusable. Applications for the first set of three PhD student interns have been received and are currently being shortlisted, the successful applicants should be able to begin work before the end of 2015.

In October some minor questions were received about the DataShare application for Data Seal of Approval (DSA), these were responded to and DataShare has now been approved for the DSA. This is a major achievement for the entire DataShare team who have worked hard to make DataShare a Trusted Digital Repository.

Over the three month period a total of 173 staff and PGR’s have attended a RDM course or workshop, an additional 20-25 staff have attended research committee meetings or small group presentations where RDM has been on the agenda. Both regular and on demand RDM sessions (courses, workshops, & presentations) will continue to be offered and we are currently in the process of scheduling 30 courses, workshops for January to June 2016 as well as a number of presentations.

The “Data Management and Sharing� Coursera MOOC is well under way with a December launch anticipated. Sarah Jones, DCC, is our video instructor, using scripts adapted from MANTRA.

National and International Engagement Activities

10th August meeting in London with other Alan Turing Institute members to discuss RDM requirements to be provided by member institutions.

17th of August a one day RDM event was organised for Danish visitors from the University of Copenhagen to present UoE RDM services, outreach activities and ELNs.

31st August Dealing with Data conference.

7th/8th September meeting with Gottingen University to talk about digital scholarship, including RDM.

7th October DataVault engagement event at Manchester University.

29 October, Educause conference, Indianapolis. Robin Rice was on a panel with Jan Cheetham & Brianna Marshall, University of Wisconsin and Rory Macneil, RSpace: “Drivers and responses toward research data management maturity: transatlantic perspectives.

Kerry Miller

RDM Service Co-Ordinator

Share

Jisc Data Vault update

Posted on behalf of Claire Knowles

Research data are being generated at an ever-increasing rate. This brings challenges in how to store, analyse, and care for the data. Part of this problem is the long term stewardship of researchers’ private data and associated files that need a safe and secure home for the medium to long term.

PrintThe Data Vault project, funded by the Jisc #DataSpring programme seeks to define and develop a Data Vault software platform that will allow data creators to describe and store their data safely in one of the growing number of options for archival storage. This may include cloud solutions, shared storage systems, or local infrastructure.

Future users of the Data Vault are invited to Edinburgh on 5th November, to help shape the development work through discussions on: use cases, example data, retention policies, and metadata with the project team.

Book your place at: https://www.eventbrite.co.uk/e/data-vault-community-event-edinburgh-tickets-18900011443

The aims of the second phase of the project are to deliver a first complete version of the platform by the end of November, including:

  • Authentication and authorisation
  • Integration with more storage options
  • Management / monitoring interface
  • Example interface to CRIS (PURE)
  • Development of retention and review policy
  • Scalability testing

Working towards these goals the project team have had monthly face-to-face meetings, with regular Skype calls in between. The development work is progressing steadily, as you can see via the Github repository: https://github.com/DataVault, where there have now been over 300 commits. Progress is also tracked on the open Project Plan where anyone can add comments.

So remember, remember the 5th November and book your ticket.

Claire Knowles, Library & University Collections, on behalf of the JISC Data Vault Project Team

Share

Edinburgh DataShare – new features for users and depositors

I was asked recently on Twitter if our data library was still happily using DSpace for data – the topic of a 2009 presentation I gave at a DSpace User Group meeting. In responding (answer: yes!) I recalled that I’d intended to blog about some of the rich new features we’ve either adopted from the open source community or developed ourselves to deliver our data users and depositors a better service and fulfill deliverables in the University’s Research Data Management Roadmap.

Edinburgh DataShare was built as an output of the DISC-UK DataShare project, which explored pathways for academics to share their research data over the Internet at the Universities of Edinburgh, Oxford and Southampton (2007-2009). The repository is based on DSpace software, the most popular open source repository system in use, globally.  Managed by the Data Library team within Information Services, it is now a key component in the UoE’s Research Data Programme, endorsed by its academic-led steering group.

An open access, institutional data repository, Edinburgh DataShare currently holds 246 datasets across collections in 17 out of 22 communities (schools) of the University and is listed in the Re3data Registry of Research Data Repositories and indexed by Thomson-Reuters’ Data Citation Index.

Last autumn, the university joined DataCite, an international standards body that assigns persistent identifiers in the form of Digital Object Identifiers (DOIs) to datasets. DOIs are now assigned to every item in the repository, and are included in the citation that appears on each landing page. This helps to ensure that even after the DataShare system no longer exists, as long as the data have a home, the DOI will be able to direct the user to the new location. Just as importantly, it helps data creators gain credit for their published data through proper data citation in textual publications, including their own journal articles that explain the results of their data analyses.

CaptureThe autumn release also streamlined our batch ingest process to assist depositors with large and voluminous data files by getting around the web upload front-end. Currently we are able to accept files up to 10 GB in size but we are being challenged to allow ever greater file sizes.

Making the most of metadata

Discover panel screenshot

Example from Geosciences community

Every landing page (home, community, collection) now has a ‘Discover’ panel giving top hits for each metadata field (such as subject classification, keyword, funder, data type, spatial coverage). The panel acts as a filter when drilling down to different levels,  allowing the most common values to be ‘discovered’ within each section.

 

 

 

 

 

The usage statistics at each level  are now publicly viewable as well, so depositors and others can see how often an item is viewed or downloaded. This is useful for many reasons. Users can see what is most useful in the repository; depositors can see if their datasets are being used; stakeholders can compare the success of different communities. By being completely open and transparent, this is a step towards ‘alt-metrics’ or alternative ways measuring scholarly or scientific impact. The repository is now also part of IRUS-UK, (Institutional Repository Usage Statistics UK), which uses the COUNTER standard to make repository usage statistics nationally comparable.

What’s coming?

Stay tuned for future improvements around a new look and feel, preview and display by data type, streaming support, bittorent downloading, and Linked Open Data.

Robin Rice
EDINA and Data Library

Share

New data analysis and visualisation service

Statistical Analysis without Statistical Software

The Data Library now has an SDA server (Survey Documentation and Analysis), and is ready to load numeric data files for access by either University of Edinburgh users only, or ‘the world’. The University of Edinburgh SDA server is available at: http://stats.datalib.edina.ac.uk/sdaweb/

SDA provides an interactive interface, allowing extensive data analysis with significance tests. It also offers the ability to download user-defined subsets with syntax files for further analysis on your platform of choice.

SDA can be used to teach statistics, in the classroom or via distance-learning, without having to teach syntax. It will support most statistical techniques taught in the first year or two of applied statistics. There is no need for expensive statistical packages, or long learning curves. SDA has been awarded the American Political Science Association Best Instructional Software.

For data producers concerned about disclosure control, SDA provides the capability of defining usage restrictions on a variable-by-variable basis. For example, restrictions on minimum cell sizes (weighted or unweighted), use of particular variables without being collapsed (recoded), or restrictions on particular bi- or multivariate combinations.

For data managers and those concerned about data preservation, SDA can be used to store data files in a generic, non-software dependant format (fixed-field format ASCII), and includes capability of producing the accompanying metadata in the emerging DDI-standard XML format.

Data Library staff can mount data files very quickly if they are well documented with appropriate metadata formats (eg SAS or SPSS), depending on access restrictions appertaining to the datafile. To request a datafile be made available in SDA, contact datalib@ed.ac.uk.

Laine Ruus
EDINA and Data Library

Share

Leading a Digital Curation ‘Lifestyle’: First day reflections on IDCC15

[First published on the DCC Blog, republished here with permission.]

Okay that title is a joke, but an apt one to name a brief reflection of this year’s International Digital Curation Conference in London this week, with the theme of looking ten years back and ten years forward since the UK Digital Curation Centre was founded.

The joke references an alleged written or spoken mistake someone made in referring to the Digital Curation lifecycle model, gleefully repeated on the conference tweetstream (#idcc15). The model itself, as with all great reference works, both builds on prior work and was a product of its time – helping to add to the DCC’s authority within and beyond the UK where people were casting about for common language and understanding in this new terrain of digital preservation, data curation, and – a perplexing combination of terms which perhaps still hasn’t quite taken off, ‘digital curation’ (at least not to the same extent as ‘research data management’). I still have my mouse-mat of the model and live with regrets it was never made into a frisbee.

Digital Curation Lifecycle

They say about Woodstock that ‘if you remember it you weren’t really there’, so I don’t feel too bad that it took Tony Hey’s coherent opening plenary talk to remind me of where we started way back in 2004 when a small band under the directorship of Peter Burnhill (services) and Peter Buneman (research) set up the DCC with generous funding from Jisc and EPSRC. Former director Chris Rusbridge likes to talk about ‘standing on the shoulders of giants’ when describing long-term preservation, and Tony reminded us of the important, immediate predecessors of the UK e-Science Programme and the ground-breaking government investment in the Australian National Data Service (ANDS) that was already changing a lot of people’s lifestyles, behaviours and outlooks.

Traditionally the conference has a unique format that focuses on invited panels and talks on the first day, with peer-reviewed research and practice papers on the second, interspersed with demos and posters of cutting edge projects, followed by workshops in the same week. So whilst I always welcome the erudite words of the first day’s contributors, at times there can be a sense of, ‘Wait – haven’t things moved on from there already?’ So it was with the protracted focus on academic libraries and the rallying cries of the need for them to rise to the ‘new’ challenges during the first panel session chaired by Edinburgh’s Geoffrey Boulton, focused ostensibly on international comparisons. Librarians – making up only part of the diverse audience – were asking each other during the break and on twitter, isn’t that exactly what they have been doing in recent years, since for example, the NSF requirements in the States and the RCUK and especially EPSRC rules in the UK, for data management planning and data sharing? Certainly the education and skills of data curators as taught in iSchools (formerly Library Schools) has been a mainstay of IDCC topics in recent years, this one being no exception.

But has anything really changed significantly, either in libraries or more importantly across academia since digital curation entered the namespace a decade ago? This was the focus of a panel led by the proudly impatient Carly Strasser, who has no time for ‘slow’ culture change, and provocatively assumes ‘we’ must be doing something wrong. She may be right, but the panel was divided. Tim DiLauro observed that some disciplines are going fast and some are going slow – depending on whether technology is helping them get the business of research done. And even within disciplines there are vast differences –-perhaps proving the adage that ‘the future is here, it’s just not distributed yet’.

panel session

Geoffrey Bilder spoke of tipping points by looking at how recently DOIs (Digital Object Identifiers, used in journal publishing) meant nothing to researchers and how they have since caught on like wildfire. He also pointed blame at the funding system which focuses on short-term projects and forces researchers to disguise their research bids as infrastructure bids – something they rightly don’t care that much about in itself. My own view is that we’re lacking a killer app, probably because it’s not easy to make sustainable and robust digital curation activity affordable and time-rewarding, never mind profitable. (Tim almost said this with his comparison of smartphone adoption). Only time will tell if one of the conference sponsors proves me wrong with its preservation product for institutions, Rosetta.

It took long-time friend of the DCC Clifford Lynch to remind us in the closing summary (day 1) of exactly where it was we wanted to get to, a world of useful, accessible and reproducible research that is efficiently solving humanity’s problems (not his words). Echoing Carly’s question, he admitted bafflement that big changes in scholarly communication always seem to be another five years away, deducing that perhaps the changes won’t be coming from the publishers after all. As ever, he shone a light on sticking points, such as the orthogonal push for human subject data protection, calling for ‘nuanced conversations at scale’ to resolve issues of data availability and access to such datasets.

Perhaps the UK and Scotland in particular are ahead in driving such conversations forward; researchers at the University of Edinburgh co-authored a report two years ago for the government on “Public Acceptability of Data Sharing Between the Public, Private and Third Sectors for Research Purposes,� as a pre-cursor to innovations in providing researchers with secure access to individual National Health Service records linked to other forms of administrative data when informed consent is not possible to achieve.

Given the weight of this societal and moral barrier to data sharing, and the spread of topics over the last 10 years of conferences, I quite agree with Laurence Horton, one of the panelists, who said that the DCC should give a particular focus to the Social Sciences at next year’s conference.

Robin Rice
Data Librarian (and former Project Coordinator, DCC)
University of Edinburgh

Share

Open data repository – file size analysis

The University of Edinburgh’s open data sharing repository, DataShare, has been running since 2009.  During this time, over 125 items of research data have been published online. This blog post provides a quick overview of the the number, extent, and distribution of file sizes and file types held in the repository.

First, some high level statistics (as at March 2014):

  • Number of items: 125
  • Total number of files: 1946
  • Mean number of files per item: 16
  • Total disk space used by files: 76GB (0.074TB)

DataShare uses the open source DSpace repository platform. As well as stroring the raw data files that are uploaded, it creates derivative files such as thumbnails of images, and plain text versions of text documents such as PDF or Word files, which are then used for full-text indexing.  Of the files held within DataShare, about 80% are the original files, and 20% are derived files (including for example, licence attachments).

filetypes

When considering capacity planning for repositories, it is useful to look at the likely file size of files that may be uploaded.  Often with research data, the assumption is that the file size will be quite large.  Sometimes this can be true, but the next graph shows the distribution of files by file size.  The largest proportion of files are under 1/10th of a megabyte (100KB).  Ignoring these small files, there is a normal distribution peaking at about 100MB.  The largest files are nearer to 2GB, but there are very few of these.

filesizes

Finally, it is interesting to look at the file formats stored in the repository.  Unsurprisingly the largest number of files are plain text, followed by a number of Wave Audio files (from the Dinka Songs collection).  Other common file formats include XML files, ZIP files, and JPEG images.

fileformats

Stuart Lewis
Head of Research and Learning Services, Library & University Collections

Data provided by the DataShare team.

Share

Data journals – an open data story

Here at the Data Library we have been thinking about how we can encourage our researchers who deposit their research data in DataShare to also submit these for peer review.

Why? We hope the impact of the research can be enhanced with the recognised added-value of peer review. Regardless whether there is a full-blown article to accompany the data.

We therefore decided recently to provide our depositors with a list of websites or organisations where they could do this.

I pulled a table together, from colleagues’ suggestions, from the PREPARDE project and the latest RDM textbook. And, very much in the Open Data spirit, I then threw the question open on Twitter:

“[..]does anyone have an up-to-date list of journals providin peer review of datasets (without articles), other than PREPARDE? #opendata�

…and published the draft list for others to check or make comments on. This turned out to be a good move. The response from the Research Data Management community on Twitter was very heartening, and colleagues from across the globe provided some excellent enhancements for the list.

That process has given us confidence to remove the word ‘Draft’ from the title – the list, this crowd-sourced resource, will need to be updated from time-to-time, but we are confident that we’ve achieved reasonable coverage of the things we were looking for.

Another result of this search was the realisation that what we had gathered was in fact quite clearly a list of Data Journals. My colleague Robin Rice has now added a definition of that term to the list, and we will be providing all our depositors with a link to it:

https://www.wiki.ed.ac.uk/display/datashare/Sources+of+dataset+peer+review

Share