Minneapolis, MN, USA, 2 to 5 June 2015
Host institution: Minnesota Population Center at the University of Minnesota
The theme of the 2015 conference was Bridging the Data Divide: Data in the International Context with many of the sessions dedicated to research data management in academia, which of course is being embraced across a growing number of UK academic institutions. I seem to recall that about 20 percent of UK academic institutions have a research data management strategy in place, so these sessions were of considerable interest, and well attended.
Data Infrastructure and Applications sessions were also prominent at the conference, with some interesting presentations relevant to EDINA, and attendance quite good, especially for the Block 5, E1 session for Geospatial and Qualitative Data on Thursday, June 4, 13:30 to 15:30. My presentation on GoGeo was slotted into this session along with three others with those focussed more on qualitative data.Â http://iassist2015.pop.umn.edu/program/block5#a1
The first plenary session was interesting as Professor Steven Ruggles, from the Minnesota Population Center, provided an overview of the history of the US Census and how it was at the forefront with regards to data capture, process and dissemination. The second plenary speaker, Curtiss Cobb, from Facebook, tried to make the make the case that Facebook serves as a force of social good in the world, and Andrew Johnson, from the City of Minneapolis, spoke at the final plenary session on Friday with an overview of the Cityâ€™s open data policy.
Summaries of relevant presentations
3 June, Wednesday morning session:
A3: Enabling Public Use of Public Data
Mark Mitchell, from the Urban Big Data Centre (UBDC) at the University of Glasgow provided an interesting presentation titled And Data for all. The UBDC takes the Glasgow City Council’s urban open data that it has created, and makes it available to the public and to academia through its UBDC Data Portal (http://ubdc.gla.ac.uk/), which currently holds 934 datasets, primarily from the Glasgow City Council and Greater London Authority. MM noted the use of CKAN to build their data portal, and use R and QGISÂ at UBDC. He also noted that there are about 300+ data portal users and try to provide good metadata records and crosslink these with their datasets.
MM noted that there was a considerable degree of metadata quality, but indicated that the Glasgow City Council planned to mandate a minimum standard for metadata quality.
Some issues were revealed, most notably differences in projections between datasets where Transport Planning used British National Grid and Health Services used northing-easting.
He also pointed out an interesting result in a survey conducted in Glasgow which revealed support for the use of personal data for societal benefit, but not for commercial interest.
He touched on the ESRC-funded Integrated Multimedia City data (iMCD) project, which is intended to capture urban life through surveys, sensors and multimedia.
Then on that same strand, he made reference to the gamification of data, which would incorporate Minecraft server and Minecraft, an interactive block game, to introduce Glasgow open data to Glasgow primary school children to make geography and maps more engaging and interesting.
More about this can be found on the UBDC website via this link.
Someone noted during questions that the Australian Bureau of Statistics has created a mobile game called Run That Town. The ABS use data from every postal area in Australia and incorporate it into this mobile game.
Run That Town gives each player the ability to nominate any Australian town and take over as its virtual ruler. Players have to decide which local projects to approve and which to reject, with the real Census data of their town dictating how their population reacts. To win, players need to maintain their popularity, making Census data core to the gameplay and giving players the chance to use the data themselves.
Mark also mentioned about collaborative efforts between UBDC and the Glasgow School of Art to create noise and light maps for the City of Glasgow, then noted that housing charities were requesting more data from the Glasgow City Council as well.
Winny Akullo, from the Uganda Bureau of Statistics, delivered another presentation of this session, which provided an overview of the results of a quantitative study in Uganda that was carried out to investigate ways of improving the dissemination of statistical information there. The results indicated that the challenge remained, and one that required more resources to improve dissemination.
Margherita Ceraolo, from the UK Data Service, wrapped up the session with her presentation about the global momentum towards promoting open data including support from national governments and IGOs (e.g. IMF, World Bank and UN).
She made reference to macro data as well as boundary data, then made a reference to the UKDS building an open API for data re-use; release is scheduled for the end of 2015. She also made a reference to a map visualisation interface to display all data in their collection.
3 June, Wednesday afternoon session:
B5: Building on Common Ground: Integrating Principles, Practices, and Programs to support Research Data Management
Lizzy Rolando (Georgia Tech Library), and Kelly Chatain, from the Institute for Social Research (ISR) at the University of Michigan, gave interesting presentations on support for research data management at their respective institutions. Session Chair, Bethany Anderson, from University Archives at the University of Illinois-Urbana, also discussed ways of integrating the work of academic archives and research data services to appraise, manage and steward data.
Some key points that they noted during their presentations included the following:
- requiring a chain of custody for data to encourage collective ownership and responsibility;
- make data use a higher priority over preservation; and
- mentioned Purdue Universityâ€™s policy for data retention which requires a reappraisal of data every 10 years.
These are eminently sensible approaches to data management in academia. Granted, the first one faces resistance, but if data creators and users refuse to be accountable for data, then who assumes this responsibility? Ownership needs to be addressed if data are to be managed and shared, and when it becomes a collective responsibility, then perhaps there might be more willingness as a shared activity?
Data re-use ought to be prioritised as well, and periodically assessed rather than stored on various media to be forgotten. Itâ€™s become another of many classic excuses when terabytes of data are blamed for eschewing the responsibilities of data documentation/metadata creation.
Itâ€™s uncertain, but how many spatial datasets are worth a place in archival storage? If there are spatial datasets of no value, then they should be deleted rather than saved. Question is who makes these decisions, but could assume that it would be within each department?
3 June, Wednesday afternoon late session:
C5: No Tools, No Standard — Software from the DDI Community
Listened to a presentation about the Ontario Data Documentation, Extraction Service and Infrastructure (ODESI) and the Canadian Data Liberation Initiative (DLI), with reference to Nesstar. Nesstar is a software system for data publishing and online analysis. The Norwegian Social Science Data Services (NSD) owns it and recall it during the time I worked years ago at the UK Data Archive.
4 June, Thursday morning session:
D4: Minnesota Population Data Center (MPC) Data Infrastructure: Integration and Access
This session provided an overview of the Minnesota Population Center (MPC) project activities with most of the presentation about Integrated Public Use Microdata Series (IPUMS) (www.ipums.org), which is dedicated to collecting and distributing freeÂ and accessible census data, both US and international census data.
Interesting to note from the presentation, the number of users, with economists, the highest, at 31 percent; demographers and sociologists accounting for 16 percent; and journalists and government users at 15 percent. Only 8 percent of users were identified as geographers/GIS, though they indicated that their numbers were growing.
The North Atlantic Population Project (NAPP) was mentioned, which includes 19th and early 20th century census microdata from Canada, Great Britain, Germany, Iceland, Norway, Sweden, and the US, so worth noting that British census data available as well.
The Terra Populus project (http://www.terrapop.org/) was also covered and sounded quite interesting. The goal of the project is to integrate the worldâ€™s population (census) with environmental data (remotely-sensed land cover, land cover records and climate data).
There is also a temporal aspect to this which exams interactions over time between humans and environment to observe changes that take place between the two.
There is a TerraPop Data Finder being built, which is currently in beta. It holds census data, and land use, land cover and climate data.
The MPC has also been involved with the State Health Access Data Assistance Center (SHADAC) Data Center, doing analysis on estimates of health insurance coverage, health care use, access and affordability using data from the 2012 National Health Interview Survey (NHIS).
4 June, Thursday afternoon session:
E1: Geospatial and Qualitative Data
There was exceptionally good attendance for this session with most of the room filled. Amber Leahey, the Data Services Metadata Librarian at the University of Toronto, was Chair of our session. Had a chance to talk to her after the session, and learned about the Scholars GeoPortal, which is an online resource for Canadian academics and students to access licensed geospatial datasets through a subscription service, much like Digimap.Â Impressive portal, and data are free, though the portal provides a limited number of Canadian datasets. They encourage data creators to upload their datasets to the portal, much like Digimap ShareGeo, but face similar challenges as here.Â http://geo1.scholarsportal.info/
Andy Rutkowski (USC) started the session with his presentation on using qualitative data (social media, tweets, interviews, archived newspaper classifieds, photographs) to improve the understanding of quantitative data to produce more meaningful maps, maps as social objects, a move towards spatial humanities?
He alluded to skateboardersâ€™ information about pavement conditions at various locations in Los Angeles that led to a new skateboard park.
He also referred to Professor Nazgol Bagheri’s (UT San Antonio) work on mapping women’s socio-spatial behaviours in Tehran’s public spaces using photographs and narratives linked to GIS data from the Iranian Census, national GIS database and City of Tehran; all this to generate a qualitative GIS map that displays the gendering of spatial boundaries.
He concluded with a reference to the LA Times Mapping project, which started in 2009 and displays the neighbourhoods of Los Angeles, which have been redrawn using feedback from readers whose perceptions of boundaries differed from the original ones.Â http://maps.latimes.com/neighborhoods/
The next presentation (The Landscape of Geospatial Research: A Content Analysis of Recently Published Articles) was a joint collaboration with library staff at the University of Michigan reporting on their efforts and results to use geospatial research methods to capture information from the body of published literature. Samples of articles, from a selection of multi-disciplinary journals with spatial themes, were UID coded for content including spatial data cited, software used and research methodology; I assume with regards to software, this would be ArcGIS, ERDAS MapInfo, etc?
Metadata was also compiled for the articles, which included title, subject, author(s) subject affiliation, number of authors and their gender; this information extracted through multi-coding. Also reference toÂ geo co-ordinate analysis and building the schema to support this information extraction.
Certainly the Unlock geo-parser (http://edina.ac.uk/unlock/) comes to mind as being relevant to their project. Weâ€™ve already discussed the possibility of doing something similar using GoGeo to extract and harvest metadata from open access journals as publications represent the best sources for spatial data information with most publications peer-reviewed, and the data cited, so this should address data quality concerns, and the purpose for which the data were created. Each publication would also provide the author(s) name(s) and contact details for those interested in acquiring the data, which might in turn pressure researchers to release their data through GoGeo rather than face personal requests for their data.
My presentation followed and can be found on this EDINA page.
GoGeo: A Jisc-funded service to promote and support spatial data management and sharing across UK academia
One of my comments, and a photo of one of my slides, reached the IASSIST conferenceâ€™s Twitterland and went viral at the conference, though I noted as well that metadata creation was important, but the reality is that after 14 years of metadata coordination both in the public sector and academia, Iâ€™ve yet to meet anyone who has actually expressed any pleasure in creating metadata.
My presentation provided an overview of EDINA , Jisc and the GoGeo Spatial Data Infrastructure, then summarised the latterâ€™s successes and shortcoming, the former attributed to GoGeo users searching for data; the latter, GoGeo users unwilling to share their data. My presentation also offered to the audience, new approaches that would encourage spatial data management and sharing including a mandatory requirement for students to use Geodoc to document data cited in their dissertation and theses as a requirement for graduation; itâ€™s often easier for a department to impose this requirement on its students rather than its faculty, but if students document their data, future students can access the metadata records as part of their literature review, and access data that might complement their own research data; this in turn would require university departments to take ownership of their studentsâ€™ data and make available to others, so at least spatial data is shared internally. This could be restricted to the department or within a university if there is a data management policy and the infrastructure in place to support it, though if not GoGeo provides this.
The use of Geodoc and the GoGeo private catalogues was also presented as another approach to supporting spatial data information management with Geodoc used at the personal level where a researcher can document his/her spatial data, then use Geodoc to store and update those records. Then the option of exporting Geodoc records to attach to shared spatial datasets, which seems the preferred option as academics will entrust their data to colleagues rather than make them openly available; the data recipient can then import the metadata record into his/her own Geodoc to access for updating and editing. The other option is for Geodoc users, whether part of a research project group, a department, or university, to publish their metadata records to a GoGeo private catalogue, which only those with assigned usernames and passwords can access. As I manage these catalogues, I can assign these to those whoâ€™ve been granted permission to access the metadata records, and can be affiliated with the same project, but from different universities.
The hopeful outcome would be that after these records and their datasets have served their purpose, then the records would be published in GoGeoâ€™s open catalogue and the data uploaded to ShareGeo, or a GoGeo database as it would be better to have both the metadata and data in the GoGeo portal and not separate as it the case now between GoGeo and the ShareGeo data repository, which records from 500 to 3,000 downloads a month, so better to redirect those users to GoGeo.
My presentation noted as well the Jisc commitment to providing the resources to the UK academic community in support of research data management, then noted that about 20 percent of the UK universities have a data research management policy in place.
Also in line with the Landscape of Geospatial Research: A Content Analysis of Recently Published Articles presentation, the search interface in GoGeo could be updated to search and harvest metadata from peer-reviewed open access journal publications. It would also be an important step forwards if publishers would require authors to release their data, but there seems to be no movement on that front as it is in the financial interest of most publishers to publish more, and might see this as an imposition on researchers which would result in fewer publications?
If there was any consolation, there were other presentations at IASSIST that revealed similar experiences (see 5 June, Friday morning session), so academia represents a formidable challenge both here and the US, and probably in most other countries as well?
Mandy Swygart-Hobaugh (Georgia State University) concluded the session with her presentation on qualitative research. She asked if social sciences data services librarians devoted their primary attention to quantitative researchers to the detriment of qualitative researchers, and her survey indicated that it is overwhelmingly biased towards quantitative data researchers.
5 June, Friday morning session:
F5: Using data management plans as a research tool for improving data services in academic libraries
Â Amanda Whitmire (Oregon State University), Lizzy Rolando, Georgia Tech Library and Brian Westra and University of Oregon Libraries combined to offer interesting presentations.
AW talked about the DART Project (Data management plans as A Research Tool). This NSF-funded project is intended to facilitate a multi-university study to develop an analytic rubric to standardise the review of faculty data management plans for Oregon State University, the University of Michigan, the Georgia Institute of Technology and Penn State University.
This poster offers more insight about the Dart project.
She also talked about the Data Management Plan (DMP) tool, which can be used to provide a rich source of information about researchers and their research data management (RDM) knowledge, capabilities and practices. She revealed some information including the possibility of plagiarism with 40 percent of researchers sharing text and geographical research comprising only 8 percent of the RDM activities, so probably no different than here in the UK as the social sciences/geosciences seem more averse to data management and sharing. Only 10 percent of the researchers approached the RDM staff for assistance as well.
The DMP tool also has the functionality to see cross-disciplinary trends without engaging with the researchers, and with only 10 percent of the researchers approaching the RDM staff, this is probably good. She noted that the cross-disciplinary trends were high for the likes of Mathematics and Physics and low for geography, and really no surprise in this revelation.
Further assessment of information revealed that with eight research plans/practices(?) did not indicate any intent of releasing data; five plans indicated a selective release of â€˜relevant data, which she interpreted as suggesting it was to the researchersâ€™ discretion and just another way of saying â€˜noâ€™ to data sharing.
In addition, she reported that researchersâ€™ descriptions of data types was done well, but no mention of metadata creation or data protection and data archiving; some mention of data re-use.
Lizzy Rolando revealed similar results during her presentation which involved feedback from researchers at Georgia Tech.
Asked about their plans on how they would share their data, researchers indicated the following:
–Â Citation in journals: 22 percent
–Â Conferences: 10 percent
–Â Repository: 9 percent
–Â Other repository: 7 percent
In effect, most researchers perceived that the citation of their data in journals or at conferences was effectively data sharing; only a minority seemed inclined to share their data directly.
Also, results of the survey indicated that researchers werenâ€™t aware of metadata standards, or metadata at all, and expressed a willingness to share their data, but not willing to archive their data, again, their interpretation of data sharing seems to suggest only through citation.
LR suggested that one way to encourage researchers to create metadata was to do so informally through note taking, but then would researchers be willing to share their notes is the question I have, or would they allow librarians or others to reference their notes to create metadata?
Iâ€™ve offered my services to academics in academia, but no one has accepted the offer of providing their data for me to extract information to document their datasets, and this is a step further than asking researchers to take notes about their data.
Itâ€™s a good idea, can it succeed, though it should be a reasonable approach to data management, but without any formal structure, what will happen to the notes? Will those files be stored randomly in various media, accidentally deleted, or not properly updated to reflect changes made to the dataset?
Brian Westa from the University of Oregon, offered another summary of a similar survey conducted at his university; the survey targeted researchers in Chemistry, Â Biological Sciences and Mathematics.
Asked about data documentation/description and metadata standards, 51 researchers in Biological Sciences and Chemistry acknowledged the following:
–Â Data description: 14
–Â Could identify metadata standards: 10
– Making data public: 14
– Mentioned data formats: 12
The Dryad repository was mentioned amongst the 14 who responded to making data public, but again, with only 10 respondents acknowledging familiarity with metadata standards, there are RDM issues here as well.
Feedback also indicated that most researchers were concerned about trusting others with their data, and though there were 14 respondents who acknowledged that they shared their data, most indicated that they shared their data through citation in publications and their own website, so again, a reluctance to physically share their data, and if they did actually share the data, it can be inferred that it would have been one-to-one with colleagues they could trust?
Turning to the survey for researchers in Chemistry, much the same was suggested in the results. A majority indicated that they shared their data through citations in publications and only shared data through â€˜specific requestsâ€™, again trust comes into play here and assume these requests would be approved if from a close, or trusted colleague?
The respondents noted the following as methods of data sharing in this order:
– On request
– Personal website
– Data centre
None of the respondents made any reference to metadata or standards.
BW concluded with an overview of the National Science Foundationâ€™s (NSF) effort to encourage research data management and sharing, which basically requires the research community, who receives considerable NSF funding, to establish data management practices; however, BW noted that itâ€™s not happening, though said that there was one occasion recently where continued funding for a postgrad student was withheld until the student had submitted an RDM plan to the NSF, so there has been little progress there, even from a major funding body as the NSF, and this sounds similar to experiences at NERC where researchers saw funding as a one-off, so felt no obligation to submit their data to NERC after the project was finished, though I think they were to review this and try to find another strategy that would encourage better data management and sharing.
The resistance within academia to both data management and sharing is quite concerning as access to the data should be part of the peer-review process. In this Reutersâ€™ article, and others, itâ€™s noted that there are publications where the data donâ€™t hold up to scrutiny, and this is an alarming concern.
As governments continue to cut funding for research, this makes it increasingly more difficult for researchers to collect sufficient data for proper analysis, and less inclined to share their data, so will this only exacerbate the problem, or are there other issues as well, but certainly trust seems to be a key concern amongst researchers, and these presentations at the IASSIST conference reaffirm the reality here, and this reluctance to share data, and even data management seems to be too much to ask of most researchers to do. Metadata creation is so far removed from the actual data processing and analysis, and the publication of these results, hence, most researchers who would rather spend more time with their datasets than their descriptions, especially as most researchers have no intention of sharing their datasets publically, and only share it with those they trust; however, rather than taking questions about their datasets with each request, the Geodoc metadata editor tool would allow each researcher to document his/her datasets and bundle the corresponding metadata records with them to share both with their trusted colleagues.
Perhaps, over time, researchers will be willing to share both their metadata and data with the public, but that time still seems far in the future, but for now, the support must be made available to those who want to manage their data and share it with those that they can trust.
5 June, Friday afternoonÂ
I had planned to attend the G2 session on Planning Research Data Management Services, but had the fortunate opportunity to speak with Professor Bob Downs from Columbia University. GoGeo harvests metadata from the Socioeconomic Data and Applications Centerâ€™s (SEDAC) portal catalogue, which CIESIN hosts at Columbia University, so Professor Downs had asked me about this during question time after my presentation on Thursday.
We discussed both SEDAC and GoGeo, then he mentioned to me how DataCite was useful source for locating catalogues to harvest metadata, with SEDACâ€™s catalogue included on the website. Heâ€™d mentioned as well about tracing the use of SEDAC data in publications through citation, which was quite impressive as the number of times was more than 1,000, so clearly demonstrating the benefit of making their data open access, and the success of the SEDAC portal.
That was IASSIST 2015 in Minneapolis, Minnesota. The 2016 conference will be held in Bergen, Norway.