Last week I spent two days in a very sunny Southampton at RDMF8, an event from the DCC that drew together librarians, repository managers, researchers, data scientists, funders and publishers and looked at issues around managing research data. The particular theme of this event (there have been 7 previous ones) was looking at what publishers are planning to do with data. It’s an area I can’t confess a broad understanding of, and so it was an opportunity to find out as much as possible about the various activities and consider how they might impact on work here at Leicester.
Speakers from the publishing world included Ruth Wilson (Nature), David Tempest (Elsevier) and Rebecca Lawrence (Faculty 1000). From the scholarly society side of the fence was Brian McMahon (International Union of Crystallography), Todd Vision (National Evolutionary Synthesis Centre. Finally representing the researchers were Simon Coles (Southampton) and Christopher Gutteridge (Southampton).
Publishers and Data
For the publishers there is a strong awareness of the value of data as the principle output from research, and the potential value it has to other scholars. Elsevier noted that they were planning to incorporate data within “all their publications by the end of 2012”. However, when the issue of reuse came up, they seemed to indicate that while they wouldn’t be hosting the data that it would be reused under “their licence terms”. As one of my colleagues at the end commented, is this the beginning of another IPR land grab from the publishers? I know many researchers whom are far more protective of access to their data than they are of their publications, and suspect this might be a hot topic in the not too distant future.
It was suggested that unlike publications (which come with the potential for career boosting citation impact) data offers no career advantage to researchers, and as such they would be disengaged from worrying about its storage and public sharing. There was also the issue of just whom owned the rights to share the data, given that most projects are carried out across multiple institutions and often countries. Finally it was suggested that many researchers might well be concerned that they would lose primacy of research impact if some other researcher was to come along and obtain more revelatory discoveries from their data.
The role of hosting the raw, and indeed structured data came in for considerable discussion. Publisher view points were very much that this wasn’t their role, although they’d be seeking to aggregate it through their data journals allied to their publications. Sadly most of the speakers went so far as to detail quite whom would be doing the data hosting, tagging and curation at a practical level; with the exception of Brian McMahon who demonstrated that Crystallography has a demonstrable and already in practice data management and curation workflow; all be it one that is not immediately replicable in other disciplines. For those less ICT savvy researcher communities, the feeling seemed to be that there would be more need of national, regional or local support in these activities.
The presentation from Todd Vision focussed on Dryad, an international data repository for the biosciences. At the point of submission researchers are prompted to provide metadata to augment rediscovery and use. At the moment Dryad hosts around 15,000 “data packages” – with one package being outputs from a single project; charged at £30/50 a deposit. Interestingly in terms of size he indicated that most were ~5-10Mb in size, which seems a suspiciously small size. My own personal suspicion is much of the data produced here locally would be many times that in size. Todd’s talk also flagged up the idea of data citations being different to publication citations, in that different authors would be highlighted in the companion data and articles. This seemed to be the case for many of the objects that Dryad has ingested.
On the second day there were a series of breakout groups, and I attended the one focussing on national/institutional data repositories. One issue that arose was the risk of allowing researchers, rather than LIS or digital curation specialists, to set the policy on data curation within these. Many researchers it was argued would be heading on towards their next goal, and may not recognise the potential value buried within their datasets; that they themselves could exact value and meaning from the data sets would be enough. It was agreed that it was desirable (among those in the room at least) that data was both curated and appropriately metadata tagged. In addition the variance across disciplines in terms of believing in the value of data, or in its potential reuse was touched on as a barrier or least an area for further exploration on how to overcome.
One debate that raised my eyebrows was the Subject vs Institutional data repositories as the best long-term host. The first suggestion was that SRs are more natural hosts, given their transnational capture and in many cases ready, willing and able to perform data curation. On the other hand a potential future war with Belgium or the USA (!) was cited as a strong argument against the international storage of data, and that institutions would be seen as more natural and long-term stable hosts. It was agreed that separate repositories to publication based ones would be needed, given the technical architecture for discovery, curation and differing reuse policies likely to be applied.
I took away from the event a feeling that most of the focus and activity currently seems to be on STEM data management. I learned the phrase “Data journals” as distinct from “research journals”. I also learned that what is a “trusted repository” is an issue for many in the publishing sector – with scepticism of both subject and institutional repositories as appropriate hosts; and yet a total reluctance from the publishing sector to proffer an alternative. It was refreshing to discover that far from being in a room where everyone had the solution, that most seemed to be still groping their way slowly forwards. Research data curation on a large and systematic scale is still in its infancy, but I would not be surprised to see things leap forward over the next 18 months as both technology and policy w.r.t. it evolve.