DataCite and the Research Data Challenge

Posted by gazjjohnson on 30 May, 2012

Last Friday (25th May) I took my second trip of the week to London (having been at the Symplectic User Conference on Monday).  This time it was the gentle stroll from St Pancras to the British Library Conference Centre to participate in the first JISC/BL DataCite workshop.  Billed as an introduction to data citation and DataCite, this seemed an ideal follow up to the Research Data Management Forum event in Southampton back in March.  As the role of the LRA Manager migrates to look increasingly at how we will manage, share and curate research data outputs as well as publications it was the sort of thing that I felt I really needed.

Data Citation

Following the house keeping and welcome from the BL’s Lee-Ann Coleman and JISC’s Simon Hodson (owner of the finest waxed moustache I’ve seen in many a moon), Lee-Ann kicked off with an overview of Data Citation; what it is and why is it important.  The fact that there is an expectation from the RCUK that research data will be shared, to assist in validation of research conducted by their funded investigators, is perhaps the most major driver.  At the same HEIs want oversight on their research outputs, and as such the curation of their organisations data resource is important to them for building on earlier work and enabling collaborative research to organically evolve.  Given that many academics in adjoining offices are often unaware of what colleagues are producing, increasing this transparency and accessibility to a rich, queriable and reusable research resource is believed to be of value in not only progressing collaboration but enabling genuine novel research from preexisting work.

Lee-Ann cited some examples included the importance of data sharing in speeding up the sequencing and generation of a vaccine for the African strain of Avian flu.  Her other examples were also in the STEM field which slightly concerned me, given that two-thirds of research here at Leicester is in disciplines outside this domain; whom in my experience often need a greater assistance in capturing and sharing technological resource.  Lee-Ann stressed that one question that needed to be addressed by HEIs was what is critical/worthy data to curate?  A microbiologist might see all the raw data output from an instrument as worthy of this, and yet for many other people it would be the processed data given context and analysis that would be of value.

What is DataCite?

Next  up was Elizabeth Newbold (British Library) who gave an overview of what is DataCite.  Founded in 2009 it is a registration agency, effectively an allocating agent for DOIs (which I had never realised are based on the Handle system that I use daily in the LRA).  However, it was made very plain that DataCite does not work directly with researchers, they are expected to deposit their data (in whatever way possible) to an appririate data centre, and then come to DataCite to “mint” a DOI.  Minting of DOIs was new phrase for me, but clearly one that I can see slipping into my regular conversations about this subject here at Leicester.

It was noted that the UK Data Archive had a strong definition of what was data (termed data collections) as groups of all outputs from a single project source.  Commented that other data centres across the country were working along similar lines and methodologies.

Biscuits - failed to picture lunch, but it was splendidDataCite Infrastructure & Working with DataCite

After an excellent lunch (BL London catering never fails to delight) Ed Zukowski (British Library) gave a very useful, if in part quite detailed and technical, overview of both DataCite and DOIs.  Handles being the technology that underpins them, where DOI is actually a trademarked derivative.  DOIs importantly point to landing pages not to the objects themselves (akin to our implementation of Handles on the LRA), and in practice using the DataCite front-end take around a minute to mint.  He went on to detail how DataCite resolves contents from DOIs minted via them, but I think I’ll wait and link to the slides once available rather than try and make sense of my slightly confused notes.  I was content to see that the service worked, rather than worry about the technicality.

Following this Elizabeth Newbold returned to talk briefly about working with DataCite and the data client responsibilities.  In terms of their metadata schemea there were only 4 required elements needed to make it work.  However, locally people may well augment this with many more fields as they felt appropriate for discovery and description.  I confess one nagging worry I have is whom will create this metadata?  Is it a task we will anticipate a PI will perform at the conclusion of a project?  Personally I have concerns over the quality, accuracy, uniformity and standardisation of such input; going on my experience of manually created records submitted to the LRA via IRIS.  From the academics’ perspective I can see the challenge being that this will be seen as yet another piece of administration trivia that they are expected to deal with, and achieving the cultural change to embeded this into their standard workflows will be challenging with some serious and time-consuming carrot-whipping.  Given the struggle to work deposit of publications into our open access repository into their routine over the past four years, it is a serious challenge and the scale of this should not be underestimated!

Elizabeth noted that metadata created must be shared under a Creative Commons Zero licence, noting that for example the British Library OPAC makes data available for sharing and reuse in this way.  There were some concerns from those present in the room that this might cause problems in cases where institutions, funders or even publishers made claim over such data.  Another speaker also highlighted the problem of having data (with a minted DOI) then having a third party mint a different DOI to it which could interfere with metrics of access as well as uniformity of reference.  There didn’t appear to be a clear consensus or answer to these concerns, and the discussions broke up over tea.

Challenges Around Managing Research Data

The final session of the day was a workshop format where we were broken into small groups, and then smaller groups, an then finally into pairs (!) to discuss and document what we perceived as the challenges around managing research data.  I think it was a shame we were so subdivided, since while I had a valuable chat with my counterpart I would have relished a broader chat with a slightly larger group.  Given that there was a wide disparity between the role of delegates (from publishers to project manages to editors to directors of service through to repository managers) I feel we lost some of the benefit that we could have achieved through putting more of these diverse heads together.  I also sensed a slight bias in the broader discussion when each pair’s issues were categorised and resolutions discussed – it did feel like the expectation was that the answer to “How do we solve this problem?” was intimated to be “DataCite”.  It wasn’t in our room, although in at least one of the other two larger groups DataCite seemed ready to answer more of their challenges.


My slight concerns over the value of the final session aside, this was an eye-opening and valuable day.  It has for me perhaps opened up more questions than answers, although some of those were provided as well.  Importantly what I think it offered was a chance to gauge where other people are on the research data management question and more importantly it gave shape to the bigger operational and strategic questions that we need to be asking ourselves within our organisations.  As such the day was most certainly worthwhile, and my thanks to all the speakers, organisers and delegates for a thought-provoking day.

