UoL Library Blog

Develop, debate, innovate.

Posts Tagged ‘data’

DataCite and the Research Data Challenge

Posted by gazjjohnson on 30 May, 2012

Last Friday (25th May) I took my second trip of the week to London (having been at the Symplectic User Conference on Monday).  This time it was the gentle stroll from St Pancras to the British Library Conference Centre to participate in the first JISC/BL DataCite workshop.  Billed as an introduction to data citation and DataCite, this seemed an ideal follow up to the Research Data Management Forum event in Southampton back in March.  As the role of the LRA Manager migrates to look increasingly at how we will manage, share and curate research data outputs as well as publications it was the sort of thing that I felt I really needed.

Data Citation

Following the house keeping and welcome from the BL’s Lee-Ann Coleman and JISC’s Simon Hodson (owner of the finest waxed moustache I’ve seen in many a moon), Lee-Ann kicked off with an overview of Data Citation; what it is and why is it important.  The fact that there is an expectation from the RCUK that research data will be shared, to assist in validation of research conducted by their funded investigators, is perhaps the most major driver.  At the same HEIs want oversight on their research outputs, and as such the curation of their organisations data resource is important to them for building on earlier work and enabling collaborative research to organically evolve.  Given that many academics in adjoining offices are often unaware of what colleagues are producing, increasing this transparency and accessibility to a rich, queriable and reusable research resource is believed to be of value in not only progressing collaboration but enabling genuine novel research from preexisting work.

Lee-Ann cited some examples included the importance of data sharing in speeding up the sequencing and generation of a vaccine for the African strain of Avian flu.  Her other examples were also in the STEM field which slightly concerned me, given that two-thirds of research here at Leicester is in disciplines outside this domain; whom in my experience often need a greater assistance in capturing and sharing technological resource.  Lee-Ann stressed that one question that needed to be addressed by HEIs was what is critical/worthy data to curate?  A microbiologist might see all the raw data output from an instrument as worthy of this, and yet for many other people it would be the processed data given context and analysis that would be of value.

What is DataCite?

Next  up was Elizabeth Newbold (British Library) who gave an overview of what is DataCite.  Founded in 2009 it is a registration agency, effectively an allocating agent for DOIs (which I had never realised are based on the Handle system that I use daily in the LRA).  However, it was made very plain that DataCite does not work directly with researchers, they are expected to deposit their data (in whatever way possible) to an appririate data centre, and then come to DataCite to “mint” a DOI.  Minting of DOIs was new phrase for me, but clearly one that I can see slipping into my regular conversations about this subject here at Leicester.

It was noted that the UK Data Archive had a strong definition of what was data (termed data collections) as groups of all outputs from a single project source.  Commented that other data centres across the country were working along similar lines and methodologies.

Biscuits - failed to picture lunch, but it was splendidDataCite Infrastructure & Working with DataCite

After an excellent lunch (BL London catering never fails to delight) Ed Zukowski (British Library) gave a very useful, if in part quite detailed and technical, overview of both DataCite and DOIs.  Handles being the technology that underpins them, where DOI is actually a trademarked derivative.  DOIs importantly point to landing pages not to the objects themselves (akin to our implementation of Handles on the LRA), and in practice using the DataCite front-end take around a minute to mint.  He went on to detail how DataCite resolves contents from DOIs minted via them, but I think I’ll wait and link to the slides once available rather than try and make sense of my slightly confused notes.  I was content to see that the service worked, rather than worry about the technicality.

Following this Elizabeth Newbold returned to talk briefly about working with DataCite and the data client responsibilities.  In terms of their metadata schemea there were only 4 required elements needed to make it work.  However, locally people may well augment this with many more fields as they felt appropriate for discovery and description.  I confess one nagging worry I have is whom will create this metadata?  Is it a task we will anticipate a PI will perform at the conclusion of a project?  Personally I have concerns over the quality, accuracy, uniformity and standardisation of such input; going on my experience of manually created records submitted to the LRA via IRIS.  From the academics’ perspective I can see the challenge being that this will be seen as yet another piece of administration trivia that they are expected to deal with, and achieving the cultural change to embeded this into their standard workflows will be challenging with some serious and time-consuming carrot-whipping.  Given the struggle to work deposit of publications into our open access repository into their routine over the past four years, it is a serious challenge and the scale of this should not be underestimated!

Elizabeth noted that metadata created must be shared under a Creative Commons Zero licence, noting that for example the British Library OPAC makes data available for sharing and reuse in this way.  There were some concerns from those present in the room that this might cause problems in cases where institutions, funders or even publishers made claim over such data.  Another speaker also highlighted the problem of having data (with a minted DOI) then having a third party mint a different DOI to it which could interfere with metrics of access as well as uniformity of reference.  There didn’t appear to be a clear consensus or answer to these concerns, and the discussions broke up over tea.

Challenges Around Managing Research Data

The final session of the day was a workshop format where we were broken into small groups, and then smaller groups, an then finally into pairs (!) to discuss and document what we perceived as the challenges around managing research data.  I think it was a shame we were so subdivided, since while I had a valuable chat with my counterpart I would have relished a broader chat with a slightly larger group.  Given that there was a wide disparity between the role of delegates (from publishers to project manages to editors to directors of service through to repository managers) I feel we lost some of the benefit that we could have achieved through putting more of these diverse heads together.  I also sensed a slight bias in the broader discussion when each pair’s issues were categorised and resolutions discussed – it did feel like the expectation was that the answer to “How do we solve this problem?” was intimated to be “DataCite”.  It wasn’t in our room, although in at least one of the other two larger groups DataCite seemed ready to answer more of their challenges.


My slight concerns over the value of the final session aside, this was an eye-opening and valuable day.  It has for me perhaps opened up more questions than answers, although some of those were provided as well.  Importantly what I think it offered was a chance to gauge where other people are on the research data management question and more importantly it gave shape to the bigger operational and strategic questions that we need to be asking ourselves within our organisations.  As such the day was most certainly worthwhile, and my thanks to all the speakers, organisers and delegates for a thought-provoking day.

Further reading

A twitter archive of discussions around the day is also available.

Posted in Leicester Research Archive, Research Support | Tagged: , , , , , , , | Leave a Comment »

Research Data Management Forum – Engaging with the Publishers

Posted by gazjjohnson on 3 April, 2012

Southampton - venue of RDMF8Last week I spent two days in a very sunny Southampton at RDMF8, an event from the DCC that drew together librarians, repository managers, researchers, data scientists, funders and publishers and looked at issues around managing research data.  The particular theme of this event (there have been 7 previous ones) was looking at what publishers are planning to do with data.  It’s an area I can’t confess a broad understanding of, and so it was an opportunity to find out as much as possible about the various activities and consider how they might impact on work here at Leicester.

Speakers from the publishing world included Ruth Wilson (Nature), David Tempest (Elsevier) and Rebecca Lawrence (Faculty 1000).  From the scholarly society side of the fence was Brian McMahon (International Union of Crystallography), Todd Vision (National Evolutionary Synthesis Centre. Finally representing the researchers were Simon Coles (Southampton) and Christopher Gutteridge (Southampton).

Publishers and Data
For the publishers there is a strong awareness of the value of data as the principle output from research, and the potential value it has to other scholars.  Elsevier noted that they were planning to incorporate data within “all their publications by the end of 2012”.  However, when the issue of reuse came up, they seemed to indicate that while they wouldn’t be hosting the data that it would be reused under “their licence terms”.  As one of my colleagues at the end commented, is this the beginning of another IPR land grab from the publishers?  I know many researchers whom are far more protective of access to their data than they are of their publications, and suspect this might be a hot topic in the not too distant future. 

It was suggested that unlike publications (which come with the potential for career boosting citation impact) data offers no career advantage to researchers, and as such they would be disengaged from worrying about its storage and public sharing.  There was also the issue of just whom owned the rights to share the data, given that most projects are carried out across multiple institutions and often countries.  Finally it was suggested that many researchers might well be concerned that they would lose primacy of research impact if some other researcher was to come along and obtain more revelatory discoveries from their data.

Hosting Data
The role of hosting the raw, and indeed structured data came in for considerable discussion.  Publisher view points were very much that this wasn’t their role, although they’d be seeking to aggregate it through their data journals allied to their publications.  Sadly most of the speakers went so far as to detail quite whom would be doing the data hosting, tagging and curation at a practical level; with the exception of Brian McMahon who demonstrated that Crystallography has a demonstrable and already in practice data management and curation workflow; all be it one that is not immediately replicable in other disciplines.  For those less ICT savvy researcher communities, the feeling seemed to be that there would be more need of national, regional or local support in these activities.

The presentation from Todd Vision focussed on Dryad, an international data repository for the biosciences.  At the point of submission researchers are prompted to provide metadata to augment rediscovery and use.  At the moment Dryad hosts around 15,000 “data packages” – with one package being outputs from a single project; charged at £30/50 a deposit.  Interestingly in terms of size he indicated that most were ~5-10Mb in size, which seems a suspiciously small size.  My own personal suspicion is much of the data produced here locally would be many times that in size.   Todd’s talk also flagged up the idea of data citations being different to publication citations, in that different authors would be highlighted in the companion data and articles.  This seemed to be the case for many of the objects that Dryad has ingested.

Breakout session
On the second day there were a series of breakout groups, and I attended the one focussing on national/institutional data repositories.  One issue that arose was the risk of allowing researchers, rather than LIS or digital curation specialists, to set the policy on data curation within these.  Many researchers it was argued would be heading on towards their next goal, and may not recognise the potential value buried within their datasets; that they themselves could exact value and meaning from the data sets would be enough.  It was agreed that it was desirable (among those in the room at least) that data was both curated and appropriately metadata tagged.  In addition the variance across disciplines in terms of believing in the value of data, or in its potential reuse was touched on as a barrier or least an area for further exploration on how to overcome.

Belgium - Don't store your data here?One debate that raised my eyebrows was the Subject vs Institutional data repositories as the best long-term host.  The first suggestion was that SRs are more natural hosts, given their transnational capture and in many cases ready, willing and able to perform data curation.  On the other hand a potential future war with Belgium or the USA (!) was cited as a strong argument against the international storage of data, and that institutions would be seen as more natural and long-term stable hosts.  It was agreed that separate repositories to publication based ones would be needed, given the technical architecture for discovery, curation and differing reuse policies likely to be applied.

I took away from the event a feeling that most of the focus and activity currently seems to be on STEM data management.  I learned the phrase “Data journals” as distinct from “research journals”.  I also learned that what is a “trusted repository” is an issue for many in the publishing sector – with scepticism of both subject and institutional repositories as appropriate hosts; and yet a total reluctance from the publishing sector to proffer an alternative.  It was refreshing to discover that far from being in a room where everyone had the solution, that most seemed to be still groping their way slowly forwards.  Research data curation on a large and systematic scale is still in its infancy, but I would not be surprised to see things leap forward over the next 18 months as both technology and policy w.r.t. it evolve.

A twitter stream with comments from the back channel is available.

Posted in Open Access | Tagged: , , , , , , , | Leave a Comment »

A little bit of an ILL Excel challenge

Posted by gazjjohnson on 23 August, 2010

I know there must be an elegant and clever solution to a problem I have.  I’ve acquired an Excel output with a long list of values – 8,000+ to be exact.  Each one is an instance when someone placed an interlibrary loan request last year for a book or a journal, and contains the title of a book or a journal.  Like so:

BOOKILL Sharing the Earth: the Rhetoric of Sustainable Development AC-STAFF
BOOKILL Shelley and Vitality RES-PG
BOOKILL Shelley’s satire: violence, exhortation, and authority RES-PG
BOOKILL Shipwreck Anthropology DL-TCPG
BOOKILL Shock, Memory and the Unconscious in Victorian Fiction TC-PG
BOOKILL Short Story Theory at a Crossroads AC-STAFF

 What I want to know:

  • Is there a way to group the list by most popular titles (e.g. those that appear more than once)?

You see in this way I can see books that are regularly requested for loan from the British Library , and likewise journals, that we could perhaps consider for purchasing.  Something I’m sure the information librarians would find useful.

Using COUNTIF= has been suggested, but to use this I’d need to know what was the most popular variable already.  I fear currently this goes beyond my Excel skills to solve, but I can’t be the first person to try and find this sort of information out.  Any and all suggestions are welcomed – otherwise I’m going to have to eyeball this list.

Posted in Document Supply | Tagged: , , , , , | 9 Comments »

Citing data

Posted by knockels on 22 June, 2010

Possibly two posts in one day breaks a blogging convention, but here is the second thing from Innovations in Reference Management 2, while I am thinking about it!  

Plenty to think about in Kevin Ashley’s talk about citing data.    In addition to talking about how to cite it, Kevin spent some time persuading us (if we needed to be persuaded) that we needed to care about data and its use.   I thought of  (my favourite!) critical appraisal.

If you have the underlying data, it makes it easier to see where authors have been selective, or used a particular technique to make the data “prove” something or other.   It adds new dimensions to the evaluation of a paper, or a figure or table in a paper.   Kevin’s examples included a survey that said that “nine out of ten cats preferred the Open University”, but where an examination of the underlying data showed that you could only say this if you sorted the data in a particular way.     There are all the ruses outlined in books like “How to lie with statistics”, and being aware of those ruses is part of critical appraisal, but having the data adds a new dimension to this.

Kevin also gave some instances of where data might be valuable on its own, without any accompanying papers.  A researcher might be interested in the data, and not the story being told about it by one particular person or group:

  • Researchers in one area might be able to use data gathered by others for other purposes;
  • Having access to all the separate data sets relating to, say, the distribution of individual species, might lead to work on biodiversity on a larger scale;
  • Having access to all the separate data sets on a subject could lead to being able to identify where the gaps are.

Posted in Research Support | Tagged: , , , | Leave a Comment »

Policy-making for Research Data in Repositories: A Guide

Posted by gazjjohnson on 15 May, 2009

This rather interesting policy guide has just been brought out by JISC and DISC-UK DataShare project.  For once rather than being a lengthy report, it is actually a very useful tool kit for setting up the policies, workflows and the like for a research data repository.  It’s been something that the LRAPG has touched on in discussions, and I know the University in the future will be keen to develop.  Thus having a document like this, where a lot of the questions we need to ask and decisions that have to be taken are laid out in a very thorough manner.

Posted in Leicester Research Archive, Research Support | Tagged: , , , , , , | Leave a Comment »

Sharing data

Posted by knockels on 6 April, 2009

Something that always interested me when I was Leicester Research Archive Manager was using LRA as a means to share data.  I began to become aware that this raises all sorts of issues – space, of course, what qualifies as data worth archiving, issues of confidentiality relating to clinical data, and the matter of who actually owns the data (it might be the people who funded the research, although some funding bodies mandate sharing of data).  

A recent editorial in the BMJ is a good summary of all this.  The BMJ now asks authors to indicate in a statement which data is available.   Nature already does this.    The editorial discusses what is already going on outside clinical medicine, as well as some of the issues surrounding data sharing within clinical medicine, as well as giving a brief report from a recent UK Research Data Service meeting.

Posted in Digital Strategy & Website, Leicester Research Archive, Open Access, Research Support | Tagged: , , | Leave a Comment »

Report: The demographics of social networkers

Posted by gazjjohnson on 16 January, 2009

This morning I’ve been reading a couple of reports.  The first is by Amanda Lenhart on the PEW Internet & American Life Project entitled Adults and Social Network Sites.  According to their data:

  • Adults with online social networking profiles has gone up from 8% (2005) to 35% (Dec 2008 )
  • 30% of adults 35-44 have a profile (the report covers other age ranges, but since this is my peer group…)

Of these adults who do use them

I wonder how those numbers would look from a UK audience?  I’d suspect MySpace wouldn’t be anything like as popular, at least that’s my perception of their market penetration over here – what do you think?

Personally I’ve profiles on all three, but really only use FB for my professional and personal networking.  LinkedIn just leaves me cold.  Then again the median age of the LinkedIn user in this report is 40; so I’m a fair bit below that demographic point.  Shockingly the report concludes that on the whole adults are less likely to have online social networking profiles (65% vs 35%); something I’m sure is replicated in order of magnitude over here if not the exact numbers.

One paragraph later on was quite interesting, following on from things Alan and others have talked about in SmallWorldz and elsewhere – that of maintaining multiple online identities

  • A user generall wants to be finable by the people they wish to add to their online network…but may not wish to be so visable as to be harassed or observed by people totally unknown to them.

Or I’m sure in some cases people who are known to them, not quite sure I want everyone I’ve ever studied or worked with in my professional networks; and social networking security settings aren’t that customisable in many instances.  Interestingly 29% of users discovered their friends political interests/affiliations through networking sites.  Then again how many people list their real leanings on these sites? 

The report concludes with the data and methodology of the work.  So well worth a read, the main text is only 10 pages long.

Posted in Web 2.0 & Emerging Technologies, Wider profession | Tagged: , , , , , , | 1 Comment »