Talk at OHSU, September 25, 2013
-
Upload
anita-de-waard -
Category
Education
-
view
329 -
download
0
description
Transcript of Talk at OHSU, September 25, 2013
Making Research Data Discoverable and Usable (It’s the metadata, stupid!)
Anita de Waard VP Research Data Collabora7ons
h=p://researchdata.elsevier.com/
Research data is the ‘new hotness’… § Share research outputs § Demonstrate impact to public § Data availability drives growth
§ Demonstrate impact § Guarantee permanence, discoverability § Avoid fraud
§ Generate, track outputs § Comply with mandates § Ensure availability
§ Archive, track, curate § Support researcher/ins7tu7on
§ Archive § Add cura7on § Allow reuse
Todd Vision, DataDryad, OAI8, 6/23/13: “We need to find a way to keep Dryad funded, and would love to hear your ideas about doing that.”
Phil Bourne, Associate Vice Chancellor, UCSD, 4/13: “We are thinking about the university as a digital enterprise.”
Mike Huerta, Ass. Director NLM O of Health Info at NIH, 6/13: “Today, the major public product of science are concepts, wri=en down in papers. But tomorrow, data will be the main product of science…. We will require scien7sts to track and share their data as least as well, if not be=er, than they are sharing their ideas today.”
Mara Saule, Dean University Libraries/CIO, UVM, 5/13: “We need to do something about data.”
§ Derive credit § Comply with mandates § Discover and use § Cite/acknowledge
Gov
Funding bodies
University management
Researchers
Librarians
Data Repositories
Nathan Urban, PI Urban Lab, CMU, 3/13: “If we can share our data, we can write a paper that will knock everybody’s socks off!”
Roles and needs wrt Research Data:
Barbara Ransom, NSF Program Director Earth Sciences, 2/13: “We’re not going to spend any more money for you to go out and get more data! We want you first to show us how you’re going to use all the data we paid y’all to collect in the past!”
Research data management today:
Using an7bodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story.
Prepare
Observe
Analyze
Ponder
Communicate
Prepare
Observe
Analyze
Ponder
Communicate
Research today (in biology) is o^en quite insular:
But life is VERY complicated:
h=p://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg
• Interspecies variability: A specimen is not a species • Gene expression variability: Knowing genes is not
knowing how they are expressed • Microbiome: An animal is an ecosystem • Systems biology: A whole is more than the sum of its
parts Reduc7onist science does not work for living systems!
What if the data were connected?
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observa7ons
Observa7ons
Observa7ons
Across labs, experiments: track reagents and how they are used
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observa7ons
Observa7ons
Observa7ons
Compare outcome of interac7ons with these en77es
What if the data were connected?
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observa7ons
Observa7ons
Observa7ons
Build a ‘virtual reagent spectrogram’ by comparing how different en77es interacted in different experiments Think
What if the data were connected?
Where research data goes now:
> 50 My Papers 2 M scien7sts
2 My papers/year
Majority of data (90%?) is stored
on local hard drives
Dryad: 7,631 files
Dataverse: 0.6 My
Ins7tu7onal Repositories
Some data (8%?) stored in large,
generic data repositories
MiRB: 25k
PetDB: 1,5 k
TAIR: 72,1 k
PDB: 88,3 k
SedDB: 0.6 k
A small por7on of data (1-‐2%?) stored in small,
topic-‐focused data repositories
1. How do we get researchers to curate, store
and share their data?
2. How do we ensure long-‐term
sustainability for high-‐end repositories?
3. What role do libraries/
ins7tu7ons play?
de Waard, A., Burton, S. et al., 2013
1.1. An a=empt to get researchers to curate (but only parZally share!) their data:
• In 220 publica7ons only 40% of an7bodies, 40% of cell lines and 25% of constructs can be manually iden7fied (Vasilevsky et al, submi=ed)
• Proposal (with NIH/NIF and Force11 Group):
– Adding minimal data standards – Tool extracts likely reagents / resources – User interface asks author to confirm or select
1.2. What to do in the mean7me?
49 publica7ons 193 publica7ons 76 publica7ons 214 publica7ons 210 publica7ons
Pilot project with IEDA: – Build a database for lunar geochemistry – Write joint report on building repository, cura7on, costs and challenges
2.2 How can research databases become long-‐term sustainable?
With WDS/RDA WG: • Planning survey of cost recovery models for research databases
• Input/inspira7on: ICPSR Sloane-‐funded project Sustaining Domain Repositories for Digital Data’
• Developing overarching funding model:
2.2 Cost recovery ques7onnaire:
Private store
Data producer or sponsor
Access Closed
Flow of funds
Data publication
Public
Service Collaboration
Conclave
Limited
Subscription content
Commercial overlay
Limited Academic Use/Limited
Data user
Flow of funds
Examples ICSPR, CERN-LHC
KEGG GeoFacets Reaxys
DRAFT - CC-BY-NC 2013, Todd Vision & Anita de Waard
Many small operations, e.g. try-db.org, plhdb.org
Dryad, arXiv, PDB
Commercial and institutional storage
&
or
2.3. A first stab at a model:
3.1. Where do ins7tu7onal repositories fit in? Repository Advantages Disadvantages
Local data repository
Easy! No one steals your data.
No one sees it. Not compliant with requirements
Generic data repository
Not very hard to do. Have complied!
Data can’t be easily reused. Credit?
Ins7tu7onal Repository
Can use exis7ng IR? Tracking and compliance checks.
Data can’t easily be reused. Credit?
Domain-‐specific data repository
Data can be reused. Credit!
Lot of work for curators. Long-‐term sustainable? Eff
ort, Re
use, Credit, Co
mpliance
Habit, Ease, Priv
acy, Con
trol
Highe
r quality metadata
Funding Agency: University:
Collaborators: Domain of study: Domain-‐Specific Data Repository
Local Data Repository
Ins7tu7onal Data Repository
Generic Data Repository
AND
THEY ALL
WANT
DIFFERENT
METADATA!!!!
3.2. The poor researcher:
Domain repository
3.3. Possible pilot project:
Domain repository
IR Data Metadata: What data was stored/viewed
Metadata
Metadata: What data was stored/viewed • Interview ins7tu7ons
• Normalize repor7ng data • Talking to
• IQSS, Harvard • ICPSR, U Mich • DataDryad, UNC • Pangaea, Germany
3.4. Ins7tu7onal Pilot study: • Planning series of interviews at key ins7tu7ons: – What role do libraries/ins7tu7ons play wrt research data management?
– What tools/metadata standards are used? – What aspects of data deposi7on is the Research Office/IR/Ins7tu7on interested in?
– How does this compare with what scien7sts want and do in their labs?
• Outcomes: – Share knowledge (within ins7tu7on); – Write joint report (anonymised) – Establish joint plan of ac7on
Elsevier Research Data Services: • 2013/2013: Series of pilots, reviews, and reports: - With CMU: Data/metadata entry and sharing - With IEDA: Repository crea7on: feasibility study & report - With RDA: Cost of Data Repositories ques7onnaire - With series of ins7tutes: Interviews re. role of ins7tu7on
• Main ques7ons: - What are key needs? - Can we play a role: skillsets, partnerships? - Is there a (transparent) business model for this?
• Principles: – Collabora7on is tailored to partner’s needs, using local resources; – Collabora7on plan is MoU/Service-‐Level Agreement; – At all 7mes, all data, reports and so^ware are open and shared.
In summary: 1. If researchers start to curate and share their
data… 2. And research databases become long-‐term
sustainable… 3. And libraries, data repositories and grid
infrastructures start to work together… We might enable a knowledge infrastructure that allows us to jointly tackle the quesZons of life!
Many ques7ons remain:
? What carrots and s7cks will make researchers share their data?
? How do we create interoperable metadata layers?
? What role would the ins7tu7on/library play? ? What are sustainable models, moving forward?
? Is there a place for publishers, in all this?
Thank you! Collabora7ons and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy • UCSD: Phil Bourne, Brian Shoe=lander, David Minor, Declan Fleming,
Ilya Zaslavsky • NIF: Maryann Martone, Anita Bandrowski • MSU: Brian Bothner • OHSU: Melissa Haendel, Nicole Vasilevsky • California Digital Library: Carly Strasser, John Kunze, Stephen Abrams • Columbia/IEDA: Kers7n Lehnert, Leslie Hsu • ICPSR: George Altman, Mary Vardigan • CNI: Clifford Lynch • Harvard: Michael Kurtz, Chris Erdmann • MIT: Micah Altman • UVM: Mara Saurle • RDA: Simon Hodson, Michael Diepenbroek
Your ques7ons?
Anita de Waard VP Research Data Collabora7ons,
Elsevier Research Data Services (VT) [email protected]
h=p://researchdata.elsevier.com/