Martone grethe

15
Methodologies for Long- Tail Data Sharing: What Have We Learned? Maryann E. Martone, Ph. D. University of California, San Diego and Hypothesis Jeffrey S. Grethe, Ph. D. University of California, San Diego

Transcript of Martone grethe

Page 1: Martone grethe

Methodologies for Long-Tail Data Sharing: What Have We Learned?

Maryann E. Martone, Ph. D.University of California, San Diego

andHypothesis

Jeffrey S. Grethe, Ph. D.University of California, San Diego

Page 2: Martone grethe

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years:

NIF is an initiative of the NIH Blueprint consortium of institutes– NIF has been tracking and cataloging the biomedical resource landscape since 2008

Page 3: Martone grethe

The current “Addictome"NIF searches across:

• Resource Registry (13,000+)

• > 200 deeply integrated data sources (>800 million records)

• literature

Query: Addiction

Page 4: Martone grethe

N

ORCID

RRID

Data

Digital world runs on globally unique and persistent identifiers; PID’s serve as a “key” for identifying the same entity across different contexts

e-Science Ecosystem

Met

adat

a st

anda

rds

Aggregator

People

Research resources

Ontology

ConceptsDOI

Prot

ocol

s

Minimal Information Models

TranslationNon-digital

Repositories and

Registriese.g. NIF, Monarch NIH Data DIscovery Index

CDEE

eScience goal: Make data Findable, Accessible, Interoperable, Re-usable (FAIR) for both human and machine

PID

Page 5: Martone grethe

Resource Identification Initiative: Supplying unique identifiers for key research resources

“The following antibodies were used for immunoblotting: -actin mAb (1:10,000 dilution, Sigma-Aldrich)…”

“The following antibodies were used for immunoblotting: -actin mAb (1:10,000 dilution, Sigma-Aldrich, RRID:AB_262137)…”

VS

https://scicrunch.org/resolver/RRID:AB_262137

Page 6: Martone grethe

Minimal Information Standards

http://precedings.nature.com/documents/1720/version/1http://precedings.nature.com/documents/1720/version/1/files/npre20081720-1.pdf

A set of guidelines for reporting data that ensures the data can be easily verified, analysed and clearly interpreted by the wider scientific community. The recommendations also provide a foundation for structured databases, public repositories and development of data analysis tools.https://en.wikipedia.org/wiki/Minimum_Information_Standards

MINI: Minimum Information about a Neuroscience Investigation

MIM

CDE 1

CDE 2

CDE N

• • •Value Set

Page 7: Martone grethe

Common Data Elements

https://cde.nlm.nih.gov/home

http://www.nlm.nih.gov/cde/

A data element that is common to multiple datasets and is used to improve data quality and promote data sharing. CDEs usually describe the following data element properties: Name, Definition, Instructions, Provenance, Value Set.

Page 8: Martone grethe

Value SetsThe set of possible values or responses. A Value Set often includes concepts from established Vocabularies, Ontologies or Data Standards. A value set may also include a range of permissible values and indicate the required units. For a survey question, the value set may be a list of possible responses.

http://neurolex.org/wiki/Category:Hippocampus_CA1_pyramidal_cell

Page 9: Martone grethe

Neuroscience Information Framework“a tool for analyzing and structuring information”

“a reduction in uncertainty”

• Ontologies are the major way that NIF searches for and organizes information• Aggregate of community ontologies, e.g., Gene Ontology, Chebi, Protein Ontology• Still significant gaps for behavioral and physiological concepts and techniques• Available as services through NIF so they can be built into applications

Organism

Molecule

Macromolecule Gene

Molecule Descriptors

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NS FunctionSubcellular structure Investigation

ProtocolsReagent

Techniques

NIFSTD

Page 10: Martone grethe

Concept-based query

Remove synonyms

Ontologies and their relationships let us probe the data space for related concepts

Page 11: Martone grethe

What have we learned?• The landscape is vibrant, dynamic and growing, but also littered

with abandoned and unrealized projects• Data belongs in a data repository, not on your lab server• People are important in this endeavor: Leaders, curators,

community engagement specialists• Data and ontology resources become interesting when they

are comprehensive: populate!!!• Assume that you will be resource limited and plan

accordingly: time, money, personnel• Cost-benefit analysis; what to do now vs later• Technology will improve

• Don’t start from square 1-resources exist to help; help support them

Page 12: Martone grethe

Extra Slides

Page 13: Martone grethe

Dimensions of FAIR data sharing• Discoverability

– Data can be found– Data set has an identifier and links are stable

• Accessibility– Data can be accessed programmatically– Access rights are clear

• Assessability– Provenance is known– Reliability can be determined

• Understandability– The data can be understood

• Usability– The data are actionable– Data are not in a proprietary format

?

?

Goodman, A. et al. Ten simple rules for the care and feeding of scientific data. PLoS Comput Biol 10, e1003542, doi:10.1371/journal.pcbi.1003542 (2014)

Science as an open enterprise, Royal Society: https://royalsociety.org/policy/projects/science-public-enterprise/Report/

Page 14: Martone grethe

FORCE11: Future of Research Communications and e-Scholarship

• Resource Identification Initiative: https://www.force11.org/group/resource-identification-initiative

• FAIR Data Guiding principles: https://www.force11.org/group/fairgroup/fairprinciples

• Data Citation Principles: https://www.force11.org/group/joint-declaration-data-citation-principles-final

• On creating machine-readable data citations: https://peerj.com/articles/cs-1/

• 10 Simple rules for design, provision, and reuse of persistent identifiers for life science data: https://zenodo.org/record/18003#.VeOxxLQjvyA

FORCE11.org: Grass roots organization dedicated to transforming scholarship through technology

Page 15: Martone grethe

Forebrain

Midbrain

Hindbrain0

1-10

11-100

>101

Data Sources

Mapping the data landscape: Anatomical framework

~800 million records across ~200 databases or views