The researcher perspective, Jean-Fred Fontaine, MDC Berlin
-
Upload
liber-europe -
Category
Technology
-
view
1.793 -
download
1
description
Transcript of The researcher perspective, Jean-Fred Fontaine, MDC Berlin
Text and data mining for Biomedical Research
Dr. Jean-Fred FontaineMax Delbrück Center for Molecular Medicine, Berlin
Scientific project and biomedical literature
Project designProject design
AnalysisAnalysis
ExperimentsExperimentsCommunication
Communication
• Methods• Explanations• New hypotheses
• State of the art• Innovative ideas
• Technologies• State of the art• Explanations• Open hypotheses• Perspectives
Data growth
Literature growth Molecular data growth
Accessibility
Krallinger et al. (2010) Methods Mol Biol.
* PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)
18 M (all)
9.7 M – TEXT MINING OF ABSTRACTS8.6 M
2.4 M – (freely readable)1.8 M0.2 M - TEXT MINING OF FULL TEXTS*
Document retrieval
Alzheimer’s disease?
By date
Fontaine et al. (2009) Nucleic Acids Res.http://cbdm.mdc-berlin.de/tools/medlineranker/
By relevance
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
...........
...........
...........
.......
............
............
............
....
............
............
............
....
Medline Ranker
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
1940
1944
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
2000
2004
2008
Citations in PubMed®
Discovery of gene-disease associations
......
Database miningDatabase mining
Fontaine et al. (2011) Nucleic Acids Res.
http://cbdm.mdc-berlin.de/tools/genie
Medline Ranker / Génie
Rank 20 000 genes
Discovery of gene- and drug-disease associations
Frijters et al. (2010) PLoS Comput Biol.
?
Before 2007
After 2007
Before 2007
After 2007
Semantic analysis
Knowledge bases
Van Landeghem et al. (2013) PLoS One.
Network construction
Miljkovic et al. (2012) PLoS One.
Modelling Plant Defence Response
Trends
Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab.
http:// www.ogic.ca/mltrends/
Surveillance of Surgical Site Infections
Campillo-Gimenez et al. (2013) Stud Health Technol Inform.
2008-2009relevant records
2008-2009relevant records
...........
...........
...........
......
...........
...........
...........
......
Classification
Classification
2010 medical reports
Conventional surveillance
ICD10 codes
Full-text medical reports
TRUE positive 3 11 12FALSE positive 0 219 18FALSE negative 10 2 1TRUE negative 1212 993 1194
University Hospital of Rennes, France SSI secondary to neurosurgery Electronic Patient Records
ICD10 codes Free text
Disease Correlations from Electronic Patient Records
Avg. ICD10 codes Manual: 2.7 Text Mining: 9.5
Roque et al. (2011) PLoS Comput Biol.
Patient recordsPatient records
ICD10 codesICD10 codes
Manual
Text Mining
Alopecia
Migraine
THRA
ESR1
HR
Co-morbidity 93 / 802 unexpected Ex. Alopecia and Migraine
Summary
Computers and biomedical literature and data Generation Storage Analysis
Text and data mining Useful from project start to finish Broad and critical applications
Information retrieval Information extraction Knowledge databases Knowledge discovery
Limited by text availability
Accuracy in some applications Ambiguity, complex sentences, document context, novelty
“Protein A and its partners”
From abstracts to full texts Current methods optimized for short texts (abstracts) Figures and tables Supplementary information
File format The PDF problem
XML: structured format Abstract, Introduction, Results, Methods, Discussion, References, ...
Challenges
.......
.
.......
.
.......
.
.......
.
.......
.
.......
.........................
.......
.
.......
.
.......
.
.......
.
.......
.
.......
.
.......
.
.......
.
.......
.........................
.......
.
.......
.
.......
.
?........................
.......
.
.......
.
.......
.........................
.......
.
.......
.
.......
.
?
Needs
Copyright Teach scientists Unify licenses
Availability All significant documents
Articles, reviews, case reports, letters The main structured text (XML)
No figures (or optional) texts mostly useless for readers
Supplements: optional No fancy user interface or webservice
FTP/P2P + Compressed XML Communicating Research results
Open Access As text As data
standardized list of facts standardized figures data and tables
# articles Compressed file size*
1 13 KB
1M 12 GB
20M 250 GB
* Projections based on PMC Open Access 2012