BiTeM / SIBTex @ TREC CDS 2014

Post on 10-Aug-2015

147 views 1 download

Transcript of BiTeM / SIBTex @ TREC CDS 2014

Full-texts representation with MeSH, co-citations network reranking

BiTeM/SIBtex group

J Gobeill (me), A Gaudinat, E Pasche and P Ruch

University of Applied Sciences,Swiss Institute of Bioinformatics,

Hospitals and University of Geneva

The BiTeM / SIBtex group

• Text Mining and Bibliomics (P Ruch) Strong focus on clinical and biological data

heg (training librarians) and SIB (assisting biocurators)

• Long history of participation in TREC campaigns Genomics, Chemical IR, Medical Records…

• Translational medicine projects (EU FP7 Programme)

Khresmoi: multimodal medical search engine

MD-Paedigree: retrieval of similar cases for clinicians

The CDS Track 2014

• Clinical Decision Support : « retrieval of biomedical articles relevant for answering generic clinical questions about medical records »

Ex. query: « 25-year-old woman with fatigue,hair loss,

weight gain, and cold intolerance for 6

months »

Collection: subset of PubMed Central

Strategies for TREC CDS 2014

Reranking

4. Boosting based on article types

5. Exploitation of the co-citations network

Document Representation

1. Classical document representation with text

2. Document representation withMeSH

3. Target-specific semanticenrichment with MeSH

IR performed by (Okapi BM25)

BiTeM official results

ourbaseline

ourbaseline

Creating a baseline

1. Classical document representation with text

Text indexSearch engine

1. Classical document representation with text

• Two different indexing levels:• Document• Section Run 2 vs run 4 : document > section (+ 65%)

• Query representation (R-Prec):• Numbers removing (no age)• Only description: 0.169• Only summaries: 0.170• Both: 0.185 (+10%) Signal/noise ratio: better with more information

Document

Sections

Creating a complementary view

2. Document representation withMeSH

MeSH indexSearch engine

MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis

2. Document representation with MeSH

• Two possible sources:• Collected from MEDLINE when there is a PMID

• Extracted from documents with a categorizer (strict mapping)

• Two possible integrations between original text and MeSH:• Building separate indexes then combining runs

• Merging both representations into one unique document

MeSH concepts found:D008568 MemoryD008569 Memory DisordersD007866 LegD009068 MovementD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis

<topic number="8"><summary>62-year-old man with

progressive memory loss and involuntary leg movements. Brain MRI

reveals cortical atrophy, and cortical biopsy shows vacuolar gray matter

changes with reactiveastrocytosis.</summary>

Example of MeSH mapping

D013035:Muscular Spasm ?

Some good (power of synonyms)

Some broad Some missing (too ambiguous)

D002540:Cerebral Cortex ?

D008279: MH = Magnetic Resonance Imaging ?Medical Research Institute ?Moderate Renal Insufficiency ?

MEDLINE MeSH in docsHumansAnimalsFemaleMaleAdult

Middle AgedMiceAged

AdolescentMolecular Sequence Data

RatsYoung AdultTime Factors

ChildSignal Transduction

Extracted MeSH in docsCells

Ficus (because of «fig»)Patients

TimeGenes

TherapeuticsMethods

RoleHumansDiseaseVolition

MiceAttention

DNAPopulation

Extracted MeSH in topicsWomenHistory

PainBlood

Physical ExaminationFemale

Blood PressurePressureDyspneaFamilyThoraxUrineFeverMale

Emergencies

Top 15 MeSH in benchmark

Results for MeSH representation

• Best R-Prec 0.143 for MeSH representation (vs 0.211 for text)o MeSH concepts collected from MEDLINE not useful (best R-Prec 0.028)

o Only 53% of documents had MeSH terms in MEDLINE

• Complementarity for finding relevant documents (thanks to qrel) :

• Low complementarity

• Combination: 0.211 -> 0.213

Favoring target types

MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD005911 GliosisD001706 Biopsy

MeSHtargetDiagnosisMeSHtargetDiagnosis

MeSHtargetTest

Do relevant documents for diagnosis deal more with diagnosis ?

3. Target-specific semantic enrichment with MeSH

3. Target-specific semantic enrichment with MeSH

• In UMLS, each MeSH term has Semantic Types (ex: T060 Diagnostic Procedure)

Focus on targets (diagnosis, treatments and tests)

• Specific words (ex: «MeSHtargetDiag») are added in docs and queries

Target% docs that have

at least 1Average number

in documents

Test 83 % 16

Diagnosis 86 % 41

Treatment 86 % 24

Small improvementonly for section indexing

In the qrel…Set Aver. Diagnosis MeSH Aver. Test MeSH Aver. Treatment MeSH

All collection 41 16 24

Relevant for diagnosis(1|2 for queries 1..10)

108 41 41

Relevant for test(1|2 for queries 11..20)

107 41 33

Relevant for treatment(1|2 for queries 21..30)

114 47 52

All relevant documents:o Are quite similar, with no distinction between targetso But have 2/3 times more target MeSH termso ... but it’s also the case for documents with 0 in the qrel

4. Boosting based on article types

Promoting some article types

Are some article types more likely to be relevant ?

Article typeDistribution

in docs in qrel in our runsresearch-article 74.3 % 52.2% 37.9 %

case-report 4.0 % 20.4 % 41.5 %review-article 6.9 % 17.9 % 10.9 %

Other 2.6 % 3.2 % 3.6 %brief-report 1.1 % 1.5 % 0.9 %

4. Boosting based on article types

• Strategy: to promote review and case-based articles (boosting)

• Intuition was good…

• In reality… the IR engine already promoted these types !

but the strategy failed !

Top 5

5. Exploitation of the co-citations network

Promoting citations

Are citations of retrieved documents relevant ?

5. Exploitation of the co-citations network

• E is the set of retrieved documents

• RSVe is the Retrieval Status Value of doce

• We boost each citation of doce by + α x RSVe

• 50% of documents cite another one in the collection (avg 3.8 cits)

Results

• With α = 0.1, slight improvement• + 10% for R-PREC

• + 20% for infNCDG

• In TREC Chem 2010 Prior Art task, + 150% for MAP

Conclusions

“what is important is to have fought well”

Conclusions

• A lot of strategies, but not much better than Terrier baseline

• Section indexing: never again

• MeSH not complementary… Better when infered by a k-NN ?

• Relevant docs talk about test, diag and treatment altogether.

• Maybe we have to start working from the baseline run…