BiTeM / SIBTex @ TREC CDS 2014

22
Full-texts representation with MeSH, co-citations network reranking BiTeM/SIBtex group J Gobeill (me), A Gaudinat, E Pasche and P Ruch University of Applied Sciences, Swiss Institute of Bioinformatics, Hospitals and University of Geneva

Transcript of BiTeM / SIBTex @ TREC CDS 2014

Page 1: BiTeM / SIBTex @ TREC CDS 2014

Full-texts representation with MeSH, co-citations network reranking

BiTeM/SIBtex group

J Gobeill (me), A Gaudinat, E Pasche and P Ruch

University of Applied Sciences,Swiss Institute of Bioinformatics,

Hospitals and University of Geneva

Page 2: BiTeM / SIBTex @ TREC CDS 2014

The BiTeM / SIBtex group

• Text Mining and Bibliomics (P Ruch) Strong focus on clinical and biological data

heg (training librarians) and SIB (assisting biocurators)

• Long history of participation in TREC campaigns Genomics, Chemical IR, Medical Records…

• Translational medicine projects (EU FP7 Programme)

Khresmoi: multimodal medical search engine

MD-Paedigree: retrieval of similar cases for clinicians

Page 3: BiTeM / SIBTex @ TREC CDS 2014

The CDS Track 2014

• Clinical Decision Support : « retrieval of biomedical articles relevant for answering generic clinical questions about medical records »

Ex. query: « 25-year-old woman with fatigue,hair loss,

weight gain, and cold intolerance for 6

months »

Collection: subset of PubMed Central

Page 4: BiTeM / SIBTex @ TREC CDS 2014

Strategies for TREC CDS 2014

Reranking

4. Boosting based on article types

5. Exploitation of the co-citations network

Document Representation

1. Classical document representation with text

2. Document representation withMeSH

3. Target-specific semanticenrichment with MeSH

IR performed by (Okapi BM25)

Page 5: BiTeM / SIBTex @ TREC CDS 2014

BiTeM official results

ourbaseline

ourbaseline

Page 6: BiTeM / SIBTex @ TREC CDS 2014

Creating a baseline

1. Classical document representation with text

Text indexSearch engine

Page 7: BiTeM / SIBTex @ TREC CDS 2014

1. Classical document representation with text

• Two different indexing levels:• Document• Section Run 2 vs run 4 : document > section (+ 65%)

• Query representation (R-Prec):• Numbers removing (no age)• Only description: 0.169• Only summaries: 0.170• Both: 0.185 (+10%) Signal/noise ratio: better with more information

Document

Sections

Page 8: BiTeM / SIBTex @ TREC CDS 2014

Creating a complementary view

2. Document representation withMeSH

MeSH indexSearch engine

MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis

Page 9: BiTeM / SIBTex @ TREC CDS 2014

2. Document representation with MeSH

• Two possible sources:• Collected from MEDLINE when there is a PMID

• Extracted from documents with a categorizer (strict mapping)

• Two possible integrations between original text and MeSH:• Building separate indexes then combining runs

• Merging both representations into one unique document

Page 10: BiTeM / SIBTex @ TREC CDS 2014

MeSH concepts found:D008568 MemoryD008569 Memory DisordersD007866 LegD009068 MovementD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis

<topic number="8"><summary>62-year-old man with

progressive memory loss and involuntary leg movements. Brain MRI

reveals cortical atrophy, and cortical biopsy shows vacuolar gray matter

changes with reactiveastrocytosis.</summary>

Example of MeSH mapping

D013035:Muscular Spasm ?

Some good (power of synonyms)

Some broad Some missing (too ambiguous)

D002540:Cerebral Cortex ?

D008279: MH = Magnetic Resonance Imaging ?Medical Research Institute ?Moderate Renal Insufficiency ?

Page 11: BiTeM / SIBTex @ TREC CDS 2014

MEDLINE MeSH in docsHumansAnimalsFemaleMaleAdult

Middle AgedMiceAged

AdolescentMolecular Sequence Data

RatsYoung AdultTime Factors

ChildSignal Transduction

Extracted MeSH in docsCells

Ficus (because of «fig»)Patients

TimeGenes

TherapeuticsMethods

RoleHumansDiseaseVolition

MiceAttention

DNAPopulation

Extracted MeSH in topicsWomenHistory

PainBlood

Physical ExaminationFemale

Blood PressurePressureDyspneaFamilyThoraxUrineFeverMale

Emergencies

Top 15 MeSH in benchmark

Page 12: BiTeM / SIBTex @ TREC CDS 2014

Results for MeSH representation

• Best R-Prec 0.143 for MeSH representation (vs 0.211 for text)o MeSH concepts collected from MEDLINE not useful (best R-Prec 0.028)

o Only 53% of documents had MeSH terms in MEDLINE

• Complementarity for finding relevant documents (thanks to qrel) :

• Low complementarity

• Combination: 0.211 -> 0.213

Page 13: BiTeM / SIBTex @ TREC CDS 2014

Favoring target types

MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD005911 GliosisD001706 Biopsy

MeSHtargetDiagnosisMeSHtargetDiagnosis

MeSHtargetTest

Do relevant documents for diagnosis deal more with diagnosis ?

3. Target-specific semantic enrichment with MeSH

Page 14: BiTeM / SIBTex @ TREC CDS 2014

3. Target-specific semantic enrichment with MeSH

• In UMLS, each MeSH term has Semantic Types (ex: T060 Diagnostic Procedure)

Focus on targets (diagnosis, treatments and tests)

• Specific words (ex: «MeSHtargetDiag») are added in docs and queries

Target% docs that have

at least 1Average number

in documents

Test 83 % 16

Diagnosis 86 % 41

Treatment 86 % 24

Small improvementonly for section indexing

Page 15: BiTeM / SIBTex @ TREC CDS 2014

In the qrel…Set Aver. Diagnosis MeSH Aver. Test MeSH Aver. Treatment MeSH

All collection 41 16 24

Relevant for diagnosis(1|2 for queries 1..10)

108 41 41

Relevant for test(1|2 for queries 11..20)

107 41 33

Relevant for treatment(1|2 for queries 21..30)

114 47 52

All relevant documents:o Are quite similar, with no distinction between targetso But have 2/3 times more target MeSH termso ... but it’s also the case for documents with 0 in the qrel

Page 16: BiTeM / SIBTex @ TREC CDS 2014

4. Boosting based on article types

Promoting some article types

Are some article types more likely to be relevant ?

Page 17: BiTeM / SIBTex @ TREC CDS 2014

Article typeDistribution

in docs in qrel in our runsresearch-article 74.3 % 52.2% 37.9 %

case-report 4.0 % 20.4 % 41.5 %review-article 6.9 % 17.9 % 10.9 %

Other 2.6 % 3.2 % 3.6 %brief-report 1.1 % 1.5 % 0.9 %

4. Boosting based on article types

• Strategy: to promote review and case-based articles (boosting)

• Intuition was good…

• In reality… the IR engine already promoted these types !

but the strategy failed !

Top 5

Page 18: BiTeM / SIBTex @ TREC CDS 2014

5. Exploitation of the co-citations network

Promoting citations

Are citations of retrieved documents relevant ?

Page 19: BiTeM / SIBTex @ TREC CDS 2014

5. Exploitation of the co-citations network

• E is the set of retrieved documents

• RSVe is the Retrieval Status Value of doce

• We boost each citation of doce by + α x RSVe

• 50% of documents cite another one in the collection (avg 3.8 cits)

Page 20: BiTeM / SIBTex @ TREC CDS 2014

Results

• With α = 0.1, slight improvement• + 10% for R-PREC

• + 20% for infNCDG

• In TREC Chem 2010 Prior Art task, + 150% for MAP

Page 21: BiTeM / SIBTex @ TREC CDS 2014

Conclusions

“what is important is to have fought well”

Page 22: BiTeM / SIBTex @ TREC CDS 2014

Conclusions

• A lot of strategies, but not much better than Terrier baseline

• Section indexing: never again

• MeSH not complementary… Better when infered by a k-NN ?

• Relevant docs talk about test, diag and treatment altogether.

• Maybe we have to start working from the baseline run…