Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏...

Post on 04-Jan-2016

257 views 1 download

Transcript of Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏...

Enhancing Biomedical Text Rankers by

Term Proximity Information

劉瑞瓏慈濟大學醫學資訊學系

2012/06/13

Outline

• Background– Text ranking– Biomedical information needs

• An approach to enhancing text rankers in the biomedical domain

• Evaluation

• Conclusion

2

Research Background

3

Text Ranking• Goal

– Given a query q and a set T of texts retrieved for q, ranking those texts (in T) according to their degrees of relevance to q

• Motivation– Reducing information overload, since T is often

quite huge, even a smart search engine is used– Text ranking is a key issue in information

retrieval, and often a “secret” component for search engines

4

An Example Ranker

5

Biomedical Information Need

• Biomedical research requires relevant evidences in the huge and ever-growing biomedical literature

• Retrieval of the evidences requires a system that – Accepts a natural language query for a biomedical

information need, and – Ranks relevant texts higher for access or processing

6

An Example

• Query: urinary tract infection, criteria for treatment and admission (from OHSUMED) – A disease as the target concept (i.e., urinary tract infection)

– Two concepts about the scenario of the information need (i.e., treatment and admission)

• Neither special nor related to any disease

7

Contextual Completeness

• Biomedical queries need to be well-formed, and so call for a retrieval system that considers contextual completeness of each query concept t in the text d– Contextual completeness of t in d is the extent

to which the query concepts other than t appear in nearby areas in d

8

An Example

9

• In children with an acute febrile illness, what is the efficacy of single medication therapy with acetaminophen or ibuprofen in reducing fever?

[From Lin & Demner-Fushman, 2006]

PICO

Task

Answer

Strength

An Approach to Improving Rankers for Biomedical Info Needs

10

11

Goals

• An approach PRE (Proximity-based Ranker Enhancer) that – Measures contextual completeness of query

concepts appearing in a nearby area in the text– Serves as a supplement to improve existing

rankers

12

Contrast with Related Work• Biomedical text ranking

– Using synonyms and considering diversity of passages, without considering term proximity

• Text ranking– Individual text scoring techniques (e.g., BM25)

and learning to rank techniques (e.g., Ranking SVM), without considering term proximity

• Improving ranking by term proximity– Term proximity is employed, but contextual

completeness was not considered

System Overview

13

Text Ranker Development

TrainingTesting

Underlying RankerPRE

Text Ranking TF in d

User

Query (q)

Text (d)

TF (Term Frequency) Assessment

Training Data

Ranked Texts

TF Assessment

14

• Three types of term proximity– Overall proximity (QTermTF)– Individual proximity (IndiP)– Collective proximity (CollP)

• A term t may get a large TF increment in d, if – Many query terms appear frequently in d– Query terms are individually near to t at some

places, and– Query terms collectively appear at a place near to t

15

•RTF(t,d,q) = TF(t,d)+TFincrement(t,d,q)•TFincrement(t,d,q) = QtermTF(d,q)IndiP(t,d,q)×CollP(t,d,q)•QtermTF(d,q) = Total TF of query terms in d•IndiP(t,d,q) =ΣmM -

{t}SigmoidWeight(Mindist(t,m))/ MaxIndiP•Mindist(x,y) = shortest distance between x and y in d•SigmoidWeight(dt) = 1/(1+e-((|q|-1)-dt))•CollP(t,d,q) = MaxkK{mM - {t}

SigmoidWeight(dist(t,k,m))}/MaxCollP, where K is the set positions at which t appears in d•dist(t,k,m) = Distance between t (at position k) and m

16

Empirical Evaluation

17

Experimental Data• OHSUMED

– A popular database of biomedical queries and references

– 106 queries– 348,566 references– 16,140 query-reference pairs

• Definitively relevant• Possibly relevant• Not relevant

18

• TREC Genomics 2006– 28 queries (topics) and 27,999 query-passage

pairs• Definitively relevant, possibly relevant, and not

relevant

– 13,993 query-reference pairs

• TREC Genomics 2007– 36 queries and 35,996 query-passage pairs

• Relevant and not relevant

– 22,913 query-reference pairs

19

Underlying Rankers

20

Baseline Ranker Enhancer• Three state-of-the-art techniques that enhanced

text rankers by term proximity– The t-function: t() [Tao & Zhai, 2007]

– The p-function: p() [Cummins & O’Riordan, 2009] – The proximity language model: PLM [Zhao & Yun,

2009]

21

Evaluation Criteria• Evaluating how relevant references are ranked

higher for users to access– Mean average precision (MAP)

– Normalized discount cumulative gain at x (NDCG@X)

22

Results

23

24

25

26

27

28

29

30

Conclusion

31

• Contextual completeness of query concepts in the texts is essential in ranking biomedical texts

• To measure contextual completeness, it is helpful to integrate three types of term proximity– Overall proximity– Individual proximity– Collective proximity

• Existing rankers may be comprehensively enhanced

32

33

Thank You!