Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.
-
date post
15-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.
Evaluation in Information Retrieval
Speaker: Ruihua Song
Web Data Management Group, MSR Asia
Outlines Basics on IR evaluation Introduction of TREC (Text Retrieval
Conference) One selected paper
Select-the-Best-Ones: A new way to judge relative relevance
Motivated Examples Which set is better?
S1={r, r, r, n, n} vs. S2={r, r, n, n, n} S3={r} vs. S4={r, r, n}
Which ranking list is better? L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>
r: relevant n: non-relevant h: highly relevant
Precision & Recall Precision is fraction of the retrieved document
which is relevant
Recall is fraction of the relevant document which has been retrieved
R (Relevant Set)
A (Answer Set)
Ra
||
||Pr
A
Recision a
||
||Re
R
Rcall a
Precision & Recall (cont.) Assume there are 10 relevant documents in judgments Example 1: S1={r, r, r, n, n} vs. S2={r, r, n, n, n}
P1= 3/5 = 0.6; R1= 3/10 = 0.3 P2= 2/5 = 0.4; R2= 2/10 = 0.2 S1 > S2
Example 2: S3={r} vs. S4={r, r, n} P3= 1/1 = 1; R3= 1/10 = 0.1 P4= 2/3 = 0.667; R4= 2/10 = 0.2 ? (F1-Measure)
Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> ?
r: relevant n: non-relevant h: highly relevant
Mean Average Precision Defined as the mean of Average Precision
for a set of queries
Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> AP1=(1/1+2/2+3/3)/10=0.3
AP2=(1/3+2/4+3/5)/10=0.143
L1 > L2
Other Metrics based on Binary Judgments P@10 (Precision at 10) is the number of
relevant documents in the top 10 documents in the ranked list returned for a topic e.g. there is 3 relevant documents at the top 10
retrieved documents P@10=0.3
MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries RR is the reciprocal of the first relevant document’s
rank in the ranked list returned for a topic e.g. the first relevant document is ranked as No.4
RR = ¼ = 0.25
Metrics based on Graded Relevance Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>
r: relevant n: non-relevant h: highly relevant
Which ranking list is better?
Cumulated Gains based metrics CG, DCG, and nDCG
Two assumptions about ranked result list Highly relevant document are more valuable The greater the ranked position of a relevant
document , the less valuable it is for the user
CG Cumulated Gains
From graded-relevance judgments to gain vectors
Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>
CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>
DCG Discounted Cumulated Gains
Discounted function
Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DG3 = <1, 0, 0.63, 0, 0.86>, DG4 =<2, 0, 0, 0.5, 0.43>
G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>
DCG3 = <1, 1, 1.63, 1.63, 2.49>, DCG4 =<2, 2, 2, 2.5, 2.93> CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>
nDCG Normalized Discounted Cumulated Gains
Ideal (D)CG vector
Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> Lideal = <h, r, r, n, n>
Gideal = <2, 1, 1, 0, 0>; DGideal = <2, 1, 0.63, 0, 0>
CGideal = <2, 3, 4, 0, 0>; DCGideal = <2, 3, 3.63, 3.63, 3.63>
nDCG Normalized Discounted Cumulated Gains
Normalized (D)CG
Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DCGideal = <2, 3, 3.63, 3.63, 3.63>
nDCG3 = <1/2, 1/3, 1.63/3.63, 1.63/3.63, 2.49/3.63> = <0.5, 0.33, 0.45, 0.45, 0.69>
nDCG4 =<2/2, 2/3, 2/3.63, 2.5/3.63, 2.93/3.63> = <1, 0.67, 0.55, 0.69, 0.81>
L3 < L4
Something Important Dealing with small data sets
Cross validation Significant test
Paired, two tailed t-test
Green < Yellow ?
The difference is significant or just caused
by chance
score
p(.)
score
p(.)
Any questions?
BY RUIHUA SONGWEB DATA MANAGEMENT GROUP, MSR ASIA
MARCH 30 , 2010
Introduction of TREC
Text Retrieval Conference
Homepage: http://trec.nist.gov/Goals
To encourage retrieval research based on large test collection
To increase communication among industry, academia, and government
To speed the transfer of technology from research labs into commercial products
To increase the availability of appropriate evaluation techniques for use by industry and academia
Yearly Cycle of TREC
The TREC Tracks
TREC 2009
Tracks Blog track Chemical IR track Entity track Legal track “Million Query” track Relevance Feedback track Web track
Participants 67 groups representing 19 different countries
TREC 2010
Schedule By Feb 18 – submit your application to participate in
TREC 2010 Beginning March 2 Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md.
USAWhat’s new
Session track To test whether systems can improve their performance for
a given query by using a previous query To evaluate system performance over an entire query
session instead of a single query Track web page: http://ir.cis.udel.edu/sessions
Why TREC
To obtain public data sets (most frequently used in IR papers) Pooling makes judgments unbiased for participants
To exchange ideas in emerging areas A strong Program Committee A healthy comparison of approaches
To influence evaluation methodologies By feedback or proposals
TREC 2009 Program Committee
Ellen Voorhees, chairJames AllanChris BuckleyGord CormackSue DumaisDonna HarmanBill Hersh
David LewisDoug OardJohn PragerStephen RobertsonMark SandersonIan SoboroffRichard Tong
Any questions?
SELECT-THE-BEST-ONES: A NEW WAY TO JUDGE RELATIVE RELEVANCERuihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon
Information Processing and Management, 2010
ABSOLUTE RELEVANCE JUDGMENTS
RELATIVE RELEVANCE JUDGMENTS
Problem formulation
Connections between Absolute and Relative A can be transformed to R as follows:
R can be transformed to A, if the assessors assign a relevance grade to each set.
QUICK-SORT: A PAIRWISE STRATEGY
R P
P
P
vs.
…
B
S
W
B
SS
WW
WW
RS
SS
WW
W
WW
BB
SELECT-THE-BEST-ONES: A PROPOSED NEW STRATEGY
P
P
P
…
B
PP
PP
PP
PP
P
B
P
B
P
P
B
PP
P
PP
P
USER STUDY
Experiment Design Latin Square design to minimize possible
practice effects and order effects
Each tool has been used to judge all three query sets; Each query has been judged by three subjects; Each subject has used every tool and judged every
query, but there are no overlapped queries when he/she uses two different tools
USER STUDY
Experiment Design 30 Chinese queries are divided into three
balanced sets, and cover both popular queries and long-tail queries
SCENE OF USER STUDY
BASIC EVALUATION RESULTS
Efficiency Majority agreement Discriminative power
Table 2. Basic metrics comparison for the three judgment methods
Method Time Majority Agreement (%) Average Relevance
Degrees # Untied Pairs #
Five-grade Absolute Judgments 6’38 97.50 3.12 2000
Quick-Sort Strategy 10’57 (+65.1%)* 94.82 (-2.7%)* 4.40 (+41.0%)* 2585 (+29.3%)*
Select-the-Best-Ones Strategy 5’54 (-11.1%) 99.31 (+1.9%)* 3.80 (+21.8%)* 2309(+15.5%)*
Note: T-test is conducted with regard to the baseline, i.e. Five-grade Absolution Judgments. “*” denotes that the difference is statistically significant (p-value < 0.05).
FURTHER ANALYSIS ON DISCRIMINATIVE POWER
Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not
More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO
Table 3. Detailed analysis on splitting one grade of absolute relevance judgments into more subsets of relative relevance judgments
(the average number of subsets corresponding to each grade, the percentage of queries that are influenced by splitting a certain
grade)
Strategy Perfect Excellent Good Fair Bad
Quick-Sort (1, 0) (1.31, 14.44%) (1.93, 55.56%) (1.31, 6.67%) (1, 0)
Select-the-Best-Ones (1, 0) (1.21, 23.33%) (1.78, 51.11%) (1.12, 20.00%) (1, 0)
EVALUATION EXPERIMENT ON JUDGMENT QUALITY
Collecting expert’s judgments 5 experts, for 15 Chinese queries Partial orders Judge individually + discuss as a group
Experimental resultsTable 4. Consistency between expert judgments and the judgments generated by the three methods for document pairs (the
number of concordant/tied/discordant pairs divided by the total number of pairs)
Five-grade Absolute Judgments Quick-Sort Strategy Select-the-Best-Ones Strategy
Concordant 0.2946 0.3493 0.3371
Tied 0.6436 0.4750 0.5638
Discordant 0.0342 0.1142 0.0677
DISCUSSION
Absolute relevance judgment method Fast and easy-to-implement Loses some useful order information
Quick-sort method Light cognitive load and scalable High complexity and unstable standard
Select-the-Best-Ones method Efficient with good discriminative power Heavy cognitive load and not scalable
CONCLUSION
We propose a new strategy called Select-the-Best-Ones to address the problem of relative relevance judgment
A user study and an evaluation experiment show that the SBO method Outperforms the absolute method in terms of
agreement and discriminative power Dramatically improves the efficiency over the
pairwise relative method QS strategy Reduces half of the discordant pairs, compared
to the QS method
Thank you!