ictir2016

A Simple and Effective Approach to Score Standardisation

@tetsuyasakai

http://www.f.waseda.jp/tetsuya/sakai.html

September 15@ICTIR 2016 (Newark, DE, USA)

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Hard topics, easy topics

Mean = 0.12

0

0.2

0.4

0.6

0.8

1

System 1 System 2 System 3 System 4 System 5

Topic 1 Topic 2Mean = 0.70

Low-variance topics, high-variance topics

standard deviation = 0.08

0

0.2

0.4

0.6

0.8

1

System 1 System 2 System 3 System 4 System 5

Topic 1 Topic 2 standard deviation = 0.29

Score standardisation [Webber+08]

standardised score for i-th system, j-th topic

j

i

raw

Topics

Systems

j

i

std

Topics

Systems

Subtract mean;divide by standard deviation

How good is i compared to “average” in standard

deviation units?

Standardising factors

Now for every topic, mean = 0, variance = 1.

-2

-1

0

1

2

System 1System 2System 3System 4System 5

Topic 1 Topic 2

Comparisons across different topic sets and test collections are possible!

Standardised scores have the [-∞, ∞] range and are not very convenient.

-2

-1

0

1

2

System 1System 2System 3System 4System 5

Topic 1 Topic 2

Transform them back into the [0,1] range!

std-CDF: use the cumulative density function of the standard normal distribution [Webber+08]

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TREC04

Each curve is a topic, with 110 runs represented as dots

raw nDCG

std-CDFnDCG

std-CDF: emphasises moderately high and moderately low performers – is this a good thing?

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TREC04

raw nDCG

std-CDFnDCG

Moderatelyhigh

Moderatelylow

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

std-AB: How about a simple linear transformation?

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)

TREC04

raw nDCG

std-AB with clipping, with the range [0,1]

Let B=0.5 (“average” system)

Let A=0.15 so that 89% of scores fall within [0.05, 0.95](Chebyshev’s inequality)

For EXTREMELY good/bad systems…

This formula with (A,B) is used in educational research: A=100, B=500 for SAT, GRE [Lodico+10],A=10, B=50 for Japanese hensachi “standard scores”.

In practice, clipping does not happen often.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

TREC04 raw nDCG

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

TREC04 std-AB nDCG

Topic ID

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Data for comparing raw vs. std-CDF vs. std-AB

Ranking runs by raw, std-CDF, std-ABmeasures

For each test collection, the

rankings of standardizing systems are statistically equivalent

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Standardisation factors

j

i

raw

Standardising systems

j

i

std


< , >

Top

ics

Top

ics

Can the factors handle new systems properly?

j

i

raw


j

i

std


< , >

raw std

Newsystems

Newsystems

Can the new systems be evaluated fairly?

Top

ics

Top

ics

Leave one out (1)

QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t}

(1) Compute measure (1’) Compute measure

N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T

Runs from Team t have been removed from the pooled systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one outby means of Kendall’s tau.

original qrels

Evaluating M runs using N topics

qrels with unique contributions from Team t with L runs removed

L runsM - L runs

Zobel’s original method [Zobel98] removed one run at a timebut removing the entire team is more realistic [Voorhees02].

Leave one out (2)



N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T(2) Computefactors

(2’) Computefactors

(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}

N×M matrix S QR,T N× (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}

L runs from t also removed from the standardising systems

These L runs are standardised using standardisationfactors based on the (M – L) runs

Leave one out (3)







N×M matrix W QR’(t),TN×M matrix W QR,T

(4a) std-CDF (4’a) std-CDF

Runs from Team t have been removed from the pooled systems ANDfrom the standardising systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one out by means of Kendall’s tau.

Leave one out (4)







N×M matrix W QR’(t),TN×M matrix P QR’(t),T

(4’b) std-ABN×M matrix W QR,T

N×M matrix P QR,T

(4b) std-AB(4a) std-CDF (4’a) std-CDF

Runs from Team t have been removed from the pooled systems ANDfrom the standardising systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one out by means of Kendall’s tau.

Leave one out results

Similar results for TREC04, 05 can be found in the paper.

Margin of error for 95% CI

Runs outside the pooled and standardising systems can be evaluated fairly for both std-CDF and std-AB.

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Discriminative power

• Conduct a significance test for every system pair and plot the p-values

• Discriminative measures = those with small p-values

• [Sakai06SIGIR] used the bootstrap test for every system pair but using k pairwise tests independently means that the familywise error rate can amount to 1-(1-α) [Carterette12, Ellis10].

• [Sakai12WWW] used the randomised Tukey HSD test [Carterette12][Sakai14PROMISE] instead to ensure that the familywise error rate is bounded above by α.

k

We also use randomised Tukey HSD.

With nDCG, std-CDF is more discriminative than raw and std-AB scores…

Gets more statistically significant results,probably because std-CDF emphasizesmoderately high and moderately low scores

But with nERR, std-CDF is not discriminativeProbably because nERR is

seldom moderately high/low.

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Swap test

• System X > Y with topic set A. Does X > Y also hold with topic set B?

• [Voorhees09] splits 100 topics in half to form A and B, each with 50.

• [Sakai06SIGIR] showed that bootstrap samples (sampling with replacement) can directly handle the original topic set size.

:

Bin 1

Bin 2

Bin 21

With std-CDF, we get lots of swaps. std-AB is much more consistent across topic sets.

What if we consider only run pairs that are statistically significantly different according to randomised Tukey HSD?

nDCG nERR

TREC03 (3,003 pairs) 810/844/812 378/357/386

TREC04 (5,995 pairs) 1434/1723/1534 223/220/250

TREC05 (2,701 pairs) 727/879/758 336/329/346

Significantlydifferent pairs(raw/std-CDF/std-AB)

:

Bin 1’

Bin 2’

Bin 6’

Each bin now has a wider range as the #observations is small

After filtering pairs with randomised Tukey HSD, swaps almost never occur for all three score types

TREC03

TREC04

TREC05

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

Bins 1’～3’

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

Previous workdid not consider the familywiseerror rate problem(used pairwise testsmany times)

#significant pairs for nERR: 378

#observations: 378,000#observations in Bin 1’:

980#swaps in Bin 1’:

1 (0.10%)

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Topic set size design [Sakai16IRJ,Sakai16ICTIRtutorial]

To determine the topic set size n for a new test collection to be built,

Sakai’s Excel tool based on one-way ANOVA power analysis takes as input:

α: Type I error probability

β: Type II error probability (power = 1 – β)

M: number of systems to be compared

minD: minimum detectable range

= minimum diff between the best and worst systems for which you want to guarantee (1-β)% power

: estimate of the within-system variance (typically obtained from a pilot topic-by-run matrix

Estimating the within-system variance for each measure (to obtain future n)

TREC03

TREC04

TREC05

runs

topics

C=TREC03

C=TREC04

C=TREC05

Residual variancesfrom one-way ANOVA

Pooled variance

Sample mean for system i

Do this for raw, std-CDF, and std-AB score matrices

to obtain n’s.

With std-AB, we get very small within-system variances (1)The initial estimate of n with the one-way ANOVA topic set size design is given by [Nagata03]

where,

for (α, β)=(0.05, 0.20), λ ≒

So n will be small if is small.

With std-AB, is indeed small because A is small (e.g. 0.15) and it can be shown that

Noncentrality parameter of a noncentralchi-square distribution

With std-AB, we get very small within-system variances (2)

std-AB gives us more realistic topic set sizes for small minD values

• Does not mean that std-AB is “better” than std-CDF and raw,because a minD of (say) 0.02 in std-ABnDCG is not equivalent to a minD of 0.02 in std-CDF or raw.• Nevertheless, having realistic topic set sizes for a variety of minD values isprobably a convenient feature.

If we had fewer teams, what would happen to the standardisation factors? (1)

j

i

raw


< , >

Top

ics

j

i

raw


< , >

Top

ics

Remove k teams

If the standisation factors are similar,that implies that we don’t need many systemsto obtain reliable values.

If we had fewer teams, what would happen to the standardisation factors? (2)

Starting with 16 teams,k=0,…,14 teams were removed fromthe matrices before obtainingstandardisation factors.

Each line represents m・j or s・j

for a topic (CIs omitted for brevity).

They are quite stable, even whenk=14 teams have been removed.That is, only a few teams are neededto obtain reliable valuesof m・j and s・j .

If we had fewer teams, what would happen to within-system variances for std-AB? (1)

j

i

raw


< , >

Top

ics

j

i

raw


< , >

Top

ics

Remove k teams

If the variance estimates are similar,that implies that we don’t need many systemsto obtain reliable values.

If we had fewer teams, what would happen to within-system variances for std-AB? (2)

Each k had 10 trials so 95% CIs ofthe variance estimates are shown.

The variance estimates are alsostable even if we remove a lot of teams. That is, only a few teams areneeded to obtain reliable variance estimates for topic set size design.

Using std-AB with topic set size design also means that we canhandle unnormalised measureswithout any problems [Sakai16AIRS].

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

Conclusions

• Advantages of score standardisation:

- removes topic hardness, enables comparison across test collections

- normalisation becomes unnecessary

• Advantages of std-AB over std-CDF:

Low within-system variances and therefore

- Substantially lower swap rates (higher consistency across different data)

- Enables us to consider realistic topic set sizes in topic set size design

• By-product: Using randomised Tukey HSD (instead of repeated pairwise tests) can ensure that swaps almost never occur.

Swap rates for std-CDF can be higher than those for raw scores, probably due to its

nonlinear transformation

std-AB is a good alternative to std-CDF.

If you want a p-value for every system pair, this test is highly recommended.

Shared resources

• All of the topic-by-run matrices created in our experiments are available at https://waseda.box.com/ICTIR2016PACK

• Computing AP, Q-measure, nDCG, nERR etc.:

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

• Discriminative power by randomised Tukey HSD:

http://research.nii.ac.jp/ntcir/tools/discpower-en.html

• Topic set size design Excel tools:

http://www.f.waseda.jp/tetsuya/tools.html

https://waseda.box.com/ICTIR2016PACK

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

http://research.nii.ac.jp/ntcir/tools/discpower-en.html

http://www.f.waseda.jp/tetsuya/tools.html

TALK OUTLINE






6. Swap rates


8. Conclusions

9. Future work

We Want Web@NTCIR-13 (1) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017)

frozen topic set

NTCIR-13 fresh topic set

NTCIR-13 systems

New runs pooled for

frozen + fresh topics


NTCIR-13 (Dec 2017)

frozen topic set


NTCIR-13 systems

Official NTCIR-13results discussed with the fresh topics

Qrels + std. factors based onNTCIR-13systems

NOT released

Qrels + std. factors based onNTCIR-13 systemsreleased


NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)

frozen topic set frozen topic set



NTCIR-13 systems

NTCIR-14 systems

New runs pooled for


Revived runs pooled for

fresh topics


NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)

frozen topic set frozen topic set



NTCIR-13 systems

NTCIR-14 systems


Qrels + std. factors based on

NTCIR-13+14 systems

NOT released


NTCIR-(13+)14 systemsreleased

Using the NTCIR-14 fresh topics, compare new NTCIR-14 runs with revived runs and quantify progress.


NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)

frozen topic set frozen topic set frozen topic set




NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems

New runs pooled for


Revived runs pooled for

fresh topics







NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems



NTCIR-(13+14+)15 systemsreleased

Using the NTCIR-15 fresh topics, compare new NTCIR-15 runs with revived runs and quantify progress.







NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems



NTCIR-13+14 systemsreleased

Qrels + std. factors based onNTCIR-13systems

released

How do the standardisationfactors for each frozen topic differ across the 3 rounds?


NTCIR-13+14+15 systemsreleased









NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems





NTCIR-13+14+15 systemsreleased


NTCIR-13+14 systemsreleased

Qrels + std. factors based onNTCIR-13 systemsreleased

How do the NTCIR-15 system rankings differ across the 3 rounds, with and w/o standardisation?

NTCIR-15 systems ranking



See you all in Tokyo

Selected references (1)

[Carterette12] Carterette: Multiple testing in statistical analysis of systems-based information retrieval experiments, ACM TOIS 30(1), 2012.

[Ellis10] Ellis: The essential guide to effect sizes, Cambridge, 2010.

[Lodico+10] Lodico, Spaulding, Voegtle: Methods in educational research, Jossey-Bass, 2010.


[Sakai06SIGIR] Sakai: Evaluating evaluation metrics based on the bootstrap, ACM SIGIR 2006.

[Sakai12WWW] Sakai: Evaluation with Informational and Navigational Intents, WWW 2012.

[Sakai14PROMISE] Sakai: Metrics, statistics, tests, PROMISE Winter School 2013 (LNCS 8173).

[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval Journal 19(3), 2016. http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice, ICTIR 2016 Tutorial. http://www.slideshare.net/TetsuyaSakai/ictir2016tutorial-65845256

[Sakai16AIRS] Sakai: The Effect of Score Standardisation on Topic Set Size Design, AIRS 2016, to appear.


[Voorhees02] Voorhees: The philosophy of information retrieval evaluation, CLEF 2001.

[Voorhees09] Voorhees: Topic set size redux, ACM SIGIR 2009.

[Webber+08] Webber, Moffat, Zobel: Score standardisation for inter-collection comparison of retrieval systems, ACM SIGIR 2008.

[Zobel98] Zobel: How reliable are the results of large-scale information retrieval experiments? ACM SIGIR 1998.

ictir2016

Technology

Transcript of ictir2016