ictir2016

59
A Simple and Effective Approach to Score Standardisation @tetsuyasakai http://www.f.waseda.jp/tetsuya/sakai.html September 15@ICTIR 2016 (Newark, DE, USA)

Transcript of ictir2016

Page 1: ictir2016

A Simple and Effective Approach to Score Standardisation

@tetsuyasakai

http://www.f.waseda.jp/tetsuya/sakai.html

September 15@ICTIR 2016 (Newark, DE, USA)

Page 2: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 3: ictir2016

Hard topics, easy topics

Mean = 0.12

0

0.2

0.4

0.6

0.8

1

System 1 System 2 System 3 System 4 System 5

Topic 1 Topic 2Mean = 0.70

Page 4: ictir2016

Low-variance topics, high-variance topics

standard deviation = 0.08

0

0.2

0.4

0.6

0.8

1

System 1 System 2 System 3 System 4 System 5

Topic 1 Topic 2 standard deviation = 0.29

Page 5: ictir2016

Score standardisation [Webber+08]

standardised score for i-th system, j-th topic

j

i

raw

Topics

Systems

j

i

std

Topics

Systems

Subtract mean;divide by standard deviation

How good is i compared to “average” in standard

deviation units?

Standardising factors

Page 6: ictir2016

Now for every topic, mean = 0, variance = 1.

-2

-1

0

1

2

System 1System 2System 3System 4System 5

Topic 1 Topic 2

Comparisons across different topic sets and test collections are possible!

Page 7: ictir2016

Standardised scores have the [-∞, ∞] range and are not very convenient.

-2

-1

0

1

2

System 1System 2System 3System 4System 5

Topic 1 Topic 2

Transform them back into the [0,1] range!

Page 8: ictir2016

std-CDF: use the cumulative density function of the standard normal distribution [Webber+08]

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TREC04

Each curve is a topic, with 110 runs represented as dots

raw nDCG

std-CDFnDCG

Page 9: ictir2016

std-CDF: emphasises moderately high and moderately low performers – is this a good thing?

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TREC04

raw nDCG

std-CDFnDCG

Moderatelyhigh

Moderatelylow

Page 10: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 11: ictir2016

std-AB: How about a simple linear transformation?

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)

TREC04

raw nDCG

Page 12: ictir2016

std-AB with clipping, with the range [0,1]

Let B=0.5 (“average” system)

Let A=0.15 so that 89% of scores fall within [0.05, 0.95](Chebyshev’s inequality)

For EXTREMELY good/bad systems…

This formula with (A,B) is used in educational research: A=100, B=500 for SAT, GRE [Lodico+10],A=10, B=50 for Japanese hensachi “standard scores”.

Page 13: ictir2016

In practice, clipping does not happen often.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

TREC04 raw nDCG

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

TREC04 std-AB nDCG

Topic ID

Page 14: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 15: ictir2016

Data for comparing raw vs. std-CDF vs. std-AB

Page 16: ictir2016

Ranking runs by raw, std-CDF, std-ABmeasures

For each test collection, the

rankings of standardizing systems are statistically equivalent

Page 17: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 18: ictir2016

Standardisation factors

j

i

raw

Standardising systems

j

i

std

Standardising systems

< , >

Top

ics

Top

ics

Page 19: ictir2016

Can the factors handle new systems properly?

j

i

raw

Standardising systems

j

i

std

Standardising systems

< , >

raw std

Newsystems

Newsystems

Can the new systems be evaluated fairly?

Top

ics

Top

ics

Page 20: ictir2016

Leave one out (1)

QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t}

(1) Compute measure (1’) Compute measure

N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T

Runs from Team t have been removed from the pooled systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one outby means of Kendall’s tau.

original qrels

Evaluating M runs using N topics

qrels with unique contributions from Team t with L runs removed

L runsM - L runs

Zobel’s original method [Zobel98] removed one run at a timebut removing the entire team is more realistic [Voorhees02].

Page 21: ictir2016

Leave one out (2)

QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t}

(1) Compute measure (1’) Compute measure

N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T(2) Computefactors

(2’) Computefactors

(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}

N×M matrix S QR,T N× (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}

L runs from t also removed from the standardising systems

These L runs are standardised using standardisationfactors based on the (M – L) runs

Page 22: ictir2016

Leave one out (3)

QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t}

(1) Compute measure (1’) Compute measure

N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T(2) Computefactors

(2’) Computefactors

(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}

N×M matrix S QR,T N× (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}

N×M matrix W QR’(t),TN×M matrix W QR,T

(4a) std-CDF (4’a) std-CDF

Runs from Team t have been removed from the pooled systems ANDfrom the standardising systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one out by means of Kendall’s tau.

Page 23: ictir2016

Leave one out (4)

QR = {QRj} QR’(t) = {QR’j(t)} (0) Leave out t, T’(t) = T – {t}

(1) Compute measure (1’) Compute measure

N× (M – L) matrix R QR’(t),T’(t) R QR’(t),{t}N×M matrix R QR,T(2) Computefactors

(2’) Computefactors

(3) Standardise (3’) Standardise{ <m’・j, s’・j >}{ <m・j, s・j >}

N×M matrix S QR,T N× (M – L) matrix S QR’(t),T’(t) S QR’(t),{t}

N×M matrix W QR’(t),TN×M matrix P QR’(t),T

(4’b) std-ABN×M matrix W QR,T

N×M matrix P QR,T

(4b) std-AB(4a) std-CDF (4’a) std-CDF

Runs from Team t have been removed from the pooled systems ANDfrom the standardising systems – are these “new” runs evaluated fairly?Compare two run rankings before and after leave one out by means of Kendall’s tau.

Page 24: ictir2016

Leave one out results

Similar results for TREC04, 05 can be found in the paper.

Margin of error for 95% CI

Runs outside the pooled and standardising systems can be evaluated fairly for both std-CDF and std-AB.

Page 25: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 26: ictir2016

Discriminative power

• Conduct a significance test for every system pair and plot the p-values

• Discriminative measures = those with small p-values

• [Sakai06SIGIR] used the bootstrap test for every system pair but using k pairwise tests independently means that the familywise error rate can amount to 1-(1-α) [Carterette12, Ellis10].

• [Sakai12WWW] used the randomised Tukey HSD test [Carterette12][Sakai14PROMISE] instead to ensure that the familywise error rate is bounded above by α.

k

We also use randomised Tukey HSD.

Page 27: ictir2016

With nDCG, std-CDF is more discriminative than raw and std-AB scores…

Gets more statistically significant results,probably because std-CDF emphasizesmoderately high and moderately low scores

Page 28: ictir2016

But with nERR, std-CDF is not discriminativeProbably because nERR is

seldom moderately high/low.

Page 29: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 30: ictir2016

Swap test

• System X > Y with topic set A. Does X > Y also hold with topic set B?

• [Voorhees09] splits 100 topics in half to form A and B, each with 50.

• [Sakai06SIGIR] showed that bootstrap samples (sampling with replacement) can directly handle the original topic set size.

:

Bin 1

Bin 2

Bin 21

Page 31: ictir2016

With std-CDF, we get lots of swaps. std-AB is much more consistent across topic sets.

Page 32: ictir2016

What if we consider only run pairs that are statistically significantly different according to randomised Tukey HSD?

nDCG nERR

TREC03 (3,003 pairs) 810/844/812 378/357/386

TREC04 (5,995 pairs) 1434/1723/1534 223/220/250

TREC05 (2,701 pairs) 727/879/758 336/329/346

Significantlydifferent pairs(raw/std-CDF/std-AB)

:

Bin 1’

Bin 2’

Bin 6’

Each bin now has a wider range as the #observations is small

Page 33: ictir2016

After filtering pairs with randomised Tukey HSD, swaps almost never occur for all three score types

TREC03

TREC04

TREC05

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

Bins 1’~3’

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

[0, 0.10)

[0.10, 0.20)

[0.20, 0.30)

Previous workdid not consider the familywiseerror rate problem(used pairwise testsmany times)

#significant pairs for nERR: 378

#observations: 378,000#observations in Bin 1’:

980#swaps in Bin 1’:

1 (0.10%)

Page 34: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 35: ictir2016

Topic set size design [Sakai16IRJ,Sakai16ICTIRtutorial]

To determine the topic set size n for a new test collection to be built,

Sakai’s Excel tool based on one-way ANOVA power analysis takes as input:

α: Type I error probability

β: Type II error probability (power = 1 – β)

M: number of systems to be compared

minD: minimum detectable range

= minimum diff between the best and worst systems for which you want to guarantee (1-β)% power

: estimate of the within-system variance (typically obtained from a pilot topic-by-run matrix

Page 36: ictir2016

Estimating the within-system variance for each measure (to obtain future n)

TREC03

TREC04

TREC05

runs

topics

C=TREC03

C=TREC04

C=TREC05

Residual variancesfrom one-way ANOVA

Pooled variance

Sample mean for system i

Do this for raw, std-CDF, and std-AB score matrices

to obtain n’s.

Page 37: ictir2016

With std-AB, we get very small within-system variances (1)The initial estimate of n with the one-way ANOVA topic set size design is given by [Nagata03]

where,

for (α, β)=(0.05, 0.20), λ ≒

So n will be small if is small.

With std-AB, is indeed small because A is small (e.g. 0.15) and it can be shown that

Noncentrality parameter of a noncentralchi-square distribution

Page 38: ictir2016

With std-AB, we get very small within-system variances (2)

Page 39: ictir2016

std-AB gives us more realistic topic set sizes for small minD values

• Does not mean that std-AB is “better” than std-CDF and raw,because a minD of (say) 0.02 in std-ABnDCG is not equivalent to a minD of 0.02 in std-CDF or raw.• Nevertheless, having realistic topic set sizes for a variety of minD values isprobably a convenient feature.

Page 40: ictir2016

If we had fewer teams, what would happen to the standardisation factors? (1)

j

i

raw

Standardising systems

< , >

Top

ics

j

i

raw

Standardising systems

< , >

Top

ics

Remove k teams

If the standisation factors are similar,that implies that we don’t need many systemsto obtain reliable values.

Page 41: ictir2016

If we had fewer teams, what would happen to the standardisation factors? (2)

Starting with 16 teams,k=0,…,14 teams were removed fromthe matrices before obtainingstandardisation factors.

Each line represents m・j or s・j

for a topic (CIs omitted for brevity).

They are quite stable, even whenk=14 teams have been removed.That is, only a few teams are neededto obtain reliable valuesof m・j and s・j .

Page 42: ictir2016

If we had fewer teams, what would happen to within-system variances for std-AB? (1)

j

i

raw

Standardising systems

< , >

Top

ics

j

i

raw

Standardising systems

< , >

Top

ics

Remove k teams

If the variance estimates are similar,that implies that we don’t need many systemsto obtain reliable values.

Page 43: ictir2016

If we had fewer teams, what would happen to within-system variances for std-AB? (2)

Each k had 10 trials so 95% CIs ofthe variance estimates are shown.

The variance estimates are alsostable even if we remove a lot of teams. That is, only a few teams areneeded to obtain reliable variance estimates for topic set size design.

Using std-AB with topic set size design also means that we canhandle unnormalised measureswithout any problems [Sakai16AIRS].

Page 44: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 45: ictir2016

Conclusions

• Advantages of score standardisation:

- removes topic hardness, enables comparison across test collections

- normalisation becomes unnecessary

• Advantages of std-AB over std-CDF:

Low within-system variances and therefore

- Substantially lower swap rates (higher consistency across different data)

- Enables us to consider realistic topic set sizes in topic set size design

• By-product: Using randomised Tukey HSD (instead of repeated pairwise tests) can ensure that swaps almost never occur.

Swap rates for std-CDF can be higher than those for raw scores, probably due to its

nonlinear transformation

std-AB is a good alternative to std-CDF.

If you want a p-value for every system pair, this test is highly recommended.

Page 46: ictir2016

Shared resources

• All of the topic-by-run matrices created in our experiments are available at https://waseda.box.com/ICTIR2016PACK

• Computing AP, Q-measure, nDCG, nERR etc.:

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

• Discriminative power by randomised Tukey HSD:

http://research.nii.ac.jp/ntcir/tools/discpower-en.html

• Topic set size design Excel tools:

http://www.f.waseda.jp/tetsuya/tools.html

Page 47: ictir2016

TALK OUTLINE

1. Score standardisation and std-CDF

2. Proposed method: std-AB

3. Data and measures

4. Handling new systems: Leave one out

5. Discriminative power

6. Swap rates

7. Topic set size design

8. Conclusions

9. Future work

Page 48: ictir2016

We Want Web@NTCIR-13 (1) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017)

frozen topic set

NTCIR-13 fresh topic set

NTCIR-13 systems

New runs pooled for

frozen + fresh topics

Page 49: ictir2016

We Want Web@NTCIR-13 (2) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017)

frozen topic set

NTCIR-13 fresh topic set

NTCIR-13 systems

Official NTCIR-13results discussed with the fresh topics

Qrels + std. factors based onNTCIR-13systems

NOT released

Qrels + std. factors based onNTCIR-13 systemsreleased

Page 50: ictir2016

We Want Web@NTCIR-14 (1) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)

frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

New runs pooled for

frozen + fresh topics

Revived runs pooled for

fresh topics

Page 51: ictir2016

We Want Web@NTCIR-14 (2) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)

frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

Official NTCIR-14results discussed with the fresh topics

Qrels + std. factors based on

NTCIR-13+14 systems

NOT released

Qrels + std. factors based on

NTCIR-(13+)14 systemsreleased

Using the NTCIR-14 fresh topics, compare new NTCIR-14 runs with revived runs and quantify progress.

Page 52: ictir2016

We Want Web@NTCIR-15 (1) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)

frozen topic set frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-15 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems

New runs pooled for

frozen + fresh topics

Revived runs pooled for

fresh topics

Page 53: ictir2016

We Want Web@NTCIR-15 (2) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)

frozen topic set frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-15 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems

Official NTCIR-15results discussed with the fresh topics

Qrels + std. factors based on

NTCIR-(13+14+)15 systemsreleased

Using the NTCIR-15 fresh topics, compare new NTCIR-15 runs with revived runs and quantify progress.

Page 54: ictir2016

We Want Web@NTCIR-15 (3) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)

frozen topic set frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-15 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems

Official NTCIR-15results discussed with the fresh topics

Qrels + std. factors based on

NTCIR-13+14 systemsreleased

Qrels + std. factors based onNTCIR-13systems

released

How do the standardisationfactors for each frozen topic differ across the 3 rounds?

Qrels + std. factors based on

NTCIR-13+14+15 systemsreleased

Qrels + std. factors based on

NTCIR-(13+14+)15 systemsreleased

Page 55: ictir2016

We Want Web@NTCIR-15 (4) http://www.thuir.cn/ntcirwww/

NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)

frozen topic set frozen topic set frozen topic set

NTCIR-13 fresh topic set

NTCIR-14 fresh topic set

NTCIR-15 fresh topic set

NTCIR-13 systems

NTCIR-14 systems

NTCIR-15 systems

Qrels + std. factors based on

NTCIR-(13+14+)15 systemsreleased

Official NTCIR-15results discussed with the fresh topics

Qrels + std. factors based on

NTCIR-13+14+15 systemsreleased

Qrels + std. factors based on

NTCIR-13+14 systemsreleased

Qrels + std. factors based onNTCIR-13 systemsreleased

How do the NTCIR-15 system rankings differ across the 3 rounds, with and w/o standardisation?

NTCIR-15 systems ranking

NTCIR-15 systems ranking

NTCIR-15 systems ranking

Page 56: ictir2016

See you all in Tokyo

Page 57: ictir2016

Selected references (1)

[Carterette12] Carterette: Multiple testing in statistical analysis of systems-based information retrieval experiments, ACM TOIS 30(1), 2012.

[Ellis10] Ellis: The essential guide to effect sizes, Cambridge, 2010.

[Lodico+10] Lodico, Spaulding, Voegtle: Methods in educational research, Jossey-Bass, 2010.

Page 58: ictir2016

Selected references (2)

[Sakai06SIGIR] Sakai: Evaluating evaluation metrics based on the bootstrap, ACM SIGIR 2006.

[Sakai12WWW] Sakai: Evaluation with Informational and Navigational Intents, WWW 2012.

[Sakai14PROMISE] Sakai: Metrics, statistics, tests, PROMISE Winter School 2013 (LNCS 8173).

[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval Journal 19(3), 2016. http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice, ICTIR 2016 Tutorial. http://www.slideshare.net/TetsuyaSakai/ictir2016tutorial-65845256

[Sakai16AIRS] Sakai: The Effect of Score Standardisation on Topic Set Size Design, AIRS 2016, to appear.

Page 59: ictir2016

Selected references (3)

[Voorhees02] Voorhees: The philosophy of information retrieval evaluation, CLEF 2001.

[Voorhees09] Voorhees: Topic set size redux, ACM SIGIR 2009.

[Webber+08] Webber, Moffat, Zobel: Score standardisation for inter-collection comparison of retrieval systems, ACM SIGIR 2008.

[Zobel98] Zobel: How reliable are the results of large-scale information retrieval experiments? ACM SIGIR 1998.