Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost...

Post on 15-Jan-2016

231 views 0 download

Transcript of Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost...

Get Another Label? Using Multiple, Noisy Labelers

Joint work with Victor Sheng and Foster Provost

Panos Ipeirotis

Stern School of BusinessNew York University

2

Motivation

Many task rely on high-quality labels for objects:– relevance judgments

– duplicate database records

– image recognition

– song categorization

– videos

Labeling can be relatively inexpensive, using Mechanical Turk, ESP game …

ESP Game (by Luis von Ahn)

3

Mechanical Turk Example

“Are these two documents about the same topic?”

4

Mechanical Turk Example

5

6

Motivation

Labels can be used in training predictive models – Duplicate detection systems

– Image recognition

– Web search

But: labels obtained from above sources are noisy. This directly affects the quality of learning models

– How can we know the quality of annotators?

– How can we know the correct answer?

– How can we use best noisy annotators?

7

40

50

60

70

80

90

100

1 20 40 60 80 100

120

140

160

180

200

220

240

260

280

300

Number of examples (Mushroom)

Acc

ura

cyQuality and Classification Performance

Labeling quality increases classification quality increases

Q = 0.5

Q = 0.6

Q = 0.8

Q = 1.0

8

How to Improve Labeling Quality

Find better labelers– Often expensive, or beyond our control

Use multiple, noisy labelers: repeated-labeling– Our focus

9

Multiple labelers and resulting label quality

Multiple labelers and classification quality

Selective label acquisition

Our Focus: Labeling using Multiple Noisy Labelers

10

Majority Voting and Label Quality

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13

Number of labelers

Inte

grat

ed q

ualit

y

P=0.4

P=0.5

P=0.6

P=0.7

P=0.8

P=0.9

P=1.0

Ask multiple labelers, keep majority label as “true” label

Quality is probability of majority label being correct

P is probabilityof individual labelerbeing correct

So…

(Sometimes) quality of multiple noisy labelers better than quality of best labeler in set

11

Multiple noisy labelers improve quality

So, should we always get multiple labels?

12

Tradeoffs for Classification

Get more labels Improve label quality Improve classification Get more examples Improve classification

40

50

60

70

80

90

100

1 20 40 60 80 100

120

140

160

180

200

220

240

260

280

300

Number of examples (Mushroom)

Acc

ura

cy

Q = 0.5

Q = 0.6

Q = 0.8

Q = 1.0

13

Basic Labeling Strategies

Get as many data points as possible, one label each

Repeatedly-label everything, same number of times

14

Repeat-Labeling vs. Single Labeling

P= 0.6, labeling qualityK=5, #labels/example

Repeated

Single

With high noise, repeated labeling better than single labeling

15

Repeat-Labeling vs. Single Labeling

P= 0.8, labeling qualityK=5, #labels/example

Repeated

Single

With low noise, more (single labeled) examples better

Estimating Labeler Quality

(Dawid, Skene 1979): “Multiple diagnoses”

– Assume equal qualities– Estimate “true” labels for examples– Estimate qualities of labelers given the “true” labels– Repeat until convergence

16

17

Selective Repeated-Labeling

We have seen: – With noise and enough (noisy) examples getting

multiple labels better than single-labeling

Can we do better?

Select data points, in terms of uncertainty score, to allocate multi-label resource, e.g. {+,-,+,+,-,+,+} vs. {+,+,+,+}

18

Natural Candidate: Entropy

Entropy is a natural measure of label uncertainty:

E({+,+,+,+,+,+})=0 E({+,-, +,-, +,- })=1

Strategy: Get more labels for high-entropy examples

||

||log

||

||

||

||log

||

||)( 22 S

S

S

S

S

S

S

SSE

negativeSpositiveS |:||:|

19

What Not to Do: Use Entropy

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 400 800 1200 1600 2000

Number of labels (waveform, p=0.6)

Lab

eli

ng

qu

ali

ty

ENTROPYUNF

Improves at first, hurts in long run

EntropyRound robin

Why not Entropy

In the presence of noise, entropy will be high even with many labels

Entropy is scale invariant – (3+ , 2-) has same entropy as (600+ , 400-)

20

21

Estimating Label Uncertainty (LU)

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs}

Label uncertainty = tail of beta distribution

SLU

0.50.0 1.0

Beta probability density function

Label Uncertainty

p=0.7 5 labelers

(3+, 2-) Entropy ~ 0.97

22

Label Uncertainty

p=0.7 10 labelers

(7+, 3-) Entropy ~ 0.88

23

Label Uncertainty

p=0.7 20 labelers

(14+, 6-) Entropy ~ 0.88

24

Comparison

25

0.60.650.7

0.750.8

0.850.9

0.951

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

UNF MULU LMU

Label Uncertainty

Uniform, round robin

26

Model Uncertainty (MU)

However, we do not have only labelers

A classifier can also give us labels!

Model uncertainty: get more labels for ambiguous/difficult examples

Intuitively: make sure that difficult cases are correct

+ ++

++ ++

+

+ ++

+

+ ++

++ ++

+

- - - -

- - - -- -

- -

- - - -

- - - -- - - -- - - -

- - - -

?

??

27

Label + Model Uncertainty

Label and model uncertainty (LMU): avoid examples where either strategy is certain

MULULMU SSS

Comparison

28

0.60.650.7

0.750.8

0.850.9

0.951

0 400 800 1200 1600 2000Number of labels (waveform, p=0.6)

Labe

ling

qual

ity

UNF MULU LMU

Label Uncertainty

Uniform, round robin

Label + Model Uncertainty

Model Uncertainty alone also improves

quality

29

Classification Improvement

60

65

70

75

80

85

0 400 800 1200 1600 2000Number of labels (spambase, p=0.6)

Acc

urac

y

UNF MULU LMU

30

Conclusions

Gathering multiple labels from noisy users is a useful strategy

Under high noise, almost always better than single-labeling

Selectively labeling using label and model uncertainty is more effective

31

More Work to Do

Estimating the labeling quality of each labeler

Increased compensation vs. labeler quality

Example-conditional quality issues (some examples more difficult than others)

Multiple “real” labels

Hybrid labeling strategies using “learning-curve gradient”

Other Projects

SQoUT projectStructured Querying over Unstructured Texthttp://sqout.stern.nyu.edu

Faceted InterfacesEconoMining project

The Economic Value of User Generated Contenthttp://economining.stern.nyu.edu

32

33

SQoUT: Structured Querying over Unstructured Text

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

34

SQoUT: The QuestionsOutput Tokens

…Extraction

System(s)

Text Databases

3. Extract output tuples2. Process documents1. Retrieve documents from database/web/archive

Questions: 1.How to we retrieve the documents?2.How to configure the extraction systems?3.What is the execution time? 4.What is the output quality?

SIGMOD’06, TODS’07, + in progress

EconoMining Project

Show me the Money!

Applications (in increasing order of difficulty)

Buyer feedback and seller pricing power in online marketplaces (ACL 2007)

Product reviews and product sales (KDD 2007)

Importance of reviewers based on economic impact (ICEC 2007)

Hotel ranking based on “bang for the buck” (WebDB 2008)

Political news (MSM, blogs), prediction markets, and news importance

Basic Idea

Opinion mining an important application of information extraction

Opinions of users are reflected in some economic variable (price, sales)

Some Indicative Dollar ValuesPositive Negative

Natural method for extracting sentiment strength and polarity

good packaging -$0.56

Naturally captures the pragmatic meaning within the given context

captures misspellings as well

Positive? Negative ?

Thanks!

Q & A?