Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/...

46
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

description

Plan for the Day Announcements  Questions?  Quiz answer key posted Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost

Transcript of Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/...

Page 1: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in PracticeLecture 10

Carolyn Penstein RoséLanguage Technologies Institute/

Human-Computer Interaction Institute

Page 2: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

http://www.theallusionist.com/wordpress/wp-content/uploads/gambling8.jpg

Page 3: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Plan for the Day Announcements

Questions?Quiz answer key posted

Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost

http://www.casino-gambling-dictionary.com/

Page 4: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Quiz Notes

Page 5: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Leave-one-out cross validation On each fold, train on all but 1 data point,

test on 1 data pointPro: Maximizes amount of training data used

on each foldCon: Not stratifiedCon: Take a long time on large sets

Best to only use when you have a very small amount of training dataOnly needed when 10-fold cross validation is

not feasible because of lack of data

Page 6: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

632 Bootstrap A method for estimating performance when

you have a small data setConsider it an alternative to leave-one-out

cross validation Sample n times with replacement to create

the training setSome instances will be repeatedSome will be left out – this will be your test setAbout 63% of the instances in the original set

will end up in the training set

Page 7: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

632 Bootstrap Estimating error over the training set will be an

optimistic estimate of performance Because you trained on these examples

Estimating error over test set will be a pessimistic estimate of the error Because the 63/37 split gives you less training data

than 90/10 Estimate error by combining optimistic and

pessimistic estimates .63*pessimistic_estimate + .37*optimistic_estimate

Iterate several times and average performance estimates

Page 8: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Prevalence of Gambling

Page 9: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Page 10: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Page 11: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Page 12: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Page 13: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Page 14: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence

* Risk is the most predictive feature.

Page 15: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence

Page 16: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence

Page 17: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Gambling Prevalence

* Demographic is the least predictive feature.

Page 18: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Which algorithm will perform best?

http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

Page 19: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Which algorithm will perform best?

Decision Trees .26 Kappa Naïve Bayes .31 Kappa SMO .53 Kappa

http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

Page 20: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Decision Trees

* What’s it ignoring and why?

Page 21: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

With Binary Splits – Kappa .41

Page 22: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What was different with SMO? Trained a model for all pairs

The features that were important for one pairwise distinction were different than those for other pairwise distinctions

Characteristic=Black was most important for High versus Low (ignored by decision trees)

When and Risk were most important for High versus Medium

Decision Trees pay attention to all distinctions at onceTotally ignored feature that was important for

some pairwise distinctions

Page 23: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What was wrong with Naïve Bayes? Probably just learned noisy probabilities

because the data set is small Hard to distinguish Low and Medium

Page 24: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Back to Chapter 5

Page 25: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost

Making the right choice doesn’t cost you anything

Making an error comes with a costSome errors cost more than othersRather than evaluating your model in terms of

accuracy, which treats every error as though it was the same, you can think about average cost

The real cost is determined by your application

Page 26: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Unified Framework Connection between optimization techniques and

evaluation methods Think about what function you are optimizing

That’s what learning is Evaluation measures how well you did that

optimization So it makes sense for there to be a deep connection

between the learning technique and the evaluation New machine learning algorithms are often

motivated by modifications to the conceptualization of the cost of an error

Page 27: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What’s the cost of a gambling mistake?

http://imagecache2.allposters.com/images/pic/PTGPOD/321587~Pile-of-American-Money-Posters.jpg

Page 28: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Thinking About the Practical Cost of an Error

In document retrieval, precision is more important than recall You’re picking from the whole web, so if you miss some

relevant documents it’s not a big deal Precision is more important – you don’t want to have to

slog through lot’s of irrelevant stuff

Page 29: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Thinking About the Practical Cost of an Error

What if you are trying to predict whether someone will be late? Is it worse to not predict

someone will be late when they won’t or vice versa?

Page 30: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Thinking About the Practical Cost of an Error

What if you’re trying to predict that a message will get a response or not?

Page 31: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Thinking About the Practical Cost of an Error

Let’s say you are picking out errors in student essays If you detect an error, you offer

the student a correction for their error

What are the implications of missing an error?

What are the implications of imagining an error that doesn’t exist?

Page 32: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Cost Sensitive Classification An example of the connection between the notion

of cost of an error and the training method Say you manipulate the cost of different types of

errors Cost of a decision is computed based on the expected

cost That affects the function the algorithm is “trying”

to maximize Minimize expected cost rather than maximizing

accuracy

Page 33: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Cost Sensitive Classification Cost sensitive classifiers work in two ways

manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)

Manipulate the way predictions are made Select the option that minimizes cost rather than the most

likely choice

In practice it’s hard to use cost-sensitive classification in a useful way

Page 34: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Cost Sensitive Classification What if it’s 10 times more expensive to

make a mistake when selecting Class C

Expected cost of a decision j Cjpj

The cost of predicting class Cj is computed by multiplying the j column of the cost matrix by the corresponding probabilities

The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is .75*10 + .1*1 = 7.6

0

0

01

1 1

1 10

1

A B C

A

B

C

Page 35: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Cost Sensitive Classification The expected cost of selecting B if

probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9

If A is selected, expected cost is .1*1 + .15*1 = .25

You can make a choice by minimizing the expected cost of an error

So in this case, the expected cost is less when selecting A with highest probability

0

0

01

1 1

1 10

1

A B C

A

B

C

Page 36: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Cost Sensitive Classification The expected cost of selecting B if

probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9

If A is selected, expected cost is .1*1 + .15*1 = .25

You can make a choice by minimizing the expected cost of an error

So in this case, the expected cost is less when selecting A with highest probability

0

0

01

1 1

1 10

1

A B C

A

B

C

Page 37: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Using Cost Sensitive Classification

Page 38: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Using Cost Sensitive Classification

Page 39: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Using Cost Sensitive Classification

* Set up the cost matrix * Assign a high penalty to the largesterror cell

Page 40: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Results Without Cost Sensitive Classification .53 Using Cost Sensitive Classification

increased performance to .55Tiny difference because SMO assigns

probability 1 to all predictionsNot statistically significant

SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect

Page 41: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What is the cost of an error? Assume first all errors have the

same cost

Quadratic loss: j (pj – aj)2

Cost of a decision J iterates over classes (A, B, C)

Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction

0

0

01

1 1

1 1

1

A B C

A

B

C

Page 42: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What is the cost of an error? Assume first all errors have the same

cost Quadratic loss: j (pj – aj)2

Cost of a decision J iterates over classes (A, B, C)

If C is right and you say A=75%, B=10%, C=15% (.75 – 0)2 + (.1 – 0)2 + (.15 -1)2

1.3 If A is right and you say A=75%,

B=10%, C=15% (.75 – 1)2 + (.1 – 0)2 + (.15 – 0)2

.09 Lower cost if highest probability is on the

correct choice

0

0

01

1 1

1 1

1

A B C

A

B

C

Page 43: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

What is the cost of an error? Assume all errors have the same cost Informational Loss: -logkpi k is the number of classes i is the correct class Pi is the probability of selecting class i If C is right and you say A=75%, B=10%,

C=15% -log3(.15) 1.73

If A is right and you say A=75%, B=10%, C=15% -log3(.75) .26 Lower cost if highest probability is on the correct

choice

0

0

01

1 1

1 1

1

A B C

A

B

C

Page 44: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to

probabilities placed on all classesSo you can get partial credit if you put really

low probabilities on some of the wrong choicesBounded (Max value is 2)

Information loss only pays attention to how you treated the correct predictionMore like gamblingNot bounded

Page 45: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Minimum Description Length Principle

Another way of viewing the connection between optimization and evaluation Based on information theory

Training minimizes how much information you encode in the model How much information does it take to determine what

class an instance belongs to? Information is encoded in your feature space

Evaluation measures how much information is lost in the classification

Tension between complexity of the model at training time and information loss at testing time

Page 46: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Take Home Message Different types of errors have different costs

Costs associated with cells in the confusion matrix Costs may also be associated with the level of

confidence with which decisions are made Connection between concept of cost of an error

and learning method Machine learning algorithms are optimizing a cost

function The cost function should reflect the real cost in the

world In cost sensitive classification, the notion of which

types of errors cost more can influence classification performance