Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/...
-
Upload
shanon-chambers -
Category
Documents
-
view
218 -
download
0
description
Transcript of Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/...
Machine Learning in PracticeLecture 10
Carolyn Penstein RoséLanguage Technologies Institute/
Human-Computer Interaction Institute
http://www.theallusionist.com/wordpress/wp-content/uploads/gambling8.jpg
Plan for the Day Announcements
Questions?Quiz answer key posted
Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost
http://www.casino-gambling-dictionary.com/
Quiz Notes
Leave-one-out cross validation On each fold, train on all but 1 data point,
test on 1 data pointPro: Maximizes amount of training data used
on each foldCon: Not stratifiedCon: Take a long time on large sets
Best to only use when you have a very small amount of training dataOnly needed when 10-fold cross validation is
not feasible because of lack of data
632 Bootstrap A method for estimating performance when
you have a small data setConsider it an alternative to leave-one-out
cross validation Sample n times with replacement to create
the training setSome instances will be repeatedSome will be left out – this will be your test setAbout 63% of the instances in the original set
will end up in the training set
632 Bootstrap Estimating error over the training set will be an
optimistic estimate of performance Because you trained on these examples
Estimating error over test set will be a pessimistic estimate of the error Because the 63/37 split gives you less training data
than 90/10 Estimate error by combining optimistic and
pessimistic estimates .63*pessimistic_estimate + .37*optimistic_estimate
Iterate several times and average performance estimates
Prevalence of Gambling
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
Gambling Prevalence
* Risk is the most predictive feature.
Gambling Prevalence
Gambling Prevalence
Gambling Prevalence
* Demographic is the least predictive feature.
Which algorithm will perform best?
http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg
Which algorithm will perform best?
Decision Trees .26 Kappa Naïve Bayes .31 Kappa SMO .53 Kappa
http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg
Decision Trees
* What’s it ignoring and why?
With Binary Splits – Kappa .41
What was different with SMO? Trained a model for all pairs
The features that were important for one pairwise distinction were different than those for other pairwise distinctions
Characteristic=Black was most important for High versus Low (ignored by decision trees)
When and Risk were most important for High versus Medium
Decision Trees pay attention to all distinctions at onceTotally ignored feature that was important for
some pairwise distinctions
What was wrong with Naïve Bayes? Probably just learned noisy probabilities
because the data set is small Hard to distinguish Low and Medium
Back to Chapter 5
Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost
Making the right choice doesn’t cost you anything
Making an error comes with a costSome errors cost more than othersRather than evaluating your model in terms of
accuracy, which treats every error as though it was the same, you can think about average cost
The real cost is determined by your application
Unified Framework Connection between optimization techniques and
evaluation methods Think about what function you are optimizing
That’s what learning is Evaluation measures how well you did that
optimization So it makes sense for there to be a deep connection
between the learning technique and the evaluation New machine learning algorithms are often
motivated by modifications to the conceptualization of the cost of an error
What’s the cost of a gambling mistake?
http://imagecache2.allposters.com/images/pic/PTGPOD/321587~Pile-of-American-Money-Posters.jpg
Thinking About the Practical Cost of an Error
In document retrieval, precision is more important than recall You’re picking from the whole web, so if you miss some
relevant documents it’s not a big deal Precision is more important – you don’t want to have to
slog through lot’s of irrelevant stuff
Thinking About the Practical Cost of an Error
What if you are trying to predict whether someone will be late? Is it worse to not predict
someone will be late when they won’t or vice versa?
Thinking About the Practical Cost of an Error
What if you’re trying to predict that a message will get a response or not?
Thinking About the Practical Cost of an Error
Let’s say you are picking out errors in student essays If you detect an error, you offer
the student a correction for their error
What are the implications of missing an error?
What are the implications of imagining an error that doesn’t exist?
Cost Sensitive Classification An example of the connection between the notion
of cost of an error and the training method Say you manipulate the cost of different types of
errors Cost of a decision is computed based on the expected
cost That affects the function the algorithm is “trying”
to maximize Minimize expected cost rather than maximizing
accuracy
Cost Sensitive Classification Cost sensitive classifiers work in two ways
manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)
Manipulate the way predictions are made Select the option that minimizes cost rather than the most
likely choice
In practice it’s hard to use cost-sensitive classification in a useful way
Cost Sensitive Classification What if it’s 10 times more expensive to
make a mistake when selecting Class C
Expected cost of a decision j Cjpj
The cost of predicting class Cj is computed by multiplying the j column of the cost matrix by the corresponding probabilities
The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is .75*10 + .1*1 = 7.6
0
0
01
1 1
1 10
1
A B C
A
B
C
Cost Sensitive Classification The expected cost of selecting B if
probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9
If A is selected, expected cost is .1*1 + .15*1 = .25
You can make a choice by minimizing the expected cost of an error
So in this case, the expected cost is less when selecting A with highest probability
0
0
01
1 1
1 10
1
A B C
A
B
C
Cost Sensitive Classification The expected cost of selecting B if
probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9
If A is selected, expected cost is .1*1 + .15*1 = .25
You can make a choice by minimizing the expected cost of an error
So in this case, the expected cost is less when selecting A with highest probability
0
0
01
1 1
1 10
1
A B C
A
B
C
Using Cost Sensitive Classification
Using Cost Sensitive Classification
Using Cost Sensitive Classification
* Set up the cost matrix * Assign a high penalty to the largesterror cell
Results Without Cost Sensitive Classification .53 Using Cost Sensitive Classification
increased performance to .55Tiny difference because SMO assigns
probability 1 to all predictionsNot statistically significant
SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect
What is the cost of an error? Assume first all errors have the
same cost
Quadratic loss: j (pj – aj)2
Cost of a decision J iterates over classes (A, B, C)
Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction
0
0
01
1 1
1 1
1
A B C
A
B
C
What is the cost of an error? Assume first all errors have the same
cost Quadratic loss: j (pj – aj)2
Cost of a decision J iterates over classes (A, B, C)
If C is right and you say A=75%, B=10%, C=15% (.75 – 0)2 + (.1 – 0)2 + (.15 -1)2
1.3 If A is right and you say A=75%,
B=10%, C=15% (.75 – 1)2 + (.1 – 0)2 + (.15 – 0)2
.09 Lower cost if highest probability is on the
correct choice
0
0
01
1 1
1 1
1
A B C
A
B
C
What is the cost of an error? Assume all errors have the same cost Informational Loss: -logkpi k is the number of classes i is the correct class Pi is the probability of selecting class i If C is right and you say A=75%, B=10%,
C=15% -log3(.15) 1.73
If A is right and you say A=75%, B=10%, C=15% -log3(.75) .26 Lower cost if highest probability is on the correct
choice
0
0
01
1 1
1 1
1
A B C
A
B
C
Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to
probabilities placed on all classesSo you can get partial credit if you put really
low probabilities on some of the wrong choicesBounded (Max value is 2)
Information loss only pays attention to how you treated the correct predictionMore like gamblingNot bounded
Minimum Description Length Principle
Another way of viewing the connection between optimization and evaluation Based on information theory
Training minimizes how much information you encode in the model How much information does it take to determine what
class an instance belongs to? Information is encoded in your feature space
Evaluation measures how much information is lost in the classification
Tension between complexity of the model at training time and information loss at testing time
Take Home Message Different types of errors have different costs
Costs associated with cells in the confusion matrix Costs may also be associated with the level of
confidence with which decisions are made Connection between concept of cost of an error
and learning method Machine learning algorithms are optimizing a cost
function The cost function should reflect the real cost in the
world In cost sensitive classification, the notion of which
types of errors cost more can influence classification performance