MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000...
Transcript of MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000...
![Page 1: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/1.jpg)
Transformations
Merlise Clyde
Readings: Gelman & Hill Ch 2-4
![Page 2: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/2.jpg)
Assumptions of Linear Regression
Yi = β0 + β1Xi1 + β2Xi2 + . . . βpXip + εi
I Model Linear in Xj but Xj could be a transformation of theoriginal variables
I εi ∼ N(0, σ2)I Yi
ind∼ N(β0 + β1Xi1 + β2Xi2 + . . . βpXip, σ2)
I correct mean functionI constant varianceI independent errorsI Normal errors
![Page 3: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/3.jpg)
Animals
Read in Animal data from MASS. The data set containsmeasurements on body weight and brain weight.
Let’s try to predict brain weight (size) from body weight.
library(MASS)data(Animals)brain.lm = lm(brain ~ body, data=Animals)
![Page 4: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/4.jpg)
Diagnostic Plots## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
540 550 560 570
−10
0010
0030
0050
00
Fitted values
Res
idua
ls
Residuals vs Fitted
African elephant
Asian elephant
Human
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
African elephant
Asian elephant
Brachiosaurus
540 550 560 570
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAfrican elephant
Asian elephant
Brachiosaurus
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.50.51
Residuals vs Leverage
Brachiosaurus
African elephant
Asian elephant
![Page 5: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/5.jpg)
Leverage plot
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Index
hatv
alue
s(br
ain.
lm)
Energetic students: how I should plot with ggplot?
![Page 6: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/6.jpg)
Outliers and Influential Points
Flag outliers after Bonferroni Correction
pval = 2*(1 - pt(abs(rstudent(brain.lm)), brain.lm$df -1))rownames(Animals)[pval < .05/nrow(Animals)]
## [1] "Asian elephant" "African elephant"
Cook’s Distance > 1
rownames(Animals)[cooks.distance(brain.lm) > 1]
## [1] "Brachiosaurus"
![Page 7: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/7.jpg)
Remove Influential Point & Refit
brain2.lm = lm(brain ~ body, data=Animals,subset = !cooks.distance(brain.lm)>1)
par(mfrow=c(2,2)); plot(brain2.lm)
500 1000 1500 2000
−20
000
2000
Fitted values
Res
idua
ls
Residuals vs Fitted
African elephantAsian elephant
Dipliodocus
−2 −1 0 1 2
−2
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
African elephant
Asian elephant
Dipliodocus
500 1000 1500 2000
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAfrican elephant
Asian elephant
Dipliodocus
0.0 0.1 0.2 0.3 0.4 0.5
−2
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance10.5
0.51
Residuals vs Leverage
Dipliodocus
African elephant
Triceratops
![Page 8: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/8.jpg)
Keep removing points?
500 1000 1500 2000 2500 3000
−40
000
2000
4000
Fitted values
Res
idua
ls
Residuals vs Fitted
Asian elephant African elephant
Triceratops
−2 −1 0 1 2
−4
−2
02
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Triceratops
African elephantAsian elephant
500 1000 1500 2000 2500 3000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationTriceratops
African elephantAsian elephant
0.0 0.1 0.2 0.3 0.4 0.5 0.6
−4
−2
02
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.5
0.51
Residuals vs Leverage
Triceratops
African elephantAsian elephant
![Page 9: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/9.jpg)
And another one bites the dust
0 1000 2000 3000 4000 5000 6000
−10
000
500
1500
Fitted values
Res
idua
ls
Residuals vs Fitted
Asian elephant
Human
African elephant
−2 −1 0 1 2
−4
−2
02
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Asian elephant
African elephant
Human
0 1000 2000 3000 4000 5000 6000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAsian elephant African elephant
Human
0.0 0.2 0.4 0.6 0.8
−4
−2
02
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.50.51
Residuals vs Leverage
African elephant
Asian elephant
Human
![Page 10: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/10.jpg)
and another one
0 1000 2000 3000 4000
−50
00
500
1000
Fitted values
Res
idua
ls
Residuals vs Fitted
Human
CowHorse
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Human
Asian elephant
Cow
0 1000 2000 3000 4000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationHuman
Asian elephant
Cow
0.0 0.2 0.4 0.6 0.8
−2
−1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.5
0.51
Residuals vs Leverage
Asian elephant
Human
Cow
![Page 11: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/11.jpg)
And they just keep coming!
Figure 1: Walt Disney Fantasia
![Page 12: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/12.jpg)
Plot of Original Data (what you should always do first!)
library(ggplot2)ggplot(Animals, aes(x=body, y=brain)) +
geom_point() +xlab("Body Weight") + ylab("Brain Weight")
0
2000
4000
0 25000 50000 75000
Body Weight
Bra
in W
eigh
t
![Page 13: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/13.jpg)
Log Transform
Body Weight
Fre
quen
cy
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
05
1015
2025
log(Body Weight)
Fre
quen
cy
0 5 10
01
23
45
6Who can reproduce this slide using ggplot? Tell me how on Piazza!Even better make a pull request!
![Page 14: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/14.jpg)
Plot of Transformed DataAnimals= mutate(Animals, log.body = log(body))ggplot(Animals, aes(log.body, brain)) + geom_point()
0
2000
4000
−4 0 4 8 12
log.body
brai
n
#plot(brain ~ body, Animals, log="x")
![Page 15: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/15.jpg)
Diagnostics with log(body)
−500 0 500 1000 1500
−20
000
2000
5000
Fitted values
Res
idua
ls
Residuals vs Fitted
15
7
26
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
15
7
26
−500 0 500 1000 1500
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location15
7
26
0.00 0.05 0.10 0.15−
11
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
0.5
1
Residuals vs Leverage
15
7
26
Variance increasing with mean
![Page 16: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/16.jpg)
Try Log-LogAnimals= mutate(Animals, log.brain= log(brain))ggplot(Animals, aes(log.body, log.brain)) + geom_point()
0
2
4
6
8
−4 0 4 8 12
log.body
log.
brai
n
#plot(brain ~ body, Animals, log="xy")
![Page 17: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/17.jpg)
Diagnostics with log(body) & log(brain)
2 4 6 8
−4
−2
01
23
Fitted values
Res
idua
ls
Residuals vs Fitted
6 2616
−2 −1 0 1 2
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
6 2616
2 4 6 8
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location6 26
16
0.00 0.05 0.10 0.15
−2
−1
01
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance 0.5
0.5
Residuals vs Leverage
26616
![Page 18: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/18.jpg)
Optimal Transformation for NormalityThe BoxCox procedure can be used to find “best” powertransformation λ of Y (for positive Y) for a given set of transformedpredictors.
Ψ(Y, λ) ={
Yλ−1λ if λ 6= 0
log(Y) if λ = 0
Find value of λ that maximizes the likelihood derived fromΨ(Y, λ) ∼ N(Xβλ, σ
2λ) (need to obtain distribution of Y first)
Find λ to minimize
RSS(λ) = ‖ΨM(Y, λ)− Xβ̂λ‖2
ΨM(Y, λ) ={
(GM(Y)1−λ(Yλ − 1)/λ if λ 6= 0GM(Y) log(Y) if λ = 0
where GM(Y) = exp(∑
log(Yi )/n) (Geometric mean)
![Page 19: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/19.jpg)
boxcox in R: Profile likelihood
library(MASS)boxcox(braintransX.lm)
−2 −1 0 1 2
−25
0−
200
−15
0−
100
−50
λ
log−
Like
lihoo
d
95%
![Page 20: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/20.jpg)
Caveats
I Boxcox transformation depends on choice of transformations ofX ’s
I For choice of X transformation use boxTidwell inlibrary(car)
I transformations of X ’s can reduce leverage values (potentialinfluence)
I if the dynamic range of Y or X is less than 1 or 10 (iemax/min) then transformation may have little effect
I transformations such as logs may still be useful forinterpretability
I outliers that are not influential may still
![Page 21: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/21.jpg)
Review of Last Class
I In the model with both response and predictor log transformed,are dinosaurs outliers?
I should you test each one individually or as a group; if as agroup how do you think you would you do this using lm?
I do you think your final model is adequate? What else mightyou change?
![Page 22: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian](https://reader034.fdocument.pub/reader034/viewer/2022042420/5f36bb7d5cbf8553ed1909f6/html5/thumbnails/22.jpg)
Check Your Prediction SkillsAfter you determine whether dinos can stay or go and refine yourmodel, what about prediction?
I I would like to predict Aria’s brain size given her current weightof 259 grams. Give me a prediction and interval estimate.
I Is her body weight within the range of the data in Animals orwill you be extrapolating? What are the dangers here?
I Can you find any data on Rose-Breasted Cockatoo brain sizes?Are the values in the prediction interval?