Stats chapter 3
-
Upload
richard-ferreria -
Category
Education
-
view
5.279 -
download
0
description
Transcript of Stats chapter 3
Chapter 3
Examining RelationshipsExamining Relationships
3.1 SCATTERPLOTS AND CORRELATION
Basics
• Up until now, we have been concerned with “one-variable data”
• We are about to view the relationship between two-variables.
• Do two variables have to be related?
Variables
• Response Variable– This is the variable that is measured in a study– The “outcome” of studies– We can think of it as our “dependant variable”– ( y )
Variables
• Explanatory variable– May (or may not) influence or change the
response variable– We often would like to show that different values
of explanatory will affect the response– “independent variable”– ( x )
Variables
• More often than not, explanatory v. response is just a vocabulary choice.
• Just because we are calling a variable the “response variable” does not mean that the corresponding “explanatory variable” causes change!
• We are content right now to just examine if a relationship exists between the two variables
Scatterplots
• Shows the relationship between two quantitative variables
• Each data pair is represented by a point• x- coordinate is the value of the explanatory
variable• y-coordinate is the value of the response
variable• Be sure to label and scale both axes!!• Quickly automated using the TI!
Scatterplots on the TI-84• [stat], [1] (Edit)• Enter the explanatory variables in “L1”• Enter the response variables in “L2”
– Make sure your L1 and L2 correspond• [2nd], [Y=] (STATPLOT), [1]
– You can define a number of plots from here• Turn “ON” the plot• Choose the scatterplot (first icon)• Xlist: “L1”• Ylist: “L2”• [zoom], [9] (zoomstat)
– I recommend starting with this zoom. Examine and take note of the window!
Scatterplots on the TI89
• From the “apps” choose the “stat/list”• Enter the explanatory variables in “list1”• Enter the response variables in “list2”• [F2] (plots),[F1] (define)• Plot Type: “scatter”• x: “list1”• y: “list2”• [ENTER]• (Zoomdata)
Interpreting a Scatterplot
Use the following list when asked to comment on a scatterplot/relationship between 2 variables.
1. Direction/Association: is the slope positive or negative?2. Form:
1. Linear or nonlinear? If nonlinear, what is the relationship (more on this later)
2. Are there any clusters? How many?
3. Strength of relationship: strong, moderate, weak?4. Outliers? Outliers are either outliers in the x-direction,
y-direction, or both
Categorical data
• We can add categorical information to a scatterplot by using multiple marker types
• Example: all marks that represent dogs are a box, all marks that represent cats are circles
• Sometimes differing patterns will appear when categorical information is added!
3.1 A
• P173 #1, 4, 5, 7, 9
Correlation
• One way to measure the strength of a linear relationship is to calculate it with variable “r”
• The variable r measures both the strength and the direction of the relationship
• r is known as the “correlation coefficient” and measures the quantity “correlation”
• You should not use the word “correlation” unless you mean r.
Correlation
• The above formula is quite time consuming! • We will compute r on a small set of data.• Thankfully, we can compute r using our TI
(more on this later)• No.
1
1i i
x y
x x y yrn s s
Correlation
• r = 0 indicates “no linear relationship”• r = 1 indicates “perfect line with positive
slope”• r = -1 indicates “perfect line with negative
slope”• Remember -1 < r < 1
Correlation
Cautions
• Correlation requires both variables to be quantitative
• Correlation does not described curved relationships
• Correlation is not resistant- outliers have a strong effect on r
• Correlation is not a complete summary of 2-variable data
Assignment 3.1B
• P188 #13, 16, 19, 20, 23, 24
3.2 LEAST-SQUARES REGRESSION
Regression Line
Regression Line• A line (linear equation) that describes the
relationship between two variables• Naturally, just calling a line a “regression line” does
not mean that it does an accurate job describing a relationship!
• If you had done this in an algebra class, you probably just “eyeballed” a relationship or found the equation of a line that connected two points in the scatter.
Regression Line
• Regression lines in statistics are a bit “backwards” from what you learned in algebra!
• a = y-intercept– This is the predicted value of the response variable
when the exp. var. is zero.• b = slope– This is the average amount the resp. var. changes for
every change of one unit in the expl. var.• You will be asked to interpret both the values of ‘a’
and ‘b’
Extrapolation and Interpolation
Interpolation- Use the regression line to predict values of the resp
var for a expl var within the data range.Extrapolation- Use the regression line to predict values of the resp
var for a expl var outside the data range.- As you might suspect, interpolation good,
extrapolation bad- OK, not really. You need to use results obtained from
extrapolation with great caution.
Least-Squares Regression Line
Least-Square Regression Line (LSRL)• A good regression line should minimize the
vertical distance between an actual y value of a point in the scatter and the corresponding y-value on the regression line.
• This distance (yactual – ypredicted) is known as a residual.
Least-Squares Regression Line
Least-Squares Regression LineLeast-Square Regression Line (LSRL)• The LSRL minimizes the sum of squares of the residuals
Least-Squares Regression Line
Computation of the LSRL1) Obtain xbar, ybar, r, sx and sy
2) Compute ‘b’
3) Compute ‘a’4) Give the equation of the LSRL
The regression line always goes through (xbar, ybar)
(this is from )
y
x
sb r
s
a y bx y a bx
Least-Squares Regression Line
• The equation of your line should be as follows
• It’s called “y hat”• In stats, “hats” indicate predicted values
• The example below is an example of the second notation.
(fat gain) = 3.505 – 0.00344 x (NEA change)
or
resp var expl var
y a bx
a b
Reading a printout
Reading a printout
As part of the “great compromise of 1998,” you will be required to interpret a printout like the one above
Reading a printout
This is the value of ‘a.’ Look for the line that says constanta = 1.0891
Reading a printoutThis is the value of ‘b.’ It is the coefficient of the line with the explanatory variableb = 0.1889
Reading a printout
The LSRL for this printout is:
(Gas Used) = 1.08921 + 0.1889 (degree-days)
Assignment 3.2A
• P204 #29, 32, 33, 36, 38
LSRL on the TI83/84
1. Input data in L1 and L22. From home, [stat], “CALC,” [8] (LinReg a+bx)3. On the home screen enter the variable list:
“LinReg L1, L2, Y1”(this will copy and paste the LSRL into Y1)AMAZING! It computes “r” for you, too!
4. [zoom], [9] (zoomstat)Take a good look at your LSRL!
LSRL on the TI89
1. On the “stat list” app, input data in “list1” and “list2”
2. [F4] (calc)3. Choose LinReg a+bx4. Select “list1” and for the expl var and list2 resp
var5. Select “y1” for “store list”6. [ENTER] and behold the magic!7. [F2], “zoom data”
Residuals
• (This is where the analysis begins)• Recall that the residual of a point is y – yhat
(yactual - ypredicted).• Luckily when you compute an LSRL, your
calculator will automatically compute the residuals and place them in a list called “RESID”– Keep scrolling to the right
Residual Plots
• A residual plot is a type of scatter plot where the x-coordinate is the expl var of an observation and the y-coordinate is the residual of the observation.
• Scatter plot of “expl var” vs. “resid”
Residual Plots
To create a residual plot on your TI, 1. Create a LSRL for your data2. Choose a scatterplot from the [stat plot]
menu, 3. Set “Xlist: L1”4. Set “Ylist: RESID” ([2nd] ,[stat],”NAMES”)5. Turn off all other plots and graphs6. [ZOOM], [9]
Residual Plots
• Residual plots tell us whether a linear model was a good choice for our data.
• We want the residual plot to look like an unstructured scatter of points
• The presence of a curve or any other pattern indicate that the linear model might not be the best choice.
Residual Plots
• A “fan-shaped” pattern (vuvuzela?) indicates that the linear model only works well for larger or smaller values of x
• Residuals should be small in value– Standard deviation of residuals should be small
2
2
y ys
n
Assn 3.2B
• P 212 #34, 35, 37, 39, 41
Coefficient of determination
• The value of r2 is known as the coefficient of determination.
22
22
2
or popularly(?)
y y y yr
y y
SST SSEr
SST
Coefficient of determination
• You may often see r2 abbreviated as “R-sq”• SST = sum of the squares of residuals using the
regression y = ybar.• SSE = sum of the squares of the residuals using
the LSRL.• r2 gives us the percentage difference of the
areas of the two regressions– You can think of the regression y = ybar as the most
basic regression line possible.
Coefficient of determination
• Interpretation• r2 given tells us “(r2) percent of the variation in
(reponse variable) can be explained with a LSRL relating (response variable) and (explanatory variable)”
• Fill in the blanks.• “60.6% of the variation in fat gain is explained
by the least-square regression line relating fat gain and nonexcercise activity.”
Facts about LSRL
• The distinction between expl var and resp var is essential.– You will get a different LSRL if you switch variables
• Correlation is closely related to the slope• The LSRL always passes through (xbar, ybar)• r2 is the fraction of variation in y that is
explained with a LSRL regression of y on x.
3.3 CORRELATION AND REGRESSION WISDOM
Cautions!
• Correlation and Regression are only useful if the data shows a linear pattern
• Extrapolation often produces unreliable predictions
• Correlation is not resistant– Outliers will affect your regressions!
Outliers and Influential Points
• Regression outliers fall outside the overall pattern of the other observations– These can be outliers in the x and/or y direction
• Influential points greatly affect the regression with their inclusion/exclusion– These are often outliers in the x direction.
Lurkers
• Often an unaccounted variable will affect both the “explanatory” and “response”– In this scenario, both the “explanatory” and “response”
are actually responding to a third variable!• EX. When the #of Methodist ministers in New
England increased from 1860-1915, the #of barrels of imported Cuban rum also increased with a r = 0.999! Would we say that the #of Methodist ministers causes an increase in import of Cuban rum?
Remember
• Association does not imply causation!
Chapter 3 REV
• P 228 #46, 55, 62, 70, 77, 80, 83, 84