Stats chapter 3

51
Chapter 3 Examining Relationships Examining Relationships

description

 

Transcript of Stats chapter 3

Page 1: Stats chapter 3

Chapter 3

Examining RelationshipsExamining Relationships

Page 2: Stats chapter 3

3.1 SCATTERPLOTS AND CORRELATION

Page 3: Stats chapter 3

Basics

• Up until now, we have been concerned with “one-variable data”

• We are about to view the relationship between two-variables.

• Do two variables have to be related?

Page 4: Stats chapter 3

Variables

• Response Variable– This is the variable that is measured in a study– The “outcome” of studies– We can think of it as our “dependant variable”– ( y )

Page 5: Stats chapter 3

Variables

• Explanatory variable– May (or may not) influence or change the

response variable– We often would like to show that different values

of explanatory will affect the response– “independent variable”– ( x )

Page 6: Stats chapter 3

Variables

• More often than not, explanatory v. response is just a vocabulary choice.

• Just because we are calling a variable the “response variable” does not mean that the corresponding “explanatory variable” causes change!

• We are content right now to just examine if a relationship exists between the two variables

Page 7: Stats chapter 3

Scatterplots

• Shows the relationship between two quantitative variables

• Each data pair is represented by a point• x- coordinate is the value of the explanatory

variable• y-coordinate is the value of the response

variable• Be sure to label and scale both axes!!• Quickly automated using the TI!

Page 8: Stats chapter 3

Scatterplots on the TI-84• [stat], [1] (Edit)• Enter the explanatory variables in “L1”• Enter the response variables in “L2”

– Make sure your L1 and L2 correspond• [2nd], [Y=] (STATPLOT), [1]

– You can define a number of plots from here• Turn “ON” the plot• Choose the scatterplot (first icon)• Xlist: “L1”• Ylist: “L2”• [zoom], [9] (zoomstat)

– I recommend starting with this zoom. Examine and take note of the window!

Page 9: Stats chapter 3

Scatterplots on the TI89

• From the “apps” choose the “stat/list”• Enter the explanatory variables in “list1”• Enter the response variables in “list2”• [F2] (plots),[F1] (define)• Plot Type: “scatter”• x: “list1”• y: “list2”• [ENTER]• (Zoomdata)

Page 10: Stats chapter 3

Interpreting a Scatterplot

Use the following list when asked to comment on a scatterplot/relationship between 2 variables.

1. Direction/Association: is the slope positive or negative?2. Form:

1. Linear or nonlinear? If nonlinear, what is the relationship (more on this later)

2. Are there any clusters? How many?

3. Strength of relationship: strong, moderate, weak?4. Outliers? Outliers are either outliers in the x-direction,

y-direction, or both

Page 11: Stats chapter 3

Categorical data

• We can add categorical information to a scatterplot by using multiple marker types

• Example: all marks that represent dogs are a box, all marks that represent cats are circles

• Sometimes differing patterns will appear when categorical information is added!

Page 12: Stats chapter 3

3.1 A

• P173 #1, 4, 5, 7, 9

Page 13: Stats chapter 3

Correlation

• One way to measure the strength of a linear relationship is to calculate it with variable “r”

• The variable r measures both the strength and the direction of the relationship

• r is known as the “correlation coefficient” and measures the quantity “correlation”

• You should not use the word “correlation” unless you mean r.

Page 14: Stats chapter 3

Correlation

• The above formula is quite time consuming! • We will compute r on a small set of data.• Thankfully, we can compute r using our TI

(more on this later)• No.

1

1i i

x y

x x y yrn s s

Page 15: Stats chapter 3

Correlation

• r = 0 indicates “no linear relationship”• r = 1 indicates “perfect line with positive

slope”• r = -1 indicates “perfect line with negative

slope”• Remember -1 < r < 1

Page 16: Stats chapter 3

Correlation

Page 17: Stats chapter 3

Cautions

• Correlation requires both variables to be quantitative

• Correlation does not described curved relationships

• Correlation is not resistant- outliers have a strong effect on r

• Correlation is not a complete summary of 2-variable data

Page 18: Stats chapter 3

Assignment 3.1B

• P188 #13, 16, 19, 20, 23, 24

Page 19: Stats chapter 3

3.2 LEAST-SQUARES REGRESSION

Page 20: Stats chapter 3

Regression Line

Regression Line• A line (linear equation) that describes the

relationship between two variables• Naturally, just calling a line a “regression line” does

not mean that it does an accurate job describing a relationship!

• If you had done this in an algebra class, you probably just “eyeballed” a relationship or found the equation of a line that connected two points in the scatter.

Page 21: Stats chapter 3

Regression Line

• Regression lines in statistics are a bit “backwards” from what you learned in algebra!

• a = y-intercept– This is the predicted value of the response variable

when the exp. var. is zero.• b = slope– This is the average amount the resp. var. changes for

every change of one unit in the expl. var.• You will be asked to interpret both the values of ‘a’

and ‘b’

Page 22: Stats chapter 3

Extrapolation and Interpolation

Interpolation- Use the regression line to predict values of the resp

var for a expl var within the data range.Extrapolation- Use the regression line to predict values of the resp

var for a expl var outside the data range.- As you might suspect, interpolation good,

extrapolation bad- OK, not really. You need to use results obtained from

extrapolation with great caution.

Page 23: Stats chapter 3

Least-Squares Regression Line

Least-Square Regression Line (LSRL)• A good regression line should minimize the

vertical distance between an actual y value of a point in the scatter and the corresponding y-value on the regression line.

• This distance (yactual – ypredicted) is known as a residual.

Page 24: Stats chapter 3

Least-Squares Regression Line

Page 25: Stats chapter 3

Least-Squares Regression LineLeast-Square Regression Line (LSRL)• The LSRL minimizes the sum of squares of the residuals

Page 26: Stats chapter 3

Least-Squares Regression Line

Computation of the LSRL1) Obtain xbar, ybar, r, sx and sy

2) Compute ‘b’

3) Compute ‘a’4) Give the equation of the LSRL

The regression line always goes through (xbar, ybar)

(this is from )

y

x

sb r

s

a y bx y a bx

Page 27: Stats chapter 3

Least-Squares Regression Line

• The equation of your line should be as follows

• It’s called “y hat”• In stats, “hats” indicate predicted values

• The example below is an example of the second notation.

(fat gain) = 3.505 – 0.00344 x (NEA change)

or

resp var expl var

y a bx

a b

Page 28: Stats chapter 3

Reading a printout

Page 29: Stats chapter 3

Reading a printout

As part of the “great compromise of 1998,” you will be required to interpret a printout like the one above

Page 30: Stats chapter 3

Reading a printout

This is the value of ‘a.’ Look for the line that says constanta = 1.0891

Page 31: Stats chapter 3

Reading a printoutThis is the value of ‘b.’ It is the coefficient of the line with the explanatory variableb = 0.1889

Page 32: Stats chapter 3

Reading a printout

The LSRL for this printout is:

(Gas Used) = 1.08921 + 0.1889 (degree-days)

Page 33: Stats chapter 3

Assignment 3.2A

• P204 #29, 32, 33, 36, 38

Page 34: Stats chapter 3

LSRL on the TI83/84

1. Input data in L1 and L22. From home, [stat], “CALC,” [8] (LinReg a+bx)3. On the home screen enter the variable list:

“LinReg L1, L2, Y1”(this will copy and paste the LSRL into Y1)AMAZING! It computes “r” for you, too!

4. [zoom], [9] (zoomstat)Take a good look at your LSRL!

Page 35: Stats chapter 3

LSRL on the TI89

1. On the “stat list” app, input data in “list1” and “list2”

2. [F4] (calc)3. Choose LinReg a+bx4. Select “list1” and for the expl var and list2 resp

var5. Select “y1” for “store list”6. [ENTER] and behold the magic!7. [F2], “zoom data”

Page 36: Stats chapter 3

Residuals

• (This is where the analysis begins)• Recall that the residual of a point is y – yhat

(yactual - ypredicted).• Luckily when you compute an LSRL, your

calculator will automatically compute the residuals and place them in a list called “RESID”– Keep scrolling to the right

Page 37: Stats chapter 3

Residual Plots

• A residual plot is a type of scatter plot where the x-coordinate is the expl var of an observation and the y-coordinate is the residual of the observation.

• Scatter plot of “expl var” vs. “resid”

Page 38: Stats chapter 3

Residual Plots

To create a residual plot on your TI, 1. Create a LSRL for your data2. Choose a scatterplot from the [stat plot]

menu, 3. Set “Xlist: L1”4. Set “Ylist: RESID” ([2nd] ,[stat],”NAMES”)5. Turn off all other plots and graphs6. [ZOOM], [9]

Page 39: Stats chapter 3

Residual Plots

• Residual plots tell us whether a linear model was a good choice for our data.

• We want the residual plot to look like an unstructured scatter of points

• The presence of a curve or any other pattern indicate that the linear model might not be the best choice.

Page 40: Stats chapter 3

Residual Plots

• A “fan-shaped” pattern (vuvuzela?) indicates that the linear model only works well for larger or smaller values of x

• Residuals should be small in value– Standard deviation of residuals should be small

2

2

y ys

n

Page 41: Stats chapter 3

Assn 3.2B

• P 212 #34, 35, 37, 39, 41

Page 42: Stats chapter 3

Coefficient of determination

• The value of r2 is known as the coefficient of determination.

22

22

2

or popularly(?)

y y y yr

y y

SST SSEr

SST

Page 43: Stats chapter 3

Coefficient of determination

• You may often see r2 abbreviated as “R-sq”• SST = sum of the squares of residuals using the

regression y = ybar.• SSE = sum of the squares of the residuals using

the LSRL.• r2 gives us the percentage difference of the

areas of the two regressions– You can think of the regression y = ybar as the most

basic regression line possible.

Page 44: Stats chapter 3

Coefficient of determination

• Interpretation• r2 given tells us “(r2) percent of the variation in

(reponse variable) can be explained with a LSRL relating (response variable) and (explanatory variable)”

• Fill in the blanks.• “60.6% of the variation in fat gain is explained

by the least-square regression line relating fat gain and nonexcercise activity.”

Page 45: Stats chapter 3

Facts about LSRL

• The distinction between expl var and resp var is essential.– You will get a different LSRL if you switch variables

• Correlation is closely related to the slope• The LSRL always passes through (xbar, ybar)• r2 is the fraction of variation in y that is

explained with a LSRL regression of y on x.

Page 46: Stats chapter 3

3.3 CORRELATION AND REGRESSION WISDOM

Page 47: Stats chapter 3

Cautions!

• Correlation and Regression are only useful if the data shows a linear pattern

• Extrapolation often produces unreliable predictions

• Correlation is not resistant– Outliers will affect your regressions!

Page 48: Stats chapter 3

Outliers and Influential Points

• Regression outliers fall outside the overall pattern of the other observations– These can be outliers in the x and/or y direction

• Influential points greatly affect the regression with their inclusion/exclusion– These are often outliers in the x direction.

Page 49: Stats chapter 3

Lurkers

• Often an unaccounted variable will affect both the “explanatory” and “response”– In this scenario, both the “explanatory” and “response”

are actually responding to a third variable!• EX. When the #of Methodist ministers in New

England increased from 1860-1915, the #of barrels of imported Cuban rum also increased with a r = 0.999! Would we say that the #of Methodist ministers causes an increase in import of Cuban rum?

Page 50: Stats chapter 3

Remember

• Association does not imply causation!

Page 51: Stats chapter 3

Chapter 3 REV

• P 228 #46, 55, 62, 70, 77, 80, 83, 84