Stats chapter 3

Post on 14-Jan-2015

5.280 views 0 download

description

 

Transcript of Stats chapter 3

Chapter 3

Examining RelationshipsExamining Relationships

3.1 SCATTERPLOTS AND CORRELATION

Basics

• Up until now, we have been concerned with “one-variable data”

• We are about to view the relationship between two-variables.

• Do two variables have to be related?

Variables

• Response Variable– This is the variable that is measured in a study– The “outcome” of studies– We can think of it as our “dependant variable”– ( y )

Variables

• Explanatory variable– May (or may not) influence or change the

response variable– We often would like to show that different values

of explanatory will affect the response– “independent variable”– ( x )

Variables

• More often than not, explanatory v. response is just a vocabulary choice.

• Just because we are calling a variable the “response variable” does not mean that the corresponding “explanatory variable” causes change!

• We are content right now to just examine if a relationship exists between the two variables

Scatterplots

• Shows the relationship between two quantitative variables

• Each data pair is represented by a point• x- coordinate is the value of the explanatory

variable• y-coordinate is the value of the response

variable• Be sure to label and scale both axes!!• Quickly automated using the TI!

Scatterplots on the TI-84• [stat], [1] (Edit)• Enter the explanatory variables in “L1”• Enter the response variables in “L2”

– Make sure your L1 and L2 correspond• [2nd], [Y=] (STATPLOT), [1]

– You can define a number of plots from here• Turn “ON” the plot• Choose the scatterplot (first icon)• Xlist: “L1”• Ylist: “L2”• [zoom], [9] (zoomstat)

– I recommend starting with this zoom. Examine and take note of the window!

Scatterplots on the TI89

• From the “apps” choose the “stat/list”• Enter the explanatory variables in “list1”• Enter the response variables in “list2”• [F2] (plots),[F1] (define)• Plot Type: “scatter”• x: “list1”• y: “list2”• [ENTER]• (Zoomdata)

Interpreting a Scatterplot

Use the following list when asked to comment on a scatterplot/relationship between 2 variables.

1. Direction/Association: is the slope positive or negative?2. Form:

1. Linear or nonlinear? If nonlinear, what is the relationship (more on this later)

2. Are there any clusters? How many?

3. Strength of relationship: strong, moderate, weak?4. Outliers? Outliers are either outliers in the x-direction,

y-direction, or both

Categorical data

• We can add categorical information to a scatterplot by using multiple marker types

• Example: all marks that represent dogs are a box, all marks that represent cats are circles

• Sometimes differing patterns will appear when categorical information is added!

3.1 A

• P173 #1, 4, 5, 7, 9

Correlation

• One way to measure the strength of a linear relationship is to calculate it with variable “r”

• The variable r measures both the strength and the direction of the relationship

• r is known as the “correlation coefficient” and measures the quantity “correlation”

• You should not use the word “correlation” unless you mean r.

Correlation

• The above formula is quite time consuming! • We will compute r on a small set of data.• Thankfully, we can compute r using our TI

(more on this later)• No.

1

1i i

x y

x x y yrn s s

Correlation

• r = 0 indicates “no linear relationship”• r = 1 indicates “perfect line with positive

slope”• r = -1 indicates “perfect line with negative

slope”• Remember -1 < r < 1

Correlation

Cautions

• Correlation requires both variables to be quantitative

• Correlation does not described curved relationships

• Correlation is not resistant- outliers have a strong effect on r

• Correlation is not a complete summary of 2-variable data

Assignment 3.1B

• P188 #13, 16, 19, 20, 23, 24

3.2 LEAST-SQUARES REGRESSION

Regression Line

Regression Line• A line (linear equation) that describes the

relationship between two variables• Naturally, just calling a line a “regression line” does

not mean that it does an accurate job describing a relationship!

• If you had done this in an algebra class, you probably just “eyeballed” a relationship or found the equation of a line that connected two points in the scatter.

Regression Line

• Regression lines in statistics are a bit “backwards” from what you learned in algebra!

• a = y-intercept– This is the predicted value of the response variable

when the exp. var. is zero.• b = slope– This is the average amount the resp. var. changes for

every change of one unit in the expl. var.• You will be asked to interpret both the values of ‘a’

and ‘b’

Extrapolation and Interpolation

Interpolation- Use the regression line to predict values of the resp

var for a expl var within the data range.Extrapolation- Use the regression line to predict values of the resp

var for a expl var outside the data range.- As you might suspect, interpolation good,

extrapolation bad- OK, not really. You need to use results obtained from

extrapolation with great caution.

Least-Squares Regression Line

Least-Square Regression Line (LSRL)• A good regression line should minimize the

vertical distance between an actual y value of a point in the scatter and the corresponding y-value on the regression line.

• This distance (yactual – ypredicted) is known as a residual.

Least-Squares Regression Line

Least-Squares Regression LineLeast-Square Regression Line (LSRL)• The LSRL minimizes the sum of squares of the residuals

Least-Squares Regression Line

Computation of the LSRL1) Obtain xbar, ybar, r, sx and sy

2) Compute ‘b’

3) Compute ‘a’4) Give the equation of the LSRL

The regression line always goes through (xbar, ybar)

(this is from )

y

x

sb r

s

a y bx y a bx

Least-Squares Regression Line

• The equation of your line should be as follows

• It’s called “y hat”• In stats, “hats” indicate predicted values

• The example below is an example of the second notation.

(fat gain) = 3.505 – 0.00344 x (NEA change)

or

resp var expl var

y a bx

a b

Reading a printout

Reading a printout

As part of the “great compromise of 1998,” you will be required to interpret a printout like the one above

Reading a printout

This is the value of ‘a.’ Look for the line that says constanta = 1.0891

Reading a printoutThis is the value of ‘b.’ It is the coefficient of the line with the explanatory variableb = 0.1889

Reading a printout

The LSRL for this printout is:

(Gas Used) = 1.08921 + 0.1889 (degree-days)

Assignment 3.2A

• P204 #29, 32, 33, 36, 38

LSRL on the TI83/84

1. Input data in L1 and L22. From home, [stat], “CALC,” [8] (LinReg a+bx)3. On the home screen enter the variable list:

“LinReg L1, L2, Y1”(this will copy and paste the LSRL into Y1)AMAZING! It computes “r” for you, too!

4. [zoom], [9] (zoomstat)Take a good look at your LSRL!

LSRL on the TI89

1. On the “stat list” app, input data in “list1” and “list2”

2. [F4] (calc)3. Choose LinReg a+bx4. Select “list1” and for the expl var and list2 resp

var5. Select “y1” for “store list”6. [ENTER] and behold the magic!7. [F2], “zoom data”

Residuals

• (This is where the analysis begins)• Recall that the residual of a point is y – yhat

(yactual - ypredicted).• Luckily when you compute an LSRL, your

calculator will automatically compute the residuals and place them in a list called “RESID”– Keep scrolling to the right

Residual Plots

• A residual plot is a type of scatter plot where the x-coordinate is the expl var of an observation and the y-coordinate is the residual of the observation.

• Scatter plot of “expl var” vs. “resid”

Residual Plots

To create a residual plot on your TI, 1. Create a LSRL for your data2. Choose a scatterplot from the [stat plot]

menu, 3. Set “Xlist: L1”4. Set “Ylist: RESID” ([2nd] ,[stat],”NAMES”)5. Turn off all other plots and graphs6. [ZOOM], [9]

Residual Plots

• Residual plots tell us whether a linear model was a good choice for our data.

• We want the residual plot to look like an unstructured scatter of points

• The presence of a curve or any other pattern indicate that the linear model might not be the best choice.

Residual Plots

• A “fan-shaped” pattern (vuvuzela?) indicates that the linear model only works well for larger or smaller values of x

• Residuals should be small in value– Standard deviation of residuals should be small

2

2

y ys

n

Assn 3.2B

• P 212 #34, 35, 37, 39, 41

Coefficient of determination

• The value of r2 is known as the coefficient of determination.

22

22

2

or popularly(?)

y y y yr

y y

SST SSEr

SST

Coefficient of determination

• You may often see r2 abbreviated as “R-sq”• SST = sum of the squares of residuals using the

regression y = ybar.• SSE = sum of the squares of the residuals using

the LSRL.• r2 gives us the percentage difference of the

areas of the two regressions– You can think of the regression y = ybar as the most

basic regression line possible.

Coefficient of determination

• Interpretation• r2 given tells us “(r2) percent of the variation in

(reponse variable) can be explained with a LSRL relating (response variable) and (explanatory variable)”

• Fill in the blanks.• “60.6% of the variation in fat gain is explained

by the least-square regression line relating fat gain and nonexcercise activity.”

Facts about LSRL

• The distinction between expl var and resp var is essential.– You will get a different LSRL if you switch variables

• Correlation is closely related to the slope• The LSRL always passes through (xbar, ybar)• r2 is the fraction of variation in y that is

explained with a LSRL regression of y on x.

3.3 CORRELATION AND REGRESSION WISDOM

Cautions!

• Correlation and Regression are only useful if the data shows a linear pattern

• Extrapolation often produces unreliable predictions

• Correlation is not resistant– Outliers will affect your regressions!

Outliers and Influential Points

• Regression outliers fall outside the overall pattern of the other observations– These can be outliers in the x and/or y direction

• Influential points greatly affect the regression with their inclusion/exclusion– These are often outliers in the x direction.

Lurkers

• Often an unaccounted variable will affect both the “explanatory” and “response”– In this scenario, both the “explanatory” and “response”

are actually responding to a third variable!• EX. When the #of Methodist ministers in New

England increased from 1860-1915, the #of barrels of imported Cuban rum also increased with a r = 0.999! Would we say that the #of Methodist ministers causes an increase in import of Cuban rum?

Remember

• Association does not imply causation!

Chapter 3 REV

• P 228 #46, 55, 62, 70, 77, 80, 83, 84