Ch24.Multiple

7/28/2019 Ch24.Multiple

1/25

24 Multiple

RegressionSIMPLE MODELS ............................................................................................................................24-3

ERRORS IN THE SIMPLE REGRESSION MODEL ..............................................................................24-4

AN EMBARRASSING LURKING FACTOR........................................................................................24-5

MULTIPLE REGRESSION ................................................................................................................24-6PARTIAL AND MARGINAL SLOPES ................................................................................................24-7

PATH DIAGRAMS ...........................................................................................................................24-8

R2INCREASES WITH ADDED VARIABLES....................................................................................24-11

THE MULTIPLE REGRESSION MODEL .........................................................................................24-11

CALIBRATION PLOT .....................................................................................................................24-13

THE RESIDUAL PLOT ...................................................................................................................24-14

CHECKINGNORMALITY ..............................................................................................................24-14

INFERENCE IN MULTIPLE REGRESSION.......................................................................................24-15

THE F-TEST IN MULTIPLE REGRESSION .....................................................................................24-17

STEPS IN BUILDING A MULTIPLE REGRESSION ..........................................................................24-19

SUMMARY ....................................................................................................................................24-22


2/25

7/27/07 24 Multiple Regression

24-2

Utilities that supply natural gas to residential customers have to anticipatehow much fuel they will need to supply in the coming winter. Naturalgas is difficult to store, so utilities contract with pipelines to deliver thegas as needed. The larger the supply that the utility locks in, the larger thecost for the contract. Thats OK if the utility needs the fuel, but its a waste

if the contract reserves more gas than needed.On the other hand, if deliveries fall short, the utility can find itself in atight spot when the winter turns cold. A shortage means cold houses orexpensive purchases of natural gas on the spot market. Either way, theutility will have unhappy customers theyll either be cold or shocked atsurcharges on their gas bills. It makes sense, then, for the utility toanticipate how much gas its customers will need.

Lets focus on estimating the demand of a community of 100,000residential customers in Michigan. According to the US EnergyInformation Administration, 62 million residential customers in the US

consumed about 5 trillion cubic feet of gas in 2004. That works out toabout 80 thousand cubic feet of natural gas per household. In industryparlance, thats 80 MCF per household.

Should the utility should lock in 80 MCF for every customer?

Probably not. Does every residence use natural gas for heating, or onlysome of them? A home that uses gas for heat burns a lot more than onethat uses gas only for cooking and heating water. And what about theweather? These 100,000 homes arent just anywhere in the US; theyre inMichigan, where it gets colder than many other parts of the country. Youcan be sure these homes need more heating than if they were in Florida.Forecasters are calling for a typical winter. For the part of Michiganaround Detroit, that means a winter with 6,500 heating degree days.1

How much gas does the utility need for the winter? How much shouldthey lock in with contracts? As well see, the answers to these questionsdont have to be the same. To answer either, we better look at data.

1 Heres how to compute the heating degree days. For a day with low temperature of 20 and a highof 60, the average temperature is (20+60)/2 = 40. Now subtract the average from 65. This daycontributes 65 40 = 25 heating degree days to the year. If the average temperature is above 65, theday contributes nothing to the total. Its as if you only need heat if the average temperature isbelow 65.


3/25


4/25


24-4

The intercept b0 implies that residences use about 34,000 cubic feet ofnatural gas, regardless of the weather. All of these households use gas forheating water and some use gas for cooking. To capture the effect of theweather, the slope b1 indicates that, on average, consumption increases byabout 14,500 cubic feet of gas for each increase of 1 moreMHDD.

A utility can use this equation to estimate the average gas consumption inresidences. The National Oceanic and Atmospheric Administrationestimates 6,500 HDD for the winter in Michigan (compared to 700 inFlorida). Plugging this value into the least squares line for x, this equationpredicts annual gas consumption to be

34.4272 + 14.5147 * 6.5 128.8MCF

Once we verify the conditions of the SRM, we can add a range to conveyour uncertainty.

Errors in the Simple Regression Model

The SRM says that the explanatory variable x affects the average value ofthe response y through a linear equation, written as

y = y|x+ " with

y|x= "

0+ "

1x,

The model has two main parts.

(1) Linear pattern. The average value of y depends on x through the line"

0+ "

1x. That should be the only way that x affects the

distribution of y. For example, the value of x should not affect thevariance of the s (as seen in the previous chapter).

(2) Random errors. The errors should resemble a simple random samplefrom a normal distribution. They must be independent of eachother with equal variance. Normality matters most for predictionintervals. Otherwise, with moderate sample sizes, we can rely onthe CLT to justify confidence intervals.

The SRM is a strange model when you think about it. Only one variablesystematically affects the response. That seems too simple. Think aboutwhat influences energy consumption in your home. The temperaturematters, but climate is not the only thing. How about the size of the home?It takes more to heat a big house than a small one. Other things contributeas well: the number of the windows, the amount of insulation, the type of

construction, and so on. The people who live there also affect the energyconsumption. How many live in the house? Do they leave windows openon a cold day? Where do they set the thermostat?

A model that omits these variables treats their effects as random errors.The errors in regression represent variables that influence y that areomitted from the model. The real equation for y looks more like

y = "0+ "

1x1+ "

2x2+ "

3x3+ "

4x4+ "

5x5+L

What is ?

e errors represent the

cumulation of all of the

her variables that

fect the response that

ve not accounted for.


5/25


24-5

Either we are unaware of some of these xs or, even if we are aware, wedont observe them. The SRM draws a boundary after x1 and bundles therest into the error,

y = "0+ "

1x1+ "

2x2+ "

3x3+ "

4x4+ "

5x5+L

#

1 244444 344444

=

"0+

"1 x1+

#

If the omitted variables have comparable effects on y, then the CentralLimit Theorem tells us that the sum of their effects is roughly normallydistributed. As a result, its not too surprising that we often find normallydistributed residuals. If we omit an important variable whose effect standsout from the others, however, neednt be normal. Thats another reason towatch for a lack of normality in residuals. Deviations from normality can suggestan important omitted variable.

AYTA simple regression that describes the prices of cars offered for sale at a

large dealer regresses Price (y) on Engine Power(x).(a) Name other explanatory variables that affect the price of the car.2

(b) What happens to the effects of these other explanatory variables if theyare not used in the regression model?3

An Embarrassing Lurking Factor

The best way to identify omitted explanatory variables is to know thecontext of your model. The simple regression for gas use says that the onlything that systematically affects consumption is temperature. The size ofthe home doesnt matter, just the climate.

Thats an embarrassing omission. Unless these homes are the same size,size matters. Well use the number of rooms to measure the size of thehouses. (This is easier to collect in surveys than measuring the number ofsquare feet.) This scatterplot graphs gas consumption versus the numberof rooms.

0

50

100

150

200

NaturalGas(MCF)

2 3 4 5 6 7 8 9 10 11

Number of Rooms

Figure 24-2. Size is related to fuel use as well.

2 Others include options (e.g., a sunroof or navigation system), size, and styling (e.g., convertible).3 Variables that affect y that are not explicitly used in the model become part of the error term.

tip


6/25


24-6

The slope of the least squares line in the figure is positive: larger homesuse more gas.

R2 0.2079se 36.90

Term Estimate Std Error t-statistic p-valueb0 15.8340 6.9409 2.28 0.0229b1, Number of Rooms 12.4070 1.0724 11.57


7/25


24-7

We interpret R2 and se in Table 24-3 as in simple regression. R2 = 0.5457indicates that the equation of the model represents about 55% of thevariation in gas usage. Thats less than we expected, but more than eithersimple regression. The estimate se = 27.943 MCF estimates the SD of theunderlying model errors. Because multiple regression explains morevariation, the residual SD is smaller than with simple regression. (Theequation for se is in the Formulas section at the end of the chapter.)

The rest of Table 24-3 describes the equation. The equation has anintercept and two slopes, taken from the column labeled Estimates:

Estimated Gas Use = -1.775 + 12.78MHDD + 6.882 Number of Rooms

The slope forMHDD in this equation differs from the slope forMHDD inthe simple regression (14.51). Thats not a mistake: the slope of anexplanatory variable in multiple regression estimates something differentfrom the slope of the same variable in simple regression.

Partial and Marginal SlopesThe slope 14.51 ofMHDD in the simple regression (Table 24-1) estimatesthe average difference in gas consumption between homes in differentclimates. Homes in a climate with 3,000 HDD use 14.51 more MCF of gas,on average, than homes in a climate with 2,000 HDD. Because it ignoresother differences between homes, a slope in an SRM is called the marginalslope for y on x.

The slope 12.78 forMHDD in the multiple regression (Table 24-3) alsoestimates the difference in gas consumption between homes in differentclimates, but it limits the comparison to homes with the same number of

rooms. Homes in a climate with 3,000 HDD use 12.78 more MCF of gas, onaverage, than homes with the same number of rooms in a climate with 2,000HDD. Because multiple regression adjusts for other factors, the slope in amultiple regression is known as a partial slope.

To appreciate why these estimates are different, consider a specificquestion. Suppose a customer calls the utility with a question about theadded costs for heating a one-room addition to her home. The marginalslope for the number of rooms estimates the difference in gas use betweenhomes that differ in size by one room. The marginal slope 12.41MCF/Room (Table 24-2) indicates that, on average, larger homes use 12,410

more cubic feet of gas. At $10 perMCF, an added room increases theannual heating bill by about $124. The partial slope 6.88MCF/Room (Table24-3) indicates that, on average, homes with another room use 6.88 moreMCFof gas, costing $69 annually. Which slope provides a better answer:the marginal or partial slope for the number of rooms?

The reason for the difference between the estimated slopes is theassociation between the two explanatory variables. On average, homeswith more rooms are in colder climates as shown in the following plot.(The line is the least squares fit ofMHDD on number of rooms.)


8/25


24-8

Figure 24-3. Homes with more rooms tend to be in colder climates.

Simple regression compares the average consumption of smaller to largerhomes, ignoring that homes with more rooms tend to be in colder places.Multiple regression adjusts for the association between the explanatoryvariables; it compares the average consumption of smaller to larger homes

in the same climate. Thats why the partial slope is smaller. The marginalslope in the simple regression mixes the effect of size (number of rooms)with the effect of climate. Multiple regression separates them. Unless thehomeowner who called the utility moved her home to a colder climatewhen she added the room, she should use the partial slope!

The correlation betweenMHDD and the number of rooms is r= 0.33.Correlation between explanatory variables is known as collinearity.Collinearity between the explanatory variables explains why R2 does notgo up as much as we expected. Evidently, theres a bit of overlap betweenthe two explanatory variables. Some of the variation explained by HDD is

also explained by the number of rooms. Similarly, some of the variationexplained by the number of rooms is explained by HDD. If we add the R2sfrom the simple regressions, we double count the variation that isexplained by both explanatory variables. Multiple regression counts itonce and the resulting R2 is smaller.

Path DiagramsPath diagrams offer another way to appreciate the differences betweenmarginal and partial slopes. A path diagram shows a regression model asa collection of nodes and directed edges. Nodes represent variables and

directed edges show slopes. A note of warning in advance: these picturesof models often suggest that weve uncovered the cause of the variation.Thats not true; like simple regression, multiple regression modelsassociation.

Lets draw the path diagram of the multiple regression. Arrows lead fromthe explanatory variables to the response. The diagram also joins theexplanatory variables to each other with a double-headed arrow thatsymbolizes the association between them.

Collinearity

orrelation between

e explanatoryriables in a multiplegression.


9/25


24-9

Figure 24-4. Path diagram of the multiple regression.

Its important to keep units with the slopes, particularly for the doubled-headed arrow between the explanatory variables. The slopes for this edgecome from two simple regressions: one with rooms as the response (seeFigure 24-3) and the other withMHDD as the response.

EstimatedNumber of Rooms = 5.2602 + 0.2516MHDDEstimated MHDD = 1.3776 + 0.4322 Number of Rooms

As an example, consider the difference in consumption between homes inclimates that differ by 1,000 heating degree days. The arrow from HDD tothe response indicates that a difference of 1MHDD produces an averageincrease of

1MHDD 12.78MCF/MHDD = 12.78MCFof natural gas.

Thats the direct effect of colder weather: the furnace runs more. Acolder climate has an indirect effect, too. Homes in colder climates tend to

have more rooms, and larger homes require more heat. Following thepath to the number of rooms, homes in the colder climate average

1MHDD 0.2516 Rooms/MHDD =0.2516 Rooms

more than those in the warmer climate. This difference also increases gasuse. Following the path from number of rooms to consumption, 0.2516rooms converts into

0.2516 Rooms 6.882MCF/Room 1.73MCFof natural gas.

If we add this indirect effect to the direct effect, homes in the colderclimate use (on average)

12.78 MCF + 1.73 MCF = 14.51 MCF

more MCF of natural gas than those in the warmer climate. If youve gotthat dj vu all over again feeling, look back at the summary of thesimple regression of gas use onMHDD in Table 24-1. The marginal slopeis 14.51. Thats exactly what weve calculated from the path diagram.

Neat, but lets make sure we understand why that happens. Simpleregression answers a simple question. On average how much more gas dohomes in a colder climate use? The marginal slope shows that those in thecolder climate use 14.51 more MCF of natural gas. The partial slope for

Consumptionof natural gas

1000 HeatingDegree Days

Number ofRooms

12.78MCF/MHDD

6.882MCF/Room

0.2516 Rooms/MHDD

0.4322MHDD/Room


10/25


24-10

HDD in the multiple regression answers a different question. It estimatesthe difference in gas use due to climate among homes of the same size excluding the pathway through the number of rooms. Multiple regressionseparates the marginal slope into two parts: the direct effect (the partialslope) and the indirect effect (blue).

The sum of the two paths reproduces the simple regression slope,

12.78 MCF + 0.2516 6.882MCF 14.51 MCFdirect effect + indirect effect = marginal effect

Once you appreciate the role of indirect effects, you begin to thinkdifferently about the marginal slope in a simple regression. The marginalslope blends the direct effect of an explanatory variable with all o itsindirect effects. Its fine for the marginal slope to add these effects. Theproblem is that we sometimes forget about indirect effects and interpretthe marginal slope as though it represents the direct effect.

Path diagrams help you realize something else about multiple regression,too. When would the marginal and partial slope be the same? They matchif there are no indirect effects. This happens when the explanatoryvariables are uncorrelated, breaking the pathway for the indirect effect.

AYTA contractor rehabs suburban homes, replacing exterior windows andsiding. Hes kept track of material costs for these jobs (basically, the costsof replacement windows and new vinyl siding). The homes hes fixedvary in size. Usually repairs to larger homes require both more windowsand more siding. He fits two regressions, a simple regression of cost on

the number of windows and a multiple regression of cost on the numberof windows and square feet of siding.

Which should be larger: the marginal slope for the number of windows inthe simple regression or the partial slope for the number of windows inthe multiple regression?4

4 The marginal slope is larger. Bigger homes that require replacing more windows also requiremore siding; homes with more windows are bigger. Hence the marginal slope combines the cost ofwindows plus the cost of more siding. Multiple regression separates these costs. In the language ofpath diagrams, the marginal slope combines a positive direct effect with a positive indirect effect.

Use of naturalgas for heating

12.78 MCF/HDDHeatingDegree Days

Number ofRooms

6.882 MCF/Room0.2516 Rooms/HDD

Marginal = Partial

theres no collinearity

ncorrelated explanatoryriables), then the

rginal and partial

es a ree.


11/25


24-11

R2 Increases with Added VariablesIs the multiple regression better than the initial simple regression thatconsiders only the role of climate? The overall summary statistics lookbetter: R2 is larger and se is smaller. We need to be choosy, however,before we accept the addition of an explanatory variable. With manypossible explanations for the variation in the response, we should onlyinclude those explanatory variables that add value.

How do you tell if an explanatory variable adds value? R2 is not veryuseful unless it changes dramatically. R2 increases every time you add aexplanatory variable to a regression. Add a column of random numbersand R2 goes up. Not by much, but it goes up. The residual standarddeviation se shares a touch of this perversion; it goes down too easily forits own good. For evaluating the changes brought by adding anexplanatory variable, we need confidence intervals and tests. That meanslooking at the conditions for multiple regression.

Under the Hood: WhyR2 gets largerHow does software figure out the slopes in a multiple regression? Wethought youd never ask! The software does what it did before: minimizethe sum of squared deviations, least squares. Only now it gets to useanother slope to make the sum of squared residuals smaller. With one x,the software minimizes the sum of squares

b0 ,b1

min y i " b0 " b1x1,i( )2

i=1

n

#

by inserting the least squares estimates for b0 and b1.

Look at what happens when we add x2. In a way, the x2 has been there allalong but with its slope constrained to be zero. Simple regressionconstrains the slope of x2 to zero:

b0 ,b1

min y i " b0 " b1x1,i " 0x2,i( )2

i=1

n

#

When we add x2 to the regression, the software is free to choose a slope ofx2. It gets to solve this problem:

b0 ,b1 ,b 2

min y i " b0 " b1x1,i " b2 x2,i( )2

i=1

n

#

Now that it can change b2, the software can find a smaller residual sum ofsquares. Thats why R2 goes up. A multiple regression with twoexplanatory variables has more choices. This flexibility allows the fittingprocedure to explain more variation and increase R2.

The Multiple Regression Model

The Multiple Regression Model (MRM) resembles the SRM, only itsequation has several explanatory variables rather than one. The equation


12/25


24-12

for multiple regression with two explanatory variables describes how theconditional average of y given both x1 and x2 depends on the xs:

E(y|x1,x2) = y|x1 ,x2 = "0 + "1 x1 + "2 x2 .

Given x1 and x2, y lies at its mean plus added random noise the errors:

y = y|x1 ,x2

+ "

According to this model, only the conditional means of y change with theexplanatory variables x1 and x2. The rest of the assumptions describe theerrors; these are the same three assumptions as in the SRM:

1. The errors are independent, with

2. Equal variance "2, and ideally are

3. Normally distributed.

Ideally, the errors are an iid sample from a normal distribution. As in theSRM, the MRM requires nothing of the explanatory variables. Because we

want to see how y varies with changes in the xs, all we need is variationin the xs. It would not make sense to a constant as an explanatoryvariable.

The equation of the MRM embeds several assumptions. It implies that theaverage of y varies linearly with each explanatory variable, regardless ofthe other. The xs do not mediate each others influence on y. Differencesin y associated with differences in x1 are the same regardless of the valueof x2 (and vice versa). That can be a hard assumption to swallow. Heresan equation for the sales of a product marketed by a company:

Sales = 0 + 1Advertising Spending + 2Price Difference +

The price difference is the difference between the list price of thecompanys product and the list price of its main rival.

Lets examine this equation carefully. It implies that the impact on sales,on average, of spending more for advertising is limitless. The more itadvertises, the more it sells. Also, advertising has the same effectregardless of the difference in price. It does not matter which costs more advertising has the same effect. That may be the case, but there aresituations in which the effect of advertising depends on the difference inprice. For example, the effect of advertising might depend on whether the

ad is promoting a difference in price.Weve seen remedies for some of these problems. For instance, a log-logscale captures diminishing returns, with log sales regressed on logadvertising but that does nothing for the second problem. Specialexplanatory variables known as interactions allow the equation to capturesynergies between the explanatory variables, but well save interactionsfor Chapter 26. For now, we have to hope that the effect of oneexplanatory variable does not depend on the value of the other.


13/25


24-13

Since the only difference between the SRM and the MRM is the equation,it should not come as a surprise that the check-list of conditions match.

Straight enough

No embarrassing lurking variable

Similar variances

Nearly normal

The difference lies in the choice of plots that used to verify theseconditions. Simple regression is simple because theres one key plot: thescatterplot of y on x. If you cannot see a line, neither can the computer.Multiple regression offers more choices, but there is a natural sequence ofplots that go with each numerical summary. You want to look at thesebefore you dig into the output very far.

Calibration PlotThe summary of a regression usually begins with R2 and se, and two plotsbelong with these. These two plots convert multiple regression into aspecial simple regression. Indeed, most plots make multiple regressionlook like simple regression in one way or another. This table repeats R2and sefor the two-predictor model for natural gas.

R2 0.5457se 27.97

Table 24-4. Overall summary of the two-predictor model.

The calibration plot summarizes the fit of a multiple regression, much asa scatterplot of y on x summarizes a simple regression. In particular, the

calibration plot is a scatterplot of y versus the fitted valuey = b0 + b1x1 +

b2x2. Rather than plot y on either x1 or x2, the calibration plot places thefitted values on the horizontal axis.

0

50

100

150

200

NaturalGas(MCF)

20 30 40 50 60 70 80 90 110 130 150

Estimated Natural Gas (MCF)

Figure 24-5. Calibration plot for the two-explanatory variable MRM.For the simple regression the scatterplot of y on x shows R2: its the squareof the correlation between y and x. Similarly, the calibration plot shows R2for the multiple regression. R2 in multiple regression is again the square ofa correlation, namely the square of the correlation between y and the fitted

R2

The square of the

correlation between y

and the estimate y .


14/25


24-14

values y . The tighter the data cluster along the diagonal in the calibrationplot, the better the fit.

The Residual PlotThe plot of residuals on x is very useful in simple regression because it

zooms in on the deviations around the fitted line. The analogous plot isuseful in multiple regression. All we do is shear the calibration plot,twisting it so that the regression line in the calibration plot becomes flat.In other words, we plot the residuals, y - y , on the fitted values y .

This view of the fit shows se. If the residuals are nearly normal, then 95%of them lie within 2 se of zero. Heres the plot of residuals on fittedvalues for the multiple regression of gas use on HDD and rooms.

-60

-40

-20

0

20

40

60

80

NaturalGas(M

CF)Residual

10 20 30 40 50 60 70 80 90100 120 140 160

Estimated Natural Gas (MCF)

Figure 24-6. Scatterplot of residuals on fitted values.

You can guess that se is about 30 MCF from this plot because all but about10 cases (out of 512) lie within 60 MCF of the horizontal line at zero (Theactual value of se is 27.97 MCF.) The residuals should suggest an angry

swarm of bees with no real pattern. In this example, the residuals suggesta lack of constant variation (heteroscedasticity, Chapter 23). The residualsat the far left seem less variable than the rest, but the effect is not severe.

Thats the most common use of this plot: checking the similar variancescondition. Often, as weve seen in Chapter 23, data become more variableas they get larger. Since y tracks the size of the predictions, this plot is thenatural place to check for changes in variation. If you see a pattern in thisplot, either a trend in the residuals or changing variation, the model doesnot meet the conditions of the MRM. In this example, the residuals hoveraround zero with no trend, but there might be a slight change in thevariation.

Checking Normality

The last condition to check is the nearly normal condition. We haventdiscovered outliers or other problems in the other views, so chances arethis one will be OK as well. Heres the normal quantile plot of theresiduals.


15/25


24-15

-60

-40

-20

0

20

40

60

80

25 50 75

Count

.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Figure 24-7. Normal quantile plot of residuals from the multiple regression.

Its a good thing that we checked. The residuals are slightly skewed. Theresiduals reach further out on the positive side (to +80) than on thenegative side (60). The effect is mild, however, and almost consistentwith normality. Well take this as nearly normal, but be careful aboutpredicting the gas consumption of individual homes. (The skewness and

slight shift in variation in Figure 24-6 suggest you might have problemswith prediction intervals.) For inferences about slopes, however, were allset to go.

Inference in Multiple RegressionIts time for confidence intervals. As usual, we start with the standarderror. The layout of the estimates for multiple regression in Table 24-3matches that for simple regression, only the table of estimates has onemore row. Each row gives an estimate and its standard error, followed bya t-statistic andp-value.

Term Estimate Std Error t-statistic p-valueIntercept b0 -1.775 5.339 -0.33 0.7396Heating Degree Days (000)b1 12.78 0.657 19.45


16/25


17/25


24-17

statistic for the test of H0: 0 = 0 is 0.33; the estimate lies 1/3 of a standarderror below zero. Thep-value indicates that 74% of random samples froma population with 0 = 0 produce estimates this far (or farther) from zero.Because thep-value is larger than 0.05, we cannot reject H0. Whats thatmean? It does not mean 0 is zero; it only suggests that it might be zero,negative, or positive.

TheF-Test in Multiple RegressionMultiple regression adds a test that we dont need in simple regression.Its called the F-test, and it usually appears in a portion of the outputknown as the Analysis of Variance, abbreviated ANOVA. Well havemore to say about the ANOVA table in Chapter 25. (The F-statistic is notneeded in simple regression because the t-statistic serves the samepurpose. In a simple regression the F-statistic is the square of the t-statisticfor the one slope, F = t2 and both produce the samep-value.)

TheF-test, which is obtained using theF-statistic, measures theexplanatory value of all of the explanatory variables, taken collectivelyrather than separately. t-statistics consider the explanatory variables one-at-a-time; an F-statistic looks at them collectively. What null hypothesis isbeing tested? For the F-statistic, the null hypothesis states that your datais a sample from a population for which all the slopes are zero. In this case,the null hypothesis is H0: 1 = 2 = 0. In other words, it tests the nullhypothesis that the model explains nothing. Unless you can reject thisone, the explanatory variables collectively explain nothing more thanrandom variation.

The F-test in regression comes in handy because of the problem with R2.Namely, it gets larger whenever you add an explanatory variable. In fact,if you add enough explanatory variables, you can make the R2 as large asyou want. You can think of the F-test as a test of the size of R2. Supposethat a friend of yours whos taking Stat tells you that she has built a greatregression model. Her regression has an R2 of 98%. Before you getimpressed, you ought to ask her a couple of questions: How manyexplanatory variables are in your model? and How many observationsdo you have? If her model obtains an R2 of 98% using two explanatoryvariables and n = 1000, you should learn more about her model. If she hasn=50 cases and 45 explanatory variables, the model is not so impressive.

R2 does not track the number of explanatory variables or the sample size.The F-statistic cares. R2 is the ratio of how much variation resides in thefitted values of your model compared to the variation in y.

R2=

Variation in y

Variation iny=

y i "y( )2

i=1

n

#

y i "y( )2

i=1

n

#

t vs F

e t-stat tests the

fect of one explanatory

riable,

e F-stat tests the

mbination of them all.


18/25


24-18

The more explanatory variables you have, the more variation gets tied upin the fitted values. As you add explanatory variables, the top of this ratiogets smaller and the bottom stays the same. R2 goes up. The F-statisticdoesnt offer this free lunch; it charges the model for each explanatoryvariable. For a multiple regression with q explanatory variables, theF-statistic is

F=Variation in y per explanatory variable

Variation remaining per residual d.f.=

R2

q

1"R2

n " q"1

=

R2

1"R2#n " q"1

q

If you have relatively few explanatory variables compared to the numberof cases (q


19/25


24-19

Steps in Building a Multiple Regression

Lets summarize the steps that weve considered and the order that wevemade them. As in simple regression, it pays to start with the big picture.

1) What problem are you trying to solve? Do these data help you?

Until you know enough about the data and problem to answerthese two questions, theres no need to fit a complicated model.

2) Check the scatterplots of the response versus the explanatoryvariables and also those that show relationships among theexplanatory variables. Make sure that you understand themeasurement units of the variables and identify any outliers.

3) If these scatterplots of y on xj appear straight enough, fit themultiple regression. Otherwise, find a transformation tostraighten out a relationship that bends.

4) Find the residuals and fitted values from your regression.

5) Make scatterplots that show the overall model (y ony and

residuals versus y). The residual plot should look simple. Theresidual plot is the best place to identify changing variances.

6) Check whether the residuals are nearly normal. If not, be verycautious about using prediction intervals.

7) Check the F-statistic to see whether you can reject the nullmodel conclude that some slope differs from zero. If not, gono further with tests.

8) Test and interpret individual partial slopes.

4M Subprime MortgagesSubprime mortgages dominated business news in 2007. A subprimemortgage is a loan made to a more risky borrower than most. As thisexample shows, theres a reason that bank and hedge funds plunged intothe risky housing market: these loans earn more interest so long as theborrower can keep paying. Defaults on such mortgages brought downseveral hedge funds that year. For this analysis, weve made you ananalyst at a creditor whos considering moving into this market.

The two explanatory variables in this analysis are common in this domain.The loan-to-value ratio (LTV) captures the exposure of the lender to

defaults. For example, if LTV = 0.80 (80%), then the mortgage covers 80%of the value of the property. The FICO score (named for its owner, theFair-Isaac Company) is the most common commercial measure of thecredit worthiness of a borrower.

Motivation

State the business decision.

As a mortgage lender, my company would like to

know what might be gained by moving into the

subprime market. In particular, wed like to know

which characteristics of the borrower and loan

affect the amount of interest we can earn on


20/25


24-20

loans in this category.

Method

Plan your analysis. Identify the predictor andthe response.

Relate regression to the business decision.

Describe the sample.

Use a plot to make sure correlations make

sense. Weve omitted them here to savespace, but look before you summarize. Forexample, heres APR versus LTV.

Verify the big-picture condition.

Ill use a multiple regression. I have two

explanatory variables: the credit rating score ofthe borrower (FICO) and the ratio of the value

of the loan to the value of the property (LTV).

The response is the annual percentage rate of

interest earned by the loan (APR).

The partial slope for FICO describes whether

for a given LTV how much poor credit costs the

borrower. The partial slope for LTV controls for

the exposure were taking on the loan: the higher

LTV, the more risk we face if the borrower

defaults.

We obtained data on 372 mortgages from a

credit bureau that monitors the subprime

market. These loans are a SRS of subprime

mortgages within the geographic area where we

operate.

Scatterplots of APR on both explanatory

variables seem reasonable, and FICO and LTV are

not dramatically correlated with each other. The

relationships are linear, so Ill summarize them

with a table of correlations.

LTV FICO

APR -0.4265 -0.6696

LTV 0.4853

Straight-enough. Seems OK fromscatterplots of APR versus LTV and FICO.

Theres no evident bending, the plots indicate

moderate dependence, and no big outliers.

Mechanics

Check the additional conditions on the errors

by examining the residuals.

No embarrassing lurking factor. I can imaginea few other factors that may be omitted, such as

more precise geographic information. Other

aspects of the borrower (age, race) better not

matter or something illegal is going on.

Similar variances. The plot of residuals on

fitted values shows consistent variation over the

range of fitted values. (There is one outlier and

some skewness in the residuals, but those

features are more visible in the quantile plot.)

Nearly normal. The histogram and normalquantile plot confirm the skewness of the

residuals. The regression fit underestimates APR

by up to 7%, but rarely overpredicts by more

than 2%. Since Im not building a prediction


21/25


24-21

F = R2/(1-R

2)*(n-1-2)/2

= 0.4619/(1-0.4619) * (372-3)/2 158

If there are no severe violations of theconditions, summarize the overall fit of the

model.

Describe the estimated equation.

Build confidence intervals for the relevantparameters.

With all of these other details, dont forget to

round to the relevant precision.

interval for individual loans, Ill rely on the CLT to

produce normal sampling distributions for my

estimates.

Heres the summary of my fitted model

R2 0.4619

se 1.242

It explains about of the variation (R2 = 0.46) in

the interest rates, which is highly statistically

significant by the F-test. I can clearly reject H0

that both slopes are zero. This model explains

real variation in APR. The SD of the residuals is

se= 1.24%.

Term Est SE t Stat p-value

b0 23.691 0.650 36.46


22/25


24-22

SummaryThe Multiple Regression Model (MRM) expands the Simple RegressionModel by incorporating other explanatory variables in its equation. Theslopes in the equation of an MRM are partial slopes that typically differ

from the marginal slopes in a simple regression. Collinearity is thecorrelation between predictors in a multiple regression. A path diagram isa useful figure for distinguishing the direct and indirect effects ofexplanatory variables. A calibration plot of y on y shows the overall fit ofthe model, visualizing R2 of the model. The plot of residuals on y allowsa check for similar variances. The F-statistic measures the overallsignificance of the fitted model, and individual t-statistics for each partialslope test the incremental improvement in R2 obtained by adding thatvariable to a model containing the other explanatory variables.

Indexcalibration plot, 24-13collinearity, 24-8F-statistic, 24-17, 24-18F-test, 24-17

slopemarginal, 24-7partial, 24-7

FormulasIn each formula, q denotes the number of explanatory variables. For theexamples in this chapter, q = 2.

F-statistic

F=Variation in y per predictor

Variation remaining per residual d.f.=

R2q

1"R2

n " q"1

=

R2

1"R2#n " q"1

q

seDivide the sum of squared residuals by n minus the number ofestimates in the equation. For a multiple regression with q=2explanatory variables (and estimates b0, b1, and b2)

se2=

ei2

i=1

n

"n #1#q

=

yi #b0 #b1xi,1#b2xi,2( )2

i=1

n

"n #1#2

Best Practices Know the context of your model. Its important in simple regression,

but even more important in multiple regression. How are yousupposed to guess what factors are missing from the model unless youknow something about the problem and the data?


23/25


24-23

Examine plots of the overall model and coefficients before interpretingthe output. You did it in simple regression, and you need to do it evenmore in multiple regression. It can be really, really tempting to diveright into the output rather than hold off to look at the plots, but youllfind that you make better choices by being more patient.

Check the overall F-statistic before digging into the t-statistics. If you

look at enough t-statistics, youll find that you eventually findexplanatory variables that are statistically significant. Statistics awardspersistence. If you check the F-statistic first, youll avoid the worst ofthese problems.

Distinguish marginal from partial slopes. A marginal slope combinesthe direct and indirect effects. A partial slope avoids the indirecteffects of other variables in the model. Some would say that the partialslope holds the other variables fixed but thats too far from the truth.It is true is a certain mathematical sense, but we didnt hold anything

fixed in our example, we just compared energy consumption amonghomes of differing size in different climates.

Let your software compute prediction intervals in multiple regression.Extrapolation is harder to recognize in multiple regression. Forinstance, suppose we were to use our multiple regression to estimategas consumption for 4-room homes in climates with 7,500 heating DD.Thats not an outlier on either variable alone, but its an outlier whenwe combine the two! In general, prediction intervals have the sameform in multiple regression as in simple regression, namely predictedvalue 2 se. This only applies when not extrapolating. If you do

extrapolate, you better let the software do the calculations. We alwaysdo in multiple regression.

1

2

3

4

5

6

7

8

Heating

DD(

000)

2 3 4 5 6 7 8 9 10 11

Number of Rooms

Figure 24-8. Outliers are more subtle when the xs are correlated.

Pitfalls Become impatient. Multiple regression takes time not only to learn,

but also to do. If you get hasty and skip the plots, you may foolyourself into thinking youve figured it all out, only to discover laterthan it was just an outlier.


24/25


24-24

Think that you have all of the important variables. Sure, we added asecond variable to our model for energy consumption and it made a lotof sense. That does not mean that weve gotten them all, however. Inmost applications, it virtually impossible to know whether youvefound all of the relevant explanatory variables.

Think that youve found causal effects. Unless you got your data by

running an experiment (and this does happen sometimes), you cannotget causation from a regression model. No matter how manyexplanatory variables you have in the model. Just as we did not holdfixed any variable, we probably did not change any of the variableseither. Weve just compared averages.

Think that an insignificant t-statistic implies an explanatory variablehas no effect. All it means if we cannot reject H0: 1 = 0 is that thisslope might be zero. It might not. The confidence interval tells you abit more. If the confidence interval includes zero, its telling us that the

partial slope might be positive or might be negative. Just because wedont know the direction (or sign) of the slope doesnt mean its zero.

About the DataThe data on household energy consumption are from the ResidentialEnergy Consumption Survey conducted by the US Department of Energy.The study of subprime mortgages is based on an analysis of these loans ina working paper Mortgage Brokers and the Subprime Mortgage Marketproduced by A. El Anshasy, G. Elliehausen, and Y. Shimazaki of GeorgeWashington University and The Credit Research Center of GeorgetownUniverity.

Software HintsThe software commands for building a multiple regression are essentiallythose used for building a model with one explanatory variable. All youneed to do is select several columns as explanatory variables rather thanjust one.

ExcelTo fit a multiple regression, follow the menu commands

Tools > Data Analysis > Regression

(If you dont see this option in your Tools menu, you will need to add thesecommands. See the Software Hints in Chapter 19.) Selecting several columnsas Xvariables produces a multiple regression analysis.

MinitabThe menu sequence

Stat > Regression > Regressionconstructs a multiple regression if several variables are chosen as explanatoryvariables.


25/25

Ch24.Multiple

Documents

Transcript of Ch24.Multiple