ch16.ci

download ch16.ci

of 21

Transcript of ch16.ci

  • 7/28/2019 ch16.ci

    1/21

    4/20/08

    16 Confidence

    Intervals

    RANGES FORPARAMETERS........................................................................................ 16-3Confidence Interval for the Proportion ............................................................... 16-3Confidence Interval for the Mean ........................................................................ 16-6

    INTERPRETING CONFIDENCE INTERVALS................................................................... 16-7MANIPULATING CONFIDENCE INTERVALS............................................................... 16-10

    Combining Confidence Intervals ....................................................................... 16-10Changing the Problem ....................................................................................... 16-11

    CONFIDENCE INTERVAL ORTEST?........................................................................... 16-12MARGIN OF ERROR.................................................................................................. 16-13

    Determining Sample Size .............. .............. .............. .............. .............. ............. 16-14SUMMARY ............................................................................................................... 16-18

  • 7/28/2019 ch16.ci

    2/21

  • 7/28/2019 ch16.ci

    3/21

    4/20/08 16 Confidence Intervals

    16-3

    Ranges for ParametersThe contemplated launch of a new credit card proposes sendingpre-approved applications for an affinity card to 100,000 alumniof a large university. Thats the population. Two parameters ofthis population determine whether the card will be profitable:

    p, the proportion who will return the application.

    , the average monthly balance that those who accept thecard will carry.

    To estimate these parameters, the bank sent pre-approvedapplications to a sample of 1,000 alumni in a populousmetropolitan area. Of these, 140 accepted the offer. Thishistogram shows the average balance for these customers in the 3months after receiving the card.

    Figure 16-1. Balances of a sample of credit card accounts.The distribution of balances is right-skewed, with a large number atzero. Accounts with zero balances identify transactors (who paytheir balance in full each month) and customers who carry an extra

    card just in case. This table shows the key statistics.

    Number of offers 1000Number accepted 140

    Proportion who accepted p = 0.14Average(balance) x = $1990.50

    SD s = $2833.33Skewness 1.750Kurtosis 2.975

    Table 16-1. Summary statistics in the test of a new affinity credit card.

    Of the 1,000 offers, 14% returned the pre-approved application.Among these, the average monthly balance is $1990.50 withstandard deviation of $2,683.98.

    Confidence Interval for the ProportionWhat can be concluded aboutp, the proportion in the populationof alumni who will accept the offer? Rather than test a hypothesisthat assertsp is larger or smaller than some threshold, we willconstruct a range forp that holds values compatible with theobserved sample. A confidence interval is a range of plausible

  • 7/28/2019 ch16.ci

    4/21

    4/20/08 16 Confidence Intervals

    16-4

    values for a parameter based on the data in a sample from thepopulation. To arrive at this range, we use the samplingdistribution of the sample statistic. We will start by building aconfidence interval forp based on the sample proportion p. Then

    we will do the same for using X.

    Because we have a large sample (n = 1000), the Central LimitTheorem tells us that we can use a normal model to approximatethe sampling distribution of p:

    p ~N p,p(1p)

    n

    It follows that (z0.025 = 1.96)

    P(1.96p 1p( )/n p -p 1.96

    p 1p( )/n ) = 0.95

    In words, the sample proportion lies within about 2 standarderrors of the population proportion in 95% of samples. If the

    standard error is small, thenp is probably close top. Because we

    do not knowp, we also dont know the standard error of p. Theplug-in estimate

    se(

    p) = p 1 p( ) /n

    that uses p in place ofp does quite well, however, and it remainsapproximately true that

    P(1.96

    p 1 p( ) /n p -p 1.96

    p 1 p( ) /n ) 0.95

    The endpoints p 1.96

    p 1 p( ) /n form a 95% confidence

    interval forp Even though this interval uses an estimatedstandard error, it is nonetheless often called az-interval becauseof the use of a normal model for the sampling distribution. (SeeChapter 17.)

    The confidence interval is said to coverp ifplies within theinterval.The value 95%, or 0.95, is the coverage or confidencelevel of the interval. By default, most confidence intervals havecoverage 0.95, just as most tests set the -level to 0.05. We oftendescribe the coverage of an interval as 100(1-)%, with usuallyset to 0.05. To alter the coverage, change the number of multiples

    of the standard error. For instance, the wider interval with z0.005 =2.58 in place of 1.96 has 99% coverage because

    P(2.58

    p 1 p( ) /n p -p 2.58

    p 1 p( ) /n ) = 0.99

    The endpoints p 2.58

    p 1 p( ) /n form the 99% confidenceinterval forp. We will stick to 95% intervals unless there is acompelling reason for choosing a different coverage.

    confidence interval

    Range of plausible values

    for population parameter

    that are compatible with

    sample data.

    z-interval

    A confidence interval basedon using a normal model for

    the sampling distribution.

    coverage

    Probability of a random

    sample in which the

    confidence interval holds

    the parameter. Most

    often, the coverage is

    0.95.

  • 7/28/2019 ch16.ci

    5/21

    4/20/08 16 Confidence Intervals

    16-5

    Confidence Interval forpThe 100(1-)% z-interval forp is the range

    p z/2 p 1 p( ) /n to p + z/2

    p 1 p( ) /n )

    where P(Z > z) = for Z ~ N(0,1). For a 95% confidence interval,

    use z/2 = 1.96. ChecklistSRS Condition. The observed sample is a simple random samplefrom the relevant population. If sampling from a finite population,the sample comprises less than 10% of the population.

    CLT Condition (for proportion). Both np0 and n (1-p0) are largerthan 10.

    For the credit-card example, n = 1000 and p = 0.14. The data are asimple random sample, and the sample size is large enough tomeet the required conditions. The estimated standard error is

    se(

    p) = 0.14 1 0.14( )/1000 0.01097The 95% confidence interval forp is the range

    0.14 1.96 0.01097 to 0.14 1.96 0.01097 [0.1185 to 0.1615]

    We often write a confidence interval with the endpoints enclosedin square brackets [ to ]. We can also write this interval as[11.85% to 16.15%] with added percentage symbols to clarify thatthe interval describes a proportion. With 95% confidence thepopulation proportion that will accept this offer is between about12% and 16%.

    Dont lose sight of the importance of a representative sample. Ifyou dont have an SRS, dont bother with a confidence interval.Its easy to imagine how things could have gone wrong in thisexample. This test market may not be representative: alumselsewhere may be less receptive to the offer. The confidenceinterval would not adjust for this difference.

    Are You There?An auditor checks a sample of 225 randomly chosen transactionsfrom among the many the thousands processed in an office. 35contain errors in crediting or debiting the appropriate account.

    (a) Does this situation meet the conditions required for a z-interval for the proportion?1

    (b) Find the 95% confidence interval forp, the proportion of alltransactions processed in this office that have these small errors.2

    (c) Managers claim that the proportion of errors is about 10%. Doesthat seem reasonable?3

    1 Yes. p = 35/225 0.156. n

    p and n(1-

    p) are larger than 10. Presumably, this is an SRS.2p 1.96 sqrt(

    p(1-

    p)/n) = 0.156 1.96 0.0242 = 0.156 0.047, or 0.109 to 0.203.

  • 7/28/2019 ch16.ci

    6/21

    4/20/08 16 Confidence Intervals

    16-6

    Confidence Interval for the MeanA similar procedure produces a confidence interval for , themean of a population. The similarity owes to our use of a normalmodel for the sampling distribution of X,

    X~ N ,

    2

    n

    This sampling distribution implies that, for instance,

    P(1.96 /n X 1.96 /n ) = 0.95

    The averages of 95% of the samples are within 1.96/

    n of .Once again, the sample statistic lies within about 2 standard errors ofthe population parameter in most samples.

    As in tests, we do not know and resort to the plug-in estimate

    se(X) = S n

    (The capital S is a reminder that the sample SD is a randomvariable that changes from sample to sample.) The t-distributionwith n-1 degrees of freedom determines the number of standarderrors needed to obtain the desired coverage,

    P(t/2,n-1S n X t/2,n-1 S n ) 0.95

    The range X t/2,n-1

    S n is the 95% confidence t-interval for .

    Again, we can vary the coverage by changing the number ofmultiples of the standard error.

    Confidence interval for The 100(1-)% confidence t-interval for is

    X t/2,n-1 S n to X + t/2,n-1 S n

    where P(Tn-1 > t/2,n-1) = /2 for Tn-1 distributed as a t-randomvariable with n-1 degrees of freedom.

    ChecklistSRS Condition. The observed sample is a simple random samplefrom the relevant population. If sampling from a finite population,the sample comprises less than 10% of the population.CLT Condition. The sample size is larger than 10 times the

    squared kurtosis and absolute value of the kurtosis, n > 10 K3 andn > 10|K4| (as in Chapters 13 and 15).

    In this example, we have n = 140 accounts for estimating ratherthan 1000 people available for estimatingp. Since n = 140 is larger

    than 10K32 31 and 10 K4 28 (see Table 16-1), these data satisfy

    the CLT condition To obtain a 95% confidence interval for, we

    3 No. 10% seems too low (at 95% confidence). It appears that the rate is larger.

  • 7/28/2019 ch16.ci

    7/21

    4/20/08 16 Confidence Intervals

    16-7

    use t0.025,139 1.98, which is slightly larger than the correspondingfactor in a z-interval. The 95% confidence interval for the averagebalance carried by alumni who accept this credit card is

    1990.4966 1.98 2833.3324 140 $1516.365

    to

    1990.4966 + 1.98 2833.3324 140 $2464.628

    We are 95% confident that lies between $1,516.365 and$2,464.628. Might be $2,000? Yes, $2,000 lies within theconfidence interval. Might be $1,250? It could, but thats outsidethe confidence interval and not compatible with our sample at a95% level of confidence.

    Are You There?Office administrators claim that the average amount on apurchase order is $6,000. A sample of 49 purchase orders average

    x = $4,200 with s = $3,500.(a) What is the relevant sampling distribution?4

    (b) Find the 95% confidence interval for , the mean of purchaseorders handled by this office during the sampled period.5

    (c) Do you think the administrators claim is reasonable?6

    Interpreting Confidence IntervalsWe wrote We are 95% confident that lies in the range$1,516.365 to $2,464.628. Lets review what that means.

    The first step in interpreting a confidence interval is to round the

    endpoints to something more sensible. It makes little sense topresent endpoints to the nearest $0.001 if the interval allows tobe $1,700 or $2,200. The number of digits to show depends on thesample size, but most intervals should be rounded to 2 or 3 digits.For example, wed summarize this interval after rounding to 2digits as the range $1,500 to $2,500; you could also show it tosuperfluous precision as $1,520 to $2,460 by keeping a third digit.Whatever your choice, do the intermediate calculations to the fullaccuracy allowed by your calculator or software; rounding comesonly at the last step.

    After rounding, you typically obtain the same interval whetheryou use x t

    /2,n-1 s/

    n or x 2 s/

    n . The latter interval, anestimate plus or minus two standard errors, is a handy back-of-the-envelope 95% confidence interval.

    4 The sampling distribution of the average is approximately N(, s2/7)5 42002.01(3500/sqrt(49)) = [$3195 to $5205] (t0.025,48 = 2.01)6 No. $6,000 lies far above the confidence interval and is not compatible with these data.

    back-of-the-envelope

    To approximate a 95% intervalf youre without a computer

    or calculator, use the range

    formed by the sample

    estimate 2 standard errors.

    rounding

    1. Present the endpoints

    to 2 or 3 digits.

    2. Carry out intermediate

    steps to full precision.

  • 7/28/2019 ch16.ci

    8/21

    4/20/08 16 Confidence Intervals

    16-8

    Now that weve rounded the endpoints to something easier tocommunicate, lets think about the meaning of the phrase Weare 95% confident Imagine placing a bet with a friend beforeseeing the sample. The outcome of the bet depends on whetherthe as yet unseen confidence interval covers . Your friendmagically knows , but she does not know which sample will be

    observed. You wager $95 that the interval X t0.025,n-1S/n willcover ; your friend bets $5 that it wont. Whoever is right gets all$100. Even though your friend knows , this is a fair bet becauseyour share of the total wager ($95 out of $100) matches yourchance for winning the pot.

    Compare that situation to the same bet afterseeing the sample. Doyou really want to bet with your friend in this case? She knows ,and so knows whether is between $1520 and $2460. If she takesthis bet, dont expect to win!

    The difference lies in the timing. The intervalX

    t0.025,n-1S/ncovers in 95% of samples. The observed sample is either one ofthe good ones in which the confidence interval covers or itsnot. We can bet ahead of time whether a coin will land heads ortails, but once it lands, its either heads or tails. Unlike tossingcoins, we typically never learn whether a confidence intervalcovers the parameter unless we eventually see the wholepopulation. When we say We are 95% confident weredescribing aprocedure that works for 95% of samples. If we linedup the intervals from many, many samples, 95% of these intervalswould contain . A specific interval either covers or not.

    To avoid some common mistakes, lets consider several incorrectinterpretations of a confidence interval and review why each iswrong. You can spot most errors of interpretation if youremember that a confidence interval offers a range for apopulation parameter.

    1. 95% of all customers keep a balance of $1,520 to $2,460.This error is so common that its worth repeating thehistogram of the 140 customer balances.

    Figure 16-2. Histogram of account balances showing the confidence interval.

    The confidence interval (identified by the two vertical redlines) doesnt contain much data, much less 95% of every

  • 7/28/2019 ch16.ci

    9/21

    4/20/08 16 Confidence Intervals

    16-9

    balance. The confidence interval describes the populationmean, not the balance of any individual.

    2. The mean balance of 95% of samples of 140 accounts willfall between $1,520 and $2,460.The confidence interval describes , not means of other

    samples.3. The mean balance is between $1,520 and $2,460.

    The average balance in the population does not have to bebetween $1,520 and $2,460. This is a 95% confidence interval.It might not contain .

    Heres the way we prefer to state a confidence interval whenpresenting results to a non-technical audience:

    I am 95% confident that the mean monthly balance in thepopulation of customers who accept an application liesbetween $1,520 and $2,460.

    The phrase 95% confident hides a lot. Its our way of sayingthat were using a procedure that produces an interval that covers in 95% of samples.

    Follow the Money [boxed]The profitability of a new credit card depends on how many acceptthe offer and their balances. Some costs are known. Promotionalcosts in this example are $300,000 plus $5 to send each of the100,000 mailed offers, a total of $800,000. The bank invests another$50 to set up each account. For profits, the bank earns 10% of the

    average balance. Lets work though two examples.If p = 10% accept the offer, it costs $500,000 to set up the accounts(10,000 @ $50 each), pushing the total cost of the launch to $800,000+ $500,000 = $1.3 million. If the average monthly balance of these10,000 accounts is x = $2,500, the bank earns $2.5 million. Aftersubtracting the costs, the bank nets $1.2 million:

    profit = 100,000 p (

    x/10 $50) $800,000= 100,000 0.10 ($2,500/10 $50) $800,000= $1,200,000

    Consider what happens if, however, 5% return the application and

    carry a smaller balance of $1,500. The bank still has $800,000 inpromotional costs, plus $250,000 to set up the accounts, totaling$1,050,000. The interest from 5,000 customers who carry a $1,500balance is $750,000, and so the bank loses $300,000:

    profit = 100,000 0.05 ($1,500/10 $50) $800,000= ($300,000)

    Profitable scenario10% accept, $2,500 balanceBank profits $1.2 million

    Unprofitable scenario5% accept, $1,500 balanceBank loses $0.3 million

  • 7/28/2019 ch16.ci

    10/21

    4/20/08 16 Confidence Intervals

    16-10

    Manipulating Confidence IntervalsOnce we have a confidence interval, we can manipulate it toobtain ranges for related quantities. For instance, suppose thatfederal regulators require that the lending bank keep cash onhand equivalent to 10% of the outstanding balances. How much

    cash is that? Because the 95% confidence interval for is [$1,520to $2,460], the 95% confidence interval for 10% of (0.1 ) is[$152 to $246].

    If [L, U] is a 100(1-)% confidence interval for , then[c L, c U]

    is a 100(1-)% confidence interval for cand[c + L, c +U]

    is a 100(1-)% confidence interval for c + .

    The same applies if the parameter isp rather than . Moregenerally, you can substitute a confidence interval for aparameter in many types of mathematical expressions. Functionslike logs, square roots, and reciprocals are monotone; they keepgoing in the same direction.

    If [L, U] is a 95% confidence interval for andfis amonotone increasing function, then

    [(L), (U)]is a 95% confidence interval for f(). Iffis a monotonedecreasing function, then [(U), (L)] is a 95% confidenceinterval for f().

    For example, if [1.30 $/

    to 1.70 $/

    ] is a 95% confidence intervalfor the expected value of the dollar/euro exchange rate, then

    [1/1.70=0.588 /$ to 1/1.30=0.769/$]

    is a 95% confidence interval for the expected euro/dollar exchangerate.

    Combining Confidence IntervalsIfp = 0.14 and = $1990.50, the following spreadsheet shows asolid profit for the bank:

    $300,000 Up front cost of promotion

    500,000 Mailing costs (100,000 @ $5)700,000 Cost to set up accounts (14,000 @ $50)

    Total Cost $1,500,000

    Income $2,786,700 Interest (14,000 @ 0.1 $1990.50)

    Net profit $1,286,700

    Table 16-2. Profitability if sample statistics match parameters of the population.

    The profit is

  • 7/28/2019 ch16.ci

    11/21

    4/20/08 16 Confidence Intervals

    16-11

    profit = 100,000 p (

    x/1050) 800,000= 100,000 0.14 (1990.50/1050) 800,000= $1,286,700

    It is unrealistic, however, to expectp and to match p and xexactly. Just because 14% of these alumni return the application

    does not mean that 14% of the population will. An average balanceof $1990.50 in this sample does not guarantee = $1990.50.

    Confidence intervals are handy in this situation: Substituteintervals for p and x. Rather than a number, the result is a rangethat expresses the uncertainty due to sampling. In this case, weneed more than one interval. For the product of two intervals, wecombine the intervals by multiplying lowerlowerand upperupper. (We will illustrate the calculations using rounded intervalsto avoid cluttering the calculations with extra digits.)

    profit = 100,000 [0.12, 0.16]([1520, 2460]/10 50) 800,000

    = 100,000( [0.12, 0.16]

    [102, 196]) 800,000= [1224000, 3136000] 800,000= [$424,000 to $2,336,000]

    Even at the low side of the interval, the bank makes a profit, andthe potential profits are more than $2.3 million.

    This analysis separates p (which is usually determined by how acard is marketed) from the interest earnings represented by x/10(which is determined by interest profits). The analysis shows howuncertainty in estimates ofp and lead to uncertainty in theprofits. That transparency provides talking points for discussions

    with colleagues in marketing and finance.The weakness of this approach is that you end up with aconfidence interval with unknown coverage. About all we canconclude is that the coverage of the interval [$424,000 to$2,336,000] is more than 90%. (See Under the Hood: CombiningConfidence Intervals)

    Changing the ProblemYou can often avoid combining multiple confidence intervals bycreating a new variable. In this example, lets work directly withthe profit earned from each customer rather than construct the

    profits from p and x.

    Each customer who does not accept the card costs the bank $8 formailing out the application and promotions ($300,000 spread over100,000). Each customer who accepts the card costs the bank $58($8 plus $50 for setting up the account), but earns the bank 10% ofthe revolving balance. The following variable measures the profit,in dollars, earned for a customer:

  • 7/28/2019 ch16.ci

    12/21

    4/20/08 16 Confidence Intervals

    16-12

    yi =8, if offer not accepted

    Balance

    10 58, if offer accepted

    We formed this variable and calculated the following summary.

    y $12.867s $117.674n 1000

    s/

    n $3.72195% t-interval $5.565 to $20.169

    Skewness 6.175

    Kurtosis 44.061Table 16-3. Summary statistics for profits earned from each customer.

    Obviously, the constructed variable is not normally distributed:86% of the ys are - 8 because only 14% return the offer. Both the

    skewness and kurtosis are quite far from 0. Nonetheless, becauseof the very large sample, these data satisfy the CLT condition. We

    can use the t-interval because n=1000 is larger than 10K32 381

    and 10 K4 = 441. The t-interval for y, the average profit per offer,is [$5.55 to $20.19]. Scaled up to 100,000 offers, the intervalextends from $555,000 to $2,019,000.

    This confidence interval for y has two advantages over theprevious range. First, its shorter ($1.5 million versus $2.0million). This interval lies entirely inside the previous interval.Second, this interval is a 95% confidence interval. Shorter with

    known coverage: thats a good combination. By combiningeverything into one variable, however, we no longer distinguishthe role of the acceptance ratep from the average balance .

    Confidence Interval or Test?The launch of the credit card provides all of the ingredientsneeded for a hypothesis test. The variable yi tracks the profitearned per customer. The null hypothesis (with the conservativeaction not to launch the card) is that the card is not profitable:

    H0: y $0

    The alternative hypothesis Ha: y> $0 claims that it is profitable.Using the summary in Table 16-3, the t-statistic is

    t =y 0

    se(y)=

    12.86695 0

    3.70231 3.48

    Thep-value is equal to 0.00034, much less than the usual -level.We would reject H0: the test of the launch has proven that is willbe profitable beyond reasonable doubt. Notice that the t-statistic also tells us that 0 is not inside the 95% confidence

    A test provides

    A precise analysis of a

    specific, hypothesized

    value for a parameter.

    A CI provides

  • 7/28/2019 ch16.ci

    13/21

    4/20/08 16 Confidence Intervals

    16-13

    interval. The t-statistic indicates that y lies 3.48 standard errors

    away from 0 = 0. Since the 95% confidence interval holds valuesthat are within about 2 standard errors of y , 0 lies outside of the

    confidence interval. Its not compatible with the data at a 95%level of confidence.

    The test of H0: y 0 and the confidence interval answer differentquestions. The one-sided test tells you whether a proposal hasproven itself profitable (with a 5% chance for a Type I error). Toreject H0, the profits have to be statistically significantly more than$0 per person. Confidence intervals do more. Suppose thatsomeone asks how high the profits might go? A one-sided testdoes not answer this question. It simply concludes that averageprofits are positive. The confidence interval, on the other hand,gives a range for the profits per person, $5.60 to $20. Since 0 liesoutside this interval, were 95% confident that the program isprofitable and we have an upper limit for the profitability.

    For an example in which the test gives a different answer fromthe confidence interval, we need data in which the outcome is lessclear. Suppose that y had been smaller, say y = $7. Then the 95%

    confidence interval would have been about $7 2(3.7) = [-0.40 to$14.40]; zero lies inside this interval. The t-statistic is t = (7-0)/3.7 1.89 and the test still rejects H0 because thep-value 0.029 < 0.05.The one-sided test rejects H0 even though $0 lies inside theconfidence interval.

    Thats the price we pay for using a confidence interval. Because

    the 95% confidence interval tells us both an upper and lower limitfor y, it is less sensitive than the one-sided test when setting alower limit for y.

    Margin of ErrorThe back-of-the-envelope 95% confidence interval for is

    [X 2 S/

    n , X 2 S/

    n ]

    The extent of this interval to either side of X is known as themargin of error

    Margin of Error = 2S

    n

    Aprecise confidence interval has a small margin of error. Threefactors determine the margin of error:

    1. Level of confidenceThe multiplier 2 t0.025,n-1 comes from the use of a t-distribution to model the sampling distribution of the t-statistic.

    margin of error

    2S

    n

  • 7/28/2019 ch16.ci

    14/21

    4/20/08 16 Confidence Intervals

    16-14

    2. Variation of the dataThe smaller the standard deviation S, the narrower theconfidence interval becomes. Its usually not possible toreduce the standard deviation without changing thepopulation.

    3. Number of observationsConfidence intervals reward you for having a large sample.The larger the sample, the shorter the interval because the

    standard error S/

    n gets smaller.

    The reduction in the margin of error achieved with a largersample is not proportional to the increase in the sample size. Themargin of error decreases with n. As a result, to cut the marginof error in half, you need 4 times as many cases. Costs, however,typically rise in proportion to n. So, youd spend 4 times as muchto slice the margin of error in half.

    Determining Sample SizeHow large a sample do we need? A simple answer is more isbetter, but data cost time and money. How much is enough?Before you collect data, its a good idea to know whether thesample size you can afford is adequate for what you want tolearn.

    If you know the needed margin of error, you can estimate thenecessary sample size. Its only a guess because you wont knowuntil you get the sample. Sample size calculations are rarely exact.To get a particular margin of error, solve for n in this approximate

    formula:

    Margin of Error =2

    n

    n =42

    Margin of Error( )2

    The necessary sample size depends on . The sample SD s is notavailable to estimate because we have to choose nbeforecollecting the sample. If you have no idea of , it is a good idea toobtain a small sample to estimate . A pilot sample collects asmall sample of, say, 10 to 30 cases to estimate .

    In the special case of proportions, we dont need a pilot sample.

    The relationship between andp allows us to determine nwithout knowingp. Suppose youre doing a survey of politicalaffiliation. You need for the survey to have a margin of error of3% or less. Whatever the estimate for p, you want to claim that plies within 0.03 or less ofp (with 95% confidence). For this tohappen, the margin of error must be no more than 0.03,

    Margin of Error =2

    n

    0.03

    pilot sample

    A small, preliminary

    sample used to obtain an

    estimate of .

  • 7/28/2019 ch16.ci

    15/21

    4/20/08 16 Confidence Intervals

    16-15

    For proportions, = p(1p) no matter what the value ofp.7

    If we choose n so that the margin of error is 0.03 when = thenthe margin for error cannot be larger than 0.03 when we get oursample. If = , the margin of error is

    Margin of Error=

    2

    n

    2 12( )n

    0.03The necessary sample size is then

    n =

    1

    Margin of Error( )2=

    1

    .032 1,111

    A survey with 1,111 or more respondents guarantees the marginof error is 0.03 or smaller.

    This formula explains why you do not see surveys with a 2%margin of error. To guarantee a 2% margin of error, the surveywould need n = 1/0.022 = 2,500 respondents. Evidently, the

    increased precision isnt perceived to be worth more thandoubling the cost. This table shows the sample sizes needed forseveral choices of the margin of error.

    n SE(X) Margin of Error

    100 0.050 10%200 0.035 7%400 0.025 5%600 0.020 4%

    1,100 0.015 3%2,400 0.010 2%

    10,000 0.005 1%Table 16-4. Samples sizes needed for the margin of error in a survey.

    Keep in mind that n is the number of respondents, not thenumber of people sent questionnaires. A low response rate turnsa sample into an unreliable voluntary response survey. Asdiscussed in Chapter 14, its better to spend resources onincreasing the response rate than on surveying a larger group.

    Example 16.1 Property Taxes

    Motivation state the questionA Midwestern city faces a budget crunch. To close the gap, the mayor isconsidering a tax of businesses that is proportional to the amount spent to leaseproperty in the city. How much revenue would a 1% tax generate?

    7 The maximum of (x) = x(1-x) occurs at x = . You can prove this with calculus or bydrawing the graph of the function.

  • 7/28/2019 ch16.ci

    16/21

    4/20/08 16 Confidence Intervals

    16-16

    Method describe the data and select an approachIf we have confidence interval for , the average cost of a lease,we can obtain a confidence interval for the amount raised by thetax. The city has 4,500 businesses that lease properties; these arethe population. If we multiply a 95% confidence interval for by1% of 4500, well have a 95% confidence interval for the totalrevenue.

    The data are the costs of a random sample of 223 recent leases. The histogramof the lease costs is skewed. One lease costs nearly $3,000,000 per year,whereas most are far smaller.

    We will use a 95% t-interval interval for . Checking the conditions, we findthat both are satisfied.

    SRS Condition. The sample consists of less than 10% of the population ofleases, randomly chosen from the correct population.

    CLT Condition. The sample size n = 223 cases. Since 10 K3

    2 38 and 10 K441 (see the summary table in the Mechanics), we have enough data to meet thiscondition.

    Mechanics do the analysisThis table summarizes the key summary statistics of the lease costs, including the

    skewness K3 and kurtosis K4. The confidence interval is x (t0.025,2221.97) s/

    n .

    x$478603.48

    s $535342.56n 223

    s/

    n

    $35849.19

    95% interval $407981 to $549226Skewness 1.953Kurtosis 4.138

    Message summarize the resultsRounding to tens of thousands seems more than enough digits for such a longinterval. We are 95% confident that the average cost of a lease is between$410,000 and $550,000. Hence, on average, we can be 95% confident that the tax

    1. Determine parameter

    2. Identify population

    3. Describe data

    4. Choose interval5. Check conditions

  • 7/28/2019 ch16.ci

    17/21

    4/20/08 16 Confidence Intervals

    16-17

    would raise between $4,100 and $5,500 per business, and thus between$18,450,000 to $24,750,000 citywide (multiply by 4,500, the number of businessleases in the city).

    Example 16.2 A Political Poll

    Motivation state the questionThe Mayor was so happy with the amount raised by the business tax that hesdecided to run for re-election. Only 40% of registered voters in a survey done bythe local newspaper (n = 400), however, think that hes doing a good job. Whatdoes this indicate about attitudes among all registered voters?

    Method describe the data and select an approachThe parameter of interest is the proportion in the population ofregistered voters who think that the mayor is doing a good job. The

    data reported in the news is a sample (allegedly) from thispopulation. Well use a 95% z-interval forp to summarize what wecan reasonably conclude aboutp from this sample. The conditionsfor using this interval are satisfied:

    SRS Condition. The newspaper hires a reputable firm to conduct its polls, sowell assume that they got a simple random sample. Also, n is much less than10% of the population. Wed like to see the precise question and find out the rateof non-response, but lets give the pollsters the benefit of the doubt.

    Sample Size Condition. n p and n(1-

    p) are larger than 10.

    Mechanics do the analysisThe estimated standard error isse(

    p) = sqrt(

    p(1-

    p)/n) = sqrt(0.40.6/400) 0.0245.We use a z-interval for proportions even though we have estimated the standarderror. The 95% z-interval forp is

    [0.40 1.96 0.0245, 0.40 + 1.96 0. 0245] [0.352, 0.448]

    Message summarize the resultsWe can tell the mayor that he can be 95% confident that between 35% and 45% ofthe registered voters think that he is doing a good job. Fewer than half appear

    pleased. Perhaps he needs to convince more voters that the business tax will begood for the city or remind them that its not them, but businesses that will paythis tax!

    1. Determine parameter

    2. Identify population

    3. Describe data4. Choose interval

    5. Check conditions

  • 7/28/2019 ch16.ci

    18/21

    4/20/08 16 Confidence Intervals

    16-18

    SummaryConfidence intervals provide a range for a parameter. Thecoverage (or confidence level) of a confidence interval is theprobability of getting a sample for which the interval includes theparameter. Most often, confidence intervals have coverage 0.95 and

    are known as 95% confidence intervals. The margin of error is thehalf-length of the 95% confidence interval. A z-interval usespercentiles from a normal distribution, whereas a t-interval uses(slightly larger) percentiles from a t-distribution. We use a z-interval for proportions even though we use an estimated standarderror; we use a t-interval for the mean when the standard error isestimated from the data.

    Key Termsconfidence interval, 16-3

    for , 16-6

    forp, 16-4confidence level, 16-4

    coverage, 16-4margin of error, 16-13

    pilot sample, 16-14z-interval, 16-4

    Best PracticesBe sure that your data are an SRS from the right population. If

    you dont start with a representative sample, it wont matterwhat you do. A questionnaire that finds that 85% of peopleenjoy filling out surveys suffers from non-response bias even ifwe put confidence intervals around this (biased) estimate.

    Use confidence intervals rather than tests to convey a range. Atest can tell you whether a plan has demonstrated profitability,but it does not give a range on the amount of profits.

    Stick to 95% confidence intervals. Unless there is a compellingreason for an interval to have larger or smaller coverage, use95%. (The most common alternatives are 90% and 99%.)

    Round the endpoints of intervals when presenting the results.Software produces many digits of accuracy, but these arenthelpful when presenting your results. By rounding, you avoidlittering your summary with superfluous digits.

    Use full precision for intermediate calculations. Store theintermediate results in your calculator or write down the fullanswer. Round only when you get to the final step.

    PitfallsThinking that a confidence interval must hold . The name 95%

    confidence interval means that the interval could be wrong.You have not seen the population, only a sample. Dont claimthat you know the parameter lies in this range.

  • 7/28/2019 ch16.ci

    19/21

    4/20/08 16 Confidence Intervals

    16-19

    Using a confidence interval to describe other samples.Confidence intervals describe the population. A confidenceinterval doesnt describe individual responses or statistics insamples; its a statement about a population parameter.

    Formulas

    z-interval for the proportion. The 100(1-)% confidence interval forthe population proportionp is

    pz / 2

    p(1 p)

    n

    For the typical 95% interval, set = 0.05 and z.025 = 1.96 2.

    t-interval for the mean. The 100(1-)% confidence interval for when using an estimate of the standard error of X with n-1degrees of freedom is

    X t/2 ,n1 se(X) = X t/2 ,n1

    s

    n

    Margin of error. Half of the length of the approximate 95% z-interval.

    Margin of Error =2

    n

    or2 p 1p( )

    n

    Under the HoodIndicators and Proportions

    A special relationship joins p and s. Once you know p , you knows. Its the same relationship as that between the mean and varianceof a Bernoulli random variable X, where Var(X) =p(1-p).

    The z-interval for and the z-interval forp are almost identical. Tosee the connection, imagine that we have a column of 1s and 0s,with a 1 coded each time the event of interest occurs. p is the meanof this column. A column of 0/1 indicators of an event is called adummy variable. In the example of credit cards, the dummy variableindicates whether the customer accepts an application:

    xi= 1 if application is returnedxi= 0 if application is not returned

    The mean x is the proportion of times that a xi is 1, what weusually label p . To see that x = p , write the numerator of x as the

    sum of the xi. The sum of squared deviations about x simplifies:

    xi x( )2

    i=1

    n

    = xi p( )2

    i=1

    n

    = 0 p( )2

    xi = 0

    + 1 p( )2

    xi =1

    = (n n1)p

    2+ n

    11 p( )

    2

  • 7/28/2019 ch16.ci

    20/21

    4/20/08 16 Confidence Intervals

    16-20

    n1 is the number of ones, so

    p = n1/n. If we plug in n

    p for n1, weget

    x i x( )2

    i=1

    n

    = n(1 p)p2 + np 1 p( )2

    = np 1 p( )(1 p+ p)

    = np 1 p( )

    If we divide the left side by n-1, then we find that s2 is

    s2= p(1 p)

    n

    n 1

    You might have guessed this formula from the population version,

    2 =p(1 p). The sample standard deviation almost agrees with theformula for the SD of Bernoulli trials:

    s = p(1 p) n

    n 1

    Combining Confidence Intervals

    Write the z-interval forp as [L(p), U(p)] and the z-interval for as[L(),U()]. Because each is a 95% interval,

    P(L(p) pU(p)) = P(L() U()) = 0.95What can we conclude about P(L(p) L() pU(p) U())?

    Not as much as wed like. If we observe a sample in which bothintervals cover, then the product of the endpoints coversp. That is,if the events A = L(p) pU(p) and B = L() U() both occur,then L(p)L() pU(p)U(). (This is true so long as the endpoints

    are positive as in this example.) If these events are independent,then the coverage of the combined interval is at least

    P L(p)L() p L(p)L()( ) P(A and B)= P(A)P(B)

    = 0.952 = 0.9025

    Theres no guarantee, however, that the two intervals cover theirparameters independently.

    About the Data

    The banking data for this example comes from a research projectthat studied the performance of consumer loans done at theWharton Financial Institutions Center. The data on the value ofbusiness leases comes from an analysis of real estate pricesconducted by several enterprising MBA students. Weve changedthe data to keep the prices in line with current rates.

  • 7/28/2019 ch16.ci

    21/21