孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Basic concepts Data visualization Data summarization

Statistics and Data Analysis for Engineers

Part 1:Introduction and Descriptive Statistics

Ling-Chieh Kung

Department of Information ManagementNational Taiwan University

September 4, 2016

Introduction and Descriptive Statistics 1 / 62 Ling-Chieh Kung (NTU IM)


What is Statistics?

I Many things are unknown...I Consumers’ tastes.I Quality of a product.I Stock prices.I The effectiveness of a new way of teaching/training.

I Statistics is the science of collecting, analyzing, interpreting, andpresenting (numerical) data.I Ultimate goal (of Business Statistics): to achieve better decision making.

I The study of Statistics includes:I Descriptive Statistics.I Probability.I Inferential Statistics: Estimation.I Inferential Statistics: Hypothesis testing.I Inferential Statistics: Prediction.

I In summary: To estimate, test, and predict those unknowns.



My plan for today

I Descriptive Statistics.I Visualization and summarization.

I Inferential Statistics.I (Probability).I Hypothesis testing and p-value.I Regression analysis.

I Case studies.



Road map

I Basic concepts.

I Data visualization.

I Data summarization.



Populations vs. samples

I A population is a collection of persons, objects, or items.I A census is to investigate the whole population.

I A sample is a portion of the population.I Sampling is to investigate only a subset of the population.I We then use the information contained in the sample to infer (“guess”)

about the population.

I What are samples for the following populations?I All students in NTU.I All students in the business school.I All chips made in one factory.I All consumers who have bought iPhone 6.

I Two important questions:I Why sampling?I Is a sample representative?



Descriptive vs. inferential statistics

I Descriptive statistics:I Graphical or numerical summaries of data.I Describing (visualizing or summarizing) a set of data.

I Inferential statistics:I Making a “scientific guess” on unknowns.I Trying to say something about the population.

I Which is descriptive and which is inferential?I Calculating the average height of 1000 randomly selected NTU students.I Using this number to estimate the average height of all NTU students.

I Another example (pharmaceutical research):I All the potential patients form the population.I A group of randomly selected patients is a sample.I Use the result on the sample to infer the result on the population.



Parameters vs. statistics

I A numerical summary of a population is a parameter.I The average height of all NTU students.I The expected coffee demand when the price is 50 NTD.

I A numerical summary of a sample is a statistic.I The average height of all NTU male students.I The average coffee demand when the price is 50 NTD in the past 6 days.

I Almost always people use a statistic to infer a parameter.I Some statistics are “good” while some are “bad.”



Parameters vs. statistics: an example

I What is the average height of all NTU students?

I While a census is possible, it is still quite costly.

I It is natural to:I Sample some NTU students.I Calculate a statistic.I Use that statistic to estimate the average height (the parameter).

I Some (good or bad) samples and statistics:I The average height of all students in this classroom.I The average height of 100 students randomly drawn from all students.I The maximum height of 100 students randomly drawn from all students.I The sum of heights of 100 students randomly drawn from all students.I The average height of 60 male and 40 female students randomly drawn

from the population.



Levels of data measurement

I Most data we will play with are numerical.

I Numerical data may be categorized to three levels:I Nominal.I Ordinal.I Quantitative: interval or ratio.



Nominal level

I A nominal scale classifies data into categories with no ranking.

I Data are labels or names used to identify an attribute of the element.

I The label may be numeric or non-numeric label.

I Examples:

Categorical variables Values (Categories)

Laptop ownership Yes / NoCitizenship Taiwan / Japan / ...Country code 886 / 86 / 1 / ...

I Arithmetic operations cannot be applied on nominal data.



Ordinal level

I An ordinal scale classifies data into categories with ranking.

I The order or rank of the data is meaningful.

I However, differences between numerical labels do not implydistances.

I Examples:

Categorical variables Values (Categories)

Product satisfaction Satisfied, neutral, unsatisfiedProfessor rank Full, associate, assistantRanking of scores 1, 2, 3, 4, ...

I It is still not meaningful to do arithmetic on ordinal data.I Assistant + associate = full?!I The grade difference between no. 1 and no. 5 may not be equal to that

between no. 11 and no. 15.



Quantitative (interval and ratio) levels

I An interval scale is an ordered scale in which the difference betweenmeasurements is a meaningful quantity but the measurements do nothave a true zero point.

I A ratio scale is an ordered scale in which the difference betweenmeasurements is a meaningful quantity and the measurements have atrue zero point.

I Ratio data appear more often in the world.I Heights, weights, income, prices.

I Interval data are actually rare.I Degrees in Celsius or Fahrenheit.I GRE or GMAT scores.

I How about degrees in Kelvin?



Some remarks

I Nominal and ordinal data are called qualitative data.

I Interval and ratio data are called quantitative data.

I Most statistical methods are for quantitative data; some are forqualitative data.I Distinguishing nominal and ordinal scales is important.I Distinguishing interval and ratio scales is not.

I Sometimes qualitative data are called categorical data.

I Sometimes quantitative data are called numeric data.



A short summary

I Understand these terms:

I Populations vs. samples.I Parameters vs. statistics.I Inferential statistics vs. descriptive statistics.

I For each scale of measurement, is it meaningful to calculate thefollowing numbers?

Level Ranking Distance

Nominal No NoOrdinal Yes NoQuantitative Yes Yes



Road map

I Basic concepts.





An example

I For each day in 2011 and 2012, we recordthe number of daily rentals of the publicbike rental system in Washington, D.C.I 985, 801, 1349, 1562, 1600, 1606, 1510, ...,

1341, 1796. and 2729.I The smallest and largest numbers are 22

and 8714, respectively.

I How to get some feeling on 731 numbers?

date rental

2011/1/1 9852011/1/2 8012011/1/3 13492011/1/4 15622011/1/5 16002011/1/6 16062011/1/7 1510

...

2012/12/29 13412012/12/30 17962012/12/31 2729



Frequency distributions

I The original 731 numbers form a set of ungrouped data.

I We start by grouping them into a frequency distribution.I Grouped data presented in the form of class intervals and frequencies.

I Let’s create an intuitive frequency distribution.



Frequency distributions: an example

I The resulting classes:

Class Class interval (Which means)

1 [0, 1000) 0 ≤ x < 10002 [1000, 2000) 1000 ≤ x < 20003 [2000, 3000) 2000 ≤ x < 3000

...8 [7000, 8000) 7000 ≤ x < 80009 [8000, 9000) 8000 ≤ x < 9000

I How about [0, 999], [1000, 1999], etc.?I How about (0, 1000], (1000, 2000], etc.?



Frequency distributions: an example

I Then we count to get the frequencydistribution at the right.

I This is a set of grouped data.

I Some remarks:I Typically we have 5 to 15 classes.I Typically all classes have the same

width.I Be aware of class endpoints! Classes

should NOT overlap with each other.I If there are outliers, they should be

removed first.

Class interval Frequency

[0, 1000) 18[1000, 2000) 80[2000, 3000) 74[3000, 4000) 107[4000, 5000) 166[5000, 6000) 106[6000, 7000) 86[7000, 8000) 82[8000, 9000) 12



Something moreI We may add class midpoints, relative frequencies, and

cumulative frequencies into a frequency table:

ClassFrequency

Class Relative Cumulativeinterval midpoint frequency frequency

[0, 1000) 18 500 2.46% 18[1000, 2000) 80 1500 10.94% 98[2000, 3000) 74 2500 10.12% 172[3000, 4000) 107 3500 14.64% 279[4000, 5000) 166 4500 22.71% 445[5000, 6000) 106 5500 14.50% 551[6000, 7000) 86 6500 11.76% 637[7000, 8000) 82 7500 11.22% 719[8000, 9000) 12 8500 1.64% 731

I How about cumulative relative frequencies?



Histograms

I A frequency distribution may be depicted as a histogram.

Interval Freq.

[0, 1000) 18[1000, 2000) 80[2000, 3000) 74[3000, 4000) 107[4000, 5000) 166[5000, 6000) 106[6000, 7000) 86[7000, 8000) 82[8000, 9000) 12

I It consists of a series of contiguous rectangles, each representing thefrequency in a class.



Histograms

I Histograms may be the most important type of data graphs.

I One particular reason to draw histograms is to get some ideas aboutthe distribution.I Bell shape? M shape? Skewed?I Any outlier?I We will discuss distributions in more details.



Frequency polygons

I Alternatively, we may draw a frequency polygon by using linesegments connecting dots plotted at class midpoints.I The information contained in a frequency polygon is quite similar to that

contained in a histogram.



Frequency polygonsI It is more convenient to use a frequency polygon to compare

multiple frequency distributions.

I Both: Uni-modal andsymmetric.

I 2011: Bi-modal andskewed to the right(right-tailed).

I 2012: Uni-modal andskewed to the left(left-tailed).

I Warning: People may misinterpret a frequency polygon as a linechart (for data with a time sequence).



Line chartsI A line chart is useful in depicting a time series data set.

I A two-dimensional data set whose first dimension (the x-axis) is forlabels of time points.

I It visualizes how a quantity changes as time goes by.I For our monthly bike rentals:



Pie charts

I A pie chart is a circular depiction of data where each slice representsthe percentage of the corresponding category.

I It visualizes relative frequency distributions well.

I For our bike rental data set:I What are the proportions of rentals in the four seasons?I What are the proportions of rentals on the seven days of a week?



A pie chart for seasonal rentals

Season Total rentals Proportion

Winter (12/20-3/20) 471348 14.3%Spring (3/21-6/20) 918589 27.9%Summer (6/21-9/20) 1061129 32.2%Fall (9/21-12/20) 841613 25.6%



A pie chart for rentals among weekdays

Day Total rentals

Sunday 444027Monday 455503Tuesday 469109

Wednesday 473048Thursday 485395Friday 487790

Saturday 477807



Data not appropriate for pie charts

I Pie charts are used to visualize proportions, i.e., subtotals over theoverall total.

I It should not be used to compare averages.I The total numbers of rentals made by male and female users are

appropriate for a pie chart.I The average numbers of rentals per male and female users are not

appropriate for a pie chart.



Bar charts

I Pie charts are useful in visualizing the proportions of each categories.

I In demonstrating the differences among categories, a bar chart is abetter choice.I The larger the category, the longer the bar.I Some people draw bars vertically; some horizontally.



Bar charts

I Let’s replace the pie chart to a bar chart.

Day Total rentals

Sunday 444027Monday 455503Tuesday 469109

Wednesday 473048Thursday 485395Friday 487790

Saturday 477807

I Note that the y-axis does not start at 0!



Bar charts v.s. histograms

I What are the differences that distinguish a bar chart from a histogram?

I A bar chart uses noncontiguous bars to visualize categorical data.I A histogram uses contiguous bars to visualize quantitative data.



Visualizing two variables

I When we have data for two variables, typically we want to identifywhether there is any relationship between them.

I Visualizing the data in a two-dimensional manner helps.

I When the two vales are both measured in quantitative scales, we maydepict each observation as a point on a plane to create a scatter plot.

I For our bike rental example:I How do monthly rentals in 2011 and those in 2012 relate with each other?I How do daily casual and registered rentals relate with each other?



Monthly rentals in 2011 and 2012

Month 2011 2012

1 38189 967442 48215 1031373 64045 1648754 94870 1742245 135821 1958656 143512 2028307 141341 2036078 136691 2145039 127418 21857310 123511 19884111 102167 15266412 87323 123713



Road map

I Basic concepts.





Summarizing the data with numbers

I Descriptive Statistics includes some common ways to describe data.I Summarization with numbers.I Visualization with graphs.

I This is always the first step of any data analysis project: To getintuitions that guide our directions.

I Here we talk about summarization.I For a set of (a lot of) numbers, we use a few numbers to summarize them.I For a population: these numbers are parameters.I For a sample: these numbers are statistics.

I We will talk about three things:I Measures of central tendency for the center or middle part of data.I Measures of variability for how variable the data are.I Measures of correlation for the relationship between two variables.



Medians

I The median is the middle value in an ordered set of numbers.I Roughly speaking, half of the numbers are below and half are above it.

I Suppose there are N numbers:I If N is odd, the median is the N+1

2th large number.

I If N is even, the median is the average of the N2

th and the (N2

+ 1)thlarge number.

I For example:I The median of {1, 2, 4, 5, 6, 8, 9} is 5.I The median of {1, 2, 4, 5, 6, 8} is 4+5

2= 4.5.



Medians

I A median is unaffected by the magnitude of extreme values:I The median of {1, 2, 4, 5, 6, 8, 9} is 5.I The median of {1, 2, 4, 5, 6, 8, 900} is still 5.

I Medians may be calculated from quantitative or ordinal data.I It cannot be calculated from nominal data.

I Unfortunately, a median uses only part of the information contained inthese numbers.I For quantitative data, a median only treats them as ordinal.



Means

I The mean is the average of a set of data.I Can be calculated only from quantitative data.I The mean of {1, 2, 4, 5, 6, 8, 9} is

1 + 2 + 4 + 5 + 6 + 8 + 9

7= 5.

I A mean uses all the information contained in the numbers.

I Unfortunately, a mean will be affected by extreme values.I The mean of {1, 2, 4, 5, 6, 8, 900} is 1+2+4+5+6+8+900

7≈ 132.28!

I Using the mean and median simultaneously can be a good idea.I We should try to identify outliers (extreme values that seem to be

“strange”) before calculating a mean (or any statistics).



Population means vs. sample means

I Let {xi}i=1,...,N be a population with N as the population size. Thepopulation mean is

µ ≡∑N

i=1 xiN

.

I Let {xi}i=1,...,n be a sample with n < N as the sample size. Thesample mean is

x ≡∑n

i=1 xin

.

I People use µ and x in almost the whole statistics world.



Population means v.s. sample means

µ ≡∑N

i=1 xiN

x ≡∑n

i=1 xin

.

I Isn’t these two means the same?I From the perspective of calculation, yes.I From the perspective of statistical inference, no.

I Typically the population mean is fixed but unknown.I The sample mean is random: We may get different values of x today

and tomorrow.I To start from x and use inferential statistics to estimate or test µ, we

need to apply probability.



Quartiles and percentiles

I The median lies at the middle of the data.

I The first quartile lies at the middle of the first half of the data.

I The third quartile lies at the middle of the second half of the data.

I For the pth percentile:I p

100of the values are below it.

I 1− p100

of the values are above it.

I Median, quartiles, and percentiles:I The 25th percentile is the first quartile.I The 50th percentile is the median (and the second quartile).I The 75th percentile is the third quartile.



Modes

I The mode(s) is (are) the most frequently occurring value(s) in a setof qualitative data.I In the set {A,A,A,B,B,C,D,E, F, F, F,G,H}, the modes are A and F .

The frequency of the modes (A and F ) are 3.

I Though the above definition may also be applied to quantitative data,sometimes it is useless.I In many case, all values are modes!

I For quantitative data, we instead look for the modal class(es).



Modal classes

I In a baseball team, players’ heights(in cm) are:

178 172 175 184172 175 165 178177 175 180 182177 183 180 178179 162 170 171

I For the classes [160, 165), [165, 170),..., and [185, 190), the modal class is[175, 180).

I We sometimes say the mode of thisset is 177.5.

I The way of grouping matters!



Variability

I Measures of variability describe the spread or dispersion of a setof data.

I Especially important when two sets of data have the same center.



Ranges and Interquartile ranges

I The range of a set of data {xi}i=1,...,N is the difference between themaximum and minimum numbers, i.e.,

maxi=1,...,N

{xi} − mini=1,...,N

{xi}.

I The interquartile range of a set of data is the difference of the firstand third quartile.I It is the range of the middle 50 of data.I It excludes the effects of extreme values.



Deviations from the mean

I Consider a set of population data{xi}i=1,...,N with mean µ.

I Intuitively, a way to measure thedispersion is to examine how each numberdeviates from the mean.

I For xi, the deviation from the populationmean is defined as

xi − µ.

I For a sample, the deviation from thesample mean of xi is

xi − x.

i xi deviation

1 1 1− 5 = −42 2 2− 5 = −33 4 4− 5 = −14 5 1− 5 = 05 6 6− 5 = 16 8 8− 5 = 37 9 9− 5 = 4

Mean 5



Mean deviations

I May we summarize the N deviations intoa single number to summarize theaggregate deviation?

I Intuitively, we may sum them up and thencalculate the mean deviation:∑N

i=1(xi − µ)

N.

I Is it always 0?

i xi deviation

1 1 1− 5 = −42 2 2− 5 = −33 4 4− 5 = −14 5 1− 5 = 05 6 6− 5 = 16 8 8− 5 = 37 9 9− 5 = 4

Mean 5 0



Adjusting mean deviations

I People use two ways to adjustmean deviations:I Mean absolute deviations/errors

(MAD): ∑Ni=1 |xi − µ|

N.

I Mean squared deviations/errors(variance or MSE):∑N

i=1(xi − µ)2

N.

I A larger MAD or variance meansthat the data are more disperse.

i xi di |di| d2i

1 1 −4 4 162 2 −3 3 93 4 −1 1 14 5 0 0 05 6 1 1 16 8 3 3 97 9 4 4 16

Mean 5 0 2.29 7.43



MAD vs. varianceI The main difference:

I An MAD puts the same weight on all values.I A variance puts more weights on extreme values.

I They may give different ranks of dispersion:

i xi di |di| d2i

1 0 −5 5 252 4 −1 1 13 5 0 0 04 6 1 1 15 10 5 5 25

Mean 5 0 2.4 10.4

i xi di |di| d2i

1 1 4 4 162 2 3 3 93 5 0 0 04 8 3 3 95 9 4 4 16

Mean 5 0 2.8 10

I In general, people use variances more than MADs.I But MADs are still popular in some areas, e.g., demand forecasting.I It is the analyst’s discretion to choose the appropriate one.



Standard deviations

I One drawback of using variances is that the unit of measurement is thesquare of the original one.

I For the baseball team, the variance ofmember heights is 34.05 cm2. What is it?!

I People take the square root of a varianceto generate a standard deviation.

I The standard deviation of member heightsis √

34.05 ≈ 5.85 cm.

178 172 175 184172 175 165 178177 175 180 182177 183 180 178179 162 170 171

I A standard deviation typically has more managerial implications.



Population v.s. sample variances

I Recall that the formulas for population and sample means are

µ ≡∑N

i=1 xiN

and x ≡∑n

i=1 xin

, respectively.

I Formula-wise there is no difference.

I However, population and sample variances are

σ2 ≡∑N

i=1(xi − µ)2

Nand s2 ≡

∑ni=1(xi − x)2

n− 1, respectively.

I Note the difference between N and n− 1!

I Population and sample standard deviations are σ =

√∑Ni=1(xi−µ)2

Nand

s =

√∑ni=1(xi−x)2

n−1, respectively.

I People use σ2, σ, s2, and s in almost the whole statistics world.



Coefficient of variation

I The coefficient of variation is the ratio of the standard deviation tothe mean:

Coefficient of variation =σ

µ.

I When will you use coefficients of variation?



z-scores

I Consider a set of sample data {xi}i=1,...,n with sample mean x andsample standard deviation s. For xi, the z-score is

zi =xi − xs

.

I In a set of population data {xi}i=1,...,N with population mean µ andpopulation standard deviation σ, the z-score of xi is

zi =xi − µσ

.

I A value’s z-score measures for how many standard deviations itdeviates from the mean.



z-scores vs. outliers

I For detecting outliers, one common way is double check whether xi isan outlier if

|zi| =∣∣∣∣xi − µσ

∣∣∣∣ > 3.

I It is quite rare for a value’s magnitude of z-score to be so large.I For sample data, use xi−x

s.

I Some people propose the use of median and MAD is a similar way:double check whether xi is an outlier if1∣∣∣∣xi −median

MAD

∣∣∣∣ > 3.

I The above rules only suggest one to investigate some extreme valuesagain. These rules are neither sufficient nor necessary for outliers.

1The “MAD” here can be mean absolute deviation from mean, mean absolutedeviation from median, median absolute deviation from median, etc.



CorrelationI Consider the size of a house and its price in a city:

Size Price(in m2) (in $1000)

75 31559 22985 35565 26172 23446 216107 30891 30675 28965 20488 26559 195

I How do we measure/describe the correlation (linear relationship)between the two variables?



Intuition

I Consider a set of paired data{(xi, yi)}i=1,...,N .

I When one variable goes up, doesthe other one tend to go up ordown?

I More precisely, if xi is larger thanµx (the mean of the xis), is it morelikely to see yi > µy or yi < µy?

I We say that the two variables havea positive correlation.I If one goes up when the other goes

down, there is a negativecorrelation.



Covariances

I We define the covariance of a set of two-dimensional (sample) data as

sxy ≡∑n

i=1(xi − x)(yi − y)

n− 1.

I If most points fall in the first and third quadrants, most(xi − µx)(y − µy) will be positive and sxy tends to be positive.

I Otherwise, sxy tends to be negative.

I So the covariance of house size and price is 617.16.

I Is it large or small?I This depends on how variable the two variables themselves are.



Pearson’s correlation coefficientsI To take away the auto-variability of each variable itself, we define the

population and sample correlation coefficients as

r ≡ sxysxsy

,

I sx and sy are the sample standard deviations of xis and yis.I In our example, we have r = 617.16

16.78×50.45≈ 0.729.

I It can be shown that we always have −1 ≤ r ≤ 1.I r > 0: Positive correlation.I r = 0: No correlation.I r < 0: Negative correlation.

I People often determine the degree of correlation based on |s|:I 0 ≤ |s| < 0.25: A weak correlation.I 0.25 ≤ |s| < 0.5: A moderately weak correlation.I 0.5 ≤ |s| < 0.75: A moderately strong correlation.I 0.75 ≤ |s| ≤ 1: A strong correlation.



Correlation vs. independence

I A correlation coefficient only measures how one variable linearlydepends on the other variable.

(r = 0.5973) (r = 0)

I Being uncorrelated does not mean being independent!



Correlation vs. causationI A correlation coefficient only measures whether two variables correlate

with each other. High correlation does not mean causation.

I A causes B or B causes A? C causes A and B? Or just by chance?



Correlation of qualitative variables

I Sometimes the variables are not quantitative/numeric.

I For ordinal data, we calculate their Spearman’s rank correlation.

I For nominal data, we calculate Cramer’s V.


Sampling Sampling distributions Hypothesis testing p-value, t test, and more


Part 2:Hypothesis Testing and p-value

Ling-Chieh Kung


September 4, 2016

Hypothesis Testing and p-value 1 / 71 Ling-Chieh Kung (NTU IM)


Road map

I Sampling.

I Sampling distributions.

I Hypothesis testing.

I p-value, t test, and more.



Random vs. nonrandom sampling

I Sampling is the process of selecting a subset of entities from the wholepopulation.

I Sampling can be random or nonrandom.

I If random, whether an entity is selected is probabilistic.I Randomly select 1000 phone numbers on the telephone book and then

call them.

I If nonrandom, it is deterministic.I Ask all your classmates for their preferences on iOS/Android.

I Most statistical methods are only for random sampling.

I Some popular random sampling techniques:I Simple random sampling.I Stratified random sampling.I Cluster (or area) random sampling.



Simple random sampling

I In simple random sampling, each entity has the same probability ofbeing selected.

I The good part of simple random sampling is simple.

I However, it may result in nonrepresentative samples.

I In simple random sampling, there are some possibilities that toomuch data we sample fall in the same stratum.I They have the same property.I E.g., it is possible that all randomly sampled voters are younger than 40.I The sample is thus nonrepresentative.

I How to fix this problem?



Stratified random sampling

I We may apply stratified random sampling.

I We first split the whole population into several strata.I Data in one stratum should be (relatively) homogeneous.I Data in different strata should be (relatively) heterogeneous.

I We then use simple random sampling for each stratum.




I As an example, suppose that we want to sample 40 out of 1000graduates to understand the number of credits they get at school.

I Suppose that 100 students double majored, then we can split the wholepopulation into two strata:

Stratum Strata size

Double major 100No double major 900

I To sample 40 graduates, we sample 40× 1001000 = 4 from the

double-major stratum and 36 from the other stratum.




I We may further split the population into more strata.I Double major: Yes or no.I Class: 1994-1998, 1999-2003, 2004-2008, or 2009-2012.I This stratification makes sense only if students in different classes tend

to take different numbers of units.

I Stratified random sampling is good in reducing sample error.

I But it can be hard to identify a reasonable stratification.

I It is also more costly and time-consuming.



Cluster (or area) random sampling

I Imagine that you are going to introduce a new product into all theretail stores in Taiwan.

I If the product is actually unpopular, an introduction with a largequantity will incur a huge lost.

I How to get an idea about the popularity?

I Typically we first try to introduce the product in a small area. Weput the product on the shelves only in those stores in the specified area.

I This is the idea of cluster (or area) random sampling.I Those consumers in the area form a sample.



Cluster (or area) random sampling

I In cluster random sampling, we define clusters.

I We will only choose one or some clusters and then collect all thedata in these clusters.I If a cluster is too large, we may further split it into multiple

second-stage clusters.

I Therefore, we want data in a cluster to be heterogeneous, and dataacross clusters somewhat homogeneous.

I For example, people may do cluster random sampling to understandthe popularity of a new product. Those chosen cities (counties, states,etc.) are called test market cities (counties, states, etc.).I People use cluster random sampling in this case because of its feasibility

and convenience.I We should select test market cities whose population profiles are similar

to that of the entire country.



Nonrandom sampling

I Sometimes we do nonrandom sampling.

I Convenience sampling.I The researcher sample data that are easy to sample.

I Judgment sampling.I The researcher decides who to ask or what data to collect.

I Quota sampling.I In each stratum, we use whatever method that is easy to fill the quota, a

predetermined number of samples in the stratum.

I Snowball sampling.I Once we ask one person, we ask her/him to suggest others.

I Nonrandom sampling cannot be analyzed by the statistical methodswe introduce in this course.



Road map

I Sampling.



I p-value, t test, and more. .



Sampling distributions

I When we cannot examine the whole population, we study a sample.I What will be contained in a random sample is unpredictable.I We need to know the probability distribution of a sample so that we

may connect the sample with the population.

I The probability distribution of a sample is a sampling distribution.



Sampling distributionsI A factory produces bags of candies. Ideally, each bag should weigh 2

kg. As the production process cannot be perfect, a bag of candiesshould weigh between 1.8 and 2.2 kg.

I Let X be the weight of a bag of candies. Let µ and σ be its expectedvalue and standard deviation.I Is µ = 2?I Is 1.8 < µ < 2.2?I How large is σ?

I Let’s sample:I In a random sample of 1 bag of candies, suppose it weighs 2.1 kg. May

we conclude that 1.8 < µ < 2.2?I What if the average weight of 5 bags in a random sample is 2.1 kg?I What if the sample size is 10, 50, or 100?I What if the mean is 2.3 kg?

I We need to know the sampling distribution of those statistics (samplemean, sample standard deviation, etc.).



Sample means

I The sample mean is one of the most important statistics.

Definition 1

Let {Xi}i=1,...,n be a sample from a population, then

x =

∑ni=1Xi

n

is the sample mean.

I Sometimes we write xn to emphasize that the sample size is n.

I We assume that Xi and Xj are independent for all i 6= j.I This is fine if n� N , i.e., we sample a few items from a large population.I In practice, we require n ≤ 0.05N .



Means and variances of sample meansI Suppose the population mean and variance are µ and σ2, respectively.

I These two numbers are fixed.

I A sample mean x is a random variable.I It has its expected value E[x], variance Var(x), and standard deviation√

Var(x). These numbers are all fixedI They are also denoted as µx, σ2

x, and σx, respectively.

I For any population, we have the following theorem:

Proposition 1 (Mean and variance of a sample mean)

Let {Xi}i=1,...,n be a size-n random sample from a population withmean µ and variance σ2, then we have

µx = µ, σ2x =

σ2

n, and σx =

σ√n.



Means and variances of sample means

I Do the terms confuse you?I The sample mean vs. the mean of the sample mean.I The sample variance vs. the variance of the sample mean.

I By definition, they are:I x = 1

n

∑ni=1 Xi; a random variable.

I E[x]; a constant.

I s2 = 1n−1

∑ni=1(Xi − x)2; a random variable.

I Var(x); a constant.

I The sample variance also has its mean and variance.



Example: Quality inspection

I The weight of a bag of candies follow a normal distribution with meanµ = 2 and standard deviation σ = 0.2.

I Suppose the quality control officer decides to sample 4 bags andcalculate the sample mean x. She will punish me if x /∈ [1.8, 2.2].I Note that my production process is actually “good:” µ = 2.I Unfortunately, it is not perfect: σ > 0.I We may still be punished (if we are unlucky) even though µ = 2.

I What is the probability that I will be punished?I We want to calculate 1− Pr(1.8 < x < 2.2).I We know that µx = µ = 2 and σx = σ√

4= 0.1.

I But we do not know the probability distribution of x!



Sampling from a normal population

I If the population is normal, the sample mean is also normal!

Proposition 2

Let {Xi}i=1,...,n be a size-n random sample from a normal populationwith mean µ and standard deviation σ. Then

x ∼ ND

(µ,

σ√n

).

I We already know that µx = µ and σx = σ√n

. This is true regardless of

the population distribution.

I When the population is normal, the sample mean will also be normal.



Example revisited: Quality inspection

I The weight of a bag of candies follow a normal distribution with meanµ = 2 and standard deviation σ = 0.2.

I Suppose the quality control officer decides to sample 4 bags andcalculate the sample mean x. She will punish me if x /∈ [1.8, 2.2].

I What is the probability that I will be punished?I The distribution of the sample mean x is ND(2, 0.1).I Pr(x < 1.8) + Pr(x > 2.2) ≈ 0.045.



Adjusting the standard deviation

I When the population isND(µ = 2, σ = 0.2) and the samplesize is n = 4, the probability ofpunishment is 0.045.

I If we adjust our standard deviationσ (by paying more or less attentionto the production process), theprobability will change.

I Reducing σ reduces the probabilityof being punished. With thesampling distribution of x, we mayoptimize σ.I An improvement from 0.2 to 0.15

is helpful; from 0.15 to 0.1 is not.



Adjusting the sample size

I When the population is ND(2, 0.2)and the sample size is n = 4, theprobability of punishment is 0.045.

I If the quality control officerincreases the sample size n, theprobability will decrease.

I µ = 2 is actually ideal. A largersample size makes the officer lesslikely to make a mistake.



Distribution of the sample mean

I So now we have one general conclusion: When we sample from anormal population, the sample mean is also normal.I And its mean and standard deviation are µ and σ√

n, respectively.

I What if the population is non-normal?

I Fortunately, we have a very powerful theorem, the central limittheorem, which applies to any population.



Central limit theorem

I The theorem says that a sample mean is approximately normalwhen the sample size is large enough.

Proposition 3 (Central limit theorem)

Let {Xi}i=1,...,n be a size-n random sample from a population withmean µ and standard deviation σ. Let xn be the sample mean. Ifσ <∞, then xn converges to ND(µ, σ√

n) as n→∞.

I How large is “large enough”?

I In practice, typically n ≥ 30 is believed to be large enough.



Road map

I Sampling.






Hypothesis testing

I How do scientists (physicists, chemists, etc.) do research?I Observe phenomena.I Make hypotheses.I Test the hypotheses through experiments (or other methods).I Make conclusions about the hypotheses.

I Social scientists and business researchers do the same thing withhypothesis testing.I One of the most important technique of statistical inference.I A technique for (statistically) proving things.I Relying on sampling distributions.



People ask questions

I In the business (or social science) world, people ask questions:I Are older workers more loyal to a company?I Does the newly hired CEO enhance our profitability?I Is one candidate preferred by more than 50% voters?I Do teenagers eat fast food more often than adults?I Is the quality of our products stable enough?

I How should we answer these questions?

I Statisticians suggest:I First make a hypothesis.I Then test it with samples and statistical methods.



Statistical hypotheses

I A statistical hypothesis is a formal way of stating a hypothesis.I Typically it is a mathematical description of parameters to test.

I It contains two parts:I The null hypothesis (denoted as H0).I The alternative hypothesis (denoted as Ha or H1).

I The alternative hypothesis is:I The thing that we want (need) to prove.I The conclusion that can be made only if we have a strong evidence.

I The null hypothesis corresponds to a default position.I We first assume that the null hypothesis is correct.I Then we collect sample data.I If under the null hypothesis it is quite unlikely to see our observed

result, we claim that the null hypothesis is wrong.



Statistical hypotheses: example 1

I In our factory, we produce packs of candy whose average weight shouldbe 1 kg.

I One day, a consumer told us that his pack only weighs 900 g.

I We need to know whether this is just a rare event or our productionsystem is out of control.

I If (we believe) the system is out of control, we need to shutdown themachine and spend two days for inspection and maintenance. This willcost us at least $100,000.

I So we should not to believe that our system is out of control justbecause of one complaint. What should we do?




I We first state a hypothesis: “Our production system is under control.”

I Then we ask: Is there a strong enough evidence showing that thehypothesis is wrong, i.e., the system is out of control?I Initially, we assume that our system is under control.I Then we do a survey to see if we have a strong enough evidence.I We shutdown machines only if we can “prove” that the system is indeed

out of control.

I Let µ be the average weight, the statistical hypothesis is

H0 : µ = 1

Ha : µ 6= 1.




I In our society, we adopt the presumption of innocence.I One is considered innocent until proven guilty.

I So when there is a person who probably stole some money:

H0 : The person is innocent

Ha : The person is guilty.

I There are two possible errors:I One is guilty but we think she/he is innocent.I One is innocent but we think she/he is guilty.

I Which one is more critical?I It is unacceptable that an innocent person is considered guilty.I We will say one is guilty only if there is a strong evidence.




I Consider the following hypothesis: “The candidate is preferred by morethan 50% voters.”

I As we need a default position, and the percentage that we care aboutis 50%, we will choose our null hypothesis as

H0 : p = 0.5.

I p is the population proportion of voters preferring the candidate.I More precisely, let Xi = 1 if voter i prefers this candidate and 0

otherwise, i = 1, ..., N , then p =∑N

i=1Xi

N.

I How about the alternative hypothesis? Should it be

Ha : p > 0.5 or Ha : p < 0.5?




I The choice of the alternative hypothesis depends on the relateddecisions or actions to make.

I Suppose one will go for the election only if she thinks she will win (i.e.,p > 0.5), the alternative hypothesis will be

Ha : p > 0.5.

I Suppose one tends to participate in the election and will give up only ifthe chance is slim, the alternative hypothesis will be

Ha : p < 0.5.

I The alternative hypothesis is “the thing we want (need) to prove.”



Two types of errors

I Type-1 error (false positive): Rejecting a true null hypothesis.I There is nothing, but we say there is one.

I Type-2 error (false negative): Do not reject a false null hypothesis.I There is something, but we do not see it.



Remarks

I We want to control the chances for us to make these mistakes.I Unfortunately, we cannot control both.I We choose to control the probability of a type-1 error.I The choice of the default position is important.

I For setting up a statistical hypothesis:I Our default position will be put in the null hypothesis.I The thing we want to prove (i.e., the thing that needs a strong evidence)

will be put in the alternative hypothesis.

I For writing the mathematical statement:I The equal sign (=) will always be put in the null hypothesis.I The alternative hypothesis contains an unequal sign or strict

inequality: 6=, >, or <.

I The direction of the alternative hypothesis, when it is an inequality,depends on the context.



One-tailed tests and two-tailed tests

I If the alternative hypothesis contains an unequal sign (6=), the test is atwo-tailed test.

I If it contains a strict inequality (> or <), the test is a one-tailed test.

I Suppose we want to test the value of the population mean.I In a two-tailed test, we test whether the population mean significantly

deviates from a hypothesized value. We do not care whether it is largerthan or smaller than.

I In a one-tailed test, we test whether the population mean significantlydeviates from a hypothesized value in a specific direction.



The first example: a two-tailed test

I Let’s test the average weight (in g) of our products.

H0 : µ = 1000

Ha : µ 6= 1000.

I The variance of the product weights is σ2 = 40000 g2.I The case with unknown σ2 will be discussed later.

I A random sample has been collected.I Suppose the sample size n = 100.I Suppose the sample mean X = 963.

I How to make a conclusion?



Controlling the error probability

I All we can do is to collect a random sample and make our conclusionbased on the observed sample.

I It is natural that we may be wrong when we claim µ 6= 1000.

I We want to control the error probability.I Let α be the maximum probability for us to make this error.I α is called the significance level.I 1− α is called the confidence level.I Target: If µ = 1000, our sampling and testing process will make us claim

that µ 6= 1000 with probability at most α.



Rejection rule

I Now let’s test with the significance level α = 0.05.

I Intuitively, if X deviates from 1000 a lot, we should reject the nullhypothesis and believe that µ 6= 1000.I If µ = 1000, it is so unlikely to observe such a large deviation.I So such a large deviation provides a strong evidence.

I So we start by sampling and calculating the sample mean.

I We want to construct a rejection rule: If |X − 1000| > d, we rejectH0. We need to calculate d.



Rejection rule

I We want a distance d such that ifH0 is true, the probability ofrejecting H0 is at most 5%, i.e.,

Pr(|X − 1000| > d

∣∣∣µ = 1000)≤ 0.05.

I The smallest d that satisfies theabove inequality requiresPr(|X − 1000| > d) = 0.05.

I Consider X:I We know σ = 200 and n = 100.I We assume that µ = 1000.I Thanks to the central limit

theorem, X ∼ ND(1000, 20).

Pr(|X − 1000| > d) = 0.05.



Rejection rule: the critical valueI According to X ∼ ND(1000, 20), Pr(|X − 1000| > 39.2) = 0.05. The

rejection region is R = (−∞, 960.8) ∪ (1039.2,∞).

I If X falls in the rejection region, we reject H0.



Rejection rule: the critical value

I Because x = 963 /∈ R, we cannot reject H0.I The deviation from 1000 is not large enough.I The evidence is not strong enough.




I In this example, the two values 960.8 and 1039.2 are the criticalvalues for rejection.I If the sample mean is more extreme than one of the critical values, we

reject H0.I Otherwise, we do not reject H0.

I x = 963 is not strong enough to support Ha: µ 6= 1000.

I Concluding statement:I Because the sample mean does not lie in the rejection region, we cannot

reject H0.I With a 95% confidence level, there is no strong evidence showing that

the average weight is not 1000 g.I Therefore, we should not shutdown machines to do an inspection.



Summary

I We want to know whether the machine is out of control.I If the machine is actually good, we do not want to reach a conclusion

that requires an inspection and maintenance.I We will do the inspection only if we have a strong evidence suggesting

that µ 6= 1000.

I We want to know whether H0 is false, i.e., µ 6= 1000.

I We control the probability of making a wrong conclusion.I We should not reject H0 if it is true.I We limit the probability at α = 5%.

I We will conclude that H0 is false if X falls in the rejection region.I The calculation of the the critical values is based on the normal

distribution, which can always be transformed to the z distribution.I This is called a z test.



Not rejecting vs. accepting

I We should be careful in writing our conclusions:I Wrong: Because the sample mean does not lie in the rejection region,

we accept H0. With a 95% confidence level, there is a strong evidenceshowing that the average weight is 1000 g.

I Right: Because the sample mean does not lie in the rejection region, wecannot reject H0. With a 95% confidence level, there is no strongevidence showing that the average weight is not 1000 g.

I Unable to prove one thing is false does not mean it is true!



The first example (part 2)

I Suppose that we modify the hypothesis into a directional one:1

H0 : µ = 1000.

Ha : µ < 1000.

We still have σ2 = 40000, n = 100, and α = 0.05.I This is a one-tailed test.I Once we have a strong evidence supporting Ha, we will claim thatµ < 1000.

I We need to find a distance d such that

Pr(

1000−X > d∣∣∣µ = 1000

)= 0.05.

1Some researchers write µ ≥ 1000 in this case.



Rejection rule: the critical valueI For 0.05 = Pr(1000−X > d), we have d = 32.9.I As the observed sample mean x = 963 ∈ (−∞, 967.1), we reject H0.

I The deviation from 1000 is large enough.I The evidence is strong enough.




I In this example, 967.1 is the critical values for rejection.I If the sample mean is more extreme than (in this case, below) the critical

value, we reject H0.I Otherwise, we do not reject H0.

I There is a strong evidence supporting Ha: µ < 1000.

I Concluding statement:I Because the sample mean lies in the rejection region, we reject H0.

With a 95% confidence level, there is a strong evidence showing that theaverage weight is less than 1000 g.



One-tailed tests vs. two-tailed testsI When should we use a two-tailed test?

I We use a two-tailed test when we are lack of the direction information.I E.g., we suspect that the population mean has changed, but we have

no idea about whether it becomes larger or smaller.

I If we know or believe that the change is possible only in onedirection, we may use a one-tailed test.

I Having more information (i.e., knowing the direction of change) makesrejection “easier,”, i.e., easier to find a strong enough evidence.



Summary

I Distinguish the following pairs:I One- and two-tailed tests.I No evidence showing H0 is false and having evidence showing H0 is true.I Not rejecting H0 and accepting H0.I Using = and using ≥ or ≤ in the null hypothesis.



Road map

I Sampling.






The p-value

I The p-value is an important, meaningful, and widely-adopted tool forhypothesis testing.

Definition 2

For an observed value of a statistic in a statistical test, the p-value isthe probability of observing a value that is more extreme than theobserved value under the assumption that the null hypothesis is true.

I Calculated based on an observed value of the statistic.I Is the tail probability of the observed value.I Assuming that the null hypothesis is true.



The p-value

I Mathematically:I Suppose we test a population

mean µ with a one-tailed test

H0 : µ = 1000

Ha : µ < 1000.

I Given an observed x, the p-valueis defined as

Pr(X ≤ x).

I In the previous example, σ = 200,n = 100, α = 0.05, and x = 963.I If H0 is true, i.e., µ = 1000, we

have Pr(X ≤ 963) = 0.032.I The p-value of x is 0.032.



How to use the p-value?

I The p-value can be used for constructing a rejection rule.

I For a one-tailed test:I If the p-value is smaller than α, we reject H0.I If the p-value is greater than α, we do not reject H0.

I In our example, the one-tailed test is

H0 : µ = 1000

Ha : µ < 1000.

I We have α = 0.05.I Because the p-value 0.032 < 0.05, we reject H0.



p-values vs. critical values

I Using the p-value is equivalent to using the critical values.I The rejection-or-not decision we make will be the same based on the two

methods.



The benefit of using the p-value

I In many studies, researchers do not determine the significance level αbefore a test is conducted.

I They calculate the p-value and then mark the significance of theresult with stars.

I One typical way of assigning stars:

p-value Significant? Mark

(0, 0.01] Highly significant ***(0.01, 0.05] Moderately significant **(0.05, 0.1] Slightly significant *

(0.1, 1) Insignificant (Empty)



The size of a p-value

I Suppose one is testing whether people at different ages sleep for atleast eight hours per day in average.I Age groups: [10, 15), [15, 20), [20, 35), etc.I For group i, a one-tailed test is conducted. Ha : µi > 8.I The result may be presented in a table:

Group Age group p-value

1 [10,15) 0.0002***2 [15,20) 0.23 [20,25) 0.06*4 [25,30) 0.04**5 [30,35) 0.03**

I A smaller p-value does NOT mean a larger deviation!I We cannot conclude that µ5 > µ4, µ1 > µ3, etc.I There are other tests for the difference between two population means.



The p-value for two-tailed tests

I How to construct the rejection rule for a two-tailed test?I If the p-value is smaller than α

2, we reject H0.

I If the p-value is greater than α2

, we do not reject H0.

I Consider the two-tailed test

H0 : µ = 1000.

Ha : µ 6= 1000.

I We have α = 0.05.I Because the p-value 0.032 > α

2= 0.025, we do not reject H0.

I Some researchers/books/software use another definition:I The p-value for a two-tailed test is two times of that for the

corresponding one-tailed test.I They then compare this p-value with α.



Summary

I The p-value is the tail probability of the realized value of a statisticsassuming the null hypothesis is true.

I The p-value method is an alternative way of forming the rejection rule.I It is equivalent to the critical-value method.

I The p-value is related to the probability for H0 to be false.

I It does not measure the magnitude of the deviation.



The z test

I In example 1, basically we use the fact that X ∼ ND(µ, σ√n

.

I This implies that X−µσ/√n∼ ND(0, 1), the so-called standard normal

distribution, or the z distribution.I Therefore, this test is called the z test.

I This requires the knowledge about σ.



When the variance is unknown

I When the population variance σ2 is unknown, the quantity X−µσ/√n

is

unknown.

I What if we use the sample variance S2 as a substitute?

Proposition 4

For a normal population, the quantity

T =X − µS/√n

follows the t distribution with degree of freedom n− 1.

I What is the t distribution?



The t distribution

I The t distribution is defined as follows:

Definition 3

A random variable X follows the t distribution with degree of freedomn, denoted as X ∼ t(n), if

f(x|n) =Γ(n+1

2 )√nπΓ(n2 )

(1 +

x2

n

)−n+12

,

for all x ∈ (−∞,∞).

I Γ(x) =

∫ ∞0

zx−1e−zdz is the gamma function.



The z and t distributions

I Let’s compare Z = X−µσ/√n

and T = X−µS/√n

.

I Because we do not know σ, we use S to substitute it.I Z ∼ ND(0, 1) and T ∼ t(n− 1).I As the t distribution is a substitution of the z distribution, it is designed

to be also centered at 0: E[T ] = E[Z] = 0.I However, as we add one more random variable into the formula (σ is a

known constant), T will be “more random” than Z, i.e.,Var(T ) > Var(Z).

I Graphically, t curves will be flatter than the z curve.I Fact: t(n)→ ND(0, 1) as n→∞.



The t test

I We will use the t test to test the population mean if the population isnormal.

I If the sample size is large, we may still use the z distribution with ssubstituting σ.



Example 2

I An MBA program seldom admits applicants without a work experiencelonger than two years.

I To test whether the average work year of admitted students is abovetwo years, 20 admitted applicants are randomly selected.

I Their work experiences prior to entering the program are recorded.I Prior to entering the program, they have an average work experience of

2.5 years. This is the sample mean.I The sample standard deviation is 1.3765 years.

I The population is believed to be normal.

I The confidence level is set to 95%.



Example 2: hypothesis

I Suppose the one asking the question is a potential applicant with oneyear of work experience. He is pessimistic and will apply for theprogram only if the average work experience is proven to be less thantwo years.

I The hypothesis is

H0 : µ = 2

Ha : µ < 2.

I µ is the average work experience (in years) of all admitted applicantsprior to entering the program.

I To encourage him, we need to give him a strong evidence showing thathis chance is high.



Example 2: hypothesis and test

I Suppose he is optimistic and will not apply for the program only ifthe average work experience is proven to be greater than two.

I The hypothesis becomes

H0 : µ = 2

Ha : µ > 2.

I To discourage him, we need to give him a strong evidence showing thathis chance is slim.

I Let’s consider the optimistic candidate (and Ha : µ > 2) first.

I Because the population variance is unknown and the population isnormal, we may use the t test.



Example 2A: calculation and interpretation

I Calculation:I The p-value is Pr(X > 2.5|µ = 2) = 0.0604.

I Conclusion:I For this one-tailed test, as the p-value > 0.05 = α, we do not reject H0.I There is no strong evidence showing that the average work experience

is longer than two years.I The result is not strong enough to discourage the potential applicant,

who has only one year of work experience.

I Decision:I The (optimistic) applicant should apply.



Example 2B – a pessimistic applicant

I Suppose the applicant is pessimistic and the hypothesis is

H0 : µ = 2

Ha : µ < 2.

I The p-value will be Pr(X < 2.5|µ = 2) = 1− 0.0604 = 0.9396.I This is calculated based on the t distribution.I We do not reject H0 and cannot conclude that µ < 2. There is no strong

evidence to encourage him.I He should not apply.

I Note that when we write different alternative hypotheses, the finaldecision is different!I This happens if and only if in both cases we do not reject H0.



Summary

I To test the population mean µ:

σ2 Sample sizePopulation distribution

Normal Nonnormal

Knownn ≥ 30 z zn < 30 z Nonparametric

Unknownn ≥ 30 t or z zn < 30 t Nonparametric

I More parameters that may be tested:I Population proportion (z test).I Population variance (χ2 test).I Difference of two population means (t test).I Ratio of two population variances (F test).


Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression


Part 3:Regression Analysis

Ling-Chieh Kung


September 4, 2016

Regression Analysis 1 / 83 Ling-Chieh Kung (NTU IM)


Correlation and prediction

I We often try to find correlation among variables.

I For example, prices and sizes of houses:

House 1 2 3 4 5 6

Size (m2) 75 59 85 65 72 46Price ($1000) 315 229 355 261 234 216

House 7 8 9 10 11 12

Size (m2) 107 91 75 65 88 59Price ($1000) 308 306 289 204 265 195

I We may calculate their correlation coefficient as r = 0.729.

I Now given a house whose size is 100 m2, may we predict its price?



Correlation among more than two variables

I Sometimes we have more than two variables:

I For example, we may also know the number of bedrooms in each house:

House 1 2 3 4 5 6

Size (m2) 75 59 85 65 72 46Price ($1000) 315 229 355 261 234 216

Bedroom 1 1 2 2 2 1

House 7 8 9 10 11 12

Size (m2) 107 91 75 65 88 59Price ($1000) 308 306 289 204 265 195

Bedroom 3 3 2 1 3 1

I How to summarize the correlation among the three variables?

I How to predict house price based on size and number of bedrooms?



Regression analysis

I Regression is a solution!

I As one of the most widely used tools in Statistics, it discovers:I Which variables affect a given variable.I How they affect the target.

I In general, we will predict/estimate one dependent variable by oneor multiple independent variables.I Independent variables: Potential factors that may affect the outcome.I Dependent variable: The outcome.I Independent variables are explanatory variables; the dependent variable

is the response variable.

I As another example, suppose we want to predict the number of arrivalconsumers for tomorrow:I Dependent variable: Number of arrival consumers.I Independent variables: Weather, holiday or not, promotion or not, etc.



Types of regression analysis

I Based on the number of independent variables:I Simple regression: One independent variable.I Multiple regression: More than one independent variables.

I The dependent variable may be quantitative or qualitative.I In ordinary regression, the dependent variable is quantitative.I In logistic regression, the dependent variable is qualitative.

I There are other types of regression models.



Road map

I Simple regression.

I Multiple regression.

I Indicator variables and interaction.

I Endogeneity and residual analysis.

I Logistic regression.



Basic principleI Consider the price-size relationship again. In the sequel, let xi be the

size and yi be the price of house i, i = 1, ..., 12.

Size Price(in m2) (in $1000)

46 21659 22959 19565 26165 20472 23475 31575 28985 35588 26591 306107 308

I How to relate sizes and prices “in the best way?”



Linear estimation

I If we believe that the relationship between the two variables is linear,we will assume that

yi = β0 + β1xi + εi.

I β0 is the intercept of the equation.I β1 is the slope of the equation.I εi is the random noise for estimating record i.

I Somehow there is such a formula, but we do not know β0 and β1.I β0 and β1 are the parameter of the population.I We want to use our sample data (e.g., the information of the twelve

houses) to estimate β0 and β1.I We want to form two statistics β0 and β1 as our estimates of β0 and β1.



Linear estimationI Given the values of β0 and β1, we will use yi = β0 + β1xi as our

estimate of yi.

I Then we haveyi = β0 + β1xi + εi,

where εi is now interpreted as the estimation error.I Let yi = β0 + β1xi be our estimate of yi. We hope εi = yi− yi to be small.

I For all data points, let’s minimize the sum of squared errors (SSE):

n∑i=1

ε2i = (yi − yi)2 =

n∑i=1

[(yi − (β0 + β1xi)

]2.

I The solution of

minβ0,β1

n∑i=1

[(yi − (β0 + β1xi)

]2is our least square approximation (estimation) of the given data.



Least square approximation

I The least square approximation problem

minβ0,β1

n∑i=1

[(yi − (β0 + β1xi)

]2has a closed-form formula for the best (β0, β1):

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2and β0 = y − β1x.

I For our house example, we will get (β0, β1) = (102.717, 2.192).I Its SSE is 13118.63.I We will never know the true values of β0 and β1. However, according to

our sample data, the best (least square) estimate is (102.717, 2.192).I We tend to believe that β0 = 102.717 and β1 = 2.192.



Interpretations

I Our regression model is

y = 102.717 + 2.192x.

I Interpretation: When the housesize increases by 1 m2, the price isexpected to increase by $2, 192.

I (Bad) interpretation: For a housewhose size is 0 m2, the price isexpected to be $102,717.



Linear multiple regression

I In most cases, more than one independent variable may be used toexplain the outcome of the dependent variable.

I For example, consider the number of bedrooms.

I We may take both variables asindependent variables to do linearmultiple regression:

yi = β0 + β1x1,i + β2x2,i + εi.

I yi is the house price (in $1000).I x1,i is the house size (in m2).I x2,i is the number of bedrooms.I εi is the random noise.

I Our (least square) estimate is

(β0, β1, β2) = (82.737, 2.854,−15.789).

Price SizeBedroom

(in $1000) (in m2)

315 75 1229 59 1355 85 2261 65 2234 72 2216 46 1308 107 3306 91 3289 75 2204 65 1265 88 3195 59 1



InterpretationsI Our regression model is

y = 82.737 + 2.854x1 − 15.789x2.

I When the house size increases by 1 m2 (and all other independentvariables are fixed), we expect the price to increase by $2, 854.

I When there is one more bedroom (and all other independent variablesare fixed), we expect the price to decrease by $15, 789.

I One must interpret the results and determine whether the result ismeaningful by herself/himself.I The number of bedrooms may not be a good indicator of house price.I At least not in a linear way.

I We need more than finding coefficients:I We need to judge the overall quality of a given regression model.I We may want to compare multiple regression models.I We must test the significance of regression coefficients.



Model validation: How good is a model?I How to measure the quality of a model?I For the model y = 102.717 + 2.192x, how good is it?I In general, for a given regression model y = β0 + β1x1 + · · · βkxk, how

may we evaluate its overall quality?I The sum of squared total errors (SST), SST =

∑ni=1(yi − y)2, is

for the worst model.I With our regression model, the sum of squared errors (SSE) is

SSE =

n∑i=1

(yi − yi)2 =

n∑i=1

[(yi − (β0 + β1xi)

]2.

I The proportion of total variability that is explained by the regressionmodel is

0 ≤ R2 = 1− SSE

SST≤ 1.

The larger R2, the better the regression model.



Obtaining R2

I Whenever we find the estimated coefficients, we have R2.

I Statistical software includes R2 in the regression report.

I For the regression model y = 102.717 + 2.192x, we have R2 = 0.5315:I Around 53% of a house price is determined by its house size.

I If (and only if) there is only one independent variable, then R2 = r2,where r is the correlation coefficient between the dependent andindependent variables.I −1 ≤ r ≤ 1.I 0 ≤ r2 = R2 ≤ 1.



Comparing regression models

I Now we have a way to compare regression models.

I For our example:

Size only Bedroom only Size and bedroom

R2 0.5315 0.29 0.5513

I Using prices only is better than using numbers of bedrooms only.I Is using prices and bedrooms better?

I In general, adding more variables always increases R2!I In the worst case, we may set the corresponding coefficients to 0.I Some variables may actually be meaningless.

I To perform a “fair” comparison and identify those meaningful factors,we need to adjust R2 based on the number of independent variables.



Adjusted R2

I The standard way to adjust R2 to adjusted R2 is

R2adj = 1−

(n− 1

n− k − 1

)(1−R2).

I n is the sample size and k is the number of independent variables used.

I For our example:

Size only Bedroom only Size and bedroom

R2 0.5315 0.290 0.5513R2

adj 0.4846 0.219 0.4516

I Actually using sizes only results in the best model!



Testing coefficient significance

I Another important task for validating a regression model is to test thesignificance of each coefficient.

I Recall our model with two independent variables

y = 82.737 + 2.854x1 − 15.789x2.

I Note that 2.854 and −15.789 are solely calculated based on the sample.We never know whether β1 and β2 are really these two values!

I In fact, we cannot even be sure that β1 and β2 are not 0. We need totest them:

H0 : βi = 0

Ha : βi 6= 0.

I We look for a strong enough evidence showing that βi 6= 0.




I The testing results are provided in regression reports.

I Statistical software (e.g., R) tells us:

Coefficients Standard Error t Stat p-value

Intercept 82.737 59.873 1.382 0.200Size 2.854 1.247 2.289 0.048 **Bedroom −15.789 25.056 −0.630 0.544

I As we have no idea about population variance, we apply the t test.I “Coefficients” records sample means x; “Standard Error” records S√

n; “t

Stat” records T = x−0S/√n

.I “p-value” are the tail probabilities of T multiplied by 2 (done by most

software). Simply compare them with α!

I Recall the assumption that εi is normal!




I Statistical software tells us:


Intercept 82.737 59.873 1.382 0.200Size 2.854 1.247 2.289 0.048 **Bedroom −15.789 25.056 −0.630 0.544

I At a 95% confidence level, we believe that β1 6= 0. House size really hassome impact on house price.

I At a 95% confidence level, we have no evidence for β2 6= 0. We cannotconclude that the number of bedrooms has an impact on house price.

I If we use only size as an independent variable, its p-value will be0.00714. We will be quite confident that it has an impact.



Road map








House age

I The age of a house may also affect its price.

Price SizeBedroom

Age(in $1000) (in m2) (in years)

315 75 1 16229 59 1 20355 85 2 16261 65 2 15234 72 2 21216 46 1 16308 107 3 15306 91 3 15289 75 2 14204 65 1 21265 88 3 15195 59 1 26

I Let’s add age as an independent variable in explaining house prices.I Because the number of bedroom seems to be unhelpful, let’s ignore it.



House age

I For house i, let yi be its price, x1,i be its size, and x3,i be its age. Weassume the following linear relationship:

yi = β0 + β1x1,i + β2x3,i + εi.

I Software gives us the following regression report:


Intercept 262.882 83.632 3.143 0.012Size 1.533 0.628 2.443 0.037 **Age −6.368 2.881 −2.211 0.054 *

R2 = 0.696, R2adj = 0.629

I R2 goes up from 0.485 (size only) to 0.629. Age is significant at a 10%significance level. Seems good!



“Nonlinear” relationship

I May we do better?

I By looking at the age-price scatter plot(and our intuition), maybe the impact ofage on price is “nonlinear”:I A new house’s value depreciates fast.I The value depreciates slowly when the

house is old.I At least this is true for a car.

I It is worthwhile to try a capture thisnonlinear relationship.

I For example, we may try to replace houseage by its reciprocal:

yi = β0 + β1x1,i + β2

(1

x3,i

)+ εi.



Variable transformation

I To fit

yi = β0 + β1x1,i + β2

(1

x3,i

)+ εi.

to our sample data:I Prepare a new column as 1

age.

I Input these three columns to software.I Read the report.

I We may consider any kind of nonlinearrelationship.

I This technique is called variabletransformation.

Price Size 1/Age(in $1000) (in m2) (in 1/years)

315 75 0.063229 59 0.05355 85 0.063261 65 0.067234 72 0.048216 46 0.063308 107 0.067306 91 0.067289 75 0.071204 65 0.048265 88 0.067195 59 0.038



The reciprocal of house ageI Software gives us the following regression report:


Intercept 22.905 57.154 0.401 0.698Size 1.524 0.647 2.356 0.043 **1/Age 2185.575 1044.497 2.092 0.066 *

R2 = 0.685, R2adj = 0.615

I Validation:I Variables are both significant (at different significance level).I Using size and age better explains house price (at least for the given

sample data).

I The intuition that house value depreciates at different speeds is notsupported by the data.

I Changing 1age to age2 also does not help.



Typical ways of variable transformation



Variable selection and model building

I In general, we may have a lot of candidate independent variables.I Size, number of bedrooms, age, distance to a park, distance to a hospital,

safety in the neighborhood, etc.I If we consider only linear relationships, for p candidate independent

variables, we have 2p − 1 combinations.I For each variable, we have many ways to transform it.I In the next lecture, we will introduce the way of modeling interaction

among independent variables.

I How to find the “best” regression model (if there is one)?



Variable selection and model buildingI There is no “best” model; there are “good” models.I Some general suggestions:

I Take each independent variable one at a time and observe therelationship between it and the dependent variable. A scatter plothelps. Use this to consider variable transformation.

I For each pair of independent variables, check their relationship. If twoare highly correlated, quite likely one is not needed.

I Once a model is built, check the p-values. You may want to removeinsignificant variables (but removing a variable may change thesignificance of other variables).

I Go back and forth to try various combinations. Stop when a goodenough one (with high R2 and R2

adj and small p-values) is found.I Software can somewhat automate the process, but its power is limited

(e.g., it cannot decide transformation).I We may need to find new independent variables.

I Intuitions and experiences may help (or hurt).



Summary

I With a regression model, we try to identify how independent variablesaffect the dependent variable.I For a regression model, we adopt the least square criterion for estimating

the coefficients.

I Model validation:I The overall quality of a regression model is decided by its R2 and R2

adj.I We may test the significance of independent variables by their p-values.

I Modeling building:I Variable transformation.I Variable selection.



Case study: ticket selling

I A theater made hundreds of stage performances in the past six years.

I The owner hopes that statistics and data analysis may help herimprove the ticket sales.

I Key questions: What makes a show popular?I Popularity is defined as the numbers of tickets sold.I Potential factors: year, month, day, time, location, actors/actresses,

drama type, ticket prices, etc.

I 100 performances are randomly drawn from the whole pool.I All were made during weekends.I Tickets were all publicly sold.I Tickets for all performances were sold through the same channels.I For each performance, the ticket price(s) remained the same.

I As a group of consultants, how may we help the theater?



Variables

I Six variables are obtained:

Variable Meaning

Year The year in which the performance was madeTime Morning, afternoon, or evening

Capacity The number of seats in the theater hallAvgPrice The average of all pricesSalesQty The number of tickets sold

SalesDuration Performance day − Announcement day

I Labeling and scaling:I Years are labeled as 1, 2, ..., and 6 (6 means the last year).I Capacities and sales quantities have been scaled in the same proportion.



Data (incomplete)

Yr. Tm. Cap. A.P. Qty S.D. Yr. Tm. Cap. A.P. Qty S.D.

5 A 230 400 218 50 2 M 190 575 190 2895 A 150 500 119 46 6 A 130 500 108 895 A 230 400 160 126 4 E 200 775 169 1005 A 200 775 200 324 4 E 200 775 135 2596 E 190 1175 178 115 5 A 310 650 251 3466 A 190 1175 183 109 2 A 250 550 250 1455 E 190 775 161 58 1 A 190 675 183 2543 A 200 675 200 112 6 A 200 1175 146 1105 E 200 775 158 323 1 M 200 575 140 941 M 200 575 128 360 4 A 200 775 195 255



Regression

I To construct a regression model, we first consider quantitativeindependent variables.I Dependent variable: SalesQty.I Independent variables: Capacity, AvgPrice, Year.I Let’s ignore SalesDuration for a while.

I Note that Year is a quantitative variable.I The difference between two values makes sense: 4− 2 and 5− 3 both

mean a difference of two years.I The values will keep increasing.I If we have a variable Month whose possible values are 1, 2, ..., and 12,

the difference between 12 and 1 is ambiguous: 11 months or 1 month.

I Scatter plots help us consider:I Variable selection: Does a variable has an impact?I Transformation: What is a variable’s impact?I Multicollinearity: Are two variables highly correlated?



Regression

I It seems that Capacity, AvgSales, and Year are all worth a try.

I Let’s put them into a regression model.

I If we do this one by one:I SalesQty = 20.79 + 0.72Capacity: R2 = 0.538, p-value ≈ 0.I SalesQty = 174.9 + 0.0028AvgPrice: R2 = 0.0002, p-value = 0.885.I SalesQty = 203.6− 6.77Y ear: R2 = 0.063, p-value = 0.0115.

I If we include them together:I The regression model is

SalesQty = 24.742 + 0.702Capacity + 0.027AvgPrice− 4.696Y ear.

I R2 = 0.57, R2adj = 0.556; p-values are 0, 0.056, and 0.019, respectively.

I Do not try independent variables separately; try them together.



Adding Time into the model

I Time may also be an influential variable.

I However, it is qualitative.I More precisely, it is nominal.I Even if we label Time with numeric values, we cannot treat it as a

quantitative variable and put it into a regression model.

I For each qualitative variable, we need to introduce several indicatorvariables to represent its values.



Road map








Numeric labeling does not work

I The variable Time has three values.I Morning, afternoon, and evening.I Why can’t we label them as 1, 2, and 3 and do regression?

I Suppose we label (morning, afternoon, evening) as (1, 2, 3):I The regression model is

SalesQty = 164.021 + 6.313Time.

I Why is this wrong?



Numeric labeling does not work

I Different labeling gives different regression results.

I We may also label (morning, afternoon, evening) as (1, 2, 10) or (3, 1, 2):

SalesQty =

164.021 + 6.313Time

p-value = 0.294

SalesQty =

177.224− 0.075Time

p-value = 0.95

SalesQty =

205.725− 15.091Time

p-value = 0.0084



Binary variables

I There is one exception: If a qualitative variable is binary, we maylabel the values as 0 and 1 and then treat it as quantitative.I Labeling values as 1 and 0, 1 and 2, or 7 and 8 is also good.I Labeling values as 1 and −1, 1 and 5, or 4 and 8 is bad.

I This is because a regression coefficient measures what happens to thedependent variable “when that independent variable increases by 1.”

I When the binary variable is labeled with 0 and 1, its regressioncoefficient βi tells us that “if the value changes from 0 to 1 (while allothers remain the same), we expect the dependent variable to increase

by βi.”

I What if we have more than two values?



Indicator variables

I Consider a variable x with three values A, B, and C.

I We first choose a reference level, say, A.

I We then manually create two indicator variables xB and xC :

xB =

{1 if x = B

0 otherwiseand xC =

{1 if x = C

0 otherwise

In other words, we have a mapping:

x xB xC

A 0 0B 1 0C 0 1



Indicator variables

I Lastly, we put xB and xC into a model to get

y = β0 + · · ·+ βBxB + βCxC .

I If x changes from A to B (and all others remain the same), we expectthe dependent variable to increase by βB .

I If x changes from A to C (and all others remain the same), we expectthe dependent variable to increase by βC .

I If x changes from B to C (and all others remain the same), we can saynothing.

I We use x to divide the data into three groups A, B, and C.

I We are asking, after removing the impacts from other variables,whether there is a significant difference between groups A and B (or Aand C).



Indicator variables in general

I If a variable x has five values M, N, O, P, and Q.I We first choose a reference level, say, P.I We then manually create four indicator variables:

x xM xN xO xQ

M 1 0 0 0N 0 1 0 0O 0 0 1 0P 0 0 0 0Q 0 0 0 1

I Is there a significant difference between groups P and M, P and N, P andO, and P and Q?

I In general, for a variable with k values, we need k − 1 indicatorvariables.



Adding indicator variables for TimeI Time has three values: morning, afternoon, and evening.

I Let’s choose afternoon as the reference level.

I We need two indicator variables:

Time TimeM TimeE

morning 1 0afternoon 0 0evening 0 1

I Using TimeM and TimeE as our independent variables, we get

SalesQty = 191− 30.069TimeM − 16.303TimeE ,

where the p-values are 0.009 and 0.138, respectively.

I If a performance is rescheduled from afternoon to morning, we expectthe sales to decrease by 30.069.



Adding indicator variables for Time

I Let’s include Capacity, AvgPrice, Year, TimeM , and TimeE :

SalesQty = 0.696Capacity + 0.027AvgPrice− 5.282Year

− 14.387TimeM − 21.328TimeE .


Intercept 39.280 19.724 1.992 0.049 **Capacity 0.696 0.069 10.263 0.000 ***AvgPrice 0.027 0.013 2.033 0.045 **Year −5.282 1.931 −2.735 0.007 ***TimeM −14.387 7.784 −1.848 0.068 *TimeE −21.328 7.227 −2.951 0.004 ***

R2 = 0.608, R2adj = 0.587



Summary

I When an independent variable is qualitative, we need to introduceindicator variables.I An indicator variable is either 0 or 1.

I If it has k possible values, we need k − 1 indicator variables.I For the reference level, all indicator variables are 0.I For each other level, exactly one indicator variable is 1.

I We are only testing the differences between the reference level andother levels.I We have no idea about the difference between two non-reference levels.I We may change the reference level.

I As long as one indicator variable is significant, all other indicatorvariables for the same qualitative variable can be kept.



Interaction among variables

I In a regression model

y = β0 + β1x1 + β2x2 + · · ·βpxp,

βi measures how xi affects y.

I Sometimes the impact of xi on y depends on the value of anothervariable xj .

I Consider house prices, sizes, and numbers of bedrooms.I When a house is big, more numbers of bedrooms may be good.I When a house is small, more numbers of bedrooms may be bad.

I Consider the demand of a product.I Demand is sensitive to price: When price goes up, demand goes down.I The sensitivity may be different between men and women.

I In this case, we say there is an interaction between xi and xj .



Modeling interaction

I To model the interaction between xi and xj , one possibility is to createa new variable xixj , which is the product of the two original variables.

I In a regression model

y = β0 + β1x1 + β2x2 + β1,2x1x2 · · · ,

β1,2 measures the interaction between x1 and x2.I The impact of x1 on y is β1 + β1,2x2.I The impact of x2 on y is β2 + β1,2x1.

I A quadratic term x2i in a regression model

y = β0 + β1x1 + β′1x21 + · · · ,

is a special case: The impact of x1 on y is depends on the value of x1.



Interaction between Time and AvgPrice

I Do Time and AvgPrice affect each other’s impact?

I Let’s add TimeM ×AvgPrice and TimeE ×AvgPrice into our model:

Coefficients Std. Error t Stat p-value

Intercept 55.876 22.652 2.467 0.015 **Capacity 0.676 0.068 9.950 0.000 ***Year −6.192 1.966 −3.149 0.002 ***TimeM −55.205 23.829 −2.317 0.023 **TimeE −19.105 21.81 −0.876 0.383AvgPrice 0.015 0.019 0.836 0.405TimeM ×AvgPrice 0.054 0.030 1.792 0.076 *TimeE ×AvgPrice −0.004 0.030 −0.136 0.892

R2 = 0.624, R2adj = 0.595

I If we want to keep TimeE ×AvgPrice, we must also keepTimeM ×AvgPrice, AvgPrice, TimeM , and TimeE in our model.



Time affects AvgPrice’s impact

I Let’s focus on Time and AvgPrice:


TimeM −55.205 23.829 −2.317 0.023 **TimeE −19.105 21.81 −0.876 0.383AvgPrice 0.015 0.019 0.836 0.405TimeM ×AvgPrice 0.054 0.030 1.792 0.076 *TimeE ×AvgPrice −0.004 0.030 −0.136 0.892

I People have different price sensitivity for shows at different time.When the price goes up by $1, we expect:I The sales of an afternoon show increases by 0.015.I The sales of an morning show increases by 0.015 + 0.054 = 0.069.I The sales of a evening show increases by 0.015− 0.004 = 0.011.



AvgPrice affects Time’s impact

I Let’s focus on Time and AvgPrice:


TimeM −55.205 23.829 −2.317 0.023 **TimeE −19.105 21.81 −0.876 0.383AvgPrice 0.015 0.019 0.836 0.405TimeM ×AvgPrice 0.054 0.030 1.792 0.076 *TimeE ×AvgPrice −0.004 0.030 −0.136 0.892

I If we reschedule an afternoon show to the morning, the impact is

−55.205 + 0.054AvgPrice

in expectation. If AvgPrice = 500, e.g., we expect the sales to decreaseby −55.205 + 0.054× 500 = −28.205.

I If we reschedule an afternoon show to the evening, the expected impactis −19.105− 0.004AvgPrice.



Interaction between Time affects YearI Do Time and Year affect each other’s impact?


(Intercept) 39.597 22.31 1.775 0.079 *Capacity 0.693 0.068 10.267 0.000 ***AvgPrice 0.024 0.013 1.799 0.075 *TimeE −2.696 18.562 −0.145 0.885TimeM −25.114 18.303 −1.372 0.173Year −4.703 2.944 −1.597 0.114TimeE ×Year −4.841 4.302 −1.125 0.263TimeM ×Year 2.898 4.166 0.695 0.489

R2 = 0.620, R2adj = 0.591

I All the five variables related to Time and Year are insignificant.I People’s preference over the show time do not change from year to year.I The trend from year to year is the same for different show times.

I Though all the five variables are insignificant, we typically first try toremove only the interaction terms.



Summary

I Two variables’ interaction may be modeled with a product term.I If its coefficient is significantly nonzero, one variable’s impact depends on

the other’s value.

I Three rules for keeping variables:I Quadratic transformation: If we want to keep x2, we must also keep x.I Indicator variable: If we want to keep xk

′, where xk

′is the indicator

variable for represent x = k′, we must also keep xk for all k 6= k′.I Interaction: If we want to keep xixj , we must also keep xi and xj .

I Therefore:I If we want to have xix

k′j , where xk

′j is the indicator variable for represent

xj = k′, we must also keep xkj for all k 6= k′.

I It is possible to add xixjxk into a regression model.



Road map








SalesDuration

I Consider the variable SalesDuration.I The difference between the announce day and performance day.I The number of days that the tickets for a show are publicly sold.I The longer sales duration, the more sales?

I We probably want to add SalesDuration into our regression model.

I This is problematic in this case:I Typically the theater determines its schedule for the next year at the end

of each year.I Most performances are scheduled.I Ticket selling starts a few months before a series of shows are performed.I However, if a series turns out to be popular, the theater may decide to

add more shows into this series.I These additional shows have much shorter SalesDuration and typically

have high SalesQty.

I In short, SalesQty affects SalesDuration.



Endogeneity

I If in a regression model an independent variable is affected by thedependent variable, we say the model has the endogeneity problem.I If we add SalesDuration into our model, we creates endogeneity.I Year, Time, Capacity, and AvgPrice do not have the endogeneity

problem.I If any of them may be modified when the theater sees a good (or bad)

sales, endogeneity emerges.

I Endogeneity results in a biased prediction.

I In our ticket selling example, if we add SalesDuration into our model,we may intentionally announce shows later!



Example: promotional phone calls

I A bank lets its workers call people to invite them to deposit money.

I Many factors may affect the outcome (success or not):1

I The callee’ gender, age, occupation, education level, etc.I The caller’s gender, age, experience, etc.I The calling day, calling time, weather at the call, etc.

I All these information from past calls are recorded.

I The length of each call is also recorded.I It is found to be highly correlated with success/failure.I However, it cannot be used as an independent variable.I Because it is affected by the outcome: Once one agrees to deposit

money, the call gets longer to talk about more details.

I In this example, if we add call duration into our model, we may askour workers to speak as slowly as possible.

1A regression model that incorporates a qualitative dependent variable will beintroduced in later lectures.



Avoiding endogeneity

I To avoid endogeneity:I Remove the independent variable is endogenous.I Remove those records in which an independent variable is affected by

the dependent one.

I In the ticket selling example:I We may remove SalesDuration.I We may remove those additional shows.

I In the promotional call example:I We may remove the variable of call duration.



Introduction

I When doing regression:I We try to discover the hidden relationship among variables.I We assume a specific model

y = β0 + β1x1 + · · ·+ ε

and then fit our sample data to the model.I We validate our model based on the degree of fitness (R2 and R2

adj)and significance of variables (p-values).

I If our model is good, the random error ε should be really “random.”I There should be no systematic pattern for ε.

I We need residual analysis.



Residuals

I Consider a pair of variables x and y.

I We may assume a linear relationship

y = β0 + β1x+ ε

for some unknown parameters β0 and β1. ε is the random error.

I Four assumptions on the random error:I Zero mean: The expected value of ε is zero for any value of x.I Constant variance: The variance of ε is the same for any value of x.I Independence: ε for different values of x should be independent.I Normality: ε is normal for any value of x.

I Once we obtain a regression model, we need to test these assumptions.I To predict: We need the first three.I To explain: We need all the four.



Testing the four assumptions

I Consider a sample data set {(xi, yi)}i=1,...,n.

I Linear regression helps us find β0 and β1 based on the sample data andobtain the regression formula

yi = β0 + β1xi + εi,

in which the error term εi is called the residual between our estimateyi = β0 + β1xi and the real value yi.

I By conducting a residual analysis, we check these εis to see if wehave the desired properties.

I While there are rigorous statistical tests, we will only introduce somegraphical approaches.



The residual plot and histogram

I We may plot the residuals εis along with xis to form a residual plot.I This tests zero mean, constant variance, and independence.I There should be no systematic pattern.

I We may construct a histogram of residuals.I This tests normality.I The histogram should be symmetric and bell-shaped.

I In general:I A “good” plot does not guarantee a good model.I A “bad” plot strongly suggests that the model is bad!




I Consider the artificial data set as an example.

I There is no pattern in the residual plot: good!




I Consider the artificial data set as an example.

I The histogram is symmetric and bell-shaped: good!



Residual plots that pass and fail the tests



Histograms that pass and fail the tests



Residual analysis for multiple regression

I Suppose that we construct a multiple regression model

yi = β0 + β1xi + · · ·+ βpxp + εi.

I We still use residual plots and a histogram to test the assumptions.

I Multiple residual plots should be depicted.I The vertical axis is always for the residuals εis.I The horizontal axis is for a function of (x1, x2, ..., xp).I E.g., the kth independent variable xk along.I E.g., the fitted value yi = β0 + β1xi + · · ·+ βpxp.



Road map








Logistic regression

I So far our regression models always have a quantitative variable asthe dependent variable.I Some people call this type of regression ordinary regression.

I To have a qualitative variable as the dependent variable, ordinaryregression does not work.

I One popular remedy is to use logistic regression.I In general, a logistic regression model allows the dependent variable to

have multiple levels.I We will only consider binary variables in this lecture.

I Let’s first illustrate why ordinary regression fails when the dependentvariable is binary.



Example: survival probability

I 45 persons got trapped in a storm during a mountain hiking.Unfortunately, some of them died due to the storm.2

I We want to study how the survival probability of a person isaffected by her/his gender and age.

Age Gender Survived Age Gender Survived Age Gender Survived

23 Male No 23 Female Yes 15 Male No40 Female Yes 28 Male Yes 50 Female No40 Male Yes 15 Female Yes 21 Female Yes30 Male No 47 Female No 25 Male No28 Male No 57 Male No 46 Male Yes40 Male No 20 Female Yes 32 Female Yes45 Female No 18 Male Yes 30 Male No62 Male No 25 Male No 25 Male No65 Male No 60 Male No 25 Male No45 Female No 25 Male Yes 25 Male No25 Female No 20 Male Yes 30 Male No28 Male Yes 32 Male Yes 35 Male No28 Male No 32 Female Yes 23 Male Yes23 Male No 24 Female Yes 24 Male No22 Female Yes 30 Male Yes 25 Female Yes

2The data set comes from the textbook The Statistical Sleuth by Ramsey andSchafer. The story has been modified.



Descriptive statistics

I Overall survival probability is 2045 = 44.4%.

I Survival or not seems to be affected by gender.

Group Survivals Group size Survival probability

Male 10 30 33.3%Female 10 15 66.7%

I Survival or not seems to be affected by age.

Age class Survivals Group size Survival probability

[10, 20) 2 3 66.7%[21, 30) 11 22 50.0%[31, 40) 4 8 50.0%[41, 50) 3 7 42.9%[51, 60) 0 2 0.0%[61, 70) 0 3 0.0%

I May we do better? May we predict one’s survival probability?



Ordinary regression is problematic

I Immediately we may want to construct a linear regression model

survivali = β0 + β1agei + β2femalei + εi.

where age is one’s age, gender is 0 if the person is a male or 1 iffemale, and survival is 1 if the person is survived or 0 if dead.

I By conducting ordinary regression, we may obtain the regression line

survival = 0.746− 0.013age + 0.319female.

Though R2 = 0.1642 is low, both variables are significant.



Ordinary regression is problematic

I The regression model givesus “predicted survivalprobability.”I For a man at 80, the

“probability” becomes0.746−0.013×80 = −0.294,which is unrealistic.

I In general, it is very easy foran ordinary regressionmodel to generate predicted“probability” not within 0and 1.



Logistic regression

I The right way to do is to do logistic regression.

I Consider the age-survival example.I We still believe that the smaller age increases the survival probability.I However, not in a linear way.I It should be that when one is young enough, being younger does not

help too much.I The marginal benefit of being younger should be decreasing.I The marginal loss of being older should also be decreasing.

I One particular functional form that exhibits thisproperty is

y =ex

1 + ex⇔ log

(y

1− y

)= x

I x can be anything in (−∞,∞).I y is limited in [0, 1].



Logistic regression

I We hypothesize that independent variables xis affect π, theprobability for y to be 1, in the following form:3

log

(π

1− π

)= β0 + β1x1 + β2x2 + · · ·+ βpxp.

I By conducting logistic regression, we obtain the regression report.

I Some information is new, but the following is familiar:

Estimate Std. Error z value p-value

age −0.078 0.037 −2.097 0.036 *female 1.597 0.755 2.114 0.035 *

I Both variables are significant.

3Numerical algorithms are used to search for coefficients to make the curve fitthe given data points in the best way.



The Logistic regression curve

I The estimated curve is

log

(π

1− π

)= 1.633− 0.078age + 1.597female,

or equivalently,

π =exp(1.633− 0.078age + 1.597female)

1 + exp(1.633− 0.078age + 1.597female),

where exp(z) means ez for all z ∈ R.



The Logistic regression curve

I The curves can be used todo prediction.

I For a man at 80, π is

exp(1.633−0.078×80)1+exp(1.633−0.078×80) ,

which is 0.0097.

I For a woman at 60, π is

exp(1.633−0.078×60+1.597)1+exp(1.633−0.078×60+1.597) ,

which is 0.1882.

I π is always in [0, 1]. There isno problem for interpretingπ as a probability.



Comparisons



Interpretations

I The estimated curve is

log

(π

1− π

)= 1.633− 0.078age + 1.597female.

Any implication?I −0.078age: Younger people will survive more likely.I 1.597female: Women will survive more likely.

I In general:I Use the p-values to determine the significance of variables.I Use the signs of coefficients to give qualitative implications.I Use the formula to make predictions.



Model selection

I Recall that in ordinary regression, we use R2 and adjusted R2 to assessthe usefulness of a model.

I In logistic regression, we do not have R2 and adjusted R2.

I We have deviance instead.I In a regression report, the null deviance can be considered as the total

estimation errors without using any independent variable.I The residual deviance can be considered as the total estimation errors

by using the selected independent variables.I Ideally, the residual deviance should be small.4

4To be more rigorous, the residual deviance should also be close to its degree offreedom. This is beyond the scope of this course.



Deviances in the regression report

I The null and residual deviances are provided in the regression report.

I For glm(d$survival ~ d$age + d$female, binomial), we have

Null deviance: 61.827 on 44 degrees of freedom

Residual deviance: 51.256 on 42 degrees of freedom

I Let’s try some models:

Independent variable(s) Null deviance Residual deviance

age 61.827 56.291female 61.827 57.286

age, female 61.827 51.256age, female, age× female 61.827 47.346

I Using age only is better than using female only.

I How to compare models with different numbers of variables?


~


Deviances in the regression reportI Adding variables will always reduce the residual deviance.

I To take the number of variables into consideration, we may useAkaike Information Criterion (AIC).

I AIC is also included in the regression report:

Independent variable(s) Null deviance Residual deviance AIC

age 61.827 56.291 60.291female 61.827 57.286 61.291

age, female 61.827 51.256 57.256age, female, age× female 61.827 47.346 55.346

I AIC is only used to compare nested models.I Two models are nested if one’s variables are form a subset of the other’s.I Model 4 is better than model 3 (based on their AICs).I Model 3 is better than either model 1 or model 2 (based on their AICs).I Model 1 and 2 cannot be compared (based on their AICs).


R for Statistics Public bike rentals


Part 4:R for Statistics and Case Studies

Ling-Chieh Kung


September 4, 2016

R for Statistics and Case Studies 1 / 37 Ling-Chieh Kung (NTU IM)


Road map

I R for Statistics.

I Public bike rentals.



Let’s do statistics with R

I A wholesaler has 440 customers in Portugal:I 298 are “horeca”s (hotel/restaurant/cafe).I 142 are retails.

I These customers locate at different regions:I Lisbon: 77.I Oporto: 47.I Others: 316.

I Data source:http://archive.ics.uci.edu/ml/

datasets/Wholesale+customers.


http://archive.ics.uci.edu/ml/

datasets/Wholesale+customers


Let’s do statistics with R

I The data:

Channel Label Fresh Milk Grocery Frozen D. & P. Deli.

1 1 30624 7209 4897 18711 763 28761 1 11686 2154 6824 3527 592 697

...2 3 14531 15488 30243 437 14841 1867

I The wholesaler records the annual amount each customer spends on sixproduct categories:I Fresh, milk, grocery, frozen, detergents and paper, and delicatessen.I Amounts have been scaled to be based on “monetary unit.”

I Channel: hotel/restaurant/cafe = 1, retailer = 2.

I Region: Lisbon = 1, Oporto = 2, others = 3.



Loading the data

I Let’s put the data in “wholesale.csv”, separated by commas.

I We read the data into R:

W <- read.csv("wholesale.csv", header = TRUE)



Basic statistics

I The mean, median, max, and min expenditure on milk:

mean(W$Milk)

median(W$Milk)

max(W$Milk)

min(W$Milk)

I The sample standard deviation of expenditure on milk:

sd(W$Milk)

I Counting:

length(W[1, ])

length(W[, 1])



Basic statistics

I Correlation coefficient:

cor(W$Milk, W$Grocery)

I In fact, you may simply do:

W2 <- W[, 3:8]

cor(W2)

I 3:8 is a vector (3, 4, 5, 6, 7, 8).I W[, 3:8] is the third to the eighth columns of W.I cor(W2) is the correlation matrix for pairwise correlation coefficients

among all columns of W2.



Basic graphs: Scatter plots

plot(W$Grocery, W$Fresh) plot(W$Grocery, W$D Paper)



Basic graphs: histograms

hist(W$Milk[which(W$Region == 1)])



Regression with R

I Let’s do regression with R and play with the public bike daily rentaldata set.

I First, let’s load the data:

B <- read.csv("bike day.csv", header = TRUE)

I Take a look at B:

head(B)

mean(B$cnt)

cor(B$cnt, B$temp)

hist(B$cnt)

I Try them!

pairs(B)

pairs(B[, 10:16])



Simple regression

I Let’s build a simple regression model by using the function lm():

fit <- lm(B$cnt ~ B$instant)summary(fit)

I Put the dependent variable before the ~ operator.I Put the independent variable after the ~ operator.

I We will obtain the regression report:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2392.9613 111.6133 21.44 <2e-16 ***

B$instant 5.7688 0.2642 21.84 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1507 on 729 degrees of freedom

Multiple R-squared: 0.3954, Adjusted R-squared: 0.3946

F-statistic: 476.8 on 1 and 729 DF, p-value: < 2.2e-16


~

~

~


Multiple regression

I Let’s add more variables using the + operator:

fit <- lm(B$cnt ~ B$instant + B$workingday + B$temp)summary(fit)

I The regression report:

Coefficients:


(Intercept) -280.3863 138.8325 -2.02 0.0438 *

B$instant 5.0197 0.1925 26.07 <2e-16 ***

B$workingday 145.3731 86.5121 1.68 0.0933 .

B$temp 140.2238 5.4246 25.85 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1





~


Interaction

I Let’s consider interaction using the * operator:

fit <- lm(B$cnt ~ B$instant + B$workingday * B$temp)summary(fit)


Coefficients:


(Intercept) -631.776 204.732 -3.086 0.00211 **

B$instant 5.026 0.192 26.183 < 2e-16 ***

B$workingday 675.120 243.232 2.776 0.00565 **

B$temp 157.912 9.323 16.938 < 2e-16 ***

B$workingday:B$temp -26.471 11.364 -2.329 0.02012 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1





~


Qualitative variablesI Let’s add a non-binary qualitative variable (in a wrong way):

fit <- lm(B$cnt ~ B$instant + B$workingday * B$temp + B$season)summary(fit)


Coefficients:


(Intercept) -628.7340 208.7156 -3.012 0.00268 **

B$instant 5.0324 0.2085 24.141 < 2e-16 ***

B$workingday 675.0576 243.3996 2.773 0.00569 **

B$temp 158.0409 9.4807 16.670 < 2e-16 ***

B$season -3.1710 41.5623 -0.076 0.93921


---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1





~


Qualitative variables

I To correctly include a qualitative variable, use the function factor():

fit <- lm(B$cnt ~ B$instant + B$workingday * B$temp + factor(B$season))summary(fit)

I factor() tells the R program to interpret those values as categories evenif they are numbers.

I If the values are already non-numeric, there is no need to use factor().

I Let’s read the regression report.


~


Qualitative variables

I The regression report:1

Coefficients:


(Intercept) -749.4834 209.3085 -3.581 0.000366 ***

B$instant 5.1296 0.2015 25.459 < 2e-16 ***

B$workingday 632.4411 233.8650 2.704 0.007006 **

B$temp 146.5942 11.7999 12.423 < 2e-16 ***

factor(B$season)2 827.2798 143.1463 5.779 1.12e-08 ***

factor(B$season)3 142.7658 188.6595 0.757 0.449454

factor(B$season)4 272.6144 126.7112 2.151 0.031770 *


---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




1To change the reference level, use relevel().



Variable transformation

I To add temp2:

tempSq <- B$temp^2

fit <- lm(B$cnt ~ B$instant + B$workingday * (B$temp + tempSq))summary(fit)


Coefficients:


(Intercept) -3313.2904 462.5027 -7.164 1.93e-12 ***

B$instant 4.7928 0.1874 25.576 < 2e-16 ***

B$workingday 1934.5264 578.2195 3.346 0.000863 ***

B$temp 482.5310 50.6541 9.526 < 2e-16 ***

tempSq -8.1197 1.2489 -6.501 1.48e-10 ***

B$workingday:B$temp -180.0186 62.5810 -2.877 0.004138 **

B$workingday:tempSq 3.9116 1.5382 2.543 0.011200 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


~


Fitted values

I Once we execute

fit <- lm(B$cnt ~ B$instant + B$workingday)

the object fit contains more than the regression report.

I It contains the fitted values yi:

plot(predict(fit))

points(B$cnt, col = "red")

I plot() makes a scatter plot.I points() add points onto an

existing scatter plot.I col = "red" makes red points.


~


ResidualsI We may also obtain residuals:

residuals(fit)

plot(residuals(fit))

hist(residuals(fit))



Logistic regression in R

I To run logistic regression in R, all we need to do is to:I Replace lm() by glm().I Add a new parameter binomial.

I Let’s load the survival data set:

d <- read.csv("survival.csv", header = TRUE)



Logistic regression in R

I By executing

fitRight <- glm(d$survival ~ d$age + d$female, binomial)

summary(fitRight)

we obtain the regression report.

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.63312 1.11018 1.471 0.1413

d$age -0.07820 0.03728 -2.097 0.0359 *

d$female 1.59729 0.75547 2.114 0.0345 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


~


Road map

I R for Statistics.

I Public bike rentals.



Public bike rental data set

I Recall our daily bike rental data set (in “bike day.csv”).I For each day in 2011 and 2012, we have the number of rentals of public

bikes in Washington, D.C.I There are 731 rows representing the 731 days in the time horizon.

I Data source: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset.

I There are sixteen columns as explained below.


http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset


The sixteen variables

I (Serial number) instant : A serial number from 1 to 731.

I (Date information) date, year, season, month: the labels of that date.

I (Working information) holiday, weekday, workingday :I holiday is 1 if that day is a national holiday not in a weekend and 0

otherwise;I weekday labels whether it is Sunday (labeled as 0), Monday (labeled as

1), ..., or Saturday (labeled as 6);I workingday is 1 if that day is a working day (neither a weekend nor a

holiday) and 0 otherwise.



The sixteen variables

I (Weather information) Five attributes are recorded in this category:I weathersit (weather situation): 1 for sunny or partly cloudy, 2 for misty

and cloudy, 3 for light snow or light rain, and 4 for heavy snow orthunderstorm.

I temp (temperature) and atemp (apparent temperature): the dailyaverage of temperature and apparent temperature (in Celsius),respectively.

I humidity : the daily average of the humidity (in %).I windspeed (wind speed): the daily average of the wind speed (in knot; 1

knot is around 1.852 km/h).

I (Rental data) casual, registered, cnt :I casual is the number of rentals made by unregistered users.I registered is the number of rentals made by registered members.I cnt is the sum of the two numbers.



Questions

I What are the relationships among these variables?

I How do these variables affect the rental outcomes?

I How to build a model to explain the variability of rental outcomes?

I How to build a model to predict future rental outcomes?



Descriptive Statistics

I What are the summaries of these variables?

I What are the shapes of distributions of these variables?

I Are there correlation among these variables?



Capturing the trend

I Construct a regression model for instant and cnt. Do you still see anincreasing trend?



Impact of working/holiday

I Construct a regression model for instant and cnt. Do you still see anincreasing trend?

I Add the variable holiday into the regression model. In average what isthe impact of being a holiday?

I Remove holiday and add the column workingday into the regressionmodel. In average what is the impact of being a working day?Compare the result with holiday.



Impact of weather (quantitative variables)

I How do temp, atemp, hum, and windspeed affect cnt?

I If you used a regression model with the five variables listed in (a),what are the potential drawbacks?

I Try to take away temp and do the analysis again.

I Try to add instant and do the analysis again.



Impact of weather (transformation)

I Some people suggest that temp should have a nonlinear impact on cnt.Does this fit your intuition? Draw a scatter plot to help you judge theintuition.

I To capture the nonlinear relationship, let’s add a variable temp2 as oursecond independent variable. Construct the regression model, interpretit, and validate it.

I Does adding temp2 improves the model?

I Visualize the two regression models.



Impact of weather (qualitative variables)

I If we construct a regression model with instant, weathersit, and cnt,what is wrong?

I Create indicator variables for weathersit by choosing sunny as thereference level. Construct a regression model with instant, the indicatorvariables for weathersit, and cnt. Validate and interpret the model.



Seasonality

I Use instant, season, and cnt to build a model.

I What if we replace season by month?



Isolating working and non-working days

I Use a scatter plot to determine whether temp affects casual.

I Construct a regression model with temp and casual. Validate andinterpret the model. Construct the scatter plot for temp and casual.Add the linear trend line into the plot.

I On your scatter plot, isolating working and non-working days.

I Construct a regression model with temp, workingday, and casual withno interaction. Validate and interpret the model.

I Construct a regression model with temp, workingday,temp× workingday, and casual. Validate and interpret the model.

I Compare the above three models. Visualize the differences of theregression lines obtained for working days, non-working days, and both.



Isolating working and non-working days

I Construct a multiple linear regression model with temp, workingday,temp× workingday, and cnt. Validate and interpret the model.

I Visualize the differences of the regression lines obtained for workingdays, non-working days, and both.



Using casual and registered

I To predict or explain cnt, is it good to include casual or registered asindependent variables?



The “best” model for cnt?

Independent variables R2adj MAE

instant, temp 0.685 1275.163

instant, temp, workingday,0.687 1235.614

temp× workingday

instant, month, temp, workingday,0.737 1148.191

temp× workingday

instant, month, temp, temp2, workingday,0.751 1059.101

temp× workingday, temp2 × workingday

(MAE is calculated based on the first three months in 2013.)

I Is a good predictive model always a good explanatory one?

I Is a good explanatory model always a good predictive one?


孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Data & Analytics

Transcript of 孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)