Post on 06-Oct-2020
Lecture 5 1
Lecture 3
The Population Variance
The population variance, denoted σ2, is the sum
of the squared deviations about the population
mean divided by the number of observations in
the population, N :
σ2 =
∑(xi − µ)2
N=
(x1 − µ)2 + (x2 − µ)2 + · · · (xN − µ)2
N.
Another alternative formula is:
σ2 =
∑x2iN−(∑
xiN
)2
=
∑x2iN− µ2.
REMARK: To avoid round-off errors, which
accumulate quickly in these formulas, do not
round until the last computation, and use as
May 15, 2012
Lecture 5 2
many decimal places as allowed in your calculator.
May 15, 2012
Lecture 5 3
The Sample Variance
When the population is large, we approximate the
population mean µ with the sample mean, x̄.
Similarly, we approximate the population variance
σ2 by the sample variance, denoted s2:
s2 =
∑(xi − x̄)2
n− 1=
(x1 − x̄)2 + (x2 − x̄)2 + · · ·+ (xn − x̄)2
n− 1.
The alternative form is:
s2 =
∑(xi − x̄)2
n− 1− (
∑xi)
2
n(n− 1).
REMARK: Notice that we divide by the sample
size minus one (this is different from the formula
for the population variance).
May 15, 2012
Lecture 5 4
Informally, we say: a sample of size n has n
degrees of freedom; one degree of freedom is “used
up” in computing x̄, so there are only n− 1
degrees of freedom available for the sample
variance.
May 15, 2012
Lecture 5 5
The Standard Deviation
For both cases (the population or the sample),
the standard deviation is the square root of the
corresponding variance:
The population standard deviation is denoted
by σ:
σ =√σ2.
The sample standard deviation is denoted by
s:
s =√s2.
Advantage of the (population or sample) standard
May 15, 2012
Lecture 5 6
deviation: it is given in the same units as the
observations.
Advantage of the (population or sample)
variance: it is easier to manipulate algebraically,
in some cases.
Both the standard deviations and variances are
interpreted as follows: the larger they are, the
more spread is the distribution (if they equal 0,
the smallest possible value, then all observations
must be equal).
Remark 1. Standard deviation measures spread
about the mean and should be used only when
the mean is chosen as the measure of center.
Remark 2. Standard deviation is not robust.
May 15, 2012
Lecture 5 7
Remark 3. The sum of the deviations of the
observations from their mean will always be zero.
May 15, 2012
Lecture 5 8
Density curves
Histograms are approximations to an exact
variable distribution. Increasing the number of
classes in a histogram makes each rectangle less
wide and as the number of rectangles approaches
infinity, the graph becomes a curve, called
density curve.
Properties of the density curve
1. The curve is always above the x-axis: the
function f(x) describing the curve is
nonnegative (could be zero) for all x
2. The total area underneath the curve and
above the x-axis equal 1.
May 15, 2012
Lecture 5 9
Density curves, as we saw, have mean, medians
and modes as well as standard deviation. the
notations are similar to the one for the population
mean and standard deviation (why?).
Most of the time we use software to estimate
density curves. Many times we assume that data
follows a certain density curve.
May 15, 2012
Lecture 5 10
The normal distribution
Often called Gaussian curve, the normal curve
was introduced by Carl Friedrich Gauss in 1809
as an error curve of least square regression, about
which we will talk next time.
There are other symmetric bell-shaped density
curves that are not normal.
Remark 4. The curve is described completely by
2 parameters: µ-the mean and σ-the standard
deviation.
May 15, 2012
Lecture 5 11
The Empirical Rule
If the distribution is approximately bell shaped
(not only normal), then:
1. Approximately 68% of the data will lie within
one standard deviation of the mean. That is,
about 68% of the data will be between µ− σ
and µ+ σ.
2. Approximately 95% of the data will lie within
two standard deviations of the mean.
3. Approximately 99.7% of the data will lie
within three standard deviations of the mean.
For exact values, we need to integrate to find the
area between two points.
May 15, 2012
Lecture 5 12
In general, for any distribution, not only the
normal distribution, Chebyshev’s rule could be
applied:
The proportion of values from a data set that will
fall within k standard deviations of the mean will
be at least
(1− 1
k2)100%
where k > 1. his rule could be applied to samples
too.
May 15, 2012
Lecture 5 13
Finding the area under the normal density curve
is not an easy task. It requires a lot of calculus.
One way of avoiding this is to use tables that give
us these areas (probabilities). But for each µ and
σ we would need a new table. How can we avoid
this? By transforming somehow all these curves
into a standard one. Choose µ = 0 and σ2 = 1
Standardizing
Convert other values to standard units or
z-scores, by subtracting the mean and dividing
by standard deviation
z =x− µσ
May 15, 2012
Lecture 5 14
Example: Standardize x = −3 with µ = 2 and
σ = 4. What z-score range corresponds to (8, 17)
with µ = 12 and σ2 = 9?
May 15, 2012
Lecture 5 15
Interpretation: z is the number of standard
deviations that x is away from the mean.
The z-score is unit free. We can use it to compare
observations from different sources (“apples to
oranges”).
Notation The standard normal distribution is
denoted by N(0, 1) and any other normal
distribution with mean µ and variance σ2 by
N(µ, σ).
May 15, 2012
Lecture 5 16
Relations between variables. Scatter
diagrams
In practice statisticians are interested in multiple
variable relationships. For 2 variables, the pairs of
data points match forming an observation.
Sometimes we use the value of one variable in
order to predict another variable.The response
variable is the variable whose value can be
explained by, or is determined by, the value of the
explanatory variable. The response variable
measures the outcome of a study. An explanatory
variable explains or causes changes in the
response variable.
Example:
May 15, 2012
Lecture 5 17
The relationship between two variables could be
represented by crosstabulation, side by side
or clustered bar graphs, and scatterplots.
May 15, 2012
Lecture 5 18
Definition 5. A scatter diagram is a graph
that shows the relationship between two
quantitative variables measured on the same
individual.
How to draw a scatter diagram:
• The explanatory variable is plotted on the
horizontal axis and the response variable is
plotted on the vertical axis.
• Each individual in the data set is represented by
a point in the scatter diagram.
• Do not connect the points when drawing a
scatter diagram.
May 15, 2012
Lecture 5 19
How we interpret a scatter diagram
Scatter diagrams imply a
• linear relationship
• nonlinear relationship
• no relation
Definition 6. Two variables that are linearly
related are said to be positively associated if,
whenever the values of the predictor variable
May 15, 2012
Lecture 5 20
increase, the values of the response variable also
increase, and it is said to be negatively
associated if, whenever the values of the
predictor variable increase, the value of the
response variable decrease.
May 15, 2012
Lecture 5 21
Be careful!! Do not conclude causation through
association.
May 15, 2012
Lecture 5 22
Definition 7. The linear correlation
coefficient is a measure of the strength of linear
relation between two quantitative variables. The
sample correlation correlation coefficient is
computed by:
r =
∑ni=1(xi−x
sx)(yi−y
sy)
n− 1
where x is the sample mean of the predictor
variable
sx is the sample standard deviation of the
predictor variable.
y is the sample mean of the response variable
sx is the sample standard deviation of the
response variable.
n is the number of individuals in the sample.
May 15, 2012
Lecture 5 23
The population correlation coefficient is denoted
by ρ
Example: (0, 0)(1, 2)(2, 2)(3, 5)(4, 6)
May 15, 2012
Lecture 5 24
Interpretation and properties
of r
• −1 ≤ r ≤ 1
• If r = 1 there is a perfect positive linear relation
between the 2 variables.
• If r = −1 there is a perfect negative linear
relation between the 2 variables.
• The closer r is to 1 the stronger the evidence of
a positive linear relation and the closer to -1 the
stronger the evidence of negative association
between the two variables.
• If r is close to 0 there is evidence of no linear
relation between the 2 variables. This does not
mean no relation, just no linear relation.
May 15, 2012
Lecture 5 25
• r is a untiles measure of association.
• r is not resistant. It is strongly affected by
outlier.
• Both variables should be quantitative.
May 15, 2012