Regression Analysis - National Tsing Hua Universitymx.nthu.edu.tw/~cking/Statistical...

Post on 25-Feb-2021

7 views 0 download

Transcript of Regression Analysis - National Tsing Hua Universitymx.nthu.edu.tw/~cking/Statistical...

Regression Analysis

Regression Analysis

Ching-Kang Ing (銀慶剛)

Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan

1 / 162

Regression Analysis

Outline I

1 Finite Sample TheoryRegression ModelsAnalysis of Variance (ANOVA)Projection MatricesEstimationMultivariate Normal DistributionsGaussian RegressionInterval EstimationAnother look at βModel SelectionPrediction

2 Large Sample TheoryMotivationToward Large Sample Theory IToward Large Sample Theory IIToward Large Sample Theory III

2 / 162

Regression Analysis

Outline II

3 AppendixStatistical View of Spectral DecompositionLimit Theorems

Continuous Mapping TheoremSlutsky’s TheoremCentral Limit TheoremConvergence in the rth MeanSome InequalitiesWeak Law of Large Numbers

Delta MethodTwo-Sample t-TestPearson’s Chi-Squared Test

3 / 162

Regression Analysis

Finite Sample Theory

Regression Models

Regression Models

Consider the following linear regression model:

yi = β0 + β1xi1 + · · ·+ βkxik + εi, i = 1, . . . , n,

where εi are i.i.d. r.v.s with E(ε1) = 0 and E(ε21) = Var(ε1) = σ2 > 0.Define f(β) = ‖y −Xβ‖2 where

X =

1 x11 · · · x1k...

......

1 xn1 · · · xnk

and y =

y1...yn

.

By solving equation

∂f(β)

∂β= 0,

we obtain X>Xβ = X>y, and hence

(β0, . . . , βk)> ≡ β = (X>X)−1X>y.

4 / 162

Regression Analysis

Finite Sample Theory

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA)

Define

SST =

n∑i=1

(yi − y)2,

SSRes =

n∑i=1

(yi − β0 − β1xi1 − · · · − βkxik)2 =

n∑i=1

(yi − β>xi)2,

SSReg =

n∑i=1

(β0 + β1xi1 + · · ·+ βkxik − y)2 =

n∑i=1

(β>xi − y)2,

where y = n−1∑ni=1 yi and xi = (1, xi1, . . . , xik)>. Then we have

SST = SSReg + SSRes.

5 / 162

Regression Analysis

Finite Sample Theory

Analysis of Variance (ANOVA)

It is not difficult to see (why?) that

SST = y>(I −M0)y where M0 =E

n=

11>

nwith 1 = (1, . . . , 1)>,

SSRes = y>(I −Mk)y where Mk = X(X>X)−1X>,

and

SSReg = y>(Mk −M0)y.

[ Note thaty1...yn

−x>1 β...

x>n β

= y −Xβ = y −X(X>X)−1X>y = (I −Mk)y,

and Mk = M2k , (I −Mk)2 = I −Mk.]

6 / 162

Regression Analysis

Finite Sample Theory

Analysis of Variance (ANOVA)

Therefore, ANOVA is nothing but

y>(I −M0)y = y>(Mk −M0)y + y>(I −Mk)y.

Actually, ANOVA is a Pythagorean equality, as illustrated below, in whichC(X) = {Xa : a ∈ Rk+1} is called the column space of X.

7 / 162

Regression Analysis

Finite Sample Theory

Analysis of Variance (ANOVA)

Another look at SST = SSReg + SSRes

Assume

yi = x>i β + εi, i = 1, . . . , n,

where E(εi) = 0, Var(εi) = σ2, (xi, εi) are i.i.d., and E(εi|xi) = 0 for all i. Notethat we consider the case of “random regressors” instead of fixed ones. Here aresome observations:

(i) E(yi) = E(x>i β) are the same for all i.

(ii) Var(yi) are the same for all i.

(iii) E(yi|xi) = E(x>i β + εi|xi) = x>i β.

(iv)

Var(yi) = Var(E(yi|xi)) + E(Var(yi|xi))= Var(x>i β) + E{E[(yi − x>i β)2|xi]}= Var(x>i β) + Var(εi).

8 / 162

Regression Analysis

Finite Sample Theory

Analysis of Variance (ANOVA)

(v) Var(yi) = Var(y1) can be estimated by

1

n

n∑i=1

(yi − y)2 := Var(yi).

Var(x>i β) = E(x>i β − E(x>i β))2 = E(x>i β − E(yi))2 can be estimated by

1

n

n∑i=1

(x>i β − y)2 := Var(x>i β).

Var(εi) can be estimated by

1

n

n∑i=1

(yi − x>i β)2 := Var(εi).

(vi) Therefore, SST = SSReg + SSRes is nothing but

Var(yi) = Var(x>i β) + Var(εi).

9 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Projection Matrices

Let

X =

x11 · · · x1r...

...xn1 · · · xnr

= [X1, . . . ,Xr]

be an n× r matrix. The column space of X, C(X), is defined as

C(X) = {Xa : a = (a1, · · · , ar)> ∈ Rr}

noting that Xa = a1X1 + · · ·+ arXr.

Definition

An n × n matrix M is called an orthogonal projection matrix onto C(X) if andonly if

1 for v ∈ C(X),Mv = v,

2 for w ∈ C⊥(X),Mw = 0, whereC⊥(X) = {s : v>s = 0 for all v ∈ C(X)}.

10 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Fact 1

C(M) = C(X).

Proof of Fact 1

Let v ∈ C(X). Then

v = Xb = MXb ∈ C(M), (why?)

for some b.

Let v ∈ C(M). Then

v = Ma = M(a1 + a2) = a1 ∈ C(X),

for some a, and some a1 ∈ C(X),a2 ∈ C⊥(X). This completes the proof.

11 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Fact 2

M> = M (symmetric) and M2 = M (idempotent) if and only if M is anorthogonal projection matrix on C(M).

Proof of Fact 2

(⇒) For v ∈ C(M),Mv = MMbidempotent

= Mb = v, for some b.

For w ∈ C⊥(M),Mwsymmetric

= M>w = 0. (why?)(⇐) Define ei = (0, . . . , 0, 1, 0, . . . , 0)>, where i-th component is 1, and the others

are 0.It is suffices to show that for any ei, ej , e

>i M

>(I −M)ej = 0. (why?)

Since we can decompose ei and ej as ei = e(1)i + e

(2)i and ej = e

(1)j + e

(2)j ,

where e(1)i , e

(1)j ∈ C(M) and e

(2)i , e

(2)j ∈ C⊥(M),

e>i M>(I −M)ej = e>i M

>(I −M)(e(1)j + e

(2)j )

why?= e>i M

>e(2)j

why?= e

(1)>i e

(2)j = 0.

This completes the proof.

12 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Fact 3

Orthogonal projection matrices are unique.

Proof of Fact 3

Let M and P be orthogonal projection matrices onto some space S ⊆ Rn.

Then, for any v ∈ Rn,v = v1 + v2, where v1 ∈ S and v2 ∈ S⊥.

The desired conclusion follows from

(M − P )v = (M − P )(v1 + v2) = (M − P )v1 = 0.

13 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Fact 4

Let o1, · · · ,or be an orthonormal basis of C(X), i.e.,

o>i oj =

{0, if i 6= j,

1, if i = j,

and for any v ∈ C(X),v = Ob for some b ∈ Rr, where O = [o1, . . . ,or]. Then,OO> =

∑ri=1 oio

>i is the orthogonal projection matrix onto C(X).

Proof of Fact 4

Since OO> is symmetric and OO>OO> = OO>, where O>O = Ir, ther-dimensional identity matrix, by Fact 2, OO> is the orthogonal projectionmatrix onto C(OO>).

Moreover, for v ∈ C(X), we have

v = Ob = OO>Ob ∈ C(OO>),

for some b ∈ Rr.

In addition, C(OO>) ⊆ C(O) = C(X). The desired conclusion follows.14 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Remark

One can also prove the result by showing

(i) for v ∈ C(X),OO>v = OO>Ob = Ob = v, and(ii) for w ∈ C⊥(X),OO>w = 0 (the n-dimensional vector of zeros).

兩種證法之差異在於第一種方法是先引用Fact 2得到OO>是C(OO>)的正交投影矩陣,再從C(OO>)的結構猜測它與C(X)相同;而後者則是直接猜測OO>是C(X)的正交投影矩陣。前者證明較曲折但“猜測”成分較少,後者則反之。

15 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Given a matrix X, how to construct the orthogonal projection matrix for C(X)?

Gram-Schmidt processesLet X = [x1, . . . ,xq] for some q ≥ 1.Define y1 = x1/‖x1‖, where ‖x1‖2 = x>1 x1.

w2 = x2 − (x>2 y1)y1.

y2 = w2/‖w2‖....

ws = xs −∑s−1i=1 (x>s yi)yi.

ys = ws/‖ws‖, 2 ≤ s ≤ q.16 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

If the rank of C(X) is 1 ≤ r ≤ q, then there are r non-zero yi, denoted byys1 , . . . ,ysr , and Y = (ys1 , . . . ,ysr ) is an orthonormal basis of C(X).

Y Y > is the orthogonal projection matrix onto C(X) (by Fact 4).

17 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Explanation of Rank

Explain “the rank of C(X)”:

Let J be a subset of {1, · · · , q} satisfying

(i) {xi, i ∈ J} is linearly independent, i.e.,∑i∈J

aixi = 0 if and only if ai = 0 for all i ∈ J,

(ii) for any J1 ⊇ J with J1 − J 6= ∅, {xi, i ∈ J1} is not linearly independent.

The ”rank of C(X)” is defined by ](J), the number of the elements in J .

18 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

Moreover, if r(X) = q (i.e. the rank of C(X) is q), the X(X>X)−1X> isthe orthogonal projection matrix of C(X).

Proof

(i) X(X>X)−1X> is symmetric and idempotent.

(ii) C(X(X>X)−1X>)why?= C(X).

If 1 ≤ r(X) < q, then

X(X>X)−X> is the orthogonal projection matrix of C(X),

where A− denotes a generalized inverse (g-inverse) of A which is defined byany matrix G such that AGA = A.

Note that

(X>X)− = (X>X)−1 if r(X) = q,

and

there’re infinitely many (X>X)− if r(X) < q.

But in either case, X(X>X)−X> is unique, according to Fact 3.

19 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

We now go back to regression problems, and summarize the key features ofM0 = n−111>, Mk = X(X>X)−1X>, (I −M0), (I −Mk), and Mk −M0,where

X =

1 x11 · · · x1k...

...1 xn1 · · · xnk

.

(i) M0 is the orthogonal projection matrix onto C(1).

(ii) Mk is the orthogonal projection matrix onto C(X).

(iii) (I −M0) is the orthogonal projection matrix onto C⊥(1).

(iv) (I −Mk) is the orthogonal projection matrix onto C⊥(X).

20 / 162

Regression Analysis

Finite Sample Theory

Projection Matrices

(v) Mk −M0 is the orthogonal projection matrix onto C((I −M0)X), where

C((I −M0)X)why?= C

x11 − x1

...xn1 − x1

, . . . ,

x1k − xk...

xnk − xk

,

with xi = n−1∑nj=1 xji.

(vi)

M0Mk = M0 = MkM0,

(I −M0)M0 = 0,

(I −Mk)Mk = 0,

(I −Mk)M0 = 0,

where 0 is the n× n matrix of zeros.

21 / 162

Regression Analysis

Finite Sample Theory

Estimation

Estimation

Does β possess any optimal properties?

E(β) = β since

E(β) = E((X>X)−1X>y)

= E{

(X>X)−1X>(Xβ + ε)}

= β + E((X>X)−1X>ε)

= β + (X>X)−1X>E(ε)

= β + (X>X)−1X>0 = β.

Var(β) = (X>X)−1σ2 because

Var(β) = E((β − β)(β − β)>)

= E{

(X>X)−1X>εε>X(X>X)−1}

= (X>X)−1X>E(εε>)X(X>X)−1 = σ2(X>X)−1,

noting that we have used E(εε>) = σ2I.22 / 162

Regression Analysis

Finite Sample Theory

Estimation

Gauss-Markov Theorem

For any β = Ay satisfying

β = E(β) = E(Ay) = E(A(Xβ + ε)) = AXβ for “all” β,

we have Var(β) ≤ Var(β) in the sense that Var(β)−Var(β) is non-negative definite(非負定), i.e., for any ‖a‖ = 1,

a>{

Var(β)− Var(β)}a ≥ 0. (∗)

Remark

(i) Ay is called a linear estimator of β.

(ii) β is unbiased (since we assume E(β) = β for all β ).

(iii) This theorem says that β is the best linear unbiased estimator (BLUE) of β.

(iv) (∗) is equivalent to Var(a>β)why?≥ Var(a>β), meaning that the variance of

a>β is always “not” smaller than that of a>β regardless of which directionvector, a, β and β project onto.

23 / 162

Regression Analysis

Finite Sample Theory

Estimation

Proof of Gauss-Markov Theorem

Let a ∈ Rk+1 be arbitrarily chosen. Then,

Var(a>β) = E[a>(β − β)]2 (since β is unbiased)

= E(a>(β − β) + a>(β − β))2

≥ Var(a>β) + 2E{a>(β − β)(β − β)>a

}(since β is unbiased)

why?= Var(a>β) + 2a>E

((A− (X>X)−1X>)εε>X(X>X)−1

)a

why?= Var(a>β) + 2σ2a>(A− (X>X)−1X>)X(X>X)−1a

why?= Var(a>β) + 2σ2a>[(X>X)−1a− (X>X)−1a]

= Var(a>β).

24 / 162

Regression Analysis

Finite Sample Theory

Estimation

How to estimate σ2?

σ2 =1

n− (k + 1)

n∑i=1

(yi − β0 − β1xi1 − · · · − βkxik)2

=1

n− (k + 1)

n∑i=1

(yi − x>i β)2

=1

n− (k + 1)y>(I −Mk)y.

25 / 162

Regression Analysis

Finite Sample Theory

Estimation

Why ”k + 1”? “k + 1” makes σ2 unbiased, namely, E(σ2) = σ2.

To see this, we have

E(σ2) =1

n− (k + 1)E(y>(I −Mk)y)

why?=

1

n− (k + 1)E(ε>(I −Mk)ε)

why?=

σ2

n− (k + 1)tr(I −Mk)

why?= σ2.

where ε = (ε1, . . . , εk)>.Reasons for the second “why”Define µ = E(z) and V = Cov(z) = E[(z − µ)(z − µ)>]. Then

E(z>Az) = µ>Aµ+ tr(AV ).

Since ε>(I −Mk)ε is a scalar,

E(ε>(I −Mk)ε) = E(tr(ε>(I −Mk)ε))

= tr[E{(I −Mk)εε>}] = tr(I −Mk)σ2.

26 / 162

Regression Analysis

Finite Sample Theory

Estimation

Some facts about trace operator

1. tr(A) :=∑ni=1Aii, where

A = [Aij ]1≤i,j≤n =

A11 · · · A1n

......

An1 · · · Ann

.

2. tr(AB) = tr(BA) and tr(∑ki=1Ai) =

∑ki=1 tr(Ai).

3. tr(Mk) = tr(X(X>X)−1X>) = tr((X>X)−1X>X) = tr(Ik+1) = k + 1,where Ik+1 is the (k + 1)-dimensional identity matrix.

4. tr(Mk) = tr(∑k+1

i=1 oio>i

)=∑k+1i=1 tr(oio

>i ) =

∑k+1i=1 tr(o>i oi) = k + 1,

where {o1, . . . ,ok+1} is an orthonormal basis for C(X).

5. Similarly, we have tr(I −Mk) = n− k − 1 and tr(I −M0) = n− 1.

27 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Multivariate Normal Distributions

Definition

We say z has an r-dimensional multivariate normal distribution with mean

E(z) = µ,

and variance

E((z − µ)(z − µ)>) = Σ > 0 (i.e.,a>Σa > 0 for all a ∈ Rr and ‖a‖ = 1),

denoted by N(µ,Σ), if there exist a k-dimensional standard normal vector

ε = (ε1, . . . , εk)>, k ≥ r (i.e., ε1, . . . , εk are i.i.d. N(0, 1) random variables),

and an r× k nonrandom matrix A of full row rank satisfying AA> = Σ such that

z ∼ Aε+ µ,

where ∼ means both sides of the notation have the same distribution.28 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

> If ∃ a ∈ Rr such that a>Σa = 0, then E(a>(z − µ))2 = 0 (why?).

This yields P (a>(z − µ) = 0) = 1 because

E(a>(z − µ))2 = 0 implies E(a>(z − µ)) = 0 and Var(a>z) = 0.

Therefore, with probability “1”, one zi is a linear combination of other zj ’s.

29 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Why E(X) = 0 implies P (X = 0) = 1 for non-negative X?

Fact

Let X be a non-negative r.v., i.e., P (X ≥ 0) = 1. Then, E(X) = 0 impliesP (X = 0) = 1.

Proof of Fact

Suppose P (X = 0) < 1. Then, P (X > 0) > 0, and hence there exists someδ > 0 such that P (X > 0) > δ (why?).

Since P (X > 0)(why?)

= P (⋃∞n=1{X > n−1}) (why?)

= limn→∞ P (X > n−1), itfollows that

P (X > M−1) > δ/2 for some large integer M. (why?) (∗)

Now, (∗) yields

E(X)(why?)

≥ E(XI{X>M−1})(why?)

≥ M−1P (X > M−1) ≥ δ/(2M) > 0,

which gives a contradiction. Thus, the proof is complete.

30 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Remark

1. A =

a>1...a>r

is said to have a full row rank if a1, . . . ,ar are linearly

independent.

2. A is not unique since for any P>P = PP> = Ik, we have

AA> = APP>A> = Σ.

3. If z ∼ N(µ,Σ), then for any B of full row rank, Bz ∼ N(Bµ,BΣB>).

4. If r = 2, then z is said to be bivariate normal.

5. Let z =

(z1z2

)be a two-dimensional random vector and fulfill

z1 ∼ N(0, 1), z2 ∼ N(0, 1), and E(z1z2) = 0.

It is possible that z is not a bivariate normal.

31 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Fact 1

If z ∼ N(µ,Σ), then the joint probability density function (pdf) of z, f(z), isgiven by

f(z) = (2π)−r/2(det(Σ))−1/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}.

Proof of Fact 1

By definition,z ∼ Aε+ µ,

where ε ∼ N(0, Ik), k ≥ r, and A is an r × k matrix of full row rank.

Let b1, . . . , bk−r satisfy

b>i bj =

{1, i = j;

0, i 6= j,

and b>i aj = 0 for all 1 ≤ i ≤ k − r, 1 ≤ j ≤ r.

32 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Proof of Fact 1 (cont.)

Define

A∗ =

(AB

)≡

Ab>1...

b>k−r

and z∗ =

(zw

)= A∗ε+ µ∗,

where µ∗ = (µ, 0, . . . , 0)>.

Then, the joint pdf of z∗ is given by

f∗(z∗) = (2π)−k/2 exp

{− (z∗ − µ∗)(A∗>)−1(A∗)−1(z∗ − µ∗)

2

} ∣∣∣det(A∗−1

)∣∣∣ .

33 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Proof of Fact 1 (cont.)

Note that here we have used the following facts:

(i) The joint pdf of ε is

(2π)−k/2 exp

{−ε>ε

2

}=

k∏i=1

(2π)−1/2 exp

(−ε

2i

2

)since εi’s are independent, the joint pdf of (ε1, . . . , εk) is the product of themarginal pdfs.

(ii) Let the joint pdf of v = (v1, . . . , vk)> be denoted by f(v),v ∈ D ⊆ Rk, letg(v) = (g1(v), . . . , gk(v))> be a “smooth” one-to-one transformation of Donto E ⊆ Rk, and let g−1(s) = (g−11 (s), . . . , g−1k (s))>, s ∈ E denote theinverse transformation of g(s), which satisfies g−1(g(v)) = v.

34 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Proof of Fact 1 (cont.)

Define

J =∂g−1(y)

∂y=

∂g−1

1 (y)∂y1

· · · ∂g−11 (y)∂yk

......

∂g−1k (y)

∂y1· · · ∂g−1

k (y)

∂yk

.

Then, the joint pdf of y = g(v) is given by f(g−1(y))|det(J)|. Now, since

(A∗>)−1(A∗)−1 = (A∗A∗>)−1 =

((AB

)(A> B>

))−1=

((AA>)−1 0

0 Ik−r

)=

(Σ−1 0

0 Ik−r

)and

|det((A∗)−1)| = |det(A∗)|−1 = (det(A∗) det(A∗))−1/2

=(det(A∗) det(A∗>)

)−1/2=(det(A∗A∗>)

)−1/2= (det(Σ))−1/2,

35 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Proof of Fact 1 (cont.)

we have

f∗(z∗)why?= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2

×(2π)−(k−r)/2 exp{−(w>w)/2

},

and hence

f(z) =

∫ ∞−∞· · ·∫ ∞−∞

f∗(z∗) dw

= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2

×∫ ∞−∞· · ·∫ ∞−∞

(2π)−(k−r)/2 exp{−(w>w)/2

}dw

= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2,

where∫∞−∞ · · ·

∫∞−∞(2π)−(k−r)/2 exp

{−(w>w)/2

}dw = 1. (why?)

36 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Fact 2

Assume z ∼ N(µ,Σ) and z =

(z1z2

). Then Cov(z1, z2) = E((z1 − µ1)(z2 −

µ2)>) = 0, where 0 is a zero matrix, if and only if z1 and z2 are independent,where z1 and z2 are r1- and r2-dimensional, respectively.

Proof of Fact 2

⇐) It is easy and hence skipped.⇒) Since Cov(z1, z2) = 0, we have by Fact 1,

f(z) = f(z1, z2)

=

2∏i=1

(2π)−ri/2 exp

{− (zi − µi)>Σ−1ii (zi − µi)

2

}|det(Σii)|−1/2

= f(z1)f(z2),

where (µ>1 ,µ>2 )> = µ and(

Σ11 Σ12

Σ21 Σ22

)= Σ =

(Σ11 00 Σ22

), by hypothesis.

37 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Proof of Fact 2 (cont.)

Since f(z1) is the joint pdf of z1 and f(z2) is the joint pdf of z2, the above identityimplies z1 and z2 are independent. (why?)

Here, we’ve used if X nd Y are independent iff f(x, y) = fx(x)fy(y).

Fact 3

Let z ∼ N(µ, σ2Ir) and C =

(B1

B2

)q×r

, q ≤ r, have a full row rank. Then B1z

and B2z are independent if B1B>2 = 0.

Proof of Fact 3

Since

Cov(B1z,B2z) = E(B1(z − µ)(z − µ)>B>2 ) = σ2B1B>2 = 0,

by Fact 2, the desired conclusion follows.

38 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Definition

Let z be an r-dimensional random vector and let A be an n×n symmetric matrix.Then z>Az is called a quadratic form.

Fact 4

Let E(z) = µ and Var(z) = Σ. Then

E(z>Az) = µ>Aµ+ tr(AΣ).

Proof of Fact 4

For µ = 0, we have

E(z>Az) = E(tr(Azz>)) = tr(AE(zz>)) = tr(AΣ).

For µ 6= 0, we have

tr(AΣ)why?= E((z − µ)>A(z − µ))

why?= E(z>Az)− 2µ>Aµ+ µ>Aµ,

and hence the desired conclusion holds.

39 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Fact 5

If z ∼ N(0, Ir) and M is an r × r orthogonal projection matrix, then

z>Mz ∼ χ2(r(M)),

where r(M) denotes the rank of M and χ2(k) denotes the chi-square distributionwith k degrees of freedom.

Proof of Fact 5

Denote r(M) by q. Let {o1, . . . ,oq} be an orthonormal basis for C(M).

We have shown that M = OO> =∑qi=1 oio

>i , where O = [o1, . . . ,oq] and

note that O>O = Iq.

Since O> has a full row rank, O>z ∼ N(0,O>O) = N(0, Iq), yielding thato>i z, i = 1, . . . , q, are i.i.d. N(0, 1) distributed. In addition, we have

z>OO>z =

q∑i=1

(o>i z)2 ∼ χ2(q),

which completes the proof.40 / 162

Regression Analysis

Finite Sample Theory

Multivariate Normal Distributions

Fact 6

Let z ∼ N(0,Σ). Then z>Σ−1z ∼ χ2(r).

Proof of Fact 6

Since z ∼ N(0,Σ), we have z ∼ Aε in which AA> = Σ and ε ∼ N(0, Ik)for some k ≥ r. Here, A is an r × k matrix of full row rank. This implies

z>Σ−1zd= ε>A>(AA>)−1Aε.

Here,d= means “is equivalent in distribution to”.

Note that A>(AA>)−1A is symmetric and idempotent. Therefore, it is anorthogonal projection matrix with rank r (why?). By Fact 5,

ε>A>(AA>)−1Aε ∼ χ2(r),

and hence gives the desired conclusion.

41 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

Gaussian Regression

Assume ε in y = Xβ + ε obeys ε ∼ N(0, σ2In).

D1 β = (X>X)−1X>ε+ β ∼ N(β, (X>X)−1σ2).Please convince yourself this result!!

D2

σ2 =1

n− k − 1ε>(I −Mk)ε

=σ2

n− k − 1

ε>(I −Mk)ε

σ2∼ σ2χ

2(n− k − 1)

n− k − 1,

recalling that Mk = X(X>X)−1X> and

X =

1 x11 · · · x1k...

...1 xn1 · · · xnk

.

Here I is In, but I sometimes drop the subscript “n” when no confusion ispossible.

42 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

Hypothesis testing

(a) F test

Consider the null hypothesis (虛無假設)

H0 : β1 = β2 = · · · = βk = 0. (表示迴歸是不重要的)HA : H0 is wrong. (Alternative hypothesis, 對立假設)

Test statistics:

T1 =

SSReg

kSSRes

n− k − 1

=“迴歸”的“單位”貢獻

“模型殘差”的“單位”貢獻

T1就是這兩類“貢獻”的對比。

T1“大”時,我們傾向“拒絕”H0此一假設,因此時迴歸的貢獻是不可忽視的,但何謂“大”? 這就得依賴T1的分配(distribution)來決定,特別是T1在H0之下的

分配。

43 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

更進一步地說,在H0成立的情況下,T1應不會太大,如能在H0下得到T1的分配,我們就可知道

PH0(0 ≤ T1 ≤ c) = 95% (此百分比可依各別需求調整)

的“c”是多少。也就是說T1 ∈ (0, c)的機率高達 95%,而當T1 ≥ c時,我們就要高度“懷疑”H0可能是不對的 (因為在H0下不太可能發生的事情發生了)。

故我們可將T1 ≥ c (或T1 < c) 當成一“檢定的規則”,i.e., reject H0 if T1 ≥ cand do not reject H0 if T1 < c。使用這樣的檢定規則犯下型 I錯誤(Type Ierror)的機率是 5%。[ 5%稱為此一檢定方式的“顯著水準”(significance level),而此一檢定被稱為α-level檢定,α = 5%。]

Truth

Action

H0 HA

Do not reject H0 O.K. Type II errorReject H0 Type I error O.K.

更多關於統計檢定的介紹可參見由黃文璋教授寫的文章“統計顯著性”。

44 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

How to derive the distribution of T1 under H0?

(i)SSReg

k

under H0=ε>(Mk −M0)ε

k

by Fact 5∼ σ2χ2(k)

k

(ii)SSRes

n− k − 1= σ2 by D2∼ σ2χ

2(n− k − 1)

n− k − 1(iii) SSReg and SSRes are independent. This is because

SSRegunder H0= ε>ORegO

>Regε,

where OReg consists of the orthonormal basis of C((I −M0)X), and

SSRes = ε>OResO>Resε,

where ORes consists of the orthonormal basis of C⊥((I −M0)X). Moreover,since

O>RegORes = 0, (0: zero matrix)

by Fact 3, O>Regε and O>Resε are independent, and hence SSReg and SSRes areindependent (why?).

45 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

Note.

46 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

(iv) Combing (i) ∼ (iii), we obtain

T1H0∼ F (k, n− k − 1),

where F (k, n− k − 1) is called the F -distribution with k and n− k − 1degrees of freedoms.[Why? Because T1 (under H0) is a ratio of two ”independent” chi-squaredistributions divided by their corresponding degrees of freedom.]

(v) (α-level) Testing rule: Reject H0 if

T1 ≥ f1−α(k, n− k − 1),

where P (F (k, n− k − 1) > f1−α(k, n− k − 1)) = α.

> f1−α(k, n− k − 1) is called the upper critical value for the F (k, n− k − 1)distribution.

47 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

48 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

(b) Wald test

Consider the linear parametric hypothesis:

H0 : Dβ = γ,

HA : H0 is wrong,

where D and γ are known, D is a q× (k+ 1) matrix with 1 ≤ q ≤ k+ 1 and γ isa q × 1 vector.

Example

If β =

β1...β4

,D =

(1 0 −1 00 1 0 −1

), and γ =

(00

), then

H0 =

{β1 = β3

β2 = β4and HA : β1 6= β3 or β2 6= β4.

49 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

By suitably imposing D and γ, Wald tests are much more flexible than Ftests.

Test statistics:

W1 =(Dβ − γ)>E−1(Dβ − γ)

σ2q,

where E = D(X>X)−1D>.

What is the distribution of W1 under H0?

(i) Dβ − γ H0∼ N(0, σ2E) (Why? Dβ − γ under H0= D(β − β))

(ii)(Dβ − γ)>E−1(Dβ − γ)

σ2∼ χ2(q) (by Fact 6)

(iii) β and σ2 are independent (Why? We’ve argued this previously!!)

(iv)σ2

σ2∼ χ2(n− k − 1)

n− k − 1. (We’ve already shown this!!)

(v) By (i) ∼ (iv), W1H0∼ F (q, n− k − 1).

(vi) Now you can set an α, find the critical value from the F table, andestablish your α-level test!!

50 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

(c) T-test

Consider the following hypothesis,

H0 : βj = b, where 1 ≤ j ≤ k, b is known

against the alternative,HA : βj 6= b.

We have

(i) β − β ∼ N(0, (X>X)−1σ2) [see D1], and hence

βj − bH0= e>j (β − β) ∼ N(0, e>j (X>X)−1ejσ

2),

where ej = (0, . . . , 0, 1, 0, . . . , 0)>, the jth component is 1, and the othersare zeros.

(ii)σ2

σ2∼ χ2(n− k − 1)

n− k − 1. [see D2]

(iii) σ2 and βj are independent. (why?)

51 / 162

Regression Analysis

Finite Sample Theory

Gaussian Regression

(iv) By (i) ∼ (iii),

βj−b√e>j (X>X)−1ejσ2√

σ2

σ2

=βj − b√

e>j (X>X)−1ej σ2≡ T H0∼ t(n− k − 1)

where t(n− k − 1) is the t-distribution with n− k − 1 degrees of freedom.

(v) Testing rule: Reject H0 if |T | > tα/2(n− k − 1).

We have PH0(|T | > tα/2(n− k − 1)) = α and hence this is a level α test.

52 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Interval Estimation

We first recall some results on point estimation:

(i) E(β) = β and E(σ2) = σ2 (unbiasedness).

(ii) Var(β) = (X>X)−1σ2

(iii) β is BLUE!!

(iv) Var(σ2) =2σ4

n− k − 1→ 0, as n→∞ (under the normal assumption)

[which is desired result because it shows the estimation quality is gettingbetter and better when sample size is getting larger and larger!!]

To see this, note first that

(a)σ2

σ2∼ χ2(n− k − 1)

n− k − 1

(b) E(χ2(n− k − 1)) = n− k − 1

(c) Var(χ2(n− k − 1)) = 2(n− k − 1)

By (a)∼(c), Var(σ2) =2σ4

n− k − 1follows.

53 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

However, if the normal assumption fails to hold, how should us calculate Var(σ2)?

Some ideas:

σ2 =1

n− k − 1y>(I −Mk)y =

1

n− k − 1ε>(I −Mk)ε

=1

n− k − 1

n∑i=1

n∑j=1

Aijεiεj ,

where [Aij ]1≤i,j≤n ≡ A = I −Mk. It is clear that

E(σ2) =1

n− k − 1

n∑i=1

n∑j=1

AijE(εiεj)

why?=

1

n− k − 1

n∑i=1

Aiiσ2 =

σ2

n− k − 1tr((I −Mk)) = σ2.

54 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Moreover, we have

E(σ4) =

(1

n− k − 1

)2 n∑i=1

n∑j=1

n∑k=1

n∑l=1

AijAklE(εiεjεkεl)

=

(1

n− k − 1

)2 n∑i=1

A2iiE(ε4i ) (i = j = k = l)

+

(1

n− k − 1

)2 ∑1≤i,k≤ni 6=k

AiiAkkE(ε2i )E(ε2k) (i = j 6= k = l)

+

(1

n− k − 1

)2 ∑1≤i,j≤ni 6=j

A2ijE(ε2i )E(ε2j ) (i = k 6= j = l)

+

(1

n− k − 1

)2 ∑1≤i,j≤ni 6=j

AijAjiE(ε2i )E(ε2j ) (i = l 6= j = k),

where∑

1≤i,j≤ni 6=j

AijAjiE(ε2i )E(ε2j ) =∑

1≤i,j≤ni 6=j

A2ijE(ε2i )E(ε2j ) (since A is symmetric).

55 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Simple algebra shows that

E(σ4) =

(1

n− k − 1

)2

(E(ε41)− 3σ4)

n∑i=1

A2ii

+

(1

n− k − 1

)2

σ4

n∑i=1

n∑k=1

AiiAkk + 2

n∑i=1

n∑j=1

A2ij

=

1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii + σ4 +

2σ4

n− k − 1.

Note

(i) E(ε41)− 3σ4 = 0 if ε is normal.

(ii)∑ni=1

∑nk=1AiiAkk = (tr(A))2 = (tr(I −Mk))2 = (n− k − 1)2.

(iii)∑ni=1

∑nj=1A

2ij = tr(A2) = tr((I −Mk)2) = tr(I −Mk) = n− k − 1.

Hence

Var(σ2) =1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii +

2σ4

n− k − 1.

56 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Will1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii converge to zero as n→∞?

Yes, because

n∑i=1

A2ii ≤

n∑i=1

Aii = tr(A) = tr(I −Mk) = n− k − 1.

To see this, we note that the idempotent property of A yields

Aii =

n∑j=1

A2ij ≥ A2

ii (which also yields 0 ≤ Aii ≤ 1).

57 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

We now get back to interval estimation.

(i) The first goal is to find an interval such that

P (βi ∈ Iα) = 1− α,

where α is small and is decided by the users, 1− α is called a “confidencelevel”.

How to construct Iα?

(a)βi − βi√

e>i (X>X)−1eiσ2∼ t(n− k − 1)

(b) P (βi ∈ (Li, Ri)) = 1− αLi = βi − t1−α/2(n− k − 1)

√e>i (X>X)−1eiσ2

Ri = βi + t1−α/2(n− k − 1)√e>i (X>X)−1eiσ2

58 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Does the interval described in (b) have the shortest length?

To answer this question, we need to solve the following problem:

minimizing b− a subject to F (b)− F (a) = 1− α,

where F (·) denotes the distribution function of t(n− k − 1) distribution, and

P

(a <

βi − βi√e>i (X>X)−1eiσ2

≤ b

)= F (b)− F (a) = 1− α.

By the Lagrange method, define

g(a, b, λ) = b− a− λ(F (b)− F (a)− (1− α))

and let ∇g(a, b, λ) = 0, where ∇g = ( ∂g∂a ,∂g∂b ,

∂g∂λ )>. The last identity yield{

f(b) = f(a) = 1λ ,

F (b)− F (a) = 1− α, (∗)

where f(·) is the pdf of t(n− k − 1) distribution.

59 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Since the pdf of t(n− k− 1) is symmetric and strictly decreasing (increasing)when x ≥ 0 (when x ≤ 0), (∗) implies b = −a and b > 0.

As a result, the unique solution to (∗) is (−b, b) with 2F (b) = 2− α, i.e.,

(−t1−α/2(n− k − 1), t1−α/2(n− k − 1)).

To check whether 2t1−α/2(n− k − 1) minimizes b− a, we still need toconsider the so-called ”bordered” Hessian matrix evaluated at

s∗ =

a∗b∗λ∗

=

−t1−α/2(n− k − 1)t1−α/2(n− k − 1)

1f(t1−α/2(n−k−1))

.

60 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Note that the bordered Hessian matrix is defined by

∇2g =

∂2g∂a∂a

∂2g∂a∂b

∂2g∂a∂λ

· ∂2g∂b∂b

∂2g∂b∂λ

· · ∂2g∂λ∂λ

,

where ∂2g∂λ∂λ = 0, and it is straightforward to show that

∇2g(s∗) =

f ′(−t1−α/2(n−k−1))

f(t1−α/2(n−k−1))0 f(−t1−α/2(n− k − 1))

0−f ′(t1−α/2(n−k−1))

f(t1−α/2(n−k−1))−f(t1−α/2(n− k − 1))

f(−t1−α/2(n− k − 1)) −f(t1−α/2(n− k − 1)) 0

.

Since the principal submatrix f ′(−t1−α/2(n−k−1))f(t1−α/2(n−k−1))

0

0−f ′(t1−α/2(n−k−1))f(t1−α/2(n−k−1))

is positive definite, it follows that 2t1−α/2(n− k − 1) minimizes b− a subjectto F (b)− F (a) = 1− α.

61 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

(ii) The second goal is to find a (k + 1)-dimensional set Vα such that

P (β ∈ Vα) = 1− α.

How to construct Vα?

(a)(β − β)>X>X(β − β)

σ2∼ χ2(k + 1). (by Fact 6)

(b)(β − β)>X>X(β − β)

(k + 1)σ2∼ F (k + 1, n− k − 1).

62 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

(c) Vα:

a ≤ (β − β)>X>X(β − β)

(k + 1)σ2≤ b,

where F ∗(b)− F ∗(a) = 1− α and F ∗(·) is the distribution function ofF (k + 1, n− k − 1).

63 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

(d) It can be shown that the volume of the larger ellipsoid is

πk+12

Γ(k+12 + 1

) ((k + 1)σ2b) k+1

2 (det(X>X))−1/2,

and that of the smaller one is

πk+12

Γ(k+12 + 1

) ((k + 1)σ2a) k+1

2 (det(X>X))−1/2.

64 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

Hence the volume of Vα is minimized by

minimizing bk+12 − a

k+12 subject to F ∗(b)− F ∗(a) = 1− α.

However, in general, this minimization problem does not have a closed formsolution, but it can be shown that when k = 1,

a∗ = 0, and b∗ = F1−α(k + 1, n− k − 1),

and when both n and k are large and n� k,

a∗ ∼ 0 and b∗ ∼ F1−α(k + 1, n− k − 1).

Note also that unlike the t-distributions, when d1 > 1, the pdfs of Fdistributions have very small values near the origin.

65 / 162

Regression Analysis

Finite Sample Theory

Interval Estimation

66 / 162

Regression Analysis

Finite Sample Theory

Another look at β

Another look at β

Let Xk = (Xk−1,xk). Because C(Xk) and C(Xk−1, (I −Mk−1)xk) are thesame, we have

Mky = (Xk−1,xk)

(βk−1βk

)why?= (Xk−1, (I −Mk−1)xk)

((X>k−1Xk−1)−1 0

0> 1x>k (I−Mk−1)xk

)

×(

X>k−1yx>k (I −Mk−1)y

),

yielding

Xk−1βk−1 + xkβk = Xk−1[(X>k−1Xk−1)−1X>k−1y

−(X>k−1Xk−1)−1X>k−1xkβ∗k ] + xkβ

∗k ,

where β∗k =x>k (I −Mk−1)y

x>k (I −Mk−1)xk.

67 / 162

Regression Analysis

Finite Sample Theory

Another look at β

In addition, since Xk is of full rank , we obtain

βk = β∗k =x>k (I −Mk−1)y

x>k (I −Mk−1)xk.

This shows that βk is equivalent to the LSE of the simple regression of(I −Mk−1)y on (I −Mk−1)xk.

As a result, βk can only be viewed as the marginal contribution of xk to ywhen the effects of the other variables are removed in advance.

68 / 162

Regression Analysis

Finite Sample Theory

Model Selection

Model Selection

Mallows’ Cp:

Let Xp be a submodel of Xk.

69 / 162

Regression Analysis

Finite Sample Theory

Model Selection

Can we construct a measure to describe its prediction performance?

Let Mp be the orthogonal projection matrix of Xp. Then, Mpy can be usedto predict new observations

yNew = Xkβ + εNew,

where εNew and ε are independent, but have the same distribution.

The performance of Mpy can be measured by

E∥∥yNew −Mpy

∥∥2 = E∥∥Xkβ + εNew −Mpy

∥∥2(∗)= nσ2 + E ‖Xkβ −Mpy‖2 .

(∗): since εNew and y are independent.

70 / 162

Regression Analysis

Finite Sample Theory

Model Selection

Let Xk = (Xp,X−p) and β =

(βpβ−p

).

Moreover, we have

E ‖Xkβ −Mpy‖2 = E ‖Xpβp +X−pβ−p −Mp(X−pβ−p +Xpβp + ε)‖2

= E ‖(I −Mp)X−pβ−p −Mpε‖2

why?= pσ2 + β>−pX

>−p(I −Mp)X−pβ−p.

Hence,

E∥∥yNew −Mpy

∥∥2 = (n+ p)σ2 + β>−pX>−p(I −Mp)X−pβ−p.

71 / 162

Regression Analysis

Finite Sample Theory

Model Selection

To estimate this expectation, we start by considering

SSRes(p) = y>(I −Mp)y.

Note first that

E(SSRes(p)) = E(X−pβ−p + ε)>(I −Mp)(X−pβ−p + ε)

= β>−pX>−p(I −Mp)X−pβ−p + E(ε>(I −Mp)ε)

= β>−pX>−p(I −Mp)X−pβ−p + (n− p)σ2.

Therefore,

E(SSRes(p) + 2pσ2) = β>−pX>−p(I −Mp)X−pβ−p + (n+ p)σ2

= E∥∥yNew −Mpy

∥∥2 .Now, Mallows’ Cp is defined by

SSRes(p) + 2pσ2,

which is an unbiased estimate of E∥∥yNew −Mpy

∥∥2 .72 / 162

Regression Analysis

Finite Sample Theory

Prediction

Prediction

(a) How to predict E(yn+1) = x>n+1β when xn+1 = (1, xn+1,1, . . . , xn+1,k)> isavailable?

Point prediction: x>n+1β

Prediction interval (under normality):

(i) x>n+1(β − β) ∼ N(0,x>n+1(X>X)−1xn+1σ2)

Sometimes I use Xk to replace X, in particular, when model selection issue istaken into account.

(ii)x>n+1(β − β)√

x>n+1(X>X)−1xn+1σ2∼ T (n− k − 1)

(iii) Please construct a (1− α) level prediction interval by yourself.

(iv) What if the normal assumption is violated?

73 / 162

Regression Analysis

Finite Sample Theory

Prediction

(b) How to predict yn+1?

point prediction: x>n+1β (Still, we have this guy.)

prediction interval (under normality):

(i) yn+1 − x>n+1β = εn+1−x>n+1(β−β) ∼ N(0, (1 +x>n+1(X>X)−1xn+1)σ2).

(yn+1 − x>n+1β is called prediction error.)

(ii)yn+1 − x>n+1β√

(1 + x>n+1(X>X)−1xn+1)σ2∼ T (n− k − 1).

(iii) Please construct your own (1− α) level prediction interval for yn+1.

74 / 162

Regression Analysis

Large Sample Theory

Motivation

Large Sample Theory

Motivation

Consider againy = Xβ + ε.

If εt’s are not normally distributed, how do we make inference for β and σ2? Howdo we perform prediction?

Q1: Is β = (X>X)−1X>y → β in probability?

Q2: Is σ2 =1

n− (k + 1)y>(I −Mk)y → σ2 in probability?

Q3: If the answer to Q1 is “yes”, what is the limiting distribution of β?

Q4: How do we construct confidence intervals for β based on the answer of Q3?

Q5: How do we test linear or even nonlinear hypotheses without normality?

Q6: How do we do prediction without normality?

75 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Question 1

We first answer Q1 in the special case where X =

1 x1...

...1 xn

.

Definition

A sequence of r.v.s {Zn} is said to converge in probability to a r.v. Z (which canbe a non-random constant) if for any ε > 0,

limn→∞

P (|Zn − Z| > ε) = 0,

which is denoted by Znpr.→ Z.

Remark

A sequence of random vectors {Zn = (Z1n, . . . , Zkn)>} is said to be convergent

in probability to a random vector Z = (Z1, . . . , Zk)> if Zinpr.→ Zi, i = 1, . . . , k,

which is denoted by Znpr.→ Z.

76 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

An answer to Q1:

Since

Var(β) = (X>X)−1σ2 = σ2

Sxx + nx2

nSxx

−nxnSxx−nx

nSxx

1

Sxx

,

we have

P (|β0 − β0| > ε)(∗)≤ σ2

ε2Sxx + nx2

nSxx→ 0 if

x2

Sxx→ 0,

((∗): Chebychev’s inequality, which says if E(X) = µ and Var(X) = σ2, then

P (|X − µ| > ε) ≤ σ2

ε2 ) and

P (|β1 − β1| > ε) ≤ σ2

ε21

Sxx→ 0 if

1

Sxx→ 0,

noting that Sxx =

n∑i=1

(xi − x)2 and x =1

n

n∑i=1

xi.

77 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

As a result, to ensure βpr.→ β, we need

x2

Sxx→ 0 and

1

Sxx→ 0 as n→∞.

Remark

(i) Please give a heuristic explanation of whyx2

Sxx→ 0 is needed for β0 to

converge to β0 in probability.

(ii) Please explain why Cov(β0, β1) is positive (negative) correlated whenx < 0 (x > 0).

(iii) What are the sufficient conditions for βpr.→ β in general cases?

78 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Question 2

An answer to Q2:

We have shown previously that the variance of σ2 converges to 0 as n→∞.Therefore, by Chebyshev’s inequality,

σ2 pr.→ σ2.

79 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Before answering Q3, let us consider the so-called spectral decomposition forsymmetric matrices.

Let A be a k × k symmetric matrix. Then there exist real numbersλ1, . . . , λk and a k-dimensional orthogonal matrix P = (p1, . . . ,pk)satisfying P>P = PP> = I and Api = λipi such that

A = PDP>,

where D =

λ1 0 · · · 0

0. . .

......

. . . 00 · · · 0 λk

.

80 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Remark

(1) λi is called an eigenvalue of A and pi is the eigenvector corresponding to λi.

(2) Let A be positive definite. Then λi > 0 for i = 1, . . . , k.

Proof. 0p.d.< p>i Api = p>i PDP

>piwhy?= λi.↑

by the spectral decomposition

(3) Let A be positive definite. Define

A1/2 = PD1/2P>,

where

D1/2 =

λ1/21 0 · · · 0

0. . .

......

. . . 0

0 · · · 0 λ1/2k

.

Then, we have (A1/2)2 = A.

81 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Remark (cont.)

(4) Define λmax(A) = max{λ1, . . . , λk} and λmin(A) = min{λ1, . . . , λk}. Then,

λmax(A) = sup‖a‖=1

a>Aa and λmin(A) = inf‖a‖=1

a>Aa.

Proof. As shown before,

λi = p>i Api ≤ sup‖a‖=1

a>Aa.

Moreover, for any a ∈ Rk with ‖a‖ = 1, we have a = Pb, where ‖b‖ = 1.Thus,

a>Aa = b>P>PDP>Pb = b>Db =

k∑i=1

λib2i ≤ λmax(A),

where b = (b1, . . . , bk)>. This yields λmax(A) = sup‖a‖=1

a>Aa. The second

statement can be proven similarly.82 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Remark (cont.)

(5) Let A be positive definite. Then λmax(A−1) = 1/λmin(A).

(6) Let B be any real matrix. Define the “spectral norm” of B as follows:

‖B‖ =

(sup‖a‖=1

a>B>Ba

)1/2

=(λmax(B>B)

)1/2.

We have

(i) If B is symmetric with eigenvalues λ1, . . . , λk, then‖B‖ = max{|λ1|, . . . , |λk|}.

(ii) ‖AB‖ ≤ ‖A‖‖B‖, where A is another real matrix whose number of columnsis the same as the number of the rows of B.

(iii) ‖A+B‖ ≤ ‖A‖+ ‖B‖, where A and B have the same numbers of rows andcolumns.

(iv) If B is positive definite, ‖B‖ ≤ tr(B) =∑ki=1 λi, where λi, i = 1, . . . , k, are

the eigenvalues of B.

83 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Remark (cont.)

(7) Let X be the design matrix of a regression model, i.e.,

X =

1 x11 · · · x1k...

......

1 xn1 · · · xnk

.

Then,

λmax(X>X) = sup‖a‖=1

n∑i=1

(a>xi)2,

and

λmin(X>X) = inf‖a‖=1

n∑i=1

(a>xi)2.

(8) Let x ∼ N(0,Σ) be a p-dimensional multivariate normal vector. Then,Σ−1/2x ∼ N(0, I), and hence x>Σ−1x ∼ χ2(p), which has been shownpreviously in a different way.

84 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

We now revisit the question of what makes βpr.→ β in general cases.

The answer to this question is simple. Since

Var(β) = (X>X)−1σ2,

we only needs to show that“each diagonal elements of (X>X)−1 converges to 0”. (∗)

To show (∗), note first that

X>X = T>(T>)−1X>XT−1T ,

where

T =

1 x1 · · · xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

and T−1 =

1 −x1 · · · −xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

.

85 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Moreover, we have

(T>)−1X>XT−1 =

(n 0>

0o

X>

(I − En

)o

X

),

where

o

X=

x11 · · · x1k...

...xn1 · · · xnk

and E =

1 · · · 1...

...1 · · · 1

,

noting that

(I − En

)o

X=

x11 − x1 · · · x1k − xk...

...xn1 − x1 · · · xnk − xk

.

where (x11 − x1, . . . , x1k − xk) and (xn1 − x1, . . . , xnk − xk) are the centereddata vectors.

86 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Hence

(X>X)−1 = T−1

(n−1 0>

0 (o

X>

(I − En

)o

X)−1

)(T−1)>,

yielding

(X>X)−1 =

(1

n+ x>D−1x −x>D−1

−D−1x D−1

),

where (T−1)> = (T>)−1, x = (x1, . . . , xk)>, and D =o

X>

(I − En

)o

X.

87 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

This implies for each 1 ≤ i ≤ k + 1,

(X>X)−1ii ≤ max

{1

n+ x>D−1x, λmax(D−1)

}≤ max

{1

n+‖x‖2

λmin(D),

1

λmin(D)

},

where λmax(D−1) =1

λmin(D), which converges to 0 if

(i) λmin(o

X>

(I − En

)o

X)n→∞→ ∞,

(ii)

k∑i=1

x2i

λmin(o

X>

(I − En

)o

X)

n→∞→ 0.

請將此兩條件與 Q1 中的答案比較。

88 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

圖示如下:

上述條件要求:(i) 資料在散佈最窄的方向(從(x1, . . . , xk)的位置看),也要有足夠大的sum of

squares (information)

(ii) 資料的中心點距原點的距離平方比起λmin(o

X>

(I − En

)o

X)是微不足道的。

89 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Question 3

Note first that

β − β = (X>X)−1X>(y −Xβ)

= (X>X)−1X>ε

=

(n∑i=1

xix>i

)−1n∑i=1

εi

n∑i=1

xiεi

,

noting that we first consider X =

1 x1...

...1 xn

.

90 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Since for ε ∼ N(0, σ2I), we have

β − β ∼ N(0, σ2(X>X)−1),

it is natural to conjecture that when ε is not normally distributed,

(X>X)1/2

σ(β − β)

d→ N(0, I). (∗)

Definition

A sequence of random vectors, {xn}, is said to converge to a random vector, x, indistribution if

P (xn ≤ c)→ P (x ≤ c) ≡ F (c) as n→∞,

for all continuous points of F (·), the distribution function of x, which is denoted

by xnd→ x.

91 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Remark

Cramer-Wold Device:

xnd→ x ⇔ a>xn

d→ a>x for any ‖a‖ = 1.

Therefore, (∗) holds iff

a>(X>X)1/2

σ(β − β) = a>

(n∑i=1

xix>i

)−1/2n∑i=1

εiσ

n∑i=1

xiεiσ

=

n∑i=1

(w1n + w2nxi

σ

)εi

d−→ N(0, 1),

where (w1n, w2n) = a>

(n∑i=1

xix>i

)−1/2.

92 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Lindeberg’s Central Limit Theorem (for the sum of independent r.v.s)

Let Z1n, . . . , Znn be a sequence of independent r.v.s with E(Zin) = 0 andn∑i=1

E(Z2in) =

n∑i=1

σ2in = 1 for all n. If for any δ > 0,

n∑i=1

E(Z2inI|Zin|>δ

)−→ 0, as n→∞, (Lindeberg’s Condition)

無獨尊者,故在”均勻”混合後,原來分配之特性消失,成為常態分配

thenn∑i=1

Zind−→ N(0, 1).

93 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Remark

(1) Lindegerg’s condition implies

max1≤i≤n

σ2in −→ 0 as n→∞.

To see this, we note that for “any” δ > 0

max1≤i≤n

σ2in = max

1≤i≤nE(Z2

in) ≤ max1≤i≤n

E(Z2inI|Zin|>δ

)+ δ2.

Since the first term converges to 0 by Lindeberg’s condition and since δ canbe arbitrarily small, the desired conclusion follows.

(2) Lindeberg’s condition ⇔ CLT + max1≤i≤n

σ2in −→ 0 as n→∞.

94 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Now, we are in the position to check Lindeberg’s condition for

Zin =

(w1n + w2nxi

σ

)εi

denoted by≡ vinεi.

(i) E(vinεi) = 0. (easy)

(ii)n∑i=1

E(v2inε2i ) = 1. (easy but why?)

(iii) Assume E1/2(ε41) < C1 <∞, for some constants C1, C2,n∑i=1

E[v2inε

2i I{v2inε2i>δ2}

]=

n∑i=1

v2inE[ε2i I{v2inε2i>δ2}

]why?≤

n∑i=1

v2inE1/2(ε4i )P

1/2(v2inε2i > δ2)

≤ C1

n∑i=1

v2inE1/2(v2inε

2i )

δ

≤ C2

(n∑i=1

v2in

)max1≤i≤n

|vin| ≤ C3 max1≤i≤n

|vin|.95 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Therefore, Lindeberg’s condition holds for vinεi if

max1≤i≤n

(v2in) = σ−2 max1≤i≤n

a>( n∑i=1

xix>i

)−1/2(1xi

)2

≤ σ−2a>

(n∑i=1

xix>i

)−1a(1 + max

1≤i≤nx2i )

(∗)≤ σ−2λmax

( n∑i=1

xix>i

)−1 (1 + max1≤i≤n

x2i )

= σ−21 + max

1≤i≤nx2i

λmin

(n∑i=1

xix>i

) −→ 0, as n→∞.

96 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

(∗) To see this, we have by spectral decomposition for A,A = PDP>, where

D =

λ1 . . .

λk

with 0 < λ1 ≤ λ2 ≤ · · · ≤ λk, and P = (p1, . . . ,pk) satisfies

Api = λipi, p>i pi = 1, and p>i pj = 0 for i 6= j.

Hence,

p>kApk = p>k PDP>pk = (0, . . . , 0, 1)

λ1 . . .

λk

0...01

= λk ≤ sup

‖a‖=1

a>Aa.

97 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

On the other hand, for any a ∈ Rk with ‖a‖ = 1, we can express it asa = Pb with ‖b‖ = 1. Thus,

a>Aa = b>P>PD>P>Pb = b>Db =

k∑i=1

λib2i ≤ λk,

where b = (b1, . . . , bk)>. As a result,

λk = sup‖a‖=1

a>Aa.

Similarly, it can be shown that

λ1 = inf‖a‖=1

a>Aa.

98 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

To give a more comprehensive sufficient condition, we note that

λmin

(n∑i=1

xix>i

)= λmin

((n

∑xi∑

xi∑x2i

))= λmin

((1 0x 1

)(1 0−x 1

)(n

∑xi∑

xi∑x2i

)×(

1 −x0 1

)(1 x0 1

))= λmin

((1 0x 1

)(n 00 Sxx

)(1 x0 1

))why?

≥ min{n, Sxx}λmin

((1 0x 1

)(1 x0 1

))(∗)≥ C min{n, Sxx},

provided x <∞, where Sxx =∑

(xi − x)2.

Explain

“why?” : λmin(B>AB) ≥ λmin(B>B)λmin(A)

(∗) : if x <∞, λmin

((1 0x 1

)(1 x0 1

))is “bounded away” from 0. (We will

show this later.)99 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

In view of this, a set of more transparent sufficient conditions for the Lindeberg’scondition is

(i)

max1≤i≤n

x2i

n−→ 0,

(ii) Sxx −→∞, [this one is also needed for βpr.−→ β.]

(iii)

max1≤i≤n

x2i

Sxx−→ 0.

Can you answer Q3 under general multiple regression models?

100 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

事實上,對一般的多元迴歸(k ≥ 1), it is not difficult to show that Lindeberg’scondition holds when

1 + max1≤i≤n

k∑j=1

x2ij

λmin(X>X)−→ 0 as n→∞. (>)

(請對照 k = 1的case)

101 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

進一步的問題是,我們能不能得到類似 k = 1 case中(i), (ii), (iii)條件,使得(>)成立。為了回答此一問題,我們需要一點linear algebra。

(1) Let

T =

1 c1 · · · ck0 1 0 · · · 0... 0 1

. . ....

. . .. . . 0

0 0 · · · 0 1

=

(1 c>

0 Ik

),

where c = (c1, . . . , ck)>, and Ik is the k-dimensional identity matrix.Then, we have

λmin(T>T ) ≥ 1

2 + c>c. (∗)

102 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Proof of (∗)Since (∗) holds trivially when c = 0, we only consider the case c 6= 0.

Note first that

E∗ = T>T =

(1 c>

c cc> + Ik

),

and the eigenvalues of E∗ are those λ’s satisfying

det(E∗ − λIk+1) = 0, (∗∗)

where Ik+1 is the (k + 1)-dimensional identity matrix.

In addition,

det(E∗ − λIk+1) = det

(1− λ c>

c cc> + (1− λ)Ik

)

=

det

(0 c>

c cc>

), ifλ = 1;

det

(1− λ 0>

c (1− 11−λ )cc> + (1− λ)Ik

), ifλ 6= 1.

103 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Proof of (∗) (cont.)

For λ = 1,

det

(0 c>

c cc>

)=

{−c21 6= 0, if k = 1;0, if k > 1.

For λ 6= 1,

det

(1− λ 0>

c (1− 11−λ )cc> + (1− λ)Ik

)= (1− λ) det

((1− 1

1− λ

)cc> + (1− λ)Ik

)because this is a triangular matrix

= (1− λ)k+1 det

(Ik +

(1

1− λ −1

(1− λ)2

)cc>

)det(aAk) = ak det(Ak)

= (1− λ)k+1 det(Ik)

(1 +

(1

1− λ −1

(1− λ)2

)c>c

)Please try to prove det(A+ abb>) = det(A)(1 + ab>A−1b)

= (1− λ)k−1(λ2 − (2 + c>c)λ+ 1).104 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Proof of (∗) (cont.)

Therefore, the roots for (∗∗) are

λ = 1 or λ =

(2 + c>c)

(1±

√1− 4

(2 + c>c)2

)2

,

yielding

λmin(T>T ) ≥ min

1,

(2 + c>c)

(1−

√1− 4

(2 + c>c)2

)2

≥ min

{1,

1

2 + c>c

} (since

√1− x ≤ 1− x

2

)=

1

2 + c>c.

Thus the proof of (∗) is complete.105 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

(2) We have shown previously that

X>X = T>(n 0>

0 D

)T ,

where

T =

1 x1 · · · xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

and D =o

X>

(I − En

)o

X .

106 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

By λmin(B>AB) ≥ λmin(B>B)λmin(A) and (∗), we obtain

λmin(X>X) ≥ λmin(T>T )λmin

(n 0>

0 D

)≥ 1

2 +

k∑i=1

x2i

min{n, λmin(D)}

λmin

(n 0>

0 D

)= min{n, λmin(D)}

≥ 1

2 + Vmin{n, λmin(D)}.

在此我們假設

k∑i=1

x2i < V <∞(讓討論更聚焦)

107 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

(3) 最後為了讓(>)成立,我們給出以下充分條件:

(i′)

max1≤i≤n

k∑j=1

x2ij

n−→ 0,

(ii′) λmin(D) −→∞, (我已解釋過”它”的意義)

(iii′)

max1≤i≤n

k∑j=1

x2ij

λmin(D)−→ 0.

明顯看出 (i), (ii), (iii) 與 (i′), (ii′), (iii′) 是對應的。

108 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

Questions 4 and 5

Q4 and Q5: How does one construct confidence intervals (CIs) and testing ruleswhen ε is not normal?

Some basic probabilistic tools

(A) Slutsky’s Theorem.

If Xnd−→X, Yn

pr.−→ a and Znpr.−→ b, where a is a vector of real numbers

and b is a real number, then

Y >n Xn + Znd−→ a>X + b.

Corollary. If Xnd−→X and Yn −Xn

pr.−→ 0, then Ynd−→X.

Proof. Since Yn = Xn − (Xn − Yn), the conclusion follows immediately fromSlutsky’s Theorem.

109 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(B) Big O and Small O notation for a sequence of random vectors.Let an be a sequence of positive numbers. We say

Xn = Op(an),

where Xn is a sequence of random vectors, if for any ε > 0, there exist0 < Mε <∞ and a positive integer N such that for all n ≥ N ,

P

(∥∥∥∥Xn

an

∥∥∥∥ > Mε

)< ε,

andXn = op(an),

ifXn

an

pr.−→ 0.

110 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(C) Big O and Small O notation for a sequence of vectors of real numbers.Let {wn} be a sequence of vectors of real numbers and {an} be a sequenceof positive number. We say wn = O(an) if there exist 0 < M <∞ and apositive integer N such that for all n ≥ N ,∥∥∥∥wnan

∥∥∥∥ < M,

and wn = o(an) if wn/an → 0.

(D) Some rules.Let Xn = op(1), Op(1), o(1) or O(1), and Yn = op(1), Op(1), o(1) or O(1).

For ” + ”: For ”× ” (product):

op Op o Oop op Op op OpOp − Op Op Opo − − o OO − − − O

op Op o Oop op op op opOp − Op op Opo − − o oO − − − O

111 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(E) If Xn = Op(an), then Xn/an = Op(1),

If Xn = op(an), then Xn/an = op(1).

(F) If Xnd−→X, then Xn = Op(1), and if E‖Xn‖q < K <∞ for some q > 0

and for all n, then Xn = Op(1).

(G) If Xnpr.−→X and Yn

pr.−→ Y , then

(Xn

Yn

)pr.−→(XY

).

If Xnd−→X and Yn

d−→ Y , then

(Xn

Yn

)d−→(XY

), provided {Xn} and

{Yn} are independent.

(H) Continuous mapping theorem.

If Xnpr. or d−→ X and f(·) is a continuous function, then f(Xn)

pr. or d−→ f(X).

112 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(I) Delta method.

If√n(Zn − u)

d−→ N(0k×1,Vk×k) andf(·) = (f1(·), . . . , fm(·))> : Rk → Rm is a “sufficiently smooth” function,then

√n(f(Zn)− f(u))

d−→ N(0m×1, (∇f(u))>V (∇f(u))), (∗)

where

∇f(·) =

∂f1(·)∂x1

· · · ∂fm(·)∂x1

......

∂f1(·)∂xk

· · · ∂fm(·)∂xk

is a k ×m matrix.

Sketch of the proof. By Taylor’s Theorem,f(Zn)

·∼ f(u) + (∇f(u))>(Zn − u), which yields

√n(f(Zn)− f(u))

·∼ (∇f(u))>√n(Zn − u).

This and the CLT for Zn (given as an assumption) lead to the desiredconclusion. 113 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

We are now ready to answer Q4 & Q5.

(1) An alternative version of CLT for β.

Recall that(X>X)1/2

σ(β − β)

d−→ N(0, I),

under suitable conditions. (What are they?)

Assume

Rn =1

nX>X =

1

n

n∑i=1

xix>in→∞−→ R,

where R is a positive definite matrix.

Then, it can be shown that

1

σR1/2

√n(β − β)

d−→ N(0, I). (∗)

By (∗), we have

√n(β − β)

d−→ N(0,R−1σ2). (∗∗)114 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

Additional materials

(i) ‖σ−1(R1/2n −R1/2)

√n(β − β)‖ ≤ σ−1‖R1/2

n −R1/2‖‖√n(β − β)‖

(‖Ax‖2 = x>A>Ax ≤ ‖A‖2‖x‖2)

(ii) ‖R1/2n −R1/2‖ = o(1) (it’s obvious)

(iii) E‖√n(β − β)‖2 why?

= tr((X>Xn )−1)σ2

= tr(R−1n )σ2 n→∞−→ tr(R−1)σ2<∞. (R is p.d.)

(iv) By (i)–(iii), we have∥∥∥σ−1(R1/2n −R1/2)

√n(β − β)

∥∥∥ = o(1)Op(1) = op(1),

yielding1

σR1/2

√n(β − β) and

1

σR1/2n

√n(β − β)

have the same limiting distribution (by Slutsky’s Theorem), which isN(0, I).

115 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(2) Consider the problem of testing a nonlinear null hypothesis,

H0 : β0 + β21 = d,

for some known d, against the alternative hypothesis,

HA : β0 + β21 6= d.

For simplify the discussion, we again assume that

X =

1 X1

......

1 Xn

, hence β =

(β0β1

)and β =

(β0β1

).

Set f(β) = β0 + β21 . Then ∇f(β) =

(1

2β1

).

116 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

By the δ-method and (∗∗), we obtain

√n(f(β)− f(β))

H0=√n(f(β)− d)

d−→ N

(0, (1, 2β1)R−1

(1

2β1

)σ2

),

which implies√n(f(β)− d)

σ

√(1, 2β1)R−1

(1

2β1

) d−→ N(0, 1). (∗ ∗ ∗)

Moreover, it holds that

σ

√(1, 2β1)R−1n

(1

2β1

)pr.−→ σ

√(1, 2β1)R−1

(1

2β1

).

(β1

pr.−→ β1Rn −→ R

)117 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

This, (∗ ∗ ∗) and Slutsky’s Theorem together imply√n(f(β)− d)

σ

√(1, 2β1)R−1n

(1

2β1

) d−→ N(0, 1).

This result enables us to construct the following testing rule:reject H0 if

f(β) = β0 + β21 > d+

1.96σ

√(1, 2β1)R−1n

(1

2β1

)√n

or

f(β) = β0 + β21 < d−

1.96σ

√(1, 2β1)R−1n

(1

2β1

)√n

which is an “asymptotic” level 5% test, i.e.,

PH0(reject H0)

n→∞−→ 5%.118 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

(3) Consider the problem of testing the linear hypothesis

H0 : Dq×kβk×1 = γq×1 against HA :∼ H0,

where Dq×k and γq×1 are known.

Set f(β) = Dβ. By the δ-method and the CLT for β, we have under H0,

√n(f(β)− γ)

d−→ N(0,DR−1D>σ2),

and hence by the continuous mapping theorem,

n(f(β)− γ)>(DR−1D>)−1(f(β)− γ)

σ2

d−→ χ2(q).

This, σ2 pr.−→ σ2, Rnn→∞−→ R, and Slutsky’s theorem further give (some

algebraic manipulations are needed!!)

w1 =n(f(β)− γ)>(DR−1n D

>)−1(f(β)− γ)

σ2

d−→ χ2(q).

119 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

Therefore, the following testing rule:

reject H0 if w1 > χ21−α(q),

is an asymptotic level α% test.

Please compare this asymptotic test with its counterpart derived from thefinite-sample theory under normal assumptions.

120 / 162

Regression Analysis

Appendix

Statistical View of Spectral Decomposition

Statistical View of Spectral Decomposition

1. Without loss of generality, we can assume Γ = E(xx>) where x is ap-dimensional random vector with E(x) = 0.

2. Define

a1 = argmaxc∈{s∈Rp:‖s‖=1}E((c>x)2) and λ∗1 = E((a>1 x)2).

By Lagrange multipliers method, Γa1 = λ∗1a1. Define

v1 = a>1 x and u1 = argminc∈RpE((x− cv1)>(x− cv1)).

Then,

u1 =E(xv1)

λ∗1=

Γa1

λ∗1= a1,

R1 := x− u1v1 = x− a1v1 = x− a1a>1 x = (Ip − a1a

>1 )x,

and

Γ1 := Var(R1) = E((Ip − a1a>1 )xx>(Ip − a1a

>1 )) = Γ− λ∗1a1a

>1 .

121 / 162

Regression Analysis

Appendix

Statistical View of Spectral Decomposition

3. Define

a2 = argmaxc∈{s∈Rp:‖s‖=1,s>a1=0}E((c>R1)2) and λ∗2 = E((a>2 R1)2).

By Lagrange multipliers method, we let∂∂c

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0;

∂∂h1

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0;

∂∂h2

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0,

and obtain h1 = 0 and h2 = c>Γc. Therefore,

Γ1a2 = (Γ− λ∗1a1a>1 )a2 = Γa2 = λ∗2a2.

122 / 162

Regression Analysis

Appendix

Statistical View of Spectral Decomposition

3. Define

v2 = a>2 R1 = a>2 x and u2 = argminc∈RpE((R1 − cv2)>(R1 − cv2)).

Then,

u2 =E(R1v2)

λ∗2= a2,

and

R2 := R1 − u2v2 = (Ip − (a1a>1 + a2a

>2 ))x.

123 / 162

Regression Analysis

Appendix

Statistical View of Spectral Decomposition

4. By the similar argument as above, we have Rp := (Ip −∑pi=1 aia

>i )x = 0,

and hence

O = Var(Rp) =

(Ip −

p∑i=1

aia>i

(Ip −

p∑i=1

aia>i

)

=

(Γ− 2

p∑i=1

λ∗iaia>i

)+

p∑i=1

p∑j=1

λ∗iaia>i aja

>j

= Γ−p∑i=1

λ∗iaia>i ,

where O is a p× p zero matrix.

5. Define P = (a1, . . . ,ap) and D = diag(λ∗1, . . . , λ∗p). Then,

Γ =

p∑i=1

λ∗iaia>i = PDP>.

124 / 162

Regression Analysis

Appendix

Limit Theorems

Continuous Mapping Theorem

Fact 1

Let Xnpr.−→ X and g be a continuous function on R. Then g(Xn)

pr.−→ g(X).

Proof of Fact 1

For any ε > 0, there exists a large k such that P (|X| > k) ≤ ε2

.

Moreover, we have for any δ > 0 and n ≥ Nδ,ε, P (|Xn −X| > δ) ≤ ε2

.

Since g(x) is uniformly continuous on [−k, k], there exists a δ∗ > 0 such that

|g(x)− g(y)| ≤ ε for all |x− y| ≤ δ∗ and |x| ≤ k.

Now, |g(X)− g(Xn)| > ε implies |X −Xn| > δ∗ or |X| > k, and hence

P (|g(X)− g(Xn)| > ε) ≤ P (|X −Xn| > δ∗) + P (|X| > k) ≤ ε for n ≥ Nδ∗,ε.

Therefore, P (|g(Xn)− g(X)| > ε)→ 0 as n→∞.

Remark

If Xnd−→ X and g is a continuous function on R, then g(Xn)

d−→ g(X).

125 / 162

Regression Analysis

Appendix

Limit Theorems

Fact 2

If Xnpr.−→ X, then Xn

d−→ X.

Definition

Let {an} be a sequence of real number. We denote an → 0 by an = o(1).

Proof of Fact 2

Goal: Fn(x)→ F (x), ∀x ∈ C(F ), where F (x) = P (X ≤ x) and Fn(x) = P (Xn ≤ x).

Let x ∈ C(F ) with x′ < x < x′′. By Xnpr.−→ X,

P (X ≤ x′) = P (X ≤ x′, Xn ≤ x) + P (X ≤ x′, Xn > x)

= P (X ≤ x′, Xn ≤ x) + o(1) ≤ Fn(x) + o(1),

and hence F (x′) ≤ lim infn→∞ Fn(x).

Similarly, we obtain lim supn→∞ Fn(x) ≤ F (x′′), and thus

F (x′) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ F (x′′).

The proof is completed by letting x′ ↑ x, x′′ ↓ x and

F (x) = limx′↑x

F (x′) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ limx′′↓x

F (x′′) = F (x).

126 / 162

Regression Analysis

Appendix

Limit Theorems

Slutsky’s Theorem

Slutsky’s Theorem

If Xn − Ynpr.−→ 0 and Yn

d−→ Y , then Xnd−→ Y .

Proof of Slutsky’s Theorem

Let x be any continuity point of the c.d.f. of Y , FY . Given δ > 0, there exists a smallε > 0 such that x− ε and x+ ε are continuity points of FY and F (x+ ε)− F (x− ε) < δ.

Define Fn(x) = P (Xn ≤ x). Our goal is to show that

FY (x− ε) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ FY (x+ ε),

which implies Fn(x) −→ FY (x).

Since Xn − Ynpr.−→ 0 and Yn

d−→ Y , we have

Fn(x) ≤ P (Yn ≤ x+ Yn −Xn, Yn −Xn ≤ ε) + P (Yn −Xn > ε) ≤ P (Yn ≤ x+ ε) + o(1),

Fn(x) = P (Yn ≤ x+ Yn −Xn, Yn −Xn ≥ −ε) + o(1)

≥ P (Yn ≤ x− ε, Yn −Xn ≥ −ε) + o(1)

≥ P (Yn ≤ x− ε)− P (Yn −Xn < −ε) + o(1) = P (Yn ≤ x− ε) + o(1),

and hence lim supn→∞ Fn(x) ≤ FY (x+ ε) and lim infn→∞ Fn(x) ≥ FY (x− ε).127 / 162

Regression Analysis

Appendix

Limit Theorems

Fact 3

If Xnd−→ X and Yn

pr.−→ c where c is a constant, then

(a) Xn + Ynd−→ X + c

(b) XnYnd−→ cX

Proof of Fact 3

For (a), it suffices to show that Xn + cd−→ X + c, which is obvious.

For (b), it suffices to show that XnYn − cXnpr.−→ 0. It is equivalent to show that if

Xnd−→ X and Yn

pr.−→ 0, then XnYnpr.−→ 0.

Let δ > 0 be an arbitrarily small constant. Then, there exists a large M such thatP (|X| > M) ≤ δ. Now for any ε > 0,

P (|XnYn| > ε) ≤ P(|XnYn| > ε, |Yn| ≤

ε

M

)+ P

(|Yn| >

ε

M

)≤ P (|Xn| > M) + o(1)

= P (|X| > M) + P (|Xn| > M)− P (|X| > M) + o(1)

= P (|X| > M) + o(1),

which implies 0 ≤ lim infn→∞ P (|XnYn| > ε) ≤ lim supn→∞ P (|XnYn| > ε) ≤ δ, and

hence XnYnpr.−→ 0.

128 / 162

Regression Analysis

Appendix

Limit Theorems

Application

Let Xii.i.d.∼ (0, 1) and E(X4

1 ) <∞. Then it follows from

X21 + · · ·+X2

n

n

pr.−→ 1,

X1 + · · ·+Xn√n

d−→ N(0, 1),

and Fact 3 that√n(X1 + · · ·+Xn)

X21 + · · ·+X2

n

d−→ N(0, 1).

129 / 162

Regression Analysis

Appendix

Limit Theorems

Some Remarks on Slutsky’s Theorem

(1) If Xnpr.−→ X and Yn

pr.−→ Y , then Xn + Ynpr.−→ X + Y and XnYn

pr.−→ XY .

Proof.

P (|Xn + Yn − (X + Y )| > ε) ≤ P (|Xn −X|+ |Yn − Y | > ε)

≤ P (|Xn −X| > ε/2 or |Yn − Y | > ε/2)

≤ P (|Xn −X| > ε/2) + P (|Yn − Y | > ε/2)→ 0,

as n→∞. Show by yourself that XnYnpr.−→ XY .

(2) If Xnd−→ X and Yn

d−→ c where c is a constant, then Xn + Ynd−→ X + c and

XnYnd−→ cX because Yn

d−→ c⇔ Ynpr.−→ c.

(Show by yourself that Ynd−→ c⇔ Yn

pr.−→ c)

(3) Assume Xnd−→ X and Yn

d−→ Y . Does Xn + Ynd−→ X + Y ? No. (The distribution of

X + Y is undefined if only the marginal distributions of X and Y are available.)

(4) If

(XnYn

)d−→(XY

), then by continuous mapping theorem,

Xn + Yn =(1 1

)(XnYn

)d−→(1 1

)(XY

)= X + Y.

130 / 162

Regression Analysis

Appendix

Limit Theorems

Central Limit TheoremLindeberg Central Limit Theorem

Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2

i for i =1, . . . , n. Define S2

n =∑ni=1 σ

2i and Sn =

∑ni=1Xi. Then

Sn

Snd−→ N(0, 1), (1)

provided for any ε > 0,

1

S2n

n∑i=1

E(X2i I{|Xi|>εSn})→ 0, (Lindeberg’s condition) (2)

as n→∞.

Lyapunov Central Limit Theorem

Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2

i for i =1, . . . , n. Define S2

n =∑ni=1 σ

2i and Sn =

∑ni=1Xi. If

1

S2+αn

n∑i=1

E(|Xi|2+α)→ 0 for some α > 0, (Lyapunov’s condition)

then Sn/Snd−→ N(0, 1).

131 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem

To proof (1), we need two facts:

(F1) Levy continuity theoremLet {Xn} be a sequence of random variables and define ϕn(t) = E(exp{itXn}). Then

Xnd−→ X ⇐⇒ ϕn(t)→ ϕ(t),

where ϕ(t) = E(exp{itX}).

(F2) Lemma 8.4.1 of Chow and Teichen (1997)

∣∣∣∣∣∣exp{it} −n∑j=0

(it)j

j!

∣∣∣∣∣∣ ≤ 21−δ|t|n+δ

(1 + δ)(2 + δ) · · · (n+ δ),

where δ is any constant in [0, 1].

132 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

Now,

E

(exp

{itSn

Sn

})− exp

{−t2

2

}= E

(exp

{it(X1 +

∑ni=2 Zi)

Sn

})− exp

{−t2

2

}•

••

+ E

(exp

{it(Sj +

∑ni=j+1 Zi)

Sn

})− E

(exp

{it(Sj−1 +

∑ni=j Zi)

Sn

})•

• (3)•

+ E

(exp

{it(Sn−1 + Zn)

Sn

})− E

(exp

{it(Sn−2 + Zn−1 + Zn)

Sn

})+ E

(exp

{itSn

Sn

})− E

(exp

{it(Sn−1 + Zn)

Sn

}),

where Ziindep.∼ N(0, σ2

i ) and independent of {Xn}.

133 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

It holds that for “Stair j”,∣∣∣∣∣E(exp

{it(Sj +

∑ni=j+1 Zi)

Sn

})− E

(exp

{it(Sj−1 +

∑ni=j Zi)

Sn

})∣∣∣∣∣≤

∣∣∣∣∣exp{−t2

2

}[E

(exp

{itSj

Sn

})exp

{t2S2

j

2S2n

}− E

(exp

{itSj−1

Sn

})exp

{t2S2

j−1

2S2n

}]∣∣∣∣∣≤ exp

{−t2

2

}∣∣∣∣∣E(exp

{itSj−1

Sn

})exp

{t2S2

j

2S2n

}[E

(exp

{itXj

Sn

})− exp

{−t2σ2

j

2S2n

}]∣∣∣∣∣≤

∣∣∣∣∣E(exp

{itXj

Sn

})− exp

{−t2σ2

j

2S2n

}∣∣∣∣∣≤

∣∣∣∣∣E(exp

{itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

)−(exp

{−t2σ2

j

2S2n

}− 1 +

t2σ2j

2S2n

)∣∣∣∣∣ , (4)

where the first inequality is by

E

(exp

{it∑ni=j+1 Zi

Sn

})= exp

{−t2(S2

n − S2j )

2S2n

}.

134 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

By (F2) (taking δ = 1 and n = 1, 2), we have

n = 1 :

∣∣∣∣∣exp{

itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

∣∣∣∣∣ ≤∣∣∣∣exp{ itXj

Sn

}− 1−

itXj

Sn

∣∣∣∣+ t2X2j

2S2n

≤t2X2

j

2S2n

+t2X2

j

2S2n

=t2X2

j

S2n

,

n = 2 :

∣∣∣∣∣exp{

itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

∣∣∣∣∣ ≤ 1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 ,

and hence∣∣∣∣∣E(exp

{itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

)∣∣∣∣∣ ≤ E(min

(t2X2

j

S2n

,1

6|t|3

∣∣∣∣XjSn∣∣∣∣3))

why?

≤ E

t2X2j

S2n

I{∣∣∣∣XjSn∣∣∣∣>ε} +

1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

= E

t2X2j

S2n

I{∣∣∣∣XjSn∣∣∣∣>ε}

+ E

1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

≡ Ij + IIj . (5)

135 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

In addition, we have

0 ≤ exp

{−t2σ2

j

2S2n

}− 1 +

t2σ2j

2S2n

≤t4σ4

j

8S4n

, (6)

noting that for x > 0, 0 ≤ exp{−x} − 1 + x ≤ x2/2. Moreover,∑nj=1 σ

2j/S

2n = 1,

E

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

≤ ε σ2j

S2n

and (2) impliesmax1≤j≤n σ

2j

S2n

→ 0.

By (2)–(6), it follows that∣∣∣∣∣E(exp

{itSn

Sn

})− exp

{−t2

2

}∣∣∣∣∣≤

n∑j=1

(Ij + IIj +

t4σ4j

8S4n

)

≤ t2n∑j=1

E(X2j I{|Xj |>εSn}

)S2n

+|t|3

n∑j=1

σ2j

S2n

+t4

8

max1≤j≤n σ2j

S2n

n∑j=1

σ2j

S2n

=|t|3

6ε+ o(1).

136 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

Since ε can be arbitrarily small, one gets∣∣∣∣E(exp

{itSn

Sn

})− exp

{−t2

2

}∣∣∣∣ −→n→∞ 0,

which, together with (F1), yields the desired conclusion (1).

Proof of Lyapunov Central Limit Theorem

1

S2n

n∑i=1

E(X2i I{|Xi|>δSn}) =

1

S2n

n∑i=1

E

(|Xi|2+α

|Xi|αI{|Xi|>δSn}

)

≤1

S2+αn δα

n∑i=1

E(|Xi|2+α)→ 0,

if Lyapunov’s condition holds.

137 / 162

Regression Analysis

Appendix

Limit Theorems

Example 1

If X1, . . . , Xn are independent random variables with E(Xi) = 0 for i = 1, . . . , n,

supi≥1 E|Xi|2+α < M , lim infn→∞

S2n/an > 0, and na

−1−α/2n = o(1), then Sn/Sn

d−→ N(0, 1).

Proof of Example 1

lim supn→∞

1

S2+αn

n∑i=1

E(|Xi|2+α) ≤ lim supn→∞

nM

a1+α/2n (S2

n/an)1+α/2

≤ lim supn→∞

na−1−α/2n × lim sup

n→∞

M

(S2n/an)1+α/2

= lim supn→∞

na−1−α/2n ×

M

lim infn→∞(S2n/an)1+α/2

→ 0.

138 / 162

Regression Analysis

Appendix

Limit Theorems

Example 2

Let PJ be the orthogonal projection matrix on to the space spanned by {Xj : j ∈ J}. Consider

ε>(PJ2 − PJ1 )ε where ε = (ε1, . . . , εn)> with εiindep.∼ (0, 1), J2 ⊃ J1, and ](J2) − ](J1) = 1.

Then

ε>(PJ2 − PJ1 )εd−→ χ2(1),

provided supi≥1 E|ε1|2+α < M <∞ for some α > 0 and max1≤i≤n(PJ2 )ii → 0 as n→∞.

Remark

If ε ∼ N (0, I), then ε>(PJ2 − PJ1 )ε ∼ χ2(1).

If (X>X)/an → R (p.d.), then

(PJ2 )ii = e>i XJ2 (X>J2XJ2 )−1X>J2ei =

(xi(J2)√an

)>(X>J2XJ2

an

)−1 (xi(J2)√an

)

≤ λmax

(X>J2XJ2

an

)−1× ∑j∈J2 x

2ij

an→ 0,

provided a−1n∑j∈J2 x

2ij → 0, where xi(J2) = (xij , j ∈ J2)> and XJ2 = (Xj , j ∈ J2).

139 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Example 2

Let ](J1) = r. Then PJ1 =∑ri=1 oio

>i and PJ2 =

∑r+1i=1 oio

>i , where o>i oi = 1 and

o>i oj = 0 for 1 ≤ i < j ≤ r + 1. Hence, PJ2 − PJ1 = or+1o>r+1. Without loss of

generality, set or+1 = (a1n, . . . , ann)> with∑ni=1 a

2in = 1.

Now ε>(PJ2 − PJ1 )ε = (∑ni=1 ainεi)

2. Note that∑ni=1 ainεi can be view as

n∑i=1

viεi√∑nj=1 v

2j

where vi > 0 for i = 1, . . . , n andn∑j=1

v2j →∞.

The Lyapunov’s condition∑ni=1 E|ainεi|2+α → 0 follows from

n∑i=1

E|ainεi|2+α ≤Mn∑i=1

|ain|2+α ≤M(

n∑i=1

a2in

)max

1≤i≤n|ain|α = M max

1≤i≤n|ain|α,

and max1≤i≤n |ain| = (max1≤i≤n a2in)1/2 ≤ (max1≤i≤n(PJ2 )ii)

1/2 → 0.

By Lyapunov central limit theorem, we have

n∑i=1

ainεid−→ N (0, 1),

and hence ε>(PJ2 − PJ1 )εd−→ χ2(1) is obtained using continuous mapping theorem.

140 / 162

Regression Analysis

Appendix

Limit Theorems

Convergence in the rth Mean

Definition

If E|Xn −X|r → 0 and E|X|r <∞, then we say that Xn converges in the rth mean to X, and

we write XnLr−→ X.

Definition

The r-norm of the random variable Z is defined by ‖Z‖r = (E(|Z|r))1/r.

141 / 162

Regression Analysis

Appendix

Limit Theorems

Some InequalitiesJensen’s inequality

If g is a convex function, then E(g(X)) ≥ g(E(X)).

Proof of Jensen’s inequality

Note that the graph of convex function lies above its tangent line at every point and thus

g(x) ≥ g(µ) + g′(µ)(x− µ),

for any x and µ in the domain of the function g.

Choosing µ = E(X) and replacing x with the random variable X, we have

g(X) ≥ g(E(X)) + g′(E(X))(X − E(X)).

The proof is completed by taking expectation on both sides of the above inequality.

Application

Let q > 1 and g(x) = xq for x > 0. Then g(x) is a convex function.

Assume 0 < s < r. By Jensen’s inequality, we have

E(|X|r) = E((|X|s)r/s) ≥ (E(|X|s))r/s,

and hence (E(|X|r))1/r ≥ (E(|X|s))1/s.142 / 162

Regression Analysis

Appendix

Limit Theorems

Young’s inequality

Let f be a strictly increasing function on [0,∞) and f(0) = 0. Then

ab ≤∫ a

0f(x) dx+

∫ b

0f−1(x) dx.

143 / 162

Regression Analysis

Appendix

Limit Theorems

Holder’s inequality

E|XY | ≤ (E(|X|p))1/p(E(|Y |q))1/q where1

p+

1

q= 1 and p, q ∈ (1,∞).

Proof of Holder’s inequality

Let f(x) = xp−1. Then by Young’s inequality,

ab ≤∫ a

0xp−1 dx+

∫ b

0x1/(p−1) dx =

ap

p+

1

1 + 1/(p− 1)b1+1/(p−1) =

ap

p+bq

q. (∗)

Now, let a = |X|/‖X‖p and b = |Y |/‖Y ‖q . By (∗),

|X|‖X‖p

×|Y |‖Y ‖q

≤1

p×(|X|‖X‖p

)p+

1

q×(|Y |‖Y ‖q

)q,

which implies

E|XY |‖X‖p‖Y ‖q

≤ 1,

and thus the proof is complete.

144 / 162

Regression Analysis

Appendix

Limit Theorems

Minkowski’s inequality

‖X + Y ‖p ≤ ‖X‖p + ‖Y ‖p where 1 ≤ p <∞.

Proof of Minkowski’s inequality

By Holder’s inequality,

E(|X + Y |p) = E(|X + Y |p−1|X + Y |)≤ E(|X + Y |p−1|X|) + E(|X + Y |p−1|Y |)

≤ (E(|X + Y |p))(p−1)/p(E(|X|p))1/p + (E(|X + Y |p))(p−1)/p(E(|Y |p))1/p,

and thus the proof is complete.

145 / 162

Regression Analysis

Appendix

Limit Theorems

Some Facts

(1) XnLr−→ X ⇒ Xn

pr.−→ X ⇒ Xnd−→ X

If X is a constant, then Xnd−→ X ⇒ Xn

pr.−→ X.

If supn≥1 E(|Xn|p) <∞ with p > r, then Xnd−→ X ⇒ Xn

Lr−→ X.

(2) Xnpr.−→ X does not necessarily imply Xn

Lr−→ X.

Example. Let P (Xn = n2) = 1/n and P (Xn = 0) = 1− 1/n. Then

P (|Xn| > ε) = P (Xn > ε) = P (Xn = n2)→ 0,

and hence Xnpr.−→ 0. However,

E|Xn − 0| = E(Xn) = 0× P (Xn = 0) + n2 × P (Xn = n2) = n→∞.

(3) If XnL2−→ X, then E(Xn)→ E(X) and E(X2

n)→ E(X2).

Proof. |E(Xn −X)| ≤ E|Xn −X| ≤ (E(Xn −X)2)1/2 → 0 and

|E(X2n −X2)| = |E[(Xn −X)(Xn −X + 2X)]|

≤ E[(Xn −X)2] + 2E|X(Xn −X)|

≤ E[(Xn −X)2] + 2√E[(Xn −X)2]

√E(X2)→ 0.

146 / 162

Regression Analysis

Appendix

Limit Theorems

Some Facts (Cont.)

(4) If XnLr−→ X, then E(|Xn|r) −→ E(|X|r).

Proof. For r ≥ 1, by Minkowski’s inequality, we have

‖Xn‖r ≤ ‖Xn −X‖r + ‖X‖r and ‖X‖r ≤ ‖Xn −X‖r + ‖Xn‖r,

and hence

‖X‖r − ‖Xn −X‖r ≤ ‖Xn‖r ≤ ‖X‖r + ‖Xn −X‖r,

which, in conjunction with XnLr−→ X, yields the desired result. On the other hand, note

that (a+ b)r ≤ ar + br for a, b ≥ 0 and 0 < r < 1. Hence, for r < 1,

‖Xn‖rr ≤ ‖Xn −X +X‖rr ≤ ‖Xn −X‖rr + ‖X‖rrand

‖X‖rr ≤ ‖X −Xn +Xn‖rr ≤ ‖Xn −X‖rr + ‖Xn‖rr.

By an argument similar to that used for showing the case of r > 1, we have

E(|Xn|r) −→ E(|X|r) for r ≤ 1,

and thus the proof is complete.

147 / 162

Regression Analysis

Appendix

Limit Theorems

Weak Law of Large Numbers

Fact 4

Let X1, . . . , Xn be i.i.d. random variables with E(X1) = µ <∞. Then

Sn

n

pr.−→ µ,

where Sn =∑ni=1 Xi.

Remark

If X1, . . . , Xn are independent random variables with E(X1) < ∞, then the weak law of largenumbers does not necessary hold for {Xi}. Consider the following example:

Let X1, . . . , Xn be a sequence of independent random variables with

P (Xi =√i) = P (Xi = −

√i) =

1

2.

Note that E(Xi) = 0 and Var(Xi) = i for i = 1, . . . , n. Moreover,

S2n =

n∑i=1

Var(Xi) =n∑i=1

i =n(n+ 1)

2.

148 / 162

Regression Analysis

Appendix

Limit Theorems

Remark (Cont.)

Since for some α > 0

n∑i=1

E(|Xi|2+α)

S2+αn

=n∑i=1

i1+α/2

(n(n+ 1)/2)1+α/2= O

(n2+α/2

n2+α

)→ 0,

by Lyapunov central limit theorem, we have

√2∑ni=1 Xi

n

d−→ N (0, 1). Hence, the weak

law of large numbers does not hold for {Xi}.

Proof of Fact 4

Consider

Sn

n− µ =

Sn −mnn

+mn − nµ

n=Sn −mn

n− E(X1I{|X1|>n}) (5-1)

where mn =∑ni=1 E(X

(n)i ) with X

(n)i = XiI{|Xi|≤n}, i = 1, . . . , n.

It suffices to show that

Sn −mnn

pr.−→ 0, (5-2)

and

E(|X1|I{|X1|>n})→ 0. (5-3)

149 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Fact 4 (Cont.)

Since E(|X1|) <∞, E(|X(n)1 |)→ E(|X1|) and

E(|X1|) = E(|X(n)1 |) + E(|X1|I{|X1|>n}),

we obtain (5-3).

We next show (5-2). Define S(n)n =

∑ni=1X

(n)i . Note first that for any ε > 0,

P

(|Sn −mn|

n> ε

)≤ P

(|Sn −mn|

n> ε,

n⋂i=1

{|Xi| ≤ n})

+ P

(n⋃i=1

{|Xi| > n})

≤ P

(|S(n)n −mn|

n> ε

)+ nP (|X1| > n)

≤ P

(|S(n)n −mn|

n> ε

)+ E(|X1|I{|X1|>n})

= P

(|S(n)n −mn|

n> ε

)+ o(1), (5-4)

where the second and third inequalities are by i.i.d. and (5-3), respectively.

150 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Fact 4 (Cont.)

By Chebyshev’s inequality, we have

P

(|S(n)n −mn|

n> ε

)≤E((X

(n)1 )2)

nε2. (5-5)

Moreover,

E((X(n)1 )2)

=

∫ ∞0

P (X21 I{|X1|≤n} > x) dx

= 2

∫ ∞0

uP (X21 I{|X1|≤n} > u2) du

= 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u) du

= 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u, |X1| ≤ n) du+ 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u, |X1| > n) du

= 2

∫ ∞0

uP (u < |X1| ≤ n) du

= 2

∫ n

0uP (u < |X1| ≤ n) du ≤ 2

∫ n

0uP (|X1| > u) du. (5-6)

151 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Fact 4 (Cont.)

Since nP (|X1| > n) ≤ E(|X1|I{|X1|>n}) = o(1), we have

2

∫ n

0uP (|X1| > u) du = 2

∫ A

0uP (|X1| > u) du+ 2

∫ n

AuP (|X1| > u) du

≤ A2 + 2ε3(n−A), (5-7)

where A is large enough such that

uP (|X1| > u) ≤ ε3, ∀u ≥ A.

Hence, (5-2) follows from (5-4)–(5-7) and the proof is complete.

152 / 162

Regression Analysis

Appendix

Delta Method

Delta Method

Assume an(Zn −µ)d−→ Z, where Zn, µ, Z are k-dimensional and an →∞

as n→∞.Let f(·) = (f1(·), . . . , fm(·))> be a smooth function from Rk into Rm with1 ≤ m ≤ k. Define

∇f(·) =

∂f1(·)∂x1

· · · ∂fm(·)∂x1

......

∂f1(·)∂xk

· · · ∂fm(·)∂xk

.

Suppose that there exists ε > 0 such that for some 0 < G <∞,

max1≤i≤m

sup‖x−µ‖≤ε

∥∥∥∥∥(∂2fi(x)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ ≤ G. (∗)

Then

an(f(Zn)− f(µ))d−→ (∇f(µ))>Z. (∗∗)

153 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method

Since an(Zn − µ)d−→ Z, it holds that

an(Zn − µ) = Op(1), (0)

and hence

Zn − µ = Op(a−1n ) = Op(o(1)) = op(1),

yielding

Znpr.−→ µ. (1)

Define An = {‖Zn − µ‖ ≤ ε}, where ε is defined in (∗). Then, by (1),

P (An)→ 1 as n→∞. (2)

154 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method (cont.)

In the following, we shall prove (∗∗) for the case of m = 1. The proof of thecase of m > 1 is similar.

By Taylor’s theorem,

f1(Zn)− f1(µ) = (∇f1(µ))>(Zn − µ) + wn, (3)

where wn = (Zn−µ)>(∂2f1(ξ)∂xj∂xl

)1≤j,l≤k

(Zn−µ) and ‖ξ−µ‖ ≤ ‖Zn−µ‖.

Let x ∈ R be a continuous point of the distribution function of (∇f1(µ))>Z.

Then

P (an(f(Zn)− f(µ)) ≤ x)

why?= P (an(f(Zn)− f(µ)) ≤ x,An) + o(1)

by (3)= P ((∇f1(µ))>an(Zn − µ) + anwn ≤ x,An) + o(1)

why?= P ((∇f1(µ))>an(Zn − µ)IAn + anwnIAn ≤ x) + o(1). (4)

155 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method (cont.)

Note that

|anwnIAn |why?≤ an‖Zn − µ‖2

∥∥∥∥∥(∂2f1(ξ)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ IAn≤ an‖Zn − µ‖2 sup

‖x−µ‖≤ε

∥∥∥∥∥(∂2f1(ξ)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ IAn≤ an‖Zn − µ‖2G

by (0) and (1)= op(1). (5)

Moreover, since

(∇f1(µ))>an(Zn − µ)d−→ (∇f1(µ))>Z, (by continuous mapping theorem)

and IAnpr.−→ 1 (by (2)), it follows from Slutsky’s theorem that

(∇f1(µ))>an(Zn − µ)IAnd−→ (∇f1(µ))>Z. (6)

156 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method (cont.)

By (5) and (6), and Slutsky’s theorem, we obtain

(∇f1(µ))>an(Zn − µ)IAn + anwnIAnd−→ (∇f1(µ))>Z. (7)

By (4) and (7),

P (an(f(Zn)− f(µ)) ≤ x) −→ P ((∇f1(µ))>Z ≤ x),

and hence the desired conclusion follows.

157 / 162

Regression Analysis

Appendix

Two-Sample t-Test

Two-Sample t-Test

Consider the model

z = Xµ+ ε,

where z = (x1, . . . , xm, y1, . . . , yn)>, X = (sij) is a (m+ n)× 2 matrix satisfying

sij =

{1, if {1 ≤ i ≤ m, j = 1} or {m+ 1 ≤ i ≤ m+ n, j = 2};0, otherwise,

µ = (µx, µy)>, ε = (ε1, . . . , εm+n)>, and εi’s are i.i.d. N(0, σ2).

The least squares estimator of µ is

µ = (µx, µy)> = (X>X)−1X>z = (x, y)> ∼ N((

µxµy

),

(σ2

m0

0 σ2

n

)).

Note that H0 : µx = µy and

T =x− y√

σ2(

1m

+ 1n

) ∼H0

N(0, 1).

158 / 162

Regression Analysis

Appendix

Two-Sample t-Test

In practice, σ2 is unknown and we can use

σ2 =1

m+ n− 2z>(I −M)z,

in place of σ2 where M = X(X>X)−1X>.

Define Sx = (m− 1)−1∑mi=1(xi − x)2 and Sy = (n− 1)−1

∑ni=1(yi − y)2. Then some

elementary calculations yield

(I −M)z =

x1 − x...

xm − xy1 − y

...yn − y

,

and hence

σ2 =1

m+ n− 2

m∑i=1

(xi − x)2 +n∑j=1

(yj − y)2

=(m− 1)Sx + (n− 1)Sy

m+ n− 2,

which is the pooled variance.

159 / 162

Regression Analysis

Appendix

Two-Sample t-Test

Since T ∼ N(0, 1), (m+ n− 2)σ2/σ2 ∼ χ2(m+ n− 2), and T⊥σ2, we have

x− y√σ2(

1m

+ 1n

) =

x−y√σ2( 1

m+ 1n )√

σ2

σ2

∼ t(m+ n− 2).

Assume m/(m+ n)→ γx > 0 and n/(m+ n)→ γy > 0 as m→∞ and n→∞. If εi’s

are i.i.d. (0, σ2) (without assuming normality), then one can show that σ2 pr.−−→ σ2,

√m+ n

(x− µxy − µy

)d−→ N

(0,

(1γx

0

0 1γy

)σ2

),

and√m+ n− 2(x− y)√σ2(

1γx

+ 1γy

) d−−→H0

N(0, 1).

This, in conjunction with σ2 pr.−−→ σ2, m+n−2m

→ 1γx

, m+n−2n

→ 1γy

, continuous mapping

theorem, and Slutsky’s theorem, yields√m+ n− 2(x− y)√

σ2(m+n−2

m+ m+n−2

n

) d−−→H0

N(0, 1).

160 / 162

Regression Analysis

Appendix

Pearson’s Chi-Squared Test

Pearson’s Chi-Squared Test

Suppose that X1, . . . , Xn is a random sample of size n from a population, and the nobservations are classified into k classes A1, . . . , Ak.

Let pi denote the probability that an observation falls into the class Ai and∑ki=1 pi = 1.

Note first that

Zt =

I{Xt∈A1}

...I{Xt∈Ak−1}

∼ (p,D − pp>),

1√n

n∑t=1

(Zt − p)d−→ N(0,D − pp>),

and (1√n

n∑t=1

(Zt − p)

)>(D − pp>)−1

(1√n

n∑t=1

(Zt − p)

)d−→ χ2(k − 1), (1)

where p = (p1, . . . , pk−1)> and D = diag(p1, . . . , pk−1).

161 / 162

Regression Analysis

Appendix

Pearson’s Chi-Squared Test

Let 1 be a (k − 1)-dimensional vector with all entries one and Oi =∑nt=1 I{Xt∈Ai} for

i = 1, . . . , k. Define O = (O1, . . . , Ok−1)>.

Since

(D − pp>)−1 = D−1 +D−1pp>D−1

1− p>D−1p= D−1 +

11>

pk,

we have (1√n

n∑t=1

(Zt − p)

)>(D − pp>)−1

(1√n

n∑t=1

(Zt − p)

)

=1

n(O − np)>D−1(O − np) +

1

npk(O − np)>11>(O − np)

=

k−1∑i=1

(Oi − npi)2

npi+

1

npk

(k−1∑i=1

(Oi − npi))2

=

k−1∑i=1

(Oi − npi)2

npi+

1

npk((n−Ok)− n(1− pk))2 =

k∑i=1

(Oi − npi)2

npi. (2)

Hence, by (1) and (2),

k∑i=1

(Oi − npi)2

npi

d−→ χ2(k − 1).

162 / 162