Statistics ---Summarytcs.inf.kyushu-u.ac.jp/~kijima/GPS20/GPS20-13.pdfStatistics ---Summary Aug. 5,...

Statistics ---Summary

Aug. 5, 2020

来嶋秀治 (Shuji Kijima)

Dept. Informatics,

Graduate School of ISEE

確率統計特論 (Probability & Statistics)

Lesson 13

2

Final exam (期末試験)

Date/time: August 12 (8/12), 13:00- 14:30

Place (場所): at moodle.

Submit electronic files (incl. photo: recommended). ≤10MB.

Keep your “original data” (I may ask to submit them later).

電子ファイルを提出 (写真可: 推奨).10MB以内.

紙/データを手元に保存しておくこと

(後日提出を求める場合がある)．

Topics (範囲):

Probability and Statistics.

check the course page (講義ページを参照のこと)

http://tcs.inf.kyushu-u.ac.jp/~kijima/

Books, notes, google, etc. are allowed to use (持ち込み可).

Communication (e-mail, SNS, BBS) is prohibited (相談不可).

Statistics I

July 1, 2020


Dept. Informatics,


Todays topics

• estimating population mean

• estimating population variance

• consistent estimator (一致推定量)

• unbiased estimator (不偏推定量)


Lesson 8

Statistics Inference (統計的推論)

Estimation (推定) 8,9,12th

Statistical test (統計検定) 10th

Regression (回帰) 11th

Applications

Machine learning (機械学習),

Pattern recognition (パターン認識),

Data mining (データマイニング), etc.

Statistics / Data science4

Population, sample, stochastic model5

Example 1

We sample 6 accounts of twixxer at random. The following table

shows the numbers of followers.

1 2 3 4 5 6

#followers 372 623 89 781 3219 152

Q. How large is the population mean of followers?

Suppose that the number of followers follows some

distribution (e.g., Ex 𝜆 ) with expectation 𝜇.

=> Sample mean ത𝑋 =𝑋1+⋯+𝑋𝑛

𝑛= 872. 7

population (母集団)

sample (標本)

stochastic model (確率モデル)

Sample mean6

Proposition

ത𝑋 =𝑋1+⋯+𝑋𝑛

𝑛is a consistent estimator of 𝜇.

sample mean

Proof.

By the law of large numbers.

Proposition

ത𝑋 =𝑋1+⋯+𝑋𝑛

𝑛is an unbiased estimator of 𝜇.

Definition

𝑇(𝑋) is an unbiased estimator of 𝑔 𝜃

if 𝐸𝜃 𝑇 𝑋 − 𝑔 𝜃 = 0 holds.

Population, sample, stochastic model7

Example 1

We sample 6 accounts of twixxer at random. The following table

shows the numbers of followers.

1 2 3 4 5 6

#followers 372 623 89 781 3219 152

Q. How large is the population variance of #followers?

Suppose that the number of followers follows some

distribution (e.g., Ex 𝜆 ) with expectation 𝜇 and variance 𝜎2

Recall Var 𝑋 ≔ E 𝑋 − 𝜇 2

population (母集団)

sample (標本)

stochastic model (確率モデル)

Consistency of a sample variance8

Proposition

σ𝑖=1𝑛 𝑋𝑖− ത𝑋 2

𝑛−1is a consistent estimator of 2 (if Var 𝑋 − 𝐸 𝑋 2 < ∞)

Proposition


𝑛−1is an unbiased estimator of 2 (if Var 𝑋 − 𝐸 𝑋 2 < ∞)

Proposition


𝑛is NOT an unbiased estimator of 2 (in general)

Statistics II

July 8, 2020


Dept. Informatics,


Todays topics

• maximum likelihood (最尤推定)


Lesson 9

Statistical inference: maximum likelihood10

Example 1

The number of defective products per 10,000 products.

How often do detectives appear?

lot 1 2 3 4 5 6 7 8 9 10

#defective 0 2 0 0 1 1 0 3 1 0

Let 𝑋 be a r.v. denoting #defectives,

Then 𝑋 ∼ Po(𝜆), i.e.,

Pr 𝑋 = 𝑥 =: 𝑓 𝑥 = e−𝜆𝜆𝑥

𝑥!

for unknown parameter 𝜆.

Maximum likelihood 11

Preparation

Let 𝑋1, … , 𝑋𝑛 be i.i.d. with density function 𝑓(𝑥; 𝜃), and

let 𝑓 𝑥1, . . , 𝑥𝑛; 𝜃 ≔ 𝑓 𝑥1; 𝜃 ⋯𝑓(𝑥𝑛; 𝜃),

i.e., density function of a joint distribution of 𝑋1, … , 𝑋𝑛 .

: parameters

e.g. N(,), E()

remark

X1,…,Xn are independent

Maximum likelihood estimation

Given sample values 𝑋1 = 𝑎1, …, 𝑋𝑛 = 𝑎𝑛,

let 𝐿 𝜃 𝒂 ≔ 𝑓(𝑎1, … , 𝑎𝑛; 𝜃), called likelihood function, and

𝜃∗ = argmax𝜃

𝐿 𝜃 is called maximum likelihood estimator.

Ex. Maximum likelihood12

max. likelihood estimator of = argmax L()

𝐿 𝜆; 𝑎1, … , 𝑎𝑛 = e−𝜆𝜆𝑎1

𝑎1!e−𝜆

𝜆𝑎2

𝑎2!⋅⋅⋅ e−𝜆

𝜆𝑎𝑛

𝑎𝑛!

𝜕

𝜕𝜆𝐿 𝜆; 𝑎1, … , 𝑎𝑛 = ⋯

Ex. Poisson distribution

Let 𝑋1, … , 𝑋𝑛 be independent r.v.s according to Po(𝜆), and

𝑎1, … , 𝑎𝑛 are sample values.

Poisson distribution Po(𝜆)

𝑓 𝑥 = e−𝜆𝜆𝑥

𝑥!

Ex. Maximum likelihood 13

ത𝑎 is the maximum likelihood estimator of

Ex. Poisson distribution

Let 𝑋1, … , 𝑋𝑛 be independent r.v.s according to Po(𝜆), and


Poisson distribution Po(𝜆)

𝑓 𝑥 = e−𝜆𝜆𝑥

𝑥!

log 𝐿 𝜆 =

𝑖=1

𝑛

log 𝑓 𝑎𝑖; 𝜆 =

𝑖=1

𝑛

−𝜆 + 𝑎𝑖 log 𝜆 − log 𝑎𝑖!

= 𝑛 ത𝑎 log 𝜆 − 𝜆 −

𝑖=1

𝑛

log 𝑎𝑖!

𝜕

𝜕𝜆log 𝐿 𝜆 = 𝑛 ത𝑎

1

𝜆− 1

Statistical inference: maximum likelihood14

Example 2

The scores of examination.

How much ratio do they understand?

student 1 2 3 4 5 6 7 8 9 10

score 72 89 64 52 96 64 70 83 56 70

Let 𝑋 be a r.v. denoting scores,

Then 𝑋 ∼ N(𝜇, 𝜎2), i.e.,

𝑓 𝑥 =1

2𝜋𝜎exp −

𝑥 − 𝜇 2

2𝜎2

for unknown parameters 𝜇 and 𝜎.


Ex. Normal distribution

Let 𝑋1, … , 𝑋𝑛 be independent r.v.s according to N 𝜇, 𝜎2 , and


ln 𝐿 𝜇, 𝜎; 𝑎 =

𝑖=1

𝑛

ln 𝑓 𝑎𝑖; 𝜇, 𝜎

=

𝑖=1

𝑛

ln1

2𝜋𝜎exp −

𝑎𝑖 − 𝜇 2

2𝜎2

=

𝑖=1

𝑛

−1

2ln 2𝜋 −

1

2ln 𝜎2 −

𝑎𝑖 − 𝜇 2

2𝜎2

= −𝑛

2ln 2𝜋 −

𝑛

2ln𝜎2 −

𝑖=1

𝑛𝑎𝑖 − 𝜇 2

2𝜎2


ത𝑎 is the maximum likelihood

estimator of

Since = ത𝑎 maximize L(,)

(independent of ),

𝜎∗ 2 =σ 𝑎𝑖 − 𝑎

2

𝑛

Ex. Normal distribution

Let 𝑋1, … , 𝑋𝑛 be independent r.v.s according to N 𝜇, 𝜎2 , and


𝜕

𝜕𝜇ln 𝐿 𝜇, 𝜎; 𝑎 = −

𝜕

𝜕𝜇

σ𝑖=1𝑛 𝑎𝑖 − 𝜇 2

2𝜎2= −

σ𝑖=1𝑛 𝑎𝑖 − 𝜇

𝜎2

𝜕

𝜕𝜎ln 𝐿 𝜇, 𝜎; 𝑎

= −𝜕

𝜕𝜎

𝑛

2ln 𝜎2 −

𝜕

𝜕𝜎

σ𝑖=1𝑛 𝑎𝑖 − 𝜇 2

2𝜎2

= −𝑛

𝜎+σ𝑖=1𝑛 𝑎𝑖 − 𝜇 2

𝜎3

Statistics III

July 15, 2020


Dept. Informatics,


Todays topics

• interval estimation (区間推定)

• hypothesis testing (仮説検定)

• t-test

• 2-test


Lesson 10

1. Interval estimation

Statistical Inference (統計的推定)

point estimation (点推定)

consistent estimation (一致推定)

unbiased estimation (不偏推定)

maximum likelihood (最尤推定)

interval estimation (区間推定)

Statistical inference19

Example 1

A clerk says “our eggs are big. 70[g] in average.”

ത𝑋 = 66.3[g], s2 = 17.584[g2] for 6 eggs.

Suppose 2=18.0 for simplicity.

Let z* (>0) satisfy

Pr −𝑧∗ ≤𝑋 − 𝜇𝜎𝑛

≤ 𝑧∗ ≥ 0.95

Since central limit theorem,


≤ 𝑧∗ = න−𝑧∗

𝑧∗ 1

2𝜋𝜎exp −

1

2𝑥2 d𝑥

… and we see that z* = 1.960 (see normal distribution table).

“two-sided 95%

confidence interval”

両側95％信頼区間

Normal distribution20

Wikipedia: Standard normal table

http://en.wikipedia.org/wiki/Normal_distribution


Example 1


ത𝑋 = 66.3[g], s2 = 17.584[g2] for 6 eggs.

Suppose 2=18.0 for simplicity.

ത𝑋 = 66.3[g]

𝑧∗ = 1.960

𝜎2 = 18.0

𝑛 = 6


≤ 𝑧∗ = Pr −𝑧∗𝜎

𝑛≤ 𝑋 − 𝜇 ≤ 𝑧∗

𝜎

𝑛

= Pr −𝑋 − 𝑧∗𝜎

𝑛≤ −𝜇 ≤ −𝑋 + 𝑧∗

𝜎

𝑛

= Pr 𝑋 + 𝑧∗𝜎

𝑛≥ 𝜇 ≥ 𝑋 − 𝑧∗

𝜎

𝑛

= Pr 66.3 + 1.96018

6≥ 𝜇 ≥ 66.3 − 1.960

18

6

= Pr 69.69 ≥ 𝜇 ≥ 62.91

2. hypothesis testing (仮説検定)

Todays topics

• interval estimation (区間推定)

• hypothesis testing (仮説検定)

• t-test

• 2-test

Hypothesis testing (仮説検定)23

Terminology

• null hypothesis (帰無仮説)

• alternative hypothesis (対立仮説)

Idea

Pr[null hypo is true]

reject the null hypothesis with significant level

(有意水準で帰無仮説を棄却する)

Pr[null hypo is true]

fail to reject the null hypothesis with significant level

(有意水準で帰無仮説を棄却しない)


Example 1


You bought 6 eggs in a shop.

How large are eggs sold in this shop?

ത𝑋 = 66.3[g], s2 = 17.584[g2]

Is the clerk honest?

1 2 3 4 5 6

weight[g] 64.3 70.4 63.2 67.8 71.3 60.8


≤ 𝑧∗ = Pr −𝑧∗𝜎

𝑛≤ 𝑋 − 𝜇 ≤ 𝑧∗

𝜎

𝑛

= Pr 𝜇 − 𝑧∗𝜎

𝑛≤ 𝑋 ≤ 𝜇 + 𝑧∗

𝜎

𝑛

= Pr 70 − 1.96018

6≤ 𝑋 ≤ 70 + 1.960

18

6

= Pr 66.6 ≤ 𝑋 ≤ 73.4


Example 1


ത𝑋 = 66.3[g], s2 = 17.584[g2] for 6 eggs.

Let assume = 70.0 Suppose 2=18.0 for simplicity.

It rejects the null hypothesis = 70.0 with significant level 5%

(帰無仮説 = 70.0 は有意水準5%で棄却される．)

𝜇 = 70

𝑧∗ = 1.960

𝜎2 = 18.0

𝑛 = 6

ത𝑋 = 66.3[g]

Statistics IV

July 22, 2020


Dept. Informatics,


Todays topics

• linear regression (線形回帰)

•単回帰

• 重回帰

•自己回帰

• モデル選択 AIC


Lesson 11

Ex. Advertisement27

Question

How does 𝑦 increase, as 𝑥 increasing?

year 1 2 3 4 5 6 7 8

𝑥: ad. cost 8 11 13 10 15 19 17 20

𝑦: sale amount 115 124 138 120 151 186 169 193

0

50

100

150

200

250

0 5 10 15 20 25

系列1

Least Square Estimator28

Question

How does 𝑦 increase, as 𝑥 increasing?

Linear regression (線形回帰)

Suppose 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝑒𝑖 where 𝑒𝑖 ∼ N(0, 𝜎2).

Estimate 𝛼 and 𝛽 such that

min

𝑖=1

𝑛

𝑦𝑖 − 𝛼 + 𝛽𝑥𝑖2

year 1 2 3 4 5 6 7 8

𝑥: ad. cost 8 11 13 10 15 19 17 20

𝑦: sale amount 115 124 138 120 151 186 169 193

Least Square Estimator29

Linear regression (線形回帰)

Suppose 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝑒𝑖 where 𝑒𝑖 ∼ N(0, 𝜎2).

Estimate 𝛼 and 𝛽 such that minσ𝑖=1𝑛 𝑦𝑖 − 𝛼 + 𝛽𝑥𝑖

2

𝜕

𝜕𝛼𝑔 𝛼, 𝛽 =

𝑖=1

𝑛

−2(𝑦𝑖 − (𝛼 + 𝛽𝑥𝑖))

𝜕

𝜕𝛽𝑔 𝛼, 𝛽 =

𝑖=1

𝑛

(−2𝑥𝑖)(𝑦𝑖 − (𝛼 + 𝛽𝑥𝑖))

𝛼 + 𝛽 ҧ𝑥 = ത𝑦

𝛼 ҧ𝑥 + 𝛽𝑥2 = 𝑥𝑦𝜕

𝜕𝛽𝑔 𝛼, 𝛽 = 0

𝜕

𝜕𝛼𝑔 𝛼, 𝛽 = 0

መ𝛽 =𝑥𝑦 − ҧ𝑥 ⋅ ത𝑦

𝑥2 − ҧ𝑥2

ො𝛼 = ത𝑦 − መ𝛽 ҧ𝑥

Prop.

E መ𝛽 =E 𝑠𝑥𝑦

𝑠𝑥2 = 𝛽

E ො𝛼 = E 𝑦 − መ𝛽𝑥2 = E 𝑦 − E መ𝛽 𝑥 = 𝛼

are unbiased estimators.

Statistics V

July 29, 2020


Dept. Informatics,


Todays topics

• Bayes estimation

• MAP estimation


Lesson 12

Bayesian inference (for discrete )31

Thm. (Bayes; ベイズの定理)

Let 𝑋 ∼ 𝑓(𝑥; 𝜃) where 𝜃 is an unknown parameter(s).

Now suppose we obtain sample 𝑋 = 𝑧,

then

𝑤′(𝜃 ∣ 𝑧) =𝑤 𝜃 ⋅ 𝑓(𝑧 ∣ 𝜃)

σ𝜃′∈Θ𝑤 𝜃′ ⋅ 𝑓(𝑧 ∣ 𝜃′)

where

𝑤(𝜃) is prior probability distribution (事前分布) of 𝜃,

𝑤′ 𝜃 𝑧 is posterior probability distribution (事後分布) of 𝜃.

Rem. 𝑤′ 𝜃 𝑥 ∝ 𝑤 𝜃 ⋅ 𝐿 𝜃 𝑧

𝐿 𝜃 𝑧 ≔ 𝑓 𝑧 𝜃 : likelihood function

Bayesian inference (for continuous )32

Thm. (Bayes; ベイズの定理)

Let 𝑋 ∼ 𝑓(𝑥; 𝜃) where 𝜃 is an unknown parameter(s).

Now suppose we obtain sample 𝑋 = 𝑧,

then

𝑤′(𝜃 ∣ 𝑧) =𝑤 𝜃 ⋅ 𝑓(𝑧 ∣ 𝜃)

Θ𝑤 𝜃′ ⋅ 𝑓 𝑧 𝜃′ d𝜃′

where

𝑤(𝜃) is prior probability distribution (事前分布) of 𝜃,

𝑤′ 𝜃 𝑧 is posterior probability distribution (事後分布) of 𝜃.

Rem. 𝑤′ 𝜃 𝑥 ∝ 𝑤 𝜃 ⋅ 𝐿 𝜃 𝑧

𝐿 𝜃 𝑧 ≔ 𝑓 𝑧 𝜃 : likelihood function

Conjugate Prior33

Distribution Conjugate Prior

Binomial Beta

Poisson Gamma

Normal Normal

Multinomial Dirichlet

Ex. Bayesian inference34

Let 𝑋 ∼ Po 𝜆 where 𝜆 > 0 is an unknown parameter,

but we know prior prob. of 𝜆 such that 𝜆 ∼ Ga(𝛼, 𝜈).

Now suppose we obtain sample 𝑋 = 𝑘.

Q. Compute posterior prob. 𝑤′ 𝜆 𝑘 . 𝑓 𝑥 =𝛼𝜈

Γ 𝜈𝑥𝜈−1 exp −𝛼𝑥

𝑤′ 𝜆 𝑘 ∝ 𝑤 𝜆 ⋅ 𝐿 𝜆 𝑘 = 𝑤 𝜆 ⋅ 𝑓 𝑧 𝜆

=𝛼 𝜈

Γ 𝜈𝜆𝜈−1 exp −𝛼𝜆

𝜆𝑘

𝑘!exp(−𝜆)

∝ 𝜆𝜈+𝑘−1 exp − 𝛼 + 1 𝜆

∝𝛼 + 1 𝜈+𝑘

Γ 𝜈 + 𝑘exp − 𝛼 + 1 𝜆

hence Ga 𝛼 + 1, 𝜈 + 𝑘conjugate distribution

MAP estimation

Maximum a posterior estimator

MAP estimation36

A MAP (maximum a posteriori) estimator is given by

𝜃∗ = argmax𝜃

𝑤′ 𝜃 𝑧

meaning that 𝜃∗ maximizes the posterior probability/density.

Note.

Bayes estimator usually refers the posterior distribution,

while MAP estimator is the 𝜃∗ maximizing the posterior.

Maximum likelihood estimator and MAP estimator37

Rem. posterior density

𝑤′(𝜃 ∣ 𝑧) =𝑤 𝜃 ⋅ 𝑓(𝑧 ∣ 𝜃)

σ𝜃′∈Θ𝑤 𝜃′ ⋅ 𝑓(𝑧 ∣ 𝜃′)

where 𝑤 𝜃 is the a prior density.

Prop.

A posterior density

𝑤′ 𝜃 𝑧 ∝ 𝑤 𝜃 ⋅ 𝐿(𝜃 ∣ 𝑧)

where 𝐿 𝜃 𝑧 = 𝑓(𝑧 ∣ 𝜃) is the likelihood function.

Roughly speaking,

A maximum likelihood estimator maximizes the likelihood function,

that is an artificial concept highly related to but not exactly the same

as the joint probability.

A MAP estimator maximizes the posterior probability, artificially

assuming a prior distribution.

Further Topics

39

Probability Theory Probability Space Distribution, Expectation, Variance Stochastic ineq., Law of large numbers, central limit them.

Statistics Estimation Test Regression

Optimization

Experimental Design

Machine Learning

Probability Theory

Information Theory

Calculous Linear Algebra

Computer Science• Programming• Algorithm

Data Science

Data mining

Pattern Recognition

Further topics40

Probability Inequalities

Stochastic Process

Markov process

Brownian motion/stochastic diff. eq.

Martingale

Ergodic theory

Multivariate Statistics

Principal component analysis (主成分分析)

Machine Learning

SVM

NMF

Deep Learning / neural network

Data Mining

Statistics ---Summarytcs.inf.kyushu-u.ac.jp/~kijima/GPS20/GPS20-13.pdfStatistics ---Summary Aug. 5,...

Documents

Transcript of Statistics ---Summarytcs.inf.kyushu-u.ac.jp/~kijima/GPS20/GPS20-13.pdfStatistics ---Summary Aug. 5,...