Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In...

Post on 28-Jun-2020

0 views 0 download

Transcript of Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In...

Chapter 2

Revision of Bayesianinference

Bayes Theorem

NotationParameter vector: θ = (θ1, . . . , θp)T

Data: x = (x1, x2, . . . , xn)T .

Bayes Theorem for eventsFor two events E and F , the conditional probability of E given F is:

Pr(E |F ) =Pr(E ∩ F )

Pr(F ).

This gives us the simplest version of Bayes theorem:

Pr(E |F ) =Pr(F |E )Pr(E )

Pr(F ).

Bayes Theorem

If E1,E2, . . . ,En is a partition i.e. the events Ei are mutuallyexclusive with Pr(E1 ∪ . . .En) = 1, then

Pr(F ) = Pr ((F ∩ E1) ∪ (F ∩ E2) ∪ . . . ∪ (F ∩ En)) =n∑

i=1

Pr(F ∩ Ei )

=n∑

i=1

Pr(F |Ei )Pr(Ei ).

This gives us the more commonly used version of Bayes theorem:

Pr(Ei |F ) =Pr(F |Ei )Pr(Ei )∑ni=1 Pr(F |Ei )Pr(Ei )

.

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Inference for discrete parameters and data

Suppose our data and parameters have discrete distributions.

The hypotheses Ei are values for our parameter(s) Θ, while theoutcome F is a set of data X .

Bayes theorem is then

Pr(Θ = θ|X = x) =Pr(X = x |Θ = θ)Pr(Θ = θ)∑tPr(X = x |Θ = t)Pr(Θ = t)

.

Example 2.1 (Drug testing in athletes)

Due to inaccuracies in drug testing procedures (e.g., falsepositives and false negatives), in the medical field the resultsof a drug test represent only one factor in a physician’sdiagnosis.

Yet when Olympic athletes are tested for illegal drug use (i.e.doping), the results of a single test are used to ban the athletefrom competition.

In Chance (Spring 2004), University of Texas biostatisticiansD. A. Berry and L. Chastain demonstrated the application ofBayes’s Rule for making inferences about testosterone abuseamong Olympic athletes.

They used the following example.

Example 2.1 (Drug testing in athletes)

In a population of 1000 athletes, suppose 100 are illegally usingtestosterone. Of the users, suppose 50 would test positive fortestosterone. Of the nonusers, suppose 9 would test positive.

If an athlete tests positive for testosterone, use Bayes theorem tofind the probability that the athlete is really doping.

Let θ = 1 be the event that the athlete is a user, and θ = 0 be anonuser; and let x = 1 be the event of a positive test result. Weneed to consider the following questions.

What is the prior information for θ?

What is the observation?

What is the posterior distribution of θ after we get the data?

Example 2.1 (Drug testing in athletes)

In a population of 1000 athletes, suppose 100 are illegally usingtestosterone. Of the users, suppose 50 would test positive fortestosterone. Of the nonusers, suppose 9 would test positive.

If an athlete tests positive for testosterone, use Bayes theorem tofind the probability that the athlete is really doping.

Let θ = 1 be the event that the athlete is a user, and θ = 0 be anonuser; and let x = 1 be the event of a positive test result. Weneed to consider the following questions.

What is the prior information for θ?

What is the observation?

What is the posterior distribution of θ after we get the data?

Inference for continuous parameters and data

Consider a continuous parameter Θ, and observed data X . (Noticethat both data and parameters are random variables).

Our prior beliefs about the parameters define the prior distributionfor Θ, specified by a density π(θ).

We formulate a model that defines the distribution of X given theparameters, i.e. we specify a density fX |Θ(x |θ).

This can be regarded as a function of θ when we have got somefixed observed data x , called the likelihood:

L(θ|x) = fX |Θ(x |θ).

Inference for continuous parameters and data

The prior and likelihood determine the full joint density over dataand parameters:

fΘ,X (θ, x) = π(θ)L(θ|x).

Given the joint density we are then able to compute its marginalsand conditionals:

fX (x) =

∫ΘfΘ,X (θ, x) dθ =

∫Θπ(θ)L(θ|x) dθ

and

fΘ|X (θ|x) =fΘ,X (θ, x)

fX (x)=

π(θ)L(θ|x)∫Θ π(θ)L(θ|x) dθ

.

fΘ|X (θ|x) is known as the posterior density, and is usually denotedπ(θ|x).

Inference for continuous parameters and data

The prior and likelihood determine the full joint density over dataand parameters:

fΘ,X (θ, x) = π(θ)L(θ|x).

Given the joint density we are then able to compute its marginalsand conditionals:

fX (x) =

∫ΘfΘ,X (θ, x) dθ =

∫Θπ(θ)L(θ|x) dθ

and

fΘ|X (θ|x) =fΘ,X (θ, x)

fX (x)=

π(θ)L(θ|x)∫Θ π(θ)L(θ|x) dθ

.

fΘ|X (θ|x) is known as the posterior density, and is usually denotedπ(θ|x).

Inference for continuous parameters and data

This leads to the continuous version of Bayes theorem:

π(θ|x) =π(θ)L(θ|x)∫

Θ π(θ)L(θ|x) dθ

Now, the denominator is not a function of θ, so we can in factwrite this as

π(θ|x) ∝ π(θ)L(θ|x)

where the constant of proportionality is chosen to ensure that thedensity integrates to one. Hence, the posterior is proportional tothe prior times the likelihood.

Inference for continuous parameters and data

This leads to the continuous version of Bayes theorem:

π(θ|x) =π(θ)L(θ|x)∫

Θ π(θ)L(θ|x) dθ

Now, the denominator is not a function of θ, so we can in factwrite this as

π(θ|x) ∝ π(θ)L(θ|x)

where the constant of proportionality is chosen to ensure that thedensity integrates to one. Hence, the posterior is proportional tothe prior times the likelihood.

Bayesian computation

That should be everything!

However, typically only the kernel can be written in closedform, that is

π(θ)L(θ|x).

The denominator has the form∫Θπ(θ)L(θ|x) dθ

and this integral is often intractable.

We can try to evaluate this numerically, but

the integral may be very high dimensional, i.e. there might bemany parameters, andthe support of Θ may be very complicated.

Similar problems arise when we want to marginalize or workout an expectation.

Bayesian computation

That should be everything!

However, typically only the kernel can be written in closedform, that is

π(θ)L(θ|x).

The denominator has the form∫Θπ(θ)L(θ|x) dθ

and this integral is often intractable.

We can try to evaluate this numerically, but

the integral may be very high dimensional, i.e. there might bemany parameters, andthe support of Θ may be very complicated.

Similar problems arise when we want to marginalize or workout an expectation.

Example 2.2 (Normal with unknown mean and variance)

Model:Xi |µ, τ ∼ N(µ, 1/τ).

Likelihood for a single observation:

L(µ, τ |xi ) = f (xi |µ, τ) =

√τ

2πexp

{−τ

2(xi − µ)2

}So for n independent observations, x = (x1, . . . , xn)T

L(µ, τ |x) = f (x |µ, τ) =n∏

i=1

√τ

2πexp

{−τ

2(xi − µ)2

}

Example 2.2 (Normal with unknown mean and variance)

L(µ, τ |x) =( τ

)n2 exp

{−τ

2

n∑i=1

(xi − x̄ + x̄ − µ)2

}∝ τ n

2 exp{−nτ

2

[s2 + (x̄ − µ)2

]}where

x̄ =1

n

n∑i=1

xi and s2 =1

n

n∑i=1

(xi − x̄)2.

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Example 2.2 (Normal with unknown mean and variance)

π(µ) =

√c

2πexp

{−c

2(µ− b)2

}∝ exp

{−c

2(µ− b)2

}π(τ) =

hg

Γ(g)τg−1 exp{−hτ} ∝ τg−1 exp{−hτ}

Soπ(µ, τ) ∝ τg−1 exp

{−c

2(µ− b)2 − hτ

}and we obtain

π(µ, τ |x) ∝ τg−1 exp{−c

2(µ− b)2 − hτ

}× τ

n2 exp

{−nτ

2

[s2 + (x̄ − µ)2

]}= τg+

n2−1 exp

{−nτ

2

[s2 + (x̄ − µ)2

]− c

2(µ− b)2 − hτ

}.

Example 2.2 (Normal with unknown mean and variance)

π(µ) =

√c

2πexp

{−c

2(µ− b)2

}∝ exp

{−c

2(µ− b)2

}π(τ) =

hg

Γ(g)τg−1 exp{−hτ} ∝ τg−1 exp{−hτ}

Soπ(µ, τ) ∝ τg−1 exp

{−c

2(µ− b)2 − hτ

}and we obtain

π(µ, τ |x) ∝ τg−1 exp{−c

2(µ− b)2 − hτ

}× τ

n2 exp

{−nτ

2

[s2 + (x̄ − µ)2

]}= τg+

n2−1 exp

{−nτ

2

[s2 + (x̄ − µ)2

]− c

2(µ− b)2 − hτ

}.

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Tutorial questions

1 Describe a Monte Carlo algorithm to calculate the normalisingconstant in a Bayesian analysis.

2 Describe a weighted resampling algorithm for generatingdraws from a posterior π(θ|x) (known up to proportionality)using a proposal with density π(θ), i.e. the prior. What arethe weights? How can the weights be used to estimate theunknown normalising constant?