Bayesian scoring functions for Bayesian Belief Networks
description
Transcript of Bayesian scoring functions for Bayesian Belief Networks
BAYESIAN SCORING FUNCTIONS FOR BAYESIAN BELIEF NETWORKS (BBNS)Jee Vang, [email protected]
Version 3.1 This work is licensed under a Creative Commons Attribution 3.0 Unported License.
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
2
PURPOSE AND OUTLINE
Purpose: concisely illustrate how some Bayesian scoring functions have been established to score Bayesian belief networks (BBNs)
Define a BBN Cover what Bayesian scoring functions are based on
Basic mathematic functions (factorial, gamma, and Beta functions) Probability distributions (multinomial, Dirichlet, Dirichlet-multinomial) Bayes’ Theorem Assumptions
Give a few Bayesian scoring function examples (BD, K2, BDe, BDeu)
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
3
DEFINITION OF A BBN
G is a pair, G(V,E) V={V1,…,Vn} is a set of vertices with a
one-to-one correspondence with a set of random variables X={X1,…,Xn} (sometimes, Vi and Xi are used interchangeably)
E is a set of directed edges Eij denotes Vi Vj or equivalently Xi Xj
G is a special type of graph, called an acyclic directed graph (ADG): there is no path starting with any Vi and leading back to itself in the direction of the arrows
For any Xi Xj, Xi is said to be the parent of Xj
All the parents of Xi are denoted pa(Xi) or
G is called the structure of a BBN
P is a joint probability distribution over X
Chain rule
P satisfies the Markov condition which states that a variable is conditionally independent of all other variables given its parents
P is the parameters of a BBN
A BBN is defined as a pair (G,P) where G and P themselves are defined as follows
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
4
BASIC MATHEMATIC FUNCTIONS
The factorial function for a positive integer, , is defined as follows.
The gamma function generalizes the factorial function, and for a positive integer, , is defined as follows.
The Beta function for a set of k positive integers, , is defined as follows.
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
5
PROBABILITY DISTRIBUTIONS—MULTINOMIAL AND DIRICHLET The multinomial probability mass function (PMF) for a
discrete random variable with values is defined as follows.
The Dirichlet probability density function (PDF) for a continuous random variable is defined as follows.
Note the following. is a set of parameters
is a set of counts (frequencies) is a set of hyperparameters
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
6
PROBABILITY DISTRIBUTIONS—DIRICHLET-MULTINOMIAL The Dirichlet-multinomial PMF for a discrete random
variable with values is defined as follows.
The Dirichlet-multinomial PMF states the underlying model generating the data is multinomial and the parameters are Dirichlet distributed
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
7
PROBABILITY DISTRIBUTION—DIRICHLET-MULTINOMIAL (CONTINUED)
Integrating the Dirichlet-multinomial over gets the marginal joint probability of the data
Note that takes the form of the Dirichlet
Using substitution, we get the following
Note , since this integration is over the Dirichlet PDF
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
8
PROBABILITY DISTRIBUTION—DIRICHLET-MULTINOMIAL (CONTINUED)
Expand the Beta functions and
Rearrange the terms
Drop to get the following proportional relationship
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
9
PROBABILITY DISTRIBUTION—DIRICHLET-MULTINOMIAL (CONTINUED)
We can extend to discrete random variables, , as follows
Note is a vector of count vectors
is a count vector
is a vector of hyperparameter vectors is a hyperparameter vector
is a vector of the number of values corresponding to each
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
10
BAYES’ THEOREM
Use Bayes’ Theorem to compute the probability of the BBN structure, , given the data, , written as
Drop out the prior and marginal likelihood terms because typically, is assumed to have a uniform distribution is a normalizing constant
Thus,
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
11
ASSUMPTIONS
To compute some assumptions are needed The following assumptions have been reported
The data is generated from multinomial distributions (multinomial sample), The parameters associated with each variable in a BBN are independent (parameter
independence), If a variable has the same parents in two different networks, then the probability density functions
of the parameters associated with this node are identical in both networks (parameter modularity), The parameters are distributed according to the Dirichlet distribution (Dirichlet), There is no missing data (complete data), Data should not help discriminate network structures that represent the same assertions of
conditional independence (likelihood equivalence), and For any complete DAG, G, the probability of G, , is greater than zero (structure possibility)
Note, all these assumptions are not needed together (at the same time) In general, these assumptions are needed: multinomial sample, parameter independence,
parameter modularity, complete data In general, some of these assumptions may be relaxed (complete data, multinomial sample,
Dirichlet, etc…)
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
12
BAYESIAN DIRICHLET (BD) SCORING FUNCTION Define the following
is the set of n discrete random variables is the set of parents of and is the number of unique instantiations (configurations) of is the number of values for is the number of times and and is the hyperparameter for and and
Then may be defined as follows This equation represents the Bayesian Dirichlet (BD) scoring function Note that this equation looks very similar to differs from by having an extra product term for the parents, In fact, if (no variable has any parents), then
Hyperparameters corresponds to and counts corresponds to
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
13
SOME DIFFERENT BAYESIAN SCORING FUNCTIONS Some different Bayesian scoring functions are variants of the BD
scoring function Kutató (K2), Bayesian Dirichlet equivalent (BDe), Bayesian Dirichlet equivalent uniform (BDeu)
K2, BDe, and BDeu differ in the way the values for the hyperparameters are set
K2: BDe: BDeu: For BDe and BDeu, is called the equivalent sample size (ESS)
There is no widely accepted approach on how to set ESS
The maximum a posterior (MAP) BBN structure is very sensitive to ESS; choosing different values of ESS may lead to different MAP
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
14
SOME DIFFERENT BAYESIAN SCORING FUNCTIONS (CONTINUED) In full form, K2 and BDeu are defined as follows K2
BDeu
THIS WORK IS L ICENSED UNDER A CREATIVE COMMONS ATTRIBUTION 3.0 UNPORTED L ICENSE.
15
SUMMARY
The Bayesian scoring functions discussed are based on some basic mathematical functions (factorial, gamma, and Beta), probability distributions (multinomial, Dirichlet, Dirichlet-multinomial), Bayes’ Theorem, and assumptions
The Dirichlet-multinomial compound distribution and the integration of this PMF over is key to understanding how some Bayesian scoring functions have been established
Variations of some Bayesian scoring functions primarily deal with setting the values of hyperparameters