Logistic Regression - City University of New...
Transcript of Logistic Regression - City University of New...
Today
• Logistic Regression – Maximum Entropy Formulation
• Decision Trees Redux – Now using Information Theory
• Graphical Models – Representing conditional dependence
graphically
1
Logistic Regression Optimization
• Take the gradient in terms of w
2
E(�w) = − ln p(�t|�w) = −N−1�
n=0
{tn ln yn + (1− tn) ln(1− yn)}
∇�wE =N−1�
n=0
∂E
∂yn
∂yn∂an
∇�wan
Where y0 = p(c0|�xn) = σ(an)
an = �wT �xn
Optimization
• We know the gradient of the error function, but how do we find the maximum value?
• Setting to zero is nontrivial • Numerical approximation
3
∇�wE =N−1�
n=0
(yn − tn)�xn
Entropy
• Measure of uncertainty, or Measure of “Information”
• High uncertainty equals high entropy. • Rare events are more “informative” than
common events.
4
H(x) = −�
x
p(x) log2 p(x)
Examples of Entropy
• Uniform distributions have higher distributions.
5
Maximum Entropy • Logistic Regression is also known as
Maximum Entropy. • Entropy is convex.
– Convergence Expectation. • Constrain this optimization to enforce good
classification. • Increase maximum likelihood of the data
while making the distribution of weights most even. – Include as many useful features as possible.
6
Maximum Entropy with Constraints
• From Klein and Manning Tutorial 7
Optimization formulation
• If we let the weights represent likelihoods of value for each feature.
8
maxH(w;x, t) = −�
x
w log2 w
s.t. wTx = t
and ||w||2 = 1
For each feature i
Solving MaxEnt formulation
• Convex optimization with a concave objective function and linear constraints.
• Lagrange Multipliers
9
maxH(w;x, t) = −�
x
w log2 w
s.t. wTx = t
and ||w||2 = 1
For each feature i
L(p,�λ) = −�
w
w log2 w +N�
i=1
λi(wTi xi − t) + λ0 (||w||2 − 1)
Dual representation of the maximum likelihood estimation of
Logistic Regression
Decision Trees
• Nested ‘if’-statements for classification • Each Decision Tree Node contains a
feature and a split point. • Challenges:
– Determine which feature and split point to use – Determine which branches are worth including
at all (Pruning)
10
Decision Trees
11
color
h w w
w w h h
blue brown green
<66 <140 <150
<66 <64 <145 <170
m m
m
m m f f
f
f f
Ranking Branches
• Last time, we used classification accuracy to measure value of a branch.
12
height
<68
1M / 5F 5M / 1F
50% Accuracy before Branch
83.3% Accuracy after Branch
33.3% Accuracy Improvement
6M / 6F
Ranking Branches
• Measure Decrease in Entropy of the class distribution following the split
13
height
<68
1M / 5F 5M / 1F
H(x) = 2 before Branch
83.3% Accuracy after Branch
33.3% Accuracy Improvement
6M / 6F
InfoGain Criterion • Calculate the decrease in Entropy across a
split point. • This represents the amount of information
contained in the split. • This is relatively indifferent to the position on
the decision tree. – More applicable to N-way classification. – Accuracy represents the mode of the distribution – Entropy can be reduced while leaving the mode
unaffected.
14
Graphical Models and Conditional Independence
• More generally about probabilities, but used in classification and clustering.
• Both Linear Regression and Logistic Regression use probabilistic models.
• Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables.
15
(Joint) Probability Tables
• Represent multinomial joint probabilities between K variables as K-dimensional tables
• Assuming D binary variables, how big is this table?
• What is we had multinomials with M entries?
16
p(x) = p(flu?, achiness?, headache?, . . . , temperature?)
Probability Models
• What if the variables are independent?
• If x and y are independent:
• The original distribution can be factored
• How big is this table, if each variable is binary?
17
p(x) = p(flu?, achiness?, headache?, . . . , temperature?)
p(x, y) = p(x)p(y)
p(x) = p(flu?)p(achiness?)p(headache?) . . . p(temperature?)
Conditional Independence
• Independence assumptions are convenient (Naïve Bayes), but rarely true.
• More often some groups of variables are dependent, but others are independent.
• Still others are conditionally independent.
18
Conditional Independence
• If two variables are conditionally independent.
• E.g. y = flu?, x = achiness?, z = headache?
19
p(x, z|y) = p(x|y)p(z|y)p(x, z) �= p(x)p(z)
x ⊥⊥ z|y
Factorization if a joint
• Assume
• How do you factorize:
20
x ⊥⊥ z|y
p(x, y, z)
p(x, y, z) = p(x, z|y)p(y)p(x|y)p(z|y)p(y)
Factorization if a joint
• What if there is no conditional independence?
• How do you factorize:
21
p(x, y, z)
p(x, y, z) = p(x, z|y)p(y)p(x|y, z)p(z|y)p(y)
Structure of Graphical Models • Graphical models allow us to represent
dependence relationships between variables visually – Graphical models are directed acyclic graphs
(DAG). – Nodes: random variables – Edges: Dependence relationship – No Edge: Independent variables – Direction of the edge: indicates a parent-child
relationship – Parent: Source – Trigger – Child: Destination – Response
22
Example Graphical Models
• Parents of a node i are denoted πi
• Factorization of the joint in a graphical model:
23
p(x, y) = p(x)p(y) p(x, y) = p(x|y)p(y)
p(x0, . . . , xn−1) =n−1�
i=0
p(xi|πi)
x y x y
Basic Graphical Models • Independent Variables
• Observations
• When we observe a variable, (fix its value from data) we color the node grey.
• Observing a variable allows us to condition on it. E.g. p(x,z|y)
• Given an observation we can generate pdfs for the other variables.
24
x y z
x y z
Example Graphical Models
• X = cloudy? • Y = raining? • Z = wet ground? • Markov Chain
25
x y z
p(x, y, z) =�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x)p(z|y)
Example Graphical Models
• Markov Chain
• Are x and z conditionally independent given y?
26
x y z
p(x, y, z) =�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x)p(z|y)
p(x, z|y) = p(x|y)p(z|y)
Example Graphical Models
• Markov Chain
27
x y z p(x, y, z) =
�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x)p(z|y)
p(x, z|y) = p(x|z, y)p(z|y)
p(x|z, y) = p(x, y, z)
p(y, z)=
p(x)p(y|x)p(z|y)p(y)p(z|y)
=p(x)p(y|x)
p(y)=
p(x, y)
p(y)= p(x|y)
p(x, z|y) = p(x|y)p(z|y)x ⊥⊥ z|y
One Trigger Two Responses
• X = achiness? • Y = flu? • Z = fever?
28
x
y
z
p(x, y, z) =�
n∈{x,y,z}
(n|πn) = p(x|y)p(y)p(z|y)
Example Graphical Models
• Are x and z conditionally independent given y?
29
p(x, z|y) = p(x|y)p(z|y)
x
y
z
p(x, y, z) =�
n∈{x,y,z}
(n|πn) = p(x|y)p(y)p(z|y)
Example Graphical Models
30
xy
zp(x, y, z) =
�
n∈{x,y,z}
(n|πn) = p(x|y)p(y)p(z|y)
p(x, z|y) = p(x|z, y)p(z|y)
p(x|z, y) = p(x, y, z)
p(y, z)=
p(x|y)p(y)p(z|y)p(y)p(z|y)
= p(x|y)p(x, z|y) = p(x|y)p(z|y)x ⊥⊥ z|y
Two Triggers One Response
• X = rain? • Y = wet sidewalk? • Z = spilled coffee?
31
x
y
z
p(x, y, z) =�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x, z)p(z)
Example Graphical Models
• Are x and z conditionally independent given y?
32
p(x, z|y) = p(x|y)p(z|y)
x
y
z
p(x, y, z) =�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x, z)p(z)
Example Graphical Models
33
xy
z
p(x, y, z) =�
n∈{x,y,z}
p(n|πn) = p(x)p(y|x, z)p(z)
p(x, z|y) = p(x|z, y)p(z|y)
p(x|z, y) = p(x, y, z)
p(y, z)=
p(x)p(y|x, z)p(z)p(y|x, z)p(z)
= p(x)
p(x, z|y) = p(x)p(z|y)x not ⊥⊥ z|y
Factorization
34
x0
x1
x2 x4
x3
x5
p(x0, x1, x2, x3, x4, x5) =?
Factorization
35
x0
x1
x2 x4
x3
x5
p(x0, x1, x2, x3, x4, x5) =
p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)
How Large are the probability tables?
36
p(x0, x1, x2, x3, x4, x5) =
p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)
Model Parameters as Nodes
• Treating model parameters as a random variable, we can include these in a graphical model
• Multivariate Bernouli
37
µ0
x0
µ1
x1
µ2
x2
Model Parameters as Nodes
• Treating model parameters as a random variable, we can include these in a graphical model
• Multinomial
38
x0
µ
x1 x2
Naïve Bayes Classification
• Observed variables xi are independent given the class variable y
• The distribution can be optimized using maximum likelihood on each variable separately.
• Can easily combine various types of distributions
39
x0
y
x1 x2
p(y|x0x1, x2) ∝ p(x0, x1, x2|y)p(y)p(y|x0x1, x2) ∝ p(x0|y)p(x1|y)p(x2|y)p(y)
Graphical Models • Graphical representation of dependency
relationships • Directed Acyclic Graphs • Nodes as random variables • Edges define dependency relations • What can we do with Graphical Models
– Learn parameters – to fit data – Understand independence relationships between
variables – Perform inference (marginals and conditionals) – Compute likelihoods for classification.
40
Plate Notation
• To indicate a repeated variable, draw a plate around it.
41
x0
y
x1 xn …
y
xi
n
Completely observed Graphical Model
• Observations for every node
• Simplest (least general) graph, assume each independent
42
Completely Observed graphical models
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
In the simplest – least general – graph, assume each independent. Train 6separate models.
Fl Fe Si Ac Sw He
2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)
Fl Fe Si Ac Sw He
20 / 37
Completely Observed graphical models
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
In the simplest – least general – graph, assume each independent. Train 6separate models.
Fl Fe Si Ac Sw He
2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)
Fl Fe Si Ac Sw He
20 / 37
Completely observed Graphical Model
• Observations for every node
• Second simplest graph, assume complete dependence
43
Completely Observed graphical models
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
In the simplest – least general – graph, assume each independent. Train 6separate models.
Fl Fe Si Ac Sw He
2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)
Fl Fe Si Ac Sw He
20 / 37
Completely Observed graphical models
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
In the simplest – least general – graph, assume each independent. Train 6separate models.
Fl Fe Si Ac Sw He
2nd simplest graph – most general – assume no independence. Build a6-dimensional table.
Fl Fe Si Ac Sw He
20 / 36
Maximum Likelihood
• Each node has a conditional probability table, θ
• Given the tables, we can construct the pdf.
• Use Maximum Likelihood to find the best
settings of θ 44
Maximum Likelihood Conditional Probability Tables
Consider this Graphical Model
x0x1
x2
x3
x4
x5
Each node has a conditional probability table !i .
Given the table, we have a pdf
p(x|!) =M!1!
i=0
p(xi |"i , !i)
We have m variables in x, and N data points, X.
Maximum (log) Likelihood
!" = argmax!
ln p(X|!)
= argmax!
N!1"
n=0
ln p(Xn|!)
= argmax!
N!1"
n=0
lnM!1!
i=0
p(xin|!i )
= argmax!
N!1"
n=0
M!1"
i=0
ln p(xin|!i )
21 / 36
p(�x|θ) =M−1�
i=0
p(xi|πi, θi)
Maximum likelihood θ∗ = argmax
θln p( �X|θ)
= argmaxθ
N−1�
n=0
ln p( �Xn|θ)
= argmaxθ
N−1�
n=0
lnM−1�
i=0
p(xin|θi)
= argmaxθ
N−1�
n=0
M−1�
i=0
ln p(xin|θi)
45
Count functions • Count the number of times something
appears in the data
46
Maximum Likelihood CPTs
First, Kronecker’s delta function.
!(xn, xm) =
!
1 if xn = xm0 otherwise
Counts: the number of times something appears in the data
m(xi ) =N!1"
n=0
!(xi , xin)
m(X) =N!1"
n=0
!(X,Xn)
N ="
x1
m(x1) ="
x1
#
"
x2
!(x1, x2)
$
="
x1
#
"
x2
#
"
x3
!(x1, x2, x3)
$$
. . .
22 / 36
m(xi) =N−1�
n=0
δ(xi, xin)
m( �X) =N−1�
n=0
δ( �X, �Xn)
N =�
x1
m(x1) =�
x1
��
x2
δ(x1, x2)
�=
�
x1
��
x2
��
x3
δ(x1, x2, x3)
��. . .
Maximum Likelihood
• Define a function: • Constraint:
47
Maximum likelihood CPTs
l(!) =N!1!
n=0
ln p(Xn|!)
=N!1!
n=0
ln"
X
p(X|!)!(xn,X)
=N!1!
n=0
!
X
"(xn,X) ln p(X|!)
=!
xn
m(X) ln p(X|!)
=!
xn
m(X) lnM!1"
i=0
p(xi |#i , !i )
=!
xn
M!1!
i=0
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
!
X\xi\"i
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
m(xi , #i) ln p(xi |#i , !i )
Define a function:!(xi ,#i ) = p(xi |#i , !i )
Constraint:!
xi
!(xi ,#i ) = 1
23 / 36
Maximum likelihood CPTs
l(!) =N!1!
n=0
ln p(Xn|!)
=N!1!
n=0
ln"
X
p(X|!)!(xn,X)
=N!1!
n=0
!
X
"(xn,X) ln p(X|!)
=!
xn
m(X) ln p(X|!)
=!
xn
m(X) lnM!1"
i=0
p(xi |#i , !i )
=!
xn
M!1!
i=0
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
!
X\xi\"i
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
m(xi , #i) ln p(xi |#i , !i )
Define a function:!(xi ,#i ) = p(xi |#i , !i )
Constraint:!
xi
!(xi ,#i ) = 1
23 / 36
Maximum likelihood CPTs
l(!) =N!1!
n=0
ln p(Xn|!)
=N!1!
n=0
ln"
X
p(X|!)!(xn,X)
=N!1!
n=0
!
X
"(xn,X) ln p(X|!)
=!
xn
m(X) ln p(X|!)
=!
xn
m(X) lnM!1"
i=0
p(xi |#i , !i )
=!
xn
M!1!
i=0
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
!
X\xi\"i
m(X) ln p(xi |#i , !i )
=M!1!
i=0
!
xi ,"i
m(xi , #i) ln p(xi |#i , !i )
Define a function:!(xi ,#i ) = p(xi |#i , !i )
Constraint:!
xi
!(xi ,#i ) = 1
23 / 36
Maximum Likelihood
• Use Lagrange Multipliers
48
l(θ) =M−1�
i=0
�
xi
�
πi
m(xi,πi) ln θ(xi,πi)−M−1�
i=0
�
πi
λπi
��
xi
θ(xi,πi)− 1
�
∂l(θ)
∂θ(xi,πi)=
m(xi,πi)
θ(xi,πi)− λπi = 0
θ(xi,πi) =m(xi,πi)
λπi
�
xi
m(xi,πi)
λπi
= 1 – the constraint
λπi =�
xi
m(xi,πi) = m(πi)
θ(xi,πi) =m(xi,πi)
m(πi)– counts!
Maximum A Posteriori Training
• Bayesians would never do that, the thetas need a prior.
49
θ(xi,πi) =m(xi,πi) + �
m(πi) + �|xi|
Conditional Dependence Test • Can check conditional independence in a graphical model
– “Is achiness (x3) independent of the flue (x0) given fever(x1)?” – “Is achiness (x3) independent of sinus infections(x2) given fever
(x1)?”
50
p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)
p(x3|x0, x1, x2) =p(x0, x1, x2, x3)
p(x0, x1, x2)
=p(x0)p(x1|x0)p(x2|x0)p(x3|x1)
p(x0)p(x1|x0)p(x2|x0)
= p(x3|x1)
x3 ⊥⊥ x0, x2|x1
D-Separation and Bayes Ball
• Intuition: nodes are separated or blocked by sets of nodes. – E.g. nodes x1 and x2, “block” the path from x0
to x5. So x0 is cond. ind.from x5 given x1 and x2
51
D-separation and Bayes Ball
!
"
#
$x0
x1
x2
x3
x4
x5
Intuition: nodes are separated, or blocked by sets of nodes
Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 !! x5|x2, x3
28 / 36
Bayes Ball Algorithm
• Shade nodes xc
• Place a “ball” at each node in xa
• Bounce balls around the graph according to rules
• If no balls reach xb, then cond. ind.
52
xa ⊥⊥ xb|xc
Ten rules of Bayes Ball Theorem
53
Bayes Ball Example
Bayes Ball Example - I
x0 !! x4|x2?
x0
x1
x2
x3
x4
x5
32 / 36
54
Bayes Ball Example
Bayes Ball Example - II
x0 !! x5|x1, x2?
x0
x1
x2
x3
x4
x5
33 / 36
55
Undirected Graphs • What if we allow undirected graphs? • What do they correspond to? • Not Cause/Effect, or Trigger/Response,
but general dependence • Example: Image pixels, each pixel is a
bernouli – P(x11,…, x1M,…, xM1,…, xMM) – Bright pixels have bright neighbors
• No parents, just probabilities. • Grid models are called Markov
Random Fields
56
Undirected Graphs
• Undirected separability is easy. • To check conditional independence of A
and B given C, check the Graph reachability of A and B without going through nodes in C
57
D
B
C
A
Next Time
• More fun with Graphical Models
• Read Chapter 8.1, 8.2
58