1
Learning with Information Divergence Geometry
Shinto Eguchiand
Osamu Komori
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/
Tutorial Workshop on
1
2
Outline
9:30~10:30 Information divergence class and robust statistical methods I
11:00~12:00 Information divergence class and robust statistical methods II
13:30~14:30 Information geometry on model uncertainty
15:00~16:00 Boosting leaning algorithm and U-loss functions I
9:30~10:30 Boosting leaning algorithm and U-loss functions II
11:00~12:00 Pattern recognition from genome and omics data
2
3
Key words
Linear connections
Statistical model and estimation
Divergence geometry
Duality of inference and model
GeodesicABC of Information geometry
Transversality
Riemannian metric (information metric)
Metric and dual connections
Pythagoras theorem
Max entropy model
minimaxity
U-boost
Observational bias Selection bias
Sensitivity analysis The worst case
ε –perturbed model
Pythagoras theorem
3
4
Information divergence class and robust statistical methods I
4
5
geometry
learning
statistics
quantum
1982 paperS. Amari
1975 paperB. Efron
P. Dawidcomment
Historical comment
1984WorkshopRSS150
C.R.Rao1945
R.A.Fisher1922
5
6
What is IG?
Geometry Uncertainty
Information space
It is a method to quantify uncertainty,
Or, a viewpoint to understand uncertainty?
6
7
Dual Riemannian Geometry
Dual Riemannian geometry gives reformulation for Fisher’s foundation
Cf. Einstein field equations (1916)
Information Geometery aims to geometrizeA dualistic structure between modeling and estimation
Estimation is projection of data onto model.
“Estimation is an action by projection and model is an object to be projected”
The interplay of action and object is elucidated by a geometric structure.
Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program
7
8
2 x 2 table
The space of all 2×2 tables associates with a regular tetrahedron
We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?
The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron
8
9
regular tetrahedron
P
Q
S
R
0010
1000
0100
0001
1/4
1/4
1/4
1/4
1−p0p0
0p0 1−p
9
10
regular tetrahedron
P
Q
S
R
2/301/30
02/30 1/3
002/3 1/3
2/31/300
+
=
32
32
31
32
32
31
31
31
××
××
×31
×32
02/30 1/3
2/301/30
+×31
×32=
002/3 1/3
2/31/300
10
11
P
Q
S
R
P
Q
S
R
Ruled surface
11
12
.
1 00 0
p q p (1-q)(1-p)q (1-p) (1-q)
0 00 1
0 10 0
hgfeyzxwn
22 )( −
=χ
0)1()1(
})1)(1()1)(1({ 2
=−−
−−−−−=
qqppqpqpqppqn
0 01 0
nhg
fwzD
eyxC
BA
wzyxnwyhzxgwzfyxe
+++=+=+=+=+=
,,,
11−qq1−p(1−p)(1−q)(1−p)qD
Pp(1−q)pqC
BA
12
13
e-geodesical
-10
0
10 -5
0
5
-5
0
5
-10
0
10
-5
0
5
-10
0
10-10
-5
0
5
10
-5
0
5
-10
0
10
}){( 10,10,1
log,1
log,)1)(1(
log <<<<−−−−
qpq
qp
pqp
qp }{ RI,RI),,,( ∈∈+ yxyxyx
2112112222
21
22
12
22
11211211 1wherelog,log,log),,( )( ππππ
ππ
ππ
πππππ −−−=→−
3S 3RI
13
14
Two parametrization
-10
0
10 -5
0
5
-5
0
5
-10
0
10
-5
0
5
)( )1(),1(, qpqpqp −−
)( ,qp
)(1
log,1
log,)1)(1(
logq
qp
pqp
qp−−−−
14
15
0
5
10
0
5
10
0
0.05
0.1
0.15
0
5
10
Two Gaussian distributions ),(and),( 21 ININ μμ
),( 1 IN μ ),( 2 IN μ
1μ2μ
15
16
0
5
10
0
5
10
0
0.05
0.1
0.15
0
5
10
),( 1 IN μ
),( 2 IN μ
),( 3 IN μ
Pythagoras theorem in a functional space
1μ
2μ
3μ
222 543 =+
16
17
Kullback-Leibler Divergence
∫= xxxx d)()(log)(),(KL q
ppqpD
p
q
r
),(),(),( KLKLKL rqDqpDrpD +=
Let p(x) and q(x) be probabiulity density fubnctions.
Then Kullback-Leibler Divergence defined by
17
18
Two one-parameter families
)PI,(),()()1()()( ∈+−= qptqptp mt xxx
Let the space of all pdfs on a data space.PI
)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss
es xxx
xxx dqrc sss ∫ −= )}({)}({/1 1
ここで
p
q
r
)(mtp
)(esr
m-geodesic
e-geodesic
18
19
Pythagoras theorem
),(),(),( rqDqpDrpD +=
])1,0[]1,0[),((),(),(),( )()()()( ×∈∀+= tsrqDqpDrpD es
mt
es
mt
p
q
r
)(mtp
)(esr
⇒
19
20
Proof}),(),({),( )(
KL)(
KL)()(
KLe
smt
es
mt rqDqpDrpD +−
∫ −−= xxxxx drqqp es
mt ))(log)())(log()(( )()(
∫ −−−−−= xxxxx dcrqsqpt s}log)(log)()(log1)){(()()(1(
∫ −−−−= xxxxx drqqpst ))(log)())(log()(()1)(1(
∫ −−− xxx dqpct s ))()((log)1(
)},(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=
0=
20
21
ABC in differential geometry
Linear connection defines parallelism along a vector field.
Geodesic = a minimal arc between x and y
Riemannian metric defines an inner product on any tangent space of a manifold.
RI)()(:),( →× MMggM XX
∫=
====
1
0))(),((argmin
10})1(,)0(:)({
dttxtxgtyxxxtxc
γ
)()()(: MMM XXX →×∇
))(,),(()()()2(
,)1(
MYXMfYfXYfYf
YfY
XX
XXf
XF ∈∀∈∀+∇=∇
∇=∇
∑=
Γ=∇d
kk
kjijX XX
i1
Componetwise
21
22
GeodesicA one-parameter family (curve) is called geodesic }:)({ εε ≤≤−= ttC θ
),...,1(0)()())(()(1 1
pitttt kkp
k
p
j
ikj
i =∀=Γ+∑∑= =
θθθ θ
with respect to a linear connection },,1:)({ pkjiikj ≤≤Γ θ
Remark 2:If , any geodesic is a line.),,1(0)( pkjii
kj ≤≤=Γ θ
The property is not invariant with parametrization.
The gedesic C is expressed by a transform rule from parameter θ to ω
),...,1(0)()())((~)(1 1
patttt cbp
c
p
b
abc
a =∀=Γ+ ∑∑= =
ωωω ω
,)()(~11 1 1
∑∑∑∑== = = ∂
∂+Γ=Γ
p
ac
iba
i
p
k
p
j
p
i
ikj
ai
jb
kc
abc
BBBBBω
θω )(for, θaaa
iiai
aai tBB =
∂∂
=∂∂
= ωωθ
θω
22
23
Change rule on connectionsLet us consider a transform φ from parameter θ to ω . Then a geodesic C is
written by . Hence we have the following:
),...,1(0)()())((~)(1 1
patttt cbp
c
p
b
abc
a =∀=Γ+ ∑∑= =
ωωω ω
,)()(~11 1 1
∑∑∑∑== = = ∂
∂+Γ=Γ
p
ac
iba
i
p
k
p
j
p
i
ikj
ai
jb
kc
abc
BBBBBω
θω )(for)( 1θaa
a
iiaB ϕω
ωϕ
=∂
∂=
−
}:))(()({ εεϕ <<−= ttt θω
),...,1()())(()(1
pattBtp
i
iai
a =∀= ∑=
θω θ
),...,1()()())(()())(()(1 11
patttBttBtp
j
p
i
jij
ai
p
i
iai
a =∀∂
∂+= ∑∑∑
= ==
θθθ
θω θθ
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
= i
aaiB
θϕ
23
24
What reference books?Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu
Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)
Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer
24
25
Statistical model and information
Statistical model }:),()({ Θ∈== θθxxθ ppM
Score vector ),(log),( θxθxsθ
p∂∂=
Space of score vector }RI:),({ T psT ∈⋅= αθαθ
Fisher information metric ),(}{),( θθθ TvuuvEvug ∈∀=
Fisher information matrix }),(),({ Tθxθxθθ ssEI =
)RI( p⊂Θ
25
26
Score vector spaceNote 1:
0),(),(
),(),(),(),()(T
T
==
==⇒∈
∫∫∫xθxθxsα
xθxθxsαxθxθxθθ
dp
dpdpuuETu
βαβxθxθxsθxsα
xθxθxθx
θ
θθ
Idp
dpvuvugTvuTTT }),(),(),({
),(),(),(),(,
==
=⇒∈
∫∫
})),((,0)),((:),({ ∞<== θxθxθx θθθ tVtEtS
The space of all random variables with mean 0 and finite variance
Note 2: is an infinite dimensional vector space that includesθS θT
),(),(For θxsθx Tu α=
,....2,1},{ ),(...
),(...
},),({),(11
=∂∂
∂−
∂∂∂
− kuEuuEuiki
k
iki
kkk θxθxθxθx θθ θθθθ
belong to θS
26
27
Linear connections
),,1()( }{ '1'
'e
pkjissEg ijk
p
i
iiikj ≤≤=Γ
∂
∂∑= θ
θθ
),,1()( }]{}{[ ''1'
'm
pkjisssEssEg ijkijk
p
i
iiikj ≤≤+=Γ
∂
∂∑=
θθθθ
e-connection
m-connection
)PI,(),()()1()()( ∈+−= qptqptp mt xxx
)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss
es xxx
m-geodesic
e-geodesic
Score vectorpiis 1)),((),( == θxθxs
27
28
Geodesical modelsDefinition
))1,0(()()()(),( 1 ∈∀∈⇒∈ − tMqpcMqp ttt xxxx
(i) Statistical model M is said to be e-geodesical if
(ii) Statistical model M is said to be m-geodesic if
))1,0(()()()1()(),( ∈∀∈+−⇒∈ tMqtptMqp xxxx
Note : Let P be a space of all probability density functions.
∫ −= xxx dqpc ttt )()(/1 1where
By definition P is e-geodesical and m-geodesical.
However the theoretical framework for P is not perfectly complete.
Cf. Pistone and Sempi (1995, AS)
28
29
Two type of modeling
})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppM
})}({)}({:)({0
)m( xtxtx pp EEpM ==
Exponential model
Matched mean model
})(:RI{ ∞<∈=Θ θθ κp
Let p0(x) be a pdf and t (x) a p-dimensional statistic.
})};()()(logexp{)(),({
1 00
)e( Θ∈−== ∑=
θθxxxθx κθ
I
i
ii p
pppM
}:)()()1(),({1
01
)m( Θ∈+−== ∑∑==
θθθI
iii
I
ii pppM xxθxMixture model
})(:RI{ ∞<∈=Θ θθ κp
Let be a set of pdfs.},...,1,0:)({ Iipi =x
)},...,1(0,10:),...,({1
1 Iii
I
iiI =∀><<==Θ ∑
=
θθθθθ
Exponential model
29
30
Statistical functional
A functional f (p) is called a statistical functional if p(x) is a pdf.
A statistical functional f (p) is said to be Fisher-consistent for a model
,)()1( Θ∈pf
)()()2( Θ∈∀= θθf θp
if f (p) satisfies that}:)({ Θ∈= θxθpM
For a normal model }RIRI),(},2
)(exp{2
1)({ 22
2
2+×∈=
−−== σμ
σμ
πσθxθ
xpM
Tdxxxpdxxdxxxpp )))((,)(()( 22∫ ∫∫ −=f
is Fisher-consistent for θ = (μ, σ 2 ).
Example.
30
31
TransversalityLet f (p) be Fisher consistent functional.
Is called a leaf transverse to M.
})(:{)( θffθ == ppL
}{)()1( θθ f pML =∩
)()2( fθLΘ∈
⊕θ
is a local neighborhood.
)( fθL
)(* fθL
θp
*θp
M
31
32
Foliation structure
)( fP θLΘ∈
⊕=θ
))((),(),( fP θθθθLTvMTuTt ppp ∈∃∈∃∈∀
),(),(),(such that θxθxθx vut +=
))(()()( fP θθθθLTMTT ppp ⊕=
)( fθL
M
θp
),( θxt ),( θxu
),( θxv
Foliation
Decomposition of tanget spaces
Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ
32
33
Transversality for MLE
})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppMExponential model
Hence the foliation associated with fML is a matched mean model.
})(:RI{ ∞<∈=Θ θθ κp
For the exponential model the MLE functional)e(M
)},({logmaxarg:)(ML θxfθ
pEp pΘ∈
=
}{ )}({)}({arg)( ),(ML xtxtf θθ
⋅Θ∈
== pp EEp
is written by
})}({)}({:{)( ),(ML xtxtf θθ ⋅== pp EEpL
Let p0(x) be a pdf and t (x) a p-dimensional statistic.
33
34
Maximum likelihood foliation
),(),(),( KLKLKL baba rpDppDrpD θθ +=
)( MLfθL
θp
)e(M
1p 2p
3p
2r
1r)2,1,3,2,1( == ba
34
35
Estimating function
),( θxu
Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ
p-variable function is unbiased
)(0)}),({(det,}),({ ),(),( Θ∈∀≠∂
∂= ⋅⋅ θ
θθxuθxu θθ pp EE 0
def⇔
The statistical functional }),({solvearg)( 0==Θ∈
θxufθ
pEp
is Fisher-consitent, and the leaf transverse to M
}),(:{)( 0== θxufθ pEpL
is m-geodesical.)( fθL
Θ∈⊕
θ
35
36
Information Geometry
}:),({ Θ∈= θθxpMStatistical model
Information metric (Rao, 1945)
∫ = 1),(stpdfais),(where dxxpxp θθ
)},(),({)( ),( θθθ θ xexeEg jixpij =
Dual connection (Amari, 1982)
),(log),(where θθθ
xpxe ii ∂
∂=
)()( ),(,e
kjixpkij eeEθθθ
∂
∂=Γ
)()()( ),(),(,m
kjixpkjixpkij eeeEeeE θθθθ +=Γ∂
∂
Exponential connection
Mixture connection
36
37
Mixture and exponential model
}:),({ Θ∈= θθxpM
Mixture model (mixture geodesic space)
}1,0:),...,{(,)(),(0
00
=>=Θ= ∑∑==
K
iiiK
K
iii xpxp θθθθθθ
}:),({ Θ∈= θθxpM
Exponential model (exponential geodesic space)
})(:{,)}()(exp{),(1
∞<=Θ−= ∑=
θθθ ψψθθK
iii xtxp
∫= dxxtii )}(exp{log)(where θψ θ
37
38
Triangle with KL divergence
)10()()()( 1e≤≤= − sxqxrzxr ss
ssExponential geodesic
Let p, q and r in M
)10()()1()()(m≤≤−+= txqtxptxptmixture geodesic
∫= dxxqxpxpqpD)()(log)(),(KLKL divergence
}),(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=
∫ −−= })log)(log{(emst rqqp
∫ −−−−= })log)(log{()1)(1( rqqpts
),(),(),(e
KLm
KLem
KL stst rqDqpDrpD −−p
q
r
mtp
esr
38
39
Pythagorean Theorem
)(m xps
)(xp
)(xq
)(xr
)(e xr t
),(KL rpD
),(KL rqD
),(KL qpD
),(),(),(e
KLm
KLem
KL stst rqDqpDrpD +=
),(),(e
KLem
KL sst rqDrpD ≥
),(),( mKL
emKL qpDrpD tst ≥ (em algorithm, Amari, 1995)
39
40
Minimum divergence geometryD : M×M → R is an information divergence on a statistical model M
(i) D(p,q) ≧ 0 with equality if and only if p = q
(ii) D is a differentialble on M×M
Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)
)|(),()( YXDYXg D −=
))(()|(),()( MXZZXYDZYg XD ∈∀−=∇
))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇
pqqqp qpDYXZpXYZD == |),())(|(
⇒⇐
Let
40
41
)(),()|(),( )()( ZZYfgZXYfDZYg XD
XfD ∀∇=−=∇
)|()|()|(),()( XYDXYDYXDYXg D ⋅=⋅=−=
0)|],([)|()|(),(),( )()( =⋅=⋅−⋅=− YXDYXDXYDXYgYXg DD
)(Dg
sconnection dual areand *XX ∇∇
)(),)(()|)(()),(( )()( ZZYXfYfgZfYXDZfYg XD
XD ∀+∇=−=∇
is a Riemann metric
)( *21
XX ∇+∇=∇Xwhere
),(),(),(),(),( )()(*)()()( ZYgZYgZYgZYgZYXg XD
XD
XD
XDD ∇+∇=∇+∇=
Remarks
0)|( =⋅XD
41
42
∫ −= ))}(()({),( qUqpqpLU ξξU cross-entropy
thatsuch),,( tripleaLet ξuU U is a convex function,
U entropy
1,' −== uUu ξ
),()(),( qpLpHqpD UUU −=U divergence
)( pξ )(qξ
))(( pU ξ
))(( qU ξ
U divergence
)}({maxarg)( tUtsst
−=∞<<∞−
ξ
))(()()(where * sUsssU ξξ −=
)}({maxarg)( * sUsttus
−=∞<<∞−
∫== )(),()( * pUppLpH UU
42
4343
Example of U divergence
∫∫ −=−−−= )log(log)}log(log{),(KL qpppqppqqpD
}{ )(1
11),( ∫ −
+
+−+−=
β
ββ
β
βββ
pqppqqpD
KL divergence
Beta (power) divergence
))log(),exp(),(exp())(),(),(( sttstutU =ξ
)( 1,)1(,)1())(),(),(( /11
/)1(
β
ββ
β
ββ
βββ ββξ −
++
=+
+ sttstutU
Note )exp()(lim0
ttU =→ ββ
),(),(lim KL0qpDqpD =
→ββ
)}()({))(())((),( pqppUqUqpDU ξξξξ −−−= ∫
43
4444
Geometric formula with DU
)|(),()( YXDYXg UU −=
))(()|(),()( MXZZXYDZYg UXU ∈∀−=∇
))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇
s.t.),,( )(*)()( UUUg ∇∇
dxxqxqg iiji
U )),((),(()( θξθθθθ∫ ∂
∂
∂
∂=)
dxxqxq kjikijU )),((),(( 2
,)( θξθθ
θθθ∫ ∂
∂
∂∂
∂=)Γ
dxxqxq jikkijU )),((),((
2
,)(* θξθθ
θθθ ∂∂
∂
∂
∂∫=)Γ
2005)Eguchi,()(m
)( UU ∀∇=∇
)divergence KL(exp)( =⇒⇐= Ugg U
44
4545
Triangle with DU
])1,0[]1,0[),(()},(),(),(){1)(1(
),(),(),( )()()()(
×∈∀−−−−=
−−
tsrqDqpDrpDst
rqDqpDrpD
UUU
UsU
mtU
Us
mtU
p
q
r
)(mtp
)(Usr
)()()1()(m
xxx tqptpt +−=
)( ))(())(()1()()(s
Us qsrsur κξξ ++−= xxx
mixture geodesic
U geodesic
),(),(),(
}{}{)()()()(
)()(
UsU
mtU
Us
mtU
Usq
mt
rqDqpDrpD
rp
+=
⇒⊥
45
46
Light and shadow of MLE
1.Invariance under data-transformations
2.Asymptotic efficiency
1. Non-robust
Sufficiency and efficiencylog exp
ξ u
2. Overfitting
Log-likelihood on exponential family
46
1
Learning with Information Divergence
GeometryShinto Eguchi
andOsamu Komori
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/
Tutorial Workshop on
47
2
Outline
9:30~10:30 Information divergence class and robust statistical methods I
11:00~12:00 Information divergence class and robust statistical methods II
13:30~14:30 Information geometry on model uncertainty
15:00~16:00 Boosting leaning algorithm and U-loss functions I
9:30~10:30 Boosting leaning algorithm and U-loss functions II
11:00~12:00 Pattern recognition from genome and omics data
48
3
Information divergence class and
robust statistical methods II
49
4
Light and shadow of MLE
1.Invariance under data-transformations
2.Asymptotic efficiency
1. Non-robust
Sufficiency and efficiency
log exp
ξ u
2. Over-fitting
Log-likelihood on exponential family
Likelihood method
U-method
50
5
))(()}({),( qUqEqpL pU ξξ −=
U-entropy
U-cross entropy
))(()}({)( pUpEpH pU ξξ −=
Let U(t) = exp(t) .
.1
)1()(Let
1
ββ β
β
++
=
+
ttU
)}({log)( XpEpH pU =
}1)({)(β
β −=
XpEpH pU
),,,( ΞξuUTake a quadruplet
U-entropy
Example 2.
Example 1.
51
6
U-divergence
),()( qpLpH UU ≥Information inequality
),()(),( qpLpHqpD UUU −=U-divergence
),()( qpLpH UU −
)( pξ )(qξ
))(( pU ξ
))(( qU ξ
0)}()()){(())(())(( ≥−−−= pqpupUqU ξξξξξ
}log{log),(KL pqppqqpD −−−=
ββ
ββββ
β)(),(
1
11 qpppqqpD −−
−=
+
++
KL-divergence
β-power divergence
52
7
Max U-entropy distribution
})}({:{ τXtΓτ == PEpEqual mean space
Let us fix a statistics t (x).
)(maxarg)(* pHp Up τΓ
x∈
=
))()(()( T* θxtθx κ−= up
)(})({
|)}1)1(())1)((())1(({
T*
0**T*
τ
ε
κξ
εελεεεεε
Γ∈∀+−=
−+−−+−−−+−∂∂
=
qpq
qpqpqpH
tθ
τtθ
Euler-Lagrange’s 1st variation
)()())((Hence T* θxtθx κξ −=p
U-model
53
8
))(()}({),()( θθθθ qUqEqpLL pUU ξξ −==
U-estimate
U-loss function
)(minargˆ emp θθθ
UU LΘ∈
=U-estimate
)},(),({),(),(),( θXθXθxθxθxsθ
swEsw qU −=U-estimating function
.function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=
))(())((1)(1
empθθ xθ qUq
nL
n
iiU ξξ +−= ∑
=U-empirical loss function
Let p(x) be a data density function with statistical model )(xθq
54
9
Γ -minimax
})}({:{ τXtΓτ == PEpEqual mean space
),(maxmin)*,*(),(minmax qpLppLqpL UUUpqqp ττ
ττττ Γ∈Γ∈Γ∈Γ∈
==
∫ =−
−=
τxtθxt
xtθx
θ
θτ
))(()(
,))(()(* T
Twhere
κ
κ
u
up
)(maxarg* pHp Up τ
τΓ∈
=
τΓ
τ'Γ
UM
τΓ
'τΓ
}{τ
*τp
*'τp
Let us fix a statistics t (x).
55
10
U-model
}:))(()({ Θ∈−== θxtθx θθ κTU uqMU-model
mean parameter is)}({ xtτθqE=U-estimator of ∑
=
=n
iiU n 1)(1ˆ xtτ
))(())((1)(1
empθθ xθ qUq
nL
n
iiU ξξ −= ∑
=
))((})({1 T
1
Tθθ xtθxtθ κκ −−−= ∑
=
Un
n
ii
0 )}({)(1)(1
emp =−= ∑=
∂∂ Xtxtθ
θθ q
n
iiU E
nL
τ'Γ
UM
τΓ
'τΓ
*τp
*'τpτΓ
UM
).,()ˆ()( ˆempemp
θθθθ qqDLLU
UUUU =−
We observe that
which implies
Furthermore,
56
11
U-Boost density learning
)2())1((minarg),( 1]1,0[),(
TkL kUkk ≤≤+−= −×∈
πφξπφπφπ D
},...,1:)({ )( Mjj == xD φDictionary of ξ-densities
)(minarg11 φφξφ
UD
L∈
==
)()()1()( 1 xxx kkkkk φπξπξ +−= −U-boost
)(minarg*)cov(
ξξξ
ULD∈
=Goal is to find
U(t) = t 2 (Klemela, ML, 2006) U(t)=exp(t) (Friedman et al., JASA,1998)
57
12
Proposed density estimator
))()(()( T θxtθx kuf −=
)1()1(with),...,( 11 TkkkT πππθθθ −−== +θ
T1 ))(,),(()( xxxt Tφφ=
U-estimate under U-model
*ξ
1−kξ
kξ kφ
Nj
jD 1)( }{ == φ
)cov( D
58
13
*ξ
1−kξ
kξ kφ
1+kφ
1+kξ
59
14
*ξ
1−kξ
kξ kφ
1+kφ
1+kξ2+kξ
2+kφ
∞→→ kk as*ξξ
60
15
Statistical machine learning
Brain function
LearningData sets
Signal processingPattern recognition
Vapnik (1995)
Hastie, R. Tibishirani, J. Friedman (2001)
・・・・
61
16
))}(()({)( θθθ qUqpLU ξξ −= ∫
Minimum U divergence
U-loss function
)(minargˆ emp θθθ
UU LΘ∈
=MinimumU-estimator
)},(),({),(),(),( θXθXθxθxθxsθ
swEsw qU −=U-estimating function
.function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=
∫∑ −==
))(())((1)(1
empθθ xθ qUq
nL
n
iiU ξξThe empirical loss
Let p(x) be a data density function with statistical model )(xθq
62
17
• Huber’s M-estimator ∑=
Θ∈
n
iix
n 1
),(1min θρθ
⎩⎨⎧
−<−−
=otherwise)sgn(
||if),(
θθθ
θρyk
kyyy
As M-estimation
• location case
• relation ∫−= ))(())((),( θθ qUxqx ξξρ θ
63
18
Influence function
Statistical functional
})),((()()),(({minarg)( ∫∫ +−=Θ∈
dxxpUxdGxpGTU θξθξθ
0
)(),(IF=
⎥⎦⎤
⎢⎣⎡∂∂
≡ε
εεGTxT
xFG δεε θε +−= )1(
)}],(),({),(),()[(),(IF 1 θθθθθ θ xSxwExSxwJxT −= −ΨΨ
Influence function
||),(IF||sup)(GES xTTx
ΨΨ =
64
19
Efficiency
Asymptotic variance
))()()(,0()ˆ( 11 −−⇒− θθθθθ UUUD
U JHJNn
)),(),(()(
),),(),(),(()(
θθθ
θθθθ
XSXwVarH
XSXSXwEJ
U
TU
=
=
111 )()()()( −−− ≤ θθθθ UUU JHJI
Information inequality
where
( equality holds iff U(x) = exp(x) )
65
20
Normal mean
-6 -4 -2 2 4 6
-6
-4
-2
2
4
6
0.125
0.1
0.075
0.05
0.01
0
-6 -4 -2 2 4 6
-6
-4
-2
2
4
6
2.5
0.8
0.4
0.15
0.015
0
β-power estimates η-sigmoid estimates
Influence functionβ= 0
= 0.015= 0.15= 0.4= 0.8= 2.5
η = 0= 0.01= 0.05= 0.075= 0.1= 0.125
66
21
Gross Error Sensitivity
β-power estimates η-sigmoid estimates
0.1970.6942.52.040.6890.1250.4550.7530.82.210.7420.10.6780.8020.42.470.7990.0751.040.8730.152.920.8610.051.90.9720.0156.160.970.01∞10∞10
GES効率ηGES効率β
67
22
Multivariate normal
pdf is
⎭⎬⎫
⎩⎨⎧ −Σ−Σ=Σ
−−− 2
)()(exp)det)2((),,(1
21 μμπμ yyyf
Tp
尤度方程式 Ψ
),(
)(}))(({)(1
0)()(1
Σ=
⎪⎩
⎪⎨
⎧
Σ=Σ−−−
=−
∑
∑
μθ
θμμθψ
μθψ
ψcxxn
xn
Tjjj
jj
68
23
( ) ( )111 , , +++ ∑=→∑= kkkkkk μθμθ
∑∑=+ ),(
),(1
ki
jkik xw
xxwθ
θμ
( ){ },
det ),(
)()(),(1
∑
∑
Σ−
−−=Σ +
kUki
Tkikiki
k
cxw
xxxw
θ
μμθ
繰り返し重み付け平均と分散
Under a mild condition
),...1 ( )(L )(L 1 =∀≥+ kkk θθ ψψ
U-algorithm
69
24
Simulation
ε-conmination
)ˆ( 0KL θ,θD
, N , N-εG
, -
N .
. , N-εG
ε
ε
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
9999
00
1001
00
)1(
1004
55
250501
00
)1(
(2)
(1)
ε
ε
ε
ε
(2)
(1)
under 6516702under2439033
05.00
G . . G . .
εε
→
→
=→=KL error
with MLE θ̂
70
25
β-power estimates v.s. η-sigmoid estimates
31.640.3012.400.208.910.10
20.930.0535.300.0139.240
6.040.0014.640.000754.460.00056.700.00025
23.040.000139.240
βηKL error KL error
29.680.3012.200.205.140.106.650.05
13.910.0116.50
3.070.0033.040.0023.90.0013.190.000753.360.000516.50
(1)εG
)2(εG
71
26
Plot of MLEs (no outliers)
11σ12σ
22σ
100 replications with100 size of sampleunder Normal dis.
η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)
⎥⎦
⎤⎢⎣
⎡=Σ
2212
1211
σσσσ
72
27
Plot of MLEs with ouliers
η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)
)2(εG
100 replications with100 size of sampleunder
11.5
2
2.5
0
0.5
1
1.5
0.5
1
1.5
2
11.5
2
2.5
0
0.5
1
1.5
73
28
Selection for tuning parameter
Squared loss function
dyygyf 221 )}()ˆ,({)ˆ(Loss −= ∫ θθ
∫∑ +−= −
=
dyyfxfn
ii
n
i
2)(
1
)ˆ,(21)ˆ,(1)(CV ββ θθβ
)(CVminargˆ βββ
=
)ˆ,(IF1
1ˆˆ )(βββ θθθ i
i xn −
+≈−Approximate
74
29
Selection for tuning parameter
∫∑ +−≅=
dyyfxfn
V i
n
i
2
1)ˆ,(
21)ˆ,(1)(C ββ θθβ
θθ
θ ββ ∂
∂−
+)ˆ,(
)ˆ(1
1,
iTi
xfxIF
n
The third term is dominant in no outlier case,in which CV(β) has a minimum around at β = 0.When there are substantial outliers, the first and second terms are dominant and has a minimum around at β = 1Cf. GIC in Konishi and Kitagawa (1996)
75
30
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛.26.1-
.1-.26 varianceand
00
mean with Normal
0.0250.050.0750.10.1250.150.175
2.82
2.84
2.86
2.88
2.9
2.92
2.94 ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛.261.126-.126-.228
0.081-
0.054
signalMLE
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛.383.263-.263-1.059
0.184-
0.204
ContamiMLE
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛.286.134-.134-.293
0.132-
0.086 MLE-β
β
)(C βV
07.0ˆ=β
76
31
What is ICA?
Cocktail party effect
Blind source separation
s1 s2 ……… sm
x1 x2 ……… xm
77
32
ICA model
sWxW mmm 1 s.t., −× =∈∈ RR μ
(Independent signals)
(Linear mixture of signals)
0)(,,0)()()()(),...,(
1
111
====
m
mmm
SESEspspspsss ~
),,( 1 nxx
))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx −−=
Aim is to learn W from a dataset
in which unknown. is)()()( 11 mm spspp =s
78
33
ICA likelihood
Log-likelihood function
|)det(|log))((log)(11
WpW jijj
m
j
n
i+−= ∑∑
==
μxw
TTm WWxWxhI
WpWxpWxF −−=
∂∂
= ))()((),,(),,(
)( )(log,,)(log)(1
11
m
mm
ssp
sspsh
∂∂
∂∂=
Estimating equation
Natural gradient algorithm, Amari et al. (1996)
79
34
Beta-ICA
β power equation ),(),,(),,(11
μμμ ββ WBWxFWxf
n
n
iii =∑
=
decompsability:
0]))}({E[
)]()}({E[
])}({E[,
=−×
−−×
−∏≠≠
XwXwp
XwhXwp
Xwp
tttt
ssssss
tqsqqq
β
β
β
μ
μμ
μ
ts ≠∀
80
35
Likelihood ICA
0.5 1 1.5 2 2.5
0.2
0.4
0.6
0.8
1
1.2
1.4
-1.5 -1 -0.5 0.5 1 1.5
-1.5
-1
-0.5
0.5
1
1.5
⎥⎦
⎤⎢⎣
⎡=−
5.01211W
150 signals U(0,1)× U(0,1)
Maximum likelihood
Mixture matrix
81
36
Non-robustness
-2 -1 1 2 3
-2
-1
1
2
-2 -1 1
-2
-1
1
2
50 Gaussian noise N(0, 1)× N(0, 1)
Maximum likelihood
82
37
β power-ICA (β=0.2)
-2 -1 1 2 3
-2
-1
1
2
-10 -7.5 -5 -2.5 2.5 5 7.5
-6
-4
-2
2
4
Minimum β-power
83
38
U-PCA
)()(exp)()),exp(log()( * zzzzz Ψ−=Ψ+=Ψ ηη
∑=
−=n
iiU xrUL
1))),((log()( γμγ
2
22
||||)(||||),(
γγγ yyyr
T
−=
η-sigmoid
)},(min{minargˆ μγγμγ
UU L=
γγγγγ
γγ T
T
iSSxxr max)(tr),(min −=−∑Classical PCA
ryγ
84
39
U-PCA
),(into),(Update ** γμγμ
Tiii xxxw
xwixr
xr
))((),(),S(
),,())((
))((
μμγμμγ
γμγμψ
γμψ
−−−=
∑=
∑
,−
,−where
⎪⎩
⎪⎨
⎧
=
=
∑=
i
n
ii x,,xw
,S
)(
)(
1
*
***
μγμ
γλγμγ
85
40
Non-robustness in Classical PCA
-4 -2 2 4 6
-3
-2
-1
1
2
3
-4 -2 2 4 6
-3
-2
-1
1
2
3
-10-5
0
5
10
-10
-5
0
510
-10
-5
0
5
10
-10-5
0
5
10
-10
-5
0
510
MLE for Pc vector = (.55, .82, .01, .07, .01, .032, .10)
MLE for Pc vector = (.00, .01, .05, .04, .02, .99, .00)
86
41
Data-weights in U-PCA
20 40 60 80 100 120 140
-0.2
0.2
0.4
0.6
0.8
1
U-w
eight
Pc vector = (.55, .82, .01, .07, .01, .032, .10)
U-estimator for Pc vector = (. 64, .75, .01, .09, .01, .03, .09)
87
42
2.5 5 7.5 10 12.5 15
0.2
0.4
0.6
0.8
1
radius r
u(r)
1
Tube neighborhood in U-PCA
Ψγ̂
)−− ΨΨ γμx ˆ,ˆ( izix
88
43
Kernel methods
KMs : mapping the data into a high dimensional feature space
Use of kernel functions for vectors, sequence data, text, images.
SVM, Fisher's LDA, PCA, ICA, CCA, SIR, spectral clustering
(kernel trick, cf. Aizeman et al. 1964)
(Any kernel can be used with any kernel-algorithm)
Kernel-Machines Org (http://agbs.kyb.tuebingen.mpg.de/km/bb/)
SVM Org (http://www.support-vector-machines.org/)
(Spline method, cf. Wahba. 1979)
89
44
RKHS
ZX →Φ:
X : data space in
Z : high dimensional feature space
pRI
: kernel spectrum wrt )()(),( uxuxK Φ⋅Φ=
(kernel trick, cf. Aizeman et al. 1964)
: feature map HxKhx x ∈⋅= ),(
)(),,( xffxK H =⋅
ZXh →:
22 ||||),(||)(|| HxZ hxxKx ==Φ
90
45
Kernel data
-1
-0.5
0
0.5
1-1
-0.5
0
0.5
1
0
0.25
0.5
0.75
1
-1
-0.5
0
0.5
1
}30,...,1:{ =ixi }30,...,1:{ =ihix
91
46
Ψ-loss function
Let p be a true density function on H and m a proto model function.
Definition 1. U-loss function is
∫+−== τξξθ θθθ dmUmEmpCL PUU ))(()(),()(
We assume an exponential model defined by
)),()(exp()( T θκθθ −= hxhm
If we rewrite Ψ-loss function is
.1))(exp(satisfieswhere T∫ =− τκθκ θθ dhx
then)),(exp()( tt ξ−=Ψ
∫ −Ψ−+−Ψ= τκθκθθ θθ dhxUhxEL TTPU )))((()})(({)(
92
47
KPCA Cf. Scholkopf et al 1998 NC
Robust PCA
Wang, Karhunen, Oja (NN 1995)
Higuchi, Eguchi (NC 1998; JMLR 2004)
Xu, Yuille (IEEE NN 1995)
Hubert, Rousseeuw, Branden (Technometrics, 2005)
Robust KPCA
93
48
U-KPCA
∑=∈Γ∈
ΨΨ ΓΨ=Γn
iiOHm
mhzm xk 1,
)),,((minarg)ˆ,ˆ(
Ψ-Kernel Principal Component Analysis Cf. Haung, Yi-Ren , Eguchi (2008)
where Ψ(z) is strictly increasing function and
Remark Ψ0-KPCA = KPCA if
Example )}exp(1{)( 11 zz ββ −−=Ψ −
))(exp(1)exp(1log)( 1
2 zz
−++
=Ψ −
ηββηβ
zz =Ψ )(0
)()(lim 010zz Ψ=Ψ
→βη
β−Ψ=Ψ
→)()(lim 020
zz
2 4 6 8 10
2
4
6
8
10)(1 zΨ
)()(lim 02 zz Ψ=Ψ∞→η
1.0=β5.0=β1=β
0=β
}||)(||||{||21),,( 2*2
HxHxx mhmhmhz −Γ−−=Γ
94
49
Toy example95
50
Functional data
5 phonemes (TIMIT database) http://www.lsp.ups-tlse.fr/staph/npfda/
sh iy aadcl ao
she she dark dark water
}2000,...,1:),{( == iyfD ii 5% contamination
)15,...,1,2000,...,1(,)()(~==+= jiutftf ijijjiji δ
)15,10(Uniform~,)05.0(Bernoulli~ ijij uδ
96
51
FPCA97
1
Learning with Information Divergence
GeometryShinto Eguchi
andOsamu Komori
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/
Tutorial Workshop on
98
2
Outline
9:30~10:30 Information divergence class and robust statistical methods I
11:00~12:00 Information divergence class and robust statistical methods II
13:30~14:30 Information geometry on model uncertainty
15:00~16:00 Boosting leaning algorithm and U-loss functions I
9:30~10:30 Boosting leaning algorithm and U-loss functions II
11:00~12:00 Pattern recognition from genome and omics data
99
3
Information geometry on model uncertainty
100
4
Observational bias
The theory for statistical inference is formulated under an assumption of random mechanism, for example, random sample.
However, the assumption is frequently in-testable in a situation that involves observational studies. In this sense we have to make a sufficientlycautious inference.
Typically missing data comes from a variety of missing mechanism such asmissing at completely at random, missing at random, missing not at random.In particular, missing not at random bring about a serious bias in the inference.
101
5
Hidden Bias
Publication bias -not all studies are reviewed
Confounding -causal effect only partly explained
Measurement error -errors in measure of exposure
102
6
Lung cancer & passive smoking
1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0
stud
y
Odds ratio
510
1520
2530
103
7
Passive smoke and lung cancer
Log relative risk estimates (j =1,…,30) from 30 2x2tables jθ
)weighte varinancinverse theis( jj
jj ww
w
∑∑=
θθ
The estimated relative risk 1.24 with 95% confidence interval (1.13, 1.36)
104
Conventional analysis
1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0
stud
y
Odds ratio
510
1520
2530
1.24
105
9
Incomplete Data
z = (data on all studies, selection indicators)
y = (data on selected studies)
z = (response, treatment, potential confounders)
y = (response, treatment)
z = (disease status, true exposure, error)
y = (disease status, observed exposure)
y = h(z)
106
10
Level Sets of h(z)
1. One-to-one 2. Missing 3. Measurement error
4. Interval censor 5. Competing risk 6.Hidden confounder
107
11
Tubular Neighborhood
MεM
})),,((KLmin:),,({ 221 εεε ≤⋅=
∈ YYY gfg θθyΘθ
N
}:),({ Θ∈= θθyYfMModel
Copas, Eguchi (2001)
2
2
),(KL εε ≤MM
Near-model }:),,({ Θ∈= θθyY εε gM
108
12
Mis-specification
}{ ),(exp),(),,( θzθzθz ZZZ ufg εε =
0=== )(E,1)(,0)(E 2ZZZZ suuEu fff
21
}),KL(2{ ZZ fg=ε
"direction cationmisspecifi"=Zu
109
13
Near model
}{ ),(exp),(),,( θyθyθy YYY ufg εε =
],|),([E),(where yθzθy ZθY uu =
)(:By zhyzh =
h),( θzZf ),( θyYfModel
Near-modelh
),,( εθzZg ),,( εθyYg
110
14
Ignorable incompleteness
Let Y = h(Z) be a many-to-one mapping.
If Z has then Y has
∫ −=
)(1d),(),(
yhZY zzθy θff
Z is complete; Y is incomplete
),( θzZf
ZθZ onData ˆMLE ←
YθY onData ˆMLE ←
)ˆE()ˆ(E trueis ZYZ θθ =⇒f
)ˆE()ˆ(E wrongis ZYZ θθ ≠⇒f
111
15
bias),,,( from },...,{ˆMLE virtual 1 ZZZ θzzzθ ugn ε←
),,,(from},...,{ˆMLEactual 1 YYY θyyyθ ugn ε←
)()|(|),()ˆE()ˆ(E 12 −++=− nOOu εε θbθθ ZZY
)(( ,cov),cov),( 11YYYZZZYZZ ssbbθb uGuGu −− −=−=
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎯⎯ →⎯⎟
⎟⎠
⎞⎜⎜⎝
⎛
−−−−
−−
−−
11
11Dist 1,ˆ
ˆ
YY
YZ
YY
ZZ
bθθbθθ
GGGG
nNn00
εε
bias
Limit distribution
112
16
Asymtotic bias
)1( ||||max min22
|λε
ε−=b
Zu
ifonly and if holds bounds The
.),(),( θysθy YY ∈u
)ˆE()ˆ(E ZY θθb −=
loss ninformatio of eigenvaluesmallest theisminλ
2/12/1 −−=Λ YZY III
113
17
Problem in estimation of bias
The nonignorable model
}{ 22 )(),(exp),(),,( 21 ερεε θθyθyθy YYY −= ufg
gives the worst case if ).,(),( T θysωθy YY =u
However is inestimable and untestable: ω
The profile likelihood ∑=
∈=
n
iigPL
1
)},,,({logmax),( εε ωθyω YΘθY
is flat at 0=ω
114
18
Heckman model for MNAR
))(()1( T1T, xβxψX|R −+Φ== − yrg Y σωεσ+= xβTy
),,()(),,( )( rthrt r xzxz ==
ω
115
19
From pure misspecification
Unbiased perturbed biased perturbedh
116
20
The worst case for bias
),(),(),( T2 θbθbθ ZYZZ uGuu =β
)( 11,cov),( YYZZZZ ssθb −− −= GGuu
ddddθθ
YZYY
YZYYZYYZZ GGGG
GGGGGGGuu)(
)()(),(),( 11T
1111T*22
−−
−−−−
−−−
=≤ ββ
dd
szhsdz
ZY
ZZYYZ
)(
))(()(iff attainsBound
11T
11T* }{
−−
−−
−
−=
GG
GGu
Cauchy-Schwartz’s inequality
)()(
)(11T
11T* )(
ysdd
dy Y
ZY
ZYY −−
−−
−
−=
GG
GGu εM
M
117
21
The worst case
),,,( εωθyYg
),( θyYf
),( ωθy YY If ε+ ).,(),( T θysωθy YY =u
*)),(),,,,(KL(minarg*
θωθωθ YYΘθ
Y ⋅⋅=+∈
fgI εε
118
22
Sensitivity analysis
}{ TT
21),(exp),(),,( ωωθysωθyωθy YYY YIfg −=
θθysωθysωθy
θY
YY ∂∂
+=∂∂ − ),(),()},,(log{ 2/1T
YIg
The most sensitive model
Estimating function of θ with fixed ε, ω
Yθ̂}const.:ˆ{ T
, =ωωθ Yω IY
119
23
Behavior of two MLEs
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
−−−
⎟⎟⎠
⎞⎜⎜⎝
⎛→⎟
⎟⎠
⎞⎜⎜⎝
⎛
−−
−−−−
−−−
1111
111
,,,ˆˆ
ˆ
ZYZY
ZYY
ZY
Y
bb
θθθθ
IIIIIIInNn
Dε
The two MLEs and are asymptotically normal asYθ̂ Zθ̂
⎟⎠⎞
⎜⎝⎛→=−− −11,)ˆˆ(|)ˆ( Zuuθθθθ ZYY InN
D
Note:this asymptotic expression is valid only when )( 2/1−= nOε
Note: The conditioning cancels selection bias.
120
24
Scenarios A, B, C
Inference from using fYnyy ...,,1
}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ
Scenario A: 10 =⇒= Akε
Scenario C: 1unknown0 >⇒> Ckε
Scenario B:acceptable
n
found had and
,..., observed had weif,0 1 zz>ε
CBA kkk <<⇒
121
25
Scenarios A and C
),0(~)ˆ(2/1 INI fYθθYY −
}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ
Scenario A: ⇒=0ε
Scenario C:),0(~)ˆ(
unknown02/1 INI gY
bθθYY ε
ε
−−
⇒>
?!)(,1 AA kCk =
?!)(,1 22CC kCk κε+=
122
26
Scenario B
andˆ MLEhave couldwe,,..., observe could weIf 1 Zθzz n
),0(~)ˆˆ()( 2/12/1 INII fYZYY θθU −Λ−= −
),0(~|)(* INgYUSUS =
)}ˆˆ()ˆ{()ˆ( 2/12/1ZYYZZZ θθθθθθS −−−=−= II
Conditional confidence interval
}||)(||:{)( 22*αrC ≤= uSθu
123
27
ノンランダムネスの仮定
),(~)ˆˆ( 11 −− −− ZYZY Yθθ IINn f 0
),(~)ˆˆ(|)ˆ( 1−=−− ZZYY uuθθθθY
INg
なので,条件付信頼領域は
})ˆ()ˆ(:{)( 21αα rIC T ≤−−−−= − uθθuθθθu YZY
})(:{ 2111αα rIIB T ≤−= −−− uuu ZY
なので,仮説 H:バイアス b = 0 の検定領域は
仮説 H:バイアス b = 0 が水準 α で採択される下での
条件付信頼領域の和集合は,
∪α
αB
CC∈
=u
u)(
)(uαC
αB
C
124
28
Theorem
}.)()ˆ()ˆ(:{)(CLet 2αrkIk T ≤−−= YYY θθθθθ
).2()()1( Then22||||
CCCr
⊆⊂≤
uu α
∪
-1.5 -1 -0.5 0.5 1 1.5
-1.5
-1
-0.5
0.5
1
1.5
-1.5 -1 -0.5 0.5 1 1.5
-1.5
-1
-0.5
0.5
1
1.5
-1.5 -1 -0.5 0.5 1 1.5
-1.5
-1
-0.5
0.5
1
1.5
125
29
-1 -0.5 0.5 1
-1
-0.5
0.5
1
-1 -0.5 0.5 1
-1
-0.5
0.5
1
-1 -0.5 0.5 1
-1
-0.5
0.5
1
-1 -0.5 0.5 1
-1
-0.5
0.5
1
)001.0,001.0(),( 21
=λλ
)5.0,5.0(),( 21 =λλ)1.0,1.0(),( 21 =λλ )9.0,1.0(),( 21 =λλ
s.t.),( if attainable is boundupper The21
jiji λλ ≤≤∃
αB
})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ
)2()( αα rCRCrCR ⊆⊆
1-dimensional case P = 5% → 0.3%
126
30
-1 -0.5 0.5 1
-1
-0.5
0.5
1
Double-the-variance rule}:);({ Θ∈= θθyfMStatistical model
Random sample Mfn ∈);( ,...,iid
1 θyyy ~
α % confidence interval
})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ
)( αrCR )2( αrCR
M }{ εM
127
31
Passive smoke and lung cancer
The estimated relative risk 1.24 with 95% confidence interval (1.13, 1.36)
Square root rule95% confidence interval (1.08, 1.41)
128
32
Risk from passive smoke
129
Root-2-rule
1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0
stud
y
Odds ratio
510
1520
2530
1.24
130
34
Passive smoke and lung cancer
The estimated relative risk 1.24 with 95% confidence interval (1.13, 1.36)
Square root rule95% confidence interval (1.08, 1.41)
131
35
Reference
Eguchi & Copas (1998) JRSSB (near-parametric)
Copas & Eguchi (2001) JRSSB (ε -perturbed model)
Copas & Eguchi (2005) JRSSB (discussions) (Double-the-variance rule)
Eguchi & Copas (2002) Biometrika (Kulback-Leibler divergence)
Henmi, Copas & Eguchi (2007) Biometrics (Meta analysis)
Henmi & Eguchi (2004) Biometrika (Propensity score)
Statistical Analysis With Missing Data. R. J. A. Little, D. B. Rubin. Wiley (2002)
S. Greenland. JRSSA (2005) (Multiple-bias modelling)
Copas & Eguchi (2010) JRSSB (statistical equivalent models)
132
36
Present and Future
Does all this matter?
Statistics ( missing data, response bias, censoring)
Biostatistics (drop-outs, compliance)
Epidemiology ( confounding, measurement error)
Econometrics (identifiability, instruments)
Psychometrics (publication bias, SEM)
causality, counter-factuals, ...
133
1
Learning with
Information Divergence Geometry
Shinto Eguchiand
Osamu Komori
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/
Tutorial Workshop on
134
2
Outline
9:30~10:30 Information divergence class and robust statistical methods I
11:00~12:00 Information divergence class and robust statistical methods II
13:30~14:30 Information geometry on model uncertainty
15:00~16:00 Boosting leaning algorithm and U-loss functions I
9:30~10:30 Boosting leaning algorithm and U-loss functions II
11:00~12:00 Pattern recognition from genome and omics data
135
3
Boosting leaning algorithm
and U-loss functions I
136
4
Which chameleon wins?
137
5
Pattern recognition…
138
6
What is pattern recognition?
□ There are a lot of examples for pattern recognition.
□ In principle pattern recognition is a prediction problem for class label
which would classify a human interestingness and importance.
□ Originally human brain wants to label phenomena a few words , for
example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,
no effect)….
□ Brain intrinsically predicts the class label from empirical evidence.
139
7
Practice
● Character recognitionvoice recognition
Image recognition
☆ Credit scoring
☆ Medical screening
☆ Default prediction
☆ Weather forcast
●
●
● face recognition
finger print recognition●
speaker recognition●
☆ Treatment effect
☆ Failure prediction☆ Infectious desease
☆ Drug response
140
8
Feature vector & class label
),...,( 1 pxx=xFeature vector
Class label y
Feature space pR⊆X
},,1{ GLabel set
Training data
Test data
}...,,1:),({train niyD ii == x
}...,,1:),({ testtesttest mjyD jj == x
141
9
Classification rule
),...,( 1 pxx=x },,{ Gy 1∈
zyF →),(: x
Classification rule ),(maxarg)(},,1{
yFhGy
F xx∈
=
Feature vector Class label
yh →x:classifier
Discriminant function (score)
3)( =xFh
1 2 3 4 5
)1,( xF)2,( xF
)3,( xF
)4,( xF)5,( xF
)( R∈z
Consider a case of G = 5 for a given x
142
10
Binary classification
),...,( 1 pxx=x }1,1{ +−∈y
zF →x:
classifier )}(sgn{)( xx FhF =
Feature vector Class label
score
)1,()1,()( −−+= xxx FFF
In a binary class y = −1, 1 (G=2)
)1,()1,(1))((sgn)1,()1,(1))((sgn
−<+⇔−=−>+⇔=
xxxxxx
FFFFFF
),(maxarg))(sgn(}1,1{
yFFy
xx−∈
=
143
11
Multi -class
),...,( 1 pxx=x },,{ Gy 1∈
zyF →),(: x
),(maxarg)(},,1{
yFhGy
F xx∈
=
yh →x:
3)( =xFh
1 2 3 4 5
)1,( xF)2,( xF
)3,( xF
)4,( xF)5,( xF
)( R∈z
G = 5,
Classifier
Feature vector Class label
Score function
Classifier by score
144
12
Probability distribution
).,( vector random of pdf a be),(Let yyp xx
∫ ∑∈
=B Cy
ypCBp }d),({),( xx
∫= Xxx d),()( ypypMarginal density
Conditional density
∑∈
=},...,1{
),()(Gy
ypp xx
)(),()|(
xxx
pypyp =
)(),()|(
ypypyp xx =
)()|()()|(),( xxxx pypypypyp ==
)1|()1|(
)1,()1,(
−==
=−==
ypyp
ypyp
xx
xx
145
13
Error ratepR∈x },,{ Gy 1∈
Classifier )( xFh
Feature vector ,class label
has error rate
))(Pr()(Err yhh FF ≠= x
∑∑=≠
==−====G
iF
jiFF iyihjyihh
1),)(Pr(1),)(Pr()(Err xx
Training error
}...,,1:),({for train niyD ii == x
}...,,1:),({for testtesttest mjyD jj == x
nyhih iiF
F})(:{#)(Err train ≠
=x
Test error
myhj
h jjFF
})(:{#)(Err
testtesttest ≠
=x
146
14
False negative/positive
FNP )11)(Pr( )FN( +=−== y|hh FF x
)11)(pr( )FP( −=+== y|hh FF x
True Negative
True Positive False Positive
False Negative
1+=y 1−=y
1)( +=xFh
1)( −=xFh
FPR
)1pr()FP()1pr()FN( )Err( −=++== yhyhh FFF
147
15
Bayes rule
Let p(y|x) be a conditional probability given x.
.)|1(
)|1(log)(Define 0 xxx
−==
=ypypF
The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh =
For any classifier h
)(Err)(Err Bayes hh ≤
Note: The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.
Theorem 1
148
16
Discriminant space associated with Bayes classifier is given by
{ }{ })|1()|1(: }1)(:{
)|1()|1(: }1)(:{
BayesB
BayesB
xxRxxRx
xxRxxRx
−=<+=∈=−=∈=
−=≥+=∈=+=∈=−
+
ypyphR
ypyphRpp
pp
},{ −+ RR
Error rate for Bayes rule
In general, when a classifier h associates with spaces
)}(Err1{)}(Err1{
)|1()()|1()(
)|1()()|1()(
)|1()()|1()(
)|1()()|1()()(Err)(Err
Bayes
\\\\
\\\\
)()(
)()(
)()(
)()(
BB
BB
Bayes
hh
dyppdypp
dyppdypp
dyppdypp
dyppdypphh
RRRR
RRRRRRRR
RRRRRRRR
RRRR
BBBB
BBBB
−−−=
+=−+−=−=
+=−+−=−≥
−=−++=−=
−=−++=−=−
∫∫∫∫
∫∫∫∫
∫∫∫∫
∫∫∫∫
++−−
++++−−−−
++++−−−−
++−−
xxxxxx
xxxxxx
xxxxxx
xxxxxx
)(Err)(Err Bayes hh ≥
−B−−
BRR \ −− ∩ BRR
−R −BR
−− RRB \
149
17
Multi-normal distribution
p-variate normal (Gaussian) distribution is defined by the pdf
)}()(21exp{
)det()2(1),,( 1
2/12/ μxVμxV
Vμx −−−= −Tpπ
ϕ
pdf a has),( that Asuume yx
}),...,1{,(),,()(),( GyVypyp py ∈∈= Rxxx μϕ
- 4- 2
02
4- 4
- 2
0
2
4
00 . 2 5
0 . 50 . 7 5
1
- 4- 2
02
4
G = 2
(Assumption for equal variance matrix)
150
18
Bayes classifier
01Bayes )|1()|1(log)( α+=
−=+=
= xαxxx T
ypypF
)1()1(log)(
21,)( 1
111
110
1111 −=
+=+−=−= −
−−+
−+
−−+ yp
ypTTT μVμμVμVμμα α
We call this the Fisher linear discriminant function
∑
∑
∑
∑
=
=−
=
=+
−=
−==
+=
+== n
ii
n
iii
n
ii
n
iii
yI
yI
yI
yI
1
11
1
11
)1(
)1(ˆ,
)1(
)1(ˆ
xμ
xμ
∑∑==
=−−=n
ii
n
ii
Ti nn 11
1ˆ),ˆ()ˆ(1ˆ xμμxμxV
)(in Plug Bayes xF
151
19
Boost learning
Boost by filter (Schapire, 1990)
Bagging, Arching (bootstrap)(Breiman, Friedman, Hasite)
AdaBoost (Schapire, Freund, Batrlett, Lee)
Weak learners can be combined into a strong leanrner?
Weak learner (classifier) = error rate is slightly less than .5
Strong learner (classifier) = error rate is slightly less than that Bayes calssifier
152
20
Web-page on Boost
http://www.boosting.org/
http://www.fml.tuebingen.mpg.de/boosting.org/tutorials
R. Meir and G. Rätsch. An introduction to boosting and leveraginghttp://www.boosting.org/papers/MeiRae03.pdf
R.E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.http://www.boosting.org/papers/Sch99e.ps.gz
Robert Schapire’s home page:http://www.cs.princeton.edu/~schapire/Yoav Freund's home page :http://www1.cs.columbia.edu/~freund/
153
21
Set of weak learners
{ }RI},1,1{},,,1{:)(sgn),,(stamp ∈+−∈∈−== bapjbxabaf jj xFDecision stumps
{ }1010
T1linear RI),(:)(sgn),( +∈=+== pf ββ ββxββxF
Linear classifiers
Neural net SVM k-nearest neighbur
Note: not strong but a variety of characters
linearstamp FF ⊆
154
22
Exponential loss function
Empirical exponential loss function for a score function F(x)
)}(exp{1)(1
exp ii
n
i
D Fyn
FL x−= ∑=
where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.
Let be a training (example) set.}...,,1:),({train niyD ii == x
xxxxX
d)()}|()}(exp{{)(}1,1{
expqyqyFFL
y
E ∫ ∑−+∈
−=
Expected exponential loss function for a score function F(x)
155
23
Learning algorithm
)},( ),...,,{( 11 nn yy xx
)(,),1( 11 nww
)(,),1( 22 nww1ε
2ε
1α
2α
Tα
)()1( xf
∑=
T
1)( )(
ttt f xα
)(,),1( nww TT
)()2( xf
)()( xTf
Final learner ∑=
=T
tttT fF
1)( )()( xx α
156
24
Learning curve
50 100 150 200 250
0.05
0.1
0.15
0.2
Iteration number
Training curve
157
25
0)(),1()(:Intial.1 011
=== xFniiw n
,)'(
)())((I)(∑∑ ≠=
iwiwfyf
t
tiit xε
)(
)(121
)(
)(log)b(tt
ttt
f
f
ε
εα
−=
∑=
=T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx α
Tt ,,1For .2 =
))(exp()()()c( )(1 iitttt yfiwiw xα−=+
)(min)()a( )( ff tftt εεF∈
=
Adaboost
158
26
Update weight
21)( )(1 =+ tt fε The worst case
)()( 1 iwiw tt +→
t
t
eyf
eyf
iit
iit
α
α
−⇒=
⇒≠
Multiply )(
Multiply )(
)(
)(
x
x
update
Weighted error rate )()()( )1(1)(1)( +++ →→ tttttt fff εεε
159
27
21)( )(1 =+ tt fε
∑∑+
+
=+ ≠=
)'()())(()(
1
1
1)(1 iw
iwyfIft
tn
iiittt xε
∑
∑
=
=
−
−≠= n
ititit
n
itititiit
iwfy
iwfyyfI
1)(
1)()(
)()}(exp{
)()}(exp{))((
x
xx
α
α
∑∑
∑
==
=
=−+≠
≠= n
itiitt
n
itiitt
n
itiitt
iwyfIiwyfI
iwyfI
1)(
1)(
1)(
)())((}exp{)())((}exp{
)())((}exp{
xx
x
αα
α
21
)}(1{)(1
)()(
)()(1
)()(
)(1
)()(
)()(
)(
)(
)()(
)(
=
−−
+−
−
=
tttt
tttt
tt
tt
tttt
tt
ff
ff
ff
ff
f
εε
εε
εε
εεε
160
28
Update in decision stumps
Wrong example 5 5 5 6 466 5 565
}|)(|21{minarg ∑ −=
iiij yfb
bx
Decision stump⎩⎨⎧
<−>+
=−=jj
jjjjjj bx
bxfffs
if1 if1
)(where)(or)()( xxxx
jbThe j-th feature vector njj xx ,...,1
jx
161
29
Next step
Errror 数 45.5 7 9
87.56 798.5
Update foe weights:
jx
4.5
jbWeight up to 2
}|)(|)(21{minarg ∑ −=
iiijj ysiwb
bx
2log4
16log5.0ans. false of nb.ans.correct of nb.log5.01 ==⎥⎦
⎤⎢⎣⎡=α
Weight down to 0.5
jb
162
30
Update in exponential loss
)}(exp{1)(1
exp ii
n
i
Fyn
FL x−= ∑=
)()()( xxx fFF α+→
}))(1()(){(exp fefeFL εε αα −+= −
Consider
[ ]∑=
− =+≠−=n
iiiiiii yfeyfeFy
n 1
))(I())(I()}(exp{1 xxx αα
∑=
−−=+n
iiiii fyFy
nfFL
1exp )}(exp{)}(exp{1)( xx αα
)(
)}(exp{))(()(
exp
1
FL
FyyfIf
n
iiiij∑
=
−≠=
xxεwhere
163
31
Sequential optimization
}))(1()(){()( expexpαα εεα −−+=+ efefFLfFL
)()(1log
21
opt ff
εεα −
=
)}(1){(2 ff εε −≥
)}(1){(2)()(12
ffefe
f εεεε αα −+
⎭⎬⎫
⎩⎨⎧
−−
=
Inequality holds if and only if
αα εε −−+ efef ))(1()(
164
32
AdaBooost = Seq-min exponential loss
)(minarg)a( )( ff tFf
t ε∈
=
)}(exp{)()((c) 1 itittt xfyiwiw α∝+
)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL εεαα
−=+ −−∈ R
)()(1
log21
)(
)(opt
t
t
ff
εε
α−
=
)(minarg)b( )(1exp ttt fFL ααα
+= −∈ R
165
33
Proposal from machine learning
Learnability: boosting weak learners?
AdaBoost : Freund & Schapire (1997)
Weak learners (machines)
})(,....),({ 1 xx pff
Strong machine
)()( )()1(1 xx tt ff αα ++
)( xf
)()1(1 xfαForward stagewise
166
34
Simulation (complete separable)
-1 -0.5 0.5 1
-1
-0.5
0.5
1
[-1,1]×[-1,1]
Feature space
Decision boundary
}1000,,1:),({ =iyiix
]1,1[]1,1[ −×−∈ix
)2sin( 12 xx π=
}1,1{ +−∈iy
167
35
-1 -0.5 0.5 1
-1
-0.5
0.5
1
Set of linear classifiers
⎩⎨⎧
<++−≥+++
=++=0 if10 if1
)(sgn),(32211
322113221121 rxrxr
rxrxrrxrxrxxf
Linear classification machines
3321 )1,1(},,{ −Urrr ~
Random generation
168
36
Learning process
•
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10
Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08
169
37
Learning process (II)
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 55, train err = 0.061
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 99, train err = 0.032
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 155, train err = 0.016
170
38
Final decision boundary
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Contour of F(x) Sign(F(x))
171
39
KL divergence
},,1{ G=Y
For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write
Nonnegative function
Feature space pR⊆X
),(),(),,( YXxxx ∈∈ yyym μ
∫ ∑=
+−=X
xxxxxx
G
yyym
yymymmD
1KL d)},(),(
),(),(log),({),( μ
μμ
Note
)()|(),(),()|(),( xxxxxx pyqypypym == μ
∫ ∑=
=X
xxxxx
G
yp
yqypypmD
1KL d)(}
)|()|(log)|({),( μ
Label set
Then,
172
40
Twine of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as
∑=
−=G
gyFgFym
11 )},(),(exp{)|( xxx
xxxxxxx
Xd)()}|()|(
)|()|(log)|({),(
}1,1{KL qymyq
ymyqyqmqD
y∫ ∑
−+∈
+−=
Then
∑=
= G
ggF
yFym
1
2
)},(exp{
)},(exp{)|(x
xx
Log lossExp loss
173
41
Bound for exponential loss
)}({exp(E)(exp XYFFL −=
))(exp()(1
1exp ∑
=
−=n
iiin FyFL xEmpirical exp loss
Expected exp loss
).(minarg
and functionsnt discrimina all of space a beLet
expopt FLFF F
F
∈=
Theorem.
.)|1()|1(log
21)(Then opt x
xx−=+=
=ypypF
174
42
Variational calculas
⎥⎦⎤
⎢⎣⎡ −∂∂
=∂∂ ))(exp(E)(exp XYF
FFL
δ[ ]))(exp(E XYFY −=
[ ])|1())(exp()|1())(exp(E XXXX −=−+=−= ypFypF
}]{[)|1()|1())(2exp()|1())(exp(E
XXXXX
−=+=
−+=−=ypypFypF
.)|1()|1(log
21)(Hence opt x
xx−=+=
=ypypF
175
43
On AdaBoostFDA, or Logistic regression
Parametric approach to Bayes classifier
01
101)( ααα +=+= ∑=
p
jjj xF xαx
AdaBoost ∑=
=T
ttt fF
11 )()( xx α
AdaBoost. real cf. ,classifier a is itself )(Each xtf
The stopping time T can be selected according to the state of learning.
Parametric approach to Bayes classifier, but dimension and basis function are flexible
176
44
On AdaBoost (II)
1. Unbalanced examples
2. Onverleaning
EtaBoost
GroupBoost
AsymAdaBoost
LocalBoost
Robust against mislabel examples
Hi-dimension and small sample
Local learning
Balancing the false n/ps’
177
45
Simulation (complete random)
-1 -0.5 0.5 1
-1
-0.5
0.5
1
178
46
Overlearning of AdaBoost
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
Iter = 51, train err = 0.21
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
Iter = 151, train err = 0.06 Iter =301, train err = 0.0
179
47
U-Boost
))(())((1)(1
empθθ xθ qUq
nL
n
iiU ξξ +−= ∑
=U-empirical loss function
∑∑∑= ==
+−=n
i
G
gi
n
iiiU gFU
nyF
nFL
1 11
emp )),((1),(1)( xx
In a context of classification
Unnormalized U-loss
∑∑= =
−=n
i
G
giiiU yFgFU
nFL
1 1
(0) )),(),((1)( xx
Normalized U-loss
∑∑∑= ==
+−=n
i
G
gi
n
iiiU gFU
nyF
nFL
1 11
(1) )),((1),(1)( xx
∑=
=G
ggFu
11)),(( subject to x
180
48
U-Boost (binary)
Unnormalized U-loss
∑ ∑= ±=
−=n
i giiU FyU
nFL
1 1
(0) ))((1)( x
Note ))(()),(),((1
iig
iii FyUyFgFU xxx −=−∑±=
)}1,()1,({)(where21 −−= xxx FFF
Bayes risk consistency
)}|1())(()|1())(({)(
))(()( 1
xxxxx
xx
−=+=−∂∂
=−∂∂ ∑
±=ypFUypFU
FyFU
F y
))}(({minarg where)|1(
)|1())((
))(( **
*xyFUEF
ypyp
FuFu
F−=
−==
=− x
xx
x
F*(x) is Bayes risk consistent because 0)}({
)()()()()(
)(2 >
−−+−
=−∂
∂Fu
FuFuFuFuFu
FuF
181
49
Eta-loss function
regularized
AdaBoost with margin
FFFU ηη +−= )exp()1()( generator with
182
50
EtaBoost for Mislabels Expected Eta-loss function
)}())({exp()1()( xx yFyFEFL ηηη −−−=
Optimal score )(minarg* FLFF
η=
The variational argument leads to
ηη
ηη
2))(1(
)1()|()()(
)(
*21*
21
*21
++−
+−=
− xx
x
xFF
yF
ee
eyp
)()(
)(
**
*
1
1)(1
))(1(xx
xxx
FF
F
ee
e
++
+−= εε
ηη
ηε2))(-(1
)( where)()(
21
21
++=
− xxx
FFee
Mislabel modeling
183
51
EtaBoost
0)(),1()(: settings Initial.1 011
=== xFniiw n
,)())((I)(1
iwfyf m
n
iiim ∑
=
≠∝ xε
∑=
=T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx α
Tm ,,1For .2 =
))(exp()()()c( )(*
1 iimmmm yfiwiw xα−∝+
)(min)()a( )( ff mfmm εε =
(b)
184
52
A toy example
185
53
Examples partly mislabeled
186
54
AdaBoost vs. EtaBoost
AdaBoost EtaBoost
187
55
EtaBoost
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
Iter = 51, train err = 0.25 Iter = 51, train err = 0.15 Iter =351, train err = 0.18
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
188
1
Learning with Information Divergence Geometry
Shinto Eguchiand
Osamu Komori
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/
Tutorial Workshop on
189
2
Outline
9:30~10:30 Information divergence class and robust statistical methods I
11:00~12:00 Information divergence class and robust statistical methods II
13:30~14:30 Information geometry on model uncertainty
15:00~16:00 Boosting leaning algorithm and U-loss functions I
9:30~10:30 Boosting leaning algorithm and U-loss functions II
11:00~12:00 Pattern recognition from genome and omics data
190
3
Boosting leaning algorithm and U-loss functions II
191
4
GOAL
Statistical inference
Microarray
SNPs
Proteome
A variety of functions associated with genes
Statistical Learning
genetics
informatics
medicine
Modeling and Prediction forKnowledge and discovery
192
5
Project for genome polymorphism analysis
mRNA
Protein
Genome
Microarray
SNPs
Proteome
[ Gene expression ]
[ Protein expression ]
[Single Nucleotide Polymorphism ]
193
6
Project mission
Microarray
SNPs
Proteome
[ Gene expression ]
[ Protein expression ]
[Polymorphism ]
Drug effect / adverse effect
Disease
[ anticancer drug, aspirin ]
[Cancer, diabetes, cardiopathy … ]
Metabolic syndrome
biomarker
Genomic/protemic
Phenotype
194
7
Expression Arrays and the p n Problem
T. Hastie, R. Tibshirani 20 November, 2003
Gene expression arrays typically have 50 to 100 samples and 5,000
to 20,000 variables (genes). There have been many attempts to adapt
statistical models for regression and classification to these data, and in
many cases these attempts have challenged the computational resources.......
>>
The Dantzig selector:Statistical estimation when p is much larger than n
E. Candes and T. Tao
Ann. Statist. 35, 6 (2007), 2313-2351.
195
8
Recent issue
Fan and Lv (2008)Sure independence screening Dantzig selector
Candes, E. and Tao, T. (2007)
p is much larger than n
次元 1 100 1000 30000
196
9
Unified view in machine learning
Dantzig selector (SVM, programming)
LASSO (LAR, Elastic net)
L2Boosting (Early stopping, εBoosting, Stagewize LASSO)
A tale of three cousins (Meinshausen, Rocha, Yu, 2007 )
L1 regularization
197
10
What is genomic data?
mRNA
Protein
Genome
microarry
SNPs
proteome
[ 遺伝子発現 ]
[ 蛋白発現 ]
[ 一塩基多型 ]
多様性代謝
翻訳
02040
0204060
0204060
204060
255075100
10203040
255075
255075100
255075100
1500 1750 2000 2250 2500
198
11
Target
Microarray
SNPs
Proteome
[ Gene expression ]
[ Protein expression ]
[Polymorphism ]
Drug effect / adverse effect
Disease
[ anticancer drug, aspirin ]
[Cancer, diabetes, cardiopathy … ]
biomarker phenotype
),( 1 pxx=x }1,1{ +−∈y
199
12
GeneChip® HT Human Genome U133
AffymetrixのU133 contains probes more tha 54,000,Which is utilized semicondotor technology.
For one gene a set of 11 prob pairs are designed.All the probes consist of 25-mer DNA.
200
13
Predection from microarrays
),( 1 pxx=x
}1,1{ +−∈y
Feature vector dimension p = the nb of geneseach component is of gene expression
Class label names of disease, effect of drugs, adverse effect of drug
training data }1:),({train niyD ii ≤≤= x
yf →x:ˆclassifier )(ˆˆ xfy =predictor
201
14
Leukemic diseases Golub,T. et al. (1999) Science.
The first successful result202
15
Web microarray data
7129
2000
7129p
242549Estrogen
224062Colon
353772ALLAMLy = -1y = +1n
p >> n
http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/
Open access data
×
203
16
source("http://www.bioconductor.org/biocLite.R")
biocLite("GEOquery")
library(GEOquery)
d <- getGEO(file = "GSE2034_family.soft.gz")
http://www.ncbi.nlm.nih.gov/geo/
Gene Expression OmnibusNational Center for Biotechnology Information
204
17
Clustering for successful case
205
18
Jones, M. et al. Lancet 2004
Lung cancer
Pathological Classification
Prognosisidentification
Not successful case
206
19
Prediction from mass-spectrometry
),( 1 pxx=x
Feature vector dimension p = peak numbers of molecular masscomponents express peak value
}1,1{ +−∈y
Class label names of disease, effect of drugs, adverse effect of drug
training data }1:),({train niyD ii ≤≤= x
yf →x:ˆclassifier )(ˆˆ xfy =predictor
207
20
MW (Time of Flight)
LensLaser beam source
Mirror
Ion detector
High voltage
+++
+++
+++Protein ChipFlight tube (High vacuum)
Proteome method
Koichi Tanaka,MALDI-TOF MS,2002年Nobel Prize for chemistry
[SELDI TOF/MS]
208
21
0
20
40
0
20
40
60
0
20
40
60
20
40
60
25
50
75
100
10
20
30
40
25
50
75
25
50
75
100
25
50
75
100
1500 1750 2000 2250 2500
Lungcancer
Normaltissue
Proteomic data財団法人癌研究会
209
2222
Total data
Proteomic data
Fileterd by common peaks
AdaBoost learning
203人(130人(卵巣癌),73人cotrols)
Fushiki, Fujisawa, Eguchi, 2006
210
23
Goal = Prediction score
Machine learning
Tarining data = (clinical data, genomic data)
Prediction score
Pattern recognition
Genom
ic data
Clinical data
{xi} {yi}
Sensitivity
x y
Train and test practical realization
F(x)
211
24
Concordance among Gene-Expression–Based Predictors for Breast Cancer
Fan, et al NEJM 355:560-569, 2006
Prognosis prediction for breast cancer
70 gene file : van 't Veer J, et al. Nature 2002;415(6871):530-6.Recurrence score : Paik S, et al. NEJM 2004;351:2817-26.Mechanism-derived: Chang H Y, et al. PNAUS 2005;102(10):3738-43.Proper sutype : Sorlie T, et al. PNAUS 2001;98(19):10869-74.2 gene rate: Ma X J, et al. Cancer Cell 2004;5(6):607-16.
5 studies suggest different sets of genes that are related withprognosis for breast cancer
Four of five studies show substantial performance for new validation test data (295 samples)
212
25
Nature 2002; 415: 530-6.
213
26
Single variable study
y|x ),...,1(| pjyx j =
Let x be a feature vector from genomic monitor andy outcome value or phenotype.
)|( yp x )|( yxp j
Note: The multi-vaiate analysis for (x, y) is essential different froma set of all single varaiable analyses.Basically the genome data are correlated with in biologicalnetwork, which implies the information amount is not so huge even if the number of observed genes is larger than several ten thousands.
Joint analysis single analysis
214
27
2 sample test approach
)),(,),,{( 11 nn yyD xx=Data set
is decomposed into
仮説 H:
),...,(}1:{00010 n
jjii
jj xxyx =−==x
),...,1(),...,(}1:{ 11111 pjxxyx njj
iijj ==+==x
},...,1:),{( 10 pjjj =xx
where n = n0 + n1
)1|()1|( −==+= yxpyxp jj
を考えよう.
2標本 , .01)}{(
nkk
jx =11)}{(
nii
jx =
215
28
典型的な2群の検定量
Z スコア
スチューデント t 検定
ウィルコクソン検定
j
jjj
sxx
XZ 01)(ˆ −=
)2/())1()1(()(ˆ
200
211
01
−−+−
−=
nsnsn
xxXt
jj
jjj
)(1)(ˆ0
1 11
10
1 0
ij
n
k
n
ikjj xxI
nnXC ∑ ∑
= =>=
注意: Zスコアと t-検定は データ変換 x → ax+b (a > 0) に不変である.
ウィルコクソン検定は任意の単調変換 F に対して不変である.
)(ˆ))((ˆ jj XCXFC =
しかし異なる j と k に対して上の検定統計量は何の関係も持たない.
)(ˆ),(ˆ),(ˆ),(ˆ jkjkjk bXXaCXXCXCXC ++
○
×
(後半II部でこのことを再考する.)
216
29
p 個の2群比較
p 個の2群比較のための検定統計量 を採用して
},...,1:)(ˆ{ pjXT j =
p 個の検定統計量 T(Xj) を 計算して,順番を以下のように付ける.
)(ˆ)(ˆ)(ˆ )()()1( pd XTXTXT ≥≥≥≥
を考察する.このとき,検定の多重性の問題 (FDR) やランキングの
正確さの問題は重要な問題である. しかし,現在,ターゲットにしている
問題は,表現形予測の問題であるので次のフィルターリングに使われる.
1 100 1000 30000d p
)(ˆ jXT
217
30
Hierarchal clusteringclustering ⊆ unspervized learning
Hierarchal clustering
Optimal partition clustering k-means, self-organizing map
Minimum distance method
Eisen et al. (1998) PNAS
http://derisilab.ucsf.edu/data/microarray/software.html
218
31
ArrayMaker Version 2
Gal File Maker v1.2
Cluster:
Tree View:
J-Express:
http://derisilab.ucsf.edu/data/microarray/software.html
219
32
LASSO
),...,1(T niy iii =++= εαβx線形モデル
次元ベクトルとする.はここで pi , βx
typ
jj
n
iii =−−= ∑∑
== 11
2T ||subject to)(minarg)ˆ,ˆ( βαα βxβ
Lasso推定量 (Least absolute shrinkage and selection operator)
Tibshirani, 1996 JRSSB
220
33
Sparse learning
)}({ 11
2T21 ||)( ty
pj j
n
iii −−−−
∂∂ ∑∑ =
=βλαβx
β
0=−−−= ∑∑==
)sgn()(11
T )( βxβxx λαn
iii
n
iii y
The Lagarange method for optomization with constraints
)ˆsgn(ˆˆ 1
1
TOLS )( βxxββ
−
=∑+=n
iiiλ
Lasso estimator and LSE have the following connection
).( that Assume1
T 単位行列In
iii =∑
=xx
),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ
OLSβ̂β̂
221
34
Sparseness representation
),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ
+−= )|ˆ)(|ˆsgn(ˆOLSOLS λjjj βββ
Othonormal design matrix leads to
Inverting the above provides
jβ̂
jβOLSˆ
jβ̂
jβOLSˆ
λ
λ−
λ−
λ
)(T 単位行列Iii =∑ xx
⎩⎨⎧ >
=+
otherwise00if
whereAA
A
Note: λ is uniquely determined by the constraint tj =∑ |ˆ| β
222
35
Control of sparseness
tj =∑ |ˆ| β
t
Control is reduced to a choice of t
223
36
Microarray data
),( 1 pxx=x }1,1{ +−∈y
Leaning machines
)( xfy =
p signals Class labels
ABoost
)},(),...,,{( 11 nn yy xx
)(,),1( 11 nww
)(,),1( 22 nww1ε
2ε
1α
2α
Tα
∑=
T
1)( )(
ttt f xα
)(,),1( nww TT
)()2( xf
)()( xTf
)()1( xf
)}(...,),({ 1 xx Kff ∑=
T
1)( )(
ttt f xα
Difficult Problem: p >> n
224
37
Data
)(,),1( 11 nww
)(,),1( 22 nww
1ε
2ε
Tα
),1()1,1(
),1()1,1(
,,)(,),(
G
Gffαα
xx
∑=
T
ttf
1)( )(x
)(,),1( nww TT
Grouping G machines )( ),(),()1,()1,(1
)( GtGtttGt fff αα ++=
)()1( xf
)()2( xf
)()( xTf
),2()1,2(
),2()1,2(
,,)(,),(
G
Gffαα
xx
),()1,(
),()1,(
,,)(,),(
GTT
GTT ffαα
xx
GroupBoost
GroupBoost )}(...,),({ 1 xx Kff ∑=
T
1)( )(
ttt f xα
225
38
Results Data size
Test error
226
39
Lung cancer analysisGene number Test error
true genes
false genes
227
40
228
41
MW (Time of Flight)
LensLaser beam source
Mirror
Ion detector
High voltage
+++
+++
+++Protein ChipFlight tube (High vacuum)
Proteome[SELDI TOF/MS]
229
42
0
20
40
0
20
40
60
0
20
40
60
20
40
60
25
50
75
100
10
20
30
40
25
50
75
25
50
75
100
25
50
75
100
1500 1750 2000 2250 2500
Lungcancer
Normaltissue
Proteome data
230
43
Averaged curve203 subjects (130 cases (ovarian cancer),73 controls)
231
44
Common peaks
hth
SpecAlign, Wong et al. (2005)
232
45
Peak pattern recognittion
),( 1 pxx=x }1,1{ +−∈y
Learning machines
)( xfy =
p peaks Class labels,
AdaBoost
例題
)(,),1( 11 nww
)(,),1( 22 nww1ε
2ε
1α
2α
Tα
∑=
T
1)( )(
ttt f xα
)(,),1( nww TT
)()2( xf
)()( xTf
)()1( xf
Test data(32cases+18contols)
233
46
Association with drug effect
A joint work with Japan Institute Cancer Research Foundation
Breast cancer patients and a drug effect association
Supervised detection for common peaks
0
0.05
0.1
0.15
0.2
0.25
0.3
0 5 10 15 20 25 30 35
training error
CV error
test error
(peaks)
error rate
234
47
Inference for prediction
Microarray
SNPs
Proteome
Genome Function
Expression pattern(GroupBoost)
Peak pattern (Common peak)
SNP haploblock
Statistical machine Learning
235
Top Related