Sugiyama Lab, NII - 隣接代数と双対平坦構造を用い …Nov.20–23,2019 IBIS2019...

Nov. 20–23, 2019IBIS 2019

隣接代数と双対平坦構造を用いた学習

杉山麿人（国立情報学研究所，JSTさきがけ研究者）

Matrix Balancingp₁₁ p₁₂

p₂₁ p₂₂

1/41

Matrix Balancing

r₁s₁ p₁₁ r₁s₂ p₁₂

r₂s₁ p₂₁ r₂s₂ p₂₂

p₁₁ p₁₂

p₂₁ p₂₂

s₁ 0

0 s₂

r₁ 0

0 r₂

∑j r₁sj p₁j = 1

∑j r₂sj p₂j = 1

∑i ris₁ pi₁ = 1 ∑i ris₂ pi₂ = 1

=

Find r and s:(Make doublystochasticmatrix)

1/41

Sinkhorn-Knopp Algorithm• Alternately rescale all rows and columnsof a matrix 𝑃 to sum to 1

• Commonly used to compute entropy-regularizedOptimal transport (Wasserstein distance)

2/41

Revisit Matrix Balancing [ICML2017]

p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

3/41

https://arxiv.org/abs/1702.08142

Introduce 𝜼

p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

η₁₁

η₂₁

η₃₁

η₁₂ η₁₃

4/41

Introduce 𝜼

3

2

1

2 1p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

η₁₁

η₂₁

η₃₁

η₁₂ η₁₃

4/41

Introduce 𝜽

p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

θ₂₂ θ₂₃

θ₃₂ θ₃₃

θij = log pijlog pi–₁j – log pij–₁log pi–₁j–₁

–+

5/41

Balancing as Constraints on 𝜼 and 𝜽

3

2

1

2 1p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

η₁₁

η₂₁

η₃₁

η₁₂ η₁₃

θ₂₂ θ₂₃

θ₃₂ θ₃₃

θij = log pijlog pi–₁j – log pij–₁log pi–₁j–₁

–+

Matrix balancing ⇔Satisfy ηi₁ = η₁i = 3 – i + 1with keeping all θij

6/41

Natural Gradient• Given 𝑃 ∈ ℝ𝑛×𝑛, introduce (𝜃, 𝜂) aslog𝑝𝑖𝑗 =

∑

𝑘≤𝑖

∑

𝑙≤𝑗𝜃𝑘𝑙, 𝜂𝑖𝑗 =

∑

𝑘≥𝑖

∑

𝑙≥𝑗𝑝𝑘𝑙

• Let 𝐼 = {11, 12,… , 1𝑛, 21,… , 𝑛1}, 𝜽 = (𝜃𝑖)𝑇𝑖∈𝐼 , 𝜼 = (𝜂𝑖)𝑇𝑖∈𝐼• Using Fisher information matrix 𝐺 ∈ ℝ|𝐼|×|𝐼| s.t.𝑔(𝑖𝑗)(𝑘𝑙) = 𝜂max{𝑖,𝑘}max{𝑗,𝑙} − 𝜂𝑖𝑗𝜂𝑘𝑙, update formula is𝜽next = 𝜽 − 𝐺−1(𝜼 − 𝜼correct)

7/41

Results on Hessenberg MatrixN

umbe

r of i

tera

tion

100 500 5000100102104106

nnRu

nnin

g tim

e (s

ec.)

100 500 5000

10410210010–2

106108

> x1000faster

NaturalBNEWTSinkhorn

8/41

Introduce Partial Order Structure

p₁₁ p₁₂ p₁₃

p₂₁ p₂₂ p₂₃

p₃₁ p₃₂ p₃₃

s(p(s), θₛ, ηₛ)

9/41

Partially Ordered Sets (Posets)

∅

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

2{a,b,c} ℕ² Network

10/41

Incidence Algebra• Incidence algebra is defined over a poset (𝑆,≤)

– (Closed) Interval [𝑎, 𝑏] = {𝑠 ∈ 𝑆 ∣ 𝑎 ≤ 𝑠 ≤ 𝑏}

• Members of the incidence algebra arefunctions 𝑓(𝑎, 𝑏) from intervals [𝑎, 𝑏] to a scalar with(𝑓 + 𝑔)(𝑎, 𝑏) = 𝑓(𝑎, 𝑏) + 𝑔(𝑎, 𝑏)

(𝑓𝑔)(𝑎, 𝑏) =∑

𝑎≤𝑥≤𝑏𝑓(𝑎, 𝑥)𝑔(𝑥, 𝑏) (convolution)

11/41

Analogy to Matrix Multiplication• For [𝑎, 𝑏], define 𝒇,𝒈 ∈ ℝ|[𝑎,𝑏]| as

𝒇 =⎡⎢⎣

𝑓(𝑎, 𝑎)⋮

𝑓(𝑎, 𝑏)

⎤⎥⎦, 𝒈 =

⎡⎢⎣

𝑔(𝑎, 𝑎)⋮

𝑔(𝑏, 𝑎)

⎤⎥⎦

• For 𝑓 and 𝑔,(𝑓𝑔)(𝑎, 𝑏) = 𝒇𝑇𝒈

12/41

Special Elements• Delta function 𝛿:

𝛿(𝑎, 𝑏) = { 1 if 𝑎 = 𝑏0 otherwise

• Zeta function 𝜁: (integral)

𝜁(𝑎, 𝑏) = { 1 if 𝑎 ≤ 𝑏0 otherwise

• Möbius function 𝜇 = 𝜁−1: 𝜁𝜇 = 𝛿 (differential)13/41

Möbius Inversion Formula• Given a poset 𝑆, for any functions 𝑓, 𝑔 ∶ 𝑆 → ℝ,the Möbius inversion formula is given as⎧⎪

⎨⎪⎩

𝑔(𝑥) =∑

𝑠∈𝑆𝜁(𝑠, 𝑥)𝑓(𝑠) =

∑

𝑠≤𝑆𝜁(𝑠, 𝑥)𝑓(𝑠) =

∑

𝑠≤𝑥𝑓(𝑠)

𝑓(𝑥) =∑

𝑠∈𝑆𝜇(𝑠, 𝑥)𝑔(𝑠) =

∑

𝑠≤𝑆𝜇(𝑠, 𝑥)𝑔(𝑠)

14/41

E.g.1: Inclusion-Exclusion Principle

A∩B∩C

A∪B∪C

A∩B A∩C B∩C

A B C

A∩B∩C

A∪B∪C

A∩B A∩C B∩C

A B C f (X) = |X| g(X) = |X \ ⋃ Y|Y ⊂ X

f (X) = ∑Y≤X g(X)g(X) = ∑Y≤X μ(Y,X) f (Y)

|A∪B∪C|= |A|+|B|+|C|–|A∪B|–|A∪C|–|B∪C|+|A∩B∩C|

15/41

E.g.2: Divisibility• Divisibility poset: 𝑎 ≤ 𝑏 ⇐⇒ 𝑏|𝑎 (𝑎 divides 𝑏)• The Möbius function: 𝑛 = 𝑏∕𝑎 and

𝜇(𝑛) = { (−1)𝑘 if 𝑛 = 𝑝1𝑝2…𝑝𝑘 for 𝑘 distinct primes0 otherwise

– 𝜇(𝑎, 𝑏) = 𝜇(𝑏|𝑎), the Möbius function in number theory• The Riemann zeta function 𝜁 is given by1∕𝜁(𝑠) =

∑∞

𝑛=1𝜇(𝑛)∕𝑛𝑠

16/41

Log-Linear Model on Poset [ICML2017]

• For probability 𝑝∶𝑆 → (0, 1) with∑𝑥∈𝑆 𝑝(𝑥) = 1,introduce 𝜽 and 𝜼 as𝜃𝑥 =

∑𝑠∈𝑆

𝜇(𝑠, 𝑥) log𝑝(𝑠),

𝜂𝑥 =∑

𝑠∈𝑆𝜁(𝑥, 𝑠)𝑝(𝑠) =

∑𝑠≥𝑥

𝑝(𝑠)

• From the Möbius inversion formula, log-linear model is:log𝑝(𝑥) =

∑𝑠∈𝑆

𝜁(𝑠, 𝑥)𝜃𝑠 =∑

𝑠≤𝑥𝜃𝑠

17/41


Log-Linear Model on Poset

x

(p(x), θx, ηx)

18/41


x

(p(x), θx, ηx)

log p(x) = ∑s≤x θs

18/41


x

(p(x), θx, ηx)


ηx = ∑s≥x p(s)

18/41

Exponential Family• The log-linear model on posets belongs tothe exponential family

• 𝜽 : Natural parameter• 𝜼 : Expectation parameter

19/41

Binary Log-Linear Model [AAAI2019](= Boltzmann machine)

∅

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

2{a,b,c}

20/41



∅

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

{a,b,c}

{a,b} {a,c} {b,c}

{a} {b} {c}

2{a,b,c}2{a,b,c}

∅

{a,b,c}

{a,b} {a,c}

{a} {b}

{a,b,c}

{a,b} {a,c}

{a} {b}

{b,c}

{c}{c}


ηx = ∑s≥x p(s)

20/41



000

100

110 101

111

000

100

110 101

111{0, 1}3

011011

001001010010

21/41



000

100

110 101

111

000

100

110 101

111{0, 1}3

011011

001001010010


= –ψ + ∑i θixi + ∑i,jθijxixj + ···

For x with xi₁ = ··· = xik = 1,ηx = ∑s≥x p(s)

= �[xi₁···xik] = Pr(xi₁ = ··· = xik = 1)000

100

110 101

111

000

100

110 101

111{0, 1}3

011

001001010010

21/41


Dually Flat Structure• Let 𝜓(𝜃) = −𝜃(⊥) (convex, partition function)

𝜓(𝜃)Legendre transformation,,,,,,,,,,,,,,,,,,,,,,→ 𝜙(𝜂) =

∑

𝑥∈𝑆𝑝(𝑥) log𝑝(𝑥)

• (𝜓(𝜃), 𝜙(𝜂)) leads to dually flat coordinate system (𝜃, 𝜂):

∇𝜓(𝜃) = 𝜂, 𝜕𝜕𝜃𝑥

𝜓(𝜃) = 𝜂𝑥

∇𝜙(𝜂) = 𝜃, 𝜕𝜕𝜂𝑥

𝜙(𝜂) = 𝜃𝑥22/41

Riemannian Metric (Fisher Information)

𝜕𝜕𝜃𝑥

𝜕𝜕𝜃𝑦

𝜓(𝜃) = 𝜕𝜕𝜃𝑥

𝜂𝑦 =∑

𝑠∈𝑆𝜁(𝑥, 𝑠)𝜁(𝑦, 𝑠)𝑝(𝑠) − 𝜂𝑥𝜂𝑦

𝜕𝜕𝜂𝑥

𝜕𝜕𝜂𝑦

𝜙(𝜂) = 𝜕𝜕𝜂𝑥

𝜃𝑦 =∑

𝑠∈𝑆𝜇(𝑠, 𝑥)𝜇(𝑠, 𝑦)𝑝(𝑠)−1

𝔼𝑠 [𝜕𝜕𝜃𝑥

log𝑝(𝑠) 𝜕𝜕𝜂𝑦

log𝑝(𝑠)] = 𝛿(𝑥, 𝑦)

23/41

Riemannian Metric (Fisher Information)


log𝑝(𝑠) 𝜕𝜕𝜃𝑦

log𝑝(𝑠)] =∑

𝑠∈𝑆𝜁(𝑥, 𝑠)𝜁(𝑦, 𝑠)𝑝(𝑠) − 𝜂𝑥𝜂𝑦

𝔼𝑠 [𝜕𝜕𝜂𝑥


log𝑝(𝑠)] =∑

𝑠∈𝑆𝜇(𝑠, 𝑥)𝜇(𝑠, 𝑦)𝑝(𝑠)−1



log𝑝(𝑠)] = 𝛿(𝑥, 𝑦)

24/41

Mixed Coordinate System• Many problems are formulated as coordinate mixing

P = ( θ1, θ2, ..., θi–1, θi, θi+1, ..., θₙ )

Q = ( η1, η2, ..., ηi–1, θi, θi+1, ..., θₙ )

R = ( η1, η2, ..., ηi–1, ηi, ηi+1, ..., ηₙ )

e-projection(MLE)m-projection

Pythagorean theorem:KL(P, R) = KL(P, Q) + KL(Q, R)

(Q is always unique)

25/41

Mixed Coordinate System (Example)• Many problems are formulated as coordinate mixing

P = ( , , ..., , , , ..., )

Q = ( η1, η2, ..., ηi–1, , , ..., )

R = ( η1, η2, ..., ηi–1, ηi, ηi+1, ..., ηₙ )

Pythagorean theorem:KL(P, R) = KL(P, Q) + KL(Q, R)

(Q is always unique)

0 0 0 0 0 0

0 0 0

Uniform dist.

Empirical dist.

26/41

Two Submanifolds

P = (θ1,θ2,...,θi–1,θi,θi+1,...,θₙ)

R = (η1,η2,...,ηi–1,ηi,ηi+1,...,ηₙ)

Q = (η1,η2,...,ηi–1,θi,θi+1,...,θₙ)

Fix

Fix

e-projection= MLE (Model manifold)

(Data manifold)27/41

Gradient methods for e-projection• e-projection is convex optimization• Gradient descent (first-order):𝜽next ← 𝜽 − 𝜀(𝜼 − �̂�target)

• Natural gradient (second-order)𝜽next ← 𝜽 − 𝐺−1(𝜼 − �̂�target)– 𝐺 is Fisher information matrix w.r.t. 𝜃

• Coordinate descent [IBIS2019]28/41

Boltzmann Machine Training

00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110

1111Boltzmannmachine

Samplespace

29/41

Boltzmann Machine Training

00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace θ = 0

η = η

29/41

Matrix Balancing

θ = θ

η = 3 – i + 1

3x3 matrixas poset:

30/41

Legendre Decomposition [NeurIPS 2018]

3x3x3 tensoras poset:

31/41


Legendre Decomposition [NeurIPS 2018]

3x3x3 tensoras poset:

θ = 0η = η

31/41


Introducing Hidden Variables in BM

00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110

1111BM withhidden variables

Samplespace

32/41


00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace η0101 = ?

32/41


00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace θ = 0

η = η

Need to estimate ηvia EM nonconvexNeed to estimate ηvia EM nonconvex

32/41

Introducing Hidden States

00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace

33/41


00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace

Hiddenstates

34/41


00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace

Hiddenstates

Optimization isstill convex!

ηx

x

35/41


00001000 0100

1100

1110

1111

00001000 0100

1100 10101010

1110


Samplespace θ = 0

η = η

36/41

Blind Source Separation [arXiv]

SourceMixing Received

e.g. image

37/41


Blind Source Separation [arXiv]

SourceMixing Received

e.g. image

θ = 0η = η37/41


Results of BSS

0.270320.436300.621950.37167

IGBSSFastICANMFDicLearn

Source

Received

Reconst.

Method RMSE

38/41

Relationship to Homology• Möbius function is Euler characteristics• Consider order complex ∆(𝑆) of a poset 𝑆 with ⊥,⊤ ∈ 𝑆

(i) Vertices of ∆(𝑆) are elements of 𝑆(ii) Faces of ∆(𝑆) are chains of 𝑆

• For the Euler characteristic 𝜒(∆(𝑆)),𝜇(⊥,⊤) = 𝜒(∆(𝑆)) + 1– Two spaces are homotopy equivalent⇒ Euler characteristics are the same

39/41

Summary: Recipe for Poset-LogLinear1. Treat the target as a poset

2. Introduce the log-linear model on poset

3. Formulate the objective as coordinate mixing

4. Solve it by a gradient method

(This slide is at https://mahito.nii.ac.jp/)

40/41

https://mahito.nii.ac.jp/

Acknowledgment• Tsuda, K. (UTokyo), Nakahara, H. (RIKEN CBS)• Yamada, R. (KyotoU), Mimura, K. (HiroshimaCU)• Luo, S., Azizi, L. (USydney)• Hayashi, S., Matsushima, S. (UTokyo)• Borgwardt, K. and his lab members (ETH Zürich)• Lab members at NII

41/41

Sugiyama Lab, NII - 隣接代数と双対平坦構造を用い …Nov.20–23,2019 IBIS2019...

Documents

Transcript of Sugiyama Lab, NII - 隣接代数と双対平坦構造を用い …Nov.20–23,2019 IBIS2019...

Sugiyama Lab, NII - 隣接代数と双対平坦構造を 用い …Nov.20–23,2019 IBIS2019...

Documents

Transcript of Sugiyama Lab, NII - 隣接代数と双対平坦構造を 用い …Nov.20–23,2019 IBIS2019...

Sugiyama Lab, NII - 隣接代数と双対平坦構造を用い …Nov.20–23,2019 IBIS2019...

Transcript of Sugiyama Lab, NII - 隣接代数と双対平坦構造を用い …Nov.20–23,2019 IBIS2019...