Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN...
Transcript of Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN...
![Page 1: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/1.jpg)
Wasserstein GAN
Juho Lee
Jan 23, 2017
![Page 2: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/2.jpg)
Wasserstein GAN (WGAN)
I Arxiv submission
I Martin Arjovsky, Soumith Chintala, and Leon Bottou
I A new GAN model minimizing the Earth-Mover’s distance(Wasserstein-1 distance)
I Stabilized GAN training with way less mode collapse
I Provide meaningful learning curves useful for debegging
![Page 3: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/3.jpg)
Towards principled methods for training generativeadversarial networks
I ICLR 2017 (oral)
I Martin Arjovsky and Leon Bottou
I Why do updates gets worse as the discriminator gets better?
I Why is GAN training massively unstable?
I The impact of − logD(G(z)) trick; is it following the JSD?
![Page 4: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/4.jpg)
Learning probability distribution
I Given a set of observations {xi}ni=1, assume a model distribution Pθof parametric family.
I Select a distance measure between the model distribution and realdistribution Pr; ρ(Pθ,Pr).
I Convergence: as t→∞, θt → θ, so Pθt → Pθ where ρ(Pr,Pθ)→ 0.
I Desirable conditions: the mapping θ 7→ ρ(Pr,Pθ) is continuous.
![Page 5: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/5.jpg)
Distances between probability distributions I
Let (X ,Σ) be measurable space, where X is a compact metric set and Σis a Borel σ-algebra.
I The Total Variation (TV) distance
δ(Pr,Pθ) = supA∈Σ|Pr(A)− Pθ(A)|.
I The Kullback-Leibler (KL) divergence
KL(Pr‖Pθ) =
∫log
Pr(x)
Pθ(x)Pr(x)dµ(x),
where both Pr and Pθ are assumed to be absolutely continuous, andtherefore admit densities, w.r.t. a same measure µ on X .
![Page 6: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/6.jpg)
Distances between probability distributions II
I The Jensen-Shannon (JS) divergence
JS(Pr,Pθ) =1
2KL(Pr‖Pm) +
1
2KL(Pθ‖Pm),
where Pm := (Pr + Pθ)/2.
I The Earth-Mover’s (EM) distance or Wasserstein-1 distance
W (Pr,Pθ) = infγ∈Π(Pr,Pθ)
E(x,y)∼γ [|x− y|],
where Π(Pr,Pθ) denotes the set of all joint distributions γ(x, y)whose marginals are respectively Pr and Pθ.
![Page 7: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/7.jpg)
Distances between probability distributions III
(0, Z) (θ, Z)
Z ∼ Unif([0, 1])
I KL(Pθ‖P0) =
{∞ if θ 6= 00 if θ = 0
.
I JS(P0,Pθ) =
{log 2 if θ 6= 00 if θ = 0
.
I δ(P0,Pθ) =
{1 if θ 6= 00 if θ = 0
.
I W (P0,Pθ) = |θ|.
![Page 8: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/8.jpg)
Instability of GAN I
Original objective function:
L(D, gθ) = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))].
The optimal discriminator is
D∗(x) =Pr(x)
Pr(x) + Pg(x),
and
L(D∗, gθ) = 2JS(Pr,Pg)− 2 log 2.
![Page 9: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/9.jpg)
Instability of GAN II
Theorem 1
Let Pr and PG be two distributions that have support contained in twoclosed manifolds M and P that don’t perfectly align and don’t have fulldimensions. We further assume that Pr and Pg are continuous in theirrespective manifolds, meaning that if there is a set A with measure 0 inM, then Pr(A) = 0 (and analogously for Pg). Then, there exists anoptimal discriminator D∗ : X → [0, 1] that has accuracy 1 and for almostany x in M∪P, D∗ is smooth in a neighbourhood of x and∇xD∗(x) = 0.
![Page 10: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/10.jpg)
Instability of GAN III
Theorem 2
(Vanishing gradients on the generator) Let gθ : Z → X be adifferentiable function that induces a distribution Pg. If some conditionsare satisfied, and ‖D −D∗‖ < ε, and Ez∼p(z)[‖Jθgθ(z)‖22] ≤M2,
‖∇θEz∼p(z)[log(1−D(gθ(z)))]‖2 < Mε
1− ε .
![Page 11: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/11.jpg)
Instability of GAN IV
![Page 12: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/12.jpg)
Instability of GAN V
![Page 13: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/13.jpg)
The − logD trick I
For generator, instead of minimizing Ez∼p(z)[log(1−D(gθ(z))], minimizeEz∼p(z)[log(D(gθ(z))]. This does not change the fixed points.
Theorem 3
Let D∗ = PrPr+Pg
be the optimal discriminator for a fixed θ = θ0.
Ez∼p(z)[−∇θ logD∗(gθ(z))|θ=θ0 ]
= ∇θ[KL(Pgθ0‖Pr)− 2JS(Pgθ0 ,Pr)]|θ=θ0 .
![Page 14: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/14.jpg)
The − logD trick II
Theorem 4
(Under some conditions) Ez∼p(z)[−∇θ logD(gθ(z))] is a centeredCauchy distribution with infinite expectation and variance.
![Page 15: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/15.jpg)
Why should we use Wasserstein distance I
Theorem 5
Let Pr be a fixed distribution over X . Let Z be a random variable overanother space Z. Let g : Z × Rd → X be a function, that will bedenoted gθ(z). Let Pθ denote the distribution of gθ(Z). Then,
1. If g is continuous in θ, so is W (Pr,Pθ).
2. If g is locally Lipschitz and satisfies regularity assumption 1, thenW (Pr,Pθ) is continuous everywhere, and differentiable almosteverywhere.
3. 1 and 2 are false for the Jensen-Shannon and KL divergences.
If we choose gθ to be any feedforward neural network parametrized by θ,and p(z) to be E[‖z‖] <∞, then the regularity assumption 1 is satisfied.
![Page 16: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/16.jpg)
Why should we use Wasserstein distance II
Theorem 6
Let P be a distribution on a compact space X and (Pn)n∈N be asequence of distributions on X . Then, as n→∞,
1. δ(Pn,P)→ 0 ⇐⇒ JS(Pn,P)→ 0.
2. W (Pn,P)→ 0 ⇐⇒ PnD→ P.
3. KL(Pn‖P)→ 0 or KL(P‖Pn)→ 0 implies 1.
4. 1 implies 2.
![Page 17: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/17.jpg)
Why should we use Wasserstein distance III
![Page 18: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/18.jpg)
Approximating the Earth-Mover’s distance
By the Kantorovich-Rubinstein duelity [1]
W (Pr,Pθ) = sup‖f‖L≤1
Ex∼Pr [f(x)]− Ex∼Pθ [f(x)],
where the supremum is over all the 1-Lipschitz functions f : X → R.1-Lipschitz can be replaced by K-Lipschitz.
Theorem 7
Let Pr be any distribution, and let Pθ be the distribution of gθ(Z)satisfying assumption 1. Then, there exists a solution f : X → R to theproblem
max‖f‖L≤1
Ex∼Pr [f(x)]− Ex∼Pθ [f(x)]
and we have
∇θW (Pr,Pθ) = −Ez∼p(z)[∇θf(gθ(z))],
when both terms are well-defined.
![Page 19: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/19.jpg)
WGAN algorithm
![Page 20: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/20.jpg)
Experiments I
![Page 21: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/21.jpg)
Experiments II
![Page 22: Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A](https://reader030.fdocument.pub/reader030/viewer/2022041110/5f105fa47e708231d448caa3/html5/thumbnails/22.jpg)
C. Villani.
Optimal Transport: Old and New.Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.