An Information Theoretic Exploration of the Generative Adversarial Network

Information Theory (Winter 2020)

Andrew Miller-Smith (aams)

Introduction

For this project I would like to offer an information theoretic interpretation of the Generative Adversarial Network (GAN). A popular type of generative model, the GAN leverages two separate networks paired in a minimax game, one trying to generate realistic images and the other attempting to discern between fake and real samples. This paradigm has propelled the field of generative models to new heights, allowing networks to produce higher-fidelity samples at previously unattainable image sizes.

While this minimax description is accurate, many introductory materials on the GAN omit the information theoretic motivations for this model and the underpinnings of its mechanisms. I explore these topics and their importance below.

Problem motivation – why do we use GANs?

Many machine learning tasks offer simple objective functions–classifiers rely on accuracy, while detection models feature bounding box regressions. For generative models, this proves less straightforward, as any metric must quantify the difference between two probability distributions. Most historical models have employed likelihood estimation, and many popular systems, such as variational autoencoders and normalizing flow models, still use this measure.

While these models can produce impressive results, high likelihood does not necessarily lead to sample quality. For example, a model that memorizes training data might produce samples with excellent likelihood, but this clearly misses the mark with respect to learning the true distribution.

This problem motivates the quest for a likelihood-free method of learning, one that can minimize the distance between distributions via some other mechanism. These requirements open the door to information theoretic measures of probability.

Generative Adversarial Networks

In this section I will discuss three variants of the GAN–the classical model, the InfoGAN, and the Wasserstein GAN–and explore their information theoretic underpinnings.

1) Classical GAN

1.a) Architecture

First proposed by [4], the GAN attempts to find a likelihood-free measure between probability distributions P_{Data} and P_G , where the former is the true distribution and the latter is some learned distribution. The basic concept is that of a statistical test, where H_0: P_{Data} = P_G and H_A: P_{Data} \neq P_G . A generator network attempts to learn the true distribution by mapping latent Gaussian noise z to samples, while a discriminator network attempts to learn the measure that will maximize the power of the test, i.e. the ability to reject the null when it is false. This results in the following objective function

\begin{aligned} \min_G \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) \\\ \end{aligned}

where D(x) is the discriminator’s predicted probability as to whether the sample comes from the true distribution. Note that the generator’s objective function \min_G E_{z \sim P_z} \log(1 - D(G(z))) frequently proves susceptible to vanishing gradients, and in practice researchers flip the objective to \max_G E_{z \sim P_z} \log(D(G(z))) to avoid this.

1.b) Examining the minimax objective

Closer analysis of this minimax objective reveals hidden information theoretic underpinnings. Before we start, let us define the Jensen-Shannon divergence as follows.

\begin{aligned} JSD (P, Q) &= \frac{1}{2}D_{KL}\left(P \ \middle|\middle| \ \frac{P + Q}{2} \right) + \frac{1}{2}D_{KL}\left(Q \ \middle|\middle| \ \frac{P + Q}{2} \right) \\\ \end{aligned}

Now for our analysis. Let us first note that for the discriminator’s objective function

\begin{aligned} \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) \\\ \end{aligned}

the optimal discriminator D^* over the output space [0, 1] predicts with the following probability

\begin{aligned} D^*(x) &= \frac{P_{Data}(x)}{P_{Data}(x) + P_G(x)} \\\ \end{aligned}

This allows us to reformulate the generator’s objective function as follows.

\begin{aligned} L_G(z) &= \min_G \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) \\\ &= \min_G E_{x \sim P_{Data}}\left(\log\left(\frac{P_{Data}\left(x\right)}{P_{Data}\left(x\right) + P_G\left(x\right)}\right)\right) + E_{z \sim P_z} \log\left(1 - \frac{P_{Data}\left(G\left(z\right)\right)}{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}\right) \\\ &= \min_G E_{x \sim P_{Data}}\left(\log\left(\frac{P_{Data}\left(x\right)}{P_{Data}\left(x\right) + P_G\left(x\right)}\right)\right) \\\ & \ \ \ \ + E_{z \sim P_z} \log\left(\frac{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right) - P_{Data}\left(G\left(z\right)\right)}{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}\right) \\\ &= \min_G E_{x \sim P_{Data}}\left(\log\left(\frac{P_{Data}\left(x\right)}{P_{Data}\left(x\right) + P_G\left(x\right)}\right)\right) + E_{z \sim P_z} \log\left(\frac{P_G\left(G\left(z\right)\right)}{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}\right) \\\ \end{aligned}

Now say the generator has perfectly matched P_{Data} . From our equations before, this would force the optimal discriminator D^* to predict 1/2 for any example. This would lead to the discriminator’s loss becoming

\begin{aligned} \max_D L(D) &= \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) \\\ &= E_{x \sim P_{Data}}(\log(D^*(x))) + E_{z \sim P_z} \log(1 - D^*(G(z))) \\\ &= E_{x \sim P_{Data}}\left(\log\left(\frac{1}{2}\right)\right) + E_{z \sim P_z} \left(\log\left(\frac{1}{2}\right)\right) \\\ &= -\log(4) \\\ \end{aligned}

Adding and subtracting this from the reformulated generator loss function we see

\begin{aligned} L_G(z) &= \min_G E_{x \sim P_{Data}}\left(\log\left(\frac{P_{Data}\left(x\right)}{P_{Data}\left(x\right) + P_G\left(x\right)}\right)\right) + E_{z \sim P_z} \log\left(\frac{P_G\left(G\left(z\right)\right)}{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}\right) \\\ & \ \ \ \ + \log(4) - \log(4)\\\ L_G(z) &= \min_G E_{x \sim P_{Data}}\left(\log\left(\frac{P_{Data}\left(x\right)}{\frac{P_{Data}\left(x\right) + P_G\left(x\right)}{2}}\right)\right) + E_{z \sim P_z} \log\left(\frac{P_G\left(G\left(z\right)\right)}{\frac{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}{2}}\right) - \log(4)\\\ &= D_{KL}\left(P_{Data} \ \middle|\middle| \ \frac{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}{2} \right) + D_{KL}\left(P_{G} \ \middle|\middle| \ \frac{P_{Data}\left(G\left(z\right)\right) + P_G\left(G\left(z\right)\right)}{2} \right) - \log(4) \\\ &= 2JSD\left(P_{Data}, P_G\right) - \log(4) \\\ \end{aligned}

Classical GAN results (left) compared with nearest real samples (right). Image courtesy of [4].

Originally detailed in [4], this proof connects the seemingly disjoint notion of the minimax paradigm to an explicit information theoretic measure. In trying to fool the discriminator, the generator thus tries to minimize the Jensen-Shannon divergence between its distribution and that of the true data. This provides a new lens through which to view the model and makes clear the theoretical justification as to why the architecture works. It also opens the door to additional modifications to the architecture via information theoretic adjustments.

2) InfoGAN

2.a) Addition of the mutual information

Despite their sampling prowess, the GAN’s reliance on Gaussian noise for generation renders any interpretation of the latent variables difficult. For example, a GAN trained to generate faces might produce a wide range of hair styles, yet mapping the styles back to their latent input can prove intractable if not impossible. The conditional GAN, first proposed by [8] in 2014, attempts to adjust the standard architecture to incorporate data labels corresponding to such features. While this paradigm offers impressive control over output, it necessitates labeled data and cannot learn latent features in an unsupervised fashion.

The InfoGAN attempts to enable latent feature disentanglement through the introduction of an additional information theoretic construct to the objective function. First published in 2017 by [3], the model divides the latent input z_{original} into two components (z, c) (the authors reuse z to denote that this input functions as before). In order to ascribe semantic meaning to c , the authors amend the classical GAN’s minimax equation with a new, weighted term to encourage the model to maximize the mutual information I(c; G(z, c)) .

\begin{aligned} \min_G \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) - \lambda I(c; G(z, c))\\\ \end{aligned}

The mutual information functions as a measure of the shared information between two random variables. Formally the cross entropy between the joint distribution and the product of the individual distributions, it also measures the difference between the entropy of one variable with the entropy of that variable conditioned on the other.

\begin{aligned} I(X; Y) &= D_{KL}\left(P(X, Y) \ \middle|\middle| \ P(X)P(Y) \right) \\\ &= H(X) - H(X | Y) \\\ &= H(Y) - H(Y | X) \\\ \end{aligned}

The authors note that in order to compute this value directly, they would need access to the posterior P(C | X) , which they do not have. They instead approximate a lower bound for this value via a learned distribution Q(C | X) , which they derive as follows.

\begin{aligned} I(C; G(c, z)) &= H(c) - H(c | G(c, z)) \\\ &= H(c) + E_{x \sim G(c, z)}\left(E_{c' \sim P(c | x)} \log(P(c' | x)) \right) \\\ &= H(c) + E_{x \sim G(c, z)}\left(D_{KL}\left(P(\dot| x) \ \middle|\middle| \ Q(\dot | x) \right) + E_{c' \sim P(c | x)} \log(Q(c' | x)) \right) \\\ &\geq H(c) + E_{x \sim G(c, z)}\left(E_{c' \sim P(c | x)} \log(Q(c' | x)) \right) \\\ \end{aligned}

The last step comes from the fact that the KL divergence is greater than or equal to zero. By treating H(c) as a constant and citing a lemma that allows them to avoid sampling from the posterior, they arrive at the following lower bound for the mutual information.

\begin{aligned} L_I(G, Q) &= E_{x \sim G(c, z)}\left(E_{c' \sim P(c | x)} (\log Q(c' | x)) \right) + H(c) \\\ &\leq I(C; G(c, z)) \\\ \end{aligned}

This term replaces the mutual information in the objective function to give

\begin{aligned} \min_G \max_D E_{x \sim P_{Data}}(\log(D(x))) + E_{z \sim P_z} \log(1 - D(G(z))) - \lambda L_I(G, Q)\\\ \end{aligned}

where the last term is approximated via Monte Carlo sampling.

2.b) Disentanglement learning through the updated objective

While ostensibly similar, the InfoGAN’s c and the conditional GAN’s class labels differ in that the former requires no explicit labeling. The model instead requires only a prior P(c) from which it draws samples before feeding them to the generator. This allows for flexible and unsupervised learning of latent traits.

InfoGAN MNIST results. Image courtesy of [3].

In the original paper, the authors apply their model to the MNIST data set with c \in \mathbb{R}^3 , drawing c_1 uniformly over the discrete set {0, \dots, 9} and c_2 and c_3 from U(-1, 1) . The network assigns clear meaning to these covariates, mapping the first to distinct digits and the second and third to digit rotation and width. Remarkably, the model handles never-before-seen values of c_2 and c_3 in stride, generating appropriately rotated or wide digits at test time when given inputs over U(-2, 2) .

InfoGAN CelebA results. Image courtesy of [3].

The network produces similar results when trained on CelebA. Various c_i correspond to latent features such as poise, hair style, emotion, and the presence or absence of glasses.

These results highlight the power of incorporating additional information theoretic constructs into GANs. Through the inclusion of the mutual information into the objective function, the InfoGAN proves adept at learning previously uninterpretable latent features.

3) Wasserstein GAN

3.a) Motivation

Despite its prowess, the classical GAN often proves difficult to train. This notoriety stems in part from the model’s remarkable sensitivity to hyperparameters, network structure, and even random seeds, which can require extensive calibration to surmount. Even so, the most common challenge when training a classical GAN comes from the phenomenon known as mode collapse, in which the generator produces only a small set of samples regardless of latent input. Models can collapse with little to no warning, and attempts to resuscitate them most often fail.

While the theoretical causes of mode collapse remain an open area of research, [1] suggest the problem might stem from the information theoretic mechanisms underpinning the minimax objective function. The authors examine the Jensen-Shannon divergence from a mathematical-analysis perspective, considering the iterative optimization a sequence of probability distributions in a metric space.

As in the paper, let \chi be a compact metric space and \Sigma represent the set of Borel subsets of \chi . In this context, the KL divergence can be defined as

\begin{aligned} D_{KL}\left(P_a \ \middle|\middle| \ P_b \right) &= \int P_a(x) \log\left(\frac{P_a(x)}{P_b(x)}\right) d\mu(x) \\\ \end{aligned}

where both distributions must be absolutely continuous with respect to \mu . In examining the metrics in this context, the authors demonstrate that there exist optimization problems that cannot be learned through these measures. If this sounds too confusing, consider the simple example offered in the paper.

Consider Z \sim U[0, 1] and some true distribution P_0(0, Z) \in \mathbb{R}^2 , i.e. a vertical line at the origin where the secondary coordinate is sampled from Z . Let P_\theta(\theta, Z) be a learned distribution. The KL divergence between these two distributions will be

\begin{aligned} D_{KL}\left(P_0 \ \middle|\middle| \ P_\theta \right) &= 0 \ \ \ \text{if} \ \ \theta = 0 \\\ \end{aligned}

\begin{aligned} D_{KL}\left(P_0 \ \middle|\middle| \ P_\theta \right) &= \infty \ \ \ \text{if} \ \ \theta \neq 0 \\ \end{aligned}

(This is because the KL divergence defines the distance between any distributions where there exists a point that with zero probability for the latter term but not the former as \infty as opposed to undefined).

Similarly, we have the Jensen-Shannon divergence as

\begin{aligned} \text{JSD}(P_0, P_\theta) &= 0 \ \ \ \text{if} \ \ \theta = 0 \end{aligned}

\begin{aligned} \text{JSD}(P_0, P_\theta) &= \log(2) \ \ \ \text{if} \ \ \theta \neq 0 \\ \end{aligned}

This means that even in our simple example, neither of the above divergences will give meaningful gradient information to the generator. More specifically, the sequence P_{\theta_t}(x)_{t \in \N} will not converge to P_0 as t \rightarrow \infty .

Given these problems faced by our two divergences, the authors propose a different measure for comparing probability distributions, the Wasserstein (Earth Mover) distance. This construct quantifies the cost of transforming one probability curve into another by ”moving” and ”reshaping” the probability mass of one into the other. In the metric space defined above, the authors define the metric as follows.

\begin{aligned} W(P_a, P_b) &= \inf_{\gamma \in \Pi(P_a, P_b)} E_{(x, y) \sim \gamma} \|x - y\| \\\ \end{aligned}

Here we define \Pi(P_a, P_b) as the set of all joint densities such that the marginals equal P_a and P_b respectively. From our example from before, we see that

\begin{aligned} W(P_0, P_\theta) &= |\theta| \\\ \end{aligned}

regardless of whether \theta = 0 . This means that unlike the KL and Jensen-Shannon divergences, the Wasserstein distance is defined and differentiable everywhere, provides useful gradient information, and will allow the sequence P_{\theta_t}(x)_{t \in \N} to converge to the true distribution P_0 .

While promising, swapping the Wasserstein distance for the Jensen-Shannon divergence introduces an intractable requirement in terms of the infimum. In order to circumvent this calculation, the authors turn to the Kantorovich-Rubinstein duality, which allows them to express the metric as follows

\begin{aligned} W(P_a, P_b) &= \sup_{\|f\|_L \leq 1} E_{s \sim P_a} f(x) - E_{x \sim P_b} f(x) \\\ \end{aligned}

where f denotes all 1-Lipschitz functions from \chi to \mathbb{R} . (In this context, a function g is K -Lipschitz if its gradient |\nabla g| \leq K ). To enforce this, the authors turn to gradient clipping. (To be more specific, the gradient clipping forces the network’s weights to live in a compact space, which leads to the satisfaction of the gradient requirement).

The authors thus propose the adjusted objective functions

\begin{aligned} L_D &= \min_D E_{x \sim P_{Data}} D(x) - E_{z \sim P_z} D(G(z)) \\\ L_G &= \max_G E_{z \sim P_z} D(G(z)) \\\ \end{aligned}

where the discriminator predicts a real-valued score for each input instead of a probability. As in the classical GAN, these deceptively simple objectives stem from key information theoretic underpinnings that allow the model to quantify the discrepancies between distributions.

For more information and more detailed proofs, please see the original paper [1].

3.b) Wasserstein GANs in practice

The Wasserstein GAN has replaced the classical GAN as the standard generative model thanks to its improved training stability and immunity to mode collapse. In practice, researchers often swap the original paper’s gradient clipping for the Wasserstein gradient-penalty from [5], which adds a term penalizing the norm of the gradient to the objective function. This serves the same purpose as gradient clipping while allowing for more robust results.

WGAN results with gradient penalty. Image courtesy of [5].

Under both visual evaluation and metrics for generative models, the Wasserstein GAN produces high quality material that more closely resembles the distributions it tries to learn. This loss structure underlies most of the state-of-the-art architectures for generative models.

4) Additional reading

As generated images have approached life-like fidelity, recent work has turned again toward disentangling latent features from the noise vector z . Many state-of-the-art models omit the InfoGAN’s explicit use of mutual information and instead feature alternative methods of disentanglement, though future papers could incorporate more direct approaches ([6],[7]). Other prominent architectures have focused on labeled data as in the conditional GAN ([2]).

5) Acknowledgments

Thank you to Professor Weissman and the EE 276 staff for a wonderful quarter. I have enjoyed illuminating the information theoretic underpinnings of GANs and look forward to future research on the subject.

6) References

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan, 2017.
[2] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis, 2018.
[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016.
[4] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks, 2014.
[5] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans, 2017.
[6] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks, 2018.
[7] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan, 2019.
[8] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.