An (Understandable) Information Theoretic Interpretation of the Variational Autoencoder

EE376A (Winter 2019)

Author: Sam Schwager

Intro to the VAE

The VAE was first proposed in 2013 by Kingma et al. in the seminal paper, “Auto-Encoding Variational Bayes”. The promise of the VAE was a scalable framework for learning directed, continuous latent variable models for large data sets. In plain English, the “latent variable” part means that we are assuming the data we have observed (e.g. home prices) depend on some unobserved (i.e. latent) variable (e.g. distance from the nearest urban center). Furthermore, the scalability concerns kick in when we assume that the latent variable is continuous, such as distance in our home prices example.

However, before delving into the details of the VAE, let’s build some intuition as to how one would go about learning such a model, and what it means to learn a model in the first place. Going back to our home prices example, we are assuming that we have observed a bunch of home prices but for some reason don’t know the corresponding distance to the nearest urban center for each home. Suppose we would like to learn a joint probability distribution p(X, Z) over home prices (X) and distance to the nearest urban center (Z). Maximum likelihood estimation may seem like a good place to start, so we begin by thinking about how we could learn p(X, Z) to maximize the log-likelihood of the observed home prices. Specifically, assuming we have M data points, we are interested in maximizing the following quantity:

l(\theta) = \sum_{i=1}^M{log p_{\theta}(x_i)} =\sum_{i=1}^M{ log \int_z{p_{\theta}(x_i, z_i)}}

Note that above the x variables correspond to observed home prices, and we marginalize out the unobserved z variables to obtain the marginal distribution for home prices. At first glance, it may seem that this reduces to simply taking derivatives and setting them to 0 as in the typical MLE setting, but unfortunately the presence of the integral inside the log spells serious trouble for us. The typical solution to this problem is to apply the famous Expectation-Maximization (EM) algorithm. However, notice that if Z is high-dimensional, the integral above becomes intractable and we are unable to realistically compute the posterior distribution over the latent Z in the E-step (A much more thorough treatment of the EM algorithm can be found here: http://cs229.stanford.edu/notes/cs229-notes8.pdf).

Information Theory to the Rescue

At this point, realizing we can’t optimize the desired objective function, it may seem like we may have no choice but to call it quits. However, as it turns out, with a little information theory we can come up with a lower bound to the log-likelihood that we can optimize very easily. Let’s quickly return to the EM algorithm. The main thing I left out before was that the EM algorithm relies on the following inequality:

l(\theta) \geq \sum_{i=1}^M{\mathbb{E}_{Q_i(z_i)}[log{p_{\theta}(x_i, z_i)}]}

We arrive at the equation above by using Jensen’s Inequality which allows us to pull the integral outside of the log to obtain a lower bound for the log-likelihood. Now at least we know that maximizing this lower bound won’t cause us to overshoot the log-likelihood, but the EM algorithm goes one step further and recognizes that when the distribution Q, with respect to which we take the expectation, is equal to the true posterior of the latent z variable, then the lower bound given above is tight! Specifically, this means that:

l(\theta) = \sum_{i=1}^M{\mathbb{E}_{p(z_i|x_i;\theta)}[log{p_{\theta}(x_i, z_i)}]}

With this in mind, the EM algorithm repeats the following process until convergence: Find the posterior of the latent z variable (the E-step) and then maximize the expected log-likelihood with respect to the posterior found in the E-step (the M-step). Crucially, the M-step reduces to the simple fully-observed log likelihood case that we are used to. Also, we are guaranteed convergence since the lower bound we are optimizing is tight when we plug in the posterior distribution that we found in the E-step!

This is great news, but as mentioned earlier, in many cases the E-step is intractable. Nonetheless, if we refer back to the inequality given above, we see that even if we don’t choose the true posterior distribution, we can still optimize knowing that we won’t overshoot the true log-likelihood. In fact this inequality is so important to the VAE that we we will now present it again but with a different name: the evidence lower bound (ELBO)

l(\theta) \geq \sum_{i=1}^M{\mathbb{E}_{Q_i(z_i)}[log{p_{\theta}(x_i, z_i)}]}

Now, finally we see that after applying the chain rule to rewrite the joint of X and Z as the marginal of Z times the conditional probability of X given Z, we can rewrite the ELBO as follows:

l(\theta) \geq \sum_{i=1}^M{\mathbb{E}_{Q_i(z_i)}[log{p_{\theta}(x_i | z_i)}] - D_{KL}(Q_i(z_i | x_i) || p(z_i))}

That is, the ELBO can be thought of as the difference between how likely the observed data is under our chosen posterior over Z (we call this the reconstruction term) and the KL-divergence of our chose posterior distribution over Z and the true distribution over Z. This makes a lot of sense! Intuitively, if we think back to our home prices example before, we want the observed home prices to be as likely as possible given our choice of posterior over Z, and we want our chosen posterior over Z to be as close as possible to the true marginal distribution over the latent variable Z. From our study of information theory, we know that the KL-divergence measures the similarity of two probability distributions, and is always nonnegative. In fact, it equals zero if and only if the two distributions are the same.

Finally, we have an objective that we can maximize tractably, since it is easy for us to compute the conditional probability of X given Z using our model (recall from before that conditional probability of Z given X was the intractable conditional probability, due to the fact that by Bayes Rule, in order to compute it, we would need to compute the marginal probability of X in the denominator). In fact, our problem now has been reduced to finding the posterior distribution of the latent Z given the observed X that maximizes the objective above. Variational inference is simply optimization over a set of functions (in our case probability distributions), so the fact that we are trying to find the optimal Q is what gives the VAE the first part of its name!

Crucially, we no longer have any convergence guarantees, since the ELBO is not a tight lower bound to the true log likelihood. Also, the family we choose for Q (e.g. 10-dimensional Gaussian with identity covariance) is very important, since if the optimal distribution is not in this family then the Q that we learn can be very far from optimal.

Crucially, we note that searching over the space of possible choices of Q is in general very challenging, especially if Q is reasonably complex. However, one of Kingma et al.’s most clever and impactful discoveries was that learning a good choice of Q (assuming a choice of a parametric model family for Q) can be accomplished by using deep neural networks. In fact, the VAE uses two deep neural networks: an encoder and a decoder. The training process proceeds as follows: First, the observed data (X) is passed through the encoder network whose outputs are used to produce a latent code (Z) for the observation. Then the decoder network takes the code as input and tries to reconstruct an observation as close as possible to the original image! Thus, our previous explanation of the ELBO as the sum of the reconstruction loss and KL-divergence suddenly becomes much more concrete: the KL-divergence piece is intimately related to the loss of the encoder network (i.e. the network from which we learn Q), and the reconstruction loss is intimately related to the loss of the decoder network (i.e. the network from which we learn to reconstruct observations from latent codes).

Using starter code from CS 236: Deep Generative Models at Stanford University taught by professors Stefano Ermon and Aditya Grover, I implemented a VAE wherein we assume the latent Z variables are drawn from a 2-dimensional Gaussian with identity covariance, and the observed variables are images of digits. After training the model for 20,000 epochs, I plotted the loss (i.e. the negative of the ELBO):

As expected, the loss decreased steadily over time, especially over the first couple thousand epochs. However, consider the following plots of the reconstruction loss (i.e. the negative of the reconstruction term defined above) and the KL-divergence throughout training:

It is crucial to notice that although the overall loss and the reconstruction loss decreased throughout training the KL-divergence increased! Recall the the overall loss is the sum of the reconstruction loss and the KL-divergence. Clearly, the reconstruction loss are highly dependent (both are expectations over Q), and looking at the y-axes, we see that in this case there was much more to gain by pushing down the reconstruction loss at the expense of a higher KL-divergence.

After training, I used my model to generate new images of digits by first sampling Z from my learned posterior Q, and then sampling X from my learned conditional distribution of digit images given latent Z assignments. Despite a slightly higher KL-divergence than we might like (we likely would never get exactly to zero), the images look shockingly realistic:

So what have we done here? We started off with a gentle introduction to latent variable models and moved on to a discussion of how learning such models becomes intractable when the latent variables are high-dimensional and continuous. We then introduce the EM algorithm and discussed its limitations. Finally, we explored the inequality (the ELBO) derived during our discussion of EM from an information-theoretic perspective. It turns out that, although we don’t get the convergence guarantee we had with EM, maximizing the ELBO is tractable and can produce extremely impressive results. The code for my implementation of the VAE discussed above can be found here: https://github.com/SRS95/EE376A-Project. I encourage you to download it, make changes, and see if you can produce even better digits!

Moving Forward

As a final, note I would like to present an additional avenue for studying the VAE from an information-theoretic standpoint. In “Estimating Information Flow in Neural Networks,” Goldfield et al. explore tractable methods for measuring the mutual information between different layers of deep feed-forward neural networks. Since the VAE consists of two deep feed-forward neural networks, the encoder and decoder, perhaps Goldfield et al.’s techniques could be applied to gain a deeper understanding of interplay between the encoder and decoder throughout training. Furthermore, such an analysis could provide a more motivated method for hyperparameter choices in VAEs (e.g. number of layers for encoder/decoder, number of units per layer, learning rate, etc.). This analysis would likely allow us to gain a deeper understanding of the information-theoretic significance of the VAE, and perhaps it would even lead us to coming up with something more effective.

Outreach

I would also like to comment on the fantastic outreach event that the EE376A teaching team put together at Nixon Elementary last weekend. Not wanting to bore the elementary students too badly, I left out most of the technical details of the VAE. However, the vast majority of the children enjoyed thinking about the VAE as a black box that takes in a picture and tries to output another picture that’s as similar as possible. Furthermore, the students really enjoyed seeing how far face generation technology has come, and particularly enjoyed the following website: https://thispersondoesnotexist.com/

Although Nvidia uses GANs rather than VAEs to generate the faces at the link above, the students really enjoyed engaging with deep generative models and seeing how far the technology has come. Finally, I also found that the students were able to grasp and even enjoy fundamental information theoretic concepts such as entropy and mutual information when they were framed in non-mathematical terms. For example, I found that explaining entropy as the average number of yes/no questions you’d have to ask a source spitting out letters from the English alphabet to know the next letter to be very effective. Khan Academy has multiple fantastic videos on information theory, and I encourage elementary school teachers to try them out with their students!

Leave a Reply