# A Debate: The Information Bottleneck Theory for DNNs

EE376A (Winter 2019)

#### Chelsea Sidrane & Ryan Holmdahl

Since the advent of the deep learning era in roughly 2012 A.D. , researchers and practitioners alike have sought theory to understand the massive success of the technique. Naftali Tishby et al theorized that the information bottleneck principle might very well explain this “deep learning” stuff [1, 2, 3].

The Information Bottleneck (IB) principle relies on a few ideas that we will define. The first is compression. To compress a piece of information, such as an image, a text document, or a sequence of bits, is to represent it in an abbreviated fashion, usually with the help of a code. We take our information, encode it using our short and efficient code, perhaps store it or transmit it, and, when we need to, decode the information back to its original form .

A question we must answer before we begin compression is: which part of the information do we care about? For example, if I write the sentence “The quick brown fox jumped over the lazy dog” on a whiteboard:

I might assume that what really matters are the symbolic English characters, and that transcribing it into typed text preserves the information content — but this wouldn’t be the case if we were trying to predict who wrote the sentence on the board.  In handwriting analysis, the way I cross my ‘t’ and the way I loop my ‘p’ are relevant pieces of information that would be lost by transcribing the sentence into typed text.

The concept of a sufficient statistic captures this idea well . Let’s call me, the person who wrote the sentence on the board, $\theta$, and the sentence that I wrote, a sample of my handwriting, $X$. Now let’s take a picture of the sentence and call that picture a function of the sample $T(X)$. If the picture is high quality enough to do forensic handwriting analysis on and determine that I wrote the sentence, then we can say that $T(X)$ is a sufficient statistic of $X$ for $\theta$. The picture ( $T(X)$) captures everything about me ( $\theta$) that the writing on the board ( $X$) does. In contrast, a typed-up version of the text I wrote, $U(X)$, would not be as good for predicting my identity ( $\theta$), as the actual writing on the board ( $X$) would be. This means that the typed-up version of the text ( $U(X)$) is not a sufficient statistic of the writing on the board ( $X$) for predicting the author ( $\theta$).

Sometimes we don’t want just any sufficient statistic, we want a minimal sufficient statistic, say, $M(X)$. All this means is that $M(X)$ is a sufficient statistic for $X$ while using as few bits as possible, or maximally compressing $X$ . For the words on the whiteboard, it’s perhaps sufficient to only see the ‘t’ and ‘p’ that I wrote to positively identify me as the writer, meaning that $T(X)$ contains a lot of unnecessary information and that $M(X)$ would be a picture of just those two letters.

The last concept we’ll introduce is that of mutual information. Mutual information essentially tells you how much information two random variables share. If knowing one random variable $X$ tells you nothing useful about the other $Y$, then their mutual information $I(X;Y)$ is zero . For example, if I flip two coins, and I tell you the first one was heads up, this doesn’t help me figure out the second coin because they are independent events and have zero mutual information. However, if I were to flip two cards from the top of a deck, and I show you that the first is the Queen of Hearts, then you know the second card is not the Queen of Hearts, and is less likely to be a Queen than anything else, so there does exist non-zero, positive mutual information between the two events.

Now that we’ve discussed compression, sufficient statistics, and mutual information, we can get back to the Information Bottleneck theory . It’s a broadly applicable idea, but we will describe it in the context of a neural network [2,3]. For a classification task, we have a class $Y$, and we draw a sample from this class, which we call $X$. Since we’re performing classification, $Y$ represents what is relevant about $X$. We feed $X$ into our neural network, producing activations at each layer of the network. We call the activations of the $i$-th layer $T_i$. The last layer produces a predicted class label, $\hat{Y}$. If the network follows the IB theory, then $\hat{Y}$ attempts to compress the input $X$ as much as possible while retaining as much information as possible about the relevant information $Y$ [1,2,3]. The resulting $\hat{Y}$ may not be a true sufficient statistic, but would strive to approach one.

The IB theory states that this approximate minimally sufficient statistic is produced by maximizing the mutual information between the compressed representation and the variable defining relevance ( $I(\hat{Y},Y)$), and by minimizing the mutual information between the compressed representation and the input ( $I(\hat{Y}, X)$) [1,2,3].

The theory has intuitive appeal to describe the functioning of neural networks, as by definition neural networks attempt to learn some representation of the input with which good predictions of the class label can be made.

Tishby proposed that the theory of classification networks as IB models could be evaluated by measuring the mutual information $I(T_i; Y)$ between intermediate layer representations $T_i$ and the true label $Y$, and the mutual information $I(T_i;X)$ between the intermediate representations $T_i$ and the input $X$ [2,3]. According to Tishby, training beginning from random weights should increase $I(T_i; Y)$ over all layers. Training should also result in learning weights such that $I(T_i; X)$ decreases as one moves from early hidden layers to later ones. The maximization of $I(T_i; Y)$ describes how as the network trains, it learns intermediate representations $T_i$ that approach a sufficient statistic, such that $I(T_i; Y) = I(X;Y)$. The minimization of $I(T_i; X)$ describes how the network learns intermediate representations that shed “irrelevant features” of the input $X$.

Tishby originally proposed the Information Bottleneck theory in a 2000 paper , but formalized the theory as applied to deep learning in a 2015 paper . He and colleagues then published a paper with experiments in 2017  that began a broad debate about understanding how the Information Bottleneck theory may be relevant to deep learning.

This debate has raised a number of interesting questions, such as:

#### How do you meaningfully measure mutual information in a neural network?

This has been one of the more polarizing topics in the IB debate. The answers to this question depend on what the true mutual information of the hidden layers are. If the network is deterministic, then:

1. If the input $X$ is discrete, Amjad and Geiger show that the mutual information $I(X,T_i)$ is piecewise constant as you move along the layers from input to output 
2. If the input $X$ is discrete and if the mappings from an input to the network’s hidden layer activations are invertible, which Goldfeld et al argue is true for any neural network which uses sigmoid or tanh activation functions, each of $X$ and $T_i$ are fully informative of the other and $I(X; T_i)$ is completely constant as you move along the layers from input to output 
3. If the input $X$ is continuous, the mutual information $I(X,T_i) = H(T_i) - H(T_i \mid X)$ is infinite because the density function $p(T_i \mid X)$ is a delta function at $T_i = f(X)$ as Saxe et al demonstrate 

In cases (2) and (3), it is argued that some source of noise in the hidden layers is necessary to give $I(X; T_i)$ a meaningful value. Tishby et al introduce noise by binning the activations, assigning them to buckets based on their value and in doing so reducing many activations to a single value. Goldfeld et al add small random values to each neuron’s activation during training so that the mapping is no longer invertible, arguing that the noise must be part of the network’s function instead of applied after-the-fact like with binning. Achille and Soatto employ a similar method, adding random noise to the network’s weights instead of its activations . Ultimately, it remains an open question how best to derive a value of $I(X; T_i)$ which meaningfully captures compression in the neural networks employed in practice, which generally do not use these noise sources and use non-saturating activations like the rectified linear unit (ReLu). It also remains to be seen if mutual information is the best quantity to measure in order to understand neural networks from an information theoretic perspective

#### How does compression of inputs relate to model generalization?

A central idea in IB theory for neural networks is that compression of the inputs, captured in $I(X; T_i)$, helps the network to generalize. Intuitively, this makes sense: the less a hidden layer’s activations tells you about the value of $X$, the more “unnecessary” features it has shed about $X$, and the fewer “unnecessary” features the model uses, the less prone it is to overfitting. While logical, this is a difficult claim to actually prove, and would seemingly require a consistent correlation between lower $I(X; T_i)$ and higher model performance on withheld samples to be shown on a variety of different models.

#### How does clustering of activations relate to model generalization?

Goldfeld et al argue that the source of compression in neural networks is clustering of hidden layer activations. More specifically, they argue that estimates of $I(X; T_i)$ is mostly measuring how “close together” the activations at layer $i$ are for different inputs of the same class. From this, they propose that the extent to which different models cluster their inputs, and how that clustering changes over the course of a model’s training, be further explored as a cause of model generalization. It’s an interesting theory, and one which makes intuitive sense, but again would require examination under a broader set of neural networks and with more rigorous measures of clustering.

#### Conclusion

As the deep learning community continues to explore how and why neural networks perform as well as they do, the Information Bottleneck theory provides an interesting framework for guiding and interpreting the research accumulated along the way. There are still a lot of questions to tackle, but it’s an approach certainly worth consideration and an in-depth examination. If you’d like to dive into the topic in more detail, look at the references and extra references for some relevant papers to help you get started:

### References

 Tishby, Naftali, Fernando C. Pereira, and William Bialek. “The information bottleneck method.” arXiv preprint physics/0004057 (2000).

 Tishby, Naftali, and Noga Zaslavsky. “Deep learning and the information bottleneck principle.” In 2015 IEEE Information Theory Workshop (ITW), pp. 1-5. IEEE, 2015.

 Shwartz-Ziv, Ravid, and Naftali Tishby. “Opening the black box of deep neural networks via information.” arXiv preprint arXiv:1703.00810 (2017).

 Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.

 Amjad, Rana A. and Bernhard C. Geiger. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle.” (2018). ArXiv.

 Goldfeld, Ziv, Ewout van den Berg, Kristjan Greenewald, Brian Kingsbury, Igor Melnyk, Nam Nguyen, and Yury Polyanskiy. “Estimating Information Flow in DNNs.” (2018).

 Saxe, Andrew Michael, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. “On the information bottleneck theory of deep learning.” (2018).

 Achille, Alessandro, and Stefano Soatto. “Emergence of invariance and disentanglement in deep representations.” The Journal of Machine Learning Research 19, no. 1 (2018): 1947-1980.

 Wikipedia. “Deep Learning: Deep Learning Revolution.

### Extra References

Gabrié, Marylou, Andre Manoel, Clément Luneau, Nicolas Macris, Florent Krzakala, and Lenka Zdeborová. “Entropy and mutual information in models of deep neural networks.” In Advances in Neural Information Processing Systems, pp. 1826-1836. 2018.

Kolchinsky, Artemy, Brendan D. Tracey, and Steven Van Kuyk. “Caveats for information bottleneck in deterministic scenarios.” (2018).