Writing Tutorials for Topics in Information Theory

EE376A (Winter 2019)

by Chelsea Sidrane and Ryan Holmdahl

What we did

In a fast-moving field such as computer science, it’s vital to stay aware of new advances. Most computer scientists, though, aren’t academics, and may not have the background or time required to read and process formal research papers. In domains like deep learning, this issue is addressed by a thriving ecosystem of blog-style tutorials, where academics and practitioners write about new developments in a way that’s approachable to people with a non-academic or non-professional background. It’s so far proven to be a great supplement to formal publications, and has helped CS folks from diverse backgrounds stay up-to-date on the latest deep learning developments

Outside of deep learning and a few other “hot” fields, this kind of blog ecosystem doesn’t really exist. We thought that information theory — a domain which impacts nearly everything computer scientists do — could really benefit from writing that falls somewhere between “Information In Small Bits: Information Theory for Kids” and Shannon’s “A Mathematical Theory of Communication”. Some non-threatening but still-informative blog posts could get people thinking more about information theory and deliver the field’s innovations to a bigger crowd.

The posts written for this project piggybacked off of the success of the deep learning blog ecosystem, focusing on topics that combine fundamental ideas of both information theory and deep learning. Hopefully, we can leverage the deep learning hype to get more people interested in information theory.

Post 1: Maximum-entropy Inverse Reinforcement Learning

Chelsea’s research field is reinforcement learning (RL), and she found that in papers and blogs that she read, there was never significant discussion of the information theoretic ideas that are used in RL. In order to remedy this gap in the (blog) literature, it seemed fitting to address a popular technique that borrows a great deal from information theoretic ideas: Maximum Entropy Inverse Reinforcement Learning. Hopefully the blog post can give other RL researchers a deeper understanding of the information theoretic ideas behind MaxEnt IRL.

Post 2: Lossless compression with neural networks

Most deep learning papers either present a new network architecture or apply an existing architecture to a new problem. Kedar Tatwawadi’s paper, DeepZip: Lossless Compression using Recurrent Neural Networks, managed to do both, describing a new neural network/arithmetic coder hybrid and then using that model to tackle lossless compression, a domain neural network research hasn’t often explored. It seemed like a great topic to excite deep learning folks while teaching them about key information theory problems and the cool tools used in the field. The post covers the broad task of compression, the challenges of lossless compression, and Tatwawadi’s neural-network-based approach to it. Hopefully, it can get readers thinking about new ways to leverage machine learning for central information theory problems.

Post 3: Information Bottleneck in neural networks

For our third blog post, we stumbled upon a topic exactly at the intersection of information theory and deep learning: the information bottleneck (IB) debate. This ongoing academic debate surrounds an information theoretic theory for how to understand the training process of deep neural networks. We originally approached the blog post as a summary of the debate, focusing on one of the original papers proposing the topic by Naftali Tishby, and on one which made some counters to its arguments by Ziv Goldfeld. But ultimately, we found that trying to explain all sides of the debate in detail was a monumental task. There are at least 10 papers involved, as well as additional talks following up on many of the papers. After beginning discussions with both Tishby and Goldfeld, we realized that a proper survey of the debate would likely require extensive discussions with many different authors. It also became clear that properly addressing all of the papers would require substantial study of the field to fully understand the arguments being made. In the end, we settled on summarizing the original IB theory itself and discussing the various questions that have been raised and explored regarding applying IB theory to neural networks. In the end, this post was a good lesson about scoping a project so that it’s appropriate in size.

Ryan’s Outreach

My outreach event was a game using cards from the game Set:

A few example Set cards.

Kids would play in pairs. One player would come up with a pattern — say, red-green-purple, or stripe-stripe-solid — and then the other player would try to guess that pattern. The first player can’t tell the other anything about the pattern; all they can do is show examples of that pattern. After each example, the second player gets to take a guess. If they get it right, they win. If they don’t, then the first player picks a new example to show them, and they keep going until they get the pattern.

The goal of the game was to get kids thinking about what makes an example “informative.” It’s a nice introduction to the idea that some messages, even if they’re the same length, have more information than others, and that a good example minimizes the guesser’s uncertainty about the space of possible patterns.

Overall, I’d say the players figured it out pretty quickly. Some, especially older kids, picked up on it immediately, and could get it in two guesses (or one, in some luckier cases). Younger kids took a bit longer with it, but it was really great seeing them piece things together as they played. A lot of the time, they’d assemble three cards, look them over, think for second, then swap out a card or two for better, more informative ones. The biggest source of confusion was probably the definition of a pattern. I assumed it would be interpreted as a sequence of three attribute values, given the few examples of patterns I’d tell them, but I guess my examples weren’t informative enough to explain it!

Chelsea’s Outreach

Chelsea living the dream of being Bill Nye for 10 minutes

For my outreach project I prepared a”Bill Nye” type presentation for the elementary school students at the event. With the help of Noor, I gave a short and hopefully fun lesson on a key information theory concept: entropy, and talked about one of the major problems in information theory: the joint source channel coding problem. I did this through every-day examples in the hopes that it would make it more relatable for the kids. I also told the audience a little bit about my field of aerospace engineering, as I thought this would be exciting for the kids (who doesn’t love rockets?). I closed the presentation by talking about a time that I struggled in physics as a freshman in college. Overall, my intention was to convey a little information theory to the assembled audience, and to also present myself as someone that the elementary school students could see themselves in. (Err, and I also wanted to tell a few jokes about space :P).

Lossless compression with neural networks

EE376A (Winter 2019)

by Ryan Holmdahl

In this post, we’re going to be discussing Kedar Tatwawadi’s interesting approach to lossless compression, which combines neural networks with classical information theory tools to achieve surprisingly good results. This post starts with some pretty basic definitions and builds up from there, but each section can stand alone, so feel free to skip over any that you already know:

  1. A quick overview of compression
  2. Challenges of lossless compression
  3. Lossless compression with neural networks
    1. RNN probability estimator
    2. Arithmetic coder
    3. Encoding and decoding
    4. DeepZip in practice

A quick overview of compression

Let’s say you have a picture that you want to send to a friend:

A picture of vital importance.

Unfortunately, the file size is pretty large, and it won’t fit in an email attachment. That’s where compression comes in: using compression, you can create a smaller version of the picture that you can send to your friend, who can then decompress and view it on their computer. Generally, compression works like this:

The encoder-decoder model of compression.

You encode the image, producing a new, smaller version. You send the compressed image to your friend, who then decodes it, recreating the original image. Compression algorithms — that is, encoder-decoder pairs — come in two flavors:

  • Lossless compression algorithms always output an exact copy of the original input. This is great for formats like text, where things can go seriously wrong if the output isn’t the same as the input. Algorithms like ZIP and GZIP (which you’ve probably used to create .zip and .gz files on your computer) are commonly used for lossless compression.
  • Lossy compression algorithms output a close approximation of the input. This usually means your compressed files will be smaller than they would with lossless compression, but also means that errors in the output will almost certainly occur. JPEG, the algorithm which compresses images into .jpg files, is lossy; that’s why you see weird miscolorings and other artifacts in .jpg images.

Challenges of lossless compression

Suppose a friend claims that they’ve devised a lossless compression algorithm that can reduce any file to half its original size. Should you trust their claim?

Consider every possible file that is 2KB in size. There are 8 bits in a byte and 1000 bytes in a kilobyte, so each of these files consists of 16000 bits. Each of those bits can be either 1 or 0, so if you consider every possible permutation of bit values, there are 2^{16000} possible 2KB files. If your friend is telling the truth, then each of these files can be compressed to a 1KB version; moreover, since the algorithm is lossless, each 2KB file has to be compressed to a different 1KB file. Otherwise, if two different 2KB files compressed to the same 1KB file, the algorithm would have no way to know which input was originally used when it tries to decode that 1KB file.

This requirement exposes a problem: a 1KB file has only 8000 bits, so there are only 2^{8000} possible 1KB files — far fewer than the number of 2KB files. In fact, there are more possible 2KB files than possible files of all sizes less than 2KB. If there are fewer possible small files than large files, then not every large file can be assigned a unique small file. It’s therefore impossible for any lossless compression algorithm to reduce the size of every possible file, so our friend’s claim has to be incorrect.

Fortunately, a lossless compression algorithm doesn’t need to compress every possible file to be useful. Typically, we design a compression algorithm to work on a particular category of file, such as text documents, images, or DNA sequences. Each of these categories represents a tiny portion of the universe of all possible files. So while we can’t reduce the size of every possible file with a single algorithm, if we can make our algorithm work on the kinds of inputs we expect it to receive, then in practice we’ll usually achieve meaningful compression.

The key challenge of lossless compression is making sure that these expected inputs get encoded to small compressed versions, letting less common inputs receive larger compressions. This challenge even appears within a single file: we’d like our algorithms to use short representations for common bit sequences, letting rarer bit sequences get longer representations. If a file consists only of 01 and 10 repeated randomly, it’d be pretty efficient if our algorithm could figure that out and encode every instance of 01 as a 0 and every 10 as a 1.

Indeed, this is roughly how most modern lossless compression algorithms work: the algorithm builds a model of how likely certain sequences are and uses that model to encode the input as concisely as possible. This leaves two main questions:

  • How do you build a statistical model of an input?
  • How do you use the model to generate a compressed output?

Lossless compression with neural networks

The letter “z” is the least commonly used in the English language, appearing less than once per 10,000 letters on average. If you were trying to build a compression algorithm to encode text files, since “z” has such a low probability of occurring, you’d probably assign it a very long bit sequence, so that more frequent letters like “a” and “e” can receive shorter ones. But what happens when your algorithm tries to compress an article about zebras? Suddenly the letter “z” is appearing all over the place, but your algorithm is using a long encoding for it each time. You probably wouldn’t get very good compression on this document.

The bane of a naïve lossless compression algorithm.

If you, a person, were reading this zebra article, you’d figure out pretty fast that “z” is going to appear a lot. It would be nice if our lossless compression algorithm could figure that out also; that is, if the algorithm could adapt its letter frequency model as it encoded the document, and use a shorter encoding for the letter “z” when it realizes that “z” will be very common. This is not a rare problem in compression, and there has been a substantial amount of research in building algorithms that can adapt to a document as it is being encoded.

But these algorithms tend to have a pretty short memory: their models generally only take into account the past 20 or so steps in the input sequence. If the zebra article took a brief digression to discuss horses, the model could “forget” that “z” is a common letter and have to re-update its model when the section ended. It would be nice if we could find a model which is better at capturing long-term dependencies in the inputs.

Fortunately, there’s a whole category of neural networks specifically designed to model sequential inputs and capture their long-term dependencies: recurrent neural networks, or RNNs. There are a lot of great explainers on what RNNs are and how they work that I won’t rehash here; for our purposes, it suffices to say that an RNN is a neural network model that processes an input sequence step-by-step, producing some output at each step.

In his paper DeepZip: Lossless Compression using Recurrent Networks, Kedar Tatwawadi combines RNNs with information theory techniques to build a surprisingly effective lossless compressor. We’ll be taking a deeper look at his approach.

RNN probability estimator

The first component of Tatwawadi’s DeepZip model is called the RNN probability estimator. As the name suggests, it is an RNN which, at each time step, takes as input a symbol in the original sequence. Here, a symbol is any building block of an input sequence; it might be a bit, a base in a DNA strand, an English letter, whatever. After processing the symbol, the RNN probability estimator outputs a vector. Each entry in the vector is the RNN’s prediction of how likely it is that a particular symbol appears next in the sequence. When encoding a bit sequence, an output of [0.25, 0.75] would indicate that the model believes the next bit is 1 with 75% probability. The RNN can then be shown the next symbol in the sequence, for which it will produce probabilities given the symbols it has been previously shown.

The RNN probability estimator in DeepZip is interesting when compared to other neural networks. Most networks use a training dataset to learn their internal parameters. When training is done, the parameters are frozen, and only then is the network used to make predictions on new inputs. The RNN probability estimator, however, undergoes no such training before it is shown a new input. When given something to encode, the RNN starts with random parameters. As it processes symbols in the input, it not only updates its hidden state by the usual RNN rules, it also updates its weight parameters using the loss between its probability predictions and the ground-truth symbol. Not only is the RNN probability estimator trying to learn what dependencies exist in the new sequence, it has to learn how to learn those dependencies.

Arithmetic coder

The RNN produces symbol probabilities at each step in the input sequence, but the algorithm needs a way to actually translate those probabilities to an encoding of the input. To do this, Tatwawadi uses a classical information theory tool called an arithmetic coder.

An arithmetic coder uses a numerical range to represent the input sequence. The range initially spans from 0.0 to 1.0, and is updated as follows:

  1. Predict the probability of each symbol appearing next in the input sequence. This will probably come from some statistical model, like an RNN probability estimator.
  2. Divide the coder’s current range into subsections. There will be one subsection for each possible symbol, and the subsection’s length is proportional to the probability of its corresponding symbol (produced in step 1).
  3. Read in the next symbol from the input sequence.
  4. Set the coder’s current range to be that symbol’s subsection. The coder’s range is now strictly smaller than it was before. The higher the predicted probability of the symbol, the longer the new range will be.
  5. If there are more symbols in the input sequence, return to step 1. The coder will continue updating its range for each symbol in the input sequence.

After reading in the entire input sequence, the coder is left with a range. The final encoding for the input sequence is the binary fraction representation for any number within that range. A binary fraction is like a regular decimal, except each digit represents a power of 2 instead of a power of 10. For example, 0.25 would be 0.01 as a binary fraction, since the second decimal place counts increments of 2^{-2} = 0.25. In the arithmetic coder, we ignore the leading zero and decimal point, since all numbers that could be conveyed are between 0.0 and 1.0.

Let’s walk through an example. Let’s say we want to encode the bit sequence 1101. For simplicity, our model always predicts a 0.25 probability for 0 and 0.75 for 1, regardless of the previous bits in the sequence. Our coder begins with the range 0.0 to 1.0:

Before reading the first bit, the coder divides its range into subsections. 0 has probability 0.25, so it gets the range 0.0 to 0.25. 1 has probability 0.75, so it gets the range 0.25 to 1.0. The divided range looks like this:

Now the encoder reads the first bit, which is a 1. The coder then updates its range to be the subsection assigned to 1; in this case, it’s 0.25 to 1.0:

There are still more symbols to encode, so the coder goes back to the first step. Our model produces the same probabilities, but now the coder uses them to subdivide its new range. This gives 0 the range 0.25 to 0.44 and 1 the range 0.44 to 1.0:

The coder reads the next bit, which is a 1 again, so again the coder updates its range to be the subsection assigned to 1:

For the next bit, the divided range looks like this:

The coder reads in the 0 and sets its range to be the subsection assigned to 0. Our range before the last bit is now 0.44 to 0.58:

For the last bit, the divided range looks like this:

The last bit is a 1, so our final range is 0.475 to 0.58:

Any number in this range can now be used to represent our input sequence. The number in the range with the shortest bit representation is 0.5, so we can select that as our encoding number:

0.5 is represented by the binary fraction 0.1, so we can now save or transmit our original sequence 1101 using the much shorter sequence 1.

When you want to decode this compressed sequence, you use the arithmetic coder in reverse. Starting again with the range 0.0 to 1.0, the coder generates the output sequence as follows:

  1. Predict the probability of each symbol appearing next in the output sequence. Importantly, these probabilities must be produced by the same model that produced the probabilities in the encoder.
  2. Divide the coder’s current range into subsections. Again, there will be one subsection for each possible symbol, and the subsection’s length is proportional to the probability predicted in step 1 for that symbol.
  3. Identify the subsection which contains the encoded number. Remember that our encoding of the input sequence is the binary representation of a number between 0.0 and 1.0, and that number was contained in the final range of the arithmetic coder during the encoding step.
  4. Add the symbol assigned to that subsection to the output sequence. Our final range from the encoding step was contained in the range selected for each preceding step, so whichever symbol’s subsection contains our input encoding must be in the output sequence.
  5. Set the coder’s current range to be that subsection. The coder’s range is now exactly what it was during the corresponding step of the encoding process.
  6. If there are more symbols to decode, return to step 1. The end of the sequence might be indicated with a special end-of-message symbol, or we might know the number of symbols that should be in the output in advance. If we don’t see that EOM symbol or reach the known end of our sequence, we keep decoding.

We’ll illustrate this with the same example as before. Someone has sent us the sequence 1, and we’ll assume we know the original sequence was four bits long. The coder is initialized with the range 0.0 to 1.0, with the encoding number placed onto the range:

The coder has to use the same statistical model as during encoding to work properly, so it produces the following subsections:

Our encoding represents the number 0.5, which falls into the range for symbol 1, so the coder adds 1 to the output sequence and updates its range. The new range is:

Since the coder knows there are more bits to add to the output sequence, it again subdivides the range using the symbol probabilities:

Again, 0.5 falls into the section for symbol 1, so the output sequence is now 11. The new range becomes:

There still aren’t four bits in the output sequence, so the coder subdivides again:

Now, 0.5 falls into the range for 0, so the coder updates the output sequence to be 110. It takes symbol 0‘s subdivision to be its new range:

Subdividing again, the range becomes:

0.5 now falls into the range of symbol 1 again, so the output sequence becomes 1101. It now consists of four bits, so the coder terminates, and we have our original sequence back.

In practice, we usually won’t know the exact length of the incoming sequence in advance, so we’d probably use a special end-of-message symbol to indicate to the coder when it should stop adding new symbols to the output sequence; otherwise, the coder could keep adding symbols forever.

Encoding and decoding

DeepZip combines the RNN probability estimator and the arithmetic coder to encode input sequences, as seen here:

Courtesy of Kedar Tatwawadi.

After the RNN probability estimator is initialized with random weights, the arithmetic coder encodes the first symbol in the input sequence using a default symbol distribution. That first symbol is then passed into the RNN probability estimator, which outputs probabilities for the next symbol. The arithmetic coder uses these probabilities to encode the second symbol. The weights of the RNN probability estimator are updated by comparing its predicted probabilities for the second symbol to the actual identity of the second symbol. Then, the second symbol is input to the RNN probability estimator, outputting probabilities for the third symbol, and the process continues until the input sequence is completely encoded.

Decoding works similarly:

Courtesy of Kedar Tatwawadi.

Importantly, the RNN probability estimator is initialized with the exact same weights used at the start of encoding; this can be done by sending whatever random seed was used during encoding along with the actual encoding of the input. The arithmetic coder then uses the default symbol distribution to parse the first symbol from the encoding. This symbol is passed to the RNN probability estimator, which outputs a set of probabilities for the next symbol; if initialized correctly, these will be the exact probabilities output by the RNN during the first encoding step. These probabilities are used by the coder to parse the second symbol, which is used to update the weights of the RNN probability estimator. This weight update should be exactly the same as that used after the first step of the encoding process. The second symbol is then passed to the RNN probability estimator, which outputs probabilities for the third symbol, and the process continues until the coder reads an end-of-message symbol.

DeepZip in practice

Tatwawadi applied DeepZip to some common challenge datasets, and achieved impressive results. DeepZip was able to encode the human chromosome 1, originally 240MB long, into a 42MB sequence, which was 7MB shorter than that produced by the best known DNA compression model, MFCompress. On text datasets, DeepZip achieved around 2x better compression than the common lossless compression algorithm GZIP, although compression models specifically designed for text performed slightly better. The full results can be found in the paper. It’s worth noting that DeepZip was significantly slower at compressing than the other models tested, which is to be expected when backpropagation needs to be performed at each step of the input sequence.

DeepZip was also applied to procedurally generated Markov-k sources. A Markov-k sequence is a series of N numbers, each of which is anywhere from 1 to some constant M. The first k symbols are random, then each subsequent symbol is equal to the previous symbol minus the symbol which appeared k prior; that is,

X_n = X_{n-1} - X_{n-k} \mod M

If this isn’t quite clear, don’t worry; the main takeaway is that each symbol is dependent on one that’s k steps away in the input sequence. Here’s how the rate of compression improved over time for different values of k when DeepZip used a vanilla RNN as its RNN probability estimator:

Courtesy of Kedar Tatwawadi.

What’s interesting here is that DeepZip was able to get roughly the same level of compression for all values of k up to 35, after which it completely failed to compress the sequence at all. This suggests that the vanilla RNN can only “remember” symbols that fell within 35 steps of the current one. We can see the same plot for a DeepZip model which used a GRU as its RNN probability estimator:

Courtesy of Kedar Tatwawadi.

While the vanilla RNN could only remember up to 35 symbols, the GRU seems to be able to remember up to 50. Tatwawadi proposes that this could be used as a test to compare different RNN flavors going forward: those which can compress higher values of k might have better long-term memory than their counterparts. It’s a cool idea, and after a good amount of validation might spur new innovations in RNNs.

Hopefully, this post has helped you understand how neural networks can be used to create surprisingly effective lossless compressors. The DeepZip model — combining information theory techniques with neural networks — is a great example of cross-disciplinary work producing synergistic results, and one which shows the potential of this exciting area of research.