by Ryan Holmdahl

In this post, we’re going to be discussing Kedar Tatwawadi’s interesting approach to lossless compression, which combines neural networks with classical information theory tools to achieve surprisingly good results. This post starts with some pretty basic definitions and builds up from there, but each section can stand alone, so feel free to skip over any that you already know:

- A quick overview of compression
- Challenges of lossless compression
- Lossless compression with neural networks

## A quick overview of compression

Let’s say you have a picture that you want to send to a friend:

Unfortunately, the file size is pretty large, and it won’t fit in an email attachment. That’s where *compression* comes in: using compression, you can create a smaller version of the picture that you can send to your friend, who can then decompress and view it on their computer. Generally, compression works like this:

You *encode* the image, producing a new, smaller version. You send the compressed image to your friend, who then *decodes* it, recreating the original image. Compression algorithms — that is, encoder-decoder pairs — come in two flavors:

*Lossless*compression algorithms always output an exact copy of the original input. This is great for formats like text, where things can go seriously wrong if the output isn’t the same as the input. Algorithms like ZIP and GZIP (which you’ve probably used to create .zip and .gz files on your computer) are commonly used for lossless compression.*Lossy*compression algorithms output a close approximation of the input. This usually means your compressed files will be smaller than they would with lossless compression, but also means that errors in the output will almost certainly occur. JPEG, the algorithm which compresses images into .jpg files, is lossy; that’s why you see weird miscolorings and other artifacts in .jpg images.

## Challenges of lossless compression

Suppose a friend claims that they’ve devised a lossless compression algorithm that can reduce any file to half its original size. Should you trust their claim?

Consider every possible file that is 2KB in size. There are 8 bits in a byte and 1000 bytes in a kilobyte, so each of these files consists of 16000 bits. Each of those bits can be either $latex 1$ or $latex 0$, so if you consider every possible permutation of bit values, there are $latex 2^{16000}$ possible 2KB files. If your friend is telling the truth, then each of these files can be compressed to a 1KB version; moreover, since the algorithm is lossless, each 2KB file has to be compressed to a *different* 1KB file. Otherwise, if two different 2KB files compressed to the same 1KB file, the algorithm would have no way to know which input was originally used when it tries to decode that 1KB file.

This requirement exposes a problem: a 1KB file has only 8000 bits, so there are only $latex 2^{8000}$ possible 1KB files — far fewer than the number of 2KB files. In fact, there are more possible 2KB files than possible files of all sizes less than 2KB. If there are fewer possible small files than large files, then not every large file can be assigned a unique small file. It’s therefore impossible for *any* lossless compression algorithm to reduce the size of every possible file, so our friend’s claim has to be incorrect.

Fortunately, a lossless compression algorithm doesn’t need to compress *every *possible file to be useful. Typically, we design a compression algorithm to work on a particular category of file, such as text documents, images, or DNA sequences. Each of these categories represents a tiny portion of the universe of all possible files. So while we can’t reduce the size of every possible file with a single algorithm, if we can make our algorithm work on the kinds of inputs we expect it to receive, then in practice we’ll usually achieve meaningful compression.

The key challenge of lossless compression is making sure that these expected inputs get encoded to small compressed versions, letting less common inputs receive larger compressions. This challenge even appears within a single file: we’d like our algorithms to use short representations for common bit sequences, letting rarer bit sequences get longer representations. If a file consists only of $latex 01$ and $latex 10$ repeated randomly, it’d be pretty efficient if our algorithm could figure that out and encode every instance of $latex 01$ as a $latex 0$ and every $latex 10$ as a $latex 1$.

Indeed, this is roughly how most modern lossless compression algorithms work: the algorithm builds a model of how likely certain sequences are and uses that model to encode the input as concisely as possible. This leaves two main questions:

- How do you build a statistical model of an input?
- How do you use the model to generate a compressed output?

## Lossless compression with neural networks

The letter “z” is the least commonly used in the English language, appearing less than once per 10,000 letters on average. If you were trying to build a compression algorithm to encode text files, since “z” has such a low probability of occurring, you’d probably assign it a very long bit sequence, so that more frequent letters like “a” and “e” can receive shorter ones. But what happens when your algorithm tries to compress an article about zebras? Suddenly the letter “z” is appearing all over the place, but your algorithm is using a long encoding for it each time. You probably wouldn’t get very good compression on this document.

If you, a person, were reading this zebra article, you’d figure out pretty fast that “z” is going to appear a lot. It would be nice if our lossless compression algorithm could figure that out also; that is, if the algorithm could adapt its letter frequency model as it encoded the document, and use a shorter encoding for the letter “z” when it realizes that “z” will be very common. This is not a rare problem in compression, and there has been a substantial amount of research in building algorithms that can adapt to a document as it is being encoded.

But these algorithms tend to have a pretty short memory: their models generally only take into account the past 20 or so steps in the input sequence. If the zebra article took a brief digression to discuss horses, the model could “forget” that “z” is a common letter and have to re-update its model when the section ended. It would be nice if we could find a model which is better at capturing long-term dependencies in the inputs.

Fortunately, there’s a whole category of neural networks specifically designed to model sequential inputs and capture their long-term dependencies: recurrent neural networks, or RNNs. There are a lot of great explainers on what RNNs are and how they work that I won’t rehash here; for our purposes, it suffices to say that an RNN is a neural network model that processes an input sequence step-by-step, producing some output at each step.

In his paper DeepZip: Lossless Compression using Recurrent Networks, Kedar Tatwawadi combines RNNs with information theory techniques to build a surprisingly effective lossless compressor. We’ll be taking a deeper look at his approach.

### RNN probability estimator

The first component of Tatwawadi’s DeepZip model is called the RNN probability estimator. As the name suggests, it is an RNN which, at each time step, takes as input a *symbol* in the original sequence. Here, a symbol is any building block of an input sequence; it might be a bit, a base in a DNA strand, an English letter, whatever. After processing the symbol, the RNN probability estimator outputs a vector. Each entry in the vector is the RNN’s prediction of how likely it is that a particular symbol appears next in the sequence. When encoding a bit sequence, an output of $latex [0.25, 0.75]$ would indicate that the model believes the next bit is $latex 1$ with 75% probability. The RNN can then be shown the next symbol in the sequence, for which it will produce probabilities given the symbols it has been previously shown.

The RNN probability estimator in DeepZip is interesting when compared to other neural networks. Most networks use a training dataset to learn their internal parameters. When training is done, the parameters are frozen, and only then is the network used to make predictions on new inputs. The RNN probability estimator, however, undergoes no such training before it is shown a new input. When given something to encode, the RNN starts with random parameters. As it processes symbols in the input, it not only updates its hidden state by the usual RNN rules, it also updates its weight parameters using the loss between its probability predictions and the ground-truth symbol. Not only is the RNN probability estimator trying to learn what dependencies exist in the new sequence, it has to learn *how* to learn those dependencies.

### Arithmetic coder

The RNN produces symbol probabilities at each step in the input sequence, but the algorithm needs a way to actually translate those probabilities to an encoding of the input. To do this, Tatwawadi uses a classical information theory tool called an arithmetic coder.

An arithmetic coder uses a numerical range to represent the input sequence. The range initially spans from 0.0 to 1.0, and is updated as follows:

**Predict the probability of each symbol appearing next in the input sequence.**This will probably come from some statistical model, like an RNN probability estimator.**Divide the coder’s current range into subsections.**There will be one subsection for each possible symbol, and the subsection’s length is proportional to the probability of its corresponding symbol (produced in step 1).**Read in the next symbol from the input sequence.****Set the coder’s current range to be that symbol’s subsection.**The coder’s range is now strictly smaller than it was before. The higher the predicted probability of the symbol, the longer the new range will be.**If there are more symbols in the input sequence, return to step 1.**The coder will continue updating its range for each symbol in the input sequence.

After reading in the entire input sequence, the coder is left with a range. The final encoding for the input sequence is the *binary fraction* representation for any number within that range. A binary fraction is like a regular decimal, except each digit represents a power of 2 instead of a power of 10. For example, 0.25 would be 0.01 as a binary fraction, since the second decimal place counts increments of $latex 2^{-2} = 0.25$. In the arithmetic coder, we ignore the leading zero and decimal point, since all numbers that could be conveyed are between 0.0 and 1.0.

Let’s walk through an example. Let’s say we want to encode the bit sequence $latex 1101$. For simplicity, our model always predicts a 0.25 probability for $latex 0$ and 0.75 for $latex 1$, regardless of the previous bits in the sequence. Our coder begins with the range 0.0 to 1.0:

Before reading the first bit, the coder divides its range into subsections. $latex 0$ has probability 0.25, so it gets the range 0.0 to 0.25. $latex 1$ has probability 0.75, so it gets the range 0.25 to 1.0. The divided range looks like this:

Now the encoder reads the first bit, which is a $latex 1$. The coder then updates its range to be the subsection assigned to $latex 1$; in this case, it’s 0.25 to 1.0:

There are still more symbols to encode, so the coder goes back to the first step. Our model produces the same probabilities, but now the coder uses them to subdivide its new range. This gives $latex 0$ the range 0.25 to 0.44 and $latex 1$ the range 0.44 to 1.0:

The coder reads the next bit, which is a $latex 1$ again, so again the coder updates its range to be the subsection assigned to $latex 1$:

For the next bit, the divided range looks like this:

The coder reads in the $latex 0$ and sets its range to be the subsection assigned to $latex 0$. Our range before the last bit is now 0.44 to 0.58:

For the last bit, the divided range looks like this:

The last bit is a $latex 1$, so our final range is 0.475 to 0.58:

Any number in this range can now be used to represent our input sequence. The number in the range with the shortest bit representation is 0.5, so we can select that as our encoding number:

0.5 is represented by the binary fraction $latex 0.1$, so we can now save or transmit our original sequence $latex 1101$ using the much shorter sequence $latex 1$.

When you want to decode this compressed sequence, you use the arithmetic coder in reverse. Starting again with the range 0.0 to 1.0, the coder generates the output sequence as follows:

**Predict the probability of each symbol appearing next in the output sequence.**Importantly, these probabilities must be produced by the same model that produced the probabilities in the encoder.**Divide the coder’s current range into subsections.**Again, there will be one subsection for each possible symbol, and the subsection’s length is proportional to the probability predicted in step 1 for that symbol.**Identify the subsection which contains the encoded number.**Remember that our encoding of the input sequence is the binary representation of a number between 0.0 and 1.0, and that number was contained in the final range of the arithmetic coder during the encoding step.**Add the symbol assigned to that subsection to the output sequence.**Our final range from the encoding step was contained in the range selected for each preceding step, so whichever symbol’s subsection contains our input encoding must be in the output sequence.**Set the coder’s current range to be that subsection.**The coder’s range is now exactly what it was during the corresponding step of the encoding process.**If there are more symbols to decode, return to step 1.**The end of the sequence might be indicated with a special end-of-message symbol, or we might know the number of symbols that should be in the output in advance. If we don’t see that EOM symbol or reach the known end of our sequence, we keep decoding.

We’ll illustrate this with the same example as before. Someone has sent us the sequence $latex 1$, and we’ll assume we know the original sequence was four bits long. The coder is initialized with the range 0.0 to 1.0, with the encoding number placed onto the range:

The coder has to use the same statistical model as during encoding to work properly, so it produces the following subsections:

Our encoding represents the number 0.5, which falls into the range for symbol $latex 1$, so the coder adds $latex 1$ to the output sequence and updates its range. The new range is:

Since the coder knows there are more bits to add to the output sequence, it again subdivides the range using the symbol probabilities:

Again, 0.5 falls into the section for symbol $latex 1$, so the output sequence is now $latex 11$. The new range becomes:

There still aren’t four bits in the output sequence, so the coder subdivides again:

Now, 0.5 falls into the range for $latex 0$, so the coder updates the output sequence to be $latex 110$. It takes symbol $latex 0$’s subdivision to be its new range:

Subdividing again, the range becomes:

0.5 now falls into the range of symbol $latex 1$ again, so the output sequence becomes $latex 1101$. It now consists of four bits, so the coder terminates, and we have our original sequence back.

In practice, we usually won’t know the exact length of the incoming sequence in advance, so we’d probably use a special end-of-message symbol to indicate to the coder when it should stop adding new symbols to the output sequence; otherwise, the coder could keep adding symbols forever.

### Encoding and decoding

DeepZip combines the RNN probability estimator and the arithmetic coder to encode input sequences, as seen here:

After the RNN probability estimator is initialized with random weights, the arithmetic coder encodes the first symbol in the input sequence using a default symbol distribution. That first symbol is then passed into the RNN probability estimator, which outputs probabilities for the next symbol. The arithmetic coder uses these probabilities to encode the second symbol. The weights of the RNN probability estimator are updated by comparing its predicted probabilities for the second symbol to the actual identity of the second symbol. Then, the second symbol is input to the RNN probability estimator, outputting probabilities for the third symbol, and the process continues until the input sequence is completely encoded.

Decoding works similarly:

Importantly, the RNN probability estimator is initialized with the exact same weights used at the start of encoding; this can be done by sending whatever random seed was used during encoding along with the actual encoding of the input. The arithmetic coder then uses the default symbol distribution to parse the first symbol from the encoding. This symbol is passed to the RNN probability estimator, which outputs a set of probabilities for the next symbol; if initialized correctly, these will be the exact probabilities output by the RNN during the first encoding step. These probabilities are used by the coder to parse the second symbol, which is used to update the weights of the RNN probability estimator. This weight update should be exactly the same as that used after the first step of the encoding process. The second symbol is then passed to the RNN probability estimator, which outputs probabilities for the third symbol, and the process continues until the coder reads an end-of-message symbol.

## DeepZip in practice

Tatwawadi applied DeepZip to some common challenge datasets, and achieved impressive results. DeepZip was able to encode the human chromosome 1, originally 240MB long, into a 42MB sequence, which was 7MB shorter than that produced by the best known DNA compression model, MFCompress. On text datasets, DeepZip achieved around 2x better compression than the common lossless compression algorithm GZIP, although compression models specifically designed for text performed slightly better. The full results can be found in the paper. It’s worth noting that DeepZip was significantly slower at compressing than the other models tested, which is to be expected when backpropagation needs to be performed at each step of the input sequence.

DeepZip was also applied to procedurally generated Markov-$latex k$ sources. A Markov-$latex k$ sequence is a series of $latex N$ numbers, each of which is anywhere from 1 to some constant $latex M$. The first $latex k$ symbols are random, then each subsequent symbol is equal to the previous symbol minus the symbol which appeared $latex k$ prior; that is,

$latex X_n = X_{n-1} – X_{n-k} \mod M$

If this isn’t quite clear, don’t worry; the main takeaway is that each symbol is dependent on one that’s $latex k$ steps away in the input sequence. Here’s how the rate of compression improved over time for different values of $latex k$ when DeepZip used a vanilla RNN as its RNN probability estimator:

What’s interesting here is that DeepZip was able to get roughly the same level of compression for all values of $latex k$ up to 35, after which it completely failed to compress the sequence at all. This suggests that the vanilla RNN can only “remember” symbols that fell within 35 steps of the current one. We can see the same plot for a DeepZip model which used a GRU as its RNN probability estimator:

While the vanilla RNN could only remember up to 35 symbols, the GRU seems to be able to remember up to 50. Tatwawadi proposes that this could be used as a test to compare different RNN flavors going forward: those which can compress higher values of $latex k$ might have better long-term memory than their counterparts. It’s a cool idea, and after a good amount of validation might spur new innovations in RNNs.

Hopefully, this post has helped you understand how neural networks can be used to create surprisingly effective lossless compressors. The DeepZip model — combining information theory techniques with neural networks — is a great example of cross-disciplinary work producing synergistic results, and one which shows the potential of this exciting area of research.

## One thought on “Lossless compression with neural networks”