Saliency-Conditioned Generative Compression

EE376A (Winter 2019)

Authors: Akihiro Matsukawa, Rafael Mitkov Rafailov, Michael Yan (undergraduate volunteer), Jordan Nicholson (undergraduate volunteer)


In the era of information technology, the amount of data being produced and consumed is growing at an exponential rate. In this context, the problem of storing and communicating data effectively is an increasingly important challenge. So far, lossy compression algorithms have been general-purpose, hand-designed codecs such as JPEG. Such a hand-crafted compression scheme must trade off its ability to compress specific classes of data (compression performance) against its ability to compress all classes of data (generality).

Recent developments in the use of machine learning for compression has the potential to automatically design lossy compression schemes tailored towards compressing a specific distribution of data in question. Such algorithms typically train probabilistic deep learning models to directly optimize the balance between the entropy rate of the codes and distortion of reconstruction on the observed data distribution.
In this framework of generative compression, the distortion metrics is very important. While many such metrics exist, we observe the true metric we want to optimize for is perception of distortion by a human. In that vein, we explore the use of visual saliency prediction to guide reconstruction, essentially attempting to weight reconstruction of areas that humans are more prone to looking at at a higher quality, possibly at the cost of other areas.

Literature Survey

We provide a literature survey of relevant methods. The list here has been pruned to what are most immediately relevant to the rest of blog post. For a full list of literature we reviewed, see here. We will use methods, diagrams, and equations in these papers in the rest of the post. Rather than cite each individual one, consider this as our liberal citation for the entire blogpost.

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model (link)

Rather than the standard approach of using a feed-forward CNN to predict a saliency mask similar to semantic segmentation, the authors propose an iterative attention-based method using a recurrent neural network to incrementally augment the saliency mask. This is the saliency model we used, code is here:

Generative Adversarial Networks for Extreme Learned Image Compression (link)

The authors propose a conditional-GAN setup where an encoder-quantizer q(E(.)) produces the input conditioning-codes for a GAN, which is optionally concatenation with a noise source and goes through the normal GAN loss. The loss is augmented with a distortion and bitrate terms.

Rather than fitting the entropy directly, it can be upper bounded by controlling the amount of quantization to L possible values, which would upper bound the entropy at dim(w)log(L).

End-to-end Optimized Compression (link)

The authors propose an autoencoder structure that directly optimizes for rate and distortion, along with architectural choices. They also present a custom flexible discrete distribution which can be optimized for lower entropy. Finally, the authors also show that under certain choices of the distortion metric and priors, their framework is equivalent to optimizing a variational autoencoder.

Loss Functions for Image Restoration with Neural Networks (link)

The authors argue that a convex combination of SSIM + L1 is a better matric than L2. L1 is “smoother” than L2, since L2 does not penalize small errors, and seems to be better than L2 on its own. SSIM is structural, whereas L1/L2 is not, so adding the two and tuning the convex combination can optimize for both at the same time.

Framework & Approach

Lossy Compression

Lossy compression optimizes for a trade-off between entropy and distortion. Letting z= q(E(x)) be the quantized encoded codes and \hat{x} = D(z) be the reconstruction,

L_{c} = H(z) + \mathop{\mathbb{E}}\left[d\left(x, \hat{x}\right)\right]

Generative Compression

Generative compression optimizes a generative machine learning model for the encoder E_\theta and decoder D_\phi and possibly a distribution on the codes or otherwise minimizes it’s entropy H_\gamma. It would also introduce it’s on loss L_g needed to optimize the generative model, such as an adversarial loss in a GAN framework.

L_{gc} = L_{g} + L_{c} = L_{g}  +  H_\gamma(z) + \mathop{\mathbb{E}}\left[d\left(x, D_\phi\left(E_\theta(x)\right)\right)\right]

Saliency Conditioning

To introduce saliency s, we condition the encoder and distortion loss. Note that our approach is intended to guide the compressed codes to focus on saliency, and therefore the saliency map is not an passed to the decoder. This means the saliency map does not need to be compressed itself.

L_{gc} = L_{g} + L_{c} = L_{g}  +  H_\gamma(E_\theta(x, s)) + \mathop{\mathbb{E}}\left[d\left(x, D_\phi\left(E_\theta(x, s)\right), s \right)\right]


While educational, we were not able to conclusively prove the benefit of saliency conditioning on compression, due to difficulties in reproducing existing methods that in the end limited the time we had to run saliency experiments. This section chronicles our experience.

Dataset & Saliency

We decided to use the Cityscapes dataset for our experiments, since this seemed to be a common baseline in generative compression machine learning literature (although admittedly, not compression literature in general). We used the leftImage set, and rescaled to 512×256 pixels to speed up experiments.

We prepared out data by using the pre-trained models in Here are a few samples image and its saliency mask:

Initially these masks saved as numpy arrays were were surprisingly large when saved in npz format, perhaps due to the lack of compression. In the end, we took advantage of the fact that the png format supports an alpha channel, and we saved these saliency masks into that channel, which we split out after the data was loaded in Tensorflow.

Initial baseline attempt

After our literature review, we initially decided to use the method in Generative Adversarial Networks for Extreme Learned Image Compression as our baseline. We made this choice based on the fact that this seemed to be the most principled GAN approach out of our literature survey, and they had also conducted some tasks of conditionally compressing only parts of the image based on a Cityscape’s semantic mask, which seemed similar to our task.

We found a third-party implementation at which claimed good results in the README, so we were optimistic. However, after many attempts, we found that while the implementation seemed to overfit on the training set, and gave poor performance on the test set:

Training set. Left: Original, right: reconstruction.
Test set. Left: Original, right: reconstruction.

We spent a few weeks (training time was long) to try a variety of remedies in an attempt to alleviate the overfitting, but could not find a good solution:

  • Whether or not to augment codes with sampled noise.
  • Various distortion metrics MSE -> MS-SSIM and it’s weight.
  • Batch size & learning rates.

Not able to reproduce this baseline, we moved on to a new baseline implementation of a different compression algorithm.

A new baseline & saliency conditioning

In order to attempt our idea, we tried a different baseline implementation of generative compression. We chose the algorithm presented in End-to-end Optimized Compression, using the reference implementation at

The algorithm uses MSE as the distortion loss, which we weighted point-wise with the saliency mask. We did not directly use the saliency mask, since a majority of the mask is very close to 0, which would place no distortion loss at those pixels.

We found mixed results. On the one hand, while the saliency-weighted reconstructions seemed to produce subtly better reconstructions, especially around areas with high saliency. Here are two samples from the saliency-guided compression results.

The difference is subtle, but notice the text and slight details in the license plates in high saliency regions.

So did we succeed? It turns out, not quite. What we found was that weighting the saliency masks in this way seemed to impact the entropy optimization of the codes. The saliency guided photos actually both have slightly higher bpp which may account for the slight improvements in reconstruction.

bits per pixel. orange is baseline, blue is saliency-conditioned.

Conclusion and Future Work

In hindsight, we realized that most existing generative compression algorithms employed strided convolutions and transposed convolutions with quantization of the codes. This meant that the produced code is not only fixed length (but smaller than the original image) but each individual code can only represent the receptive field of the convolution.

We tried inserting dilated convolution layers to try to expand the receptive field, but that seemed to hurt performance overall and introduce checkered artifacts into the reconstruction. We had spent significant amounts of the project trying to reproduce our first baseline. Had we had more time, we could have experimented more.

Reconstruction of model w/ dilated convolution

There are existing prior work that also use recursive neural network to iteratively generate codes, and we believe such a framework may lend itself more to saliency conditioning.


For our outreach, we tried to illustrate minimax game between the generator and discriminator of a generative adversarial network. We created a handout on the topic, explaining neurons and framing this as a game between a forger and a detective. We then asked the children to play a game against the forger inside of the computer, where they had to play the role of the detective. We presented kids with samples from the generator as training progressed, which meant guessing the fakes became harder and harder. They seemed to have a lot of fun with it!

Leave a Reply