Megha Srivastava and Vinay Sriram
Introduction
Because existing image compressors compress images individually, such compressors do not capture the similarities between different images during compression. Our project aims to study whether similarities across images can be exploited for better compression. Specifically, we implement and analyze three algorithms that leverage image similarity in different ways. The first algorithm, average image compression, compresses images by storing their average and the individual differences. The second method performs pixel-level clustering to quantize the images, and then compresses images with similar quantizations together. The final method is a generalization of two-dimensional JPEG compression to three dimensions, exploring how the 3D discrete cosine transform can be used on depth-concatenated similar images for compression. Each of the subsequent sections details and analyzes one of the above approaches, and the final section of this blog describes our outreach activity at Nixon Elementary School.
Storing Averages and Differences
Our first method towards compressing a group of images was to consider ways we can take advantage of similar images to reduce the entropy of the files needed to store. We thought of the following approach:
- Cluster
images into
clusters using k-means.
- Calculate the average image for each each cluster,
, and store these.
- Calculate the difference images
, where each
is the difference between that image
and
, the average image for the cluster to which image
was assigned.
- Store the average images, difference images, and mapping index between the integer index of a file and the integer value of the cluster to which it is assigned.
The motivation for this approach is that if we have very similar images, then we can capture the high-entropy information in one file, and low-entropy differences in the files representing each image. Since compression improves as entropy decreases, we would expect a gain in compression amount if the amount of information the average file can represent is high enough to offset the cost of storing an increased number of files. For instance, let us look at this small motivating example of four images:

As one can see, these images are very similar to each other. The average image is shown below on the left. This average image has higher entropy than any of the other starter images, but it is only one image. By contrast, an example of the difference between one the images and the average is shown below on the right. Clearly, the entropy of this difference is a lot lower, as most of the image is now roughly the same pixel value. Thus, we hypothesize that compressing the four low-entropy differences and the high-entropy average will yield a lower file size compared to compressing the four original high-entropy images.

We use the following procedure to measure the compression rate.
- Save each image as 3-d matrix (using the numpy.save package) to a binary file.
- Zip all binary files together to produce the final compressed representation.
- Report the total bits of the zip file divided by the total number of image pixels present in all of the images, representing the compression rate in BPP (bits per pixel).
We evaluate this method relative to a baseline implementation, in which we simply store each image as a numpy array binary and and compress those binaries into a zip. In the small motivating example above, the baseline has a final BPP given as follows.
Meanwhile, the average-image compression method achieves the following BPP.
This simple proof-of-concept illustrates the potential of our approach. However, in the real world, not all images are as similar as those presented this example, and the tradeoff that comes with storing an additional image file can vary based on the sizes of our dataset and the number of clusters. So how does this approach work in the “real world”?
Empirical Results
On a dataset of faces centered in an image frame with z-axis rotations, we vary the number of clusters to see how that affects the BPP. We computed the baseline BPP to be 5.94, which already represents represents significant improvement over an uncompressed BPP of 24. We quantify in the below plot how the averaging method BPP compares to this baseline.

As we can observe, our approach does not perform successfully (relative to the baseline) on the larger, more realistic dataset, but still achieves reasonable compression. Observing the trend provides some clarity. As the number of clusters increases, so does the number of extra files needed to maintain the average image for each cluster. That extra baggage clearly causes a large overhead, thus increasing the BPP as the number of clusters increases.
Some insight into our model follows. Below are two example average files. The left image is for clusters, while the right image is for
clusters. This demonstrates the increased specificity as the number of clusters increases.

Meanwhile, example difference files for and
show an increased level of detail as the number of clusters increases. This phenomenon is due to fewer images being averaged together, and so clusters as small as 2 images are formed, which can cause high-entropy differences.

Overall, it is clear that this approach is highly data-dependent. Future work along the lines of analyzing different difference concatenation methods and clustering methods might prove highly beneficial. More sophisticated techniques in these areas could potentially achieve better balances in the trade-off between number of files and the entropy within the files for multi-image compression.
K-Means Compression
Image Clustering Stage
We use image-level clustering before the compression stage. Here, we downsample the original images and cluster them. We use image-level clustering before the compression stage. Here, we downsample the original images and cluster them. For clustering, we stack each downsampled image matrix , into a vector
, for
, where
represents the total number of images,
represents the downsampled width, and
represents the downsampled height. Note that we assume the images are grayscale. We then cluster the vectors
into
clusters, each of size
. Then, each original image is assigned to the cluster into which its downsampled representation was clustered.
Color Quantization Stage
This form of compression involves developing a low-entropy representation of each image using quantization based on color clusters. For a given image cluster (generated in the previous step), we perform the following. First, we create a set of This form of compression involves developing a low-entropy representation of each image using quantization based on color clusters. For a given image cluster (generated in the previous step), we perform the following. First, we create a set of 3-vectors, each of which represents an (R, G, B) value. This set comprises the colors present in all of the images in a given cluster. Let denote the cardinality of this set. We cluster the 3-vectors into k clusters using the weighted k-means algorithm. Specifically, each 3-vector
is assigned a weight
corresponding to the fraction of total pixels that have color
. Thus, we wish to cluster points
with corresponding weights
into
clusters.
The weighted k-means algorithm is described below. Let denote the set of cluster assignments. Specifically,
corresponds to the index of the cluster to which point
is assigned.
- Initialize the centroids
by randomly sampling from
.
- For each point
, assign
- Recompute each centroid
.
- Repeat steps 2 and 3 until the clusters converge.
One we have centroid colors, we replace every pixel in each image in an image cluster with the index of the centroid whose color is closest to that pixel’s color. Thus, if there are
total images in a cluster, each image becomes a matrix
for
.
Tensor Compression Stage
For For each cluster of images generated in the previous stage, we depth-concatenate the
images in this cluster into a tensor of size
. Then, we apply run-length encoding on
blocks of the image, first transforming each block into a vector of length
. Concatenating the run length results over all of the blocks, we obtain data
, where
is the total quantity of runs and
is the run length of symbol
. This data is concatenated into a bitsream
. The total bits needed to represent all of the images is then just
. The compression rate (in bits per pixel) is then given as follows.
We consider a motivating example of four similar images below (representing one potential cluster). As is evident, as the number of unique colors used in the representation increases, the reconstruction accuracy becomes better. By contrast, the number of bits per pixel used to compress the image increases.

The intuition behind the compression rate savings can be found by examining the histogram of colors present within the The intuition behind the compression rate savings can be found by examining the histogram of colors present within the images. Evidently, we see that certain color intensities (towards the center of the range of possible intensities) are dominant, and so the clustering will have a tendency to capture these colors. The chosen colors are those that minimize the overall euclidean reconstruction error between the quantized image and the original image.

Empirical Results
For empirical analysis, we once again use the test dataset of face images. The below plot illustrates a tradeoff between the reconstruction accuracy and the compression ratio on the full set of images. We vary for a fixed number of image clusters
. For this method, we use a stricter baseline that has higher compression ratio than the previous method – a compressed zip folder of the images encoded as JPEGs. This baseline achieves 0.77 BPP, and so we are able to beat the baseline only for cluster sizes under
.

Note that as increases, the reconstruction loss decreases. This is the key tradeoff. While we are able to achieve better compression ratios with heavier quantization (using fewer image clusters), the reconstruction loss also increases as a result. This phenomenon is illustrated graphically in the simple example above. This result is fully intuitive – the lower entropy approximation we create, the higher the compression rate but the lower the reconstruction accuracy.
Our methodology for this approach could be improved in the future through the user of more intelligent run-length encoding. Typically, zig-zag patterns are used within the individual compression blocks in order to exploit local similarities between pixels. However, for simplicity, in this approach we naively concatenate the blocks into vectors (which define the runs). Therefore, the compression rates are not as optimal as they could be.
3D JPEG
JPEG compression relies upon the 2D discrete cosine transform (DCT) [1]. The central idea behind classic JPEG compression is partitioning the source image into blocks, and then taking the 2D DCT of each of these blocks. After quantizing the DCT matrix of each block, many coefficients become zero, enabling significant compression. Here, we analyze a 3D generalization of this algorithm for multiple images.
As in the previous algorithm, we cluster the images into clusters, and then depth-concatenate images within a given cluster (each cluster contains
images) to create a 3D tensor. Once each cluster’s tensor is generated, we apply the following procedure for compression.
- For each cluster, we segment the 3D tensor into blocks
and use the 3D discrete cosine transform to generate frequency domain blocks
of the same dimension as the original blocks. The forward and inverse DCT transforms are given below, as provided by [5].
Note that in the above equations, we define for
and
for
.
2. For each block in each cluster, we quantize using a quantization matrix that is defined by the the following equation, where
are used to index the quantization matrix [2][3]. Note that
represents a quantization level, effectively indicating how much to round each DCT coefficient.
3. For each block in each cluster, we filter the zero components, and then encode each block’s coefficients using Huffman encoding. The result is a bit stream for each block.
4. We sum the encoded bit stream sizes for each of the blocks in a cluster to obtain the cluster compressed size (in bits), and then sum over all clusters for the total compressed size.
Empirical Results
The below plots illustrate both (1) the relationship between the level of quantization and the compression ratio and (2) the relationship between the reconstruction error on the full set of images and the level of quantization. Evidently, there is a tradeoff between reconstruction accuracy and compression ratio.

Note that in this case, none of the quantization levels beat the baseline performance with a tolerable reconstruction error. However, of particular interest with this plot is that both the compression rate and reconstruction error asymptotically level out as increases. Therefore, the primary benefit of this method is that, even with near zero reconstruction error (near lossless compression), we can still achieve significant a significant compression factor. Part of the reason that this method does not perform as well as classic JPEG is lack of ad-hoc optimizations. Because industry standard JPEG compressors perform many side optimizations (conversion to the YCbCr color scale, chromiance downsampling, etc.), this compressor does not achieve a comparable BPP. Future work with regard to 3D JPEG could include developing more effective quantization schemes. While 2D quantization tables are readily available and have largely been optimized, very little work has been done on 3D quantization tables. More sophisticated approaches could result in improved multi-image compressors.
Outreach Activity
For our outreach project, we wanted to communicate the idea that “it is easier to remember the difference between two images than remember the two images themselves”, which was the premise for our project’s methods. To teach this idea to elementary school students, we presented them with a series of “Spot the Difference” games, through which they competed for a spot on our leaderboard. The goal was, in 3 minutes, to spot as many differences as possible between two similar images. The below panel is an example of two similar images that were displayed to the students.
The students were then presented with a “difference” slide (a black-and-white heat map, see below) highlighting the specific areas where the images differ from each other.
When this difference slide and the main slide were flashed back and forth quickly, the differences popped out, and the students found them much more easily. The difference slide makes the task considerably easier, and demonstrates that having to store information of two whole images is more difficult than analyzing the second image from the vantage point of the first one and the differences between them.
References
[1] Townsend, A. (2017, March 15). JPEG: Image compression algorithm. Retrieved from http://pi.math.cornell.edu/~web6140/TopTenAlgorithms/JPEG.html
[2] Lee, M., Chan, R. K., & Adjeroh, D. A. (1997). Quantization of 3D-DCT Coefficients and Scan Order for Video Compression. Journal of Visual Communication and Image Representation,8(4), 405-422. doi:10.1006/jvci.1997.0365
[3] Bozinovic, N., & Konrad, J. (2003). Scan order and quantization for 3D-DCT coding. Visual Communications and Image Processing 2003. doi:10.1117/12.503324
[4] Ng. A (2018). The k-means Clustering Algorithm. CS 229 Lecture Notes. http://cs229.stanford.edu/notes/cs229-notes7a.pdf
[5] Boussakta, S. and Alshibami, H.O. (2004) Fast algorithm for the 3-D DCT-II. IEEE
Transactions on Signal Processing, 52 (4). pp. 992-1001. ISSN 1053-587X
This looks like an interesting foundation for compression of a series of screenshots. Tools like ManicTime take screenshots at regular intervals in order to make a timeline of the user’s computer usage throughout the day, so they can allocate and account for their time. Currently, ManicTime uses a discrete image file for each 60 screenshot. This feature is the greatest use of disk space on my computer. However, the differences between each image are so minimal, it seems likely that an algorithm based on this concept could reduce the image size by 90% or more for typical computer usage!