**Megha Srivastava and Vinay Sriram**

**Introduction**

Because existing image compressors compress images individually, such compressors do not capture the similarities between different images during compression. Our project aims to study whether similarities across images can be exploited for better compression. Specifically, we implement and analyze three algorithms that leverage image similarity in different ways. The first algorithm, average image compression, compresses images by storing their average and the individual differences. The second method performs pixel-level clustering to quantize the images, and then compresses images with similar quantizations together. The final method is a generalization of two-dimensional JPEG compression to three dimensions, exploring how the 3D discrete cosine transform can be used on depth-concatenated similar images for compression. Each of the subsequent sections details and analyzes one of the above approaches, and the final section of this blog describes our outreach activity at Nixon Elementary School.

**Storing Averages and Differences**

Our first method towards compressing a group of images was to consider ways we can take advantage of similar images to reduce the entropy of the files needed to store. We thought of the following approach:

- Cluster $latex N$ images into $latex K$ clusters using k-means.
- Calculate the average image for each each cluster, $latex \{A_1,…,A_K\}$, and store these.
- Calculate the difference images $latex \{D_1,…,D_N\}$, where each $latex D_i$ is the difference between that image $latex i$ and $latex A_{c(i)}$, the average image for the cluster to which image $latex i$ was assigned.
- Store the average images, difference images, and mapping index between the integer index of a file and the integer value of the cluster to which it is assigned.

The motivation for this approach is that if we have very similar images, then we can capture the high-entropy information in **one file**, and low-entropy differences in the files representing each image. Since compression improves as entropy decreases, we would expect a **gain **in compression amount if the amount of information the average file can represent is high enough to offset the cost of storing an increased number of files. For instance, let us look at this small motivating example of four images:

As one can see, these images are very similar to each other. The average image is shown below on the left. This average image has higher entropy than any of the other starter images, but it is only one image. By contrast, an example of the difference between one the images and the average is shown below on the right. Clearly, the entropy of this difference is a lot lower, as most of the image is now roughly the same pixel value. Thus, we hypothesize that compressing the four low-entropy differences and the high-entropy average will yield a lower file size compared to compressing the four original high-entropy images.

We use the following procedure to measure the compression rate.

- Save each image as 3-d matrix (using the numpy.save package) to a binary file.
- Zip all binary files together to produce the final compressed representation.
- Report the total bits of the zip file divided by the total number of image pixels present in all of the images, representing the compression rate in BPP (bits per pixel).

We evaluate this method relative to a baseline implementation, in which we simply store each image as a numpy array binary and and compress those binaries into a zip. In the small motivating example above, the baseline has a final BPP given as follows.

$latex BPP_{baseline} = \frac{25,775 \times 8}{224 \times 224 \times 4} = 1.03$

Meanwhile, the average-image compression method achieves the following BPP.

$latex BPP_{method} = \frac{21,886 \times 8}{224 \times 224 \times 4} = 0.87$

This simple proof-of-concept illustrates the potential of our approach. However, in the real world, not all images are as similar as those presented this example, and the tradeoff that comes with storing an additional image file can vary based on the sizes of our dataset and the number of clusters. So how does this approach work in the “real world”?

**Empirical Results**

On a dataset of faces centered in an image frame with z-axis rotations, we vary the number of clusters to see how that affects the BPP. We computed the baseline BPP to be 5.94, which already represents represents significant improvement over an uncompressed BPP of 24. We quantify in the below plot how the averaging method BPP compares to this baseline.

As we can observe, our approach does not perform successfully (relative to the baseline) on the larger, more realistic dataset, but still achieves reasonable compression. Observing the trend provides some clarity. As the number of clusters increases, so does the number of extra files needed to maintain the average image for each cluster. That extra baggage clearly causes a large overhead, thus increasing the BPP as the number of clusters increases.

Some insight into our model follows. Below are two example average files. The left image is for $latex K=5$ clusters, while the right image is for $latex K=125$ clusters. This demonstrates the increased specificity as the number of clusters increases.

Meanwhile, example difference files for $latex k=5$ (left) and $latex k=125$ (right) show an increased level of detail as the number of clusters increases. This phenomenon is due to fewer images being averaged together, and so clusters as small as 2 images are formed, which can cause high-entropy differences.

Overall, it is clear that this approach is highly data-dependent. Future work along the lines of analyzing different difference concatenation methods and clustering methods might prove highly beneficial. More sophisticated techniques in these areas could potentially achieve better balances in the trade-off between number of files and the entropy within the files for multi-image compression.

**K-Means Compression**

**Image Clustering Stage**

We use image-level clustering before the compression stage. Here, we downsample the original images and cluster them. For clustering, we stack each downsampled image matrix $latex I_i \in \{0,…,255\}^{w_d \times h_d}$, into a vector $latex v_i \in \{0,…,255\}^{w_dh_d}$, for $latex i \in \{1,…,M\}$, where $latex M$ represents the total number of images, $latex w_d$ represents the downsampled width, and $latex h_d$ represents the downsampled height. Note that we assume the images are grayscale. We then cluster the vectors $latex \{v_1, …, v_M\}$ into $latex K$ clusters, each of size $latex d = \lfloor \frac{M}{K} \rfloor$. Then, each original image is assigned to the cluster into which its downsampled representation was clustered.

**Color Quantization Stage**

This form of compression involves developing a low-entropy representation of each image using quantization based on color clusters. For a given image cluster (generated in the previous step), we perform the following. First, we create a set of 3-vectors, each of which represents an (R, G, B) value. This set comprises the colors present in all of the images in a given cluster. Let $latex N$ denote the cardinality of this set. We cluster the 3-vectors into k clusters using the weighted k-means algorithm. Specifically, each 3-vector $latex x_i \in \{0,…,255\}^3$ is assigned a weight $latex w_i \in [0,1]$ corresponding to the fraction of total pixels that have color $latex x_i$. Thus, we wish to cluster points $latex \{x_1, …, x_N\}$ with corresponding weights $latex \{w_1, …, w_N\}$ into $latex k$ clusters.

The weighted k-means algorithm is described below. Let $latex \{a_1, …, a_N\}$ denote the set of cluster assignments. Specifically, $latex a_i \in \{0,…,k-1\}$ corresponds to the index of the cluster to which point $latex i$ is assigned.

- Initialize the centroids $latex \{c_1, …, c_k\}$ by randomly sampling from $latex \{x_1,…,x_N\}$.
- For each point $latex x_i$, assign $latex a_i = \text{argmin}_{j \in \{1, …, k\}} ||x_i – c_j||^2$
- Recompute each centroid $latex c_j = \frac{\sum_{i=1}^{N} \{a_i = j\} w_i x_i}{\sum_{i=1}^{N} \{a_i = j\} w_i}$.
- Repeat steps 2 and 3 until the clusters converge.

One we have $latex k$ centroid colors, we replace every pixel in each image in an image cluster with the index of the centroid whose color is closest to that pixel’s color. Thus, if there are $latex d$ total images in a cluster, each image becomes a matrix $latex I_m \in \{1,…,k\}^{w \times h}$ for $latex m \in \{1,…,d\}$.

**Tensor Compression Stage**

For each cluster $latex i \in \{1,…,K\}$ of images generated in the previous stage, we depth-concatenate the $latex d$ images in this cluster into a tensor of size $latex w \times h \times d$. Then, we apply run-length encoding on $latex 20 \times 20 \times d$ blocks of the image, first transforming each block into a vector of length $latex 400d$. Concatenating the run length results over all of the blocks, we obtain data $latex \{s_1, …, s_L\}, \{r_1, …, r_L\}$, where $latex L$ is the total quantity of runs and $latex r_j$ is the run length of symbol $latex s_j$. This data is concatenated into a bitsream $latex b_i$. The total bits needed to represent all of the images is then just $latex b_1 + b_2 + … + b_K$. The compression rate (in bits per pixel) is then given as follows.

$latex r = \frac{b_1 + b_2 + … + b_K}{8Mwh}$

We consider a motivating example of four similar images below (representing one potential cluster). As is evident, as the number of unique colors used in the representation increases, the reconstruction accuracy becomes better. By contrast, the number of bits per pixel used to compress the image increases.

The intuition behind the compression rate savings can be found by examining the histogram of colors present within the images. Evidently, we see that certain color intensities (towards the center of the range of possible intensities) are dominant, and so the clustering will have a tendency to capture these colors. The chosen $latex k$ colors are those that minimize the overall euclidean reconstruction error between the quantized image and the original image.

**Empirical Results**

For empirical analysis, we once again use the test dataset of face images. The below plot illustrates a tradeoff between the reconstruction accuracy and the compression ratio on the full set of images. We vary $latex k$ for a fixed number of image clusters $latex K = 150$. For this method, we use a stricter baseline that has higher compression ratio than the previous method – a compressed zip folder of the images encoded as JPEGs. This baseline achieves 0.77 BPP, and so we are able to beat the baseline only for cluster sizes under $latex k = 8$.

Note that as $latex k$ increases, the reconstruction loss decreases. This is the key tradeoff. While we are able to achieve better compression ratios with heavier quantization (using fewer image clusters), the reconstruction loss also increases as a result. This phenomenon is illustrated graphically in the simple example above. This result is fully intuitive – the lower entropy approximation we create, the higher the compression rate but the lower the reconstruction accuracy.

Our methodology for this approach could be improved in the future through the user of more intelligent run-length encoding. Typically, zig-zag patterns are used within the individual compression blocks in order to exploit local similarities between pixels. However, for simplicity, in this approach we naively concatenate the blocks into vectors (which define the runs). Therefore, the compression rates are not as optimal as they could be.

**3D JPEG**

JPEG compression relies upon the 2D discrete cosine transform (DCT) [1]. The central idea behind classic JPEG compression is partitioning the source image into $latex 8 \times 8$ blocks, and then taking the 2D DCT of each of these blocks. After quantizing the DCT matrix of each block, many coefficients become zero, enabling significant compression. Here, we analyze a 3D generalization of this algorithm for multiple images.

As in the previous algorithm, we cluster the images into $latex K$ clusters, and then depth-concatenate images within a given cluster (each cluster contains $latex d$ images) to create a 3D tensor. Once each cluster’s tensor is generated, we apply the following procedure for compression.

- For each cluster, we segment the 3D tensor into blocks $latex x \in \mathbb{R}^{8 \times 8 \times 8}$ and use the 3D discrete cosine transform to generate frequency domain blocks $latex X$ of the same dimension as the original blocks. The forward and inverse DCT transforms are given below, as provided by [5].

$latex X(k_1, k_2, k_3) = \frac{8}{N^3} \epsilon_{k_1} \epsilon_{k_2} \epsilon_{k_3} \sum_{n_1 = 0}^N \sum_{n_2 = 0}^N \sum_{n_3 = 0}^N x(n_1, n_2, n_3) cos\left(\frac{\pi}{2N} (2n_1 + 1)k_1 \right)cos\left(\frac{\pi}{2N} (2n_2 + 1)k_2 \right)cos\left(\frac{\pi}{2N} (2n_3 + 1)k_3 \right)$

$latex x(n_1, n_2, n_3) = \sum_{k_1 = 0}^N \sum_{k_2 = 0}^N \sum_{k_3 = 0}^N \epsilon_{k_1} \epsilon_{k_2} \epsilon_{k_3} X(k_1, k_2, k_3) cos\left(\frac{\pi}{2N} (2n_1 + 1)k_1 \right)cos\left(\frac{\pi}{2N} (2n_2 + 1)k_2 \right)cos\left(\frac{\pi}{2N} (2n_3 + 1)k_3 \right)$

Note that in the above equations, we define $latex \epsilon_{k_i} = \frac{1}{\sqrt{2}}$ for $latex k_i = 0$ and $latex \epsilon_{k_i} = 1$ for $latex k_i \neq 0$.

2. For each block in each cluster, we quantize using a quantization matrix $latex q$ that is defined by the the following equation, where $latex i, j, k$ are used to index the quantization matrix [2][3]. Note that $latex Q$ represents a quantization level, effectively indicating how much to round each DCT coefficient.

$latex q(i,j,k) = \lfloor \frac{Q}{4}(1 + i + j + k) \rfloor$

3. For each block in each cluster, we filter the zero components, and then encode each block’s coefficients using Huffman encoding. The result is a bit stream for each block.

4. We sum the encoded bit stream sizes for each of the blocks in a cluster to obtain the cluster compressed size (in bits), and then sum over all clusters for the total compressed size.

**Empirical Results**

The below plots illustrate both (1) the relationship between the level of quantization and the compression ratio and (2) the relationship between the reconstruction error on the full set of images and the level of quantization. Evidently, there is a tradeoff between reconstruction accuracy and compression ratio.

Note that in this case, none of the quantization levels beat the baseline performance with a tolerable reconstruction error. However, of particular interest with this plot is that both the compression rate and reconstruction error asymptotically level out as $latex Q$ increases. Therefore, the primary benefit of this method is that, even with near zero reconstruction error (near lossless compression), we can still achieve significant a significant compression factor. Part of the reason that this method does not perform as well as classic JPEG is lack of ad-hoc optimizations. Because industry standard JPEG compressors perform many side optimizations (conversion to the YCbCr color scale, chromiance downsampling, etc.), this compressor does not achieve a comparable BPP. Future work with regard to 3D JPEG could include developing more effective quantization schemes. While 2D quantization tables are readily available and have largely been optimized, very little work has been done on 3D quantization tables. More sophisticated approaches could result in improved multi-image compressors.

#### Outreach Activity

For our outreach project, we wanted to communicate the idea that “it is easier to remember the difference between two images than remember the two images themselves”, which was the premise for our project’s methods. To teach this idea to elementary school students, we presented them with a series of “Spot the Difference” games, through which they competed for a spot on our leaderboard. The goal was, in 3 minutes, to spot as many differences as possible between two similar images. The below panel is an example of two similar images that were displayed to the students.

The students were then presented with a “difference” slide (a black-and-white heat map, see below) highlighting the specific areas where the images differ from each other.

When this difference slide and the main slide were flashed back and forth quickly, the differences popped out, and the students found them much more easily. The difference slide makes the task considerably easier, and demonstrates that having to store information of two whole images is more difficult than analyzing the second image from the vantage point of the first one and the differences between them.

**References**

[1] Townsend, A. (2017, March 15). JPEG: Image compression algorithm. Retrieved from http://pi.math.cornell.edu/~web6140/TopTenAlgorithms/JPEG.html

[2] Lee, M., Chan, R. K., & Adjeroh, D. A. (1997). Quantization of 3D-DCT Coefficients and Scan Order for Video Compression. *Journal of Visual Communication and Image Representation,8*(4), 405-422. doi:10.1006/jvci.1997.0365

[3] Bozinovic, N., & Konrad, J. (2003). Scan order and quantization for 3D-DCT coding. *Visual Communications and Image Processing 2003*. doi:10.1117/12.503324

[4] Ng. A (2018). The k-means Clustering Algorithm. *CS 229 Lecture Notes*. http://cs229.stanford.edu/notes/cs229-notes7a.pdf

[5] Boussakta, S. and Alshibami, H.O. (2004) Fast algorithm for the 3-D DCT-II. IEEE

Transactions on Signal Processing, 52 (4). pp. 992-1001. ISSN 1053-587X