By Vinh Nguyen, Jananan Mithrakumar and Pranav Padode.
Tabular datasets are one of the oldest and most common forms of storing large amounts of information. They are used for organizing data to be processed by machines and for visually representing data to humans. They are used in a wide variety of applications like business transactions, machine-learning datasets, genomics data and for storing census data. Due to their ubiquity and usefulness, lossless compression of such datasets is very important.
Our project is concerned with exploiting dependencies between the features of tabular datasets to achieve greater compression ratios using arithmetic encoding. We start with the most basic schema of assuming no dependencies between the features. We then move on to exploit possible pairwise dependencies between features. To do so we rely heavily on the excellent work of Pavlichin et al and their paper on using pairwise dependencies to improve tabular compression . In this approach, the dependency graph of the dataset features are limited to be a tree (graph where each node either has 0 or 1 parent node).
We then try and extend their work by considering v-structure dependencies between dataset features. These dependencies involve a node having 2 parents. We show that by considering such dependencies we can achieve greater compression ratios without significantly increasing the space required to store our more complex model.
To test our compression schemes we used a connect-4 dataset from the UCI machine learning dataset repository . The dataset contained 42 features (columns) and 67557 entries
We divided our project into two parts. For the first part, we decided to follow the most simple encoding scheme possible and perform arithmetic encoding on each feature of our dataset (each column) separately.
The compression rate achievable by an arithmetic encoder given this scheme is given by the following equation:
We then attempted to duplicate a subset of Dimitri’s results. Specifically, we wanted to model pairwise dependencies between features in our dataset. We adopted the Chow-Liu maximum spanning tree approach as in Dimitri’s paper. However, we ignored the cost of compressing our graphical model and the string to integer dictionaries that were used when computing probability distributions, to simplify the process.
We also skipped the actual compression step since a generic arithmetic encoder will be able to get within very few bits of the compression rate which means that computing the compression rate
We used the following procedure to compute the achievable compression rate:
- We computed the empirical probability distribution of each feature (column) and the empirical joint probability distributions between all pairs of features. These distributions were computed as histograms that represented the probability mass function of each feature
- Using these probability distributions we computed the entropies of each features and the joint entropies between feature pairs
- We then computed the mutual information between all pairs of features in our dataset to create a mutual information matrix
- We then created a fully connected graph where the edge weights were the mutual information between the vertices of the edge.
- We then created a fully connected graph where the edge weights were the edge weights were the mutual information between the vertices.
- We then found a maximum spanning tree of this graph
- Using our MST we computed the total mutual information of our model
- This then allowed us to compute the compression rate of the scheme as below
The compression rate of this scheme is :
The first term in the above equation corresponds to the sum of the entropies of all the features in our dataset. The second term corresponds to the total mutual information of our dependency model.
To extend Dimitri’s work we included the possibility of v-structure dependencies. To do so, we randomly inserted v-structures into our MST model and recomputed the compression rates. Of course, practically this would not be the way to go about this since inserting v-structure for particular vertices would result in the model and probability mass function dictionaries becoming too large. A more practical approach would be to only insert v-structures between vertices few unique values. However, since we were ignoring the cost of compressing our models and dictionaries, we simply decided to see how much better we could go by inserting a single v-structure into our model.
The data above lists the compression rates and file sizes using the connect-4 dataset that we achieved with three different schema. We can see that inserting a single v-structure does result in a slight improvement in the compression rate. However the improvement is very small and may not be greater than the increase in space needed to store the more complex model.
Practically, when choosing whether to use V-structures in the dependency model, the reduction in compression rate would have to be weighed against the increase in space needed to store the model (which can also be compressed). Ideally, V-structures would be made between vertices with a few unique values so that the space needed to store the extra dependency information is small.
- Pavlichin, D. S., Ingber, A., & Weissman, T. (2017). Compressing Tabular Data via Pairwise Dependencies. 2017 Data Compression Conference (DCC).
- M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
Our team had a lot of fun presenting our project at the outreach event. We had the students play a game, that required them to compress 40 fields of data into 24 fields by finding pairwise dependencies in our data table. As a premise, we set up the following story.
The year is 2100. Mankind has reached the pinnacle of technological advancement. Innovations such as teleportation have become commonplace. However, our accelerated growth has put our planet to risk. Our continuous depletion of natural resources has made the core of our planet unstable. The High Council, led by Mr. Weismann, has decreed that in order for mankind to survive, we need to explore new worlds and inhabit them.
Alas, this realization was made too late! The day after the council’s order, the planet’s core began to implode. We now only have a few minutes before the planet disintegrates. Fortunately, one of our teleportation devices, along with the operator, have already found a planet and are there waiting for the codex data to be transmitted. The eminent implosion of the core has destroyed all sources of energy on this planet. However, you have a sintex battery pack that allows you to beam 24GigaJoules of energy.
Your mission, if you choose to accept, is to work with the ISRO scientist Tsachy and the portal operator Henessey to figure out how you can send information about the citizens.
The following is our data table for the citizens
Students were tasked to find similarities in this data table and group them all under numbers.
The table above shows the worst way the students could have encoded the data.
This table above shows the best way to encode the data. Only 2 students got this by the end.
Overall, we had a really fun time at the outreach event. We all thought it helped us solidify our understanding of our project since we had to take it down to the most fundamental level. Some of the pictures from our outreach event are given below.