Unsupervised Sequence Clustering with Lossless Compression

EE376A (Winter 2019)

Every day we use the internet and leave digital footprints all over the web: a click, a scroll. Even more diverse patterns appear when you collect all these disparate steps together to form a fluid session of interacting with the web. Recently, I have been particularly intrigued in deriving insights from this trove of rich, yet unstructured, well of interaction sequences. Inspired by a information theory class at Stanford, I set out to find methods to tackle this interesting challenge.

A Simple Dataset

Sample Interaction Sequence

With the help of friends and volunteers at Stanford, I collected a dataset of 2832 interaction sequences on SimpleEnroll, a Stanford internal service that offers students a graphical interface to add, plan, and drop classes. Each data point corresponds to a sequence of atomic “steps” dedicated to each “session” of use. When a user stops interacting with SimpleEnroll for a set amount of time, the session naturally expires, resulting in sequences ranging from as little as 5 actions to as much as 35 actions. Each step of these interactions encompasses a wide variety of page data and metadata associated with action taken at this step triggered through Javascript Events, including interaction modes, DOM element tag, URL, DOM attributes (ID, class name, title,…), text of element, and some metadata such as time between this and the prior action, and the origin URL (the webpage that the session is triggered). Below is a sample step within a sequence.

Sample Interaction Step

Some Theory: Normalized Compression Distance

After studying some literature around sequence clustering in the realm of information theory, I set my eye on Normalized Compression Distance (NCD). NCD is an approximate measure to the Normalized Information Distance (NID) a universal distance measure for objects of all kinds. In essence, NID, and hence NCD, tries to capture the least amount of work a program needs to do to transform an object to another. This idea leverages Kolmogorov complexity, a measure well-studied in algorithmic information theory that describe the computational resources needed to specify this object, or equivalently, the length of the shortest program that produces this object as output. Our final distance measure is thus the following:

Normalized Information Distance

In the equation above, K denotes the Kolmogorov complexity, specifically K(x|y) denotes the shortest program that takes y as input and produces x. This measure has some great properties: NID satisfies density condition and it is minimal.

First off, NID satisfies the density condition: for any object, there can only be a certain, finite number of objects that are exactly d away from said object. This condition excludes degenerate distance functions such as the indicator function, which yields 1 if the two objects under comparison are identical and 0 otherwise. An example of a distance function that does satisfy the density condition is the “edit distance”. Another nice property of NID is that it is universal. Formally, universality states that NID is the least of all distances between objects that satisfies the density condition. Intuitively, this means that NID highlights the most dominant feature in which two objects are alike. Lastly, NID is bounded between 0 and 1, making the value intuitive to understand.

However, despite the great properties, NID is not computable because it depends on Kolmogorov complexity function K, which is also uncomputable. Thus, in practice, we often use the Normalized Compression Distance (NCD) to approximate NID. NCD exchanges Kolmogorov complexity with the string length of compressed under a lossless compressor such as gzip, bzip2, and PPMZ. Intuitively, the binary length of compressed object approximates the amount of information contained in an object while maintaining the nice properties of NID. We compute NCD as follow:

Normalized Compression Distance, an approximate to NID

In the above equation, Z(*) denotes binary length of any object after being compressed by Z. Crucial derivation of properties and equations above are omitted for simplicity and can be found in further reading.

Experiment: Are you trying to add another class?

NCD had been widely used in practice in use cases like DNA analysis and phylogenies studies. For us, we are going to use NCD to help me cluster student interactions on SimpleEnroll. From my personal experience, interactions on SimpleEnroll can be categorized into two modes: 1) adding or dropping classes, and 2) searching and planning classes. Thus, we are going to try to cluster the 2832 interactions we have and whether they fit either of these use cases. However, one important problem plaguing our quest for insight is that we don’t actually have ground truth data about users’ true intention. We can only rely on a heuristic that is built in the SimpleEnroll website itself: if a session clicked on “Confirm” trigger twice, it is more likely that a student is trying to add or drop a class. As a heuristic, this simple labeling may be inaccurate.

With ground truth label generated, we can begin building out our clustering. For each pair of the 2832 sequences, we compute the NCD between them by using an open source compression engine to generate a large distance matrix. From here on, we can leverage off-the-shelf clustering algorithms provided by packages like scipy.

For comparison, we compare our NCD-based clustering with two other methods, both of which does not involve using compression for clustering. Both of these methods share a common encoding method: a function that turns an interaction sequence shown above to a vector of real number sequences. For this encoding, we leverage off-the-shelf word embeddings like GloVe: for each step, we average all embeddings of the words in the text, DOM tag, action mode, and element attributes and concatenate them with averaged word embeddings from other steps. The first comparison method simply uses this sequence embedding and apply a Euclidean (L2) distance to compute a distance matrix for clustering. The second comparison method uses a simple two-layer neural network, a staple in machine learning and deep learning, to learn the binary prediction task using 80% of the 2832 data points collected. The table below shows their performance on 20% of the data not trained by the neural network.

Performance of various methods

The NCD-based method achieves a 82% F1-score; the Euclidean distance based clustering yields only 62% F1-score; the neural network supervised method yields 89%. Interesting, we observe that NCD-based clustering, an unsupervised, non-parametric method can perform comparably to a neural network method that is supervised and specifically trained on this task of predicting user intention. On the other hand, we also observe that given the same embedding method, Euclidean distance based clustering performs much worse, demonstrating that Euclidean distance, unlike NCD, does not highlight the most prominent feature of the sequences.

Outreach event

Over the last weekend, I had the pleasure of presenting a component of this project to the Lucille M. Nixon elementary school. For the outreach project, I set out to build a simple demonstration of how we can compute distances between any pairs of objects via compression. To achieve this, I leveraged well-established APIs such as Bing Image Search API and Google Search API. I take the return articles and images from the search APIs and extra all the plain text to use as raw material for compression. NCD is applied to get a final distance score. When the elementary school students came to my booth, I had them start with comparing their favorite food and sports, and then gradually change the comparison and educate them about how compression can inform us the distances and similarity between objects in the real world. Below is a sample view of the demonstration.


Sample comparison

Further reading

If you are interested in reading more on topics like normalized compression distance, Kolmogorov complexity, and machine learning on the web domain, here are some resources that contain fully detailed derivations:

On Kolmogorov complexity: here
On Normalized Information Distance: here
On machine learning in the web domain: here

If you have further questions, please contact me at zliu19<at>stanford.edu

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.