By David Lin (linkewei :at: stanford.edu)
Introduction
In this article, we explore some applications of information-theoretic methods to various combinatorial problems. The various approaches are best thought of as an extension to the probabilistic method, first initiated by Paul Erdős in the 1940s, where a probability distribution imposed on combinatorial objects (e.g. a randomly selected element of a set) can give rise to deterministic conclusions that are otherwise difficult to obtain.
The probabilistic method usually depends on computing the expected value (where we obtain a “typical” value), or showing that the probability of an event is positive (i.e. the probability of constructing a desired object). Information theory forms a natural extension to this method by offering additional quantities to consider, since entropy and mutual information possess intuitive significance once a random distribution is imposed on the objects.
Information Theory Preliminaries
When we construct a (discrete) random variable X taking values among a set \mathcal X, the entropy is defined as H(X) \overset{\scriptscriptstyle\Delta}= \mathbb E[-\log p_X(X)] = \sum_{x\in \mathcal X} -p_X(x) \log p_X(x) where all logs are base 2 and by convention 0\log 0= 0. Following the analogy, if we have a second random variable Y taking values on \mathcal Y, we have the concepts of joint entropy and conditional entropy : \begin{aligned} H(X,Y) &\overset{\scriptscriptstyle\Delta}= \mathbb E[-\log p_{X,Y}(X,Y)] = \sum_{(x,y)\in \mathcal X\times \mathcal Y} -p_{X,Y}(x,y) \log p_{X,Y}(x,y)\\ H(Y|X) &\overset{\scriptscriptstyle\Delta}= \mathbb E[H(Y|X=x)] = \mathbb E[-\log p_{Y|X}] = \sum_{(x,y)\in \mathcal X \times \mathcal Y} - p_{X,Y}(x,y) \log p_{Y|X=x}(y) \end{aligned} There is a natural information-based interpretation for the above. H(X,Y) is a measure of the combined information of X and Y (or equivalently, the measure of the information of (X,Y) as a joint variable), while H(Y|X) is the measure of the expected information revealed by X about Y.These two quantities are also related by the chain rule:
Fact. (Chain Rule) H(X,Y) = H(X) + H(Y|X).
(Interpretation: the information of revealing X and Y is equivalent to the information of revealing X then revealing Y.)
A key inequality is as follows:
Fact. (Data Processing Inequality) H(X|Y,Z) \le H(X|Y). (Interpretation: revealing additional information cannot increase entropy)
Remark. As a quick corollary, we have the much simpler H(Y|X)\le H(Y), which when combined with the chain rule gives us the subadditivity of entropy.
Entropic Lower Bounds
A basic property is that the entropy gives a lower bound on the number of distinct values X can take:Fact. H(X) \le \log |\mathcal X|.
This immediately gives us a way to connect entropy and counting: we can bound any count we want by setting up some distribution over the desired object, and then estimating the entropy.
Problem. Let G = (U\sqcup V,E) be a bipartite graph with vertex partitions of size |U|=m and |V|=n respectively. If P is the set of length 3 paths (edges may be repeated), then |P|\ge \frac{|E|^3}{mn}
Solution. We will select a random length 3 path (e_1,e_2,e_3) as follows: e_2 = \overline{uv} is uniformly selected among all edges, then e_1 is uniformly selected among all edges at u and e_3 is uniformly selected among all edges at v independent of e_1.
First we compute H[e_1|e_2]: \begin{aligned} H[e_1|e_2] &= \sum_{u\in U}\frac{\deg u}{|E|}\log (\deg u)\\ &\ge \frac{m}{|E|}\left(\frac{|E|}{m}\log\left(\frac{|E|}{m}\right)\right) = \log\left(\frac{|E|}{m}\right)\\ \end{aligned}
Similarly H[e_3|e_2]\ge \log\left(\frac{|E|}{n}\right). Then: \begin{aligned} \log |P|&\ge H(e_1,e_2,e_3)\\ &= H(e_2) + H(e_1|e_2) + H(e_3|e_2)\\ &\ge \log\left(\frac{|E|^3}{mn}\right)\\ \end{aligned}
Discussion.
- This is clearly tight for complete graphs.
- The way to invent this solution is firstly to notice the naive argument that gives the RHS: picking (e_1,e_2,e_3) randomly among E\times E\times E, e_1 and e_2 shares the U-vertex with probability 1/m while e_2 and e_3 shares the V-vertex with probability 1/n. Hence the probability of (e_1,e_2,e_3) being a valid path is 1/mn if we make the naive assumption that the two events are independent.
This leads us to the correct solution when we impose such a distribution on the set of 3-paths, then the entropic counting bound gives us the right answer.
Problem. Given finite sets X, A_1,...,A_{n-1} and functions f_i : X \to A_i, a vector (x_1,...,x_n) \in X^n is called nice if f_i(x_i) = f_i(x_{i+1}) for each i = 1,2,..,n-1. Show that the number of nice vectors is at least \frac{|X|^n}{\prod_{i=1}^{n-1} |A_i|}
Solution. Select x_1 uniformly at random, and select x_2 among the possible choices (given x_1) uniformly at random, and so on. The probability of x_i being any particular value of X is fixed at \frac{1}{|X|}. So: \begin{aligned} H[x_1,x_2,...,x_n] &= H[x_1] + H[x_2|x_1] + H[x_3 | x_1,x_2]...\\ &= \log |X| + \mathbb{E}\left[\log \deg x_1\right] + \mathbb{E}\left[\log \deg x_2\right]...\\ &\ge \log |X| - \log\mathbb{E}\left[\frac{1}{\deg x_1}\right] - \log\mathbb{E}\left[\frac{1}{\deg x_2}\right] ... \qquad\text{(GM-HM)}\\ &= \log |X| - \log \left( |A_1|/|X|\right) - \log \left( |A_2|/|X|\right) - ... \end{aligned} which is what we wanted.
Remark. We made use of the QM-AM-GM-HM inequality for random variables: \frac{1}{2}\log \mathbb{E}[X^2] \ge \log \mathbb{E}[X]\ge \mathbb{E}[\log(X)]\ge -\log \mathbb{E}[1/X]
Entropic Lower Bounds
An alternative argument provides us a way to obtain an upper bound instead of a lower bound. We start with the following observation: if X was the uniform distribution, then the \log |\mathcal X| upper bound is in fact attained. This suggests that for an upper bound, we should start with the uniform distribution and then bound the entropy by various methods.A naive way to do so (especially when joint variables are present) is to simply make use of subadditivity. If we are more careful, we could also reveal each variable sequentially and track the incremental information revealed by each variable, expressed by the equation below: H(X_1,X_2,...,X_n) = \sum_{i=1}^n H(X_i|X_1,...,X_{i-1}) \le \sum_{i=1}^n H(X_i)
Problem. (Binomial tail bounding) Show the bound (for k\le n/2) \sum_{i=0}^k \binom{n}{k} \le 2^{nh_2(k/n)} where h_2(\alpha) = \alpha \log_2 \alpha + (1-\alpha) \log_2 (1-\alpha) is the binary entropy function.
Solution. Consider a random n-bit string X_1X_2...X_n uniformly selected from the set of strings with at most n 1’s. We have \log LHS = H(X_1X_2...X_n) \le \sum_{i=1}^n H(X_i) = \sum_{i=1}^n h_2(\mathbb P(X_i = 1))\le nh_2(k/n) since \mathbb P(X_i = 1) \le k/n \le 1/2 and h_2 is monotonic on [0,1/2].
Problem. (Discrete Loomis-Whitney) Let S be a finite subset of \Z^3, and let S_{xy} be the projection of S onto the xy-plane (with S_{yz},S_{zx} defined similarly). Then: |S|^2 \le |S_{xy}|\cdot |S_{yz}|\cdot |S_{zx}|
Solution. The following entropy estimate is useful (special case of Han’s inequality): 2H[X,Y,Z] \le H[X,Y] + H[Y,Z] + H[Z,X] This can be established by a short computation: \begin{aligned} H[X,Y] + H[Y,Z] + H[Z,X] - 2H[X,Y,Z] &= H[X,Y] - H[X|Y,Z] - H[Y|X,Z]\\ &\ge H[X,Y] - H[X|Y] - H[Y] = 0 \qedhere \end{aligned} Now select (P_x,P_y,P_z)\in S uniformly at random, then \begin{aligned} 2\log |S| &= 2H[P_x, P_y, P_z]\\ &\le H[P_x, P_y] + H[P_y, P_z] + H[P_z,P_x]\\ &\le \log |S_{xy}| + \log|S_{yz}| + \log|S_{zx}| \end{aligned} which is exactly log of the inequality we needed to show.
Discussion.
- This was proposed at the International Mathematics Olympiad 1992, a contest for high school students. How might they have solved it? Here’s a possible approach: Alternative Solution. Let V_{a,b} = \{(a,b,c)\in S\}. By the Cauchy-Schwarz inequality: |S|^2 = \left(\sum_{(a,b)\in S_{xy}} |V_{a,b}|\right)^2 \le |S_{xy}|\cdot \sum_{(a,b)\in S_{xy}} |V_{a,b}|^2 then we note that there is an injection from pairs with the same xy coordinates to S_{yz}\times S_{zx}: (a,b,c),(a,b,c')\mapsto (a,c), (b,c').
- What are the equality cases? Looking at the computation above, we require at least: \begin{aligned} I(P_x;P_z|P_y) &= 0\\ I(P_y;P_x,P_z) &= 0 \end{aligned} In particular, the second equation tells us that the values for P_y should have the same distribution conditioned the projection onto S_{zx}. The first tells us that conditioned on Y, (P_x,P_z) forms a grid, so equality holds for “cuboidal sets” S=S_x\times S_y\times S_z (If you look closely, the alternative solution suggests the same equality case).
- triangle if \overrightarrow{xy}, \overrightarrow{yz}, \overrightarrow{zx} are edges
- vee if \overrightarrow{xy}, \overrightarrow{xz} are edges
Solution. Select (X,Y,Z) uniformly at random from the set of all triangles. Then: \begin{aligned} \log_2 T &= H(X,Y,Z)\\ &= H(X) + H(Y|X) + H(Z|Y,X) \end{aligned} where the last line has an interpretation as the entropy decrements from revealing X,Y,Z in that order. However, we can ignore the dependence of Z on X (i.e. H(Z|Y,X) \le H(Z|Y) = H(Y|X) by symmetry) to get an upper bound: \log_2 T \le H(X) + 2H(Y|X) This suggests constructing a distribution on vees as follows: X'\sim X above and Y',Z'\sim Y|X'. The result is that \log_2 (X',Y',Z') is precisely H(X) + 2H(Y|X), but it must be upper bounded by \log_2 |V|.
Discussion. Ordinarily, when we relax H(Z|Y,X) to H(Z|Y), we would get a distribution of length-2 paths (like X\rightarrow Y \rightarrow Z), but we know that every triangle is a length-2 path (and so by following through, the conclusion would have been trivial). However, the magic happens when we make use of cyclic symmetry to “move” H(Z|Y) to H(Y|X).
Other directions
Aside from entropy, other information-theoretic concepts also have some relevance to combinatorial problems. For instance, Moser [2009] used a compression argument to prove the Lovasz local lemma, which guarantees that for a collection of independent variables and a sequence of “local” (i.e. depending only on a few variables) and “rare” bad events , all bad events are avoided with some positive probability.
Acknowledgements
I would like to thank Yanjun for his useful suggestions and advice for this project, and Professor Weissman for teaching this course and introducing me to the wonderful world of information theory.