Lecture 5: Hypothesis Testing and Le Cam’s Two-point Method

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In previous lectures we have seen the ideas of model reduction and its application in establishing the asymptotic lower bounds. However, in many scenarios we would like to find good non-asymptotic bounds, and as we show in the examples of Lecture 3, finding such reductions may be quite difficult. In this and the next few lectures, we will introduce the testing idea which transforms the target problem (usually the estimation problem) into a proper testing problem. Despite the conceptual simplicity, the testing problem may take various forms, some of which may be difficult to analyze. This will be the central topic of the next few lectures. 

1. f-divergences and Inequalities 

1.1. f-divergences 

First we look at the possibly simplest testing problem: if P and Q are two probability measures on the same probability space, and a random sample X is drawn from either P or Q. A (binary) testing problem aims to find some test \Psi(X)\in \{0,1\} which with high probability outputs 0 when X\sim P, and outputs 1 when X\sim Q. To quantify the performance of a certain test \Psi, we define the total error probability as follows: 

\text{err}(\Psi) = P(\Psi(X) \neq 0) + Q(\Psi(X)\neq 1).

Intuitively, when P and Q are close to each other, we will expect that the total error probability for any test is large for they are hard to distinguish. Conversely, when P and Q are very far from each other (e.g., with a disjoint support), then there exists some test which can well distinguish them. The next lemma gives the expression of the minimum \text{err}(\Psi).

Lemma 1 (Le Cam’s first lemma) Minimizing all measurable tests \Psi gives

\min_{\Psi} \text{\rm err}(\Psi) = 1 - \|P-Q\|_{\text{\rm TV}}.

Recall that \|P-Q\|_{\text{\rm TV}} = \frac{1}{2}\int |dP-dQ| is the total variation distance between P and Q

Proof: Let A = \{X: \Psi(X) = 0\}, then 

\text{err}(\Psi) = 1 - P(A) + Q(A) \ge 1 - \sup_B |P(B) - Q(B)| = 1 - \|P-Q\|_{\text{\rm TV}},

with equality if and only if A = \{x: P(dx) \ge Q(dx) \}. \Box 

By Lemma 1, the total variation distance between P and Q exactly quantifies the minimum total error probability of distinguishing between two probability distributions. Similar to Lemma 1, if we impose different weights on the error probabilities, we will arrive at some divergences of the form c\cdot \int |dP - w\cdot dQ|, where c>0 is some constant, and w>0 is the weight on Q(\Psi(X)\neq 1). Recall from convex analysis that for any convex function f with compact support on {\mathbb R}, its second-order derivative f'' always exists in the sense of a positive Radon measure, and 

f(x) = a + bx + \frac{1}{2}\int |x-t| f''(dt) \ \ \ \ \ (1)

for some a,b\in {\mathbb R}.

Remark 1 The decomposition (1) of convex functions has several applications, e.g., in the local time analysis of continuous martingales. 

Motivated by (1), we can actually take a convex combination of the above divergences and arrive at the general class of f-divergences. 

Definition 2 (f-divergence) For a convex function f on [0,\infty) with f(1)=0, the f-divergence between probability measures P and Q on (\Omega,\mathcal{F}) is defined as

D_f(P, Q) = \int f\left(\frac{dP^\parallel}{dQ}\right) dQ + f^\star(0) (1- P^\parallel(\Omega) ),

where P^\parallel is the absolute continuous part of P w.r.t. Q, and f^\star(0) := \lim_{t\rightarrow\infty} f(t)/t

In most practical scenarios, we have P\ll Q, and therefore D_f(P\|Q) = \mathop{\mathbb E}_Q f(dP/dQ). Some well-known examples of f-divergences are as follows. 

Example 1

1. f(x) = |x-1|: total variation distance \|P-Q\|_{\text{\rm TV}}

2. f(x)=(\sqrt{x}-1)^2: Hellinger distance H^2(P,Q) = \int (\sqrt{dP}-\sqrt{dQ})^2

3. f(x)=-\log x: Kullback–Leibler (KL) divergence D_{\text{\rm KL}}(P\|Q) = \int dP\log\frac{dP}{dQ}

4. f(x) = x^2-1: \chi^2-divergence \chi^2(P,Q) = \int\frac{(dP-dQ)^2}{dQ}

The following properties also hold for any f-divergences. 

Lemma 3 D_f(P,Q)\ge 0 and is joint convex in (P,Q). In particular, D_f(\mathsf{K}P, \mathsf{K}Q)\le D_f(P,Q) for any transition kernel \mathsf{K}

Remark 2 The last inequality is known as the data processing inequality. 

Proof: The non-negativity follows from convexity and Jensen’s inequality. The joint convexity is due to that of the perspective transform (x,y)\mapsto yf(x/y) for convex f. Let Y be the output of X through kernel \mathsf{K}, then the last inequality follows from the joint convexity and 

\mathop{\mathbb E}_{Q_{X|Y}}\left[ \frac{dP_{XY}}{dQ_{XY}} \right] = \frac{dP_Y}{dQ_Y}.

We leave the details to the readers. \Box 

1.2. Joint range 

Lemma 1 involves the computation of the total variation distance, which may be hard to evaluate exactly for most statistical models, especially when there is a tensor power structure. As a comparison, Hellinger, KL and \chi^2-divergences tensorize better: 

\begin{array}{rcl} H^2(\otimes_i P_i, \otimes_i Q_i) &=& 2 - 2\prod_i \left(1-\frac{1}{2}H^2(P_i, Q_i) \right),\\ D_{\text{KL}}(\otimes_i P_i\| \otimes_i Q_i) &=& \sum_i D_{\text{KL}}(P_i \| Q_i), \\ \chi^2(\otimes_i P_i, \otimes_i Q_i) &=& \prod_i \left(1 + \chi^2( P_i, Q_i)\right) - 1. \end{array}

Consequently, to apply Lemma 1 to proving lower bounds, we need upper bounds of the total variation distance in terms of Hellinger, KL or \chi^2. We have seen examples of such inequalities from Lecture 3, but have no clue whether the claimed inequalities are tight. To obtain the best inequalities between different f-divergences, we study the joint range of f-divergences. 

For two convex functions f,g on {\mathbb R}_+ with f(1) = g(1)=0, we call (r,s) is in the joint range of (D_f, D_g) if we can find two distributions P,Q over some common probability space such that D_f(P,Q)=r, D_g(P,Q)=s. Hence, the joint range exactly characterizes the relationship between different f-divergences. The central result is as follows: 

Theorem 4 Let \mathcal{R}\subseteq {\mathbb R}^2 be the joint range of (D_f, D_g), and \mathcal{R}_k\subseteq {\mathbb R}^2 be the joint range of (D_f, D_g) when both P and Q are discrete distributions supported on at most k elements. Then

\mathcal{R} = \text{\rm conv}(\mathcal{R}_2) = \mathcal{R}_4,

where \text{\rm conv} denotes the convex hull. 

Theorem 4 provides a generic way to prove inequalities between different f-divergences. Specifically, it shows that to obtain the joint range, it suffices to choose P = (p, 1-p) and Q = (q,1-q), compute the pair (D_f(P,Q), D_g(P,Q)), vary (p,q)\in [0,1]^2 and then take the convex closure. The proof of Theorem 4 requires the following theorems in the convex analysis. 

Theorem 5 (Choquet–Bishop–de Leeuw) Let C be a metrizable convex compact subset of a locally convex topological vector space V. For any point x_0 in C, there exists a probability measure \mu supported on extremal points of C such that x_0 = \int x\mu(dx)

Theorem 6 (Fenchel–Eggleston–Carathéodory) Let S\subseteq {\mathbb R}^d be connected and C=\text{\rm conv}(S). For any x\in C, there exists x_1,\cdots,x_d\in S such that x\in \text{\rm conv}(x_1,\cdots,x_d)

We are ready to prove the statement of Theorem 4. 

Proof: For simplicity we always assume that P\ll Q, then D_f(P,Q) = \mathop{\mathbb E}_Q f(dP/dQ). Viewing the likelihood ratio dP/dQ as a random variable X, we conclude that 

\mathcal{R} = \{(\mathop{\mathbb E}[f(X)], \mathop{\mathbb E}[g(X)]): X\ge 0, \mathop{\mathbb E}[X] = 1 \}.

Similarly, \mathcal{R}_k has the same representation with the additional constraint that X can take at most k values. Let C be the set of all probability measures which are supported on [0,\infty) and have unit mean. By linearity of the expectation and the convexity of C we conclude that \mathcal{R} is convex. Moreover, since all extremal points of C are supported on two elements, by Choquet–Bishop–de Leeuw theorem (note that any topological vector space is always locally convex under the weak-\star topology, where the unit sphere is compact and metrizable) we conclude that the distribution of X can be written as a mixture of distributions supported on only two points. Then by the linearity of expectation again, we conclude that \mathcal{R} = \text{conv}(\mathcal{R}_2)

For the second identity \text{conv}(\mathcal{R}_2)= \mathcal{R}_4, first note that \mathcal{R}_2 is a connected set for it is a continuous image of the divergence pair on [0,1]^2. Hence, by Fenchel–Eggleston–Carathéodory theorem, any point in \text{conv}(\mathcal{R}_2) can be written as a convex combination of two points in \mathcal{R}_2, which gives rise to a random variable X supported on at most 4 elements. \Box 

Based on Theorem 4, one may verify the following inequalities: 

  • TV vs. Hellinger: \frac{H^2(P,Q)}{2} \le \|P-Q\|_{\text{TV}} \le \sqrt{H^2(P,Q)\left(1-\frac{H^2(P,Q)}{4}\right)}.
    Both inequalities are tight. 
  • TV vs. KL: Pinsker’s inequality gives \|P-Q\|_{\text{TV}} \le \sqrt{D_{\text{KL}}(P\|Q)/2}.
    This bound is vacuous when D_{\text{KL}}(P\|Q)\ge 2, where better bounds are available: \|P-Q\|_{\text{TV}} \le \sqrt{1-\exp(-D_{\text{KL}}(P\|Q))} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q)).
    None of these bounds are tight. 
  • Hellinger vs KL: D_{\text{KL}}(P\|Q) \ge 2\log\frac{2}{2 - H^2(P,Q)}.
    This bound is tight. 
  • KL vs. \chi^2: D_{\text{KL}}(P\|Q) \le \log(1+\chi^2(P,Q)).
    This bound is tight. 

The above inequalities will be used quite often whenever we switch between convenient divergences.

2. Le Cam’s Two-point Method 

One popular way to prove a non-asymptotic lower bound of statistical estimation is via hypothesis testing. Let (P_\theta)_{\theta\in\Theta} be the statistical model of interest, and the loss function is L(\theta,\hat{\theta}): \Theta \times \Theta\rightarrow {\mathbb R}_+ (at the moment we assume that the target is to estimate \theta). The idea is to use the following weaker lower bound 

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \ge \sup_{\theta_1,\theta_2\in\Theta} \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}).

Suppose that we can find two possible candidates of \theta, say, \theta_1 and \theta_2, such that the following two conditions hold: 

  1. Large separation: the total loss L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta}) is always large for any \hat{\theta}
  2. Indistinguishability: based on the observation, the observer cannot reliably distinguish between P_{\theta_1} and P_{\theta_2}

then clearly the observer needs to pay a risk proportional to \inf_{\hat{\theta}} (L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta})). The next theorem makes this intuition precise.

Theorem 7 (Two-point Method) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and \theta_1,\theta_2\in \Theta satisfy that

L(\theta_1, a) + L(\theta_2, a) \ge \Delta, \qquad \forall a\in \mathcal{A},

then 

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right).

Proof: The desired result follows from the following chain of inequalities: 

\begin{array}{rcl} \inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) &\ge & \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \\ &\ge & \inf_{\hat{\theta}} \frac{\mathop{\mathbb E}_{\theta_1} L(\theta_1,\hat{\theta}) + \mathop{\mathbb E}_{\theta_2} L(\theta_2,\hat{\theta}) }{2} \\ &=&\frac{1}{2} \inf_{\hat{\theta}} \int L(\theta_1, \hat{\theta}(x))P_{\theta_1}(dx) + \int L(\theta_2, \hat{\theta}(x))P_{\theta_2}(dx) \\ &\ge & \frac{1}{2} \int \left[\inf_{a} ( L(\theta_1,a) + L(\theta_2,a) ) \right]\min\{P_{\theta_1}(dx), P_{\theta_2}(dx) \} \\ &\ge& \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right). \end{array} \Box

Note that in Theorem 7, the parameter \Delta characterizes the separation between the two points, and 1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}} is the minimum error probability of testing between \theta = \theta_1 and \theta=\theta_2 (cf. Lemma 1). Typically these two quantities depend negatively on each other: when one becomes larger, the other gets smaller. A rule-of-thumb is to find (\theta_1, \theta_2) such that 1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}} is a constant, say 0.1, and then search for the largest possible separation \Delta

Although the proof of Theorem 7 did not explicitly apply testing arguments, the test can be seen as follows. Let the test be \Psi(X) = 1 if L(\theta_1, \hat{\theta}) < L(\theta_2, \hat{\theta}), and \Psi(X)=2 otherwise, then the separability condition ensures that \theta = \theta_\Psi for \theta\in \{\theta_1,\theta_2\} as long as L(\theta,\hat{\theta}) < \Delta/2. Hence, 

L(\theta, \hat{\theta} ) \ge \frac{\Delta}{2}\cdot 1(\theta\neq \theta_\Psi).

Replacing L(\theta,\hat{\theta}) with the above lower bound, and repeating the analysis in the proof of Theorem 7, a direct application of Lemma 1 gives the same result (with \Delta/2 replaced by \Delta/4). 

The two-point idea is simple but appears everywhere in various lower bounds, for it establishes the indistinguishability between two different tasks due to the uncertainties. Also, despite the conceptual simplicity, searching over all possible pairs of points is typically computationally expensive, and it requires effort to understand the problem structure and figure out the “correct” two points. In later lectures, this idea will be generalized in the following directions: 

  • Multiple points instead of two-point; 
  • One or both of the two points may be a mixture of distributions, i.e., the binary test has point-mixture or mixture-mixture structures; 
  • The chioce of the two points may depend on the estimator \hat{\theta}

3. Examples of Two-point Methods 

The remainder of this lecture is devoted to the applications of the two-point method in various problems. 

3.1. Example I: Parameter Estimation 

We begin with the simplest example of the Gaussian location model P_\theta = \mathcal{N}(\theta,\sigma^2) for \theta\in{\mathbb R}, and our target is to estimate \theta. Since a natural guess is that local perturbation on \theta makes the model indistinguishable, we choose \theta_1 = 0 and \theta_2 = \delta in the two-point method. Let R^\star be the minimax risk under the squared error loss, Theorem 7 gives 

R^\star \ge \frac{\delta^2}{2}\left(1 - \Phi\left( \frac{|\delta|}{2\sigma}\right)\right),

where \Phi(\cdot) is the CDF of \mathcal{N}(0,1). Maximizing the above lower bound over \delta\in {\mathbb R} gives R^\star\ge 0.332\sigma^2. Compared with R^\star = \sigma^2 derived in Lecture 4, the two-point bound loses a constant multiplicative factor.

Next we consider the bounded mean scenario where |\theta|\le B for a known parameter B in advance. Since the above \delta is chosen to be \Theta(\sigma) in the end, this becomes infeasible if the upper bound B is much smaller than \sigma. Consequently, we can choose \theta_1 = 0 and \theta_2 = \sigma \wedge B, which gives R^\star = \Omega(\sigma^2\wedge B^2). This lower bound is also tight, for one of the estimators \hat{\theta} = 0 or \hat{\theta}=X attains it. 

Consider another location model where one observes X_1,\cdots,X_n\sim \mathsf{Unif}(\theta,\theta+1) and aims to estimate \theta\in {\mathbb R}. Again, choosing \theta_1 = 0 and \theta_2 = \delta, the minimax risk R_n^\star is lower bounded by 

R_n^\star \ge \frac{\delta^2}{2} \left(1 - \|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \right).

By the triangle inequality, 

\|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}} = n\delta.

Hence, choosing \delta = 1/(2n) gives R_n^\star \ge 1/(16n^2). The rate \Theta(n^{-2}) is also tight for the minimax risk, for the estimator \hat{\theta} = \min_i X_i attains the rate. 

Remark 3 The uniform location model is not QMD (see Lecture 4), thus it makes sense to have the correct minimax rate different from \Theta(n^{-1}). Moreover, the inequality \|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}} may not be tight for other models. 

3.2. Example II: Functional Estimation 

Recall the discrete entropy estimation problem, where we observe n samples X_1,\cdots,X_n from a discrete distribution P=(p_1,\cdots,p_k), and the target is to estimate the Shannon entropy H(P) = \sum_{i=1}^k -p_i\log p_i. In Lecture 4 we proved an asymptotic minimax lower bound (\log k)^2/n for the mean squared error, and here we show that the same bound holds non-asymptotically (even when k grows with n). 

The idea of using two-point method here is that when we perturb each entry in the uniform distribution by some small \varepsilon/k, the value of the functional H(P) changes by an additive factor of \Theta(\varepsilon\log k), and the KL divergence between these two cases is \Theta(n\varepsilon^2). Hence, choosing \varepsilon \asymp n^{-1/2} will complete the proof. However, some cautions should be taken when choosing the specific points: 

  • The two points cannot be P_1 = (\frac{1}{k},\cdots,\frac{1}{k}), and P_2 = (\frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k},\cdots, \frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k}). In this way, we have | H(P_1) - H(P_2) | = \Theta(\varepsilon^2)
  • The two points cannot be P_1 = (\frac{1}{k},\cdots,\frac{1}{k}), and P_2 = (\frac{1-\varepsilon}{k},\cdots,\frac{1-\varepsilon}{k},\frac{1+(k-1)\varepsilon}{k}). In this way, we have |H(P_1) - H(P_2)|=\Theta(\varepsilon\log k) if \varepsilon = \Omega(k^{-1}) and |H(P_1) - H(P_2)|=\Theta(\varepsilon^2) otherwise. Hence, this choice fails for some parameters (n,k)

The final choice of the two-point is as follows: 

\begin{array}{rcl} P_1 &=& \left(\frac{1}{2(k-1)}, \cdots, \frac{1}{2(k-1)}, \frac{1}{2} \right), \\ P_2 &=& \left(\frac{1-\varepsilon}{2(k-1)},\cdots, \frac{1-\varepsilon}{2(k-1)}, \frac{1 + \varepsilon}{2} \right). \end{array}

Simple algebra gives |H(P_1) - H(P_2)| = \Theta(\varepsilon\log k), and 

D_{\text{KL}}(P_1^{\otimes n} \| P_2^{\otimes n}) = nD_{\text{KL}}(P_1 \| P_2) = \frac{n}{2}\log \frac{1}{1-\varepsilon^2} = \Theta(n\varepsilon^2).

Now choosing \varepsilon \asymp n^{-1/2} completes the proof of R_{n,k}^\star = \Theta((\log k)^2/n). This example shows that the choice of the two points may sometimes be a little bit delicate. 

3.3. Example III: Non-local Choice of Two Points 

We return to the entropy estimation problem, but this time we choose different points to arrive at an entirely different lower bound. Note that all above choices of the two points are based on local perturbations, and the resulting bounds are essentially for the variance. This example will show that a more sophisticated construction can also take care of the bias part. 

For observations X^n\sim P^{\otimes n}, let T=T(X^n) be the sorted histogram of X^n, i.e., T is the sorted version of (h_1,\cdots,h_k) with h_j = \sum_{i=1}^n 1(X_i = j). The next lemma shows that there exists a minimax estimator which depends only on T

Lemma 8 For any estimator \hat{H}(X^n) of the entropy H(P), there exists an estimator \hat{H}^\star(T) such that

\sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}^\star(T) - H(P))^2 \le \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}(X^n) - H(P))^2.

Proof: The proof relies on the symmetry of H(P) in permutations of P. Let \sigma\in \mathcal{S}_k be any permutation on [k], clearly H(P) = H(P\circ \sigma), and the distribution of T under P is the same as that under P\circ \sigma. Hence, let \sigma(X^n) = (\sigma(X_1), \cdots, \sigma(X_n)), the estimator 

\hat{H}^\star = \frac{1}{k!}\sum_{\sigma\in\mathcal{S}_k} \hat{H}(\sigma(X^n))

depends only on T, and the desired inequality follows from Jensen’s inequality. \Box 

Lemma 8 shows that the model can be reduced to P^{\otimes n}\circ T^{-1}, which by data processing inequality results in a smaller total variation distance between two points. Now the new construction of the two points is motivated by the following lemma. 

Lemma 9 Let P=(p_1,\cdots,p_k) and Q=(q_1,\cdots,q_k) be two probability vectors, with \max_{i\in [k]} \max\{p_i, q_i \}\le 1/(40n). Then

\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + 5\sum_{m=1}^\infty n^m\left|\sum_{i=1}^k (p_i^m - q_i^m) \right|.

The proof of Lemma 9 is referred to the bibliographic notes. Lemma 9 shows that if the lower moments of P and Q are equal, the total variation distance between their tensor powers will be small. This fact motivates us to consider distributions P and Q which are k/d repetitions of the vectors 

\frac{1}{k}(x_1,\cdots,x_d) \quad \text{ and }\quad \frac{1}{k}(y_1,\cdots,y_d),

respectively, where d>0 is some fixed integer, and

  1. P and Q are valid distributions: \sum_{i=1}^d x_i = \sum_{i=1}^d y_i = d
  2. P and Q have matched moments: \sum_{i=1}^d x_i^m = \sum_{i=1}^d y_i^m, m=0,1,\cdots,d-1
  3. P and Q have different entropies: \sum_{i=1}^d x_i\log x_i\neq \sum_{i=1}^d y_i\log y_i

The existence of such vectors x,y is ensured by the following lemma. 

Lemma 10 For any positive integer d, there exist vectors x and y satisfying the above properties. 

Proof: Fix any d-dimensional vector x with distinct positive entries. Consider 

p_\Delta(x) = (x-x_1)(x-x_2)\cdots(x-x_d) - \Delta,

which for \Delta>0 small enough clearly has d positive zeros. Let y_1,\cdots,y_d be the zeros of p_\Delta(x), then comparing the coefficients of x,x^2,\cdots,x^{d-1} on both sides of the polynomial and invoking Newton’s identity, vectors x and y have matched moments up to d-1. The first property is ensured via simple normalizations, and it suffices to choose a suitable \Delta to ensure the third property. \Box 

Using the above two points, we have |H(P) - H(Q)| = \Theta(1) and \sum_{i=1}^k p_i^m = \sum_{i=1}^k q_i^m for any m=1,\cdots,d-1. Moreover, for m\ge d, we have 

\left|\sum_{i=1}^k (p_i^m - q_i^m) \right| = O\left( k^{1-m} \right).

Hence, by Lemma 9, 

\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + O\left( \sum_{m=d}^\infty n^m k^{1-m} \right) = \frac{1}{2} + O\left(\frac{k(n/k)^d}{1-(n/k)} \right),

which is strictly smaller than one if n = O(k^{1-1/d} ). Note that the choice of the integer d is arbitrary, we conclude that the minimax risk R_{n,k}^\star is at least a positive constant whenever n=O(k^{1-\varepsilon}) for any \varepsilon>0. This lower bound says nothing about R_{n,k}^\star when n gets larger, but compared with the previous bound R_{n,k}^\star = \Theta((\log k)^2/n), the new bound shows that n actually needs to be much larger than (\log k)^2 for the consistent estimation. 

In later lectures, it will be shown that the minimax risk of this problem is 

R_{n,k}^\star = \Theta\left(\left(\frac{k}{n\log n}\right)^2 + \frac{(\log k)^2}{n} \right).

Hence, the current bound is slightly weaker than the first term, and the old bound is exactly the second term. 

3.4. Example IV: Robust Statistics 

In this section we look at the Gaussian mean estimation again, but now with outliers. Specifically, we consider the following Huber’s \varepsilon-contamination model: 

P_{(\theta,Q)} = (1-\varepsilon)\mathcal{N}(\theta, I_p) + \varepsilon Q,

where Q denotes the outlier distribution and may be arbitrary. The robust estimation problem is that, given n i.i.d. observations from P_\theta, propose a robust estimator which minimizes the worst-case risk of estimating \theta, where the worst case is taken over both \theta\in{\mathbb R}^p and the outlier distribution Q. Under the squared L_2 loss, let R_{n,\varepsilon}^\star be the minimax risk with n samples and outlier probability \varepsilon

The central claim of this section is that 

R_{n,\varepsilon}^\star = \Omega\left(\frac{p}{n} + \varepsilon^2 \right).

This lower bound is tight for the Tukey median and some other estimators; we refer to the bibliographic notes. Since the \Omega(p/n) lower bound even holds when \varepsilon = 0, we may restrict our attention to showing that R_{n,\varepsilon}^\star = \Omega(\varepsilon^2)

To apply the two-point method, we need to choose both (\theta_1,\theta_2) and (Q_1,Q_2), where the latter is more important in this robust setting. Since Q_1 and Q_2 can be arbitrary probability distributions, we actually have a large degree of freedom and can actually achieve perfect ambiguity here. This phenomenon is summarized in the following lemma. 

Lemma 11 Let P_1, P_2 be any probability distributions with \| P_1 - P_2 \|_{\text{\rm TV}}\le \varepsilon/(1-\varepsilon). Then there exist probability distributions Q_1, Q_2 such that

(1-\varepsilon)P_1 +\varepsilon Q_1 = (1-\varepsilon)P_2 + \varepsilon Q_2.

Proof: Consider the signed measure Q=\frac{1-\varepsilon}{\varepsilon}(P_2 - P_1). Clearly Q(\Omega)=0, and by assumption |Q|(\Omega) \le 2, where |Q| denotes the total variation measure of Q. Hence, by Jordan decomposition of signed measures, we have Q = Q_1 - Q_2, where measures Q_1, Q_2 are mutually singular and Q_1(\Omega) = Q_2(\Omega)\le 1. Finally, adding any common measure \mu with \mu(\Omega)=1-Q_1(\Omega) to both Q_1, Q_2 completes the proof. \Box 

By Lemma 11, it suffices to find (\theta_1, \theta_2) such that 

\| \mathcal{N}(\theta_1, I_p) - \mathcal{N}(\theta_2, I_p) \|_{\text{\rm TV}} = 2\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2} \right) - 1 \le \frac{\varepsilon}{1-\varepsilon},

and the incurred loss is \Omega(\|\theta_1 - \theta_2\|_2^2). Hence, the lower bound \Omega(\varepsilon^2) is completed by choosing \|\theta_1 - \theta_2\|_2 \asymp \varepsilon

3.5. Example V: Two-armed Bandits 

In this section we study the simplest online learning problem, namely, the two-armed bandit. At each time t\in [T], the learner can pull one of the two arms, and the reward of pulling arm i=1,2 follows a standard normal distribution r_{t,i}\sim \mathcal{N}(\mu_i,1), where the mean parameters \mu_1, \mu_2 are unknown to the learner. A policy \pi = (\pi_t)_{t\in [T]} is a sequence where \pi_t\in \{1,2\} denotes the arm to pull at time t, where \pi_t can depend on all past policies (\pi_\tau)_{\tau\in [t-1]} and past rewards (r_{\tau,\pi_\tau})_{\tau\in [t-1]}. The regret of a policy \pi is defined as the difference of the cumulative reward of the best arm and that of the policy, i.e., 

R_T(\pi) := T\max\{\mu_1,\mu_2\} - \sum_{t=1}^T \mu_{\pi_t}.

The target of the online learning is to devise a policy \pi with a small expected regret, where the expectation is taken over the randomness of the rewards (and therefore the random realizations of the policy). A fundamental tradeoff in online learning is the exploration-exploitation tradeoff, where the learner needs to pull the suboptimal arm sufficiently many times to identify the best arm with better precision, and also needs to exploit the current knowledge to focus on the arm which seems to be the best. 

In this section we aim to prove minimax lower bounds of the regrets in the two-armed bandits. First consider the case where \mu_1, \mu_2 can take any values in [0,1]. The two-point method suggests that we may find two worlds such that 

  1. there is no fixed policy whose regrets are small in both worlds; 
  2. these two worlds are hard to distinguish. 

To do so, we may take (\mu_1,\mu_2) = (\Delta,0) in the first world, and (\mu_1,\mu_2) = (0,\Delta) in the second world, where \Delta>0 is some parameter to be chosen. Note that in both worlds, pulling a wrong arm incurs an instanteous regret of \Delta. Let P_1^t and P_2^t be the distribution of the rewards up to time t in respective worlds, then the minimax regret is lower bounded by 

R_T^\star \ge \Delta \sum_{t=1}^T \frac{P_1^{t-1}(\pi_t = 2) + P_2^{t-1}(\pi_t = 1) }{2}.

Using Lemma 1 and 

\|P_1^t - P_2^t\|_{\text{\rm TV}}^2 \le \frac{1}{2}D_{\text{KL}}(P_1^t\|P_2^t) = \frac{t\Delta^2}{4} \le \frac{T\Delta^2}{4},

we conclude that R_T^\star = \Omega(\sqrt{T}) by choosing \Delta \asymp T^{-1/2}

We also consider the case where there is a known suboptimality gap \Delta between arms, i.e., |\mu_1 - \mu_2| \ge \Delta>0. Using the above arguments and the inequality 

\|P-Q\|_{\text{TV}} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q) ),

we will arrive at 

R_T^\star = \Omega\left(\sum_{t=1}^T \Delta\exp(-t\Delta^2) \right) = \Omega\left(\frac{1-\exp(-T\Delta^2)}{\Delta}\right),

which converges to a constant 1/\Delta as T\rightarrow\infty. However, this lower bound is not tight – we will show that R_T^\star \sim (\log T)/\Delta as T gets large. To show this stronger lower bound, we need a more delicate construction of the two points. 

Set (\mu_1, \mu_2) = (\Delta,0) in world I, and (\mu_1,\mu_2) = (\Delta,2\Delta) in world II. Let P_1^t, P_2^t be defined as before, then 

D_{\text{KL}}(P_1^t \| P_2^t) \le 2\Delta^2\cdot \mathop{\mathbb E}_1[T_2],

where T_2 is the total number of pulls of arm 2, and \mathop{\mathbb E}_1 denotes the expectation under world I. Hence, applying the previous arguments, we have 

R_T^\star = \Omega(T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2])).

Clearly, this is a better bound for small \mathop{\mathbb E}_1[T_2], while it is less useful when \mathop{\mathbb E}_1[T_2] is large. However, a large \mathop{\mathbb E}_1[T_2] implies that the learner pulls the suboptimal arm (i.e., arm 2) too many times in world I, and in mathematical words we have 

R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2]).

Note that these two bounds exactly characterize the tradeoff between exploitation and exploration. Combining these bounds, we have 

R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2] + T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2]) ) = \Omega\left(\frac{\max\{1, \log(T\Delta^2) \} }{\Delta} \right),

where the second inequality follows from the minimization of the inner sum with respect to \mathop{\mathbb E}_1[T_2]\in [0,T]

In summary, we have two different lower bounds: when there is no assumption on the optimality gap, then R_T^\star = \Omega(\sqrt{T}); when the optimality gap is at least \Delta>0, then R_T^\star = \Omega(\max\{1, \log(T\Delta^2) \} / \Delta ). Both bounds are tight and there exist policies attaining these bounds; we refer to bibliographic notes for details. 

3.6. Example VI: Multi-armed Bandits 

Interestingly, the two-point method can also give the correct minimax lower bound of the regret for general multi-armed bandits (in later lectures, we will provide a proof using multiple points). However, the following application of the two-point method differs from all previous ones. In previous examples, we fix two points and prove that all estimators (or policies) cannot achieve small risk (or regret) under both points. In this example, the construction of the two points depends on the policy. 

We first define the multi-armed bandit problem. In fact, the only difference from the two-armed case is that now there are K arms in total, and the i-th arm has reward distribution \mathcal{N}(\mu_i,1). The regret of a given policy \pi is defined as 

R_T(\pi) = T\max_{i\in [K]} \mu_i - \sum_{t=1}^T \mu_{\pi_t}.

To apply the two-point method, let the first point be (\mu_1,\cdots,\mu_K)=(\Delta,0,\cdots,0). Clearly, the first arm is the optimal arm, and pulling any wrong arm incurs an instanteous regret \Delta. For any given policy \pi, the choice of the second point will depend on the policy: for i\neq 1, let \mathop{\mathbb E}[T_i] be the expected number of pulls of arm i for policy \pi under the first point. Since \sum_{i=2}^K \mathop{\mathbb E}[T_i]\le T, there must exist some i_0\neq 1 such that \mathop{\mathbb E}[T_{i_0}]\le T/(K-1). In other words, under the first point, the policy \pi does not pull arm i_0 much in expectation. Note that i_0 depends on the policy \pi

Now the second point is as follows: (\mu_1,\cdots,\mu_K) = (\Delta,0,\cdots,0,2\Delta,0,\cdots,0), where the only difference from the first point is that \mu_{i_0}=2\Delta. We claim that the two points are hard to distinguish: let P_1, P_2 be the distribution of all observed rewards under these two points, respectively, then 

D_{\text{KL}}(P_1 \| P_2) = 2\Delta^2\cdot \mathop{\mathbb E}[T_{i_0}] \le \frac{2\Delta^2 T}{K-1}.

Hence, by choosing \Delta \asymp \sqrt{K/T} we have D_{\text{KL}}(P_1 \| P_2) = O(1), and consequently the worst-case regret for the policy \pi is at least \Omega(T\Delta) = \Omega(\sqrt{KT}). By the arbitrariness of \pi, this is a minimax lower bound. This bound is tight for multi-armed bandits. 

The idea of constructing points based on given approaches is widely used in problems where several rounds of adaptivity are available. One future lecture will be devoted exclusively to this topic. 

4. Bibliographic Notes 

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977) (see also Le Cam (1973) for the binary case). The general f-divergence was proposed by Csiszár (1967), and the joint range of f-divergences (Theorem 4) was established in Harremoës and Vajda (2011). For the results in convex analysis, the Choquet–Bishop–de Leeuw theorem can be found in Phelps (2001), and the Fenchel–Eggleston–Carathéodory theorem can be found in Eggleston (1958). The inequalities between f-divergences and the two-point method can be mostly found in the excellent textbook Tsabykov (2009). 

For the examples, the first lower bound of entropy estimation is taken from Wu and Yang (2016). For the second lower bound, Lemma 9 was proved in Valiant (2008) under the Poissonized model, and Lemma 10 is taken from Archaya et al. (2016). For robust statistics, Huber’s \varepsilon-contamination model dates back to Huber (1964), and the general lower bound technique in Lemma 11 is taken from Chen, Gao and Ren (2016). For the robust estimators achieving the lower bound, see Chen, Gao and Ren (2018) and Ilias et al. (2018). 

The lower bounds of bandit problems are taken from Voger (1960) and Lai and Robbins (1985). We also refer to Bubeck, Perchet and Rigollet (2013) and Lattimore and Szepesvári (2018) for non-asymptotic bounds. For more on bandit problems such as the upper bounds, we refer to the book Bubeck and Cesa-Bianchi (2012). 

  1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977. 
  2. Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1.1 (1973): 38-53. 
  3. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica 2 (1967): 229-318. 
  4. Peter Harremoës and Igor Vajda. On pairs of f-divergences and their joint range. IEEE Transactions on Information Theory 57.6 (2011): 3230-3235. 
  5. Robert R. Phelps. Lectures on Choquet’s theorem. Springer Science & Business Media, 2001. 
  6. H. G. Eggleston. Convexity. Cambridge University Press, 1958. 
  7. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 
  8. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720. 
  9. Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing 40.6 (2011): 1927-1968. 
  10. Jayadev Acharya et al. Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory 63.1 (2016): 38-56. 
  11. Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964): 73-101. 
  12. Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s \varepsilon-contamination model. Electronic Journal of Statistics 10.2 (2016): 3752-3774. 
  13. Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. The Annals of Statistics 46.5 (2018): 1932-1960. 
  14. Ilias Diakonikolas et al. Robustly learning a gaussian: Getting optimal error, efficiently. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018. 
  15. Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics 31.2 (1960): 430-443. 
  16. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6.1 (1985): 4-22. 
  17. Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. Conference on Learning Theory, 2013. 
  18. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint (2018). 
  19. Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5.1 (2012): 1-122. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.