Lecture 9: Multiple Hypothesis Testing: More Examples

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last lecture we have seen the general tools and concrete examples of reducing the statistical estimation problem to multiple hypothesis testing. In these examples, the loss function is typically straightforward and the construction of the hypotheses is natural. However, other than statistical estimation the loss function may become more complicated (e.g., the excess risk in learning theory), and the hypothesis constructions may be implicit. To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems. 

1. Example I: Density Estimation 

This section considers the most fundamental problem in nonparametric statistics, i.e., the density estimation. It is well-known in the nonparametric statistics literature that, if a density on the real line has Hölder smoothness parameter s>0, it then can be estimated within L_2 accuracy \Theta(n^{-s/(2s+1)}) based on n samples. However, in this section we will show that this is no longer the correct minimax rate when we generalize to other notions of smoothness and other natural loss functions. 

Fix an integer k\ge 1 and some norm parameter p\in [1,\infty], the Sobolev space \mathcal{W}^{k,p}(L) on [0,1] is defined as 

\mathcal{W}^{k,p}(L) := \{f\in C[0,1]: \|f\|_p + \|f^{(k)}\|_p \le L \},

where the derivative f^{(k)} is defined in terms of distributions (f not necessarily belongs to C^k[0,1]). Our target is to determine the minimax rate R_{n,k,p,q}^\star of estimating the density f\in \mathcal{W}^{k,p}(L) based on n i.i.d. observations from f, under the general L_q loss with q\in [1,\infty). For simplicity we suppress the dependence on L and assume that L is a large positive constant.

The main result of this section is as follows: 

R_{n,k,p,q}^\star = \begin{cases} \Omega(n^{-\frac{k}{2k+1}}) & \text{if } q<(1+2k)p, \\ \Omega((\frac{\log n}{n})^{\frac{k-1/p + 1/q}{2(k-1/p) + 1/q}} ) & \text{if } q \ge (1+2k)p. \end{cases}

This rate is also tight. We see that as k,p are fixed and q increases from 1 to \infty, the minimax rate for density estimation increases from n^{-k/(2k+1)} to (\log n/n)^{(k-1/p)/(2(k-1/p)+1) }. We will show that these rates correspond to the “dense” and the “sparse” cases, respectively.

1.1. Dense case: q<(1+2k)p 

We first look at the case with q<(1+2k)p, which always holds for Hölder smoothness where the norm parameter is p=\infty. The reason why it is called the “dense” case is that in the hypothesis construction, the density is supported everywhere on [0,1], and the difficulty is to classify the directions of the bumps around some common density. Specifically, let h\in (0,1) be a bandwidth parameter to be specified later, we consider 

f_v(x) := 1 + \sum_{i=1}^{h^{-1}} v_i\cdot h^sg\left(\frac{x - (i-1)h}{h}\right)

for v\in \{\pm 1\}^{1/h}, where g is a fixed smooth function supported on [0,1] with \int_0^1 g(x)dx=0, and s>0 is some tuning parameter to ensure that f_v\in \mathcal{W}^{k,p}(L). Since

f_v^{(k)}(x) = \sum_{i=1}^{h^{-1}} v_i\cdot h^{s-k}g^{(k)}\left(\frac{x - (i-1)h}{h}\right),

we conclude that the choice s=k is sufficient. To invoke the Assoaud’s lemma, first note that for q=1 the separation condition is fulfilled with \Delta = \Theta(h^{k+1}). Moreover, for any v,v'\in \{\pm 1\}^{1/h} with d_{\text{H}}(v,v') = 1, we have

\chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) + 1 \le \left( 1 + \frac{ \|f_v - f_{v'}\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n = \left(1 + \frac{ 4h^{2k+1}\|g\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n.

Hence, the largest h>0 to ensure that \chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) = O(1) is h = \Theta(n^{-1/(2k+1)}), which for q=1 gives the minimax lower bound

R_{n,k,p,q}^\star = \Omega(n^{-\frac{k}{2k+1}} ).

This lower bound is also valid for the general case q\ge 1 due to the monotonicity of L_q norms.

1.2. Sparse case: q\ge (1+2k)p 

Next we turn to the sparse case q\ge (1+2k)p, where the name of “sparse” comes from the fact that for the hypotheses we will construct densities with only one bump on a small interval, and the main difficulty is to locate that interval. Specifically, let h\in (0,1), s>0 be bandwidth and smoothness parameters to be specified later, and we consider 

f_i(x) := h^sg\left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h<x\le ih\right) + (1-h^{s+1}),

where i\in [h^{-1}], and g is a fixed smooth density on [0,1]. Clearly f_i(x) is a density on [0,1] for all i\in [M], and

f_i^{(k)}(x) = h^{s-k}g^{(k)} \left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h<x\le ih\right).

Consequently, \|f_i^{(k)}\|_p = h^{s-k+1/p}\|g^{(k)}\|_p, and the Sobolev ball requirement leads to the choice s = k-1/p.

Next we check the conditions of Fano’s inequality. Since 

\|f_i - f_j\|_q = h^{k-1/p+1/q} \cdot 2^{1/q}\|g\|_q, \qquad \forall i\neq j\in [h^{-1}],

the separation condition is fulfilled with \Delta = \Theta(h^{k-1/p+1/q}). Moreover, since

D_{\text{KL}}(f_i \| f_j) \le \frac{\|f_i - f_j\|_2^2}{1-h^s} = \frac{2\|g\|_2^2}{1-h^s}\cdot h^{2(k-1/p)+1},

we conclude that I(V;X)= O(nh^{2(k-1/p)+1}). Consequently, Fano’s inequality gives

R_{n,k,p,q}^\star = \Omega\left(h^{k-1/p+1/q}\left(1 - \frac{O(nh^{2(k-1/p)+1})+\log 2}{\log(1/h)} \right) \right),

and choosing h = \Theta((\log n/n)^{1/(2(k-1/p)+1)} ) gives the desired result.

2. Example II: Aggregation 

In estimation or learning problems, sometimes the learner is given a set of candidate estimators or predictors, and she aims to aggregate them into a new estimate based on the observed data. In scenarios where the candidates are not explicit, aggregation procedures can still be employed based on sample splitting, where the learner splits the data into independent parts, uses the first part to construct the candidates and the second part to aggregate them. 

In this section we restrict ourselves to the regression setting where there are n i.i.d. observations x_1,\cdots,x_n\sim P_X, and 

y_i = f(x_i )+ \xi_i, \qquad i\in [n],

where \xi_i\sim \mathcal{N}(0,1) is the independent noise, and f: \mathcal{X}\rightarrow {\mathbb R} is an unknown regression function with \|f\|_\infty \le 1. There is also a set of candidates \mathcal{F}=\{f_1,\cdots,f_M\}, and the target of aggregation is to find some \hat{f}_n (not necessarily in \mathcal{F}) to minimize

\mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right),

where the expectation is taken over the random observations (x_1,y_1),\cdots,(x_n,y_n), f_\lambda(x) := \sum_{i=1}^M \lambda_i f_i(x) for any \lambda \in {\mathbb R}^M, and \Theta\subseteq {\mathbb R}^p is a suitable subset corresponding to different types of aggregation. For a fixed data distribution P_X, the minimax rate of aggregation is defined as the minimum worst-case excess risk over all bounded functions f and candidate functions \{f_1,\cdots,f_M\}:

R_{n,M}^{\Theta} := \sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n} \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right).

Some special cases are in order:

  1. When \Theta = {\mathbb R}^M, the estimate \hat{f}_n is compared with the best linear aggregates and this is called linear aggregation, with optimal rate denoted as R^{\text{L}}
  2. When \Theta = \Lambda^M = \{\lambda\in {\mathbb R}_+^M: \sum_{i=1}^M \lambda_i\le 1 \}, the estimate \hat{f}_n is compared with the best convex combination of candidates (and zero) and this is called convex aggregation, with optimal rate denoted as R^{\text{C}}
  3. When \Theta = \{e_1,\cdots,e_M\}, the set of canonical vectors, the estimate \hat{f}_n is compared with the best candidate in \mathcal{F} and this is called model selection aggregation, with optimal rate denoted as R^{\text{MS}}

The main result in this section is summarized in the following theorem. 

Theorem 1 If there is a cube \mathcal{X}_0\subseteq \mathcal{X} such that P_X admits a density lower bounded from below w.r.t. the Lebesgue measure on \mathcal{X}_0, then

R_{n,M}^\Theta = \begin{cases} \Omega(1\wedge \frac{M}{n}) &\text{if } \Theta = {\mathbb R}^M, \\ \Omega(1\wedge (\frac{M}{n} + \sqrt{\frac{1}{n}\log(\frac{M}{\sqrt{n}}+1)}) ) & \text{if } \Theta = \Lambda^M, \\ \Omega(1\wedge \frac{\log M}{n}) & \text{if } \Theta = \{e_1,\cdots,e_M\}. \end{cases}

We remark that the rates in Theorem 1 are all tight. In the upcoming subsections we will show that although the loss function of aggregation becomes more complicated, the idea of multiple hypothesis testing can still lead to tight lower bounds. 

2.1. Linear aggregation 

Since the oracle term \inf_{\lambda \in {\mathbb R}^d} \|f_\lambda - f\|_{L^2(P_X)}^2 is hard to deal with, a natural idea would be to consider a well-specified model such that this term is zero. Since P_X admits a density, we may find a partition \mathcal{X} = \cup_{i=1}^M \mathcal{X}_i such that P_X(\mathcal{X}_i) = M^{-1} for all i. Consider the candidate functions f_i(x) = 1(x\in \mathcal{X}_i), and for v\in \{0, 1\}^M, let 

f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),

where \gamma\in (0,1) is to be specified.

To apply the Assoaud’s lemma, note that for the loss function 

L(f,\hat{f}) := \|\hat{f} - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in {\mathbb R}^p} \|f_\lambda - f\|_{L_2(P_X)}^2,

the orthogonality of the candidates (f_i)_{i\in [M]} implies that the separability condition holds for \Delta = \gamma^2/2M. Moreover,

\max_{d_{\text{H}}(v,v')=1 } D_{\text{KL}}(P_{f_v}^{\otimes n} \| P_{f_{v'}}^{\otimes n}) = \max_{d_{\text{H}}(v,v')=1 } \frac{n\|f_v - f_{v'}\|_{L_2(P_X)}^2}{2} = O\left(\frac{n\gamma^2}{M}\right),

Therefore, the Assoud’s lemma (with Pinsker’s inequality) gives

R_{n,M}^{\text{L}} \ge \frac{\gamma^2}{4}\left(1 - O\left( \sqrt{\frac{n\gamma^2}{M}}\right)\right).

Choosing \gamma = 1\wedge \sqrt{M/n} completes the proof.

2.2. Convex aggregation 

The convex aggregation differs only with the linear aggregation in the sense that the linear coefficients must be non-negative and the sum is at most one. Note that the only requirement in the previous arguments is the orthogonality of (f_i)_{i\in [M]} under L_2(P_X), we may choose any orthonormal functions (f_i)_{i\in [M]} under L_2(dx) with \max_i\|f_i\|_\infty = O(1) (existence follows from the cube assumption) and use the density lower bound assumption to conclude that \|f_i\|_{L_2(P_X)} = \Omega(\|f_i\|_2) (to ensure the desired separation). Then the choice of \gamma above becomes O(n^{-1/2}), and we see that the previous arguments still hold for convex aggregation if M = O(\sqrt{n}). Hence, it remains to prove that when M = \Omega(\sqrt{n})

R_{n,M}^{\text{C}} = \Theta\left(1\wedge \sqrt{\frac{1}{n}\log\left(\frac{M}{\sqrt{n}}+1\right) }\right).

Again we consider the well-specified case where 

f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),

with the above orthonormal functions (f_i), a constant scaling factor \gamma = O(1), and a uniform size-m subset of (v_i) being 1/m and zero otherwise (m\in \mathbb{N} is to be specified). Since the vector v is no longer the vertex of a hypercube, we apply the generalized Fano’s inequality instead of Assoaud. First, applying Lemma 4 in Lecture 8 gives

I(V; X) \le \mathop{\mathbb E}_V D_{\text{KL}}(P_{f_V}^{\otimes n} \| P_{f_0}^{\otimes n}) = O\left(\frac{n}{m}\right).

Second, as long as m\le M/3, using similar arguments as in the sparse linear regression example in Lecture 8, for \Delta = c/m with a small constant c>0 we have

p_\Delta := \sup_{f} \mathop{\mathbb P}\left(\|f_V - f\|_{L^2(P_X)}^2 \le \Delta \right) \le \left(\frac{M}{m}+1\right)^{c'm},

with c'>0. Therefore, the generalized Fano’s inequality gives

R_{n,M}^{\text{C}} \ge \frac{c}{m}\left(1 - \frac{O(n/m) + \log 2}{c'm\log\left(\frac{M}{m}+1\right)} \right),

and choosing m \asymp \sqrt{n/\log(M/\sqrt{n}+1)} = O(M) completes the proof.

2.3. Model selection aggregation 

As before, we consider the well-specified case where we select orthonormal functions (\phi_i)_{i\in [M]} on L_2(dx), and let f_i = \gamma \phi_i, f\in \{f_1,\cdots,f_M\}. The orthonormality of (\phi_i) gives \Delta = \Theta(\gamma^2) for the separation condition of the original Fano’s inequality, and 

\max_{i\neq j} D_{\text{KL}}(P_{f_i}^{\otimes n}\| P_{f_j}^{\otimes n}) = \max_{i\neq j} \frac{\|f_i - f_j\|_{L_2(P_X)}^2}{2} = O\left(n\gamma^2\right).

Hence, Fano’s inequality gives

R_{n,M}^{\text{MS}} = \Omega\left(\gamma^2 \left(1 -\frac{n\gamma^2 + \log 2}{\log M}\right)\right),

and choosing \gamma^2 \asymp 1\wedge n^{-1}\log M completes the proof.

We may show a further result that any model selector cannot attain the above optimal rate of model selection aggregation. Hence, even to compare with the best candidate function in \mathcal{F}, the optimal aggregate \hat{f} should not be restricted to \mathcal{F}. Specifically, we have the following result, whose proof is left as an exercise. 

Exercise 1 Under the same assumptions of Theorem 1, show that

\sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n\in\{f_1,\cdots,f_M\} } \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \min_{i\in [M]} \|f_i - f\|_{L_2(P_X)}^2\right) = \Omega\left(1\wedge \sqrt{\frac{\log M}{n}}\right).

3. Example III: Learning Theory 

Consider the classification problem where there are n training data (x_1,y_1), \cdots, (x_n, y_n) i.i.d. drawn from an unknown distribution P_{XY}, with y_i \in \{\pm 1\}. There is a given collection of classifiers \mathcal{F} consisting of functions f: \mathcal{X}\rightarrow \{\pm 1\}, and given the training data, the target is to find some classifier \hat{f}\in \mathcal{F} with a small excess risk 

R_{\mathcal{F}}(P_{XY}, \hat{f}) = \mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right],

which is the difference in the performance of the chosen classifer and the best classifier in the function class \mathcal{F}. In the definition of the excess risk, the expectation is taken with respect to the randomness in the training data. The main focus of this section is to characterize the minimax excess risk of a given function class \mathcal{F}, i.e.,

R_{\text{pes}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY}} R_{\mathcal{F}}(P_{XY}, \hat{f}).

The subscript “pes” here stands for “pessimistic”, where P_{XY} can be any distribution over \mathcal{X} \times \{\pm 1\} and there may not be a good classifier in \mathcal{F}, i.e., \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X)) may be large. We also consider the optimistic scenario where there exists a perfect (error-free) classifer in \mathcal{F}. Mathematically, denoting by \mathcal{P}_{\text{opt}}(\mathcal{F}) the collection of all probability distributions P_{XY} on \mathcal{X}\times \{\pm 1\} such that \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))=0, the minimax excess risk of a given function class \mathcal{F} in the optimistic case is defined as

R_{\text{opt}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}_{\text{opt}}(\mathcal{F})} R_{\mathcal{F}}(P_{XY}, \hat{f}).

The central claim of this section is the following: 

Theorem 2 Let the VC dimension of \mathcal{F} be d. Then

R_{\text{\rm opt}}^\star(\mathcal{F}) = \Omega\left(1\wedge \frac{d}{n}\right), \qquad R_{\text{\rm pes}}^\star(\mathcal{F}) = \Omega\left(1\wedge \sqrt{\frac{d}{n}}\right).

Recall that the definition of VC dimension is as follows: 

Definition 3 For a given function class \mathcal{F} consisting of mappings from \mathcal{X} to \{\pm 1\}, the VC dimension of \mathcal{F} is the largest integer d such that there exist d points from \mathcal{X} which can be shattered by \mathcal{F}. Mathematically, it is the largest d>0 such that there exist x_1,\cdots,x_d\in \mathcal{X}, and for all v\in \{\pm 1\}^d, there exists a function f_v\in \mathcal{F} such that f_v(x_i) = v_i for all i\in [d]

VC dimension plays a significant role in statistical learning theory. For example, it is well-known that for the empirical risk minimization (ERM) classifier 

\hat{f}^{\text{ERM}} = \arg\min_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n 1(y_i \neq f(x_i)),

we have

\mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}^{\text{ERM}}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right] \le 1\wedge \begin{cases} O(d/n) & \text{in optimistic case}, \\ O(\sqrt{d/n}) & \text{in pessimistic case}. \end{cases}

Hence, Theorem 2 shows that the ERM classifier attains the minimax excess risk for all function classes, and the VC dimension exactly characterizes the difficulty of the learning problem.

3.1. Optimistic case 

We first apply the Assoaud’s lemma to the optimistic scenario. By the definition of VC dimension, there exist points x_1,\cdots,x_d\in \mathcal{X} and functions f_v\in \mathcal{F} such that for all v\in \{\pm 1\}^d and i\in [d], we have f_v(x_i) = v_i. Consider 2^{d-1} hypotheses indexed by u\in \{\pm 1\}^{d-1}: the distribution P_X is always 

P_X(\{x_i\}) = p > 0, \quad \forall i\in [d-1], \qquad P_X(\{x_d\}) = 1-p(d-1),

where p\in (0,(d-1)^{-1}) is to be specified later. For the conditional distribution P_{Y|X}, let Y = f_{(u,1)}(X) hold almost surely under the joint distribution P_u. Clearly this is the optimistic case, for there always exists a perfect classifier in \mathcal{F}.

We first examine the separation condition in Assoaud’s lemma, where the loss function here is 

L(P_{XY}, \hat{f}) := P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X)).

For any u,u'\in \{\pm 1\}^{d-1} and any \hat{f}\in \mathcal{F}, we have

\begin{array}{rcl} L(P_u, \hat{f}) + L(P_{u'}, \hat{f}) &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq f_u(x_i)) + 1(\hat{f}(x_i) \neq f_{u'}(x_i)) \\ &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq u_i) + 1(\hat{f}(x_i) \neq u_i') \\ &\ge & p\cdot d_{\text{H}}(u,u'), \end{array}

and therefore the separation condition holds for \Delta = p. Moreover, if u and u' only differ in the i-th component, then P_u and P_{u'} are completely indistinguishable if x_i does not appear in the training data. Hence, by coupling,

\max_{d_{\text{H}}(u,u')=1 } \|P_u^{\otimes n} - P_{u'}^{\otimes n}\|_{\text{\rm TV}} \le \left(1 - p\right)^n.

Therefore, Assoaud’s lemma gives

R_{\text{opt}}^\star(\mathcal{F}) \ge \frac{(d-1)p}{2}\left(1 - (1-p)^n\right),

and choosing p = \min\{n^{-1},(d-1)^{-1}\} yields to the desired lower bound.

3.2. Pessimistic case 

The analysis of the pessimistic case only differs in the construction of the hypotheses. As before, fix x_1,\cdots,x_d\in \mathcal{X} and f_v\in \mathcal{F} such that f_v(x_i) = v_i for all v\in \{\pm 1\}^d and i\in [d]. Now let P_X be the uniform distribution on \{x_1,\cdots,x_d\}, and under P_v, the conditional probability P_{Y|X} is 

P_v(Y= v_i | X=x_i) = \frac{1}{2} + \delta, \qquad \forall i\in [d],

and \delta\in (0,1/2) is to be specified later. In other words, the classifier f_v only outperforms the random guess by a margin of \delta under P_v.

Again we apply the Assoaud’s lemma for the lower bound of R_{\text{pes}}^\star(\mathcal{F}). First note that for all v\in \{\pm 1\}^d

\min_{f^\star \in \mathcal{F}} P_v(Y \neq f^\star(X)) = \frac{1}{2} - \delta.

Hence, for any v\in \{\pm 1\}^d and any f\in \mathcal{F},

\begin{array}{rcl} L(P_v, f) &=& \frac{1}{d}\sum_{i=1}^d \left[\frac{1}{2} + \left(2\cdot 1(f(x_i) = v_i) - 1\right)\delta \right] - \left(\frac{1}{2}-\delta\right) \\ &=& \frac{2\delta}{d}\sum_{i=1}^d 1(f(x_i) = v_i). \end{array}

By triangle inequality, the separation condition is fulfilled with \Delta = 2\delta/d. Moreover, direct computation yields

\max_{d_{\text{H}}(v,v')=1 } H^2(P_v, P_{v'}) = \frac{1}{d}\left(\sqrt{\frac{1}{2}+\delta} - \sqrt{\frac{1}{2}-\delta}\right)^2 = O\left(\frac{\delta^2}{d}\right),

and therefore tensorization gives

\max_{d_{\text{H}}(v,v')=1 } H^2(P_v^{\otimes n}, P_{v'}^{\otimes n}) = 2 - 2\left(1 - O\left(\frac{\delta^2}{d}\right) \right)^n.

Finally, choosing \delta = (1\wedge \sqrt{d/n})/2 gives the desired result.

3.3. General case 

We may interpolate between the pessimistic and optimistic cases as follows: for any given \varepsilon\in (0,1/2], we restrict to the set \mathcal{P}(\mathcal{F},\varepsilon) of joint distributions with 

\sup_{P_{XY}\in \mathcal{P}(\mathcal{F},\varepsilon) }\inf_{f^\star\in \mathcal{F}} P_{XY}(Y\neq f^\star(X))\le \varepsilon.

Then we may define the minimax excess risk over \mathcal{P}(\mathcal{F},\varepsilon) as

R^\star(\mathcal{F},\varepsilon) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}(\mathcal{F},\varepsilon)} R_{\mathcal{F}}(P_{XY}, \hat{f}).

Clearly the optimistic case corresponds to \varepsilon =0, and the pessimistic case corresponds to \varepsilon = 1/2. Similar to the above arguments, we have the following lower bound on R^\star(\mathcal{F},\varepsilon) , whose proof is left as an exercise.

Exercise 2 Show that when the VC dimension of \mathcal{F} is d, then

R^\star(\mathcal{F},\varepsilon) = \Omega\left(1\wedge \left(\frac{d}{n} + \sqrt{\frac{d}{n}\cdot \varepsilon}\right) \right).

4. Example IV: Stochastic Optimization 

Consider the following oracle formulation of the convex optimizaton: let f: \Theta\rightarrow {\mathbb R} be the convex objective function, and we aim to find the minimum value \min_{\theta\in\Theta} f(\theta). To do so, the learner can query some first-order oracle adaptively, where given the query \theta_t\in \Theta the oracle outputs a pair (f(\theta_t), g(\theta_t)) consisting of the objective value at \theta_t (zeroth-order information) and a sub-gradient of f at \theta_t (first-order information). The queries can be made in an adaptive manner where \theta_t can depend on all previous outputs of the oracle. The target of convex optimization is to determine the minimax optimization gap after T queries defined as 

R_T^\star(\mathcal{F}, \Theta) = \inf_{\theta_T}\sup_{f\in \mathcal{F}} \left( f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta) \right),

where \mathcal{F} is a given class of convex functions, and the final query \theta_{T+1} can only depend on the past outputs of the functions.

Since there is no randomness involved in the above problem, the idea of multiple hypothesis testing cannot be directly applied here. Therefore, in this section we consider a simpler problem of stochastic optimization, and we postpone the general case to later lectures. Specifically, suppose that the above first-order oracle \mathcal{O} is stochastic, i.e., it only outputs (\widehat{f}(\theta_t), \widehat{g}(\theta_t)) such that \mathop{\mathbb E}[\widehat{f}(\theta_t)] = f(\theta_t), and \mathop{\mathbb E}[\widehat{g}(\theta_t)] \in \partial f(\theta_t). Let \mathcal{F} be the set of all convex L-Lipschitz functions (in L_2 norm), \Theta = \{\theta\in {\mathbb R}^d: \|\theta\|_2\le R \}, and assume that the subgradient estimate returned by the oracle \mathcal{O} always satisfies \|\widehat{g}(\theta_t)\|_2\le L as well. Now the target is to determine the quantity 

R_{T,d,L,R}^\star = \inf_{\theta_T} \sup_{f \in \mathcal{F}} \sup_{\mathcal{O}} \left(\mathop{\mathbb E} f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta)\right),

where the expectation is taken over the randomness in the oracle output. The main result in this section is summarized in the following theorem:

Theorem 4

R_{T,d,L,R}^\star = \Omega\left(\frac{LR}{\sqrt{T}}\right).

Since it is well-known that the stochastic gradient descent attains the optimization gap O(LR/\sqrt{T}) for convex Lipschitz functions, the lower bound in Theorem 4 is tight. 

Now we apply the multiple hypothesis testing idea to prove Theorem 4. Since the randomness only comes from the oracle, we should design the stochastic oracle carefully to reduce the optimization problem to testing. A natural way is to choose some function F(\theta;X) such that F(\cdot;X) is convex L-Lipschitz for all X, and f(\theta) = \mathop{\mathbb E}_P[F(\theta;X)]. In this way, the oracle can simply generate i.i.d. X_1,\cdots,X_T\sim P, and reveals the random vector (F(\theta_t;X_t), \nabla F(\theta_t;X_t)) to the learner at the t-th query. Hence, the proof boils down to find a collection of probability distributions (P_v)_{v\in \mathcal{V}} such that they are separated apart while hard to distinguish based on T observations. 

We first look at the separation condition. Since the loss function of this problem is L(f,\theta_{T+1}) = f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta), we see that 

\inf_{\theta_{T+1}\in \Theta} L(f,\theta_{T+1}) + L(g,\theta_{T+1}) = \min_{\theta\in\Theta} (f(\theta) + g(\theta)) - \min_{\theta\in\Theta} f(\theta) - \min_{\theta\in\Theta} g(\theta),

which is known as the optimization distance d_{\text{opt}}(f,g). Now consider

F(\theta;X) = L\left|\theta_i - \frac{Rx_i}{\sqrt{d}}\right| \qquad \text{if }x = \pm e_i,

and for v\in \{\pm 1\}^d, let f_v(\theta) = \mathop{\mathbb E}_{P_v}[F(\theta;X)] with P_v defined as

P_v(X = v_ie_i) = \frac{1+\delta}{2d}, \quad P_v(X= -v_ie_i) = \frac{1-\delta}{2d}, \qquad \forall i\in [d].

Clearly F(\cdot;X) is convex and L-Lipschitz, and for \|\theta\|_\infty \le R/\sqrt{d} we have

f_v(\theta) = \frac{LR}{\sqrt{d}} - \frac{\delta L}{d}\sum_{i=1}^d v_i\theta_i.

Hence, it is straightforward to verify that \min_{\|\theta\|_2\le R} f_v(\theta) = LR(1-\delta)/\sqrt{d}, and 

d_{\text{opt}}(f_v,f_{v'}) = \frac{2\delta LR}{d}\cdot d_{\text{H}}(v,v').

In other words, the separation condition in the Assoaud’s lemma is fulfilled with \Delta = 2\delta LR/d. Moreover, simple algebra gives

\max_{d_{\text{H}}(v,v') =1} D_{\text{KL}}(P_v^{\otimes T}\| P_{v'}^{\otimes T}) = O(T\delta^2),

and therefore Assoaud’s lemma gives

R_{T,d,L,R}^\star \ge d\cdot \frac{\delta LR}{d}\left(1 - O(\delta\sqrt{T})\right),

and choosing \delta \asymp T^{-1/2} completes the proof of Theorem 4. In fact, the above arguments also show that if the L_2 norm is replaced by the L_\infty norm, then

R_{T,d,L,R}^\star = \Theta\left(LR\sqrt{\frac{d}{T}}\right).

5. Bibliographic Notes 

The example of nonparametric density estimation over Sobolev balls is taken from Nemirovski (2000), which also contains the tight minimax linear risk among all linear estimators. For the upper bound, linear procedures fail to achieve the minimax rate whenever q>p, and some non-linearities are necessary for the minimax estimators either in the function domain (Lepski, Mammen and Spokoiny (1997)) or the wavelet domain (Donoho et al. (1996)). 

The linear and convex aggregations are proposed in Nemirovski (2000) for the adaptive nonparametric thresholding estimators, and the concept of model selection aggregation is due to Yang (2000). For the optimal rates of different aggregations (together with upper bounds), we refer to Tsybakov (2003) and Leung and Barron (2006). 

The examples from statistical learning theory and stochastic optimization are similar in nature. The results of Theorem 2 and the corresponding upper bounds are taken from Vapnik (1998), albeit with a different proof language. For general results of the oracle complexity of convex optimization, we refer to the wonderful book Nemirovksi and Yudin (1983) and the lecture note Nemirovski (1995). The current proof of Theorem 4 is due to Agarwal et al. (2009). 

  1. Arkadi Nemirovski, Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28 (2000): 85. 
  2. Oleg V. Lepski, Enno Mammen, and Vladimir G. Spokoiny, Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics 25.3 (1997): 929–947. 
  3. David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard, Density estimation by wavelet thresholding. The Annals of Statistics (1996): 508–539. 
  4. Yuhong Yang, Combining different procedures for adaptive regression. Journal of multivariate analysis 74.1 (2000): 135–161. 
  5. Alexandre B. Tsybakov, Optimal rates of aggregation. Learning theory and kernel machines. Springer, Berlin, Heidelberg, 2003. 303–313. 
  6. Gilbert Leung, and Andrew R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52.8 (2006): 3396–3410. 
  7. Vlamimir Vapnik, Statistical learning theory. Wiley, New York (1998): 156–160. 
  8. Arkadi Nemirovski, and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983. 
  9. Arkadi Nemirovski, Information-based complexity of convex programming. Lecture Notes, 1995. 
  10. Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar, Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 2009. 

Lecture 8: Multiple Hypothesis Testing: Tree, Fano and Assoaud

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last three lectures we systematically introduced the Le Cam’s two-point method with generalizations and various examples. The main idea of the two-point method is to reduce the problem in hand to a binary hypothesis testing problem, where the hypotheses may be either single or composite. This approach works when the target is to estimate a scalar (even when the underlying statistical model may involve high-dimensional parameters), while it typically fails when the target is to recover a high-dimensional vector of parameters. For example, even in the simplest Gaussian location model X\sim \mathcal{N}(\theta, I_p) where the target is to estimate \theta under the mean squared loss, the best two-point approach gives a lower bound R_p^\star = \Omega(1) while we have known from Lecture 4 that the minimax risk is R_p^\star = p. To overcome this difficulty, recall from the minimax theorem that a suitable prior should always work, which by discretization motivates the idea of multiple hypothesis testing. 

In this lecture, we develop the general theory of reducing problems into multiple hypothesis testing, and present several tools including tree-based methods, Assoaud’s lemma and Fano’s inequality. In the next two lectures we will enumerate more examples (possibly beyond the minimax estimation in statistics) and variations of the above tools. 

1. General Tools of Multiple Hypothesis Testing 

This section presents the general theory of applying multiple hypothesis testing to estimation problems, as well as some important tools such as tree, Fano and Assoaud. Recall from the two-point method that the separability and indistinguishability conditions are of utmost importance to apply testing arguments, the above tools require different separability conditions and represent the indistinguishability condition in terms of different divergence functions. 

1.1. Tree-based method and Fano’s inequality 

We start with the possibly simplest separability condition, i.e., the chosen parameters (hypothese) \theta_1, \cdots, \theta_m\in \Theta are pairwise seperated. The following lemma shows that the minimax estimation error is lower bounded in terms of the average test error. 

Lemma 1 (From estimation to testing) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and there exist \theta_1,\cdots,\theta_m\in \Theta such that

L(\theta_i, a) + L(\theta_j, a) \ge \Delta, \qquad \forall i\neq j\in [m], a \in \mathcal{A}.


\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i),

where the infimum is taken over all measurable tests \Psi: \mathcal{X} \rightarrow [m]

Proof: For any given estimator \hat{\theta}, we construct a test \Psi as follows: 

\Psi(X) \in \arg\min_{i\in [m]} L(\theta_i, \hat{\theta}(X)).

Then the separability condition implies that

L(\theta_i, \hat{\theta}(X)) \ge \frac{\Delta}{2}\cdot 1(\Psi(X) \neq i), \qquad \forall i\in [m].

Then the rest follows from that the maximum is lower bounded by the average. \Box 

Remark 1 The following weaker separability condition also suffices for Lemma 1 to hold: for any i\neq j\in [m] and a\in \mathcal{A}, the inequality L(\theta_i, a)\le \Delta/2 always implies that L(\theta_j, a)\ge \Delta/2

The remaining quantity in Lemma 1 is the optimal average test error \inf_{\Psi}m^{-1}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i), for which we are looking for a lower bound. Recall that when m=2, this quantity equals to (1- \|P_{\theta_1} - P_{\theta_2}\|_{\text{TV}})/2 by Le Cam’s first lemma. For general m\ge 2, we have the following two different lower bounds. 

Lemma 2 (Tree-based inequality) Let T=([m], E) be any undirected tree on vertex set [m] with edge set E. Then

\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge \frac{1}{m}\sum_{(i,j)\in E} (1 - \|P_{\theta_i} - P_{\theta_j}\|_{\text{\rm TV}}).

Proof: It is straightforward to see that 

\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) = 1 - \frac{1}{m}\int \max\{dP_{\theta_1}(x), \cdots, dP_{\theta_m}(x)\}. \ \ \ \ \ (1)

It is also easy (left as an exercise) to establish the following elementary inequality: for any reals x_1, \cdots, x_m and any tree T=([m], E), we have 

\sum_{i=1}^m x_i - \max_{i\in [m]} x_i \ge \sum_{(i,j)\in E} \min\{x_i, x_j\}.

Now using \int \min\{dP, dQ \} = 1 - \|P-Q\|_{\text{TV}} gives the desired inequality. \Box 

Lemma 3 (Fano’s inequality) Let \bar{P} := m^{-1}\sum_{i=1}^m P_i. Then

\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge 1 - \frac{1}{\log m}\left(\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) +\log 2\right).

Remark 2 If we introduce auxiliary random variables V\sim \mathsf{Uniform}([m]) and X with P_{X|V=i} := P_{\theta_i}, then

\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) = I(V;X),

where I(V;X) denotes the mutual information between V and X

Proof: We present two proofs of Lemma 3. The first proof builds upon the representation (1) and is more analytical, while the second proof makes use of the data-processing inequality, which is essentially the classical information-theoretic proof of Fano’s inequality. 

(First Proof) By (1), it suffices to prove that 

\mathop{\mathbb E}_{\bar{P}} \left[ \max_{i\in [m]} \left\{\frac{dP_i}{d\bar{P}} \right\} \right] \le \frac{1}{\log m}\left(\sum_{i=1}^m \mathop{\mathbb E}_{\bar{P}} \left[\frac{dP_i}{d\bar{P}}\log \frac{dP_i}{d\bar{P}}\right] + m\log 2\right).

By the linearity of expectation, it further suffices to prove that for any non-negative reals x_1, \cdots, x_m with \sum_{i=1}^m x_i = m, we have 

\max_{i\in [m]} x_i \le \frac{1}{\log m}\left(\sum_{i=1}^m x_i\log x_i + m\log 2\right).

To establish this inequality, let t:=\max_{i\in [m]} x_i. Then by the convexity of x\mapsto x\log x, Jensen’s inequality gives 

\sum_{i=1}^m x_i\log x_i \ge t\log t + (m-t)\log\frac{m-t}{m-1} \ge m\log \frac{m}{2} - (m-t)\log m.

Plugging in the above inequality completes the proof of Lemma 3. 

(Second Proof) Introduce the auxiliary random variables V and X as in Remark 2. For any fixed X, consider the kernel \mathsf{K}_X: [m]\rightarrow \{0,1\} which sends i\in [m] to 1(\Psi(X) = i), then \mathsf{K}_X\circ P_V is the Bernoulli distribution with parameter m^{-1}, and \mathsf{K}_X\circ P_{V|X} is the Bernoulli distribution with parameter p_X := \mathop{\mathbb P}_{V|X}(V = \Psi(X)). By the data-processing inequality of KL divergence, 

\begin{array}{rcl} D_{\text{KL}}(P_{V|X} \| P_V) & \ge & p_X \log(mp_X) + (1-p_X)\log (1-p_X) \\ &\ge& p_X\log m - \log 2. \end{array}

Now taking expectation on X at both sides gives 

I(V;X) = \mathop{\mathbb E}_X D_{\text{KL}}(P_{V|X} \| P_V) \ge \mathop{\mathbb P}(V = \Psi(X))\cdot \log m - \log 2,

and rearranging completes the proof. \Box

Lemma 2 decomposes a multiple hypothesis testing problem into several binary testing problems, which cannot outperform the best two-point methods in typical scenarios but can be useful when there is some external randomness associated with each P_{\theta_i} (see the bandit example later). Lemma 3 is the well-known Fano’s inequality involving the mutual information, and the additional \log m term is the key difference in multiple hypothesis testing. Hence, the typical lower bound arguments are to apply Lemma 1 together with Lemma 3 (or sometimes Lemma 2), with the following lemma which helps to upper bound the mutual information. 

Lemma 4 (Variational representation of mutual informatino)

I(V;X) = \inf_{Q_X} \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )].

Proof: Simply verify that 

I(V;X) = \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )] - D_{\text{KL}}(P_X\|Q_X). \Box

1.2. Assoaud’s lemma 

Instead of the previous pairwise separation condition, Assoaud’s lemma builds upon a different one where the hypotheses are essentially the vertices of an m-dimensional hypercube. Specifically, we shall require that the distance between parameters v and v'\in \{\pm 1\}^p is proportional to their Hamming distance d_{\text{H}}(v,v') := \sum_{i=1}^p 1(v_i \neq v_i'), which becomes quite natural when the parameter of the statistical model lies in an m-dimensional space. 

Theorem 5 (Assoaud’s Lemma) Let L: \Theta\times \mathcal{A} \rightarrow {\mathbb R}_+ be any function. If there exist parameters (\theta_v)\subseteq \Theta indexed by v\in \{\pm1\}^p such that

L(\theta_v, a) + L(\theta_{v'}, a) \ge \Delta\cdot d_{\text{\rm H}}(v,v'), \qquad \forall v,v'\in \{\pm 1\}^p, a\in\mathcal{A},


\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{\Delta}{2}\sum_{j=1}^p \left(1 - \|P_{j,+} - P_{j,-}\|_{\text{\rm TV}}\right),


P_{j,+} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = 1} P_{\theta_v}, \quad P_{j,-} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = -1} P_{\theta_v}.

Proof: For a given estimator \hat{\theta}, define 

\hat{v} := \arg\min_{v\in \{\pm 1\}^p} L(\theta_v, \hat{\theta}).

Then it is straightforward to see that

L(\theta_v, \hat{\theta}) \ge \frac{L(\theta_v,\hat{\theta}) + L(\theta_{\hat{v}}, \hat{\theta})}{2} \ge \frac{\Delta}{2}\sum_{j=1}^p 1(v_j \neq \hat{v}_j), \qquad \forall v\in \{\pm 1\}^p.

The rest of the proof follows from Le Cam’s first lemma and the fact that the maximum is no smaller than the average. \Box

In most scenarios, the total variation distance between the mixtures P_{j,+} and P_{j,-} is hard to compute directly, and the following corollaries (based on joint convexity of the total variation distance and Cauchy–Schwartz) are often the frequently presented versions of Assoaud’s lemma. 

Corollary 6 Under the conditions of Theorem 5, we have

\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \max_{v,v'\in \{\pm 1\}^p: d_{\text{\rm H}}(v,v')=1} \|P_{\theta_v} - P_{\theta_{v'}}\|_{\text{\rm TV}}\right).

Corollary 7 Under the conditions of Theorem 5, we have

\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \left(\frac{1}{p2^p}\sum_{j=1}^p \sum_{v\in \{\pm 1\}^p} \|P_{\theta_v} - P_{\theta_{\bar{v}^j}} \|_{\text{\rm TV}}^2 \right)^{1/2} \right),

where \bar{v}^j is the resulting binary vector after flipping the j-th coordinate of v

The idea behind the Assoaud’s lemma is to apply a two-point argument for each coordinate of the parameter and then sum up all coordinates. Hence, Assoaud’s lemma typically improves over the best two-point argument by a factor of the dimension p, given that the hypercube-type separation condition in Theorem 5 holds. 

1.3. Generalized Fano’s inequality 

In the Fano’s inequality and Assoaud’s lemma above, we introduce a random variable V which is uniformly distributed on the set [m] or the hypercube \{\pm 1\}^p. Moreover, the separation condition must hold for each pair of the realizations of V. In general, we may wish to consider some non-uniform random variable V, and only require that the separation condition holds for most pairs. This is the focus of the following generalized Fano’s inequality. 

Theorem 8 (Generalized Fano’s Inequality) Let L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+ be any loss function, and \pi be any probability distribution on \Theta. For any \Delta>0, define

p_{\Delta} := \sup_{a\in \mathcal{A}} \pi(\{\theta\in \Theta: L(\theta,a)\le \Delta \}).

Then for V\sim \pi, we have

\inf_{\hat{\theta}}\sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \Delta\left(1 - \frac{I(V;X) + \log 2}{\log(1/p_{\Delta})}\right).

Proof: By Markov’s inequality, it suffices to show that for any estimator \hat{\theta}

\mathop{\mathbb P}\left(L(V, \hat{\theta}(X))\le \Delta \right)\le \frac{I(V;X) + \log 2}{\log(1/p_\Delta)}.

We again use the second proof of Lemma 3. For any fixed X, consider the deterministic kernel \mathsf{K}_X: \Theta\rightarrow \{0,1\} which sends V to 1( L(V,\hat{\theta}(X))\le \Delta ). Then the composition K_X\circ P_V is a Bernoulli distribution with parameter p_X\le p_\Delta, and the composition K_X\circ P_{V|X} is a Bernoulli distribution with parameter q_X := P_{V|X}(L(V,\hat{\theta}(X))\le \Delta). Then by the data processing inequality of KL divergence,

\begin{array}{rcl} D_{\text{KL}}(P_{V|X}\|P_V) &\ge& q_X\log \frac{q_X}{p_X} + (1-q_X)\log \frac{1-q_X}{1-p_X} \\ &\ge & q_X\log \frac{q_X}{p_\Delta} + (1-q_X)\log (1-q_X) \\ &\ge & q_X\log(1/p_\Delta) - \log 2. \end{array}

Since \mathop{\mathbb E}[q_X] = \mathop{\mathbb P}(L(V,\hat{\theta}(X))\le \Delta), taking expectation on X at both sides gives the desired inequality. \Box

To see that Theorem 8 is indeed a generalization to the original Fano’s inequality, simply note that when V\sim \mathsf{Uniform}([m]) and the separation condition in Lemma 1 holds, we have p_\Delta = m^{-1} and therefore the denominator \log(1/p_\Delta) = \log m becomes the log-cardinality. Hence, in the generalized Fano’s inequality, the verification of the seperation condition becomes upper bounding the probability p_\Delta

2. Applications 

In this section we present some applications of the above tools to statistical examples. Here the applications are mostly simple and straightforward, and we defer some more sophisticated examples in other domains to the next lecture. 

2.1. Example I: Gaussian mean estimation 

We start from the possibly simplest example of Gaussian mean estimation. Consider X\sim \mathcal{N}(\theta,\sigma^2 I_p) with known \sigma, and the target is the estimate the vector \theta\in {\mathbb R}^p under the squared \ell_2 loss. Let R_{p,\sigma}^\star be the corresponding minimax risk. 

We first show that any two-point argument fails. In fact, if the two points are chosen to be \theta_1, \theta_2\in {\mathbb R}^p, the two-point argument gives 

R_{p,\sigma}^\star \ge \frac{\|\theta_1 - \theta_2\|_2^2}{2}\left(1-\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2\sigma}\right)\right),

where \Phi(\cdot) is the normal CDF. Optimization only gives R_{p,\sigma}^\star = \Omega(\sigma^2).

Now we show that Assoaud’s lemma can give us the rate-optimal lower bound R_{p,\sigma}^\star= \Omega(p\sigma^2). Let P_v = \mathcal{N}(\delta v, \sigma^2 I_p) for v\in \{\pm 1\}^p, where \delta>0 is a parameter to be chosen later. Since 

\|\delta v - \delta v'\|_2^2 = 4\delta^2\cdot d_{\text{H}}(v,v'),

the separation condition in Theorem 5 holds with \Delta = 4\delta^2. Consequently, Corollary 6 gives

R_{p,\sigma}^\star \ge 2p\delta^2\cdot \left(1 - \Phi\left(\frac{\delta}{\sigma}\right)\right),

and choosing \delta =\Theta(\sigma) gives R_{p,\sigma}^\star= \Omega(p\sigma^2).

We can also establish the same lower bound using the generalized Fano’s inequality. Let V\sim \mathsf{Uniform}(\{\pm\delta\}^p), then Lemma 4 gives 

I(V;X) \le \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(V,\sigma^2 I_p) \| \mathcal{N}(0, \sigma^2I_p)) = \frac{\mathop{\mathbb E} \|V\|_2^2}{2\sigma^2} = \frac{p\delta^2}{2\sigma^2}.

Since for any \theta_0\in {\mathbb R}^p with v_0 = \text{sign}(\theta_0)\in \{\pm 1\}^p (break ties arbitrarily), we have

\|V - \theta_0\|_2^2 \ge \delta^2 \cdot d_{\text{H}}(\text{sign}(V), v_0),

choosing \Delta = p\delta^2/3 gives

\mathop{\mathbb P}(\|V-\theta_0\|_2^2 \le \Delta) \le \mathop{\mathbb P}(d_{\text{H}}(\text{sign}(V), v_0)\le p/3 ) \le e^{-cp}

for some numerical constant c>0, where the last inequality is given by sub-Gaussian concentration. Consequently, p_\Delta \le e^{-cp}, and Theorem 8 gives

R_{p,\sigma}^\star \ge \frac{p\delta^2}{3}\left(1 - \frac{p\delta^2/(2\sigma^2) + \log 2}{cp}\right).

Again, choosing \delta = c'\sigma with small c'>0 gives R_{p,\sigma}^\star = \Omega(p\sigma^2).

Remark 3 The original version of Fano’s inequality can also be applied here, where \{\theta_1,\cdots,\theta_m\} is a maximal packing of \{\pm \delta\}^p with \ell_2 packing distance O(\delta\sqrt{p})

2.2. Example II: Sparse linear regression 

Consider the following sparse linear regression Y\sim \mathcal{N}(X\theta, \sigma^2 I_n), where X\in {\mathbb R}^{n\times p} is a fixed design matrix, and \theta\in {\mathbb R}^p is a sparse vector with \|\theta\|_0\le s. Here s\in [1,p] is a known sparsity parameter, and we are interested in the minimax risk R_{n,p,s,\sigma}^\star(X) of estimating \theta under the squared \ell_2 loss. Note that when X = I_p, this problem is reduced to the sparse Gaussian mean estimation. 

We apply the generalized Fano’s inequality to this problem. Since a natural difficulty of this problem is to recover the support of \theta, we introduce the random vector V=\delta v_S, where \delta>0 is a parameter to be specified later, v\sim \mathsf{Uniform}\{\pm 1\}^p, v_S denotes the restriction of v onto S, and S\subseteq [p] is uniformly chosen from all size-s subsets of [p]. Clearly, 

\mathop{\mathbb E}[VV^\top] = \delta^2 \mathop{\mathbb E}[v_Sv_S^\top] = \frac{s\delta^2}{p}I_p.

By Lemma 4,

\begin{array}{rcl} I(V;Y) &\le & \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(XV, \sigma^2 I_n) \| \mathcal{N}(0, \sigma^2 I_n)) \\ &=& \frac{\mathop{\mathbb E}\|XV\|_2^2}{2\sigma^2} \\ &=& \frac{\text{Trace}(X^\top X\cdot \mathop{\mathbb E}[VV^\top]) }{2\sigma^2}\\ &=& \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2, \end{array}

where \|X\|_{\text{F}} := \sqrt{\text{Trace}(X^\top X)} is the Frobenius norm of X. Now it remains to upper bound p_\Delta for \Delta = s\delta^2/12. Fix any \theta_0\in {\mathbb R}^p such that \|V - \theta_0\|_2^2\le \Delta holds with non-zero probability (otherwise the upper bound is trivial), and by symmetry we may assume that \|\theta_0 - \delta1_{[s]}\|_2^2 \le \Delta. Now by triangle inequality, \|\theta_0 - \delta v_S\|_2^2 \le \Delta implies that \|v_S - 1_{[s]}\|_2^2 \le s/3. Hence,

\begin{array}{rcl} p_\Delta &\le& \mathop{\mathbb P}(\|v_S - 1_{[s]}\|_2^2\le s/3) \\ &\le& \frac{1}{2^s\binom{p}{s}}\sum_{k=\lceil 2s/3 \rceil}^s \binom{s}{k} \binom{p-k}{s-k}2^{s-k} \\ &\le& \left(s/(ep)\right)^{cs} \end{array}

for a numerical constant c>0 after some algebra. Consequently, Theorem 8 gives

R_{n,p,s,\sigma}^\star(X) \ge \frac{s\delta^2}{12}\left(1 - \frac{ \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2 + \log 2}{cs\log(ep/s)} \right).

Finally, choosing \delta^2 = c'p\sigma^2\log(ep/s)/\|X\|_{\text{F}}^2 for some small constant c'>0 gives

R_{n,p,s,\sigma}^\star (X)= \Omega\left(\frac{sp\sigma^2\log(ep/s)}{\|X\|_{\text{F}}^2} \right).

In the special cases where X= I_p or X consists of i.i.d. \mathcal{N}(0,n^{-1}) entries, we arrive at the tight lower bound \Omega(s\sigma^2\log(ep/s)) for both sparse Gaussian mean estimation and compressed sensing.

2.3. Example III: Multi-armed bandit 

Next we revisit the example of the multi-armed bandit in Lecture 5. Let T be the time horizon, K be the total number of arms. For each i\in [K], the reward of pulling arm i at time t is an independent random variable r_{t,i}\sim \mathcal{N}(\mu_i, 1). The target of the learner is to devise a policy (\pi_t)_{t\in [T]} where \pi_t\in [K] is the arm to pull at time t, and \pi_t may depend on the entire observed history (\pi_\tau)_{\tau<t} and (r_{\tau,\pi_\tau})_{\tau<t}. The learner would like minimize the following worst-case regret: 

R_{T,K}^\star = \inf_{\pi} \sup_{\max_{i\in [K]}|\mu_i| \le \sqrt{K} } \mathop{\mathbb E}\left[T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t}\right],

where \max_{i\in [K]}|\mu_i|\le \sqrt{K} is a technical condition. As in Lecture 5, our target is to show that

R_{T,K}^\star = \Omega(\sqrt{KT})

via multiple hypothesis testing.

To prove the lower bound, a natural idea is to construct K hypotheses where the i-th arm is the optimal arm under the i-th hypothesis. Specifically, we set 

\begin{array}{rcl} \mu^{(1)} &=& (\delta, 0, 0, 0, \cdots, 0), \\ \mu^{(2)} &=& (\delta, 2\delta, 0, 0, \cdots, 0), \\ \mu^{(3)} &=& (\delta, 0, 2\delta, 0, \cdots, 0), \\ \vdots & & \vdots \\ \mu^{(K)} &=& (\delta, 0, 0, 0, \cdots, 2\delta), \end{array}

where \delta>0 is some parameter to be specified later. The construction is not entirely symmetric, and the reward distribution under the first arm is always \mathcal{N}(\delta,1).

For a mean vector \mu and policy \pi, let L(\mu,\pi) = T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t} be the non-negative loss function for this online learning problem. Then clearly the separation condition in Lemma 1 is fulfilled with \Delta = \delta T. Next, we apply Lemma 2 to a star graph ([K], \{(1,2),(1,3),\cdots,(1,K) \}) with center 1 gives 

\begin{array}{rcl} R_{T,K}^\star &\overset{(a)}{\ge} & \frac{\delta T}{2}\cdot \frac{1}{K}\sum_{i=2}^K \left(1 - \|P_{\mu^{(1)}} - P_{\mu^{(i)}} \|_{\text{TV}}\right) \\ &\overset{(b)}{\ge} & \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( -D_{\text{KL}}(P_{\mu^{(1)}} \| P_{\mu^{(i)}} ) \right) \\ &\overset{(c)}{=}& \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( - 2\delta^2 \mathop{\mathbb E}_1(T_i)\right) \\ &\overset{(d)}{\ge}& \frac{\delta T}{4}\cdot \frac{K-1}{K} \exp\left(-2\delta^2\cdot \frac{\mathop{\mathbb E}_1(T_2 + \cdots + T_K)}{K-1} \right) \\ &\overset{(e)}{\ge} & \frac{\delta T}{4}\cdot \frac{K-1}{K}\exp\left(-\frac{2T\delta^2}{K-1}\right), \end{array}

where in step (a) we denote by P_{\mu^{(i)}} the distribution of the observed rewards under mean vector \mu^{(i)}, step (b) follows from the inequality 1 - \|P-Q\|_{\text{TV}}\ge \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q)) presented in Lecture 5, step (c) is the evaluation of the KL divergence where \mathop{\mathbb E}_1(T_i) denotes the expected number of pullings of arm i under the mean vector \mu^{(1)}, step (d) follows from Jensen’s inequality, and step (e) is due to the deterministic inequality T_2 + \cdots + T_K\le T. Now choosing \delta = \Theta(\sqrt{K/T}) gives the desired lower bound.

The main reason why we apply Lemma 2 instead of the Fano’s inequality and choose the same reward distribution for arm 1 at all times is to deal with the random number of pullings of different arms under different reward distributions. In the above way, we may stick to the expectation \mathop{\mathbb E}_1 and apply the deterministic inequality T_2 + \cdots + T_K\le T to handle the randomness. 

2.4. Example IV: Gaussian mixture estimation 

Finally we look at a more involved example in nonparametric statistics. Let f be a density on {\mathbb R} which is a Gaussian mixture, i.e., f = g * \mathcal{N}(0,1) for some density g, where * denotes the convolution. We consider the estimation of the density f given n i.i.d. observations X_1,\cdots,X_n from f, and we denote by R_n^\star the minimax risk under the squared L_2 loss between real-valued functions. The central claim is that 

R_n^\star = \Omega\left(\frac{(\log n)^{1/2}}{n}\right),

which is slightly larger than the (squared) parametric rate n^{-1}.

Before proving this lower bound, we first gain some insights from the upper bound. In the problem formulation, the only restriction on the density f is that f must be a Gaussian mixture. Since convolution makes function smoother, f should be fairly smooth, and harmonic analysis suggests that the extent of smoothness is reflected in the speed of decay of the Fourier transform of f. Let \widehat{f} be the Fourier transform of f, we have 

\widehat{f}(\omega) = \widehat{g}(\omega)\cdot \frac{1}{\sqrt{2\pi}}e^{-\omega^2/2},

and \|\widehat{g}\|_\infty \le \|g\|_1=1. Hence, if we truncate \widehat{f}(\omega) to zero for all |\omega| > 2\sqrt{\log n}, the L_2 approximation error would be smaller than O(n^{-1}). This suggests us to consider the kernel K on {\mathbb R} with

\widehat{K}(\omega) = 1(|\omega|\le 2\sqrt{\log n}).

Then the density estimator \mathop{\mathbb P}_n * K has mean f * K, which by Parseval’s identity has squared bias

\|f - f*K\|_2^2 = \|\widehat{f}(1-\widehat{K}) \|_2^2 = O(n^{-1}).

Moreover, by Parseval again,

\mathop{\mathbb E}\|\mathop{\mathbb P}_n * K - f * K\|_2^2 \le \frac{\|K\|_2^2}{n} = \frac{\|\widehat{K}\|_2^2}{n} = O\left(\frac{\sqrt{\log n}}{n}\right),

and a triangle inequality leads to the desired result.

The upper bound analysis motivates us to apply Fourier-type arguments in the lower bound. For v\in \{\pm 1\}^p (with integer p to be specified later), consider the density 

f_v = f_0 + \left(\delta\cdot \sum_{i=1}^p v_ig_i \right) * \phi,

where f_0 is some centered probability density, \delta>0 is some parameter to be specified later, g_1,\cdots,g_p are perturbation functions with \int g_i = 0, and \phi is the PDF of \mathcal{N}(0,1). To apply the Assouad’s lemma on this hypercube structure, we require the following orthogonality condition

\int_{{\mathbb R}} (g_i*\phi)(x) (g_j*\phi)(x)dx = 1(i=j)

to ensure the separation condition in Theorem 5 with \Delta = \Theta(\delta^2). By Plancherel’s identity, the above condition is equivalent to

\int_{{\mathbb R}} \widehat{g}_i(\omega)\widehat{g}_j(\omega)\phi^2(\omega)d\omega = 1(i=j).

Recall from Lecture 7 that the Hermite polynomials are orthogonal under the normal distribution, a natural candidate of \widehat{g}_i(\omega) is \widehat{g}_i(\omega)\propto \phi(\omega)H_{2i-1}(2\omega), where we have used that \phi(\omega)^4 \propto \phi(2\omega), and we restrict to odd degrees to ensure that \int g_i = \widehat{g}_i(0) = 0. This choice enjoys another nice property where the inverse Fourier transform of \widehat{g}_i has the following closed-form expression for g_i:

g_i(x) = \sqrt{2}(2\pi)^{3/4}\sqrt{\frac{3^{2i-1}}{(2i-1)!}}\cdot\phi(x) H_{2i-1}\left(\frac{2x}{\sqrt{3}}\right).

Moreover, since H_k(x)\sim \sqrt{k!}e^{x^2/4} for large x, we have \|g_i\|_\infty = \Theta(3^i). Therefore, we must have p=O(\log n) to ensure the non-negativity of the density f_v.

Now the only remaining condition is to check that the \chi^2-divergence between neighboring vertices is at most O(1), which is essentially equivalent to 

n\delta^2\cdot \int_{{\mathbb R}} \frac{(g_i * \phi(x))^2}{f_0(x)}dx = O(1).

A natural choice of f_0 is the PDF of \mathcal{N}(0,\sigma^2), with \sigma>0 to be specified. Splitting the above integral over {\mathbb R} into |x|\le C\sigma and |x|> C\sigma for some large constant C>0, the orthogonality relation of g_i gives

\int_{|x|\le C\sigma} \frac{(g_i * \phi(x))^2}{f_0(x)}dx \le \sqrt{2\pi}e^{C^2/2}\sigma \cdot \int_{{\mathbb R}} (g_i * \phi(x))^2dx = O(\sigma).

To evaluate the integral over |x|>C\sigma, recall that g_i(x) behaves as 3^ie^{-cx^2} for large |x|. Hence, the natural requirement \sigma^2 = \Omega(p) leads to the same upper bound O(\sigma) for the second integral, and therefore the largest choice of \delta is \delta^2 = (n\sigma)^{-1}.

To sum up, we have arrived at the lower bound 

R_{n}^\star = \Omega\left(\frac{p}{n\sigma}\right)

subject to the constraints \sigma^2 = \Omega(\sqrt{p}) and p = O(\log n). Hence, the optimal choices for the auxiliary parameters are p = \Theta(\log n), \sigma = \Theta(\sqrt{\log n}), leading to the desired lower bound.

3. Bibliographic Notes 

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977), who also applied Fano’s lemma (Fano (1952)) to a statistical setting. The Assouad’s lemma is due to Assoaud (1983), and we also refer to a survey paper Yu (1997). The tree-based technique (Lemma 2 and the analysis of the multi-armed bandit) is due to Gao et al. (2019), and the generalized Fano’s inequality is motivated by the distance-based Fano’s inequality (Duchi and Wainwright (2013)) and the current form is presented in the lecture note Duchi (2019). 

For the examples, the lower bound of sparse linear regression is taken from Candes and Davenport (2013), and we also refer to Donoho and Johnstone (1994), Raskutti, Wainwright and Yu (2011), and Zhang, Wainwright and Jordan (2017) for related results. The full proof of the Gaussian mixture example is referred to Kim (2014). 

  1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977. 
  2. Robert M. Fano, Class notes for transmission of information. Massachusetts Institute of Technology, Tech. Rep (1952). 
  3. Patrick Assouad, Densité et dimension. Annales de l’Institut Fourier. Vol. 33. No. 3. 1983. 
  4. Bin Yu, Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam. Springer, New York, NY, 1997. 423–435. 
  5. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou, Batched multi-armed bandit problem. Advances in Neural Information Processing Systems, 2019. 
  6. John C. Duchi, and Martin J. Wainwright. Distance-based and continuum Fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669 (2013). 
  7. John C. Duchi, Lecture notes for statistics 311/electrical engineering 377. 2019. 
  8. Emmanuel J. Candes, and Mark A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis 34.2 (2013): 317–323. 
  9. David L. Donoho, and Iain M. Johnstone. Minimax risk over \ell_p-balls for \ell_q-error. Probability Theory and Related Fields 99.2 (1994): 277–303. 
  10. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over \ell_q-balls. IEEE transactions on information theory 57.10 (2011): 6976–6994. 
  11. Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators. Electronic Journal of Statistics 11.1 (2017): 752-799. 
  12. Arlene K. H. Kim. Minimax bounds for estimation of normal mixtures. Bernoulli 20.4 (2014): 1802–1818. 

Lecture 7: Mixture vs. Mixture and Moment Matching

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last two lectures we have seen two specific categories of the Le Cam’s two-point methods, i.e., testing between two single hypotheses, or between one single hypothesis and a mixture of hypotheses. Of course, the most powerful and natural generalization of the two-point methods is to test between two mixtures of distributions, which by the minimax theorem is potentially the best possible approach to test between two composite hypotheses. However, the least favorable prior (mixture) may be hard to find, and it may be theoretically difficult to upper bound the total variation distance between two mixtures of distributions. In this lecture, we show that both problems are closely related to moment matching via examples in Gaussian and Poisson models. 

1. Fuzzy Hypothesis Testing and Moment Matching 

In this section, we present the general tools necessary for this lecture. First, we prove a generalized two-point method also known as the fuzzy hypothesis testing. To upper bound the crucial divergence term between two mixtures, we introduce the orthogonal polynomials under different distributions and show that the upper bound depends on moment differences of the mixtures. 

1.1. Mixture vs. Mixture 

We first state the theorem of the generalized two-point methods with two mixtures. As usual, we require that any two points in respective mixtures be well separated but the mixtures be indistinguishable from samples. However, since many natural choices of mixtures may not satisfy the well-separated property in the worst case, the next theorem will be a bit more flexible to require that the mixtures are separated with a large probability. 

Theorem 1 (Mixture vs. Mixture) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and there exist \Theta_0\subseteq \Theta and \Theta_1 \subseteq \Theta such that

L(\theta_0, a) + L(\theta_1, a) \ge \Delta, \qquad \forall a\in \mathcal{A}, \theta_0\in \Theta_0, \theta_1\in \Theta_1.

Then for any probability measures \mu_0, \mu_1 supported on \Theta, we have

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|\mathop{\mathbb E}_{\mu_0(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\mu_1(d\theta)}P_{\theta} \|_{\text{\rm TV}} - \mu_0(\Theta_0^c) - \mu_1(\Theta_1^c)\right).

Proof: For i=0,1, let \nu_i be the conditional probability measure of \mu_i conditioned on \Theta_i, i.e., 

\nu_i(A) := \frac{\mu_i(A\cap \Theta_i)}{\mu_i(\Theta_i)}.

Simple algebra gives \|\mu_i - \nu_i\|_{\text{TV}} = \mu_i(\Theta_i^c). By coupling, we also have 

\|\mathop{\mathbb E}_{\mu_i(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\nu_i(d\theta)}P_{\theta} \|_{\text{\rm TV}} \le \|\mu_i - \nu_i\|_{\text{TV}} = \mu_i(\Theta_i^c).

Now the desired result follows from the standard two-point arguments and the triangle inequality of the total variation distance. \Box 

The central quantity in Theorem 1 is \|\mathop{\mathbb E}_{\mu_0(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\mu_1(d\theta)}P_{\theta} \|_{\text{\rm TV}}, the total variation distance between mixture distributions with priors \mu_0 and \mu_1. A general upper bound on this quantity is very hard to obtain, but the next two sections will show that it is small when the moments of \mu_0 and \mu_1 are close to each other if the model (P_\theta) is Gaussian or Poisson. 

1.2. Hermite and Charlier Polynomials 

This section reviews some preliminary results on orthogonal polynomials under a fixed probability distribution. Let P be a probability measure on (\mathcal{X}, \mathcal{F}) with \mathcal{X}\subseteq {\mathbb R} and all moments finite. Recall that functions \{p_n\}_{n=0}^\infty with p_n: \mathcal{X} \rightarrow {\mathbb R} are called orthogonal under P iff 

\mathop{\mathbb E}[p_m(X)p_n(X)] = 0

for X\sim P and all m\neq n. By orthogonal polynomials we mean that for each n\in {\mathbb N}, x\mapsto p_n(x) is a polynomial with degree n

The simplest way to construct orthogonal polynomials is via the Gram–Schmidt orthogonalization. Specifically, we may choose p_0(x) = 1, and p_n(x) to be the orthogonal component of x^n projected onto \text{span}(p_0(x),\cdots,p_{n-1}(x)) with the inner product structure (f,g)\mapsto \mathop{\mathbb E}_P[f(X)g(X)]. This approach works for general distributions and can be easily implemented in practice, but it gives little insight on the properties of p_n. We shall apply a new approach to arrive at orthogonal functions, which turn out to be polynomials in Gaussian and Poisson models. 

Let (P_\theta)_{\theta\in [\theta_0-\varepsilon,\theta_0+\varepsilon]\subseteq {\mathbb R}} be a family of distributions on (\mathcal{X},\mathcal{F}) with P_{\theta_0} = P. Assume that (P_\theta) admits the following local expansion around \theta = \theta_0

\frac{dP_{\theta_0 +u}}{dP_{\theta_0}}(x) = \sum_{m=0}^\infty p_m(x;\theta_0)\frac{u^m}{m!}, \qquad \forall |u|\le \varepsilon, x\in \mathcal{X}. \ \ \ \ \ (1)

The next lemma claims that under specific conditions of (P_\theta), the functions \{p_m(x;\theta_0)\}_{m=0}^\infty are orthogonal under P_{\theta_0}

Lemma 2 Under the above conditions, if for all u,v\in [-\varepsilon,\varepsilon] the quantity

\int_{\mathcal{X}} \frac{dP_{\theta_0+u}dP_{\theta_0+v}}{dP_{\theta_0}}

depends only on their product uv and \theta_0, then the functions \{p_m(x;\theta_0)\}_{m=0}^\infty are orthogonal under P_{\theta_0}

Remark 1 Recall that the quantity \int_{\mathcal{X}} dP_{\theta_0+u}dP_{\theta_0+v}/ dP_{\theta_0} plays an important role in the Ingster-Suslina method in Lecture 6. The upper bounds in the next section can be thought of as a generalization of the Ingster-Suslina method, with the help of proper orthogonality properties. 

Proof: The local expansion of likelihood ratio gives 

\int_{\mathcal{X}} \frac{dP_{\theta_0+u}dP_{\theta_0+v}}{dP_{\theta_0}} = \sum_{m,n=0}^\infty \mathop{\mathbb E}_{P_{\theta_0}}[p_m(X;\theta_0) p_n(X;\theta_0) ]\cdot \frac{u^mv^n}{m!n!}.

The condition of Lemma 2 implies that the coefficient of the monomial u^mv^n on the RHS with m\neq n must be zero, as desired. \Box 

Exercise 1 Show that under the conditions in Lemma 2, \mathop{\mathbb E}_{P_{\theta_0+u}} [p_m(X;\theta_0)] = c_mu^m for some scalar c_m\neq 0 and any u\in [-\varepsilon,\varepsilon]. In other words, p_m(X;\theta_0) is an unbiased estimator of u^m up to scaling in the location model (P_{\theta_0+u})_{u\in [-\varepsilon,\varepsilon]}

The condition of Lemma 2 is satisfied by various well-known probability distributions. For example, 

\int_{{\mathbb R}} \frac{d\mathcal{N}(\theta_0+u, 1)d\mathcal{N}(\theta_0+v,1)}{d\mathcal{N}(\theta_0,1)} = \exp(uv),

and for any \lambda>0

\int_{{\mathbb N}} \frac{d\mathsf{Poi}(\lambda+u) d\mathsf{Poi}(\lambda+v)}{d\mathsf{Poi}(\lambda)} = \exp\left(\frac{uv}{\lambda}\right).

In fact, the functions p_m(x;\theta_0) given in the defining equation (1) in the Gaussian and Poisson models are both polynomials of degree m, known as the Hermite polynomial H_m(x) and the Charlier polynomial c_m(x;\lambda), respectively. The proof of Lemma 2 gives the following orthogonal relations: 

\begin{array}{rcl} \mathop{\mathbb E}_{X\sim \mathcal{N}(0,1)} [H_m(X)H_n(X)] &=& n!\cdot 1(m=n) \\ \mathop{\mathbb E}_{X\sim \mathsf{Poi}(\lambda)} [c_m(X;\lambda)c_n(X;\lambda)] &=& \frac{n!}{\lambda^n}\cdot 1(m=n). \end{array}

Throughout this lecture, we won’t need the specific forms of the Hermite and Charlier polynomials. We shall only need the defining property (1) and the above orthogonal relations. 

1.3. Divergences between Mixtures 

Now we are ready to present the upper bound on the total variation distance between Gaussian mixture and Poisson mixture models. We also provide upper bounds on the \chi^2 divergence due to its nice tensorization property. 

We first deal with the Gaussian location model. Let U and U' be two random variables on {\mathbb R}, and let \mathop{\mathbb E} \mathcal{N}(U,1) be the Gaussian mixture with random mean U

Theorem 3 (Divergence between Gaussian Mixtures) For any \mu\in {\mathbb R}, we have

\|\mathop{\mathbb E} \mathcal{N}(\mu+ U,1) - \mathop{\mathbb E} \mathcal{N}(\mu+U',1) \|_{\text{\rm TV}} \le \frac{1}{2}\left(\sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2 }{m!} \right)^{\frac{1}{2}}.

Moreover, if \mathop{\mathbb E}[U']=0 and \mathop{\mathbb E}[(U')^2]\le M, then 

\chi^2(\mathop{\mathbb E}\mathcal{N}(\mu+ U,1), \mathop{\mathbb E}\mathcal{N}(\mu+U',1)) \le e^{M^2/2}\cdot \sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2 }{m!}.

Proof: By translation we may assume that \mu=0. Let \phi(x) be the pdf of \mathcal{N}(0,1), and \Delta_m := \mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m]. Then 

\begin{array}{rcl} \|\mathop{\mathbb E} \mathcal{N}(U,1) - \mathop{\mathbb E} \mathcal{N}(U',1) \|_{\text{\rm TV}} &=& \frac{1}{2}\int_{{\mathbb R}} |\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|dx \\ &\overset{(a)}{=}& \frac{1}{2}\int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|\phi(x)dx \\ &\overset{(b)}{\le} & \frac{1}{2} \left(\int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|^2\phi(x)dx \right)^{\frac{1}{2}} \\ &\overset{(c)}{=} & \frac{1}{2}\left(\sum_{m=0}^\infty \frac{\Delta_m^2}{m!}\right)^{\frac{1}{2}}, \end{array}

where step (a) is due to the defining property (1), step (b) follows from the Cauchy–Schwartz inequality, and step (c) uses the orthogonal relation of H_m(x). Hence the upper bound on the total variation distance is proved. 

For the \chi^2-divergence, first note that by Jensen’s inequality, 

\mathop{\mathbb E}[\phi(x-U')] = \phi(x)\mathop{\mathbb E}\left[\exp\left(U'x - \frac{(U')^2}{2}\right)\right] \ge \phi(x)e^{-M^2/2}.

As a result, 

\begin{array}{rcl} \chi^2(\mathop{\mathbb E}\mathcal{N}(U,1), \mathop{\mathbb E}\mathcal{N}(U',1)) &=& \int_{{\mathbb R}} \frac{|\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|^2}{\mathop{\mathbb E}[\phi(x-U')]}dx \\ &\le & e^{M^2/2}\cdot \int_{{\mathbb R}} \frac{|\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|^2}{\phi(x)}dx \\ &= & e^{M^2/2}\cdot \int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|^2 \phi(x)dx \\ &=& e^{M^2/2}\cdot \sum_{m=0}^\infty \frac{\Delta_m^2}{m!}, \end{array}

where again we have used the defining property (1) and the orthogonality in the last two identities. \Box 

Specifically, Theorem 3 shows that when the moments of U and U' are close, then the corresponding Gaussian mixtures are statistically close. Similar results also hold for Poisson mixtures. 

Theorem 4 (Divergence between Poisson Mixtures) For any \lambda>0 and random variables U,U' supported on [-\lambda,\infty), we have

\|\mathop{\mathbb E}\mathsf{Poi}(\lambda+U) - \mathop{\mathbb E}\mathsf{Poi}(\lambda+U') \|_{\text{\rm TV}} \le \frac{1}{2}\left(\sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2}{m!\lambda^m}\right)^{\frac{1}{2}}.

Moreover, if \mathop{\mathbb E}[U']=0 and |U'|\le M almost surely, then 

\chi^2(\mathop{\mathbb E}\mathsf{Poi}(\lambda+U), \mathop{\mathbb E}\mathsf{Poi}(\lambda+U') ) \le e^{M}\cdot \sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2}{m!\lambda^m}.

Proof: The proof of both inequalities essentially follow the same lines as those in the proof of Theorem 3, with the Hermite polynomial H_m(x) replaced by the Charlier polynomial c_m(x;\lambda). The only difference is that when \mathop{\mathbb E}[U']=0 and |U'|\le M almost surely, for all x\in {\mathbb N} we have 

\begin{array}{rcl} \mathop{\mathbb P}(\mathop{\mathbb E}\mathsf{Poi}(\lambda+U') = x ) &=& \mathop{\mathbb E}\left[ e^{-\lambda-U'} \frac{(\lambda+U')^x}{x!}\right] \\ &\ge & e^{-\lambda-M} \mathop{\mathbb E}\left[\frac{(\lambda+U')^x}{x!} \right] \\ &\ge & e^{-\lambda-M}\frac{\lambda^x}{x!} \\ &=& e^{-M}\cdot \mathop{\mathbb P}(\mathsf{Poi}(\lambda) = x ). \end{array} \Box

2. Examples in Gaussian Models 

In this section, we present examples in Gaussian location models where we need to test between two mixtures. In these examples, we match moments up to either some finite degree, or some large and growing degrees such that the information divergences become extremely small. 

2.1. Gaussian Mixture Models 

Consider the following two-component Gaussian mixture model where n i.i.d. samples X_1,\cdots,X_n are drawn from p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2). One possible task of proper learning is to estimate the parameters (\hat{\mu}_1, \hat{\mu}_2, \hat{\sigma}_1, \hat{\sigma}_2) within O(1)-distance to the truth up to permutation. In other words, the target is to recover the components of the Gaussian mixture, which in practice helps to perform tasks such as clustering. The target is to determine the optimal sample complexity of this problem, assuming that p\in (0,1) is unknown but bounded away from 0 and 1 (e.g., 0.01\le p\le 0.99), and the overall variance of the mixture is at most \sigma^2, where \sigma = \Omega(1) is some prespecified parameter. 

To derive a lower bound, the two-point method suggests to find two sets of parameters (p,\mu_1,\mu_2,\sigma_1,\sigma_2) which are \Omega(1)-seperated, while the information divergence between these two mixtures is O(1). However, since Theorem 3 can only deal with mixtures with an identical variance in each component, we cannot simply take U to be a discrete random variable supported on two points \{\mu_1, \mu_2 \}. To overcome this difficulty, note that 

\left( p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2) \right) * \mathcal{N}(0,\sigma^2) = p\mathcal{N}(\mu_1,\sigma_1^2 + \sigma^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2 + \sigma^2),

where * denotes convolution of probability measures. Hence, we may treat the overall mixture to be p\mathcal{N}(\mu_1,\sigma_1^2 + \sigma^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2 + \sigma^2), and choose U \sim p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2) and similarly for U'. Now the desired \chi^2-divergence becomes 

\chi^2(U * \mathcal{N}(0,\sigma^2), U' * \mathcal{N}(0,\sigma^2)).

To apply Theorem 3, we should construct random variables U and U' with as many matched moments as possible. Since there are 5 free parameters in the two-component Gaussian mixtures, we expect that U and U' can only have matched moments up to degree 5. A specific choice can be as follows: 

\begin{array}{rcl} U \sim P & = & 0.5\mathcal{N}(-1,1) + 0.5\mathcal{N}(1,2), \\ U' \sim Q & \approx & 0.2968\mathcal{N}(-1.2257, 0.6100) + 0.7032\mathcal{N}(0.5173, 2.3960). \end{array}

Clearly both U * \mathcal{N}(0,\sigma^2) and U' * \mathcal{N}(0,\sigma^2) have overall variance \Theta(\sigma^2). Replacing U, U' by their centered version, Theorem 3 gives 

\chi^2(U * \mathcal{N}(0,\sigma^2), U' * \mathcal{N}(0,\sigma^2)) \le e^{O(1)/\sigma^2}\cdot \sum_{m=6}^\infty \frac{2^m}{m!\sigma^{2m}} = O\left(\frac{1}{\sigma^{12}}\right)

as long as \sigma = \Omega(1). Hence, by the additivity of the \chi^2-divergence, we conclude that n = \Omega(\sigma^{12}) is a lower bound on the sample complexity. 

The seemingly strange bound \Omega(\sigma^{12}) is also tight for this problem, and the idea is to estimate the first 5 moments of the mixture and then show that close moments imply close parameters. We leave the details to the reference in the bibliographic notes. 

2.2. L_1-norm Estimation of Bounded Gaussian Mean 

Consider the Gaussian location model X\sim \mathcal{N}(\theta,I_p) with unit variance, where the mean vector \theta\in {\mathbb R}^p satisfies \|\theta\|_\infty \le 1. The target here is to estimate the L_1 norm of the mean vector \|\theta\|_1, and let R_p^\star be the minimax risk under the absolute value loss. The main result here is to prove the following tight lower bound: 

R_p^\star = \Omega\left(p\cdot \frac{\log\log p}{\log p}\right).

2.2.1. Failure of Point vs. Mixture

Motivated by the point vs. mixture approach in the last lecture, one natural idea is to test between H_0: \theta =0 and a composite hypothesis H_1: \|\theta\|_1 \ge \rho, where \rho>0 is a parameter to be specified later such that H_0 and H_1 are indistinguishable in the minimax sense. Consequently, this approach gives the lower bound R_p^\star = \Omega(\rho), and the target is to find some \rho>0 as large as possible. We claim that the largest possible \rho is \rho = \Theta(p^{3/4}), which is strictly smaller than the desired minimax risk. 

Let P_\theta = \mathcal{N}(\theta, I_p), and consider any prior distribution \pi supported on \{\theta: \|\theta\|_1 \ge \rho \}. Then Ingster-Suslina method gives 

\chi^2\left( \mathop{\mathbb E}_\pi[P_\theta], P_0\right) = \mathop{\mathbb E}_{\theta,\theta'\sim \pi} \exp(\theta^\top \theta') - 1.

Using the Taylor expansion \exp(x) = \sum_{k=0}^\infty x^k/k! and the inequality \mathop{\mathbb E}[(\theta^\top \theta')^k]\ge 0 for all k\in {\mathbb N}, we conclude that 

\begin{array}{rcl} \chi^2\left( \mathop{\mathbb E}_\pi[P_\theta], P_0\right) &\ge & \frac{1}{2}\mathop{\mathbb E}_{\theta,\theta'} [(\theta^\top \theta')^2] \\ &=& \frac{1}{2}\sum_{i=1}^p (\mathop{\mathbb E}[\theta_i^2])^2 + \frac{1}{2}\sum_{i\neq j} (\mathop{\mathbb E}[\theta_i\theta_j])^2 \\ &\ge & \frac{1}{2}\sum_{i=1}^p (\mathop{\mathbb E}[\theta_i^2])^2 \\ &\ge & \frac{1}{2p}\left(\sum_{i=1}^p \mathop{\mathbb E}[\theta_i^2] \right)^2 \\ &\ge & \frac{1}{2p}\left(\frac{\mathop{\mathbb E}[\|\theta\|_1^2 ]}{p} \right)^2 \\ &\ge & \frac{\rho^4}{2p^3} . \end{array}

Note that the above inequality holds for any \pi supported on \{\theta: \|\theta\|_1 \ge \rho \}. Hence, when \rho = \Omega(p^{3/4}), these hypotheses H_0 and H_1 become statistically distinguishable, and therefore the best possible lower bound from the point vs. mixture approach is \Omega(p^{3/4})

There is also another way to show the desired failure, i.e., we may construct an explicit test which reliably distinguishes between H_0: \theta =0 and H_1: \|\theta\|_1 \ge p^{3/4}. The idea is to apply the \chi^2 test, i.e., compute the statistic T = \|X\|_2^2 - p. Clearly under H_0 we have T = O_p(\sqrt{p}), and after some algebra we may show that T = \|\theta\|_2^2 - O_p(\sqrt{p} + \|\theta\|_2) under P_\theta. Since \|\theta\|_1\ge p^{3/4} implies \|\theta\|_2\ge p^{1/4}, we conclude that comparing T with a suitable threshold O(\sqrt{p}) results in a reliable test. 

2.2.1. Moment Matching and Polynomial Approximation

Previous section shows that testing between a single distribution and a mixture does not work, where the knowledge of the single distribution can be used for the \chi^2 test and may make the problem significantly easier. Hence, an improvement is to consider two composite hypotheses H_0: \|\theta\|_1\le \rho_0 and H_1: \|\theta\|_1\ge \rho_1, where \rho_1>\rho_0>0 are parameters to be chosen later. For the priors \pi_0, \pi_1 on \theta, Theorem 3 motivates us to use product priors \pi_i = \mu_i^{\otimes p} where \mu_0, \mu_1 are probability measures on [-1,1] with matched moments up to degree K (to be chosen later). The specific choices of \mu_0, \mu_1 must fulfill the following requirements: 

  1. Have matched moments up to degree K while with the quantity \rho_1 - \rho_0 as large as possible; 
  2. For i\in \{0,1\}, the prior \pi_i is supported on H_i
  3. The \chi^2-divergence is upper bounded by \chi^2(\pi_0, \pi_1) = O(1), or in other words, \chi^2(\mu_0, \mu_1) = O(1/p)

We check the above requirements in the reverse order and specify the choices of \mu_0, \mu_1 and K gradually. For the last requirement, Theorem 3 with M \le 2 gives 

\chi^2\left(\mu_0,\mu_1 \right) \le e^2\sum_{m=K+1}^\infty \frac{2^{m+1}}{m!} \le 2e^2\sum_{m=K+1}^\infty \left(\frac{2e}{m}\right)^m \le \frac{2e^2}{1-2e/K}\cdot \left(\frac{2e}{K}\right)^K.

As a result, to have \chi^2(\mu_0,\mu_1) = O(1/p), it suffices to take K = \Omega(\log p/\log\log p)

The second constraint that \pi_i be (almost) supported on H_i is also easy. Set 

\rho_0 = p\mathop{\mathbb E}_{\theta\sim \mu_0}[|\theta|] + c\sqrt{p}, \quad \rho_1 = p\mathop{\mathbb E}_{\theta\sim \mu_1}[|\theta|] - c\sqrt{p},

where c>0 is a large eough numerical constant. The idea behind the above choices is that, under the product distribution \mu_i^{\otimes p}, the random variable \|\theta\|_1 is the sum of p i.i.d. random variables taking value in [-1,1] following distribution \mu_i. Then by the sub-Gaussian concentration, the L_1 norm is centered at p\mathop{\mathbb E}_{\theta\sim \mu_i}[|\theta|] with fluctuation O(\sqrt{p}). Hence, for large c>0 both probabilities \pi_0(\{\|\theta\|_1> \rho_0 \}) and \pi_1(\{\|\theta\|_1<\rho_1 \}) are small, which fulfills the conditions in Theorem 1. 

The most non-trivial requirement is the first requirement, which by our choice of \rho_0, \rho_1 essentially aims to maximize the difference \mathop{\mathbb E}_{\theta\sim \mu_1}[|\theta|] - \mathop{\mathbb E}_{\theta\sim \mu_0}[|\theta|] subject to the constraint that the probability measures \mu_0, \mu_1 are supported on [-1,1] and have matching first K moments. The following lemma shows the duality between moment matching and best polynomial approximation. 

Lemma 5 For any bounded interval I\subseteq {\mathbb R} and real-valued function f on I, let S^\star be the maximum difference \mathop{\mathbb E}_{\theta\sim \mu_1}[f(\theta)] - \mathop{\mathbb E}_{\theta\sim \mu_0}[f(\theta)] subject to the constraint that the probability measures \mu_0, \mu_1 are supported on I and have matching first K moments. Then

S^\star = 2E_K(f;I ),

where E_K(f;I) denotes the best degree-K polynomial approximation error of f on the interval I

E_K(f; I) := \inf_{a_0,\cdots,a_K } \sup_{x\in I} \left|f(x) - \sum_{k=0}^K a_kx^k \right|.

Proof: It is an easy exercise to show that S^\star \le 2E_K(f;I). We present two proofs for the hard direction S^\star \ge 2E_K(f;I). The first proof is an abstract proof which holds for general basis functions other than monomials, while the construction of the measures \mu_0, \mu_1 is implicit. The second proof gives an explicit construction, while some properties of polynomials are used in the proof. 

(First Proof) Consider the following linear functional T: \text{span}(1,x,\cdots,x^K,f(x)) \rightarrow {\mathbb R}, with T(p) = 0 for p\in \text{span}(1,x,\cdots,x^K) and T(f) = E_K(f;I). Equipped with the \|\cdot\|_\infty norm on functions, it is easy to show that the operator norm of T is \|T\| = 1. By the Hahn-Banach theorem, the linear functional T can be extended to C(I) \rightarrow {\mathbb R} without increasing the operator norm. Then by the Riesz representation theorem, there exists a signed Radon measure \mu on I such that 

T(g) = \int_I g(x)\mu(dx), \quad \forall g \in C(I).

The fact \|T\| = 1 implies that the total variation of \mu is one. Write \mu = \mu_+ - \mu_- by Jordan decomposition of signed measures, then \mu_1 = 2\mu_+ and \mu_0 = 2\mu_- satisfy the desired properties. 

(Second Proof) Let p_K be a degree-K polynomial with \|f-p_K\|_\infty = E_K(f;I) on I. Since \{1,x,\cdots,x^K\} is a Haar basis on I, Chebyshev’s alternation theorem shows that there exist K+2 points x_1,\cdots,x_{K+2} such that f(x_i) - p_K(x_i) = \varepsilon\cdot (-1)^iE_K(f;I) with \varepsilon=1 or -1. Consider the signed measure \mu supported on \{x_1,\cdots,x_{K+2}\} with 

\mu( \{x_i\} ) := c\left(\prod_{j\neq i} (x_i - x_j) \right)^{-1},

where c\in{\mathbb R} is a normalizing constant such that c\cdot \varepsilon\sum_{i=1}^{K+2} (-1)^i\mu(\{x_i\}) = 2. Then by simple algebra, \mathop{\mathbb E}_{\mu}f = 2E_K(f;I) and \mu has total variation 2. Moreover, the following identity is given by Lagrange interpolation

\sum_{i=1}^{K+2} x_i^k\prod_{j\neq i} \frac{x-x_j}{x_i-x_j} = x^k, \quad k = 0,1,\cdots,K,

and comparing the coefficient of x^{K+1} on both sides gives \mathop{\mathbb E}_\mu[x^k]=0 for k=0,\cdots,K. Now another Jordan decomposition of \mu gives the desired result. \Box 

By Lemma 5, it boils down to the best degree-K polynomial approximation error of |x| on [-1,1]. This error is analyzed in approximation theory and summarized in the following lemma. 

Lemma 6 There is a numerical constant (known as the Bernstein’s constant) \beta_\star \approx 0.280169499 such that

E_K(|x|; [-1,1]) = (\beta_\star+o_K(1))K^{-1}.

Hence, by Lemma 5 and 6, the condition of Theorem 1 is satisfied with \Delta = \Theta(pK^{-1}) = \Theta(p\cdot \frac{\log\log p}{\log p}). As a result, we finally arrive at R_p^\star = \Omega(p\cdot \frac{\log\log p}{\log p})

3. Examples in Poisson Models 

In this section, we present examples in i.i.d. sampling models from a discrete distribution with a large support. We show that a general Poissonization technique allows us to operate in the simpler Poisson models, and use moment matching to either finite or growing degrees to establish tight lower bounds. 

3.1. Poissonization and Approximate Distribution 

Throughout this section the statistical model is i.i.d. sampling from a discrete distribution X_1,\cdots,X_n \sim P=(p_1,\cdots,p_k), where n denotes the sample size and k denotes the support size. It is well-known that the histogram (h_1,\cdots,h_k) with 

h_j := \sum_{i=1}^n 1(X_i = j), \quad \forall j\in [k]

constitutes a sufficient statistic. Moreover, h_j\sim \mathsf{B}(n,p_j) for all j\in [k], and (h_1,\cdots,h_k) are negatively dependent. To remove the dependence among different bins, recall the following Poissonized model: 

Definition 7 (Poissonization) In the Poissonized model, h_j\sim \mathsf{Poi}(np_j) for all j\in [k] and they are mutually independent. 

In other words, in the Poissonized model we draw a random number N\sim \mathsf{Poi}(n) of samples from P and then compute the histogram. In Lecture 3 we have shown the asymptotic equivalence of the i.i.d. sampling model and the Poissonized model, but the arguments are highly asymptotic. The following lemma establishes a non-asymptotic relationship. 

Lemma 8 For a given statistical task, let R_n and R_n^\star be the minimax risk under the i.i.d. sampling model and the Poissonized model with design sample size n, respectively. Then

\frac{1}{2}R_{2n} \le R_n^\star \le R_{n/2} + R_0 e^{-n/8}.

Proof: Let R_n(\pi), R_n^\star(\pi) be the Bayes risks under prior \pi in respective models. The desired inequality for Bayes risks follows from the identity 

R_n^\star(\pi) = \sum_{m=0}^\infty R_m(\pi)\cdot \mathop{\mathbb P}(\mathsf{Poi}(n) = m),

the Poisson tail bounds and the monotonicity of Bayes risks R_0(\pi)\ge R_1(\pi) \ge \cdots. Now the minimax theorem gives the desired lemma. 

To establish the first identity, simply note that under any prior distribution \pi, the Bayes estimator under the Poissonized model given the realization N=m is exactly the Bayes estimator under the i.i.d. sampling model with m samples. \Box 

Lemma 8 shows that the minimax risk essentially does not change after Poissonization. We sometimes also consider approximate distributions in the Poissonized model, where \sum_{j=1}^k p_j may not necessarily sum into one (note that the distribution of the histogram is still well-defined). The approximate distribution is typically used in lower bounds where a product prior is assigned to (p_1,\cdots,p_k) and cannot preserve the distribution property. For statistical problems where the objective function or hypothesis depends on the vector P=(p_1,\cdots,p_k) in a nice way that changing P into \lambda P with \lambda\approx 1 will not change the objective much, in the lower bound it typically suffices to consider the approximate distributions in the Poissonized model. The key idea is that, conditioning on \sum_{j=1}^k h_j=m, the histogram (h_1,\cdots,h_k) is exactly distributed as that in the i.i.d. sampling model from P/\|p\|_1 with m samples. Then we may construct the estimator in the Poissonized model from the hypothetical optimal estimators (with different sample sizes) in the i.i.d. sampling model, and applying the same tail bounds as in the proof of Lemma 8 suffices. The details of the arguments may vary from example to example, but in most scenarios it will not hurt the lower bound. 

3.2. Generalized Uniformity Testing 

Consider the following generalized uniformity testing problem: given n i.i.d. observations from some discrete distribution P supported on at most k elements, one would like to test whether the underlying distribution P is uniform on its support. Note the difference from the traditional uniformity testing problem: the distribution P may be supported on a subset of [k] while still be uniform on this subset. Specifically, the task is to determine the sample complexity of distinguishing from the hypothesis H_0: P uniform on its support and H_1: P is \varepsilon-away from any uniform distribution with support \subseteq [k] under \ell_1 distance. We will show that the desired sample complexity is lower bounded by 

\Omega\left(\frac{\sqrt{k}}{\varepsilon^2} + \frac{k^{2/3}}{\varepsilon^{4/3}} \right).

Note that the first term \Omega(\sqrt{k}/\varepsilon^2) trivially follows from the Paninski’s construction in the traditional uniformity testing problem, the goal is to prove the second term. It is expected that the second term captures the difficulty of recovering the support of the distribution, and therefore we should consider a mixture of uniform distributions with random support in H_0. Specifically, we consider the following mixture: let U,U' be two random variables with 

U = \begin{cases} 0 & \text{w.p. } \frac{\varepsilon^2}{1+\varepsilon^2} \\ \frac{1+\varepsilon^2}{k} & \text{w.p. } \frac{1}{1+\varepsilon^2} \end{cases}, \qquad U' = \begin{cases} \frac{1-\varepsilon}{k} & \text{w.p. } \frac{1}{2} \\ \frac{1+\varepsilon}{k} & \text{w.p. } \frac{1}{2} \end{cases}.

Assign the k-fold product distribution of U (or U') to the probability vector (p_1,\cdots,p_k), then (p_1,\cdots,p_k) forms an approximate probability distribution since \mathop{\mathbb E}[U] = \mathop{\mathbb E}[U'] = k^{-1}. Moveover, under prior U the (normalized) distribution is always uniform, and under prior U' the (normalized) distribution is \Omega(\varepsilon)-far from any uniform distribution supported on a subset of [k] with high probability. Hence, neglecting the additional details for the approximate distribution, it suffices to show that 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) = O\left(\frac{1}{k}\right), \quad \text{if }n = O\left(\frac{k^{2/3}}{\varepsilon^{4/3}}\right).

To establish the above bound for the \chi^2-divergence, note that the random variables U,U' are chosen carefully with \mathop{\mathbb E}[U^m] = \mathop{\mathbb E}[(U')^m] for m=0,1,2. Moreover, 

|\mathop{\mathbb E}[(U-k^{-1})^m] - \mathop{\mathbb E}[(U' - k^{-1})^m ]| \le \frac{2\varepsilon^2}{k^m}, \qquad m\ge 3.

Consequently, Theorem 4 gives that 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) \le e^{n\varepsilon/k} \sum_{m=3}^\infty \frac{4\varepsilon^4}{m!(n/k)^m}\left(\frac{n}{k}\right)^{2m} = e^{n\varepsilon/k} \sum_{m=3}^\infty \frac{4\varepsilon^4}{m!}\left(\frac{n}{k}\right)^{m}.

Since k^{2/3}/\varepsilon^{4/3} is a stronger lower bound than \sqrt{k}/\varepsilon^2 if and only if k\ge \varepsilon^{-4}, under this condition and n\le k^{2/3}/\varepsilon^{4/3} we will have n\le k. Then the above inequality gives 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) = O\left(\frac{n^3\varepsilon^4}{k^3}\right),

which is the claimed upper bound on the \chi^2-divergence, establishing the lower bound.

We provide some discussions on the choice of U and U'. Theorem 4 suggests that if U and U' could match more moments, the \chi^2-divergence could even be smaller. However, the number of matched moments is in fact limited by the problem structure. In the traditional uniformity testing problem, in the last lecture we essentially choose U \equiv 1/k and U' as above. In this case, U and U' only match the first moment, which is the best possible for U must be a constant. In generalized uniformity testing, U must only be supported on two points, one of which is zero. Meanwhile, the support of U' can potentially be arbitrarily large. The next lemma shows that no matter how we choose U', we can match at most the first two moments. 

Lemma 9 Let \mu be a probability measure supported on k elements of [0,\infty), one of which is zero, and \mu' be any probability measure supported on [0,\infty). Then if \mu and \mu' match the first 2k-1 moments, we must have \mu = \mu'

Proof: Let X\sim \mu, X'\sim \mu'. Let the support of \mu be 0,x_1,\cdots,x_{k-1}. Consider the polynomial Q(x) = x\prod_{i=1}^{k-1}(x-x_i)^2 of degree 2k-1, the assumption gives \mathop{\mathbb E}[Q(X')] = \mathop{\mathbb E}[Q(X)]=0. Finally, since Q(X')\ge 0 is always non-negative, we have \mu' = \mu. \Box 

Lemma 9 applied to k=2 shows that moment matching up to degree 2 is the best we can hope for. In fact, the above lower bound is also tight (see bibliographic notes). 

3.3. Shannon Entropy Estimation 

Finally we revisit the Shannon entropy estimation problem where the target is to estimate the Shannon entropy H(P) = \sum_{i=1}^k -p_i\log p_i. Let R_{n,k}^\star be the minimax risk of estimating H(P) under the mean squared error, our target is to show that 

R_{n,k}^\star = \Omega\left( \left(\frac{k}{n\log n}\right)^2 + \frac{\log^2 k}{n} \right), \qquad \text{if } n = \Omega\left(\frac{k}{\log k}\right).

The lower bound \Omega(n^{-1}\log^2 k) has already been shown via the two-point method in Lecture 5, and therefore the remaining target is to establish \Omega(k^2/(n\log n)^2)

Similar to the L_1 norm example above, the Shannon entropy H(P) is a symmetric sum of individual functions of p_i, where the individual function is non-differentiable at zero. It suggests us to apply similar ideas based on moment matching and best polynomial approximation to establish the lower bound. Specifically, the target is to construct two priors \mu_0, \mu_1 on the interval [0,M] (with parameter M>0 to be chosen later) such that: 

  1. The priors \mu_0, \mu_1 have matched moments up to degree K (to be chosen later); 
  2. The difference between \mathop{\mathbb E}_{p\sim \mu_0}[-p\log p] and \mathop{\mathbb E}_{p\sim \mu_1}[-p\log p] is large; 
  3. With high probability, the Shannon entropy H(P) under P\sim \mu_0^{\otimes k} and P\sim \mu_1^{\otimes k} is well-separated; 
  4. The common mean \mathop{\mathbb E}_{p\sim \mu_0}[p] is at most O((n\log n)^{-1})

Note that the first requirement (moment matching) ensures a small TV distance between the mixtures, the second and third requirements ensure the separation property (i.e., lower bound the mean difference and upper bound the fluctuations), and the last requirement ensures that (p_1,\cdots,p_k) sums into a constant smaller than one (recall that k=O(n\log n) by assumption) and therefore setting p_{k+1} := 1-\sum_{i=1}^k p_i gives a valid probability vector (p_1,\cdots,p_{k+1}). We check these requirements one by one to find proper parameters (M,K)

For the first requirement, recall that Theorem 4 gives 

\|\mathop{\mathbb E}_{p\sim \mu_0}[\mathsf{Poi}(np)] -\mathop{\mathbb E}_{p\sim \mu_1}[\mathsf{Poi}(np)] \|_{\text{TV}}^2 \le \sum_{m=K+1}^\infty \frac{(nM)^{2m}}{m!(nM)^m} \le \sum_{m=K+1}^\infty \left(\frac{enM}{m}\right)^m.

Since the individual TV distance should be at most O(k^{-1}) (for the future triangle inequality), we should choose K = \Omega(\max\{nM, \log k \}) = \Omega(\max\{nM,\log n\}), where \log k\asymp \log n is due to the assumption n=\Omega(k/\log k) and that n= O(k^2) to make the first term \Omega(k^2/(n\log n)^2) become dominate in the minimax risk. 

For the second requirement, by the duality result in Lemma 5 it is essentially the best degree-K polynomial approximation error of -x\log x on [0,M]. The next lemma gives the best approximation error. 

Lemma 10

E_K(-x\log x; [0,M]) = \Theta\left(\frac{M}{K^2}\right).

By Lemma 10, the target is to maximize M/K^2 subject to the previous condition K=\Omega(\max\{nM,\log n \}). Simple algebra shows that the maximum is \Theta((n\log n)^{-1} ), with the maximizer M \asymp n^{-1}\log n, K\asymp \log n

To resolve the third requirement, note that the mean difference of H(P) is \Theta(k/(n\log n)) by the above choice of M and K. Moreover, since -p\log p\in [0, c(\log n)^2/n] for some constant c>0 if p\in [0,M], the sub-Gaussian concentration shows that the fluctuation of H(P) under both \mu_i^{\otimes k} is at most O(\sqrt{k}(\log n)^2/n). Since n=O(k^2), the fluctuation is indeed negligible compared with the mean difference. Careful analysis also shows that the contribution of p_{k+1} to the entropy difference is negligible. 

The last requirement requires proper modifications on the priors \mu_0, \mu_1 constructed in Lemma 5 to satisfy the mean constraint. This can be done via the following trick of change of measures: let \nu_0, \nu_1 be the priors constructed in Lemma 5which attains E_{\log n}(-\log x; [\frac{1}{n\log n}, \frac{\log n}{n}]), whose value is summarized in the following lemma. 

Lemma 11

E_{\log n}\left(-\log x; \left[\frac{1}{n\log n}, \frac{\log n}{n}\right]\right) = \Theta(1).

Next we construct the priors \mu_0, \mu_1 as follows: for i=0,1, set 

\mu_i(dx) = \left(1 - \mathop{\mathbb E}_{X\sim\nu_i}\left[\frac{1}{nX\log n}\right] \right)\delta_0(dx) + \frac{\nu_i(dx)}{nx\log n}.

Then it is easy to show that both \mu_0 and \mu_1 are probability measures, have matched moments up to degree K+1, and have mean (n\log n)^{-1}. Moreover, 

\mathop{\mathbb E}_{x\sim \mu_1} \left[-x\log x \right] - \mathop{\mathbb E}_{x\sim \mu_0} \left[-x\log x \right] = \frac{\mathop{\mathbb E}_{x\sim \nu_1}[-\log x] - \mathop{\mathbb E}_{x\sim \nu_0}[-\log x] }{n\log n} \asymp \frac{1}{n\log n}.

Hence, the fourth requirement is fulfilled without hurting the previous ones. 

In summary, Theorem 1 holds with \Delta = \Theta(k^2/(n\log n)^2), and we arrive at the desired lower bound R_{n,k}^\star = \Omega(k^2/(n\log n)^2 )

4. Bibliographic Notes 

The method of two fuzzy hypotheses (Theorem 1) are systematically used in Ingster and Suslina (2012) on nonparametric testing, and the current form of the theorem is taken from Theorem 2.14 of Tsybakov (2009). Statistical closeness of Gaussian mixture models via moment matching (Theorem 3) was established in Cai and Low (2011), Hardt and Price (2015), Wu and Yang (2018) for the \chi^2-divergence, where the result is new for the TV distance. For Theorem 4, the TV version was established in part by Jiao, Kartik, Han and Weissman (2015) and Jiao, Han and Weissman (2018), where the \chi^2 version is new here. When U,U'\in [0,M], a stronger bound of the TV distance without the squared root is also available in Wu and Yang (2016). For more properties of Hermite and Charlier polynomials, we refer to Labelle and Yeh (1989). 

For Gaussian examples, the proper learning of two-component Gaussian mixture was established in Hardt and Price (2015). The L_1 norm estimation problem was taken from Cai and Low (2011), which was further motivated by Lepski, Nemirovski and Spokoiny (1999). The Gaussian mean testing example under L_1 metric was taken from Ingster and Suslina (2012). For technical lemmas, proofs of Lemma 5 is taken from Lepski, Nemirovski and Spokoiny (1999) and Wu and Yang (2016), respectively, and Lemma 6 was due to Bernstein (1912). 

For Poisson examples, the non-asymptotic equivalence between i.i.d. sampling model and the Poissonized model was due to Jiao, Kartik, Han and Weissman (2015). The tight bounds of the generalized uniformity testing problem were due to Diakonikolas, Kane and Stewart (2018), where their proof was greatly simplified here thanks to Theorem 4. For Shannon entropy estimation, the optimal sample complexity was obtained in Valiant and Valiant (2011), and the minimax risk was obtained independently in Jiao, Kartik, Han and Weissman (2015) and Wu and Yang (2016). For tools in approximation theory to establish Lemma 10 and 11, we refer to books Devore and Lorentz (1993), Ditzian and Totik (2012) for wonderful toolsets. 

  1. Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012. 
  2. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 
  3. T. Tony Cai, and Mark G. Low. Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. The Annals of Statistics 39.2 (2011): 1012–1041. 
  4. Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 2015. 
  5. Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments. arXiv preprint arXiv:1807.07237 (2018). 
  6. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885. 
  7. Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the L_1 distance. IEEE Transactions on Information Theory 64.10 (2018): 6672–6706. 
  8. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702–3720. 
  9. Jacques Labelle, and Yeong Nan Yeh. The combinatorics of Laguerre, Charlier, and Hermite polynomials. Studies in Applied Mathematics 80.1 (1989): 25–36. 
  10. Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the L r norm of a regression function. Probability theory and related fields 113.2 (1999): 221-253. 
  11. Serge Bernstein. Sur l’ordre de la meilleure approximation des fonctions continues par des polynomes de degré donné. Vol. 4. Hayez, imprimeur des académies royales, 1912. 
  12. Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018. 
  13. Gregory Valiant and Paul Valiant. The power of linear estimators. 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE, 2011. 
  14. Ronald A. DeVore and George G. Lorentz. Constructive approximation. Vol. 303. Springer Science & Business Media, 1993. 
  15. Zeev Ditzian and Vilmos Totik. Moduli of smoothness. Vol. 9. Springer Science & Business Media, 2012. 

Lecture 6: Generalized Two-Point Method: Point vs. Mixture

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last lecture we introduced Le Cam’s two-point method to establish information-theoretic lower bounds, where the target is to find two single hypotheses which are well separated while indistinguishable. However, restricting to two single hypotheses gives the learner a lot of information in many scenarios, and any pair of single hypotheses which are far apart becomes distinguishable. From another point of view, if we would like to mimic the Bayesian idea of using the least favorable prior, it is very unlikely that the least favorable prior is supported on only two points. Hence, in this lecture we generalize the two-point idea to the case where one point is a mixture of distributions, which suits well in various examples. 

1. General Tools 

1.1. A Motivating Example 

We first discuss an example where the simple two-point method fails. Consider the following hidden clique detection problem: there is an underlying undirected simple graph G with vertex set [n] which either comes from an Erdös-Rényi graph \mathcal{G}(n,1/2) (where each edge appears with probability 1/2 independently), or there is a clique on k vertices of G and all other edges appear with probability 1/2 independently. Recall that a clique is a complete graph where each two vertices are connected via an edge. The target is to reliably distinguish between these two cases, i.e., a vanilla Erdös-Rényi graph versus that planted with a hidden clique of size k

This task is a composite hypothesis testing problem, where H_0 consists of a single hypothesis where G=\mathcal{G}(n,1/2), while H_1 consists of a family of hypotheses depending on the precise location of the hidden clique. Of course, an Erdös-Rényi graph is unlikely to have large cliques, thus this task becomes easier when k is large. Let k_n>0 be the threshold that the total error probability of the optimal test is at most 1/3 when k\ge k_n and is at least 1/3 when k<k_n

We first try to use Le Cam’s two-point method to find a lower bound on k_n. Let P\in H_0 be the distribution of \mathcal{G}(n,1/2), and Q_S\in H_1 be the distribution of the graph where there is a planted clique on the vertex set S\subseteq [n] with |S|=k. To compute the total variation distance between P and Q_S, note that with probability 2^{-\binom{k}{2}} the Erdös-Rényi graph has a clique supported on S. Hence, 

\|P - Q_S \|_{\text{TV}} = 1 - 2^{-\binom{k}{2}}

holds for any S. As a result, we have k=\Omega(1) to ensure that \|P - Q_S \|_{\text{TV}}\ge 2/3, which is not an informative bound.

The problem with the above two-point approach is that, if the learner is told the exact location of the clique, she can simply look at the subgraph supported on that location and tell whether a clique is present. In other words, any single hypothesis from H_1 gives too much information to the learner. To overcome this difficulty, a natural idea is to take the support S of the clique uniformly distributed on all size-k subsets of [n], which is expected to result in a graph looking more similar to the Erdös-Rényi graph. This is exactly the central topic of this lecture. 

1.2. Generalized Le Cam’s Method: Point vs. Mixture 

As is shown in the previous example, there is need to generalize the Le Cam’s two-point method to the scenario where one of the points is a mixture of distributions. The theoretical foundation of the generalization is in fact quite simple, and the proof of the next theorem parallels that of Theorem 7 in Lecture 5. 

Theorem 1 (Point vs. Mixture) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and there exist \theta_0\in \Theta and \Theta_1 \subseteq \Theta such that

L(\theta_0, a) + L(\theta_1, a) \ge \Delta, \qquad \forall a\in \mathcal{A}, \theta_1\in \Theta_1,

then for any probability measure \mu supported on \Theta_1, we have

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_0} - \mathop{\mathbb E}_{\mu(d\theta)}P_{\theta} \|_{\text{\rm TV}}\right).

The main difficulties in the practical application of Theorem 1 lie in two aspects. First, it may be non-trivial to find the proper mixture \mu and the support \Theta_1. Second, upper bounding the total variation distance \|P_{\theta_0} - \mathop{\mathbb E}_{\mu(d\theta)}P_{\theta} \|_{\text{\rm TV}} may be hard, especially when P_\theta takes the form of Q_\theta^{\otimes n}. The next section proposes a general tool to overcome the second difficulty, where we use various examples to illustrate the art of the first difficulty. 

1.3. Upper Bounding the Divergence 

To upper bound the divergence between a single distribution and a mixture of distributions, the next lemma shows that the \chi^2-divergence is a proper choice. 

Lemma 2 Let (P_\theta)_{\theta\in\Theta} be a family of probability distributions parametrized by \theta, and Q be any fixed probability distribution. Then for any probability measure \mu supported on \Theta, we have

\chi^2\left(\mathop{\mathbb E}_\mu [P_\theta], Q \right) = \mathop{\mathbb E}\left[ \int \frac{dP_\theta dP_{\theta'}}{dQ} \right] - 1,

where the expectation is taken with respect to independent \theta,\theta'\sim \mu

Proof: Follows from the definition \chi^2(P,Q) = \int\frac{(dP)^2}{dQ}-1 and simple algebra. \Box 

The advantages of Lemma 2 are two-fold. First, it factorizes well for tensor products: 

\chi^2\left(\mathop{\mathbb E}_\mu [P_\theta^{\otimes n}], Q^{\otimes n} \right) =\mathop{\mathbb E}\left[ \left( \int \frac{dP_\theta dP_{\theta'}}{dQ}\right)^n \right] - 1.

This identity makes it more convenient to use \chi^2-divergence to deal with the point-mixture scenario. Second, for many statistical models (and a proper choice of Q), the quantity 

\int \frac{dP_\theta dP_{\theta'}}{dQ}

admits a simple closed-form expression in (\theta,\theta'). We will see several examples in the rest of this lecture. 

2. Examples of Testing Problems 

In this section, we provide examples of composite hypothesis testing problems where the null hypothesis H_0 consists of a single probability distribution, while the alternative H_1 consists of a family of distributions. We will see that choosing a proper mixture and applying the tools in Lemma 2 gives desired lower bounds. 

2.1. Hidden Clique Detection 

Recall the hidden clique detection problem described in Section 1.1, where we aim to obtain tight lower bounds for the threshold k_n on the clique size. As discussed, the point is chosen to be the single distribution P for the Erdös-Rényi graph, and the mixture is taken to be \mathop{\mathbb E}_S[Q_S] where S is uniformly distributed on all size-k subsets of [n]. It remains to upper bound \chi^2(\mathop{\mathbb E}_S[Q_S], P), which by Lemma 2 equals to 

\chi^2(\mathop{\mathbb E}_S[Q_S], P) = \mathop{\mathbb E}_{S,S'}\left[ \int \frac{dQ_S dQ_{S'}}{dP} \right] - 1.

By simple algebra, 

\int \frac{dQ_S dQ_{S'}}{dP} = 2^{\binom{|S\cap S'|}{2}}.

Since the probability of |S\cap S'|=r for r=0,\cdots,k is \binom{k}{r}\binom{n-k}{k-r}/\binom{n}{k} , we conclude that 

\begin{array}{rcl} \chi^2(\mathop{\mathbb E}_S[Q_S], P) = \sum_{r=0}^k 2^{\binom{r}{2}}\cdot \frac{\binom{k}{r} \binom{n-k}{k-r}}{\binom{n}{k}} - 1 = \sum_{r=1}^k 2^{\binom{r}{2}}\cdot \frac{\binom{k}{r} \binom{n-k}{k-r}}{\binom{n}{k}} . \end{array}

After carefully bounding the binomial coefficients, we conclude that the \chi^2-divergence is O(1) as long as k = 2\log_2 n - 2\log_2\log_2 n+\Omega(1). Hence, the threshold k_n has the lower bound k_n = 2\log_2 n - 2\log_2\log_2 n+\Omega(1)

This lower bound is tight. Note that the expected number of size-k cliques in \mathcal{G}(n,1/2) is exactly 


using some algebra and Markov’s inequality, we conclude that if k = 2\log_2 n -2\log_2\log_2 n + O(1), with high probability the Erdös-Rényi graph has no size-k cliques. Hence, by enumerating size-k cliques in the graph, we have k_n = 2\log_2 n -2\log_2\log_2 n +\Theta(1)

2.2. Uniformity Testing 

The uniformity testing problem, also known as the goodness-of-fit in theoretical computer science, is as follows. Given n i.i.d. observations from a discrete distribution P supported on at most k elements, one would like to test between 

H_0: P = U_k \qquad \text{vs.} \qquad H_1: \|P - U_k\|_1\ge \varepsilon,

where \varepsilon>0 is some prescribed parameter (which may depends on n), and U_k denotes the uniform distribution over [k]. The target is to determine the sample complexity of this problem, i.e., find the smallest sample size n such that there exists a test with total error probability at most 1/3

Note that H_1 is a composite hypothesis. If we would like to choose a single distribution from H_1, the most natural choice would be Q = (\frac{1+\varepsilon}{k},\cdots,\frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k},\cdots,\frac{1-\varepsilon}{k}). For the learner to distinguish between P = U_k and Q, she could combine the first and second half of symbols into two super-symbols, with uniform distribution (\frac{1}{2},\frac{1}{2}) under H_0, and distribution (\frac{1+\varepsilon}{2},\frac{1-\varepsilon}{2}) under H_1. Consequently, \Omega(\varepsilon^{-2}) samples are sufficient to distinguish between these two cases, which loses dependence on the support size k. As before, the problem of choosing a single hypothesis in H_1 is that the learner then knows the exact locations of the symbols with probability larger than 1/k and can combine them. 

Hence, we should pick up several distributions from H_1 and consider a proper mixture. The solution is to use the following Paninski’s construction: let random vector v be distributed uniformly on \{\pm 1\}^{k/2} (assuming that k is even), and consider the mixture \mathop{\mathbb E}_v[Q_v^{\otimes n}], with 

Q_v = \left(\frac{1+v_1\varepsilon}{k}, \frac{1-v_1\varepsilon}{k}, \cdots, \frac{1+v_{k/2}\varepsilon}{k}, \frac{1-v_{k/2}\varepsilon}{k} \right).

Clearly, each Q_v belongs to H_1. Moreover, 

\int \frac{dQ_v dQ_{v'}}{dP} = \frac{1}{k}\sum_{i=1}^{k/2}\sum_{u\in \{\pm 1\}} (1+uv_i\varepsilon)(1+uv_i'\varepsilon) = 1 + \frac{2\varepsilon^2}{k}\sum_{i=1}^{k/2} v_iv_i'.

Therefore, by Lemma 2 and the discussions thereafter, we have 

\begin{array}{rcl} \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n} \right) +1 &=& \mathop{\mathbb E}_{v,v'}\left[ \left(1 + \frac{2\varepsilon^2}{k}\sum_{i=1}^{k/2} v_iv_i'\right)^n \right] \\ &\le & \mathop{\mathbb E}_{v,v'} \exp\left(\frac{2n\varepsilon^2}{k} \sum_{i=1}^{k/2} v_iv_i' \right) \\ &\le & \mathop{\mathbb E}_v \exp\left(\frac{2n^2\varepsilon^4}{k^2} \|v\|_2^2 \right) \\ &=& \exp\left(\frac{n^2\varepsilon^4}{k} \right), \end{array}

where the last inequality follows from the 1-subGaussianity of the vector v'. Hence, as long as n = O(\frac{\sqrt{k}}{\varepsilon^2}), the above \chi^2-divergence is upper bounded by O(1). Therefore, the sample complexity of the uniformity testing problem is at least n = \Omega(\frac{\sqrt{k}}{\varepsilon^2}), which is also tight (see bibliographic notes for details). 

2.3. General Identity Testing 

Next we generalize the uniformity testing problem to the case where the uniform distribution U_k can be replaced by any given probability distribution U. Specifically, given n i.i.d. observations from a discrete distribution P supported on at most k elements, one would like to test between 

H_0: P = U \qquad \text{vs.} \qquad H_1: \|P - U\|_1\ge \varepsilon,

where \varepsilon>0 is some prescribed parameter, and U=(u_1,\cdots,u_k) is a fixed known probability distribution. Again, the target is to determine the sample complexity of this identity testing problem, in terms of U and \varepsilon

For simplicity, we assume that k is an even integer, and the components of U appear in pairs, i.e., u_1 = u_2, u_3 = u_4, etc. We reparametrize U by U=(u_1,u_1,u_2,u_2,\cdots,u_{k/2},u_{k/2}). We also assume that u_i\ge \varepsilon^3/8 for all i. The general case requires proper trimming on U, and is referred to the bibliographic notes. As in the previous Paninski’ construction, we again consider a uniform mixture in H_1, but with different perturbation sizes. Specifically, let random vector v be distributed uniformly on \{\pm 1\}^{k/2} (assuming that k is even), and consider the mixture \mathop{\mathbb E}_v[Q_v^{\otimes n}], with 

Q_v = \left(u_1 + v_1\varepsilon_1, u_1 - v_1\varepsilon_1, \cdots, u_{k/2}+v_{k/2}\varepsilon_{k/2}, u_{k/2} - v_{k/2}\varepsilon_{k/2} \right).

Here the parameters \varepsilon_1,\cdots,\varepsilon_{k/2}>0 will be specified later. Note that Q_v is a valid probability distribution if \varepsilon_i \le u_i for all i, and the assumption \|Q_v - U\|_1 \ge \varepsilon requires that \sum_{i=1}^{k/2} \varepsilon_i \ge \varepsilon/2. As before, 

\chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n} \right) +1 \le \mathop{\mathbb E}_{v,v'} \exp\left( 2n\sum_{i=1}^{k/2} \frac{\varepsilon_i^2}{u_i} v_iv_i' \right) \le \exp\left(2n^2\sum_{i=1}^{k/2} \frac{\varepsilon_i^4}{u_i^2} \right),

where again the last step follows from the sub-Gaussianity. Since our target is to find the largest possible n such that the \chi^2-divergence is O(1), we aim to find \varepsilon_i‘s to minimize the quantity \sum_{i=1}^{k/2} \varepsilon_i^4 / u_i^2 subject to the above constraints. To do so, we recall the following elementary inequality. 

Lemma 3 (Hölder’s Inequality) Let a_{i,j} be non-negative reals for all i\in [m], j\in [n]. Then

\prod_{i=1}^m \left(\sum_{j=1}^n a_{i,j} \right)^{\frac{1}{m}} \ge \sum_{j=1}^n \left(\prod_{i=1}^m a_{i,j} \right)^{\frac{1}{m}}.

Proof: By homogeneity we may assume that \sum_{j=1}^n a_{i,j}=1 for all i\in [m]. Then the desired inequality follows from the AM-GM inequality 

\left(\prod_{i=1}^m a_{i,j} \right)^{\frac{1}{m}} \le \frac{1}{m}\sum_{i=1}^m a_{i,j}. \Box

By Lemma 3, we have the following inequality: 

\left(\sum_{i=1}^{k/2} \frac{\varepsilon_i^4}{u_i^2} \right) \left(\sum_{i=1}^{k/2} u_i^{2/3}\right)^3 \ge \left(\sum_{i=1}^{k/2} \varepsilon_i\right)^4 \ge \frac{\varepsilon^4}{16},

which gives the lower bound \sum_{i=1}^{k/2} \varepsilon_i^4/u_i^2 \ge \varepsilon^4/(16\|U\|_{2/3}^2), with identity when

\varepsilon_i = \varepsilon u_i^{2/3}/(2\|U\|_{2/3}^{2/3}).

It is straightforward to check that \varepsilon_i \le u_i for all i thanks to the assumption u_i\ge \varepsilon^3/8. Consequently, we arrive at the lower bound 

n = \Omega\left(\frac{\|U\|_{2/3}}{\varepsilon^2} \right)

for the sample complexity, which is tight in general. Recall that how the mysterious (2/3)-norm is obtained by carefully choosing the mixture.

3. Examples of Estimation Problems 

In this section, we will look at several estimation problems and show how Theorem 1 provides the tight lower bound for these problems. We will see that the choice of the mixture becomes natural after we understand the problem. 

3.1. Linear Functional Estimation of Sparse Gaussian Mean 

Consider a Gaussian location model X\sim \mathcal{N}(\mu, \sigma^2I_p), where we assume that the mean vector \mu is s-sparse, i.e., \|\mu\|_0\le s. The target is to estimate the linear functional of \mu, i.e., L(\mu):= \sum_{i=1}^p \mu_i, and we aim to prove a tight minimax lower bound for the minimax risk R_{s,p,\sigma}^\star under the quadratic loss. 

The key feature of this problem is the dependence of the minimax risk on the sparsity parameter s. If one could know the support of \mu, summing over X_i for i\in \text{supp}(\mu) gives a natural estimator with risk O(s\sigma^2). However, due to the unknown support of \mu, we will show that R_{s,p,\sigma}^\star = \Omega(s^2\sigma^2\log p) for small s, where the additional penalty s\log p is essentially the logarithm of \binom{p}{s}, the number of possible supports of \mu

To incorporate the effect of unknown support into the lower bound, a natural choice would be to consider a random mean vector with support uniformly distributed on all size-s subsets of [p]. Formally, let P=\mathcal{N}(0,\sigma^2I_p), and Q_S = \mathcal{N}(\rho 1_S, \sigma^2 I_p) with 1_S being the indicator vector of the subset S\subseteq [p], and \rho>0 is a parameter to be specified. Moreover, we let S be uniformly distributed on all possible size-s subsets of [p]. Clearly, the separability condition in Theorem 1 is satisfied with \Delta = s^2\rho^2/2. To upper bound the \chi^2-divergence, we again compute 

\int \frac{dQ_S dQ_{S'}}{dP} = \exp\left(\frac{\rho^2}{\sigma^2} |S\cap S'|\right),

and consequently 

\chi^2\left(\mathop{\mathbb E}_S[Q_S], P \right) = \mathop{\mathbb E}_{S,S'} \exp\left(\frac{\rho^2}{\sigma^2} |S\cap S'|\right) - 1.

The random variable |S\cap S'| follows a hypergeometric distribution with pmf \mathop{\mathbb P}(|S\cap S'| = r) = \binom{s}{r}\binom{p-s}{s-r} / \binom{p}{s}. It is well-known in probability theory that a hypergeometric distribution refers to sampling without replacement, while the Binomial distribution \mathsf{B}(s,s/p) refers to sampling with replacement. The next lemma shows a relationship between these two random variables. 

Lemma 4 Let \{x_1,\cdots,x_M\} be an urn, X_1,\cdots,X_n be samples uniformly drawn from the urn without replacement, and X_1^\star, \cdots, X_n^\star be samples uniformly drawn from the urn with replacement. Define S_n = \sum_{i=1}^n X_i and S_n^\star = \sum_{i=1}^n X_i^\star, then there exists a coupling between (S_n, S_n^\star) such that \mathop{\mathbb E}[S_n^\star|S_n] = S_n

Proof: The coupling is defined as follows: for any distinct y_1,\cdots,y_n \in \{x_1,\cdots,x_M\} and not necessarily distinct z_1,\cdots,z_n\in \{y_1,\cdots,y_n\}, let the conditional distribution be 

\mathop{\mathbb P}(X_1^\star = z_1, \cdots, X_n^\star = z_n | X_1 = y_1, \cdots, X_n = y_n) = \frac{\binom{M}{n}}{M^n\binom{M-L}{n-L}},

where L denotes the number of distinct elements in z_1,\cdots,z_n. To verify that this is a valid conditional probability measure, recall that the number of ways of drawing n labeled elements with L distinct elements from an urn of n elements is S(n,L)(n)_L, where S(n,L) is the Stirling’s number of the second kind, and (n)_L := n! / (n-L)! is the falling factorial. Hence, the probability is valid due to the identity 

\sum_{L=1}^n \frac{\binom{M}{n}}{M^n\binom{M-L}{n-L}}\cdot S(n,L)(n)_L = \frac{1}{M^n}\sum_{L=1}^n S(n,L)(M)_L = 1,

where the last step is due to that \sum_{L=1}^n S(n,L)(M)_L is the number of ways of drawing n labeled elements from an urn of M elements and therefore equals M^n

From this conditional probability, it is obvious to see that the unconditional probability of (X_1^\star, \cdots, X_n^\star) coincides with sampling with replacement. Moreover, 

\sum_{i=1}^n \mathop{\mathbb P}(X_i^\star = y| X_1 = y_1, \cdots, X_n = y_n) = 1(y\in \{y_1,\cdots,y_n\})

follows from symmetry, and therefore \mathop{\mathbb E}[S_n^\star|S_n] = S_n. \Box 

By Lemma 4, there exists some \sigma-algebra \mathcal{B} such that |S\cap S'| has the same distribution as \mathop{\mathbb E}[Z|\mathcal{B}] with Z\sim \mathsf{B}(s,s/p). Now by Jensen’s inequality, 

\chi^2\left(\mathop{\mathbb E}_S[Q_S], P \right) \le \mathop{\mathbb E} \exp(\rho^2 Z/\sigma^2 ) - 1 = \left(1 - \frac{s}{p} + \frac{s}{p}\exp(\rho^2/\sigma^2) \right)^s - 1.

Hence, the largest \rho^2 to ensure a constant \chi^2-divergence is \rho^2 = \sigma^2\log(1+p/s^2), which by Theorem 1 gives 

R_{s,p,\sigma}^\star = \Omega\left(\sigma^2 s^2 \log\left(1+ \frac{p}{s^2}\right)\right),

which has a phase transition at s\sim \sqrt{p}. In fact, this lower bound is also tight: if s \ge \sqrt{p}, the natural estimator \sum_{i=1}^p X_i is rate-optimal; if s<\sqrt{p}, the thresholding estimator \sum_{i=1}^p X_i1(|X_i|\ge t) is rate-optimal for some proper threshold t

3.2. Largest Eigenvalue of Covariance Matrix 

Consider i.i.d. observations X_1,\cdots,X_n\sim \mathcal{N}(0,\Sigma), where \Sigma\in {\mathbb R}^{p\times p} is the covariance matrix. Assuming \|\Sigma\|_{\text{op}}\le 1, classical matrix concentration inequality shows that the empirical covariance matrix \hat{\Sigma} := n^{-1}\sum_{i=1}^n X_iX_i^\top satisfies that 

\mathop{\mathbb E}\|\hat{\Sigma} - \Sigma\|_{\text{op}} = O\left(\sqrt{\frac{p}{n}}\right).

We show that the \sqrt{p/n} bound is rate-optimal. In fact, we may even show a stronger result: the minimax risk R_{n,p}^\star in estimating \|\Sigma\|_{\text{op}} under the absolute value loss is at least 

R_{n,p}^\star = \Omega\left( 1 \wedge \sqrt{\frac{p}{n}}\right).

Note that by triangle inequality, estimating the largest eigenvalue is indeed an easier problem than covariance estimation. 

As before, a natural idea to establish the lower bound is to choose a uniform direction for the principal eigenvector, so that the learner must distribute her effort into all directions evenly. Specifically, we choose P = \mathcal{N}(0, aI_p), and Q_v = \mathcal{N}(0,aI_p + bvv^\top), where a,b>0 are some parameters to be chosen later, and v is uniformly distributed on the unit ball \{v\in {\mathbb R}^p: \|v\|_2 =1 \}. It is easy to verify that \|\Sigma\|_{\text{op}}\le 1 is fulfilled as long as a+b\le 1, and \Delta = b in Theorem 1. Moreover, 

\int \frac{dQ_vdQ_{v'}}{dP} = \left( 1 - \frac{b^2}{a^2}(v^\top v')^2 \right)^{-\frac{1}{2}}.

Hence, by Lemma 2, 

\chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n}\right) +1 = \mathop{\mathbb E}_{v,v'} \left( 1 - \frac{b^2}{a^2}(v^\top v')^2 \right)^{-\frac{n}{2}} \le \mathop{\mathbb E}_{v,v'} \exp\left(\frac{nb^2}{a^2}(v^\top v')^2 \right)

assuming b/a\le 1/2, where the last inequality follows from 1-x \ge \exp(-2x) whenever 0\le x\le 1/4. Computation the surface area of spherical caps gives

\mathop{\mathbb P}(|v^\top v'| \ge t) = \frac{1}{\text{Beta}(\frac{p-1}{2}, \frac{1}{2})}\int_0^{1-t^2} x^{\frac{p-3}{2}}(1-x)^{-\frac{1}{2}}dx = \exp(-\Omega(pt^2)).

Consequently, to make \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n}\right) = O(1), we may take 

a = \frac{1}{2}, \qquad b = \frac{1}{4} \wedge \sqrt{\frac{p}{n}},

which gives the desired minimax lower bound R_{n,p}^\star = \Omega\left( 1 \wedge \sqrt{p/n}\right).

3.3. Quadratic Functional Estimation 

Finally we switch to a non-parametric example. Let X_1,\cdots,X_n be i.i.d. observations drawn from a density f, where f is supported on [0,1] and belongs to the Hölder ball with smoothness parameter s>0

f\in \mathcal{H}^s(L) := \left\{f\in C[0,1]: \sup_{x\neq y} \frac{|f^{(m)}(x) - f^{(m)}(y)| }{|x-y|^\alpha}\le L \right\},

with s = m+\alpha, m\in \mathbb{N}, \alpha\in (0,1]. The target is to estimate the quadratic functional 

Q(f) := \int_0^1 f(x)^2 dx,

and we let R_{n,s}^\star be the corresponding minimax risk under the absolute value loss. The central claim is that 

R_{n,s}^\star = \Omega\left( n^{-\frac{4s}{4s+1}} + n^{-\frac{1}{2}}\right),

which has an “elbow effect” at s=1/4

A central tool for proving nonparametric lower bounds is via parametric reduction, i.e., restricting f to a parametric family while the subproblem is as hard as the original nonparametric problem. In this example, a natural candidate is 

f_v(x) = 1 + c_0L\cdot \sum_{i=1}^{h^{-1}/2} v_ih^s\left[ g\left(\frac{x-(2i-2)h}{h} \right) - g\left(\frac{x-(2i-1)h}{h} \right) \right],

where h\in (0,1) is the bandwidth to be chosen later, the vector v satisfies \|v\|_\infty\le 1, and g is a fixed smooth function on {\mathbb R} vanishing outside [0,1]. It is a simple exercise to show that for c_0>0 sufficiently small, each f_v is a valid density on [0,1] and belongs to the Hölder ball \mathcal{H}^s(L). Intuitively, the density f_v is a perturbation on the uniform density with bumps modulated by the vector v. Moreover, 

Q(f_v) = 1 + (c_0L\|g\|_2)^2 h^{2s+1} \cdot\|v\|_2^2,

and therefore the target of the subproblem is to estimate \|v\|_2^2 based on X_1,\cdots,X_n\sim f_v. Again, the idea of Theorem 1 can be applied. 

Let v be uniformly distributed on \{\pm 1\}^{1/(2h)}. By simple calculation, 

\int_0^1 \frac{f_v(x)f_{v'}(x)}{f_0(x)}dx = 1 + (c_0L\|g\|_2)^2h^{2s+1}\cdot v^\top v'.

Consequently, Lemma 2 gives 

\begin{array}{rcl} \chi^2\left(\mathop{\mathbb E}_v[f_v^{\otimes n}], f_0^{\otimes n} \right) + 1 &=& \mathop{\mathbb E}_{v,v'} \left(1 + (c_0L\|g\|_2)^2h^{2s+1}\cdot v^\top v' \right)^n \\ &\le & \mathop{\mathbb E}_{v,v'} \exp\left( (c_0L\|g\|_2)^2nh^{2s+1}\cdot v^\top v'\right) \\ &\le& \exp\left(\frac{1}{2}(c_0L\|g\|_2)^4n^2h^{4s+2}\cdot \frac{1}{2h} \right). \end{array}

Hence, choosing h = \Theta(n^{-\frac{2}{4s+1}}) gives \chi^2\left(\mathop{\mathbb E}_v[f_v^{\otimes n}], f_0^{\otimes n} \right) = O(1). Moreover, Theorem 1 is fulfilled with \Delta = \Theta(h^{2s}) = \Theta(n^{-\frac{4s}{4s+1}}), and therefore we have proved that 

R_{n,s}^\star = \Theta(n^{-\frac{4s}{4s+1}}).

The parametric rate R_{n,s}^\star = \Theta(n^{-1/2}) is trivial (by the tools in Lecture 4). 

4. Bibliographic Notes 

A wonderful book which treats the point-mixture idea of Theorem 1 sysmatically and contains the method in Lemma 2 is Ingster and Suslina (2012). The sharp bounds on the clique size in the hidden clique problem are due to Matula (1972). It is a long-standing open problem whether there exists a polynomial-time algorithm attaining this statistical limit, and we refer to the lecture notes Lugosi (2017) for details. The sample complexity of the uniformity testing problem is due to Paninski (2008), and the general identity testing problem is studied in Valiant and Valiant (2017). There are also some other generalizations or tools for uniformity testing, e.g., Diakonikolas and Kane (2016), Diakonikolas, Kane and Stewart (2018), Blais, Canonne and Gur (2019). 

For the estimation problem, the linear functional estimation for sparse Gaussian mean is due to Collier, Comminges and Tsybakov (2017), where Lemma 4 can be found in Aldous (1985). The largest eigenvalue example is taken from the lecture note Wu (2017). The nonparametric functional estimation problem is widely studied in statistics, where the minimax rate for estimating the quadratic functional can be found in Bickel and Ritov (1988), and it is \Theta(n^{-\frac{4s}{4s+d}} + n^{-\frac{1}{2}}) for general dimension d. The same minimax rate (and the elbow effect) actually holds for all smooth functionals, see Birgé and Massart (1995), Kerkyacharian and Picard (1996) and Robins et al (2017). 

  1. Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012. 
  2. David W. Matula. Employee party problem. Notices of the American Mathematical Society. Vol. 19, No. 2, 1972. 
  3. Gábor Lugosi. Lectures on Combinatorial Statistics. 2017. 
  4. Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory 54.10 (2008): 4750–4755. 
  5. Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing 46.1 (2017): 429–455. 
  6. Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2016. 
  7. Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018. 
  8. Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT) 11.2 (2019): 6. 
  9. Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics 45.3 (2017): 923–958. 
  10. David J. Aldous. Exchangeability and related topics. Ecole d’Eté de Probabilités de Saint-Flour XIII-1983. Springer, Berlin, Heidelberg, 1985. 1–198. 
  11. Yihong Wu. Lecture notes for ECE598YW: Information-theoretic methods for high-dimensional statistics. 2017. 
  12. Peter J. Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya: The Indian Journal of Statistics, Series A (1988): 381–393. 
  13. Lucien Birgé and Pascal Massart. Estimation of integral functionals of a density. The Annals of Statistics 23.1 (1995): 11–29. 
  14. Gérard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics 24.2 (1996): 485-507. 
  15. James Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der Vaart. Higher order estimating equations for high-dimensional models. Annals of statistics 45, no. 5 (2017): 1951. 

Lecture 5: Hypothesis Testing and Le Cam’s Two-point Method

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In previous lectures we have seen the ideas of model reduction and its application in establishing the asymptotic lower bounds. However, in many scenarios we would like to find good non-asymptotic bounds, and as we show in the examples of Lecture 3, finding such reductions may be quite difficult. In this and the next few lectures, we will introduce the testing idea which transforms the target problem (usually the estimation problem) into a proper testing problem. Despite the conceptual simplicity, the testing problem may take various forms, some of which may be difficult to analyze. This will be the central topic of the next few lectures. 

1. f-divergences and Inequalities 

1.1. f-divergences 

First we look at the possibly simplest testing problem: if P and Q are two probability measures on the same probability space, and a random sample X is drawn from either P or Q. A (binary) testing problem aims to find some test \Psi(X)\in \{0,1\} which with high probability outputs 0 when X\sim P, and outputs 1 when X\sim Q. To quantify the performance of a certain test \Psi, we define the total error probability as follows: 

\text{err}(\Psi) = P(\Psi(X) \neq 0) + Q(\Psi(X)\neq 1).

Intuitively, when P and Q are close to each other, we will expect that the total error probability for any test is large for they are hard to distinguish. Conversely, when P and Q are very far from each other (e.g., with a disjoint support), then there exists some test which can well distinguish them. The next lemma gives the expression of the minimum \text{err}(\Psi).

Lemma 1 (Le Cam’s first lemma) Minimizing all measurable tests \Psi gives

\min_{\Psi} \text{\rm err}(\Psi) = 1 - \|P-Q\|_{\text{\rm TV}}.

Recall that \|P-Q\|_{\text{\rm TV}} = \frac{1}{2}\int |dP-dQ| is the total variation distance between P and Q

Proof: Let A = \{X: \Psi(X) = 0\}, then 

\text{err}(\Psi) = 1 - P(A) + Q(A) \ge 1 - \sup_B |P(B) - Q(B)| = 1 - \|P-Q\|_{\text{\rm TV}},

with equality if and only if A = \{x: P(dx) \ge Q(dx) \}. \Box 

By Lemma 1, the total variation distance between P and Q exactly quantifies the minimum total error probability of distinguishing between two probability distributions. Similar to Lemma 1, if we impose different weights on the error probabilities, we will arrive at some divergences of the form c\cdot \int |dP - w\cdot dQ|, where c>0 is some constant, and w>0 is the weight on Q(\Psi(X)\neq 1). Recall from convex analysis that for any convex function f with compact support on {\mathbb R}, its second-order derivative f'' always exists in the sense of a positive Radon measure, and 

f(x) = a + bx + \frac{1}{2}\int |x-t| f''(dt) \ \ \ \ \ (1)

for some a,b\in {\mathbb R}.

Remark 1 The decomposition (1) of convex functions has several applications, e.g., in the local time analysis of continuous martingales. 

Motivated by (1), we can actually take a convex combination of the above divergences and arrive at the general class of f-divergences. 

Definition 2 (f-divergence) For a convex function f on [0,\infty) with f(1)=0, the f-divergence between probability measures P and Q on (\Omega,\mathcal{F}) is defined as

D_f(P, Q) = \int f\left(\frac{dP^\parallel}{dQ}\right) dQ + f^\star(0) (1- P^\parallel(\Omega) ),

where P^\parallel is the absolute continuous part of P w.r.t. Q, and f^\star(0) := \lim_{t\rightarrow\infty} f(t)/t

In most practical scenarios, we have P\ll Q, and therefore D_f(P\|Q) = \mathop{\mathbb E}_Q f(dP/dQ). Some well-known examples of f-divergences are as follows. 

Example 1

1. f(x) = |x-1|: total variation distance \|P-Q\|_{\text{\rm TV}}

2. f(x)=(\sqrt{x}-1)^2: Hellinger distance H^2(P,Q) = \int (\sqrt{dP}-\sqrt{dQ})^2

3. f(x)=-\log x: Kullback–Leibler (KL) divergence D_{\text{\rm KL}}(P\|Q) = \int dP\log\frac{dP}{dQ}

4. f(x) = x^2-1: \chi^2-divergence \chi^2(P,Q) = \int\frac{(dP-dQ)^2}{dQ}

The following properties also hold for any f-divergences. 

Lemma 3 D_f(P,Q)\ge 0 and is joint convex in (P,Q). In particular, D_f(\mathsf{K}P, \mathsf{K}Q)\le D_f(P,Q) for any transition kernel \mathsf{K}

Remark 2 The last inequality is known as the data processing inequality. 

Proof: The non-negativity follows from convexity and Jensen’s inequality. The joint convexity is due to that of the perspective transform (x,y)\mapsto yf(x/y) for convex f. Let Y be the output of X through kernel \mathsf{K}, then the last inequality follows from the joint convexity and 

\mathop{\mathbb E}_{Q_{X|Y}}\left[ \frac{dP_{XY}}{dQ_{XY}} \right] = \frac{dP_Y}{dQ_Y}.

We leave the details to the readers. \Box 

1.2. Joint range 

Lemma 1 involves the computation of the total variation distance, which may be hard to evaluate exactly for most statistical models, especially when there is a tensor power structure. As a comparison, Hellinger, KL and \chi^2-divergences tensorize better: 

\begin{array}{rcl} H^2(\otimes_i P_i, \otimes_i Q_i) &=& 2 - 2\prod_i \left(1-\frac{1}{2}H^2(P_i, Q_i) \right),\\ D_{\text{KL}}(\otimes_i P_i\| \otimes_i Q_i) &=& \sum_i D_{\text{KL}}(P_i \| Q_i), \\ \chi^2(\otimes_i P_i, \otimes_i Q_i) &=& \prod_i \left(1 + \chi^2( P_i, Q_i)\right) - 1. \end{array}

Consequently, to apply Lemma 1 to proving lower bounds, we need upper bounds of the total variation distance in terms of Hellinger, KL or \chi^2. We have seen examples of such inequalities from Lecture 3, but have no clue whether the claimed inequalities are tight. To obtain the best inequalities between different f-divergences, we study the joint range of f-divergences. 

For two convex functions f,g on {\mathbb R}_+ with f(1) = g(1)=0, we call (r,s) is in the joint range of (D_f, D_g) if we can find two distributions P,Q over some common probability space such that D_f(P,Q)=r, D_g(P,Q)=s. Hence, the joint range exactly characterizes the relationship between different f-divergences. The central result is as follows: 

Theorem 4 Let \mathcal{R}\subseteq {\mathbb R}^2 be the joint range of (D_f, D_g), and \mathcal{R}_k\subseteq {\mathbb R}^2 be the joint range of (D_f, D_g) when both P and Q are discrete distributions supported on at most k elements. Then

\mathcal{R} = \text{\rm conv}(\mathcal{R}_2) = \mathcal{R}_4,

where \text{\rm conv} denotes the convex hull. 

Theorem 4 provides a generic way to prove inequalities between different f-divergences. Specifically, it shows that to obtain the joint range, it suffices to choose P = (p, 1-p) and Q = (q,1-q), compute the pair (D_f(P,Q), D_g(P,Q)), vary (p,q)\in [0,1]^2 and then take the convex closure. The proof of Theorem 4 requires the following theorems in the convex analysis. 

Theorem 5 (Choquet–Bishop–de Leeuw) Let C be a metrizable convex compact subset of a locally convex topological vector space V. For any point x_0 in C, there exists a probability measure \mu supported on extremal points of C such that x_0 = \int x\mu(dx)

Theorem 6 (Fenchel–Eggleston–Carathéodory) Let S\subseteq {\mathbb R}^d be connected and C=\text{\rm conv}(S). For any x\in C, there exists x_1,\cdots,x_d\in S such that x\in \text{\rm conv}(x_1,\cdots,x_d)

We are ready to prove the statement of Theorem 4. 

Proof: For simplicity we always assume that P\ll Q, then D_f(P,Q) = \mathop{\mathbb E}_Q f(dP/dQ). Viewing the likelihood ratio dP/dQ as a random variable X, we conclude that 

\mathcal{R} = \{(\mathop{\mathbb E}[f(X)], \mathop{\mathbb E}[g(X)]): X\ge 0, \mathop{\mathbb E}[X] = 1 \}.

Similarly, \mathcal{R}_k has the same representation with the additional constraint that X can take at most k values. Let C be the set of all probability measures which are supported on [0,\infty) and have unit mean. By linearity of the expectation and the convexity of C we conclude that \mathcal{R} is convex. Moreover, since all extremal points of C are supported on two elements, by Choquet–Bishop–de Leeuw theorem (note that any topological vector space is always locally convex under the weak-\star topology, where the unit sphere is compact and metrizable) we conclude that the distribution of X can be written as a mixture of distributions supported on only two points. Then by the linearity of expectation again, we conclude that \mathcal{R} = \text{conv}(\mathcal{R}_2)

For the second identity \text{conv}(\mathcal{R}_2)= \mathcal{R}_4, first note that \mathcal{R}_2 is a connected set for it is a continuous image of the divergence pair on [0,1]^2. Hence, by Fenchel–Eggleston–Carathéodory theorem, any point in \text{conv}(\mathcal{R}_2) can be written as a convex combination of two points in \mathcal{R}_2, which gives rise to a random variable X supported on at most 4 elements. \Box 

Based on Theorem 4, one may verify the following inequalities: 

  • TV vs. Hellinger: \frac{H^2(P,Q)}{2} \le \|P-Q\|_{\text{TV}} \le \sqrt{H^2(P,Q)\left(1-\frac{H^2(P,Q)}{4}\right)}.
    Both inequalities are tight. 
  • TV vs. KL: Pinsker’s inequality gives \|P-Q\|_{\text{TV}} \le \sqrt{D_{\text{KL}}(P\|Q)/2}.
    This bound is vacuous when D_{\text{KL}}(P\|Q)\ge 2, where better bounds are available: \|P-Q\|_{\text{TV}} \le \sqrt{1-\exp(-D_{\text{KL}}(P\|Q))} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q)).
    None of these bounds are tight. 
  • Hellinger vs KL: D_{\text{KL}}(P\|Q) \ge 2\log\frac{2}{2 - H^2(P,Q)}.
    This bound is tight. 
  • KL vs. \chi^2: D_{\text{KL}}(P\|Q) \le \log(1+\chi^2(P,Q)).
    This bound is tight. 

The above inequalities will be used quite often whenever we switch between convenient divergences.

2. Le Cam’s Two-point Method 

One popular way to prove a non-asymptotic lower bound of statistical estimation is via hypothesis testing. Let (P_\theta)_{\theta\in\Theta} be the statistical model of interest, and the loss function is L(\theta,\hat{\theta}): \Theta \times \Theta\rightarrow {\mathbb R}_+ (at the moment we assume that the target is to estimate \theta). The idea is to use the following weaker lower bound 

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \ge \sup_{\theta_1,\theta_2\in\Theta} \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}).

Suppose that we can find two possible candidates of \theta, say, \theta_1 and \theta_2, such that the following two conditions hold: 

  1. Large separation: the total loss L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta}) is always large for any \hat{\theta}
  2. Indistinguishability: based on the observation, the observer cannot reliably distinguish between P_{\theta_1} and P_{\theta_2}

then clearly the observer needs to pay a risk proportional to \inf_{\hat{\theta}} (L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta})). The next theorem makes this intuition precise.

Theorem 7 (Two-point Method) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and \theta_1,\theta_2\in \Theta satisfy that

L(\theta_1, a) + L(\theta_2, a) \ge \Delta, \qquad \forall a\in \mathcal{A},


\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right).

Proof: The desired result follows from the following chain of inequalities: 

\begin{array}{rcl} \inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) &\ge & \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \\ &\ge & \inf_{\hat{\theta}} \frac{\mathop{\mathbb E}_{\theta_1} L(\theta_1,\hat{\theta}) + \mathop{\mathbb E}_{\theta_2} L(\theta_2,\hat{\theta}) }{2} \\ &=&\frac{1}{2} \inf_{\hat{\theta}} \int L(\theta_1, \hat{\theta}(x))P_{\theta_1}(dx) + \int L(\theta_2, \hat{\theta}(x))P_{\theta_2}(dx) \\ &\ge & \frac{1}{2} \int \left[\inf_{a} ( L(\theta_1,a) + L(\theta_2,a) ) \right]\min\{P_{\theta_1}(dx), P_{\theta_2}(dx) \} \\ &\ge& \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right). \end{array} \Box

Note that in Theorem 7, the parameter \Delta characterizes the separation between the two points, and 1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}} is the minimum error probability of testing between \theta = \theta_1 and \theta=\theta_2 (cf. Lemma 1). Typically these two quantities depend negatively on each other: when one becomes larger, the other gets smaller. A rule-of-thumb is to find (\theta_1, \theta_2) such that 1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}} is a constant, say 0.1, and then search for the largest possible separation \Delta

Although the proof of Theorem 7 did not explicitly apply testing arguments, the test can be seen as follows. Let the test be \Psi(X) = 1 if L(\theta_1, \hat{\theta}) < L(\theta_2, \hat{\theta}), and \Psi(X)=2 otherwise, then the separability condition ensures that \theta = \theta_\Psi for \theta\in \{\theta_1,\theta_2\} as long as L(\theta,\hat{\theta}) < \Delta/2. Hence, 

L(\theta, \hat{\theta} ) \ge \frac{\Delta}{2}\cdot 1(\theta\neq \theta_\Psi).

Replacing L(\theta,\hat{\theta}) with the above lower bound, and repeating the analysis in the proof of Theorem 7, a direct application of Lemma 1 gives the same result (with \Delta/2 replaced by \Delta/4). 

The two-point idea is simple but appears everywhere in various lower bounds, for it establishes the indistinguishability between two different tasks due to the uncertainties. Also, despite the conceptual simplicity, searching over all possible pairs of points is typically computationally expensive, and it requires effort to understand the problem structure and figure out the “correct” two points. In later lectures, this idea will be generalized in the following directions: 

  • Multiple points instead of two-point; 
  • One or both of the two points may be a mixture of distributions, i.e., the binary test has point-mixture or mixture-mixture structures; 
  • The chioce of the two points may depend on the estimator \hat{\theta}

3. Examples of Two-point Methods 

The remainder of this lecture is devoted to the applications of the two-point method in various problems. 

3.1. Example I: Parameter Estimation 

We begin with the simplest example of the Gaussian location model P_\theta = \mathcal{N}(\theta,\sigma^2) for \theta\in{\mathbb R}, and our target is to estimate \theta. Since a natural guess is that local perturbation on \theta makes the model indistinguishable, we choose \theta_1 = 0 and \theta_2 = \delta in the two-point method. Let R^\star be the minimax risk under the squared error loss, Theorem 7 gives 

R^\star \ge \frac{\delta^2}{2}\left(1 - \Phi\left( \frac{|\delta|}{2\sigma}\right)\right),

where \Phi(\cdot) is the CDF of \mathcal{N}(0,1). Maximizing the above lower bound over \delta\in {\mathbb R} gives R^\star\ge 0.332\sigma^2. Compared with R^\star = \sigma^2 derived in Lecture 4, the two-point bound loses a constant multiplicative factor.

Next we consider the bounded mean scenario where |\theta|\le B for a known parameter B in advance. Since the above \delta is chosen to be \Theta(\sigma) in the end, this becomes infeasible if the upper bound B is much smaller than \sigma. Consequently, we can choose \theta_1 = 0 and \theta_2 = \sigma \wedge B, which gives R^\star = \Omega(\sigma^2\wedge B^2). This lower bound is also tight, for one of the estimators \hat{\theta} = 0 or \hat{\theta}=X attains it. 

Consider another location model where one observes X_1,\cdots,X_n\sim \mathsf{Unif}(\theta,\theta+1) and aims to estimate \theta\in {\mathbb R}. Again, choosing \theta_1 = 0 and \theta_2 = \delta, the minimax risk R_n^\star is lower bounded by 

R_n^\star \ge \frac{\delta^2}{2} \left(1 - \|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \right).

By the triangle inequality, 

\|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}} = n\delta.

Hence, choosing \delta = 1/(2n) gives R_n^\star \ge 1/(16n^2). The rate \Theta(n^{-2}) is also tight for the minimax risk, for the estimator \hat{\theta} = \min_i X_i attains the rate. 

Remark 3 The uniform location model is not QMD (see Lecture 4), thus it makes sense to have the correct minimax rate different from \Theta(n^{-1}). Moreover, the inequality \|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}} may not be tight for other models. 

3.2. Example II: Functional Estimation 

Recall the discrete entropy estimation problem, where we observe n samples X_1,\cdots,X_n from a discrete distribution P=(p_1,\cdots,p_k), and the target is to estimate the Shannon entropy H(P) = \sum_{i=1}^k -p_i\log p_i. In Lecture 4 we proved an asymptotic minimax lower bound (\log k)^2/n for the mean squared error, and here we show that the same bound holds non-asymptotically (even when k grows with n). 

The idea of using two-point method here is that when we perturb each entry in the uniform distribution by some small \varepsilon/k, the value of the functional H(P) changes by an additive factor of \Theta(\varepsilon\log k), and the KL divergence between these two cases is \Theta(n\varepsilon^2). Hence, choosing \varepsilon \asymp n^{-1/2} will complete the proof. However, some cautions should be taken when choosing the specific points: 

  • The two points cannot be P_1 = (\frac{1}{k},\cdots,\frac{1}{k}), and P_2 = (\frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k},\cdots, \frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k}). In this way, we have | H(P_1) - H(P_2) | = \Theta(\varepsilon^2)
  • The two points cannot be P_1 = (\frac{1}{k},\cdots,\frac{1}{k}), and P_2 = (\frac{1-\varepsilon}{k},\cdots,\frac{1-\varepsilon}{k},\frac{1+(k-1)\varepsilon}{k}). In this way, we have |H(P_1) - H(P_2)|=\Theta(\varepsilon\log k) if \varepsilon = \Omega(k^{-1}) and |H(P_1) - H(P_2)|=\Theta(\varepsilon^2) otherwise. Hence, this choice fails for some parameters (n,k)

The final choice of the two-point is as follows: 

\begin{array}{rcl} P_1 &=& \left(\frac{1}{2(k-1)}, \cdots, \frac{1}{2(k-1)}, \frac{1}{2} \right), \\ P_2 &=& \left(\frac{1-\varepsilon}{2(k-1)},\cdots, \frac{1-\varepsilon}{2(k-1)}, \frac{1 + \varepsilon}{2} \right). \end{array}

Simple algebra gives |H(P_1) - H(P_2)| = \Theta(\varepsilon\log k), and 

D_{\text{KL}}(P_1^{\otimes n} \| P_2^{\otimes n}) = nD_{\text{KL}}(P_1 \| P_2) = \frac{n}{2}\log \frac{1}{1-\varepsilon^2} = \Theta(n\varepsilon^2).

Now choosing \varepsilon \asymp n^{-1/2} completes the proof of R_{n,k}^\star = \Theta((\log k)^2/n). This example shows that the choice of the two points may sometimes be a little bit delicate. 

3.3. Example III: Non-local Choice of Two Points 

We return to the entropy estimation problem, but this time we choose different points to arrive at an entirely different lower bound. Note that all above choices of the two points are based on local perturbations, and the resulting bounds are essentially for the variance. This example will show that a more sophisticated construction can also take care of the bias part. 

For observations X^n\sim P^{\otimes n}, let T=T(X^n) be the sorted histogram of X^n, i.e., T is the sorted version of (h_1,\cdots,h_k) with h_j = \sum_{i=1}^n 1(X_i = j). The next lemma shows that there exists a minimax estimator which depends only on T

Lemma 8 For any estimator \hat{H}(X^n) of the entropy H(P), there exists an estimator \hat{H}^\star(T) such that

\sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}^\star(T) - H(P))^2 \le \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}(X^n) - H(P))^2.

Proof: The proof relies on the symmetry of H(P) in permutations of P. Let \sigma\in \mathcal{S}_k be any permutation on [k], clearly H(P) = H(P\circ \sigma), and the distribution of T under P is the same as that under P\circ \sigma. Hence, let \sigma(X^n) = (\sigma(X_1), \cdots, \sigma(X_n)), the estimator 

\hat{H}^\star = \frac{1}{k!}\sum_{\sigma\in\mathcal{S}_k} \hat{H}(\sigma(X^n))

depends only on T, and the desired inequality follows from Jensen’s inequality. \Box 

Lemma 8 shows that the model can be reduced to P^{\otimes n}\circ T^{-1}, which by data processing inequality results in a smaller total variation distance between two points. Now the new construction of the two points is motivated by the following lemma. 

Lemma 9 Let P=(p_1,\cdots,p_k) and Q=(q_1,\cdots,q_k) be two probability vectors, with \max_{i\in [k]} \max\{p_i, q_i \}\le 1/(40n). Then

\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + 5\sum_{m=1}^\infty n^m\left|\sum_{i=1}^k (p_i^m - q_i^m) \right|.

The proof of Lemma 9 is referred to the bibliographic notes. Lemma 9 shows that if the lower moments of P and Q are equal, the total variation distance between their tensor powers will be small. This fact motivates us to consider distributions P and Q which are k/d repetitions of the vectors 

\frac{1}{k}(x_1,\cdots,x_d) \quad \text{ and }\quad \frac{1}{k}(y_1,\cdots,y_d),

respectively, where d>0 is some fixed integer, and

  1. P and Q are valid distributions: \sum_{i=1}^d x_i = \sum_{i=1}^d y_i = d
  2. P and Q have matched moments: \sum_{i=1}^d x_i^m = \sum_{i=1}^d y_i^m, m=0,1,\cdots,d-1
  3. P and Q have different entropies: \sum_{i=1}^d x_i\log x_i\neq \sum_{i=1}^d y_i\log y_i

The existence of such vectors x,y is ensured by the following lemma. 

Lemma 10 For any positive integer d, there exist vectors x and y satisfying the above properties. 

Proof: Fix any d-dimensional vector x with distinct positive entries. Consider 

p_\Delta(x) = (x-x_1)(x-x_2)\cdots(x-x_d) - \Delta,

which for \Delta>0 small enough clearly has d positive zeros. Let y_1,\cdots,y_d be the zeros of p_\Delta(x), then comparing the coefficients of x,x^2,\cdots,x^{d-1} on both sides of the polynomial and invoking Newton’s identity, vectors x and y have matched moments up to d-1. The first property is ensured via simple normalizations, and it suffices to choose a suitable \Delta to ensure the third property. \Box 

Using the above two points, we have |H(P) - H(Q)| = \Theta(1) and \sum_{i=1}^k p_i^m = \sum_{i=1}^k q_i^m for any m=1,\cdots,d-1. Moreover, for m\ge d, we have 

\left|\sum_{i=1}^k (p_i^m - q_i^m) \right| = O\left( k^{1-m} \right).

Hence, by Lemma 9, 

\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + O\left( \sum_{m=d}^\infty n^m k^{1-m} \right) = \frac{1}{2} + O\left(\frac{k(n/k)^d}{1-(n/k)} \right),

which is strictly smaller than one if n = O(k^{1-1/d} ). Note that the choice of the integer d is arbitrary, we conclude that the minimax risk R_{n,k}^\star is at least a positive constant whenever n=O(k^{1-\varepsilon}) for any \varepsilon>0. This lower bound says nothing about R_{n,k}^\star when n gets larger, but compared with the previous bound R_{n,k}^\star = \Theta((\log k)^2/n), the new bound shows that n actually needs to be much larger than (\log k)^2 for the consistent estimation. 

In later lectures, it will be shown that the minimax risk of this problem is 

R_{n,k}^\star = \Theta\left(\left(\frac{k}{n\log n}\right)^2 + \frac{(\log k)^2}{n} \right).

Hence, the current bound is slightly weaker than the first term, and the old bound is exactly the second term. 

3.4. Example IV: Robust Statistics 

In this section we look at the Gaussian mean estimation again, but now with outliers. Specifically, we consider the following Huber’s \varepsilon-contamination model: 

P_{(\theta,Q)} = (1-\varepsilon)\mathcal{N}(\theta, I_p) + \varepsilon Q,

where Q denotes the outlier distribution and may be arbitrary. The robust estimation problem is that, given n i.i.d. observations from P_\theta, propose a robust estimator which minimizes the worst-case risk of estimating \theta, where the worst case is taken over both \theta\in{\mathbb R}^p and the outlier distribution Q. Under the squared L_2 loss, let R_{n,\varepsilon}^\star be the minimax risk with n samples and outlier probability \varepsilon

The central claim of this section is that 

R_{n,\varepsilon}^\star = \Omega\left(\frac{p}{n} + \varepsilon^2 \right).

This lower bound is tight for the Tukey median and some other estimators; we refer to the bibliographic notes. Since the \Omega(p/n) lower bound even holds when \varepsilon = 0, we may restrict our attention to showing that R_{n,\varepsilon}^\star = \Omega(\varepsilon^2)

To apply the two-point method, we need to choose both (\theta_1,\theta_2) and (Q_1,Q_2), where the latter is more important in this robust setting. Since Q_1 and Q_2 can be arbitrary probability distributions, we actually have a large degree of freedom and can actually achieve perfect ambiguity here. This phenomenon is summarized in the following lemma. 

Lemma 11 Let P_1, P_2 be any probability distributions with \| P_1 - P_2 \|_{\text{\rm TV}}\le \varepsilon/(1-\varepsilon). Then there exist probability distributions Q_1, Q_2 such that

(1-\varepsilon)P_1 +\varepsilon Q_1 = (1-\varepsilon)P_2 + \varepsilon Q_2.

Proof: Consider the signed measure Q=\frac{1-\varepsilon}{\varepsilon}(P_2 - P_1). Clearly Q(\Omega)=0, and by assumption |Q|(\Omega) \le 2, where |Q| denotes the total variation measure of Q. Hence, by Jordan decomposition of signed measures, we have Q = Q_1 - Q_2, where measures Q_1, Q_2 are mutually singular and Q_1(\Omega) = Q_2(\Omega)\le 1. Finally, adding any common measure \mu with \mu(\Omega)=1-Q_1(\Omega) to both Q_1, Q_2 completes the proof. \Box 

By Lemma 11, it suffices to find (\theta_1, \theta_2) such that 

\| \mathcal{N}(\theta_1, I_p) - \mathcal{N}(\theta_2, I_p) \|_{\text{\rm TV}} = 2\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2} \right) - 1 \le \frac{\varepsilon}{1-\varepsilon},

and the incurred loss is \Omega(\|\theta_1 - \theta_2\|_2^2). Hence, the lower bound \Omega(\varepsilon^2) is completed by choosing \|\theta_1 - \theta_2\|_2 \asymp \varepsilon

3.5. Example V: Two-armed Bandits 

In this section we study the simplest online learning problem, namely, the two-armed bandit. At each time t\in [T], the learner can pull one of the two arms, and the reward of pulling arm i=1,2 follows a standard normal distribution r_{t,i}\sim \mathcal{N}(\mu_i,1), where the mean parameters \mu_1, \mu_2 are unknown to the learner. A policy \pi = (\pi_t)_{t\in [T]} is a sequence where \pi_t\in \{1,2\} denotes the arm to pull at time t, where \pi_t can depend on all past policies (\pi_\tau)_{\tau\in [t-1]} and past rewards (r_{\tau,\pi_\tau})_{\tau\in [t-1]}. The regret of a policy \pi is defined as the difference of the cumulative reward of the best arm and that of the policy, i.e., 

R_T(\pi) := T\max\{\mu_1,\mu_2\} - \sum_{t=1}^T \mu_{\pi_t}.

The target of the online learning is to devise a policy \pi with a small expected regret, where the expectation is taken over the randomness of the rewards (and therefore the random realizations of the policy). A fundamental tradeoff in online learning is the exploration-exploitation tradeoff, where the learner needs to pull the suboptimal arm sufficiently many times to identify the best arm with better precision, and also needs to exploit the current knowledge to focus on the arm which seems to be the best. 

In this section we aim to prove minimax lower bounds of the regrets in the two-armed bandits. First consider the case where \mu_1, \mu_2 can take any values in [0,1]. The two-point method suggests that we may find two worlds such that 

  1. there is no fixed policy whose regrets are small in both worlds; 
  2. these two worlds are hard to distinguish. 

To do so, we may take (\mu_1,\mu_2) = (\Delta,0) in the first world, and (\mu_1,\mu_2) = (0,\Delta) in the second world, where \Delta>0 is some parameter to be chosen. Note that in both worlds, pulling a wrong arm incurs an instanteous regret of \Delta. Let P_1^t and P_2^t be the distribution of the rewards up to time t in respective worlds, then the minimax regret is lower bounded by 

R_T^\star \ge \Delta \sum_{t=1}^T \frac{P_1^{t-1}(\pi_t = 2) + P_2^{t-1}(\pi_t = 1) }{2}.

Using Lemma 1 and 

\|P_1^t - P_2^t\|_{\text{\rm TV}}^2 \le \frac{1}{2}D_{\text{KL}}(P_1^t\|P_2^t) = \frac{t\Delta^2}{4} \le \frac{T\Delta^2}{4},

we conclude that R_T^\star = \Omega(\sqrt{T}) by choosing \Delta \asymp T^{-1/2}

We also consider the case where there is a known suboptimality gap \Delta between arms, i.e., |\mu_1 - \mu_2| \ge \Delta>0. Using the above arguments and the inequality 

\|P-Q\|_{\text{TV}} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q) ),

we will arrive at 

R_T^\star = \Omega\left(\sum_{t=1}^T \Delta\exp(-t\Delta^2) \right) = \Omega\left(\frac{1-\exp(-T\Delta^2)}{\Delta}\right),

which converges to a constant 1/\Delta as T\rightarrow\infty. However, this lower bound is not tight – we will show that R_T^\star \sim (\log T)/\Delta as T gets large. To show this stronger lower bound, we need a more delicate construction of the two points. 

Set (\mu_1, \mu_2) = (\Delta,0) in world I, and (\mu_1,\mu_2) = (\Delta,2\Delta) in world II. Let P_1^t, P_2^t be defined as before, then 

D_{\text{KL}}(P_1^t \| P_2^t) \le 2\Delta^2\cdot \mathop{\mathbb E}_1[T_2],

where T_2 is the total number of pulls of arm 2, and \mathop{\mathbb E}_1 denotes the expectation under world I. Hence, applying the previous arguments, we have 

R_T^\star = \Omega(T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2])).

Clearly, this is a better bound for small \mathop{\mathbb E}_1[T_2], while it is less useful when \mathop{\mathbb E}_1[T_2] is large. However, a large \mathop{\mathbb E}_1[T_2] implies that the learner pulls the suboptimal arm (i.e., arm 2) too many times in world I, and in mathematical words we have 

R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2]).

Note that these two bounds exactly characterize the tradeoff between exploitation and exploration. Combining these bounds, we have 

R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2] + T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2]) ) = \Omega\left(\frac{\max\{1, \log(T\Delta^2) \} }{\Delta} \right),

where the second inequality follows from the minimization of the inner sum with respect to \mathop{\mathbb E}_1[T_2]\in [0,T]

In summary, we have two different lower bounds: when there is no assumption on the optimality gap, then R_T^\star = \Omega(\sqrt{T}); when the optimality gap is at least \Delta>0, then R_T^\star = \Omega(\max\{1, \log(T\Delta^2) \} / \Delta ). Both bounds are tight and there exist policies attaining these bounds; we refer to bibliographic notes for details. 

3.6. Example VI: Multi-armed Bandits 

Interestingly, the two-point method can also give the correct minimax lower bound of the regret for general multi-armed bandits (in later lectures, we will provide a proof using multiple points). However, the following application of the two-point method differs from all previous ones. In previous examples, we fix two points and prove that all estimators (or policies) cannot achieve small risk (or regret) under both points. In this example, the construction of the two points depends on the policy. 

We first define the multi-armed bandit problem. In fact, the only difference from the two-armed case is that now there are K arms in total, and the i-th arm has reward distribution \mathcal{N}(\mu_i,1). The regret of a given policy \pi is defined as 

R_T(\pi) = T\max_{i\in [K]} \mu_i - \sum_{t=1}^T \mu_{\pi_t}.

To apply the two-point method, let the first point be (\mu_1,\cdots,\mu_K)=(\Delta,0,\cdots,0). Clearly, the first arm is the optimal arm, and pulling any wrong arm incurs an instanteous regret \Delta. For any given policy \pi, the choice of the second point will depend on the policy: for i\neq 1, let \mathop{\mathbb E}[T_i] be the expected number of pulls of arm i for policy \pi under the first point. Since \sum_{i=2}^K \mathop{\mathbb E}[T_i]\le T, there must exist some i_0\neq 1 such that \mathop{\mathbb E}[T_{i_0}]\le T/(K-1). In other words, under the first point, the policy \pi does not pull arm i_0 much in expectation. Note that i_0 depends on the policy \pi

Now the second point is as follows: (\mu_1,\cdots,\mu_K) = (\Delta,0,\cdots,0,2\Delta,0,\cdots,0), where the only difference from the first point is that \mu_{i_0}=2\Delta. We claim that the two points are hard to distinguish: let P_1, P_2 be the distribution of all observed rewards under these two points, respectively, then 

D_{\text{KL}}(P_1 \| P_2) = 2\Delta^2\cdot \mathop{\mathbb E}[T_{i_0}] \le \frac{2\Delta^2 T}{K-1}.

Hence, by choosing \Delta \asymp \sqrt{K/T} we have D_{\text{KL}}(P_1 \| P_2) = O(1), and consequently the worst-case regret for the policy \pi is at least \Omega(T\Delta) = \Omega(\sqrt{KT}). By the arbitrariness of \pi, this is a minimax lower bound. This bound is tight for multi-armed bandits. 

The idea of constructing points based on given approaches is widely used in problems where several rounds of adaptivity are available. One future lecture will be devoted exclusively to this topic. 

4. Bibliographic Notes 

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977) (see also Le Cam (1973) for the binary case). The general f-divergence was proposed by Csiszár (1967), and the joint range of f-divergences (Theorem 4) was established in Harremoës and Vajda (2011). For the results in convex analysis, the Choquet–Bishop–de Leeuw theorem can be found in Phelps (2001), and the Fenchel–Eggleston–Carathéodory theorem can be found in Eggleston (1958). The inequalities between f-divergences and the two-point method can be mostly found in the excellent textbook Tsabykov (2009). 

For the examples, the first lower bound of entropy estimation is taken from Wu and Yang (2016). For the second lower bound, Lemma 9 was proved in Valiant (2008) under the Poissonized model, and Lemma 10 is taken from Archaya et al. (2016). For robust statistics, Huber’s \varepsilon-contamination model dates back to Huber (1964), and the general lower bound technique in Lemma 11 is taken from Chen, Gao and Ren (2016). For the robust estimators achieving the lower bound, see Chen, Gao and Ren (2018) and Ilias et al. (2018). 

The lower bounds of bandit problems are taken from Voger (1960) and Lai and Robbins (1985). We also refer to Bubeck, Perchet and Rigollet (2013) and Lattimore and Szepesvári (2018) for non-asymptotic bounds. For more on bandit problems such as the upper bounds, we refer to the book Bubeck and Cesa-Bianchi (2012). 

  1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977. 
  2. Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1.1 (1973): 38-53. 
  3. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica 2 (1967): 229-318. 
  4. Peter Harremoës and Igor Vajda. On pairs of f-divergences and their joint range. IEEE Transactions on Information Theory 57.6 (2011): 3230-3235. 
  5. Robert R. Phelps. Lectures on Choquet’s theorem. Springer Science & Business Media, 2001. 
  6. H. G. Eggleston. Convexity. Cambridge University Press, 1958. 
  7. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 
  8. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720. 
  9. Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing 40.6 (2011): 1927-1968. 
  10. Jayadev Acharya et al. Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory 63.1 (2016): 38-56. 
  11. Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964): 73-101. 
  12. Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s \varepsilon-contamination model. Electronic Journal of Statistics 10.2 (2016): 3752-3774. 
  13. Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. The Annals of Statistics 46.5 (2018): 1932-1960. 
  14. Ilias Diakonikolas et al. Robustly learning a gaussian: Getting optimal error, efficiently. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018. 
  15. Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics 31.2 (1960): 430-443. 
  16. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6.1 (1985): 4-22. 
  17. Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. Conference on Learning Theory, 2013. 
  18. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint (2018). 
  19. Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5.1 (2012): 1-122.