Lecture 6: Generalized Two-Point Method: Point vs. Mixture

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last lecture we introduced Le Cam’s two-point method to establish information-theoretic lower bounds, where the target is to find two single hypotheses which are well separated while indistinguishable. However, restricting to two single hypotheses gives the learner a lot of information in many scenarios, and any pair of single hypotheses which are far apart becomes distinguishable. From another point of view, if we would like to mimic the Bayesian idea of using the least favorable prior, it is very unlikely that the least favorable prior is supported on only two points. Hence, in this lecture we generalize the two-point idea to the case where one point is a mixture of distributions, which suits well in various examples. 

1. General Tools 

1.1. A Motivating Example 

We first discuss an example where the simple two-point method fails. Consider the following hidden clique detection problem: there is an underlying undirected simple graph G with vertex set [n] which either comes from an Erdös-Rényi graph \mathcal{G}(n,1/2) (where each edge appears with probability 1/2 independently), or there is a clique on k vertices of G and all other edges appear with probability 1/2 independently. Recall that a clique is a complete graph where each two vertices are connected via an edge. The target is to reliably distinguish between these two cases, i.e., a vanilla Erdös-Rényi graph versus that planted with a hidden clique of size k

This task is a composite hypothesis testing problem, where H_0 consists of a single hypothesis where G=\mathcal{G}(n,1/2), while H_1 consists of a family of hypotheses depending on the precise location of the hidden clique. Of course, an Erdös-Rényi graph is unlikely to have large cliques, thus this task becomes easier when k is large. Let k_n>0 be the threshold that the total error probability of the optimal test is at most 1/3 when k\ge k_n and is at least 1/3 when k<k_n

We first try to use Le Cam’s two-point method to find a lower bound on k_n. Let P\in H_0 be the distribution of \mathcal{G}(n,1/2), and Q_S\in H_1 be the distribution of the graph where there is a planted clique on the vertex set S\subseteq [n] with |S|=k. To compute the total variation distance between P and Q_S, note that with probability 2^{-\binom{k}{2}} the Erdös-Rényi graph has a clique supported on S. Hence, 

 \|P - Q_S \|_{\text{TV}} = 1 - 2^{-\binom{k}{2}}

holds for any S. As a result, we have k=\Omega(1) to ensure that \|P - Q_S \|_{\text{TV}}\ge 2/3, which is not an informative bound.

The problem with the above two-point approach is that, if the learner is told the exact location of the clique, she can simply look at the subgraph supported on that location and tell whether a clique is present. In other words, any single hypothesis from H_1 gives too much information to the learner. To overcome this difficulty, a natural idea is to take the support S of the clique uniformly distributed on all size-k subsets of [n], which is expected to result in a graph looking more similar to the Erdös-Rényi graph. This is exactly the central topic of this lecture. 

1.2. Generalized Le Cam’s Method: Point vs. Mixture 

As is shown in the previous example, there is need to generalize the Le Cam’s two-point method to the scenario where one of the points is a mixture of distributions. The theoretical foundation of the generalization is in fact quite simple, and the proof of the next theorem parallels that of Theorem 7 in Lecture 5. 

Theorem 1 (Point vs. Mixture) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and there exist \theta_0\in \Theta and \Theta_1 \subseteq \Theta such that

 L(\theta_0, a) + L(\theta_1, a) \ge \Delta, \qquad \forall a\in \mathcal{A}, \theta_1\in \Theta_1,

then for any probability measure \mu supported on \Theta_1, we have

 \inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_0} - \mathop{\mathbb E}_{\mu(d\theta)}P_{\theta} \|_{\text{\rm TV}}\right).

The main difficulties in the practical application of Theorem 1 lie in two aspects. First, it may be non-trivial to find the proper mixture \mu and the support \Theta_1. Second, upper bounding the total variation distance \|P_{\theta_0} - \mathop{\mathbb E}_{\mu(d\theta)}P_{\theta} \|_{\text{\rm TV}} may be hard, especially when P_\theta takes the form of Q_\theta^{\otimes n}. The next section proposes a general tool to overcome the second difficulty, where we use various examples to illustrate the art of the first difficulty. 

1.3. Upper Bounding the Divergence 

To upper bound the divergence between a single distribution and a mixture of distributions, the next lemma shows that the \chi^2-divergence is a proper choice. 

Lemma 2 Let (P_\theta)_{\theta\in\Theta} be a family of probability distributions parametrized by \theta, and Q be any fixed probability distribution. Then for any probability measure \mu supported on \Theta, we have

 \chi^2\left(\mathop{\mathbb E}_\mu [P_\theta], Q \right) = \mathop{\mathbb E}\left[ \int \frac{dP_\theta dP_{\theta'}}{dQ} \right] - 1,

where the expectation is taken with respect to independent \theta,\theta'\sim \mu

Proof: Follows from the definition \chi^2(P,Q) = \int\frac{(dP)^2}{dQ}-1 and simple algebra.  \Box 

The advantages of Lemma 2 are two-fold. First, it factorizes well for tensor products: 

 \chi^2\left(\mathop{\mathbb E}_\mu [P_\theta^{\otimes n}], Q^{\otimes n} \right) =\mathop{\mathbb E}\left[ \left( \int \frac{dP_\theta dP_{\theta'}}{dQ}\right)^n \right] - 1.

This identity makes it more convenient to use \chi^2-divergence to deal with the point-mixture scenario. Second, for many statistical models (and a proper choice of Q), the quantity 

 \int \frac{dP_\theta dP_{\theta'}}{dQ}

admits a simple closed-form expression in (\theta,\theta'). We will see several examples in the rest of this lecture. 

2. Examples of Testing Problems 

In this section, we provide examples of composite hypothesis testing problems where the null hypothesis H_0 consists of a single probability distribution, while the alternative H_1 consists of a family of distributions. We will see that choosing a proper mixture and applying the tools in Lemma 2 gives desired lower bounds. 

2.1. Hidden Clique Detection 

Recall the hidden clique detection problem described in Section 1.1, where we aim to obtain tight lower bounds for the threshold k_n on the clique size. As discussed, the point is chosen to be the single distribution P for the Erdös-Rényi graph, and the mixture is taken to be \mathop{\mathbb E}_S[Q_S] where S is uniformly distributed on all size-k subsets of [n]. It remains to upper bound \chi^2(\mathop{\mathbb E}_S[Q_S], P), which by Lemma 2 equals to 

 \chi^2(\mathop{\mathbb E}_S[Q_S], P) = \mathop{\mathbb E}_{S,S'}\left[ \int \frac{dQ_S dQ_{S'}}{dP} \right] - 1.

By simple algebra, 

 \int \frac{dQ_S dQ_{S'}}{dP} = 2^{\binom{|S\cap S'|}{2}}.

Since the probability of |S\cap S'|=r for r=0,\cdots,k is \binom{k}{r}\binom{n-k}{k-r}/\binom{n}{k} , we conclude that 

 \begin{array}{rcl} \chi^2(\mathop{\mathbb E}_S[Q_S], P) = \sum_{r=0}^k 2^{\binom{r}{2}}\cdot \frac{\binom{k}{r} \binom{n-k}{k-r}}{\binom{n}{k}} - 1 = \sum_{r=1}^k 2^{\binom{r}{2}}\cdot \frac{\binom{k}{r} \binom{n-k}{k-r}}{\binom{n}{k}} . \end{array}

After carefully bounding the binomial coefficients, we conclude that the \chi^2-divergence is O(1) as long as k = 2\log_2 n - 2\log_2\log_2 n+\Omega(1). Hence, the threshold k_n has the lower bound k_n = 2\log_2 n - 2\log_2\log_2 n+\Omega(1)

This lower bound is tight. Note that the expected number of size-k cliques in \mathcal{G}(n,1/2) is exactly 

 \binom{n}{k}2^{-\binom{k}{2}},

using some algebra and Markov’s inequality, we conclude that if k = 2\log_2 n -2\log_2\log_2 n + O(1), with high probability the Erdös-Rényi graph has no size-k cliques. Hence, by enumerating size-k cliques in the graph, we have k_n = 2\log_2 n -2\log_2\log_2 n +\Theta(1)

2.2. Uniformity Testing 

The uniformity testing problem, also known as the goodness-of-fit in theoretical computer science, is as follows. Given n i.i.d. observations from a discrete distribution P supported on at most k elements, one would like to test between 

 H_0: P = U_k \qquad \text{vs.} \qquad H_1: \|P - U_k\|_1\ge \varepsilon,

where \varepsilon>0 is some prescribed parameter (which may depends on n), and U_k denotes the uniform distribution over [k]. The target is to determine the sample complexity of this problem, i.e., find the smallest sample size n such that there exists a test with total error probability at most 1/3

Note that H_1 is a composite hypothesis. If we would like to choose a single distribution from H_1, the most natural choice would be Q = (\frac{1+\varepsilon}{k},\cdots,\frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k},\cdots,\frac{1-\varepsilon}{k}). For the learner to distinguish between P = U_k and Q, she could combine the first and second half of symbols into two super-symbols, with uniform distribution (\frac{1}{2},\frac{1}{2}) under H_0, and distribution (\frac{1+\varepsilon}{2},\frac{1-\varepsilon}{2}) under H_1. Consequently, \Omega(\varepsilon^{-2}) samples are sufficient to distinguish between these two cases, which loses dependence on the support size k. As before, the problem of choosing a single hypothesis in H_1 is that the learner then knows the exact locations of the symbols with probability larger than 1/k and can combine them. 

Hence, we should pick up several distributions from H_1 and consider a proper mixture. The solution is to use the following Paninski’s construction: let random vector v be distributed uniformly on \{\pm 1\}^{k/2} (assuming that k is even), and consider the mixture \mathop{\mathbb E}_v[Q_v^{\otimes n}], with 

 Q_v = \left(\frac{1+v_1\varepsilon}{k}, \frac{1-v_1\varepsilon}{k}, \cdots, \frac{1+v_{k/2}\varepsilon}{k}, \frac{1-v_{k/2}\varepsilon}{k} \right).

Clearly, each Q_v belongs to H_1. Moreover, 

 \int \frac{dQ_v dQ_{v'}}{dP} = \frac{1}{k}\sum_{i=1}^{k/2}\sum_{u\in \{\pm 1\}} (1+uv_i\varepsilon)(1+uv_i'\varepsilon) = 1 + \frac{2\varepsilon^2}{k}\sum_{i=1}^{k/2} v_iv_i'.

Therefore, by Lemma 2 and the discussions thereafter, we have 

 \begin{array}{rcl} \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n} \right) +1 &=& \mathop{\mathbb E}_{v,v'}\left[ \left(1 + \frac{2\varepsilon^2}{k}\sum_{i=1}^{k/2} v_iv_i'\right)^n \right] \\ &\le & \mathop{\mathbb E}_{v,v'} \exp\left(\frac{2n\varepsilon^2}{k} \sum_{i=1}^{k/2} v_iv_i' \right) \\ &\le & \mathop{\mathbb E}_v \exp\left(\frac{2n^2\varepsilon^4}{k^2} \|v\|_2^2 \right) \\ &=& \exp\left(\frac{n^2\varepsilon^4}{k} \right), \end{array}

where the last inequality follows from the 1-subGaussianity of the vector v'. Hence, as long as n = O(\frac{\sqrt{k}}{\varepsilon^2}), the above \chi^2-divergence is upper bounded by O(1). Therefore, the sample complexity of the uniformity testing problem is at least n = \Omega(\frac{\sqrt{k}}{\varepsilon^2}), which is also tight (see bibliographic notes for details). 

2.3. General Identity Testing 

Next we generalize the uniformity testing problem to the case where the uniform distribution U_k can be replaced by any given probability distribution U. Specifically, given n i.i.d. observations from a discrete distribution P supported on at most k elements, one would like to test between 

 H_0: P = U \qquad \text{vs.} \qquad H_1: \|P - U\|_1\ge \varepsilon,

where \varepsilon>0 is some prescribed parameter, and U=(u_1,\cdots,u_k) is a fixed known probability distribution. Again, the target is to determine the sample complexity of this identity testing problem, in terms of U and \varepsilon

For simplicity, we assume that k is an even integer, and the components of U appear in pairs, i.e., u_1 = u_2, u_3 = u_4, etc. We reparametrize U by U=(u_1,u_1,u_2,u_2,\cdots,u_{k/2},u_{k/2}). We also assume that u_i\ge \varepsilon^3/8 for all i. The general case requires proper trimming on U, and is referred to the bibliographic notes. As in the previous Paninski’ construction, we again consider a uniform mixture in H_1, but with different perturbation sizes. Specifically, let random vector v be distributed uniformly on \{\pm 1\}^{k/2} (assuming that k is even), and consider the mixture \mathop{\mathbb E}_v[Q_v^{\otimes n}], with 

 Q_v = \left(u_1 + v_1\varepsilon_1, u_1 - v_1\varepsilon_1, \cdots, u_{k/2}+v_{k/2}\varepsilon_{k/2}, u_{k/2} - v_{k/2}\varepsilon_{k/2} \right).

Here the parameters \varepsilon_1,\cdots,\varepsilon_{k/2}>0 will be specified later. Note that Q_v is a valid probability distribution if \varepsilon_i \le u_i for all i, and the assumption \|Q_v - U\|_1 \ge \varepsilon requires that \sum_{i=1}^{k/2} \varepsilon_i \ge \varepsilon/2. As before, 

 \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n} \right) +1 \le \mathop{\mathbb E}_{v,v'} \exp\left( 2n\sum_{i=1}^{k/2} \frac{\varepsilon_i^2}{u_i} v_iv_i' \right) \le \exp\left(2n^2\sum_{i=1}^{k/2} \frac{\varepsilon_i^4}{u_i^2} \right),

where again the last step follows from the sub-Gaussianity. Since our target is to find the largest possible n such that the \chi^2-divergence is O(1), we aim to find \varepsilon_i‘s to minimize the quantity \sum_{i=1}^{k/2} \varepsilon_i^4 / u_i^2 subject to the above constraints. To do so, we recall the following elementary inequality. 

Lemma 3 (Hölder’s Inequality) Let a_{i,j} be non-negative reals for all i\in [m], j\in [n]. Then

 \prod_{i=1}^m \left(\sum_{j=1}^n a_{i,j} \right)^{\frac{1}{m}} \ge \sum_{j=1}^n \left(\prod_{i=1}^m a_{i,j} \right)^{\frac{1}{m}}.

Proof: By homogeneity we may assume that \sum_{j=1}^n a_{i,j}=1 for all i\in [m]. Then the desired inequality follows from the AM-GM inequality 

 \left(\prod_{i=1}^m a_{i,j} \right)^{\frac{1}{m}} \le \frac{1}{m}\sum_{i=1}^m a_{i,j}.  \Box

By Lemma 3, we have the following inequality: 

 \left(\sum_{i=1}^{k/2} \frac{\varepsilon_i^4}{u_i^2} \right) \left(\sum_{i=1}^{k/2} u_i^{2/3}\right)^3 \ge \left(\sum_{i=1}^{k/2} \varepsilon_i\right)^4 \ge \frac{\varepsilon^4}{16},

which gives the lower bound \sum_{i=1}^{k/2} \varepsilon_i^4/u_i^2 \ge \varepsilon^4/(16\|U\|_{2/3}^2), with identity when

 \varepsilon_i = \varepsilon u_i^{2/3}/(2\|U\|_{2/3}^{2/3}).

It is straightforward to check that \varepsilon_i \le u_i for all i thanks to the assumption u_i\ge \varepsilon^3/8. Consequently, we arrive at the lower bound 

 n = \Omega\left(\frac{\|U\|_{2/3}}{\varepsilon^2} \right)

for the sample complexity, which is tight in general. Recall that how the mysterious (2/3)-norm is obtained by carefully choosing the mixture.

3. Examples of Estimation Problems 

In this section, we will look at several estimation problems and show how Theorem 1 provides the tight lower bound for these problems. We will see that the choice of the mixture becomes natural after we understand the problem. 

3.1. Linear Functional Estimation of Sparse Gaussian Mean 

Consider a Gaussian location model X\sim \mathcal{N}(\mu, \sigma^2I_p), where we assume that the mean vector \mu is s-sparse, i.e., \|\mu\|_0\le s. The target is to estimate the linear functional of \mu, i.e., L(\mu):= \sum_{i=1}^p \mu_i, and we aim to prove a tight minimax lower bound for the minimax risk R_{s,p,\sigma}^\star under the quadratic loss. 

The key feature of this problem is the dependence of the minimax risk on the sparsity parameter s. If one could know the support of \mu, summing over X_i for i\in \text{supp}(\mu) gives a natural estimator with risk O(s\sigma^2). However, due to the unknown support of \mu, we will show that R_{s,p,\sigma}^\star = \Omega(s^2\sigma^2\log p) for small s, where the additional penalty s\log p is essentially the logarithm of \binom{p}{s}, the number of possible supports of \mu

To incorporate the effect of unknown support into the lower bound, a natural choice would be to consider a random mean vector with support uniformly distributed on all size-s subsets of [p]. Formally, let P=\mathcal{N}(0,\sigma^2I_p), and Q_S = \mathcal{N}(\rho 1_S, \sigma^2 I_p) with 1_S being the indicator vector of the subset S\subseteq [p], and \rho>0 is a parameter to be specified. Moreover, we let S be uniformly distributed on all possible size-s subsets of [p]. Clearly, the separability condition in Theorem 1 is satisfied with \Delta = s^2\rho^2/2. To upper bound the \chi^2-divergence, we again compute 

 \int \frac{dQ_S dQ_{S'}}{dP} = \exp\left(\frac{\rho^2}{\sigma^2} |S\cap S'|\right),

and consequently 

 \chi^2\left(\mathop{\mathbb E}_S[Q_S], P \right) = \mathop{\mathbb E}_{S,S'} \exp\left(\frac{\rho^2}{\sigma^2} |S\cap S'|\right) - 1.

The random variable |S\cap S'| follows a hypergeometric distribution with pmf \mathop{\mathbb P}(|S\cap S'| = r) = \binom{s}{r}\binom{p-s}{s-r} / \binom{p}{s}. It is well-known in probability theory that a hypergeometric distribution refers to sampling without replacement, while the Binomial distribution  \mathsf{B}(s,s/p) refers to sampling with replacement. The next lemma shows a relationship between these two random variables. 

Lemma 4 Let \{x_1,\cdots,x_M\} be an urn, X_1,\cdots,X_n be samples uniformly drawn from the urn without replacement, and X_1^\star, \cdots, X_n^\star be samples uniformly drawn from the urn with replacement. Define S_n = \sum_{i=1}^n X_i and S_n^\star = \sum_{i=1}^n X_i^\star, then there exists a coupling between (S_n, S_n^\star) such that \mathop{\mathbb E}[S_n^\star|S_n] = S_n

Proof: The coupling is defined as follows: for any distinct y_1,\cdots,y_n \in \{x_1,\cdots,x_M\} and not necessarily distinct z_1,\cdots,z_n\in \{y_1,\cdots,y_n\}, let the conditional distribution be 

 \mathop{\mathbb P}(X_1^\star = z_1, \cdots, X_n^\star = z_n | X_1 = y_1, \cdots, X_n = y_n) = \frac{\binom{M}{n}}{M^n\binom{M-L}{n-L}},

where L denotes the number of distinct elements in z_1,\cdots,z_n. To verify that this is a valid conditional probability measure, recall that the number of ways of drawing n labeled elements with L distinct elements from an urn of n elements is S(n,L)(n)_L, where S(n,L) is the Stirling’s number of the second kind, and (n)_L := n! / (n-L)! is the falling factorial. Hence, the probability is valid due to the identity 

 \sum_{L=1}^n \frac{\binom{M}{n}}{M^n\binom{M-L}{n-L}}\cdot S(n,L)(n)_L = \frac{1}{M^n}\sum_{L=1}^n S(n,L)(M)_L = 1,

where the last step is due to that \sum_{L=1}^n S(n,L)(M)_L is the number of ways of drawing n labeled elements from an urn of M elements and therefore equals M^n

From this conditional probability, it is obvious to see that the unconditional probability of (X_1^\star, \cdots, X_n^\star) coincides with sampling with replacement. Moreover, 

 \sum_{i=1}^n \mathop{\mathbb P}(X_i^\star = y| X_1 = y_1, \cdots, X_n = y_n) = 1(y\in \{y_1,\cdots,y_n\})

follows from symmetry, and therefore \mathop{\mathbb E}[S_n^\star|S_n] = S_n.  \Box 

By Lemma 4, there exists some \sigma-algebra \mathcal{B} such that |S\cap S'| has the same distribution as \mathop{\mathbb E}[Z|\mathcal{B}] with Z\sim \mathsf{B}(s,s/p). Now by Jensen’s inequality, 

 \chi^2\left(\mathop{\mathbb E}_S[Q_S], P \right) \le \mathop{\mathbb E} \exp(\rho^2 Z/\sigma^2 ) - 1 = \left(1 - \frac{s}{p} + \frac{s}{p}\exp(\rho^2/\sigma^2) \right)^s - 1.

Hence, the largest \rho^2 to ensure a constant \chi^2-divergence is \rho^2 = \sigma^2\log(1+p/s^2), which by Theorem 1 gives 

 R_{s,p,\sigma}^\star = \Omega\left(\sigma^2 s^2 \log\left(1+ \frac{p}{s^2}\right)\right),

which has a phase transition at s\sim \sqrt{p}. In fact, this lower bound is also tight: if s \ge \sqrt{p}, the natural estimator \sum_{i=1}^p X_i is rate-optimal; if s<\sqrt{p}, the thresholding estimator \sum_{i=1}^p X_i1(|X_i|\ge t) is rate-optimal for some proper threshold t

3.2. Largest Eigenvalue of Covariance Matrix 

Consider i.i.d. observations X_1,\cdots,X_n\sim \mathcal{N}(0,\Sigma), where \Sigma\in {\mathbb R}^{p\times p} is the covariance matrix. Assuming \|\Sigma\|_{\text{op}}\le 1, classical matrix concentration inequality shows that the empirical covariance matrix \hat{\Sigma} := n^{-1}\sum_{i=1}^n X_iX_i^\top satisfies that 

 \mathop{\mathbb E}\|\hat{\Sigma} - \Sigma\|_{\text{op}} = O\left(\sqrt{\frac{p}{n}}\right).

We show that the \sqrt{p/n} bound is rate-optimal. In fact, we may even show a stronger result: the minimax risk R_{n,p}^\star in estimating \|\Sigma\|_{\text{op}} under the absolute value loss is at least 

 R_{n,p}^\star = \Omega\left( 1 \wedge \sqrt{\frac{p}{n}}\right).

Note that by triangle inequality, estimating the largest eigenvalue is indeed an easier problem than covariance estimation. 

As before, a natural idea to establish the lower bound is to choose a uniform direction for the principal eigenvector, so that the learner must distribute her effort into all directions evenly. Specifically, we choose P = \mathcal{N}(0, aI_p), and Q_v = \mathcal{N}(0,aI_p + bvv^\top), where a,b>0 are some parameters to be chosen later, and v is uniformly distributed on the unit ball \{v\in {\mathbb R}^p: \|v\|_2 =1 \}. It is easy to verify that \|\Sigma\|_{\text{op}}\le 1 is fulfilled as long as a+b\le 1, and \Delta = b in Theorem 1. Moreover, 

 \int \frac{dQ_vdQ_{v'}}{dP} = \left( 1 - \frac{b^2}{a^2}(v^\top v')^2 \right)^{-\frac{1}{2}}.

Hence, by Lemma 2, 

 \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n}\right) +1 = \mathop{\mathbb E}_{v,v'} \left( 1 - \frac{b^2}{a^2}(v^\top v')^2 \right)^{-\frac{n}{2}} \le \mathop{\mathbb E}_{v,v'} \exp\left(\frac{nb^2}{a^2}(v^\top v')^2 \right)

assuming b/a\le 1/2, where the last inequality follows from 1-x \ge \exp(-2x) whenever 0\le x\le 1/4. Computation the surface area of spherical caps gives

 \mathop{\mathbb P}(|v^\top v'| \ge t) = \frac{1}{\text{Beta}(\frac{p-1}{2}, \frac{1}{2})}\int_0^{1-t^2} x^{\frac{p-3}{2}}(1-x)^{-\frac{1}{2}}dx = \exp(-\Omega(pt^2)).

Consequently, to make \chi^2\left(\mathop{\mathbb E}_v[Q_v^{\otimes n}], P^{\otimes n}\right) = O(1), we may take 

 a = \frac{1}{2}, \qquad b = \frac{1}{4} \wedge \sqrt{\frac{p}{n}},

which gives the desired minimax lower bound R_{n,p}^\star = \Omega\left( 1 \wedge \sqrt{p/n}\right).

3.3. Quadratic Functional Estimation 

Finally we switch to a non-parametric example. Let X_1,\cdots,X_n be i.i.d. observations drawn from a density f, where f is supported on [0,1] and belongs to the Hölder ball with smoothness parameter s>0

 f\in \mathcal{H}^s(L) := \left\{f\in C[0,1]: \sup_{x\neq y} \frac{|f^{(m)}(x) - f^{(m)}(y)| }{|x-y|^\alpha}\le L \right\},

with s = m+\alpha, m\in \mathbb{N}, \alpha\in (0,1]. The target is to estimate the quadratic functional 

 Q(f) := \int_0^1 f(x)^2 dx,

and we let R_{n,s}^\star be the corresponding minimax risk under the absolute value loss. The central claim is that 

 R_{n,s}^\star = \Omega\left( n^{-\frac{4s}{4s+1}} + n^{-\frac{1}{2}}\right),

which has an “elbow effect” at s=1/4

A central tool for proving nonparametric lower bounds is via parametric reduction, i.e., restricting f to a parametric family while the subproblem is as hard as the original nonparametric problem. In this example, a natural candidate is 

 f_v(x) = 1 + c_0L\cdot \sum_{i=1}^{h^{-1}/2} v_ih^s\left[ g\left(\frac{x-(2i-2)h}{h} \right) - g\left(\frac{x-(2i-1)h}{h} \right) \right],

where h\in (0,1) is the bandwidth to be chosen later, the vector v satisfies \|v\|_\infty\le 1, and g is a fixed smooth function on {\mathbb R} vanishing outside [0,1]. It is a simple exercise to show that for c_0>0 sufficiently small, each f_v is a valid density on [0,1] and belongs to the Hölder ball \mathcal{H}^s(L). Intuitively, the density f_v is a perturbation on the uniform density with bumps modulated by the vector v. Moreover, 

 Q(f_v) = 1 + (c_0L\|g\|_2)^2 h^{2s+1} \cdot\|v\|_2^2,

and therefore the target of the subproblem is to estimate \|v\|_2^2 based on X_1,\cdots,X_n\sim f_v. Again, the idea of Theorem 1 can be applied. 

Let v be uniformly distributed on \{\pm 1\}^{1/(2h)}. By simple calculation, 

 \int_0^1 \frac{f_v(x)f_{v'}(x)}{f_0(x)}dx = 1 + (c_0L\|g\|_2)^2h^{2s+1}\cdot v^\top v'.

Consequently, Lemma 2 gives 

 \begin{array}{rcl} \chi^2\left(\mathop{\mathbb E}_v[f_v^{\otimes n}], f_0^{\otimes n} \right) + 1 &=& \mathop{\mathbb E}_{v,v'} \left(1 + (c_0L\|g\|_2)^2h^{2s+1}\cdot v^\top v' \right)^n \\ &\le & \mathop{\mathbb E}_{v,v'} \exp\left( (c_0L\|g\|_2)^2nh^{2s+1}\cdot v^\top v'\right) \\ &\le& \exp\left(\frac{1}{2}(c_0L\|g\|_2)^4n^2h^{4s+2}\cdot \frac{1}{2h} \right). \end{array}

Hence, choosing h = \Theta(n^{-\frac{2}{4s+1}}) gives \chi^2\left(\mathop{\mathbb E}_v[f_v^{\otimes n}], f_0^{\otimes n} \right) = O(1). Moreover, Theorem 1 is fulfilled with \Delta = \Theta(h^{2s}) = \Theta(n^{-\frac{4s}{4s+1}}), and therefore we have proved that 

 R_{n,s}^\star = \Theta(n^{-\frac{4s}{4s+1}}).

The parametric rate R_{n,s}^\star = \Theta(n^{-1/2}) is trivial (by the tools in Lecture 4). 

4. Bibliographic Notes 

A wonderful book which treats the point-mixture idea of Theorem 1 sysmatically and contains the method in Lemma 2 is Ingster and Suslina (2012). The sharp bounds on the clique size in the hidden clique problem are due to Matula (1972). It is a long-standing open problem whether there exists a polynomial-time algorithm attaining this statistical limit, and we refer to the lecture notes Lugosi (2017) for details. The sample complexity of the uniformity testing problem is due to Paninski (2008), and the general identity testing problem is studied in Valiant and Valiant (2017). There are also some other generalizations or tools for uniformity testing, e.g., Diakonikolas and Kane (2016), Diakonikolas, Kane and Stewart (2018), Blais, Canonne and Gur (2019). 

For the estimation problem, the linear functional estimation for sparse Gaussian mean is due to Collier, Comminges and Tsybakov (2017), where Lemma 4 can be found in Aldous (1985). The largest eigenvalue example is taken from the lecture note Wu (2017). The nonparametric functional estimation problem is widely studied in statistics, where the minimax rate for estimating the quadratic functional can be found in Bickel and Ritov (1988), and it is \Theta(n^{-\frac{4s}{4s+d}} + n^{-\frac{1}{2}}) for general dimension d. The same minimax rate (and the elbow effect) actually holds for all smooth functionals, see Birgé and Massart (1995), Kerkyacharian and Picard (1996) and Robins et al (2017). 

  1. Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012. 
  2. David W. Matula. Employee party problem. Notices of the American Mathematical Society. Vol. 19, No. 2, 1972. 
  3. Gábor Lugosi. Lectures on Combinatorial Statistics. 2017. 
  4. Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory 54.10 (2008): 4750–4755. 
  5. Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing 46.1 (2017): 429–455. 
  6. Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2016. 
  7. Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018. 
  8. Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT) 11.2 (2019): 6. 
  9. Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics 45.3 (2017): 923–958. 
  10. David J. Aldous. Exchangeability and related topics. Ecole d’Eté de Probabilités de Saint-Flour XIII-1983. Springer, Berlin, Heidelberg, 1985. 1–198. 
  11. Yihong Wu. Lecture notes for ECE598YW: Information-theoretic methods for high-dimensional statistics. 2017. 
  12. Peter J. Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya: The Indian Journal of Statistics, Series A (1988): 381–393. 
  13. Lucien Birgé and Pascal Massart. Estimation of integral functionals of a density. The Annals of Statistics 23.1 (1995): 11–29. 
  14. Gérard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics 24.2 (1996): 485-507. 
  15. James Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der Vaart. Higher order estimating equations for high-dimensional models. Annals of statistics 45, no. 5 (2017): 1951. 

Leave a Reply