(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In the last lecture we introduced Le Cam’s two-point method to establish information-theoretic lower bounds, where the target is to find two single hypotheses which are well separated while indistinguishable. However, restricting to two single hypotheses gives the learner a lot of information in many scenarios, and any pair of single hypotheses which are far apart becomes distinguishable. From another point of view, if we would like to mimic the Bayesian idea of using the least favorable prior, it is very unlikely that the least favorable prior is supported on only two points. Hence, in this lecture we generalize the two-point idea to the case where one point is a mixture of distributions, which suits well in various examples.
1. General Tools
1.1. A Motivating Example
We first discuss an example where the simple two-point method fails. Consider the following hidden clique detection problem: there is an underlying undirected simple graph with vertex set
which either comes from an Erdös-Rényi graph
(where each edge appears with probability
independently), or there is a clique on
vertices of
and all other edges appear with probability
independently. Recall that a clique is a complete graph where each two vertices are connected via an edge. The target is to reliably distinguish between these two cases, i.e., a vanilla Erdös-Rényi graph versus that planted with a hidden clique of size
.
This task is a composite hypothesis testing problem, where consists of a single hypothesis where
, while
consists of a family of hypotheses depending on the precise location of the hidden clique. Of course, an Erdös-Rényi graph is unlikely to have large cliques, thus this task becomes easier when
is large. Let
be the threshold that the total error probability of the optimal test is at most
when
and is at least
when
.
We first try to use Le Cam’s two-point method to find a lower bound on . Let
be the distribution of
, and
be the distribution of the graph where there is a planted clique on the vertex set
with
. To compute the total variation distance between
and
, note that with probability
the Erdös-Rényi graph has a clique supported on
. Hence,
holds for any . As a result, we have
to ensure that
, which is not an informative bound.
The problem with the above two-point approach is that, if the learner is told the exact location of the clique, she can simply look at the subgraph supported on that location and tell whether a clique is present. In other words, any single hypothesis from gives too much information to the learner. To overcome this difficulty, a natural idea is to take the support
of the clique uniformly distributed on all size-
subsets of
, which is expected to result in a graph looking more similar to the Erdös-Rényi graph. This is exactly the central topic of this lecture.
1.2. Generalized Le Cam’s Method: Point vs. Mixture
As is shown in the previous example, there is need to generalize the Le Cam’s two-point method to the scenario where one of the points is a mixture of distributions. The theoretical foundation of the generalization is in fact quite simple, and the proof of the next theorem parallels that of Theorem 7 in Lecture 5.
Theorem 1 (Point vs. Mixture) Let
be any loss function, and there exist
and
such that
then for any probability measure
supported on
, we have
The main difficulties in the practical application of Theorem 1 lie in two aspects. First, it may be non-trivial to find the proper mixture and the support
. Second, upper bounding the total variation distance
may be hard, especially when
takes the form of
. The next section proposes a general tool to overcome the second difficulty, where we use various examples to illustrate the art of the first difficulty.
1.3. Upper Bounding the Divergence
To upper bound the divergence between a single distribution and a mixture of distributions, the next lemma shows that the -divergence is a proper choice.
Lemma 2 Let
be a family of probability distributions parametrized by
, and
be any fixed probability distribution. Then for any probability measure
supported on
, we have
where the expectation is taken with respect to independent
.
Proof: Follows from the definition and simple algebra.
The advantages of Lemma 2 are two-fold. First, it factorizes well for tensor products:
This identity makes it more convenient to use -divergence to deal with the point-mixture scenario. Second, for many statistical models (and a proper choice of
), the quantity
admits a simple closed-form expression in . We will see several examples in the rest of this lecture.
2. Examples of Testing Problems
In this section, we provide examples of composite hypothesis testing problems where the null hypothesis consists of a single probability distribution, while the alternative
consists of a family of distributions. We will see that choosing a proper mixture and applying the tools in Lemma 2 gives desired lower bounds.
2.1. Hidden Clique Detection
Recall the hidden clique detection problem described in Section 1.1, where we aim to obtain tight lower bounds for the threshold on the clique size. As discussed, the point is chosen to be the single distribution
for the Erdös-Rényi graph, and the mixture is taken to be
where
is uniformly distributed on all size-
subsets of
. It remains to upper bound
, which by Lemma 2 equals to
By simple algebra,
Since the probability of for
is
, we conclude that
After carefully bounding the binomial coefficients, we conclude that the -divergence is
as long as
. Hence, the threshold
has the lower bound
.
This lower bound is tight. Note that the expected number of size- cliques in
is exactly
using some algebra and Markov’s inequality, we conclude that if , with high probability the Erdös-Rényi graph has no size-
cliques. Hence, by enumerating size-
cliques in the graph, we have
.
2.2. Uniformity Testing
The uniformity testing problem, also known as the goodness-of-fit in theoretical computer science, is as follows. Given i.i.d. observations from a discrete distribution
supported on at most
elements, one would like to test between
where is some prescribed parameter (which may depends on
), and
denotes the uniform distribution over
. The target is to determine the sample complexity of this problem, i.e., find the smallest sample size
such that there exists a test with total error probability at most
.
Note that is a composite hypothesis. If we would like to choose a single distribution from
, the most natural choice would be
. For the learner to distinguish between
and
, she could combine the first and second half of symbols into two super-symbols, with uniform distribution
under
, and distribution
under
. Consequently,
samples are sufficient to distinguish between these two cases, which loses dependence on the support size
. As before, the problem of choosing a single hypothesis in
is that the learner then knows the exact locations of the symbols with probability larger than
and can combine them.
Hence, we should pick up several distributions from and consider a proper mixture. The solution is to use the following Paninski’s construction: let random vector
be distributed uniformly on
(assuming that
is even), and consider the mixture
, with
Clearly, each belongs to
. Moreover,
Therefore, by Lemma 2 and the discussions thereafter, we have
where the last inequality follows from the -subGaussianity of the vector
. Hence, as long as
, the above
-divergence is upper bounded by
. Therefore, the sample complexity of the uniformity testing problem is at least
, which is also tight (see bibliographic notes for details).
2.3. General Identity Testing
Next we generalize the uniformity testing problem to the case where the uniform distribution can be replaced by any given probability distribution
. Specifically, given
i.i.d. observations from a discrete distribution
supported on at most
elements, one would like to test between
where is some prescribed parameter, and
is a fixed known probability distribution. Again, the target is to determine the sample complexity of this identity testing problem, in terms of
and
.
For simplicity, we assume that is an even integer, and the components of
appear in pairs, i.e.,
, etc. We reparametrize
by
. We also assume that
for all
. The general case requires proper trimming on
, and is referred to the bibliographic notes. As in the previous Paninski’ construction, we again consider a uniform mixture in
, but with different perturbation sizes. Specifically, let random vector
be distributed uniformly on
(assuming that
is even), and consider the mixture
, with
Here the parameters will be specified later. Note that
is a valid probability distribution if
for all
, and the assumption
requires that
. As before,
where again the last step follows from the sub-Gaussianity. Since our target is to find the largest possible such that the
-divergence is
, we aim to find
‘s to minimize the quantity
subject to the above constraints. To do so, we recall the following elementary inequality.
Lemma 3 (Hölder’s Inequality) Let
be non-negative reals for all
. Then
Proof: By homogeneity we may assume that for all
. Then the desired inequality follows from the AM-GM inequality
By Lemma 3, we have the following inequality:
which gives the lower bound , with identity when
It is straightforward to check that for all
thanks to the assumption
. Consequently, we arrive at the lower bound
for the sample complexity, which is tight in general. Recall that how the mysterious -norm is obtained by carefully choosing the mixture.
3. Examples of Estimation Problems
In this section, we will look at several estimation problems and show how Theorem 1 provides the tight lower bound for these problems. We will see that the choice of the mixture becomes natural after we understand the problem.
3.1. Linear Functional Estimation of Sparse Gaussian Mean
Consider a Gaussian location model , where we assume that the mean vector
is
-sparse, i.e.,
. The target is to estimate the linear functional of
, i.e.,
, and we aim to prove a tight minimax lower bound for the minimax risk
under the quadratic loss.
The key feature of this problem is the dependence of the minimax risk on the sparsity parameter . If one could know the support of
, summing over
for
gives a natural estimator with risk
. However, due to the unknown support of
, we will show that
for small
, where the additional penalty
is essentially the logarithm of
, the number of possible supports of
.
To incorporate the effect of unknown support into the lower bound, a natural choice would be to consider a random mean vector with support uniformly distributed on all size- subsets of
. Formally, let
, and
with
being the indicator vector of the subset
, and
is a parameter to be specified. Moreover, we let
be uniformly distributed on all possible size-
subsets of
. Clearly, the separability condition in Theorem 1 is satisfied with
. To upper bound the
-divergence, we again compute
and consequently
The random variable follows a hypergeometric distribution with pmf
. It is well-known in probability theory that a hypergeometric distribution refers to sampling without replacement, while the Binomial distribution
refers to sampling with replacement. The next lemma shows a relationship between these two random variables.
Lemma 4 Let
be an urn,
be samples uniformly drawn from the urn without replacement, and
be samples uniformly drawn from the urn with replacement. Define
and
, then there exists a coupling between
such that
.
Proof: The coupling is defined as follows: for any distinct and not necessarily distinct
, let the conditional distribution be
where denotes the number of distinct elements in
. To verify that this is a valid conditional probability measure, recall that the number of ways of drawing
labeled elements with
distinct elements from an urn of
elements is
, where
is the Stirling’s number of the second kind, and
is the falling factorial. Hence, the probability is valid due to the identity
where the last step is due to that is the number of ways of drawing
labeled elements from an urn of
elements and therefore equals
.
From this conditional probability, it is obvious to see that the unconditional probability of coincides with sampling with replacement. Moreover,
follows from symmetry, and therefore .
By Lemma 4, there exists some -algebra
such that
has the same distribution as
with
. Now by Jensen’s inequality,
Hence, the largest to ensure a constant
-divergence is
, which by Theorem 1 gives
which has a phase transition at . In fact, this lower bound is also tight: if
, the natural estimator
is rate-optimal; if
, the thresholding estimator
is rate-optimal for some proper threshold
.
3.2. Largest Eigenvalue of Covariance Matrix
Consider i.i.d. observations , where
is the covariance matrix. Assuming
, classical matrix concentration inequality shows that the empirical covariance matrix
satisfies that
We show that the bound is rate-optimal. In fact, we may even show a stronger result: the minimax risk
in estimating
under the absolute value loss is at least
Note that by triangle inequality, estimating the largest eigenvalue is indeed an easier problem than covariance estimation.
As before, a natural idea to establish the lower bound is to choose a uniform direction for the principal eigenvector, so that the learner must distribute her effort into all directions evenly. Specifically, we choose , and
, where
are some parameters to be chosen later, and
is uniformly distributed on the unit ball
. It is easy to verify that
is fulfilled as long as
, and
in Theorem 1. Moreover,
Hence, by Lemma 2,
assuming , where the last inequality follows from
whenever
. Computation the surface area of spherical caps gives
Consequently, to make , we may take
which gives the desired minimax lower bound .
3.3. Quadratic Functional Estimation
Finally we switch to a non-parametric example. Let be i.i.d. observations drawn from a density
, where
is supported on
and belongs to the Hölder ball with smoothness parameter
:
with . The target is to estimate the quadratic functional
and we let be the corresponding minimax risk under the absolute value loss. The central claim is that
which has an “elbow effect” at .
A central tool for proving nonparametric lower bounds is via parametric reduction, i.e., restricting to a parametric family while the subproblem is as hard as the original nonparametric problem. In this example, a natural candidate is
where is the bandwidth to be chosen later, the vector
satisfies
, and
is a fixed smooth function on
vanishing outside
. It is a simple exercise to show that for
sufficiently small, each
is a valid density on
and belongs to the Hölder ball
. Intuitively, the density
is a perturbation on the uniform density with bumps modulated by the vector
. Moreover,
and therefore the target of the subproblem is to estimate based on
. Again, the idea of Theorem 1 can be applied.
Let be uniformly distributed on
. By simple calculation,
Consequently, Lemma 2 gives
Hence, choosing gives
. Moreover, Theorem 1 is fulfilled with
, and therefore we have proved that
The parametric rate is trivial (by the tools in Lecture 4).
4. Bibliographic Notes
A wonderful book which treats the point-mixture idea of Theorem 1 sysmatically and contains the method in Lemma 2 is Ingster and Suslina (2012). The sharp bounds on the clique size in the hidden clique problem are due to Matula (1972). It is a long-standing open problem whether there exists a polynomial-time algorithm attaining this statistical limit, and we refer to the lecture notes Lugosi (2017) for details. The sample complexity of the uniformity testing problem is due to Paninski (2008), and the general identity testing problem is studied in Valiant and Valiant (2017). There are also some other generalizations or tools for uniformity testing, e.g., Diakonikolas and Kane (2016), Diakonikolas, Kane and Stewart (2018), Blais, Canonne and Gur (2019).
For the estimation problem, the linear functional estimation for sparse Gaussian mean is due to Collier, Comminges and Tsybakov (2017), where Lemma 4 can be found in Aldous (1985). The largest eigenvalue example is taken from the lecture note Wu (2017). The nonparametric functional estimation problem is widely studied in statistics, where the minimax rate for estimating the quadratic functional can be found in Bickel and Ritov (1988), and it is for general dimension
. The same minimax rate (and the elbow effect) actually holds for all smooth functionals, see Birgé and Massart (1995), Kerkyacharian and Picard (1996) and Robins et al (2017).
- Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012.
- David W. Matula. Employee party problem. Notices of the American Mathematical Society. Vol. 19, No. 2, 1972.
- Gábor Lugosi. Lectures on Combinatorial Statistics. 2017.
- Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory 54.10 (2008): 4750–4755.
- Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing 46.1 (2017): 429–455.
- Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2016.
- Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018.
- Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT) 11.2 (2019): 6.
- Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics 45.3 (2017): 923–958.
- David J. Aldous. Exchangeability and related topics. Ecole d’Eté de Probabilités de Saint-Flour XIII-1983. Springer, Berlin, Heidelberg, 1985. 1–198.
- Yihong Wu. Lecture notes for ECE598YW: Information-theoretic methods for high-dimensional statistics. 2017.
- Peter J. Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya: The Indian Journal of Statistics, Series A (1988): 381–393.
- Lucien Birgé and Pascal Massart. Estimation of integral functionals of a density. The Annals of Statistics 23.1 (1995): 11–29.
- Gérard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics 24.2 (1996): 485-507.
- James Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der Vaart. Higher order estimating equations for high-dimensional models. Annals of statistics 45, no. 5 (2017): 1951.