[display-posts category=”Online Lectures” posts_per_page=1000 order=”ASC” orderby=”title”]
Online Lectures
Lecture 9: Multiple Hypothesis Testing: More Examples
Blog, Online Lectures(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In the last lecture we have seen the general tools and concrete examples of reducing the statistical estimation problem to multiple hypothesis testing. In these examples, the loss function is typically straightforward and the construction of the hypotheses is natural. However, other than statistical estimation the loss function may become more complicated (e.g., the excess risk in learning theory), and the hypothesis constructions may be implicit. To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems.
1. Example I: Density Estimation
This section considers the most fundamental problem in nonparametric statistics, i.e., the density estimation. It is well-known in the nonparametric statistics literature that, if a density on the real line has Hölder smoothness parameter , it then can be estimated within
accuracy
based on
samples. However, in this section we will show that this is no longer the correct minimax rate when we generalize to other notions of smoothness and other natural loss functions.
Fix an integer and some norm parameter
, the Sobolev space
on
is defined as
where the derivative is defined in terms of distributions (
not necessarily belongs to
). Our target is to determine the minimax rate
of estimating the density
based on
i.i.d. observations from
, under the general
loss with
. For simplicity we suppress the dependence on
and assume that
is a large positive constant.
The main result of this section is as follows:
This rate is also tight. We see that as are fixed and
increases from
to
, the minimax rate for density estimation increases from
to
. We will show that these rates correspond to the “dense” and the “sparse” cases, respectively.
1.1. Dense case:
We first look at the case with , which always holds for Hölder smoothness where the norm parameter is
. The reason why it is called the “dense” case is that in the hypothesis construction, the density is supported everywhere on
, and the difficulty is to classify the directions of the bumps around some common density. Specifically, let
be a bandwidth parameter to be specified later, we consider
for , where
is a fixed smooth function supported on
with
, and
is some tuning parameter to ensure that
. Since
we conclude that the choice is sufficient. To invoke the Assoaud’s lemma, first note that for
the separation condition is fulfilled with
. Moreover, for any
with
, we have
Hence, the largest to ensure that
is
, which for
gives the minimax lower bound
This lower bound is also valid for the general case due to the monotonicity of
norms.
1.2. Sparse case:
Next we turn to the sparse case , where the name of “sparse” comes from the fact that for the hypotheses we will construct densities with only one bump on a small interval, and the main difficulty is to locate that interval. Specifically, let
be bandwidth and smoothness parameters to be specified later, and we consider
where , and
is a fixed smooth density on
. Clearly
is a density on
for all
, and
Consequently, , and the Sobolev ball requirement leads to the choice
.
Next we check the conditions of Fano’s inequality. Since
the separation condition is fulfilled with . Moreover, since
we conclude that . Consequently, Fano’s inequality gives
and choosing gives the desired result.
2. Example II: Aggregation
In estimation or learning problems, sometimes the learner is given a set of candidate estimators or predictors, and she aims to aggregate them into a new estimate based on the observed data. In scenarios where the candidates are not explicit, aggregation procedures can still be employed based on sample splitting, where the learner splits the data into independent parts, uses the first part to construct the candidates and the second part to aggregate them.
In this section we restrict ourselves to the regression setting where there are i.i.d. observations
, and
where is the independent noise, and
is an unknown regression function with
. There is also a set of candidates
, and the target of aggregation is to find some
(not necessarily in
) to minimize
where the expectation is taken over the random observations ,
for any
, and
is a suitable subset corresponding to different types of aggregation. For a fixed data distribution
, the minimax rate of aggregation is defined as the minimum worst-case excess risk over all bounded functions
and candidate functions
:
Some special cases are in order:
- When
, the estimate
is compared with the best linear aggregates and this is called linear aggregation, with optimal rate denoted as
;
- When
, the estimate
is compared with the best convex combination of candidates (and zero) and this is called convex aggregation, with optimal rate denoted as
;
- When
, the set of canonical vectors, the estimate
is compared with the best candidate in
and this is called model selection aggregation, with optimal rate denoted as
.
The main result in this section is summarized in the following theorem.
Theorem 1 If there is a cube
such that
admits a density lower bounded from below w.r.t. the Lebesgue measure on
, then
We remark that the rates in Theorem 1 are all tight. In the upcoming subsections we will show that although the loss function of aggregation becomes more complicated, the idea of multiple hypothesis testing can still lead to tight lower bounds.
2.1. Linear aggregation
Since the oracle term is hard to deal with, a natural idea would be to consider a well-specified model such that this term is zero. Since
admits a density, we may find a partition
such that
for all
. Consider the candidate functions
, and for
, let
where is to be specified.
To apply the Assoaud’s lemma, note that for the loss function
the orthogonality of the candidates implies that the separability condition holds for
. Moreover,
Therefore, the Assoud’s lemma (with Pinsker’s inequality) gives
Choosing completes the proof.
2.2. Convex aggregation
The convex aggregation differs only with the linear aggregation in the sense that the linear coefficients must be non-negative and the sum is at most one. Note that the only requirement in the previous arguments is the orthogonality of under
, we may choose any orthonormal functions
under
with
(existence follows from the cube assumption) and use the density lower bound assumption to conclude that
(to ensure the desired separation). Then the choice of
above becomes
, and we see that the previous arguments still hold for convex aggregation if
. Hence, it remains to prove that when
,
Again we consider the well-specified case where
with the above orthonormal functions , a constant scaling factor
, and a uniform size-
subset of
being
and zero otherwise (
is to be specified). Since the vector
is no longer the vertex of a hypercube, we apply the generalized Fano’s inequality instead of Assoaud. First, applying Lemma 4 in Lecture 8 gives
Second, as long as , using similar arguments as in the sparse linear regression example in Lecture 8, for
with a small constant
we have
with . Therefore, the generalized Fano’s inequality gives
and choosing completes the proof.
2.3. Model selection aggregation
As before, we consider the well-specified case where we select orthonormal functions on
, and let
,
. The orthonormality of
gives
for the separation condition of the original Fano’s inequality, and
Hence, Fano’s inequality gives
and choosing completes the proof.
We may show a further result that any model selector cannot attain the above optimal rate of model selection aggregation. Hence, even to compare with the best candidate function in , the optimal aggregate
should not be restricted to
. Specifically, we have the following result, whose proof is left as an exercise.
Exercise 1 Under the same assumptions of Theorem 1, show that
3. Example III: Learning Theory
Consider the classification problem where there are training data
i.i.d. drawn from an unknown distribution
, with
. There is a given collection of classifiers
consisting of functions
, and given the training data, the target is to find some classifier
with a small excess risk
which is the difference in the performance of the chosen classifer and the best classifier in the function class . In the definition of the excess risk, the expectation is taken with respect to the randomness in the training data. The main focus of this section is to characterize the minimax excess risk of a given function class
, i.e.,
The subscript “pes” here stands for “pessimistic”, where can be any distribution over
and there may not be a good classifier in
, i.e.,
may be large. We also consider the optimistic scenario where there exists a perfect (error-free) classifer in
. Mathematically, denoting by
the collection of all probability distributions
on
such that
, the minimax excess risk of a given function class
in the optimistic case is defined as
The central claim of this section is the following:
Theorem 2 Let the VC dimension of
be
. Then
Recall that the definition of VC dimension is as follows:
Definition 3 For a given function class
consisting of mappings from
to
, the VC dimension of
is the largest integer
such that there exist
points from
which can be shattered by
. Mathematically, it is the largest
such that there exist
, and for all
, there exists a function
such that
for all
.
VC dimension plays a significant role in statistical learning theory. For example, it is well-known that for the empirical risk minimization (ERM) classifier
we have
Hence, Theorem 2 shows that the ERM classifier attains the minimax excess risk for all function classes, and the VC dimension exactly characterizes the difficulty of the learning problem.
3.1. Optimistic case
We first apply the Assoaud’s lemma to the optimistic scenario. By the definition of VC dimension, there exist points and functions
such that for all
and
, we have
. Consider
hypotheses indexed by
: the distribution
is always
where is to be specified later. For the conditional distribution
, let
hold almost surely under the joint distribution
. Clearly this is the optimistic case, for there always exists a perfect classifier in
.
We first examine the separation condition in Assoaud’s lemma, where the loss function here is
For any and any
, we have
and therefore the separation condition holds for . Moreover, if
and
only differ in the
-th component, then
and
are completely indistinguishable if
does not appear in the training data. Hence, by coupling,
Therefore, Assoaud’s lemma gives
and choosing yields to the desired lower bound.
3.2. Pessimistic case
The analysis of the pessimistic case only differs in the construction of the hypotheses. As before, fix and
such that
for all
and
. Now let
be the uniform distribution on
, and under
, the conditional probability
is
and is to be specified later. In other words, the classifier
only outperforms the random guess by a margin of
under
.
Again we apply the Assoaud’s lemma for the lower bound of . First note that for all
,
Hence, for any and any
,
By triangle inequality, the separation condition is fulfilled with . Moreover, direct computation yields
and therefore tensorization gives
Finally, choosing gives the desired result.
3.3. General case
We may interpolate between the pessimistic and optimistic cases as follows: for any given , we restrict to the set
of joint distributions with
Then we may define the minimax excess risk over as
Clearly the optimistic case corresponds to , and the pessimistic case corresponds to
. Similar to the above arguments, we have the following lower bound on
, whose proof is left as an exercise.
Exercise 2 Show that when the VC dimension of
is
, then
4. Example IV: Stochastic Optimization
Consider the following oracle formulation of the convex optimizaton: let be the convex objective function, and we aim to find the minimum value
. To do so, the learner can query some first-order oracle adaptively, where given the query
the oracle outputs a pair
consisting of the objective value at
(zeroth-order information) and a sub-gradient of
at
(first-order information). The queries can be made in an adaptive manner where
can depend on all previous outputs of the oracle. The target of convex optimization is to determine the minimax optimization gap after
queries defined as
where is a given class of convex functions, and the final query
can only depend on the past outputs of the functions.
Since there is no randomness involved in the above problem, the idea of multiple hypothesis testing cannot be directly applied here. Therefore, in this section we consider a simpler problem of stochastic optimization, and we postpone the general case to later lectures. Specifically, suppose that the above first-order oracle is stochastic, i.e., it only outputs
such that
, and
. Let
be the set of all convex
-Lipschitz functions (in
norm),
, and assume that the subgradient estimate returned by the oracle
always satisfies
as well. Now the target is to determine the quantity
where the expectation is taken over the randomness in the oracle output. The main result in this section is summarized in the following theorem:
Theorem 4
Since it is well-known that the stochastic gradient descent attains the optimization gap for convex Lipschitz functions, the lower bound in Theorem 4 is tight.
Now we apply the multiple hypothesis testing idea to prove Theorem 4. Since the randomness only comes from the oracle, we should design the stochastic oracle carefully to reduce the optimization problem to testing. A natural way is to choose some function such that
is convex
-Lipschitz for all
, and
. In this way, the oracle can simply generate i.i.d.
, and reveals the random vector
to the learner at the
-th query. Hence, the proof boils down to find a collection of probability distributions
such that they are separated apart while hard to distinguish based on
observations.
We first look at the separation condition. Since the loss function of this problem is , we see that
which is known as the optimization distance . Now consider
and for , let
with
defined as
Clearly is convex and
-Lipschitz, and for
we have
Hence, it is straightforward to verify that , and
In other words, the separation condition in the Assoaud’s lemma is fulfilled with . Moreover, simple algebra gives
and therefore Assoaud’s lemma gives
and choosing completes the proof of Theorem 4. In fact, the above arguments also show that if the
norm is replaced by the
norm, then
5. Bibliographic Notes
The example of nonparametric density estimation over Sobolev balls is taken from Nemirovski (2000), which also contains the tight minimax linear risk among all linear estimators. For the upper bound, linear procedures fail to achieve the minimax rate whenever , and some non-linearities are necessary for the minimax estimators either in the function domain (Lepski, Mammen and Spokoiny (1997)) or the wavelet domain (Donoho et al. (1996)).
The linear and convex aggregations are proposed in Nemirovski (2000) for the adaptive nonparametric thresholding estimators, and the concept of model selection aggregation is due to Yang (2000). For the optimal rates of different aggregations (together with upper bounds), we refer to Tsybakov (2003) and Leung and Barron (2006).
The examples from statistical learning theory and stochastic optimization are similar in nature. The results of Theorem 2 and the corresponding upper bounds are taken from Vapnik (1998), albeit with a different proof language. For general results of the oracle complexity of convex optimization, we refer to the wonderful book Nemirovksi and Yudin (1983) and the lecture note Nemirovski (1995). The current proof of Theorem 4 is due to Agarwal et al. (2009).
- Arkadi Nemirovski, Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28 (2000): 85.
- Oleg V. Lepski, Enno Mammen, and Vladimir G. Spokoiny, Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics 25.3 (1997): 929–947.
- David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard, Density estimation by wavelet thresholding. The Annals of Statistics (1996): 508–539.
- Yuhong Yang, Combining different procedures for adaptive regression. Journal of multivariate analysis 74.1 (2000): 135–161.
- Alexandre B. Tsybakov, Optimal rates of aggregation. Learning theory and kernel machines. Springer, Berlin, Heidelberg, 2003. 303–313.
- Gilbert Leung, and Andrew R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52.8 (2006): 3396–3410.
- Vlamimir Vapnik, Statistical learning theory. Wiley, New York (1998): 156–160.
- Arkadi Nemirovski, and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
- Arkadi Nemirovski, Information-based complexity of convex programming. Lecture Notes, 1995.
- Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar, Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 2009.
Lecture 8: Multiple Hypothesis Testing: Tree, Fano and Assouad
Blog, Online Lectures(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In the last three lectures we systematically introduced the Le Cam’s two-point method with generalizations and various examples. The main idea of the two-point method is to reduce the problem in hand to a binary hypothesis testing problem, where the hypotheses may be either single or composite. This approach works when the target is to estimate a scalar (even when the underlying statistical model may involve high-dimensional parameters), while it typically fails when the target is to recover a high-dimensional vector of parameters. For example, even in the simplest Gaussian location model where the target is to estimate
under the mean squared loss, the best two-point approach gives a lower bound
while we have known from Lecture 4 that the minimax risk is
. To overcome this difficulty, recall from the minimax theorem that a suitable prior should always work, which by discretization motivates the idea of multiple hypothesis testing.
In this lecture, we develop the general theory of reducing problems into multiple hypothesis testing, and present several tools including tree-based methods, Assouad’s lemma and Fano’s inequality. In the next two lectures we will enumerate more examples (possibly beyond the minimax estimation in statistics) and variations of the above tools.
1. General Tools of Multiple Hypothesis Testing
This section presents the general theory of applying multiple hypothesis testing to estimation problems, as well as some important tools such as tree, Fano and Assouad. Recall from the two-point method that the separability and indistinguishability conditions are of utmost importance to apply testing arguments, the above tools require different separability conditions and represent the indistinguishability condition in terms of different divergence functions.
1.1. Tree-based method and Fano’s inequality
We start with the possibly simplest separability condition, i.e., the chosen parameters (hypothese) are pairwise seperated. The following lemma shows that the minimax estimation error is lower bounded in terms of the average test error.
Lemma 1 (From estimation to testing) Let
be any loss function, and there exist
such that
Then
where the infimum is taken over all measurable tests
.
Proof: For any given estimator , we construct a test
as follows:
Then the separability condition implies that
Then the rest follows from that the maximum is lower bounded by the average.
Remark 1 The following weaker separability condition also suffices for Lemma 1 to hold: for any
and
, the inequality
always implies that
.
The remaining quantity in Lemma 1 is the optimal average test error , for which we are looking for a lower bound. Recall that when
, this quantity equals to
by Le Cam’s first lemma. For general
, we have the following two different lower bounds.
Lemma 2 (Tree-based inequality) Let
be any undirected tree on vertex set
with edge set
. Then
Proof: It is straightforward to see that
It is also easy (left as an exercise) to establish the following elementary inequality: for any reals and any tree
, we have
Now using gives the desired inequality.
Lemma 3 (Fano’s inequality) Let
. Then
Remark 2 If we introduce auxiliary random variables
and
with
, then
where
denotes the mutual information between
and
.
Proof: We present two proofs of Lemma 3. The first proof builds upon the representation (1) and is more analytical, while the second proof makes use of the data-processing inequality, which is essentially the classical information-theoretic proof of Fano’s inequality.
(First Proof) By (1), it suffices to prove that
By the linearity of expectation, it further suffices to prove that for any non-negative reals with
, we have
To establish this inequality, let . Then by the convexity of
, Jensen’s inequality gives
Plugging in the above inequality completes the proof of Lemma 3.
(Second Proof) Introduce the auxiliary random variables and
as in Remark 2. For any fixed
, consider the kernel
which sends
to
, then
is the Bernoulli distribution with parameter
, and
is the Bernoulli distribution with parameter
. By the data-processing inequality of KL divergence,
Now taking expectation on at both sides gives
and rearranging completes the proof.
Lemma 2 decomposes a multiple hypothesis testing problem into several binary testing problems, which cannot outperform the best two-point methods in typical scenarios but can be useful when there is some external randomness associated with each (see the bandit example later). Lemma 3 is the well-known Fano’s inequality involving the mutual information, and the additional
term is the key difference in multiple hypothesis testing. Hence, the typical lower bound arguments are to apply Lemma 1 together with Lemma 3 (or sometimes Lemma 2), with the following lemma which helps to upper bound the mutual information.
Lemma 4 (Variational representation of mutual informatino)
Proof: Simply verify that
1.2. Assouad’s lemma
Instead of the previous pairwise separation condition, Assouad’s lemma builds upon a different one where the hypotheses are essentially the vertices of an -dimensional hypercube. Specifically, we shall require that the distance between parameters
and
is proportional to their Hamming distance
, which becomes quite natural when the parameter of the statistical model lies in an
-dimensional space.
Theorem 5 (Assouad’s Lemma) Let
be any function. If there exist parameters
indexed by
such that
then
where
Proof: For a given estimator , define
Then it is straightforward to see that
The rest of the proof follows from Le Cam’s first lemma and the fact that the maximum is no smaller than the average.
In most scenarios, the total variation distance between the mixtures and
is hard to compute directly, and the following corollaries (based on joint convexity of the total variation distance and Cauchy–Schwartz) are often the frequently presented versions of Assouad’s lemma.
Corollary 6 Under the conditions of Theorem 5, we have
Corollary 7 Under the conditions of Theorem 5, we have
where
is the resulting binary vector after flipping the
-th coordinate of
.
The idea behind the Assouad’s lemma is to apply a two-point argument for each coordinate of the parameter and then sum up all coordinates. Hence, Assouad’s lemma typically improves over the best two-point argument by a factor of the dimension , given that the hypercube-type separation condition in Theorem 5 holds.
1.3. Generalized Fano’s inequality
In the Fano’s inequality and Assouad’s lemma above, we introduce a random variable which is uniformly distributed on the set
or the hypercube
. Moreover, the separation condition must hold for each pair of the realizations of
. In general, we may wish to consider some non-uniform random variable
, and only require that the separation condition holds for most pairs. This is the focus of the following generalized Fano’s inequality.
Theorem 8 (Generalized Fano’s Inequality) Let
be any loss function, and
be any probability distribution on
. For any
, define
Then for
, we have
Proof: By Markov’s inequality, it suffices to show that for any estimator ,
We again use the second proof of Lemma 3. For any fixed , consider the deterministic kernel
which sends
to
. Then the composition
is a Bernoulli distribution with parameter
, and the composition
is a Bernoulli distribution with parameter
. Then by the data processing inequality of KL divergence,
Since , taking expectation on
at both sides gives the desired inequality.
To see that Theorem 8 is indeed a generalization to the original Fano’s inequality, simply note that when and the separation condition in Lemma 1 holds, we have
and therefore the denominator
becomes the log-cardinality. Hence, in the generalized Fano’s inequality, the verification of the seperation condition becomes upper bounding the probability
.
2. Applications
In this section we present some applications of the above tools to statistical examples. Here the applications are mostly simple and straightforward, and we defer some more sophisticated examples in other domains to the next lecture.
2.1. Example I: Gaussian mean estimation
We start from the possibly simplest example of Gaussian mean estimation. Consider with known
, and the target is the estimate the vector
under the squared
loss. Let
be the corresponding minimax risk.
We first show that any two-point argument fails. In fact, if the two points are chosen to be , the two-point argument gives
where is the normal CDF. Optimization only gives
.
Now we show that Assouad’s lemma can give us the rate-optimal lower bound . Let
for
, where
is a parameter to be chosen later. Since
the separation condition in Theorem 5 holds with . Consequently, Corollary 6 gives
and choosing gives
.
We can also establish the same lower bound using the generalized Fano’s inequality. Let , then Lemma 4 gives
Since for any with
(break ties arbitrarily), we have
choosing gives
for some numerical constant , where the last inequality is given by sub-Gaussian concentration. Consequently,
, and Theorem 8 gives
Again, choosing with small
gives
.
Remark 3 The original version of Fano’s inequality can also be applied here, where
is a maximal packing of
with
packing distance
.
2.2. Example II: Sparse linear regression
Consider the following sparse linear regression , where
is a fixed design matrix, and
is a sparse vector with
. Here
is a known sparsity parameter, and we are interested in the minimax risk
of estimating
under the squared
loss. Note that when
, this problem is reduced to the sparse Gaussian mean estimation.
We apply the generalized Fano’s inequality to this problem. Since a natural difficulty of this problem is to recover the support of , we introduce the random vector
, where
is a parameter to be specified later,
,
denotes the restriction of
onto
, and
is uniformly chosen from all size-
subsets of
. Clearly,
By Lemma 4,
where is the Frobenius norm of
. Now it remains to upper bound
for
. Fix any
such that
holds with non-zero probability (otherwise the upper bound is trivial), and by symmetry we may assume that
. Now by triangle inequality,
implies that
. Hence,
for a numerical constant after some algebra. Consequently, Theorem 8 gives
Finally, choosing for some small constant
gives
In the special cases where or
consists of i.i.d.
entries, we arrive at the tight lower bound
for both sparse Gaussian mean estimation and compressed sensing.
2.3. Example III: Multi-armed bandit
Next we revisit the example of the multi-armed bandit in Lecture 5. Let be the time horizon,
be the total number of arms. For each
, the reward of pulling arm
at time
is an independent random variable
. The target of the learner is to devise a policy
where
is the arm to pull at time
, and
may depend on the entire observed history
and
. The learner would like minimize the following worst-case regret:
where is a technical condition. As in Lecture 5, our target is to show that
via multiple hypothesis testing.
To prove the lower bound, a natural idea is to construct hypotheses where the
-th arm is the optimal arm under the
-th hypothesis. Specifically, we set
where is some parameter to be specified later. The construction is not entirely symmetric, and the reward distribution under the first arm is always
.
For a mean vector and policy
, let
be the non-negative loss function for this online learning problem. Then clearly the separation condition in Lemma 1 is fulfilled with
. Next, we apply Lemma 2 to a star graph
with center
gives
where in step (a) we denote by the distribution of the observed rewards under mean vector
, step (b) follows from the inequality
presented in Lecture 5, step (c) is the evaluation of the KL divergence where
denotes the expected number of pullings of arm
under the mean vector
, step (d) follows from Jensen’s inequality, and step (e) is due to the deterministic inequality
. Now choosing
gives the desired lower bound.
The main reason why we apply Lemma 2 instead of the Fano’s inequality and choose the same reward distribution for arm at all times is to deal with the random number of pullings of different arms under different reward distributions. In the above way, we may stick to the expectation
and apply the deterministic inequality
to handle the randomness.
2.4. Example IV: Gaussian mixture estimation
Finally we look at a more involved example in nonparametric statistics. Let be a density on
which is a Gaussian mixture, i.e.,
for some density
, where
denotes the convolution. We consider the estimation of the density
given
i.i.d. observations
from
, and we denote by
the minimax risk under the squared
loss between real-valued functions. The central claim is that
which is slightly larger than the (squared) parametric rate .
Before proving this lower bound, we first gain some insights from the upper bound. In the problem formulation, the only restriction on the density is that
must be a Gaussian mixture. Since convolution makes function smoother,
should be fairly smooth, and harmonic analysis suggests that the extent of smoothness is reflected in the speed of decay of the Fourier transform of
. Let
be the Fourier transform of
, we have
and . Hence, if we truncate
to zero for all
, the
approximation error would be smaller than
. This suggests us to consider the kernel
on
with
Then the density estimator has mean
, which by Parseval’s identity has squared bias
Moreover, by Parseval again,
and a triangle inequality leads to the desired result.
The upper bound analysis motivates us to apply Fourier-type arguments in the lower bound. For (with integer
to be specified later), consider the density
where is some centered probability density,
is some parameter to be specified later,
are perturbation functions with
, and
is the PDF of
. To apply the Assouad’s lemma on this hypercube structure, we require the following orthogonality condition
to ensure the separation condition in Theorem 5 with . By Plancherel’s identity, the above condition is equivalent to
Recall from Lecture 7 that the Hermite polynomials are orthogonal under the normal distribution, a natural candidate of is
, where we have used that
, and we restrict to odd degrees to ensure that
. This choice enjoys another nice property where the inverse Fourier transform of
has the following closed-form expression for
:
Moreover, since for large
, we have
. Therefore, we must have
to ensure the non-negativity of the density
.
Now the only remaining condition is to check that the -divergence between neighboring vertices is at most
, which is essentially equivalent to
A natural choice of is the PDF of
, with
to be specified. Splitting the above integral over
into
and
for some large constant
, the orthogonality relation of
gives
To evaluate the integral over , recall that
behaves as
for large
. Hence, the natural requirement
leads to the same upper bound
for the second integral, and therefore the largest choice of
is
.
To sum up, we have arrived at the lower bound
subject to the constraints and
. Hence, the optimal choices for the auxiliary parameters are
, leading to the desired lower bound.
3. Bibliographic Notes
The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977), who also applied Fano’s lemma (Fano (1952)) to a statistical setting. The Assouad’s lemma is due to Assouad (1983), and we also refer to a survey paper Yu (1997). The tree-based technique (Lemma 2 and the analysis of the multi-armed bandit) is due to Gao et al. (2019), and the generalized Fano’s inequality is motivated by the distance-based Fano’s inequality (Duchi and Wainwright (2013)) and the current form is presented in the lecture note Duchi (2019).
For the examples, the lower bound of sparse linear regression is taken from Candes and Davenport (2013), and we also refer to Donoho and Johnstone (1994), Raskutti, Wainwright and Yu (2011), and Zhang, Wainwright and Jordan (2017) for related results. The full proof of the Gaussian mixture example is referred to Kim (2014).
- Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
- Robert M. Fano, Class notes for transmission of information. Massachusetts Institute of Technology, Tech. Rep (1952).
- Patrick Assouad, Densité et dimension. Annales de l’Institut Fourier. Vol. 33. No. 3. 1983.
- Bin Yu, Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam. Springer, New York, NY, 1997. 423–435.
- Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou, Batched multi-armed bandit problem. Advances in Neural Information Processing Systems, 2019.
- John C. Duchi, and Martin J. Wainwright. Distance-based and continuum Fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669 (2013).
- John C. Duchi, Lecture notes for statistics 311/electrical engineering 377. 2019.
- Emmanuel J. Candes, and Mark A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis 34.2 (2013): 317–323.
- David L. Donoho, and Iain M. Johnstone. Minimax risk over
-balls for
-error. Probability Theory and Related Fields 99.2 (1994): 277–303.
- Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over
-balls. IEEE transactions on information theory 57.10 (2011): 6976–6994.
- Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators. Electronic Journal of Statistics 11.1 (2017): 752-799.
- Arlene K. H. Kim. Minimax bounds for estimation of normal mixtures. Bernoulli 20.4 (2014): 1802–1818.
Lecture 7: Mixture vs. Mixture and Moment Matching
Blog, Online Lectures(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In the last two lectures we have seen two specific categories of the Le Cam’s two-point methods, i.e., testing between two single hypotheses, or between one single hypothesis and a mixture of hypotheses. Of course, the most powerful and natural generalization of the two-point methods is to test between two mixtures of distributions, which by the minimax theorem is potentially the best possible approach to test between two composite hypotheses. However, the least favorable prior (mixture) may be hard to find, and it may be theoretically difficult to upper bound the total variation distance between two mixtures of distributions. In this lecture, we show that both problems are closely related to moment matching via examples in Gaussian and Poisson models.
1. Fuzzy Hypothesis Testing and Moment Matching
In this section, we present the general tools necessary for this lecture. First, we prove a generalized two-point method also known as the fuzzy hypothesis testing. To upper bound the crucial divergence term between two mixtures, we introduce the orthogonal polynomials under different distributions and show that the upper bound depends on moment differences of the mixtures.
1.1. Mixture vs. Mixture
We first state the theorem of the generalized two-point methods with two mixtures. As usual, we require that any two points in respective mixtures be well separated but the mixtures be indistinguishable from samples. However, since many natural choices of mixtures may not satisfy the well-separated property in the worst case, the next theorem will be a bit more flexible to require that the mixtures are separated with a large probability.
Theorem 1 (Mixture vs. Mixture) Let
be any loss function, and there exist
and
such that
Then for any probability measures
supported on
, we have
Proof: For , let
be the conditional probability measure of
conditioned on
, i.e.,
Simple algebra gives . By coupling, we also have
Now the desired result follows from the standard two-point arguments and the triangle inequality of the total variation distance.
The central quantity in Theorem 1 is , the total variation distance between mixture distributions with priors
and
. A general upper bound on this quantity is very hard to obtain, but the next two sections will show that it is small when the moments of
and
are close to each other if the model
is Gaussian or Poisson.
1.2. Hermite and Charlier Polynomials
This section reviews some preliminary results on orthogonal polynomials under a fixed probability distribution. Let be a probability measure on
with
and all moments finite. Recall that functions
with
are called orthogonal under
iff
for and all
. By orthogonal polynomials we mean that for each
,
is a polynomial with degree
.
The simplest way to construct orthogonal polynomials is via the Gram–Schmidt orthogonalization. Specifically, we may choose , and
to be the orthogonal component of
projected onto
with the inner product structure
. This approach works for general distributions and can be easily implemented in practice, but it gives little insight on the properties of
. We shall apply a new approach to arrive at orthogonal functions, which turn out to be polynomials in Gaussian and Poisson models.
Let be a family of distributions on
with
. Assume that
admits the following local expansion around
:
The next lemma claims that under specific conditions of , the functions
are orthogonal under
.
Lemma 2 Under the above conditions, if for all
the quantity
depends only on their product
and
, then the functions
are orthogonal under
.
Remark 1 Recall that the quantity
plays an important role in the Ingster-Suslina method in Lecture 6. The upper bounds in the next section can be thought of as a generalization of the Ingster-Suslina method, with the help of proper orthogonality properties.
Proof: The local expansion of likelihood ratio gives
The condition of Lemma 2 implies that the coefficient of the monomial on the RHS with
must be zero, as desired.
Exercise 1 Show that under the conditions in Lemma 2,
for some scalar
and any
. In other words,
is an unbiased estimator of
up to scaling in the location model
.
The condition of Lemma 2 is satisfied by various well-known probability distributions. For example,
and for any ,
In fact, the functions given in the defining equation (1) in the Gaussian and Poisson models are both polynomials of degree
, known as the Hermite polynomial
and the Charlier polynomial
, respectively. The proof of Lemma 2 gives the following orthogonal relations:
Throughout this lecture, we won’t need the specific forms of the Hermite and Charlier polynomials. We shall only need the defining property (1) and the above orthogonal relations.
1.3. Divergences between Mixtures
Now we are ready to present the upper bound on the total variation distance between Gaussian mixture and Poisson mixture models. We also provide upper bounds on the divergence due to its nice tensorization property.
We first deal with the Gaussian location model. Let and
be two random variables on
, and let
be the Gaussian mixture with random mean
.
Theorem 3 (Divergence between Gaussian Mixtures) For any
, we have
Moreover, if
and
, then
Proof: By translation we may assume that . Let
be the pdf of
, and
. Then
where step (a) is due to the defining property (1), step (b) follows from the Cauchy–Schwartz inequality, and step (c) uses the orthogonal relation of . Hence the upper bound on the total variation distance is proved.
For the -divergence, first note that by Jensen’s inequality,
As a result,
where again we have used the defining property (1) and the orthogonality in the last two identities.
Specifically, Theorem 3 shows that when the moments of and
are close, then the corresponding Gaussian mixtures are statistically close. Similar results also hold for Poisson mixtures.
Theorem 4 (Divergence between Poisson Mixtures) For any
and random variables
supported on
, we have
Moreover, if
and
almost surely, then
Proof: The proof of both inequalities essentially follow the same lines as those in the proof of Theorem 3, with the Hermite polynomial replaced by the Charlier polynomial
. The only difference is that when
and
almost surely, for all
we have
2. Examples in Gaussian Models
In this section, we present examples in Gaussian location models where we need to test between two mixtures. In these examples, we match moments up to either some finite degree, or some large and growing degrees such that the information divergences become extremely small.
2.1. Gaussian Mixture Models
Consider the following two-component Gaussian mixture model where i.i.d. samples
are drawn from
. One possible task of proper learning is to estimate the parameters
within
-distance to the truth up to permutation. In other words, the target is to recover the components of the Gaussian mixture, which in practice helps to perform tasks such as clustering. The target is to determine the optimal sample complexity of this problem, assuming that
is unknown but bounded away from
and
(e.g.,
), and the overall variance of the mixture is at most
, where
is some prespecified parameter.
To derive a lower bound, the two-point method suggests to find two sets of parameters which are
-seperated, while the information divergence between these two mixtures is
. However, since Theorem 3 can only deal with mixtures with an identical variance in each component, we cannot simply take
to be a discrete random variable supported on two points
. To overcome this difficulty, note that
where denotes convolution of probability measures. Hence, we may treat the overall mixture to be
, and choose
and similarly for
. Now the desired
-divergence becomes
To apply Theorem 3, we should construct random variables and
with as many matched moments as possible. Since there are
free parameters in the two-component Gaussian mixtures, we expect that
and
can only have matched moments up to degree
. A specific choice can be as follows:
Clearly both and
have overall variance
. Replacing
by their centered version, Theorem 3 gives
as long as . Hence, by the additivity of the
-divergence, we conclude that
is a lower bound on the sample complexity.
The seemingly strange bound is also tight for this problem, and the idea is to estimate the first
moments of the mixture and then show that close moments imply close parameters. We leave the details to the reference in the bibliographic notes.
2.2. -norm Estimation of Bounded Gaussian Mean
Consider the Gaussian location model with unit variance, where the mean vector
satisfies
. The target here is to estimate the
norm of the mean vector
, and let
be the minimax risk under the absolute value loss. The main result here is to prove the following tight lower bound:
2.2.1. Failure of Point vs. Mixture
Motivated by the point vs. mixture approach in the last lecture, one natural idea is to test between and a composite hypothesis
, where
is a parameter to be specified later such that
and
are indistinguishable in the minimax sense. Consequently, this approach gives the lower bound
, and the target is to find some
as large as possible. We claim that the largest possible
is
, which is strictly smaller than the desired minimax risk.
Let , and consider any prior distribution
supported on
. Then Ingster-Suslina method gives
Using the Taylor expansion and the inequality
for all
, we conclude that
Note that the above inequality holds for any supported on
. Hence, when
, these hypotheses
and
become statistically distinguishable, and therefore the best possible lower bound from the point vs. mixture approach is
.
There is also another way to show the desired failure, i.e., we may construct an explicit test which reliably distinguishes between and
. The idea is to apply the
test, i.e., compute the statistic
. Clearly under
we have
, and after some algebra we may show that
under
. Since
implies
, we conclude that comparing
with a suitable threshold
results in a reliable test.
2.2.1. Moment Matching and Polynomial Approximation
Previous section shows that testing between a single distribution and a mixture does not work, where the knowledge of the single distribution can be used for the test and may make the problem significantly easier. Hence, an improvement is to consider two composite hypotheses
and
, where
are parameters to be chosen later. For the priors
on
, Theorem 3 motivates us to use product priors
where
are probability measures on
with matched moments up to degree
(to be chosen later). The specific choices of
must fulfill the following requirements:
- Have matched moments up to degree
while with the quantity
as large as possible;
- For
, the prior
is supported on
;
- The
-divergence is upper bounded by
, or in other words,
.
We check the above requirements in the reverse order and specify the choices of and
gradually. For the last requirement, Theorem 3 with
gives
As a result, to have , it suffices to take
.
The second constraint that be (almost) supported on
is also easy. Set
where is a large eough numerical constant. The idea behind the above choices is that, under the product distribution
, the random variable
is the sum of
i.i.d. random variables taking value in
following distribution
. Then by the sub-Gaussian concentration, the
norm is centered at
with fluctuation
. Hence, for large
both probabilities
and
are small, which fulfills the conditions in Theorem 1.
The most non-trivial requirement is the first requirement, which by our choice of essentially aims to maximize the difference
subject to the constraint that the probability measures
are supported on
and have matching first
moments. The following lemma shows the duality between moment matching and best polynomial approximation.
Lemma 5 For any bounded interval
and real-valued function
on
, let
be the maximum difference
subject to the constraint that the probability measures
are supported on
and have matching first
moments. Then
where
denotes the best degree-
polynomial approximation error of
on the interval
:
Proof: It is an easy exercise to show that . We present two proofs for the hard direction
. The first proof is an abstract proof which holds for general basis functions other than monomials, while the construction of the measures
is implicit. The second proof gives an explicit construction, while some properties of polynomials are used in the proof.
(First Proof) Consider the following linear functional , with
for
and
. Equipped with the
norm on functions, it is easy to show that the operator norm of
is
. By the Hahn-Banach theorem, the linear functional
can be extended to
without increasing the operator norm. Then by the Riesz representation theorem, there exists a signed Radon measure
on
such that
The fact implies that the total variation of
is one. Write
by Jordan decomposition of signed measures, then
and
satisfy the desired properties.
(Second Proof) Let be a degree-
polynomial with
on
. Since
is a Haar basis on
, Chebyshev’s alternation theorem shows that there exist
points
such that
with
or
. Consider the signed measure
supported on
with
where is a normalizing constant such that
. Then by simple algebra,
and
has total variation
. Moreover, the following identity is given by Lagrange interpolation
and comparing the coefficient of on both sides gives
for
. Now another Jordan decomposition of
gives the desired result.
By Lemma 5, it boils down to the best degree- polynomial approximation error of
on
. This error is analyzed in approximation theory and summarized in the following lemma.
Lemma 6 There is a numerical constant (known as the Bernstein’s constant)
such that
Hence, by Lemma 5 and 6, the condition of Theorem 1 is satisfied with . As a result, we finally arrive at
.
3. Examples in Poisson Models
In this section, we present examples in i.i.d. sampling models from a discrete distribution with a large support. We show that a general Poissonization technique allows us to operate in the simpler Poisson models, and use moment matching to either finite or growing degrees to establish tight lower bounds.
3.1. Poissonization and Approximate Distribution
Throughout this section the statistical model is i.i.d. sampling from a discrete distribution , where
denotes the sample size and
denotes the support size. It is well-known that the histogram
with
constitutes a sufficient statistic. Moreover, for all
, and
are negatively dependent. To remove the dependence among different bins, recall the following Poissonized model:
Definition 7 (Poissonization) In the Poissonized model,
for all
and they are mutually independent.
In other words, in the Poissonized model we draw a random number of samples from
and then compute the histogram. In Lecture 3 we have shown the asymptotic equivalence of the i.i.d. sampling model and the Poissonized model, but the arguments are highly asymptotic. The following lemma establishes a non-asymptotic relationship.
Lemma 8 For a given statistical task, let
and
be the minimax risk under the i.i.d. sampling model and the Poissonized model with design sample size
, respectively. Then
Proof: Let be the Bayes risks under prior
in respective models. The desired inequality for Bayes risks follows from the identity
the Poisson tail bounds and the monotonicity of Bayes risks . Now the minimax theorem gives the desired lemma.
To establish the first identity, simply note that under any prior distribution , the Bayes estimator under the Poissonized model given the realization
is exactly the Bayes estimator under the i.i.d. sampling model with
samples.
Lemma 8 shows that the minimax risk essentially does not change after Poissonization. We sometimes also consider approximate distributions in the Poissonized model, where may not necessarily sum into one (note that the distribution of the histogram is still well-defined). The approximate distribution is typically used in lower bounds where a product prior is assigned to
and cannot preserve the distribution property. For statistical problems where the objective function or hypothesis depends on the vector
in a nice way that changing
into
with
will not change the objective much, in the lower bound it typically suffices to consider the approximate distributions in the Poissonized model. The key idea is that, conditioning on
, the histogram
is exactly distributed as that in the i.i.d. sampling model from
with
samples. Then we may construct the estimator in the Poissonized model from the hypothetical optimal estimators (with different sample sizes) in the i.i.d. sampling model, and applying the same tail bounds as in the proof of Lemma 8 suffices. The details of the arguments may vary from example to example, but in most scenarios it will not hurt the lower bound.
3.2. Generalized Uniformity Testing
Consider the following generalized uniformity testing problem: given i.i.d. observations from some discrete distribution
supported on at most
elements, one would like to test whether the underlying distribution
is uniform on its support. Note the difference from the traditional uniformity testing problem: the distribution
may be supported on a subset of
while still be uniform on this subset. Specifically, the task is to determine the sample complexity of distinguishing from the hypothesis
uniform on its support and
is
-away from any uniform distribution with support
under
distance. We will show that the desired sample complexity is lower bounded by
Note that the first term trivially follows from the Paninski’s construction in the traditional uniformity testing problem, the goal is to prove the second term. It is expected that the second term captures the difficulty of recovering the support of the distribution, and therefore we should consider a mixture of uniform distributions with random support in
. Specifically, we consider the following mixture: let
be two random variables with
Assign the -fold product distribution of
(or
) to the probability vector
, then
forms an approximate probability distribution since
. Moveover, under prior
the (normalized) distribution is always uniform, and under prior
the (normalized) distribution is
-far from any uniform distribution supported on a subset of
with high probability. Hence, neglecting the additional details for the approximate distribution, it suffices to show that
To establish the above bound for the -divergence, note that the random variables
are chosen carefully with
for
. Moreover,
Consequently, Theorem 4 gives that
Since is a stronger lower bound than
if and only if
, under this condition and
we will have
. Then the above inequality gives
which is the claimed upper bound on the -divergence, establishing the lower bound.
We provide some discussions on the choice of and
. Theorem 4 suggests that if
and
could match more moments, the
-divergence could even be smaller. However, the number of matched moments is in fact limited by the problem structure. In the traditional uniformity testing problem, in the last lecture we essentially choose
and
as above. In this case,
and
only match the first moment, which is the best possible for
must be a constant. In generalized uniformity testing,
must only be supported on two points, one of which is zero. Meanwhile, the support of
can potentially be arbitrarily large. The next lemma shows that no matter how we choose
, we can match at most the first two moments.
Lemma 9 Let
be a probability measure supported on
elements of
, one of which is zero, and
be any probability measure supported on
. Then if
and
match the first
moments, we must have
.
Proof: Let . Let the support of
be
. Consider the polynomial
of degree
, the assumption gives
. Finally, since
is always non-negative, we have
.
Lemma 9 applied to shows that moment matching up to degree
is the best we can hope for. In fact, the above lower bound is also tight (see bibliographic notes).
3.3. Shannon Entropy Estimation
Finally we revisit the Shannon entropy estimation problem where the target is to estimate the Shannon entropy . Let
be the minimax risk of estimating
under the mean squared error, our target is to show that
The lower bound has already been shown via the two-point method in Lecture 5, and therefore the remaining target is to establish
.
Similar to the norm example above, the Shannon entropy
is a symmetric sum of individual functions of
, where the individual function is non-differentiable at zero. It suggests us to apply similar ideas based on moment matching and best polynomial approximation to establish the lower bound. Specifically, the target is to construct two priors
on the interval
(with parameter
to be chosen later) such that:
- The priors
have matched moments up to degree
(to be chosen later);
- The difference between
and
is large;
- With high probability, the Shannon entropy
under
and
is well-separated;
- The common mean
is at most
.
Note that the first requirement (moment matching) ensures a small TV distance between the mixtures, the second and third requirements ensure the separation property (i.e., lower bound the mean difference and upper bound the fluctuations), and the last requirement ensures that sums into a constant smaller than one (recall that
by assumption) and therefore setting
gives a valid probability vector
. We check these requirements one by one to find proper parameters
.
For the first requirement, recall that Theorem 4 gives
Since the individual TV distance should be at most (for the future triangle inequality), we should choose
, where
is due to the assumption
and that
to make the first term
become dominate in the minimax risk.
For the second requirement, by the duality result in Lemma 5 it is essentially the best degree- polynomial approximation error of
on
. The next lemma gives the best approximation error.
Lemma 10
By Lemma 10, the target is to maximize subject to the previous condition
. Simple algebra shows that the maximum is
, with the maximizer
.
To resolve the third requirement, note that the mean difference of is
by the above choice of
and
. Moreover, since
for some constant
if
, the sub-Gaussian concentration shows that the fluctuation of
under both
is at most
. Since
, the fluctuation is indeed negligible compared with the mean difference. Careful analysis also shows that the contribution of
to the entropy difference is negligible.
The last requirement requires proper modifications on the priors constructed in Lemma 5 to satisfy the mean constraint. This can be done via the following trick of change of measures: let
be the priors constructed in Lemma 5which attains
, whose value is summarized in the following lemma.
Lemma 11
Next we construct the priors as follows: for
, set
Then it is easy to show that both and
are probability measures, have matched moments up to degree
, and have mean
. Moreover,
Hence, the fourth requirement is fulfilled without hurting the previous ones.
In summary, Theorem 1 holds with , and we arrive at the desired lower bound
.
4. Bibliographic Notes
The method of two fuzzy hypotheses (Theorem 1) are systematically used in Ingster and Suslina (2012) on nonparametric testing, and the current form of the theorem is taken from Theorem 2.14 of Tsybakov (2009). Statistical closeness of Gaussian mixture models via moment matching (Theorem 3) was established in Cai and Low (2011), Hardt and Price (2015), Wu and Yang (2018) for the -divergence, where the result is new for the TV distance. For Theorem 4, the TV version was established in part by Jiao, Kartik, Han and Weissman (2015) and Jiao, Han and Weissman (2018), where the
version is new here. When
, a stronger bound of the TV distance without the squared root is also available in Wu and Yang (2016). For more properties of Hermite and Charlier polynomials, we refer to Labelle and Yeh (1989).
For Gaussian examples, the proper learning of two-component Gaussian mixture was established in Hardt and Price (2015). The norm estimation problem was taken from Cai and Low (2011), which was further motivated by Lepski, Nemirovski and Spokoiny (1999). The Gaussian mean testing example under
metric was taken from Ingster and Suslina (2012). For technical lemmas, proofs of Lemma 5 is taken from Lepski, Nemirovski and Spokoiny (1999) and Wu and Yang (2016), respectively, and Lemma 6 was due to Bernstein (1912).
For Poisson examples, the non-asymptotic equivalence between i.i.d. sampling model and the Poissonized model was due to Jiao, Kartik, Han and Weissman (2015). The tight bounds of the generalized uniformity testing problem were due to Batu and Canonne (2017) for constant , and Diakonikolas, Kane and Stewart (2018) for general
, where their proof was greatly simplified here thanks to Theorem 4. For Shannon entropy estimation, the optimal sample complexity was obtained in Valiant and Valiant (2011), and the minimax risk was obtained independently in Jiao, Kartik, Han and Weissman (2015) and Wu and Yang (2016). For tools in approximation theory to establish Lemma 10 and 11, we refer to books Devore and Lorentz (1993), Ditzian and Totik (2012) for wonderful toolsets.
- Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012.
- Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
- T. Tony Cai, and Mark G. Low. Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. The Annals of Statistics 39.2 (2011): 1012–1041.
- Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 2015.
- Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments. arXiv preprint arXiv:1807.07237 (2018).
- Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885.
- Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the
distance. IEEE Transactions on Information Theory 64.10 (2018): 6672–6706.
- Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702–3720.
- Jacques Labelle, and Yeong Nan Yeh. The combinatorics of Laguerre, Charlier, and Hermite polynomials. Studies in Applied Mathematics 80.1 (1989): 25–36.
- Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the L r norm of a regression function. Probability theory and related fields 113.2 (1999): 221-253.
- Serge Bernstein. Sur l’ordre de la meilleure approximation des fonctions continues par des polynomes de degré donné. Vol. 4. Hayez, imprimeur des académies royales, 1912.
- Tugkan Batu, and Clément L. Canonne. Generalized uniformity testing. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2017.
- Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018.
- Gregory Valiant and Paul Valiant. The power of linear estimators. 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE, 2011.
- Ronald A. DeVore and George G. Lorentz. Constructive approximation. Vol. 303. Springer Science & Business Media, 1993.
- Zeev Ditzian and Vilmos Totik. Moduli of smoothness. Vol. 9. Springer Science & Business Media, 2012.
Lecture 6: Generalized Two-Point Method: Point vs. Mixture
Blog, Online Lectures(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In the last lecture we introduced Le Cam’s two-point method to establish information-theoretic lower bounds, where the target is to find two single hypotheses which are well separated while indistinguishable. However, restricting to two single hypotheses gives the learner a lot of information in many scenarios, and any pair of single hypotheses which are far apart becomes distinguishable. From another point of view, if we would like to mimic the Bayesian idea of using the least favorable prior, it is very unlikely that the least favorable prior is supported on only two points. Hence, in this lecture we generalize the two-point idea to the case where one point is a mixture of distributions, which suits well in various examples.
1. General Tools
1.1. A Motivating Example
We first discuss an example where the simple two-point method fails. Consider the following hidden clique detection problem: there is an underlying undirected simple graph with vertex set
which either comes from an Erdös-Rényi graph
(where each edge appears with probability
independently), or there is a clique on
vertices of
and all other edges appear with probability
independently. Recall that a clique is a complete graph where each two vertices are connected via an edge. The target is to reliably distinguish between these two cases, i.e., a vanilla Erdös-Rényi graph versus that planted with a hidden clique of size
.
This task is a composite hypothesis testing problem, where consists of a single hypothesis where
, while
consists of a family of hypotheses depending on the precise location of the hidden clique. Of course, an Erdös-Rényi graph is unlikely to have large cliques, thus this task becomes easier when
is large. Let
be the threshold that the total error probability of the optimal test is at most
when
and is at least
when
.
We first try to use Le Cam’s two-point method to find a lower bound on . Let
be the distribution of
, and
be the distribution of the graph where there is a planted clique on the vertex set
with
. To compute the total variation distance between
and
, note that with probability
the Erdös-Rényi graph has a clique supported on
. Hence,
holds for any . As a result, we have
to ensure that
, which is not an informative bound.
The problem with the above two-point approach is that, if the learner is told the exact location of the clique, she can simply look at the subgraph supported on that location and tell whether a clique is present. In other words, any single hypothesis from gives too much information to the learner. To overcome this difficulty, a natural idea is to take the support
of the clique uniformly distributed on all size-
subsets of
, which is expected to result in a graph looking more similar to the Erdös-Rényi graph. This is exactly the central topic of this lecture.
1.2. Generalized Le Cam’s Method: Point vs. Mixture
As is shown in the previous example, there is need to generalize the Le Cam’s two-point method to the scenario where one of the points is a mixture of distributions. The theoretical foundation of the generalization is in fact quite simple, and the proof of the next theorem parallels that of Theorem 7 in Lecture 5.
Theorem 1 (Point vs. Mixture) Let
be any loss function, and there exist
and
such that
then for any probability measure
supported on
, we have
The main difficulties in the practical application of Theorem 1 lie in two aspects. First, it may be non-trivial to find the proper mixture and the support
. Second, upper bounding the total variation distance
may be hard, especially when
takes the form of
. The next section proposes a general tool to overcome the second difficulty, where we use various examples to illustrate the art of the first difficulty.
1.3. Upper Bounding the Divergence
To upper bound the divergence between a single distribution and a mixture of distributions, the next lemma shows that the -divergence is a proper choice.
Lemma 2 Let
be a family of probability distributions parametrized by
, and
be any fixed probability distribution. Then for any probability measure
supported on
, we have
where the expectation is taken with respect to independent
.
Proof: Follows from the definition and simple algebra.
The advantages of Lemma 2 are two-fold. First, it factorizes well for tensor products:
This identity makes it more convenient to use -divergence to deal with the point-mixture scenario. Second, for many statistical models (and a proper choice of
), the quantity
admits a simple closed-form expression in . We will see several examples in the rest of this lecture.
2. Examples of Testing Problems
In this section, we provide examples of composite hypothesis testing problems where the null hypothesis consists of a single probability distribution, while the alternative
consists of a family of distributions. We will see that choosing a proper mixture and applying the tools in Lemma 2 gives desired lower bounds.
2.1. Hidden Clique Detection
Recall the hidden clique detection problem described in Section 1.1, where we aim to obtain tight lower bounds for the threshold on the clique size. As discussed, the point is chosen to be the single distribution
for the Erdös-Rényi graph, and the mixture is taken to be
where
is uniformly distributed on all size-
subsets of
. It remains to upper bound
, which by Lemma 2 equals to
By simple algebra,
Since the probability of for
is
, we conclude that
After carefully bounding the binomial coefficients, we conclude that the -divergence is
as long as
. Hence, the threshold
has the lower bound
.
This lower bound is tight. Note that the expected number of size- cliques in
is exactly
using some algebra and Markov’s inequality, we conclude that if , with high probability the Erdös-Rényi graph has no size-
cliques. Hence, by enumerating size-
cliques in the graph, we have
.
2.2. Uniformity Testing
The uniformity testing problem, also known as the goodness-of-fit in theoretical computer science, is as follows. Given i.i.d. observations from a discrete distribution
supported on at most
elements, one would like to test between
where is some prescribed parameter (which may depends on
), and
denotes the uniform distribution over
. The target is to determine the sample complexity of this problem, i.e., find the smallest sample size
such that there exists a test with total error probability at most
.
Note that is a composite hypothesis. If we would like to choose a single distribution from
, the most natural choice would be
. For the learner to distinguish between
and
, she could combine the first and second half of symbols into two super-symbols, with uniform distribution
under
, and distribution
under
. Consequently,
samples are sufficient to distinguish between these two cases, which loses dependence on the support size
. As before, the problem of choosing a single hypothesis in
is that the learner then knows the exact locations of the symbols with probability larger than
and can combine them.
Hence, we should pick up several distributions from and consider a proper mixture. The solution is to use the following Paninski’s construction: let random vector
be distributed uniformly on
(assuming that
is even), and consider the mixture
, with
Clearly, each belongs to
. Moreover,
Therefore, by Lemma 2 and the discussions thereafter, we have
where the last inequality follows from the -subGaussianity of the vector
. Hence, as long as
, the above
-divergence is upper bounded by
. Therefore, the sample complexity of the uniformity testing problem is at least
, which is also tight (see bibliographic notes for details).
2.3. General Identity Testing
Next we generalize the uniformity testing problem to the case where the uniform distribution can be replaced by any given probability distribution
. Specifically, given
i.i.d. observations from a discrete distribution
supported on at most
elements, one would like to test between
where is some prescribed parameter, and
is a fixed known probability distribution. Again, the target is to determine the sample complexity of this identity testing problem, in terms of
and
.
For simplicity, we assume that is an even integer, and the components of
appear in pairs, i.e.,
, etc. We reparametrize
by
. We also assume that
for all
. The general case requires proper trimming on
, and is referred to the bibliographic notes. As in the previous Paninski’ construction, we again consider a uniform mixture in
, but with different perturbation sizes. Specifically, let random vector
be distributed uniformly on
(assuming that
is even), and consider the mixture
, with
Here the parameters will be specified later. Note that
is a valid probability distribution if
for all
, and the assumption
requires that
. As before,
where again the last step follows from the sub-Gaussianity. Since our target is to find the largest possible such that the
-divergence is
, we aim to find
‘s to minimize the quantity
subject to the above constraints. To do so, we recall the following elementary inequality.
Lemma 3 (Hölder’s Inequality) Let
be non-negative reals for all
. Then
Proof: By homogeneity we may assume that for all
. Then the desired inequality follows from the AM-GM inequality
By Lemma 3, we have the following inequality:
which gives the lower bound , with identity when
It is straightforward to check that for all
thanks to the assumption
. Consequently, we arrive at the lower bound
for the sample complexity, which is tight in general. Recall that how the mysterious -norm is obtained by carefully choosing the mixture.
3. Examples of Estimation Problems
In this section, we will look at several estimation problems and show how Theorem 1 provides the tight lower bound for these problems. We will see that the choice of the mixture becomes natural after we understand the problem.
3.1. Linear Functional Estimation of Sparse Gaussian Mean
Consider a Gaussian location model , where we assume that the mean vector
is
-sparse, i.e.,
. The target is to estimate the linear functional of
, i.e.,
, and we aim to prove a tight minimax lower bound for the minimax risk
under the quadratic loss.
The key feature of this problem is the dependence of the minimax risk on the sparsity parameter . If one could know the support of
, summing over
for
gives a natural estimator with risk
. However, due to the unknown support of
, we will show that
for small
, where the additional penalty
is essentially the logarithm of
, the number of possible supports of
.
To incorporate the effect of unknown support into the lower bound, a natural choice would be to consider a random mean vector with support uniformly distributed on all size- subsets of
. Formally, let
, and
with
being the indicator vector of the subset
, and
is a parameter to be specified. Moreover, we let
be uniformly distributed on all possible size-
subsets of
. Clearly, the separability condition in Theorem 1 is satisfied with
. To upper bound the
-divergence, we again compute
and consequently
The random variable follows a hypergeometric distribution with pmf
. It is well-known in probability theory that a hypergeometric distribution refers to sampling without replacement, while the Binomial distribution
refers to sampling with replacement. The next lemma shows a relationship between these two random variables.
Lemma 4 Let
be an urn,
be samples uniformly drawn from the urn without replacement, and
be samples uniformly drawn from the urn with replacement. Define
and
, then there exists a coupling between
such that
.
Proof: The coupling is defined as follows: for any distinct and not necessarily distinct
, let the conditional distribution be
where denotes the number of distinct elements in
. To verify that this is a valid conditional probability measure, recall that the number of ways of drawing
labeled elements with
distinct elements from an urn of
elements is
, where
is the Stirling’s number of the second kind, and
is the falling factorial. Hence, the probability is valid due to the identity
where the last step is due to that is the number of ways of drawing
labeled elements from an urn of
elements and therefore equals
.
From this conditional probability, it is obvious to see that the unconditional probability of coincides with sampling with replacement. Moreover,
follows from symmetry, and therefore .
By Lemma 4, there exists some -algebra
such that
has the same distribution as
with
. Now by Jensen’s inequality,
Hence, the largest to ensure a constant
-divergence is
, which by Theorem 1 gives
which has a phase transition at . In fact, this lower bound is also tight: if
, the natural estimator
is rate-optimal; if
, the thresholding estimator
is rate-optimal for some proper threshold
.
3.2. Largest Eigenvalue of Covariance Matrix
Consider i.i.d. observations , where
is the covariance matrix. Assuming
, classical matrix concentration inequality shows that the empirical covariance matrix
satisfies that
We show that the bound is rate-optimal. In fact, we may even show a stronger result: the minimax risk
in estimating
under the absolute value loss is at least
Note that by triangle inequality, estimating the largest eigenvalue is indeed an easier problem than covariance estimation.
As before, a natural idea to establish the lower bound is to choose a uniform direction for the principal eigenvector, so that the learner must distribute her effort into all directions evenly. Specifically, we choose , and
, where
are some parameters to be chosen later, and
is uniformly distributed on the unit ball
. It is easy to verify that
is fulfilled as long as
, and
in Theorem 1. Moreover,
Hence, by Lemma 2,
assuming , where the last inequality follows from
whenever
. Computation the surface area of spherical caps gives
Consequently, to make , we may take
which gives the desired minimax lower bound .
3.3. Quadratic Functional Estimation
Finally we switch to a non-parametric example. Let be i.i.d. observations drawn from a density
, where
is supported on
and belongs to the Hölder ball with smoothness parameter
:
with . The target is to estimate the quadratic functional
and we let be the corresponding minimax risk under the absolute value loss. The central claim is that
which has an “elbow effect” at .
A central tool for proving nonparametric lower bounds is via parametric reduction, i.e., restricting to a parametric family while the subproblem is as hard as the original nonparametric problem. In this example, a natural candidate is
where is the bandwidth to be chosen later, the vector
satisfies
, and
is a fixed smooth function on
vanishing outside
. It is a simple exercise to show that for
sufficiently small, each
is a valid density on
and belongs to the Hölder ball
. Intuitively, the density
is a perturbation on the uniform density with bumps modulated by the vector
. Moreover,
and therefore the target of the subproblem is to estimate based on
. Again, the idea of Theorem 1 can be applied.
Let be uniformly distributed on
. By simple calculation,
Consequently, Lemma 2 gives
Hence, choosing gives
. Moreover, Theorem 1 is fulfilled with
, and therefore we have proved that
The parametric rate is trivial (by the tools in Lecture 4).
4. Bibliographic Notes
A wonderful book which treats the point-mixture idea of Theorem 1 sysmatically and contains the method in Lemma 2 is Ingster and Suslina (2012). The sharp bounds on the clique size in the hidden clique problem are due to Matula (1972). It is a long-standing open problem whether there exists a polynomial-time algorithm attaining this statistical limit, and we refer to the lecture notes Lugosi (2017) for details. The sample complexity of the uniformity testing problem is due to Paninski (2008), and the general identity testing problem is studied in Valiant and Valiant (2017). There are also some other generalizations or tools for uniformity testing, e.g., Diakonikolas and Kane (2016), Diakonikolas, Kane and Stewart (2018), Blais, Canonne and Gur (2019).
For the estimation problem, the linear functional estimation for sparse Gaussian mean is due to Collier, Comminges and Tsybakov (2017), where Lemma 4 can be found in Aldous (1985). The largest eigenvalue example is taken from the lecture note Wu (2017). The nonparametric functional estimation problem is widely studied in statistics, where the minimax rate for estimating the quadratic functional can be found in Bickel and Ritov (1988), and it is for general dimension
. The same minimax rate (and the elbow effect) actually holds for all smooth functionals, see Birgé and Massart (1995), Kerkyacharian and Picard (1996) and Robins et al (2017).
- Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012.
- David W. Matula. Employee party problem. Notices of the American Mathematical Society. Vol. 19, No. 2, 1972.
- Gábor Lugosi. Lectures on Combinatorial Statistics. 2017.
- Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory 54.10 (2008): 4750–4755.
- Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing 46.1 (2017): 429–455.
- Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2016.
- Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018.
- Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT) 11.2 (2019): 6.
- Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics 45.3 (2017): 923–958.
- David J. Aldous. Exchangeability and related topics. Ecole d’Eté de Probabilités de Saint-Flour XIII-1983. Springer, Berlin, Heidelberg, 1985. 1–198.
- Yihong Wu. Lecture notes for ECE598YW: Information-theoretic methods for high-dimensional statistics. 2017.
- Peter J. Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya: The Indian Journal of Statistics, Series A (1988): 381–393.
- Lucien Birgé and Pascal Massart. Estimation of integral functionals of a density. The Annals of Statistics 23.1 (1995): 11–29.
- Gérard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics 24.2 (1996): 485-507.
- James Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der Vaart. Higher order estimating equations for high-dimensional models. Annals of statistics 45, no. 5 (2017): 1951.