(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In previous lectures we have seen the ideas of model reduction and its application in establishing the asymptotic lower bounds. However, in many scenarios we would like to find good non-asymptotic bounds, and as we show in the examples of Lecture 3, finding such reductions may be quite difficult. In this and the next few lectures, we will introduce the testing idea which transforms the target problem (usually the estimation problem) into a proper testing problem. Despite the conceptual simplicity, the testing problem may take various forms, some of which may be difficult to analyze. This will be the central topic of the next few lectures.
1. -divergences and Inequalities
1.1. -divergences
First we look at the possibly simplest testing problem: if and
are two probability measures on the same probability space, and a random sample
is drawn from either
or
. A (binary) testing problem aims to find some test
which with high probability outputs
when
, and outputs
when
. To quantify the performance of a certain test
, we define the total error probability as follows:
Intuitively, when and
are close to each other, we will expect that the total error probability for any test is large for they are hard to distinguish. Conversely, when
and
are very far from each other (e.g., with a disjoint support), then there exists some test which can well distinguish them. The next lemma gives the expression of the minimum
.
Lemma 1 (Le Cam’s first lemma) Minimizing all measurable tests
gives
Recall that is the total variation distance between
and
.
Proof: Let , then
with equality if and only if .
By Lemma 1, the total variation distance between and
exactly quantifies the minimum total error probability of distinguishing between two probability distributions. Similar to Lemma 1, if we impose different weights on the error probabilities, we will arrive at some divergences of the form
, where
is some constant, and
is the weight on
. Recall from convex analysis that for any convex function
with compact support on
, its second-order derivative
always exists in the sense of a positive Radon measure, and
for some .
Remark 1 The decomposition (1) of convex functions has several applications, e.g., in the local time analysis of continuous martingales.
Motivated by (1), we can actually take a convex combination of the above divergences and arrive at the general class of -divergences.
Definition 2 (
-divergence) For a convex function
on
with
, the
-divergence between probability measures
and
on
is defined as
where
is the absolute continuous part of
w.r.t.
, and
.
In most practical scenarios, we have , and therefore
. Some well-known examples of
-divergences are as follows.
Example 1
1.
: total variation distance
;
2.
: Hellinger distance
;
3.
: Kullback–Leibler (KL) divergence
;
4.
:
-divergence
.
The following properties also hold for any -divergences.
Lemma 3
and is joint convex in
. In particular,
for any transition kernel
.
Remark 2 The last inequality is known as the data processing inequality.
Proof: The non-negativity follows from convexity and Jensen’s inequality. The joint convexity is due to that of the perspective transform for convex
. Let
be the output of
through kernel
, then the last inequality follows from the joint convexity and
We leave the details to the readers.
1.2. Joint range
Lemma 1 involves the computation of the total variation distance, which may be hard to evaluate exactly for most statistical models, especially when there is a tensor power structure. As a comparison, Hellinger, KL and -divergences tensorize better:
Consequently, to apply Lemma 1 to proving lower bounds, we need upper bounds of the total variation distance in terms of Hellinger, KL or . We have seen examples of such inequalities from Lecture 3, but have no clue whether the claimed inequalities are tight. To obtain the best inequalities between different
-divergences, we study the joint range of
-divergences.
For two convex functions on
with
, we call
is in the joint range of
if we can find two distributions
over some common probability space such that
. Hence, the joint range exactly characterizes the relationship between different
-divergences. The central result is as follows:
Theorem 4 Let
be the joint range of
, and
be the joint range of
when both
and
are discrete distributions supported on at most
elements. Then
where
denotes the convex hull.
Theorem 4 provides a generic way to prove inequalities between different -divergences. Specifically, it shows that to obtain the joint range, it suffices to choose
and
, compute the pair
, vary
and then take the convex closure. The proof of Theorem 4 requires the following theorems in the convex analysis.
Theorem 5 (Choquet–Bishop–de Leeuw) Let
be a metrizable convex compact subset of a locally convex topological vector space
. For any point
in
, there exists a probability measure
supported on extremal points of
such that
.
Theorem 6 (Fenchel–Eggleston–Carathéodory) Let
be connected and
. For any
, there exists
such that
.
We are ready to prove the statement of Theorem 4.
Proof: For simplicity we always assume that , then
. Viewing the likelihood ratio
as a random variable
, we conclude that
Similarly, has the same representation with the additional constraint that
can take at most
values. Let
be the set of all probability measures which are supported on
and have unit mean. By linearity of the expectation and the convexity of
we conclude that
is convex. Moreover, since all extremal points of
are supported on two elements, by Choquet–Bishop–de Leeuw theorem (note that any topological vector space is always locally convex under the weak-
topology, where the unit sphere is compact and metrizable) we conclude that the distribution of
can be written as a mixture of distributions supported on only two points. Then by the linearity of expectation again, we conclude that
.
For the second identity , first note that
is a connected set for it is a continuous image of the divergence pair on
. Hence, by Fenchel–Eggleston–Carathéodory theorem, any point in
can be written as a convex combination of two points in
, which gives rise to a random variable
supported on at most
elements.
Based on Theorem 4, one may verify the following inequalities:
- TV vs. Hellinger:
Both inequalities are tight. - TV vs. KL: Pinsker’s inequality gives
This bound is vacuous when, where better bounds are available:
None of these bounds are tight. - Hellinger vs KL:
This bound is tight. - KL vs.
:
This bound is tight.
The above inequalities will be used quite often whenever we switch between convenient divergences.
2. Le Cam’s Two-point Method
One popular way to prove a non-asymptotic lower bound of statistical estimation is via hypothesis testing. Let be the statistical model of interest, and the loss function is
(at the moment we assume that the target is to estimate
). The idea is to use the following weaker lower bound
Suppose that we can find two possible candidates of , say,
and
, such that the following two conditions hold:
- Large separation: the total loss
is always large for any
;
- Indistinguishability: based on the observation, the observer cannot reliably distinguish between
and
;
then clearly the observer needs to pay a risk proportional to . The next theorem makes this intuition precise.
Theorem 7 (Two-point Method) Let
be any loss function, and
satisfy that
then
Proof: The desired result follows from the following chain of inequalities:
Note that in Theorem 7, the parameter characterizes the separation between the two points, and
is the minimum error probability of testing between
and
(cf. Lemma 1). Typically these two quantities depend negatively on each other: when one becomes larger, the other gets smaller. A rule-of-thumb is to find
such that
is a constant, say
, and then search for the largest possible separation
.
Although the proof of Theorem 7 did not explicitly apply testing arguments, the test can be seen as follows. Let the test be if
, and
otherwise, then the separability condition ensures that
for
as long as
. Hence,
Replacing with the above lower bound, and repeating the analysis in the proof of Theorem 7, a direct application of Lemma 1 gives the same result (with
replaced by
).
The two-point idea is simple but appears everywhere in various lower bounds, for it establishes the indistinguishability between two different tasks due to the uncertainties. Also, despite the conceptual simplicity, searching over all possible pairs of points is typically computationally expensive, and it requires effort to understand the problem structure and figure out the “correct” two points. In later lectures, this idea will be generalized in the following directions:
- Multiple points instead of two-point;
- One or both of the two points may be a mixture of distributions, i.e., the binary test has point-mixture or mixture-mixture structures;
- The chioce of the two points may depend on the estimator
.
3. Examples of Two-point Methods
The remainder of this lecture is devoted to the applications of the two-point method in various problems.
3.1. Example I: Parameter Estimation
We begin with the simplest example of the Gaussian location model for
, and our target is to estimate
. Since a natural guess is that local perturbation on
makes the model indistinguishable, we choose
and
in the two-point method. Let
be the minimax risk under the squared error loss, Theorem 7 gives
where is the CDF of
. Maximizing the above lower bound over
gives
. Compared with
derived in Lecture 4, the two-point bound loses a constant multiplicative factor.
Next we consider the bounded mean scenario where for a known parameter
in advance. Since the above
is chosen to be
in the end, this becomes infeasible if the upper bound
is much smaller than
. Consequently, we can choose
and
, which gives
. This lower bound is also tight, for one of the estimators
or
attains it.
Consider another location model where one observes and aims to estimate
. Again, choosing
and
, the minimax risk
is lower bounded by
By the triangle inequality,
Hence, choosing gives
. The rate
is also tight for the minimax risk, for the estimator
attains the rate.
Remark 3 The uniform location model is not QMD (see Lecture 4), thus it makes sense to have the correct minimax rate different from
. Moreover, the inequality
may not be tight for other models.
3.2. Example II: Functional Estimation
Recall the discrete entropy estimation problem, where we observe samples
from a discrete distribution
, and the target is to estimate the Shannon entropy
. In Lecture 4 we proved an asymptotic minimax lower bound
for the mean squared error, and here we show that the same bound holds non-asymptotically (even when
grows with
).
The idea of using two-point method here is that when we perturb each entry in the uniform distribution by some small , the value of the functional
changes by an additive factor of
, and the KL divergence between these two cases is
. Hence, choosing
will complete the proof. However, some cautions should be taken when choosing the specific points:
- The two points cannot be
, and
. In this way, we have
;
- The two points cannot be
, and
. In this way, we have
if
and
otherwise. Hence, this choice fails for some parameters
.
The final choice of the two-point is as follows:
Simple algebra gives , and
Now choosing completes the proof of
. This example shows that the choice of the two points may sometimes be a little bit delicate.
3.3. Example III: Non-local Choice of Two Points
We return to the entropy estimation problem, but this time we choose different points to arrive at an entirely different lower bound. Note that all above choices of the two points are based on local perturbations, and the resulting bounds are essentially for the variance. This example will show that a more sophisticated construction can also take care of the bias part.
For observations , let
be the sorted histogram of
, i.e.,
is the sorted version of
with
. The next lemma shows that there exists a minimax estimator which depends only on
.
Lemma 8 For any estimator
of the entropy
, there exists an estimator
such that
Proof: The proof relies on the symmetry of in permutations of
. Let
be any permutation on
, clearly
, and the distribution of
under
is the same as that under
. Hence, let
, the estimator
depends only on , and the desired inequality follows from Jensen’s inequality.
Lemma 8 shows that the model can be reduced to , which by data processing inequality results in a smaller total variation distance between two points. Now the new construction of the two points is motivated by the following lemma.
Lemma 9 Let
and
be two probability vectors, with
. Then
The proof of Lemma 9 is referred to the bibliographic notes. Lemma 9 shows that if the lower moments of and
are equal, the total variation distance between their tensor powers will be small. This fact motivates us to consider distributions
and
which are
repetitions of the vectors
respectively, where is some fixed integer, and
and
are valid distributions:
;
and
have matched moments:
;
and
have different entropies:
.
The existence of such vectors is ensured by the following lemma.
Lemma 10 For any positive integer
, there exist vectors
and
satisfying the above properties.
Proof: Fix any -dimensional vector
with distinct positive entries. Consider
which for small enough clearly has
positive zeros. Let
be the zeros of
, then comparing the coefficients of
on both sides of the polynomial and invoking Newton’s identity, vectors
and
have matched moments up to
. The first property is ensured via simple normalizations, and it suffices to choose a suitable
to ensure the third property.
Using the above two points, we have and
for any
. Moreover, for
, we have
Hence, by Lemma 9,
which is strictly smaller than one if . Note that the choice of the integer
is arbitrary, we conclude that the minimax risk
is at least a positive constant whenever
for any
. This lower bound says nothing about
when
gets larger, but compared with the previous bound
, the new bound shows that
actually needs to be much larger than
for the consistent estimation.
In later lectures, it will be shown that the minimax risk of this problem is
Hence, the current bound is slightly weaker than the first term, and the old bound is exactly the second term.
3.4. Example IV: Robust Statistics
In this section we look at the Gaussian mean estimation again, but now with outliers. Specifically, we consider the following Huber’s -contamination model:
where denotes the outlier distribution and may be arbitrary. The robust estimation problem is that, given
i.i.d. observations from
, propose a robust estimator which minimizes the worst-case risk of estimating
, where the worst case is taken over both
and the outlier distribution
. Under the squared
loss, let
be the minimax risk with
samples and outlier probability
.
The central claim of this section is that
This lower bound is tight for the Tukey median and some other estimators; we refer to the bibliographic notes. Since the lower bound even holds when
, we may restrict our attention to showing that
.
To apply the two-point method, we need to choose both and
, where the latter is more important in this robust setting. Since
and
can be arbitrary probability distributions, we actually have a large degree of freedom and can actually achieve perfect ambiguity here. This phenomenon is summarized in the following lemma.
Lemma 11 Let
be any probability distributions with
. Then there exist probability distributions
such that
Proof: Consider the signed measure . Clearly
, and by assumption
, where
denotes the total variation measure of
. Hence, by Jordan decomposition of signed measures, we have
, where measures
are mutually singular and
. Finally, adding any common measure
with
to both
completes the proof.
By Lemma 11, it suffices to find such that
and the incurred loss is . Hence, the lower bound
is completed by choosing
.
3.5. Example V: Two-armed Bandits
In this section we study the simplest online learning problem, namely, the two-armed bandit. At each time , the learner can pull one of the two arms, and the reward of pulling arm
follows a standard normal distribution
, where the mean parameters
are unknown to the learner. A policy
is a sequence where
denotes the arm to pull at time
, where
can depend on all past policies
and past rewards
. The regret of a policy
is defined as the difference of the cumulative reward of the best arm and that of the policy, i.e.,
The target of the online learning is to devise a policy with a small expected regret, where the expectation is taken over the randomness of the rewards (and therefore the random realizations of the policy). A fundamental tradeoff in online learning is the exploration-exploitation tradeoff, where the learner needs to pull the suboptimal arm sufficiently many times to identify the best arm with better precision, and also needs to exploit the current knowledge to focus on the arm which seems to be the best.
In this section we aim to prove minimax lower bounds of the regrets in the two-armed bandits. First consider the case where can take any values in
. The two-point method suggests that we may find two worlds such that
- there is no fixed policy whose regrets are small in both worlds;
- these two worlds are hard to distinguish.
To do so, we may take in the first world, and
in the second world, where
is some parameter to be chosen. Note that in both worlds, pulling a wrong arm incurs an instanteous regret of
. Let
and
be the distribution of the rewards up to time
in respective worlds, then the minimax regret is lower bounded by
Using Lemma 1 and
we conclude that by choosing
.
We also consider the case where there is a known suboptimality gap between arms, i.e.,
. Using the above arguments and the inequality
we will arrive at
which converges to a constant as
. However, this lower bound is not tight – we will show that
as
gets large. To show this stronger lower bound, we need a more delicate construction of the two points.
Set in world I, and
in world II. Let
be defined as before, then
where is the total number of pulls of arm
, and
denotes the expectation under world I. Hence, applying the previous arguments, we have
Clearly, this is a better bound for small , while it is less useful when
is large. However, a large
implies that the learner pulls the suboptimal arm (i.e., arm 2) too many times in world I, and in mathematical words we have
Note that these two bounds exactly characterize the tradeoff between exploitation and exploration. Combining these bounds, we have
where the second inequality follows from the minimization of the inner sum with respect to .
In summary, we have two different lower bounds: when there is no assumption on the optimality gap, then ; when the optimality gap is at least
, then
. Both bounds are tight and there exist policies attaining these bounds; we refer to bibliographic notes for details.
3.6. Example VI: Multi-armed Bandits
Interestingly, the two-point method can also give the correct minimax lower bound of the regret for general multi-armed bandits (in later lectures, we will provide a proof using multiple points). However, the following application of the two-point method differs from all previous ones. In previous examples, we fix two points and prove that all estimators (or policies) cannot achieve small risk (or regret) under both points. In this example, the construction of the two points depends on the policy.
We first define the multi-armed bandit problem. In fact, the only difference from the two-armed case is that now there are arms in total, and the
-th arm has reward distribution
. The regret of a given policy
is defined as
To apply the two-point method, let the first point be . Clearly, the first arm is the optimal arm, and pulling any wrong arm incurs an instanteous regret
. For any given policy
, the choice of the second point will depend on the policy: for
, let
be the expected number of pulls of arm
for policy
under the first point. Since
, there must exist some
such that
. In other words, under the first point, the policy
does not pull arm
much in expectation. Note that
depends on the policy
.
Now the second point is as follows: , where the only difference from the first point is that
. We claim that the two points are hard to distinguish: let
be the distribution of all observed rewards under these two points, respectively, then
Hence, by choosing we have
, and consequently the worst-case regret for the policy
is at least
. By the arbitrariness of
, this is a minimax lower bound. This bound is tight for multi-armed bandits.
The idea of constructing points based on given approaches is widely used in problems where several rounds of adaptivity are available. One future lecture will be devoted exclusively to this topic.
4. Bibliographic Notes
The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977) (see also Le Cam (1973) for the binary case). The general -divergence was proposed by Csiszár (1967), and the joint range of
-divergences (Theorem 4) was established in Harremoës and Vajda (2011). For the results in convex analysis, the Choquet–Bishop–de Leeuw theorem can be found in Phelps (2001), and the Fenchel–Eggleston–Carathéodory theorem can be found in Eggleston (1958). The inequalities between
-divergences and the two-point method can be mostly found in the excellent textbook Tsabykov (2009).
For the examples, the first lower bound of entropy estimation is taken from Wu and Yang (2016). For the second lower bound, Lemma 9 was proved in Valiant (2008) under the Poissonized model, and Lemma 10 is taken from Archaya et al. (2016). For robust statistics, Huber’s -contamination model dates back to Huber (1964), and the general lower bound technique in Lemma 11 is taken from Chen, Gao and Ren (2016). For the robust estimators achieving the lower bound, see Chen, Gao and Ren (2018) and Ilias et al. (2018).
The lower bounds of bandit problems are taken from Voger (1960) and Lai and Robbins (1985). We also refer to Bubeck, Perchet and Rigollet (2013) and Lattimore and Szepesvári (2018) for non-asymptotic bounds. For more on bandit problems such as the upper bounds, we refer to the book Bubeck and Cesa-Bianchi (2012).
- Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
- Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1.1 (1973): 38-53.
- Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica 2 (1967): 229-318.
- Peter Harremoës and Igor Vajda. On pairs of
-divergences and their joint range. IEEE Transactions on Information Theory 57.6 (2011): 3230-3235.
- Robert R. Phelps. Lectures on Choquet’s theorem. Springer Science & Business Media, 2001.
- H. G. Eggleston. Convexity. Cambridge University Press, 1958.
- Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
- Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
- Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing 40.6 (2011): 1927-1968.
- Jayadev Acharya et al. Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory 63.1 (2016): 38-56.
- Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964): 73-101.
- Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s
-contamination model. Electronic Journal of Statistics 10.2 (2016): 3752-3774.
- Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. The Annals of Statistics 46.5 (2018): 1932-1960.
- Ilias Diakonikolas et al. Robustly learning a gaussian: Getting optimal error, efficiently. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018.
- Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics 31.2 (1960): 430-443.
- Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6.1 (1985): 4-22.
- Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. Conference on Learning Theory, 2013.
- Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint (2018).
- Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5.1 (2012): 1-122.