(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at email@example.com.)
In previous lectures we have seen the ideas of model reduction and its application in establishing the asymptotic lower bounds. However, in many scenarios we would like to find good non-asymptotic bounds, and as we show in the examples of Lecture 3, finding such reductions may be quite difficult. In this and the next few lectures, we will introduce the testing idea which transforms the target problem (usually the estimation problem) into a proper testing problem. Despite the conceptual simplicity, the testing problem may take various forms, some of which may be difficult to analyze. This will be the central topic of the next few lectures.
1. -divergences and Inequalities
First we look at the possibly simplest testing problem: if and are two probability measures on the same probability space, and a random sample is drawn from either or . A (binary) testing problem aims to find some test which with high probability outputs when , and outputs when . To quantify the performance of a certain test , we define the total error probability as follows:
Intuitively, when and are close to each other, we will expect that the total error probability for any test is large for they are hard to distinguish. Conversely, when and are very far from each other (e.g., with a disjoint support), then there exists some test which can well distinguish them. The next lemma gives the expression of the minimum .
Lemma 1 (Le Cam’s first lemma) Minimizing all measurable tests gives
Recall that is the total variation distance between and .
Proof: Let , then
with equality if and only if .
By Lemma 1, the total variation distance between and exactly quantifies the minimum total error probability of distinguishing between two probability distributions. Similar to Lemma 1, if we impose different weights on the error probabilities, we will arrive at some divergences of the form , where is some constant, and is the weight on . Recall from convex analysis that for any convex function with compact support on , its second-order derivative always exists in the sense of a positive Radon measure, and
for some .
Remark 1 The decomposition (1) of convex functions has several applications, e.g., in the local time analysis of continuous martingales.
Motivated by (1), we can actually take a convex combination of the above divergences and arrive at the general class of -divergences.
Definition 2 (-divergence) For a convex function on with , the -divergence between probability measures and on is defined as
where is the absolute continuous part of w.r.t. , and .
In most practical scenarios, we have , and therefore . Some well-known examples of -divergences are as follows.
1. : total variation distance ;
2. : Hellinger distance ;
3. : Kullback–Leibler (KL) divergence ;
4. : -divergence .
The following properties also hold for any -divergences.
Lemma 3 and is joint convex in . In particular, for any transition kernel .
Remark 2 The last inequality is known as the data processing inequality.
Proof: The non-negativity follows from convexity and Jensen’s inequality. The joint convexity is due to that of the perspective transform for convex . Let be the output of through kernel , then the last inequality follows from the joint convexity and
We leave the details to the readers.
1.2. Joint range
Lemma 1 involves the computation of the total variation distance, which may be hard to evaluate exactly for most statistical models, especially when there is a tensor power structure. As a comparison, Hellinger, KL and -divergences tensorize better:
Consequently, to apply Lemma 1 to proving lower bounds, we need upper bounds of the total variation distance in terms of Hellinger, KL or . We have seen examples of such inequalities from Lecture 3, but have no clue whether the claimed inequalities are tight. To obtain the best inequalities between different -divergences, we study the joint range of -divergences.
For two convex functions on with , we call is in the joint range of if we can find two distributions over some common probability space such that . Hence, the joint range exactly characterizes the relationship between different -divergences. The central result is as follows:
Theorem 4 Let be the joint range of , and be the joint range of when both and are discrete distributions supported on at most elements. Then
where denotes the convex hull.
Theorem 4 provides a generic way to prove inequalities between different -divergences. Specifically, it shows that to obtain the joint range, it suffices to choose and , compute the pair , vary and then take the convex closure. The proof of Theorem 4 requires the following theorems in the convex analysis.
Theorem 5 (Choquet–Bishop–de Leeuw) Let be a metrizable convex compact subset of a locally convex topological vector space . For any point in , there exists a probability measure supported on extremal points of such that .
Theorem 6 (Fenchel–Eggleston–Carathéodory) Let be connected and . For any , there exists such that .
We are ready to prove the statement of Theorem 4.
Proof: For simplicity we always assume that , then . Viewing the likelihood ratio as a random variable , we conclude that
Similarly, has the same representation with the additional constraint that can take at most values. Let be the set of all probability measures which are supported on and have unit mean. By linearity of the expectation and the convexity of we conclude that is convex. Moreover, since all extremal points of are supported on two elements, by Choquet–Bishop–de Leeuw theorem (note that any topological vector space is always locally convex under the weak- topology, where the unit sphere is compact and metrizable) we conclude that the distribution of can be written as a mixture of distributions supported on only two points. Then by the linearity of expectation again, we conclude that .
For the second identity , first note that is a connected set for it is a continuous image of the divergence pair on . Hence, by Fenchel–Eggleston–Carathéodory theorem, any point in can be written as a convex combination of two points in , which gives rise to a random variable supported on at most elements.
Based on Theorem 4, one may verify the following inequalities:
- TV vs. Hellinger:
Both inequalities are tight.
- TV vs. KL: Pinsker’s inequality gives
This bound is vacuous when , where better bounds are available:
None of these bounds are tight.
- Hellinger vs KL:
This bound is tight.
- KL vs. :
This bound is tight.
The above inequalities will be used quite often whenever we switch between convenient divergences.
2. Le Cam’s Two-point Method
One popular way to prove a non-asymptotic lower bound of statistical estimation is via hypothesis testing. Let be the statistical model of interest, and the loss function is (at the moment we assume that the target is to estimate ). The idea is to use the following weaker lower bound
Suppose that we can find two possible candidates of , say, and , such that the following two conditions hold:
- Large separation: the total loss is always large for any ;
- Indistinguishability: based on the observation, the observer cannot reliably distinguish between and ;
then clearly the observer needs to pay a risk proportional to . The next theorem makes this intuition precise.
Theorem 7 (Two-point Method) Let be any loss function, and satisfy that
Proof: The desired result follows from the following chain of inequalities:
Note that in Theorem 7, the parameter characterizes the separation between the two points, and is the minimum error probability of testing between and (cf. Lemma 1). Typically these two quantities depend negatively on each other: when one becomes larger, the other gets smaller. A rule-of-thumb is to find such that is a constant, say , and then search for the largest possible separation .
Although the proof of Theorem 7 did not explicitly apply testing arguments, the test can be seen as follows. Let the test be if , and otherwise, then the separability condition ensures that for as long as . Hence,
Replacing with the above lower bound, and repeating the analysis in the proof of Theorem 7, a direct application of Lemma 1 gives the same result (with replaced by ).
The two-point idea is simple but appears everywhere in various lower bounds, for it establishes the indistinguishability between two different tasks due to the uncertainties. Also, despite the conceptual simplicity, searching over all possible pairs of points is typically computationally expensive, and it requires effort to understand the problem structure and figure out the “correct” two points. In later lectures, this idea will be generalized in the following directions:
- Multiple points instead of two-point;
- One or both of the two points may be a mixture of distributions, i.e., the binary test has point-mixture or mixture-mixture structures;
- The chioce of the two points may depend on the estimator .
3. Examples of Two-point Methods
The remainder of this lecture is devoted to the applications of the two-point method in various problems.
3.1. Example I: Parameter Estimation
We begin with the simplest example of the Gaussian location model for , and our target is to estimate . Since a natural guess is that local perturbation on makes the model indistinguishable, we choose and in the two-point method. Let be the minimax risk under the squared error loss, Theorem 7 gives
where is the CDF of . Maximizing the above lower bound over gives . Compared with derived in Lecture 4, the two-point bound loses a constant multiplicative factor.
Next we consider the bounded mean scenario where for a known parameter in advance. Since the above is chosen to be in the end, this becomes infeasible if the upper bound is much smaller than . Consequently, we can choose and , which gives . This lower bound is also tight, for one of the estimators or attains it.
Consider another location model where one observes and aims to estimate . Again, choosing and , the minimax risk is lower bounded by
By the triangle inequality,
Hence, choosing gives . The rate is also tight for the minimax risk, for the estimator attains the rate.
Remark 3 The uniform location model is not QMD (see Lecture 4), thus it makes sense to have the correct minimax rate different from . Moreover, the inequality may not be tight for other models.
3.2. Example II: Functional Estimation
Recall the discrete entropy estimation problem, where we observe samples from a discrete distribution , and the target is to estimate the Shannon entropy . In Lecture 4 we proved an asymptotic minimax lower bound for the mean squared error, and here we show that the same bound holds non-asymptotically (even when grows with ).
The idea of using two-point method here is that when we perturb each entry in the uniform distribution by some small , the value of the functional changes by an additive factor of , and the KL divergence between these two cases is . Hence, choosing will complete the proof. However, some cautions should be taken when choosing the specific points:
- The two points cannot be , and . In this way, we have ;
- The two points cannot be , and . In this way, we have if and otherwise. Hence, this choice fails for some parameters .
The final choice of the two-point is as follows:
Simple algebra gives , and
Now choosing completes the proof of . This example shows that the choice of the two points may sometimes be a little bit delicate.
3.3. Example III: Non-local Choice of Two Points
We return to the entropy estimation problem, but this time we choose different points to arrive at an entirely different lower bound. Note that all above choices of the two points are based on local perturbations, and the resulting bounds are essentially for the variance. This example will show that a more sophisticated construction can also take care of the bias part.
For observations , let be the sorted histogram of , i.e., is the sorted version of with . The next lemma shows that there exists a minimax estimator which depends only on .
Lemma 8 For any estimator of the entropy , there exists an estimator such that
Proof: The proof relies on the symmetry of in permutations of . Let be any permutation on , clearly , and the distribution of under is the same as that under . Hence, let , the estimator
depends only on , and the desired inequality follows from Jensen’s inequality.
Lemma 8 shows that the model can be reduced to , which by data processing inequality results in a smaller total variation distance between two points. Now the new construction of the two points is motivated by the following lemma.
Lemma 9 Let and be two probability vectors, with . Then
The proof of Lemma 9 is referred to the bibliographic notes. Lemma 9 shows that if the lower moments of and are equal, the total variation distance between their tensor powers will be small. This fact motivates us to consider distributions and which are repetitions of the vectors
respectively, where is some fixed integer, and
- and are valid distributions: ;
- and have matched moments: ;
- and have different entropies: .
The existence of such vectors is ensured by the following lemma.
Lemma 10 For any positive integer , there exist vectors and satisfying the above properties.
Proof: Fix any -dimensional vector with distinct positive entries. Consider
which for small enough clearly has positive zeros. Let be the zeros of , then comparing the coefficients of on both sides of the polynomial and invoking Newton’s identity, vectors and have matched moments up to . The first property is ensured via simple normalizations, and it suffices to choose a suitable to ensure the third property.
Using the above two points, we have and for any . Moreover, for , we have
Hence, by Lemma 9,
which is strictly smaller than one if . Note that the choice of the integer is arbitrary, we conclude that the minimax risk is at least a positive constant whenever for any . This lower bound says nothing about when gets larger, but compared with the previous bound , the new bound shows that actually needs to be much larger than for the consistent estimation.
In later lectures, it will be shown that the minimax risk of this problem is
Hence, the current bound is slightly weaker than the first term, and the old bound is exactly the second term.
3.4. Example IV: Robust Statistics
In this section we look at the Gaussian mean estimation again, but now with outliers. Specifically, we consider the following Huber’s -contamination model:
where denotes the outlier distribution and may be arbitrary. The robust estimation problem is that, given i.i.d. observations from , propose a robust estimator which minimizes the worst-case risk of estimating , where the worst case is taken over both and the outlier distribution . Under the squared loss, let be the minimax risk with samples and outlier probability .
The central claim of this section is that
This lower bound is tight for the Tukey median and some other estimators; we refer to the bibliographic notes. Since the lower bound even holds when , we may restrict our attention to showing that .
To apply the two-point method, we need to choose both and , where the latter is more important in this robust setting. Since and can be arbitrary probability distributions, we actually have a large degree of freedom and can actually achieve perfect ambiguity here. This phenomenon is summarized in the following lemma.
Lemma 11 Let be any probability distributions with . Then there exist probability distributions such that
Proof: Consider the signed measure . Clearly , and by assumption , where denotes the total variation measure of . Hence, by Jordan decomposition of signed measures, we have , where measures are mutually singular and . Finally, adding any common measure with to both completes the proof.
By Lemma 11, it suffices to find such that
and the incurred loss is . Hence, the lower bound is completed by choosing .
3.5. Example V: Two-armed Bandits
In this section we study the simplest online learning problem, namely, the two-armed bandit. At each time , the learner can pull one of the two arms, and the reward of pulling arm follows a standard normal distribution , where the mean parameters are unknown to the learner. A policy is a sequence where denotes the arm to pull at time , where can depend on all past policies and past rewards . The regret of a policy is defined as the difference of the cumulative reward of the best arm and that of the policy, i.e.,
The target of the online learning is to devise a policy with a small expected regret, where the expectation is taken over the randomness of the rewards (and therefore the random realizations of the policy). A fundamental tradeoff in online learning is the exploration-exploitation tradeoff, where the learner needs to pull the suboptimal arm sufficiently many times to identify the best arm with better precision, and also needs to exploit the current knowledge to focus on the arm which seems to be the best.
In this section we aim to prove minimax lower bounds of the regrets in the two-armed bandits. First consider the case where can take any values in . The two-point method suggests that we may find two worlds such that
- there is no fixed policy whose regrets are small in both worlds;
- these two worlds are hard to distinguish.
To do so, we may take in the first world, and in the second world, where is some parameter to be chosen. Note that in both worlds, pulling a wrong arm incurs an instanteous regret of . Let and be the distribution of the rewards up to time in respective worlds, then the minimax regret is lower bounded by
Using Lemma 1 and
we conclude that by choosing .
We also consider the case where there is a known suboptimality gap between arms, i.e., . Using the above arguments and the inequality
we will arrive at
which converges to a constant as . However, this lower bound is not tight – we will show that as gets large. To show this stronger lower bound, we need a more delicate construction of the two points.
Set in world I, and in world II. Let be defined as before, then
where is the total number of pulls of arm , and denotes the expectation under world I. Hence, applying the previous arguments, we have
Clearly, this is a better bound for small , while it is less useful when is large. However, a large implies that the learner pulls the suboptimal arm (i.e., arm 2) too many times in world I, and in mathematical words we have
Note that these two bounds exactly characterize the tradeoff between exploitation and exploration. Combining these bounds, we have
where the second inequality follows from the minimization of the inner sum with respect to .
In summary, we have two different lower bounds: when there is no assumption on the optimality gap, then ; when the optimality gap is at least , then . Both bounds are tight and there exist policies attaining these bounds; we refer to bibliographic notes for details.
3.6. Example VI: Multi-armed Bandits
Interestingly, the two-point method can also give the correct minimax lower bound of the regret for general multi-armed bandits (in later lectures, we will provide a proof using multiple points). However, the following application of the two-point method differs from all previous ones. In previous examples, we fix two points and prove that all estimators (or policies) cannot achieve small risk (or regret) under both points. In this example, the construction of the two points depends on the policy.
We first define the multi-armed bandit problem. In fact, the only difference from the two-armed case is that now there are arms in total, and the -th arm has reward distribution . The regret of a given policy is defined as
To apply the two-point method, let the first point be . Clearly, the first arm is the optimal arm, and pulling any wrong arm incurs an instanteous regret . For any given policy , the choice of the second point will depend on the policy: for , let be the expected number of pulls of arm for policy under the first point. Since , there must exist some such that . In other words, under the first point, the policy does not pull arm much in expectation. Note that depends on the policy .
Now the second point is as follows: , where the only difference from the first point is that . We claim that the two points are hard to distinguish: let be the distribution of all observed rewards under these two points, respectively, then
Hence, by choosing we have , and consequently the worst-case regret for the policy is at least . By the arbitrariness of , this is a minimax lower bound. This bound is tight for multi-armed bandits.
The idea of constructing points based on given approaches is widely used in problems where several rounds of adaptivity are available. One future lecture will be devoted exclusively to this topic.
4. Bibliographic Notes
The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977) (see also Le Cam (1973) for the binary case). The general -divergence was proposed by Csiszár (1967), and the joint range of -divergences (Theorem 4) was established in Harremoës and Vajda (2011). For the results in convex analysis, the Choquet–Bishop–de Leeuw theorem can be found in Phelps (2001), and the Fenchel–Eggleston–Carathéodory theorem can be found in Eggleston (1958). The inequalities between -divergences and the two-point method can be mostly found in the excellent textbook Tsabykov (2009).
For the examples, the first lower bound of entropy estimation is taken from Wu and Yang (2016). For the second lower bound, Lemma 9 was proved in Valiant (2008) under the Poissonized model, and Lemma 10 is taken from Archaya et al. (2016). For robust statistics, Huber’s -contamination model dates back to Huber (1964), and the general lower bound technique in Lemma 11 is taken from Chen, Gao and Ren (2016). For the robust estimators achieving the lower bound, see Chen, Gao and Ren (2018) and Ilias et al. (2018).
The lower bounds of bandit problems are taken from Voger (1960) and Lai and Robbins (1985). We also refer to Bubeck, Perchet and Rigollet (2013) and Lattimore and Szepesvári (2018) for non-asymptotic bounds. For more on bandit problems such as the upper bounds, we refer to the book Bubeck and Cesa-Bianchi (2012).
- Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
- Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1.1 (1973): 38-53.
- Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica 2 (1967): 229-318.
- Peter Harremoës and Igor Vajda. On pairs of -divergences and their joint range. IEEE Transactions on Information Theory 57.6 (2011): 3230-3235.
- Robert R. Phelps. Lectures on Choquet’s theorem. Springer Science & Business Media, 2001.
- H. G. Eggleston. Convexity. Cambridge University Press, 1958.
- Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
- Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
- Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing 40.6 (2011): 1927-1968.
- Jayadev Acharya et al. Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory 63.1 (2016): 38-56.
- Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964): 73-101.
- Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s -contamination model. Electronic Journal of Statistics 10.2 (2016): 3752-3774.
- Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huberâ€™s contamination model. The Annals of Statistics 46.5 (2018): 1932-1960.
- Ilias Diakonikolas et al. Robustly learning a gaussian: Getting optimal error, efficiently. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018.
- Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics 31.2 (1960): 430-443.
- Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6.1 (1985): 4-22.
- Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. Conference on Learning Theory, 2013.
- Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint (2018).
- Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5.1 (2012): 1-122.