(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In this lecture, we establish the asymptotic lower bounds for general statistical decision problems. Specifically, we show that models satisfying the benign local asymptotic normality (LAN) condition are asymptotically equivalent to a Gaussian location model, under which the Hájek–Le Cam local asymptotic minimax (LAM) lower bound holds. We also apply this theorem to both parametric and nonparametric problems.
1. History of Asymptotic Statistics
To begin with, we first recall the notions of the score function and Fisher information, which can be found in most textbooks.
Definition 1 (Fisher Information) A family of distributions on is quadratic mean differentiable (QMD) at if there exists a score function such that
In this case, the matrix exists and is called the Fisher information at .
R. A. Fisher popularized the above Fisher information and the usage of the maximum likelihood estimator (MLE) starting from 1920s. He believes that, the Fisher information of a statistical model characterizes the fundamental limits of estimating based on i.i.d. observations from , and MLE asymptotically attains this limit. More specifically, he makes the following conjectures:
- For any asymptotically normal estimators such that for any , there must be
- The MLE satisfies that for any .
Although the second conjecture is easier to establish assuming certain regularity conditions, the first conjecture, which seems to be correct by the well-known Cramér-Rao bound, actually caused some trouble when people tried to prove it. The following example shows that (1) may be quite problematic.
Example 1 Here is a counterexample to (1) proposed by Hodges in 1951. Let , and consider a Gaussian location model where are i.i.d. distributed as . A natural estimator of is the empirical mean , and the Fisher information is . The Hodges’ estimator is constructed as follows:
It is easy to show that for any , with for non-zero and for . Consequently, (1) does not hold for the Hodges’ estimator. The same result holds if the threshold in (2) is changed by any sequence with and .
Hodges’ example suggests that (1) should at least be weakened in appropriate ways. Observing the structure of Hodges’ estimator (2) carefully, there can be three possible attempts:
- The estimator is superefficient (i.e., where (1) fails) only at one point . It may be expected that the set of such that (1) fails is quite small.
- Although the Hodges’ estimator satisfies the asymptotic normal condition, i.e., weakly converges to a normal distribution under , for any non-zero perturbation , the sequence does not weakly converge to the same distribution under . Hence, we may expect that (1) actually holds for more regular estimator sequences.
- Let be the risk function of the estimator under the absolute value loss. It can be computed that , while for . In other words, the worst-case risk over an interval of size around is still of the order , which is considerably larger than the single point . Therefore, it may make sense to consider the local minimax risk.
It turns out that all these attempts can be successful, and the following theorem summarizes the key results of asymptotic statistics developed by J. Hájek and L. Le Cam in 1970s.
Theorem 2 (Asymptotic Theorems) Let be a QMD statistical model which admits a non-singular Fisher information at . Let be differentiable at , and be an estimator sequence of in the model .
1. (Almost everywhere convolution theorem) If converges in distribution to some probability measure for every , and is non-singular for every , then there exists some probability measure such that
for Lebesgue almost every , where denotes the convolution.2. (Convolution theorem) If converges in distribution to some probability measure for , and is regular in the sense that weakly converges to the same limit under for every , then there exists some probability measure such that
3. (Local asymptotic minimax theorem) Let be a bowl-shaped loss function, i.e., is non-negative, symmetric and quasi-convex. In mathematical words, and the sublevel sets are convex for all . Then
with .
We will be primarily interested in the local asymptotic minimax (LAM) theorem, for it directly gives general lower bounds for statistical estimation. This theorem will be proved in the next two sections using asymptotic equivalence between models, and some applications will be given in the subsequent section.
2. Gaussian Location Model
In this section we study the possibly simplest statistical model, i.e., the Gaussian location model, and will show in the next section that all regular models will converge to it asymptotically. In the Gaussian location model, we have and observes with a known non-singular covaraince . Consider a bowl-shaped loss function (defined in Theorem 2), a natural estimator of is , whose worst-case risk is with . The main theorem in this section is that the natural estimator is minimax.
Theorem 3 For any bowl-shaped loss , we have
for .
The proof of Theorem 3 relies on the following important lemma for Gaussian random variables.
Lemma 4 (Anderson’s Lemma) Let and be bowl-shaped. Then
Proof: For , let . Since is bowl-shaped, the set is convex. Moreover, since
it suffices to show that for any . We shall need the following inequality in convex geometry.
Theorem 5 (Prépoka-Leindler Inequality, or Functional Brunn-Minkowski Inequality) Let and be non-negative real-valued measurable functions on , with
Then
Let be the density function of , which is log-concave. Consider functions
and , the log-concavity of and by convexity of ensures the condition of Theorem 5. Hence, Theorem 5 gives
Finally, by symmetry of and , we have , which completes the proof.
Now we are ready to prove Theorem 3.
Proof: Consider a Gaussian prior on . By algebra, the posterior distribution of given is Gaussian distributed as . By Proposition 3 in Lecture 3 and the above Anderson’s lemma, the Bayes estimator is then . Since the minimax risk is lower bounded by any Bayes risk (as the maximum is no less than the average), we have
Since this inequality holds for any , choosing with completes the proof of Theorem 3.
3. Local Asymptotic Minimax Theorem
In this section, we show that regular statistical models converge to a Gaussian location model asymptotically. To prove so, we shall need verifiable criterions to establish the convergence of Le Cam’s distance, as well as the specific regularity conditions.
3.1. Likelihood Ratio Criteria for Asymptotic Equivalence
In Lecture 3 we introduced the notion of Le Cam’s model distance, and showed that it can be upper bounded via the randomization criterion. However, designing a suitable transition kernel between models is too ad-hoc and sometimes challenging, and it will be helpful if simple criteria suffice.
The main result of this subsection is the following likelihood ratio criteria:
Theorem 6 Let and be finite statistical models. Further assume that is homogeneous in the sense that any pair in is mutually absolutely continuous. Define
as the likelihood ratios, and similarly for . Then if the distribution of under weakly converges to that of under .
In other words, Theorem 6 states that a sufficient condition for asymptotic equivalence of models is the weak convergence of likelihood ratios. Although we shall not use that, this is also a necessary condition. The finiteness assumption is mainly for technical purposes, and the general case requires proper limiting arguments.
To prove Theorem 6, we need the following notion of standard models.
Definition 7 (Standard Model) Let , and be its Borel -algebra. A standard distribution on is a probability measure such that for any . The model
is called the standard model of .
The following lemma shows that any finite statistical model can be transformed into an equivalent standard form.
Lemma 8 Let be a finite model, and be a standard model with standard distribution being the distribution of under mean measure . Then .
Proof: Since , the measure is a standard distribution. Moreover, let be the distribution of under , then
for any measurable function , which gives , agreeing with the standard model. Finally, since , by the factorization criterion (e.g., Theorem 7 in Lecture 3) we conclude that the statistic is sufficient, and therefore .
Lemma 8 helps to convert the sample space of all finite models to the simplex , and comparison between models is reduced to the comparison between their standard distributions. Consequently, we have the following quantitative bound on the model distance between finite models.
Lemma 9 Let and be two finite models with standard distributions respectively. Then
where denotes the Dudley’s metric between probability measures , and the supremum is taken over all measurable functions with and for any .
Remark 1 Recall that Dudley’s metric metrizes the weak convergence of probability measures on a metric space with its Borel -algebra. The fact that it is smaller than the total variation distance will be crucial to establish Theorem 6.
Proof: Similar to the proof of the randomization criterion (Theorem 5 in Lecture 3), the following upper bound on the model distance holds:
where denotes the Bayes risk of model under loss and prior , and the loss is non-negative and upper bounded by one in the supremum. By algebra, the Bayes risk admits the following simple form under the standard model with standard distribution :
where the set is defined as
Since the diameter of in is one, we conclude that is upper bounded by and -Lipschitz under . The rest of the proof follows from the definition of Dudley’s metric.
Finally we are ready to present the proof of Theorem 6. Note that there is a bijective map between and , which is continuous under the model due to the homogeneity assumption. Then by continuous mapping theorem (see remark below), the weak convergence of likelihood ratios implies the weak convergence of their standard distributions. Since Dudley’s metric metrizes the weak convergence of probability measures, the result of Lemma 9 completes the proof.
Remark 2 The continuous mapping theorem for weak convergence states that, if Borel-measurable random variables converges weakly to on a metric space, and is a function continuous on a set such that , then also converges weakly to . Note that the function is only required to be continuous on the support of the limiting random variable .
3.2. Locally Asymptotically Normal (LAN) Models
Motivated by Theorem 6, in order to prove that certain models asymptotically become normal, we may show that the likelihood functions weakly converge to those in the normal model. Note that for the Gaussian location model , the log-likelihood ratio is given by
where for . The equation (3) motivates the following definition of local asymptotic normal (LAN) models, in which the likelihood function looks like (3).
Definition 10 (Local Asymptotic Normality) A sequence of models with is called locally asymptotically normal (LAN) with central sequence and Fisher information matrix if
with under , and converges to zero in probability under for any fixed .
Based on the form of the likelihood ratio in (4), the following theorem is then immediate.
Theorem 11 If a sequence of models satisfies the LAN condition with Fisher information matrix , then for .
Proof: Note that for any finite sub-model, Slutsky’s theorem applied to (4) gives the desired convergence in distribution, and clearly the Gaussian location model is homogeneous. Now applying Theorem 6 gives the desired convergence. We leave the discussion of the general case in the bibliographic notes.
Now the only remaining task is to check the likelihood ratios for some common models and show that the LAN condition is satisfied. For example, for QMD models , we have
where intuitively by CLT and LLN will arrive at the desired form (4). The next proposition makes this intuition precise.
Proposition 12 Let be QMD in an open set with Fisher information matrix . Then the sequence of model , with satisfies the LAN condition with Fisher information .
Proof: Write and . Then by Taylor expansion,
where , and . By the QMD condition, we have
Moreover, by the property of the score function, and
Consequently, we conclude that
For the second term, the QMD condition gives , and therefore LLN gives . For the last term, the Markov’s inequality gives , and therefore , as desired.
In other words, Proposition 12 implies that all regular statistical models locally look like normal, where the local radius is .
3.3. Proof of LAM Theorem
Now we are ready to glue all necessary ingredients together. First, for product QMD statistical models, Proposition 12 implies that LAN condition is satisfied for the local model around any chosen parameter . Second, by Theorem 11, these local models will converge to a Gaussian location model with covariance . By definition of the model distance, the minimax risk of these local models will be asymptotically no smaller than that of the limiting Gaussian location model, which by Theorem 3 is with for bowl-shaped loss . Consequently, we have the following local asymptotic minimax theorem.
Theorem 13 (LAM, restated) Let be a QMD statistical model which admits a non-singular Fisher information at . Let be differentiable at , and be an estimator sequence of in the model . Consider any compact action space , and any bowl-shaped loss function . Then
with .
Note that here the compactness of the action space is required for the limiting arguments, while all our previous analysis consider finite models. Our arguments via model distance are also different from those used by H\'{a}jek and Le Cam, where they introduced the notion of contiguity to arrive at the same result with weaker conditions. Further details of these alternative approaches are referred to the bibliographic notes.
4. Applications and Limitations
In this section, we will apply the LAM theorem to prove asymptotic lower bounds for both parametric and nonparametric problems. We will also discuss the limitations of LAM to motivate the necessity of future lectures.
4.1. Parametric Entropy Estimation
Consider the discrete i.i.d. sampling model , where denotes the probability simplex on elements. The target is to estimate the Shannon entropy
under the mean squared loss. We can apply LAM to prove a local minimax lower bound for this problem.
First we compute the Fisher information of the multinomial model , where we set to be the free parameters. It is easy to show that
By the matrix inversion formula , we have
Now choosing and in LAM, after some algebra we arrive at
Note that on the RHS is due to our arbitrary choice of the centering .
4.2. Nonparametric Entropy Estimation
Consider a continuous i.i.d. sampling model from some density, i.e., . Assume that the density is supported on , and the target is to estimate the differential entropy
As before, we would like to prove a local minimax lower bound for the mean squared error around any target density . However, since the model is nonparametric and the parameter has an infinite dimension, there is no Fisher information matrix for this model. To overcome this difficulty, we may consider a one-dimensional parametric submodel instead.
Let be any measurable function with and , then is still a valid density on for small . Consequently, keeping small, the i.i.d. sampling model becomes a submodel parametrized only by . For this 1-D parametric submodel, the Fisher information at can be computed as
Setting in LAM, we have
and consequently
Since our choice of the test function is arbitrary, we may actually choose the worst-case such that the above lower bound is maximized. We claim that the maximum value is
Clearly, this value is attained for the test function For the maximality, the Cauchy–Schwartz inequality and the assumption gives
Therefore, the parametric lower bound for nonparametric entropy estimation is
4.3. Limitations of Classical Asymptotics
The theorems from classical asymptotics can typically help to prove an error bound with an explicit constant, and it is also known that these bounds are optimal and achieved by MLE. However, there are still some problems in these approaches:
- Non-asymptotic vs. asymptotic: Asymptotic bounds are useful only in scenarios where the problem size remains fixed and the sample size grows to infinity, and there is no general guarantee of when we have entered the asymptotic regime (it may even require that ). In practice, essentially all recent problems are high-dimensional ones where the number of parameters is comparable to or even larger than the sample size (e.g., an over-parametrized neural network), and some key properties of the problem may be entirely obscured in the asymptotic regime.
- Parametric vs. nonparametric: The results in classical asymptotics may not be helpful for a large quantity of nonparametric problems, where the main problem is the infinite-dimensional nature of nonparametric problems. Although sometimes the parametric reduction is helpful (e.g., the entropy example), the parametric rate is in general not attainable in nonparametric problems and some other tools are necessary. For example, if we would like to estimate the density at some point , the worst-case test function will actually give a vacuous lower bound (which is infinity).
- Global vs. local: As the name of LAM suggests, the minimax lower bound here holds locally. However, the global structure of some problems may also be important, and the global minimax lower bound may be much larger than the supremum of local bounds over all possible points. For example, in Shannon entropy estimation, the bias is actually dominating the problem and cannot be reflected in local methods.
To overcome these difficulties, we need to develop tools to establish non-asymptotic results for possibly high-dimensional or nonparametric problems, which is the focus of the rest of the lecture series.
5. Bibliographic Notes
The asymptotic theorems in Theorem 2 are first presented in Hájek (1970) and Hájek (1972), and we refer to Le Cam (1986), Le Cam and Yang (1990) and van der Vaart (2000) as excellent textbooks. Here the approach of using model distance to establish LAM is taken from Section 4, 6 of Liese and Miescke (2007); also see Le Cam (1972).
There is another line of approach to establish the asymptotic theorems. A key concept is the contiguity proposed by Le Cam (1960), which enables an asymptotic change of measure. Based on contiguity and LAN condition, the distribution of any (regular) estimator under the local alternative can be evaluated. Then the convolution theorem can be shown, which helps to establish LAM; details can be found in van der Vaart (2000). LAM theorem can also be established directly by computing the asymptotic Bayes risk under proper priors; see Section 6 of Le Cam and Yang (1990).
For parametric or nonparametric entropy estimation, we refer to recent papers (Jiao et al. (2015) and Wu and Yang (2016) for the discrete case, Berrett, Samworth and Yuan (2019) and Han et al. (2017) for the continuous case) and the references therein.
- Jaroslav Hájek, A characterization of limiting distributions of regular estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14.4 (1970): 323-330.
- Jaroslav Hájek, Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability. Vol. 1. 1972.
- Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986.
- Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990.
- Aad W. Van der Vaart, Asymptotic statistics. Vol. 3. Cambridge university press, 2000.
- Friedrich Liese and Klaus-J. Miescke. Statistical decision theory. Springer, New York, NY, 2007.
- Lucien M. Le Cam, Limits of experiments. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
- Lucien M. Le Cam, Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98 (1960).
- Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885.
- Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
- Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, Efficient multivariate entropy estimation via -nearest neighbour distances. The Annals of Statistics 47.1 (2019): 288-318.
- Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu, Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint arXiv:1711.02141 (2017).