(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)
In this lecture, we establish the asymptotic lower bounds for general statistical decision problems. Specifically, we show that models satisfying the benign local asymptotic normality (LAN) condition are asymptotically equivalent to a Gaussian location model, under which the Hájek–Le Cam local asymptotic minimax (LAM) lower bound holds. We also apply this theorem to both parametric and nonparametric problems.
1. History of Asymptotic Statistics
To begin with, we first recall the notions of the score function and Fisher information, which can be found in most textbooks.
Definition 1 (Fisher Information) A family of distributions
on
is quadratic mean differentiable (QMD) at
if there exists a score function
such that
In this case, the matrix
exists and is called the Fisher information at
.
R. A. Fisher popularized the above Fisher information and the usage of the maximum likelihood estimator (MLE) starting from 1920s. He believes that, the Fisher information of a statistical model characterizes the fundamental limits of estimating based on
i.i.d. observations from
, and MLE asymptotically attains this limit. More specifically, he makes the following conjectures:
- For any asymptotically normal estimators
such that
for any
, there must be
- The MLE satisfies that
for any
.
Although the second conjecture is easier to establish assuming certain regularity conditions, the first conjecture, which seems to be correct by the well-known Cramér-Rao bound, actually caused some trouble when people tried to prove it. The following example shows that (1) may be quite problematic.
Example 1 Here is a counterexample to (1) proposed by Hodges in 1951. Let
, and consider a Gaussian location model where
are i.i.d. distributed as
. A natural estimator of
is the empirical mean
, and the Fisher information is
. The Hodges’ estimator is constructed as follows:
It is easy to show that
for any
, with
for non-zero
and
for
. Consequently, (1) does not hold for the Hodges’ estimator. The same result holds if the threshold
in (2) is changed by any sequence
with
and
.
Hodges’ example suggests that (1) should at least be weakened in appropriate ways. Observing the structure of Hodges’ estimator (2) carefully, there can be three possible attempts:
- The estimator
is superefficient (i.e., where (1) fails) only at one point
. It may be expected that the set of
such that (1) fails is quite small.
- Although the Hodges’ estimator satisfies the asymptotic normal condition, i.e.,
weakly converges to a normal distribution under
, for any non-zero perturbation
, the sequence
does not weakly converge to the same distribution under
. Hence, we may expect that (1) actually holds for more regular estimator sequences.
- Let
be the risk function of the estimator
under the absolute value loss. It can be computed that
, while
for
. In other words, the worst-case risk over an interval of size
around
is still of the order
, which is considerably larger than the single point
. Therefore, it may make sense to consider the local minimax risk.
It turns out that all these attempts can be successful, and the following theorem summarizes the key results of asymptotic statistics developed by J. Hájek and L. Le Cam in 1970s.
Theorem 2 (Asymptotic Theorems) Let
be a QMD statistical model which admits a non-singular Fisher information
at
. Let
be differentiable at
, and
be an estimator sequence of
in the model
.
1. (Almost everywhere convolution theorem) If
converges in distribution to some probability measure
for every
, and
is non-singular for every
, then there exists some probability measure
such that
for Lebesgue almost every, where
denotes the convolution.
2. (Convolution theorem) If
converges in distribution to some probability measure
for
, and
is regular in the sense that
weakly converges to the same limit under
for every
, then there exists some probability measure
such that
3. (Local asymptotic minimax theorem) Let
be a bowl-shaped loss function, i.e.,
is non-negative, symmetric and quasi-convex. In mathematical words,
and the sublevel sets
are convex for all
. Then
with.
We will be primarily interested in the local asymptotic minimax (LAM) theorem, for it directly gives general lower bounds for statistical estimation. This theorem will be proved in the next two sections using asymptotic equivalence between models, and some applications will be given in the subsequent section.
2. Gaussian Location Model
In this section we study the possibly simplest statistical model, i.e., the Gaussian location model, and will show in the next section that all regular models will converge to it asymptotically. In the Gaussian location model, we have and observes
with a known non-singular covaraince
. Consider a bowl-shaped loss function
(defined in Theorem 2), a natural estimator of
is
, whose worst-case risk is
with
. The main theorem in this section is that the natural estimator
is minimax.
Theorem 3 For any bowl-shaped loss
, we have
for
.
The proof of Theorem 3 relies on the following important lemma for Gaussian random variables.
Lemma 4 (Anderson’s Lemma) Let
and
be bowl-shaped. Then
Proof: For , let
. Since
is bowl-shaped, the set
is convex. Moreover, since
it suffices to show that for any
. We shall need the following inequality in convex geometry.
Theorem 5 (Prépoka-Leindler Inequality, or Functional Brunn-Minkowski Inequality) Let
and
be non-negative real-valued measurable functions on
, with
Then
Let be the density function of
, which is log-concave. Consider functions
and , the log-concavity of
and
by convexity of
ensures the condition of Theorem 5. Hence, Theorem 5 gives
Finally, by symmetry of and
, we have
, which completes the proof.
Now we are ready to prove Theorem 3.
Proof: Consider a Gaussian prior on
. By algebra, the posterior distribution of
given
is Gaussian distributed as
. By Proposition 3 in Lecture 3 and the above Anderson’s lemma, the Bayes estimator is then
. Since the minimax risk is lower bounded by any Bayes risk (as the maximum is no less than the average), we have
Since this inequality holds for any , choosing
with
completes the proof of Theorem 3.
3. Local Asymptotic Minimax Theorem
In this section, we show that regular statistical models converge to a Gaussian location model asymptotically. To prove so, we shall need verifiable criterions to establish the convergence of Le Cam’s distance, as well as the specific regularity conditions.
3.1. Likelihood Ratio Criteria for Asymptotic Equivalence
In Lecture 3 we introduced the notion of Le Cam’s model distance, and showed that it can be upper bounded via the randomization criterion. However, designing a suitable transition kernel between models is too ad-hoc and sometimes challenging, and it will be helpful if simple criteria suffice.
The main result of this subsection is the following likelihood ratio criteria:
Theorem 6 Let
and
be finite statistical models. Further assume that
is homogeneous in the sense that any pair in
is mutually absolutely continuous. Define
as the likelihood ratios, and similarly for
. Then
if the distribution of
under
weakly converges to that of
under
.
In other words, Theorem 6 states that a sufficient condition for asymptotic equivalence of models is the weak convergence of likelihood ratios. Although we shall not use that, this is also a necessary condition. The finiteness assumption is mainly for technical purposes, and the general case requires proper limiting arguments.
To prove Theorem 6, we need the following notion of standard models.
Definition 7 (Standard Model) Let
, and
be its Borel
-algebra. A standard distribution
on
is a probability measure such that
for any
. The model
is called the standard model of
.
The following lemma shows that any finite statistical model can be transformed into an equivalent standard form.
Lemma 8 Let
be a finite model, and
be a standard model with standard distribution
being the distribution of
under mean measure
. Then
.
Proof: Since , the measure
is a standard distribution. Moreover, let
be the distribution of
under
, then
for any measurable function , which gives
, agreeing with the standard model. Finally, since
, by the factorization criterion (e.g., Theorem 7 in Lecture 3) we conclude that the statistic
is sufficient, and therefore
.
Lemma 8 helps to convert the sample space of all finite models to the simplex , and comparison between models is reduced to the comparison between their standard distributions. Consequently, we have the following quantitative bound on the model distance between finite models.
Lemma 9 Let
and
be two finite models with standard distributions
respectively. Then
where
denotes the Dudley’s metric between probability measures
, and the supremum is taken over all measurable functions
with
and
for any
.
Remark 1 Recall that Dudley’s metric metrizes the weak convergence of probability measures on a metric space with its Borel
-algebra. The fact that it is smaller than the total variation distance will be crucial to establish Theorem 6.
Proof: Similar to the proof of the randomization criterion (Theorem 5 in Lecture 3), the following upper bound on the model distance holds:
where denotes the Bayes risk of model
under loss
and prior
, and the loss
is non-negative and upper bounded by one in the supremum. By algebra, the Bayes risk admits the following simple form under the standard model with standard distribution
:
where the set is defined as
Since the diameter of in
is one, we conclude that
is upper bounded by
and
-Lipschitz under
. The rest of the proof follows from the definition of Dudley’s metric.
Finally we are ready to present the proof of Theorem 6. Note that there is a bijective map between and
, which is continuous under the model
due to the homogeneity assumption. Then by continuous mapping theorem (see remark below), the weak convergence of likelihood ratios implies the weak convergence of their standard distributions. Since Dudley’s metric metrizes the weak convergence of probability measures, the result of Lemma 9 completes the proof.
Remark 2 The continuous mapping theorem for weak convergence states that, if Borel-measurable random variables
converges weakly to
on a metric space, and
is a function continuous on a set
such that
, then
also converges weakly to
. Note that the function
is only required to be continuous on the support of the limiting random variable
.
3.2. Locally Asymptotically Normal (LAN) Models
Motivated by Theorem 6, in order to prove that certain models asymptotically become normal, we may show that the likelihood functions weakly converge to those in the normal model. Note that for the Gaussian location model , the log-likelihood ratio is given by
where for
. The equation (3) motivates the following definition of local asymptotic normal (LAN) models, in which the likelihood function looks like (3).
Definition 10 (Local Asymptotic Normality) A sequence of models
with
is called locally asymptotically normal (LAN) with central sequence
and Fisher information matrix
if
with
under
, and
converges to zero in probability under
for any fixed
.
Based on the form of the likelihood ratio in (4), the following theorem is then immediate.
Theorem 11 If a sequence of models
satisfies the LAN condition with Fisher information matrix
, then
for
.
Proof: Note that for any finite sub-model, Slutsky’s theorem applied to (4) gives the desired convergence in distribution, and clearly the Gaussian location model is homogeneous. Now applying Theorem 6 gives the desired convergence. We leave the discussion of the general case in the bibliographic notes.
Now the only remaining task is to check the likelihood ratios for some common models and show that the LAN condition is satisfied. For example, for QMD models , we have
where intuitively by CLT and LLN will arrive at the desired form (4). The next proposition makes this intuition precise.
Proposition 12 Let
be QMD in an open set
with Fisher information matrix
. Then the sequence of model
, with
satisfies the LAN condition with Fisher information
.
Proof: Write and
. Then by Taylor expansion,
where , and
. By the QMD condition, we have
Moreover, by the property of the score function, and
Consequently, we conclude that
For the second term, the QMD condition gives , and therefore LLN gives
. For the last term, the Markov’s inequality gives
, and therefore
, as desired.
In other words, Proposition 12 implies that all regular statistical models locally look like normal, where the local radius is .
3.3. Proof of LAM Theorem
Now we are ready to glue all necessary ingredients together. First, for product QMD statistical models, Proposition 12 implies that LAN condition is satisfied for the local model around any chosen parameter . Second, by Theorem 11, these local models will converge to a Gaussian location model with covariance
. By definition of the model distance, the minimax risk of these local models will be asymptotically no smaller than that of the limiting Gaussian location model, which by Theorem 3 is
with
for bowl-shaped loss
. Consequently, we have the following local asymptotic minimax theorem.
Theorem 13 (LAM, restated) Let
be a QMD statistical model which admits a non-singular Fisher information
at
. Let
be differentiable at
, and
be an estimator sequence of
in the model
. Consider any compact action space
, and any bowl-shaped loss function
. Then
with
.
Note that here the compactness of the action space is required for the limiting arguments, while all our previous analysis consider finite models. Our arguments via model distance are also different from those used by H\'{a}jek and Le Cam, where they introduced the notion of contiguity to arrive at the same result with weaker conditions. Further details of these alternative approaches are referred to the bibliographic notes.
4. Applications and Limitations
In this section, we will apply the LAM theorem to prove asymptotic lower bounds for both parametric and nonparametric problems. We will also discuss the limitations of LAM to motivate the necessity of future lectures.
4.1. Parametric Entropy Estimation
Consider the discrete i.i.d. sampling model , where
denotes the probability simplex on
elements. The target is to estimate the Shannon entropy
under the mean squared loss. We can apply LAM to prove a local minimax lower bound for this problem.
First we compute the Fisher information of the multinomial model , where we set
to be the free parameters. It is easy to show that
By the matrix inversion formula , we have
Now choosing and
in LAM, after some algebra we arrive at
Note that on the RHS is due to our arbitrary choice of the centering
.
4.2. Nonparametric Entropy Estimation
Consider a continuous i.i.d. sampling model from some density, i.e., . Assume that the density
is supported on
, and the target is to estimate the differential entropy
As before, we would like to prove a local minimax lower bound for the mean squared error around any target density . However, since the model is nonparametric and the parameter
has an infinite dimension, there is no Fisher information matrix for this model. To overcome this difficulty, we may consider a one-dimensional parametric submodel instead.
Let be any measurable function with
and
, then
is still a valid density on
for small
. Consequently, keeping
small, the i.i.d. sampling model
becomes a submodel parametrized only by
. For this 1-D parametric submodel, the Fisher information at
can be computed as
Setting in LAM, we have
and consequently
Since our choice of the test function is arbitrary, we may actually choose the worst-case
such that the above lower bound is maximized. We claim that the maximum value is
Clearly, this value is attained for the test function For the maximality, the Cauchy–Schwartz inequality and the assumption
gives
Therefore, the parametric lower bound for nonparametric entropy estimation is
4.3. Limitations of Classical Asymptotics
The theorems from classical asymptotics can typically help to prove an error bound with an explicit constant, and it is also known that these bounds are optimal and achieved by MLE. However, there are still some problems in these approaches:
- Non-asymptotic vs. asymptotic: Asymptotic bounds are useful only in scenarios where the problem size remains fixed and the sample size grows to infinity, and there is no general guarantee of when we have entered the asymptotic regime (it may even require that
). In practice, essentially all recent problems are high-dimensional ones where the number of parameters is comparable to or even larger than the sample size (e.g., an over-parametrized neural network), and some key properties of the problem may be entirely obscured in the asymptotic regime.
- Parametric vs. nonparametric: The results in classical asymptotics may not be helpful for a large quantity of nonparametric problems, where the main problem is the infinite-dimensional nature of nonparametric problems. Although sometimes the parametric reduction is helpful (e.g., the entropy example), the parametric rate
is in general not attainable in nonparametric problems and some other tools are necessary. For example, if we would like to estimate the density
at some point
, the worst-case test function will actually give a vacuous lower bound (which is infinity).
- Global vs. local: As the name of LAM suggests, the minimax lower bound here holds locally. However, the global structure of some problems may also be important, and the global minimax lower bound may be much larger than the supremum of local bounds over all possible points. For example, in Shannon entropy estimation, the bias is actually dominating the problem and cannot be reflected in local methods.
To overcome these difficulties, we need to develop tools to establish non-asymptotic results for possibly high-dimensional or nonparametric problems, which is the focus of the rest of the lecture series.
5. Bibliographic Notes
The asymptotic theorems in Theorem 2 are first presented in Hájek (1970) and Hájek (1972), and we refer to Le Cam (1986), Le Cam and Yang (1990) and van der Vaart (2000) as excellent textbooks. Here the approach of using model distance to establish LAM is taken from Section 4, 6 of Liese and Miescke (2007); also see Le Cam (1972).
There is another line of approach to establish the asymptotic theorems. A key concept is the contiguity proposed by Le Cam (1960), which enables an asymptotic change of measure. Based on contiguity and LAN condition, the distribution of any (regular) estimator under the local alternative can be evaluated. Then the convolution theorem can be shown, which helps to establish LAM; details can be found in van der Vaart (2000). LAM theorem can also be established directly by computing the asymptotic Bayes risk under proper priors; see Section 6 of Le Cam and Yang (1990).
For parametric or nonparametric entropy estimation, we refer to recent papers (Jiao et al. (2015) and Wu and Yang (2016) for the discrete case, Berrett, Samworth and Yuan (2019) and Han et al. (2017) for the continuous case) and the references therein.
- Jaroslav Hájek, A characterization of limiting distributions of regular estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14.4 (1970): 323-330.
- Jaroslav Hájek, Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability. Vol. 1. 1972.
- Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986.
- Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990.
- Aad W. Van der Vaart, Asymptotic statistics. Vol. 3. Cambridge university press, 2000.
- Friedrich Liese and Klaus-J. Miescke. Statistical decision theory. Springer, New York, NY, 2007.
- Lucien M. Le Cam, Limits of experiments. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
- Lucien M. Le Cam, Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98 (1960).
- Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885.
- Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
- Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, Efficient multivariate entropy estimation via
-nearest neighbour distances. The Annals of Statistics 47.1 (2019): 288-318.
- Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu, Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint arXiv:1711.02141 (2017).