Lecture 4: Local Asymptotic Normality and Asymptotic Theorems

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In this lecture, we establish the asymptotic lower bounds for general statistical decision problems. Specifically, we show that models satisfying the benign local asymptotic normality (LAN) condition are asymptotically equivalent to a Gaussian location model, under which the Hájek–Le Cam local asymptotic minimax (LAM) lower bound holds. We also apply this theorem to both parametric and nonparametric problems. 

1. History of Asymptotic Statistics 

To begin with, we first recall the notions of the score function and Fisher information, which can be found in most textbooks. 

Definition 1 (Fisher Information) A family of distributions (P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d} on \mathcal{X} is quadratic mean differentiable (QMD) at \theta\in {\mathbb R}^d if there exists a score function \dot{\ell}_\theta: \mathcal{X} \rightarrow {\mathbb R}^d such that

 \int\left(\sqrt{dP_{\theta+h}} - \sqrt{dP_\theta} - \frac{1}{2}h^\top \dot{\ell}_\theta \sqrt{dP_\theta} \right)^2 = o(\|h\|^2).

In this case, the matrix I(\theta) := \mathop{\mathbb E}_{P_\theta}[\dot{\ell}_\theta\dot{\ell}_\theta^\top ] exists and is called the Fisher information at \theta

R. A. Fisher popularized the above Fisher information and the usage of the maximum likelihood estimator (MLE) starting from 1920s. He believes that, the Fisher information of a statistical model characterizes the fundamental limits of estimating \theta based on n i.i.d. observations from P_\theta, and MLE asymptotically attains this limit. More specifically, he makes the following conjectures: 

  1. For any asymptotically normal estimators \hat{\theta}_n such that \sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta) for any \theta\in \Theta\subseteq {\mathbb R}^d, there must be  \Sigma_\theta \succeq I(\theta)^{-1}, \qquad \forall \theta\in \Theta. \ \ \ \ \ (1)
  2. The MLE satisfies that \Sigma_\theta = I(\theta)^{-1} for any \theta\in\Theta

Although the second conjecture is easier to establish assuming certain regularity conditions, the first conjecture, which seems to be correct by the well-known Cramér-Rao bound, actually caused some trouble when people tried to prove it. The following example shows that (1) may be quite problematic. 

Example 1 Here is a counterexample to (1) proposed by Hodges in 1951. Let \Theta = {\mathbb R}, and consider a Gaussian location model where X_1,\cdots,X_n are i.i.d. distributed as \mathcal{N}(\theta,1). A natural estimator of \theta is the empirical mean \bar{X}_n = \sum_{i=1}^n X_i/n, and the Fisher information is I(\theta)\equiv 1. The Hodges’ estimator is constructed as follows:

 \hat{\theta}_n = \bar{X}\cdot 1(|\bar{X}| \ge n^{-1/4}). \ \ \ \ \ (2)

It is easy to show that \sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta) for any \theta\in {\mathbb R}, with \Sigma_\theta = 1 for non-zero \theta and \Sigma_\theta = 0 for \theta=0. Consequently, (1) does not hold for the Hodges’ estimator. The same result holds if the threshold n^{-1/4} in (2) is changed by any sequence a_n with a_n = o(1) and a_n = \omega(n^{-1/2})

Hodges’ example suggests that (1) should at least be weakened in appropriate ways. Observing the structure of Hodges’ estimator (2) carefully, there can be three possible attempts: 

  1. The estimator \hat{\theta}_n is superefficient (i.e., where (1) fails) only at one point \theta=0. It may be expected that the set of \theta such that (1) fails is quite small. 
  2. Although the Hodges’ estimator satisfies the asymptotic normal condition, i.e., \sqrt{n}(\hat{\theta}_n - \theta) weakly converges to a normal distribution under P_\theta^{\otimes n}, for any non-zero perturbation h\in {\mathbb R}, the sequence \sqrt{n}(\hat{\theta}_n - \theta - h/\sqrt{n}) does not weakly converge to the same distribution under P_{\theta+h/\sqrt{n}}^{\otimes n}. Hence, we may expect that (1) actually holds for more regular estimator sequences. 
  3. Let R(\theta) be the risk function of the estimator \hat{\theta}_n under the absolute value loss. It can be computed that R(0) = o(1/\sqrt{n}), while R(\theta) = \Omega(|\theta|) for |\theta| = O(n^{-1/4}). In other words, the worst-case risk over an interval of size n^{-1/2} around \theta=0 is still of the order \Omega(n^{-1/2}), which is considerably larger than the single point \theta=0. Therefore, it may make sense to consider the local minimax risk. 

It turns out that all these attempts can be successful, and the following theorem summarizes the key results of asymptotic statistics developed by J. Hájek and L. Le Cam in 1970s. 

Theorem 2 (Asymptotic Theorems) Let (P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d} be a QMD statistical model which admits a non-singular Fisher information I(\theta_0) at \theta_0. Let \psi(\theta) be differentiable at \theta=\theta_0, and T_n be an estimator sequence of \psi(\theta) in the model (P_\theta^{\otimes n}).

1. (Almost everywhere convolution theorem) If \sqrt{n}(T_n - \psi(\theta)) converges in distribution to some probability measure L_\theta for every \theta, and I(\theta) is non-singular for every \theta, then there exists some probability measure M such that  L_{\theta} = \mathcal{N}(0, \dot{\psi}_{\theta} I(\theta)^{-1}\dot{\psi}_{\theta}^\top) * M
for Lebesgue almost every \theta, where * denotes the convolution. 

2. (Convolution theorem) If \sqrt{n}(T_n - \psi(\theta)) converges in distribution to some probability measure L_\theta for \theta = \theta_0, and T_n is regular in the sense that \sqrt{n}(T_n - \psi(\theta+h/\sqrt{n})) weakly converges to the same limit under P_{\theta+h/\sqrt{n}}^{\otimes n} for every h\in {\mathbb R}^d, then there exists some probability measure M such that  L_{\theta_0} = \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top) * M.

3. (Local asymptotic minimax theorem) Let \ell be a bowl-shaped loss function, i.e., \ell is non-negative, symmetric and quasi-convex. In mathematical words, \ell(x) = \ell(-x)\ge 0 and the sublevel sets \{x: \ell(x)\le t \} are convex for all t\in {\mathbb R}. Then  \lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)
with Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)

We will be primarily interested in the local asymptotic minimax (LAM) theorem, for it directly gives general lower bounds for statistical estimation. This theorem will be proved in the next two sections using asymptotic equivalence between models, and some applications will be given in the subsequent section. 

2. Gaussian Location Model 

In this section we study the possibly simplest statistical model, i.e., the Gaussian location model, and will show in the next section that all regular models will converge to it asymptotically. In the Gaussian location model, we have \theta\in {\mathbb R}^d and observes X\sim \mathcal{N}(\theta,\Sigma) with a known non-singular covaraince \Sigma. Consider a bowl-shaped loss function \ell (defined in Theorem 2), a natural estimator of \theta is \hat{\theta}=X, whose worst-case risk is \mathop{\mathbb E} \ell(Z) with Z\sim \mathcal{N}(0,\Sigma). The main theorem in this section is that the natural estimator \hat{\theta}=X is minimax. 

Theorem 3 For any bowl-shaped loss \ell, we have

 \inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) = \mathop{\mathbb E} \ell(Z)

for Z\sim \mathcal{N}(0,\Sigma)

The proof of Theorem 3 relies on the following important lemma for Gaussian random variables. 

Lemma 4 (Anderson’s Lemma) Let Z\sim \mathcal{N}(0,\Sigma) and \ell be bowl-shaped. Then

 \min_{x\in{\mathbb R}^d} \mathop{\mathbb E} \ell(Z+x) = \mathop{\mathbb E} \ell(Z).

Proof: For t\ge 0, let K_t = \{z: \ell(z)\le t \} \subseteq {\mathbb R}^d. Since \ell is bowl-shaped, the set K_t is convex. Moreover, since 

 \mathop{\mathbb E} \ell(Z+x) = \int_0^\infty (1 - \mathop{\mathbb P}(Z\in K_t - x) )dt,

it suffices to show that \mathop{\mathbb P}(Z\in K_t - x )\le \mathop{\mathbb P}(Z\in K_t) for any x\in {\mathbb R}^d. We shall need the following inequality in convex geometry. 

Theorem 5 (Prépoka-Leindler Inequality, or Functional Brunn-Minkowski Inequality) Let \lambda\in (0,1) and f,g,h be non-negative real-valued measurable functions on {\mathbb R}^d, with

 h((1-\lambda)x + \lambda y) \ge f(x)^{1-\lambda} g(y)^\lambda, \qquad \forall x,y\in {\mathbb R}^d.

Then 

 \int_{{\mathbb R}^d} h(x)dx \ge \left( \int_{{\mathbb R}^d} f(x)dx\right)^{1-\lambda} \left( \int_{{\mathbb R}^d} g(x)dx\right)^\lambda.

Let \phi be the density function of Z, which is log-concave. Consider functions 

 f(z) = \phi(z)\cdot 1_{K_t+x} (z), \quad g(z) = \phi(z)\cdot 1_{K_t-x}(z), \quad h(z) = \phi(z)\cdot 1_{K_t}(z)

and \lambda=1/2, the log-concavity of \phi and (K_t-x)/2 + (K_t+x)/2 = K_t/2 + K_t/2 = K_t by convexity of K_t ensures the condition of Theorem 5. Hence, Theorem 5 gives 

 \mathop{\mathbb P}(Z\in K_t) \ge \sqrt{\mathop{\mathbb P}(Z\in K_t+x) \cdot \mathop{\mathbb P}(Z\in K_t-x) }.

Finally, by symmetry of Z and K_t, we have \mathop{\mathbb P}(Z\in K_t+x) = \mathop{\mathbb P}(Z\in K_t-x), which completes the proof.  \Box 

Now we are ready to prove Theorem 3. 

Proof: Consider a Gaussian prior \pi=\mathcal{N}(0,\Sigma_0) on \theta. By algebra, the posterior distribution of \theta given X=x is Gaussian distributed as \mathcal{N}( (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}x, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} ) . By Proposition 3 in Lecture 3 and the above Anderson’s lemma, the Bayes estimator is then \hat{\theta} = (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}X. Since the minimax risk is lower bounded by any Bayes risk (as the maximum is no less than the average), we have 

 \inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) \ge \int \ell d\mathcal{N}(0, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} ).

Since this inequality holds for any \Sigma_0, choosing \Sigma_0 = \lambda I with \lambda\rightarrow\infty completes the proof of Theorem 3.  \Box 

3. Local Asymptotic Minimax Theorem 

In this section, we show that regular statistical models converge to a Gaussian location model asymptotically. To prove so, we shall need verifiable criterions to establish the convergence of Le Cam’s distance, as well as the specific regularity conditions. 

3.1. Likelihood Ratio Criteria for Asymptotic Equivalence 

In Lecture 3 we introduced the notion of Le Cam’s model distance, and showed that it can be upper bounded via the randomization criterion. However, designing a suitable transition kernel between models is too ad-hoc and sometimes challenging, and it will be helpful if simple criteria suffice. 

The main result of this subsection is the following likelihood ratio criteria: 

Theorem 6 Let \mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, \{P_{n,0},\cdots, P_{n,m}\}) and \mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \}) be finite statistical models. Further assume that \mathcal{M} is homogeneous in the sense that any pair in \{P_0,\cdots,P_m\} is mutually absolutely continuous. Define

 L_{n,i}(x_n) = \frac{dP_{n,i}}{dP_{n,0}}(x_n), \qquad i\in [m]

as the likelihood ratios, and similarly for L_i(x). Then \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M})=0 if the distribution of (L_{n,1}(X_n),\cdots, L_{n,m}(X_n)) under X_n\sim P_{n,0} weakly converges to that of (L_1(X),\cdots,L_m(X)) under X\sim P_0

In other words, Theorem 6 states that a sufficient condition for asymptotic equivalence of models is the weak convergence of likelihood ratios. Although we shall not use that, this is also a necessary condition. The finiteness assumption is mainly for technical purposes, and the general case requires proper limiting arguments. 

To prove Theorem 6, we need the following notion of standard models

Definition 7 (Standard Model) Let \mathcal{S}_{m+1} = \{(t_0,\cdots,t_m)\in {\mathbb R}_+^{m+1}: \sum_{i=0}^m t_i=m+1 \}, and \sigma(\mathcal{S}_{m+1}) be its Borel \sigma-algebra. A standard distribution \mu on (\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1})) is a probability measure such that \mathop{\mathbb E}_{\mu}[t_i]= 1 for any i=0,1,\cdots,m. The model

 \mathcal{N} = (\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1}), \{Q_0,\cdots, Q_m \}), \qquad \text{with}\quad dQ_i = t_id\mu,

is called the standard model of \mu

The following lemma shows that any finite statistical model can be transformed into an equivalent standard form. 

Lemma 8 Let \mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \}) be a finite model, and \mathcal{N} be a standard model with standard distribution \mu being the distribution of t:=(\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}})\in \mathcal{S}_{m+1} under mean measure \overline{P}=\sum_{i=0}^m P_i/(m+1). Then \Delta(\mathcal{M},\mathcal{N})=0

Proof: Since \mathop{\mathbb E}_{\mu}[t_i] = \mathop{\mathbb E}_{\overline{P}}[dP_i/d\overline{P}]=1, the measure \mu is a standard distribution. Moreover, let Q_i be the distribution of t under P_i, then 

 \mathop{\mathbb E}_{Q_i}[h(t)] = \mathop{\mathbb E}_{\overline{P}}\left[h(t)\frac{dP_i}{d\overline{P}} \right] = \mathop{\mathbb E}_{\mu}\left[h(t) t_i \right]

for any measurable function h, which gives dQ_i = t_id\mu, agreeing with the standard model. Finally, since dP_i(x) = t_i d\mu(x), by the factorization criterion (e.g., Theorem 7 in Lecture 3) we conclude that the statistic t is sufficient, and therefore \Delta(\mathcal{M},\mathcal{N})=0.  \Box 

Lemma 8 helps to convert the sample space of all finite models to the simplex \mathcal{S}_{m+1}, and comparison between models is reduced to the comparison between their standard distributions. Consequently, we have the following quantitative bound on the model distance between finite models. 

Lemma 9 Let \mathcal{M} = (\mathcal{X},\mathcal{F},\{P_0,\cdots,P_m \}) and \mathcal{N} = (\mathcal{Y}, \mathcal{G}, \{Q_0,\cdots,Q_m\}) be two finite models with standard distributions \mu_1, \mu_2 respectively. Then

 \Delta(\mathcal{M},\mathcal{N}) \le (m+1)\cdot \|\mu_1 - \mu_2\|_{\text{\rm D}} := (m+1)\cdot \sup_{\phi} |\mathop{\mathbb E}_{\mu_1} \phi - \mathop{\mathbb E}_{\mu_2} \phi|,

where  \|\mu_1 - \mu_2\|_{\text{\rm D}} denotes the Dudley’s metric between probability measures \mu_1,\mu_2, and the supremum is taken over all measurable functions \phi: \mathcal{S}_{m+1}\rightarrow {\mathbb R} with \|\phi\|_\infty\le 1 and |\phi(t_1) - \phi(t_2)| \le \|t_1 - t_2\|_\infty for any t_1,t_2\in \mathcal{S}_{m+1}

Remark 1 Recall that Dudley’s metric metrizes the weak convergence of probability measures on a metric space with its Borel \sigma-algebra. The fact that it is smaller than the total variation distance will be crucial to establish Theorem 6. 

Proof: Similar to the proof of the randomization criterion (Theorem 5 in Lecture 3), the following upper bound on the model distance holds: 

 \Delta(\mathcal{M},\mathcal{N}) \le \sup_{L,\pi} | R_{L,\pi}^\star(\mathcal{M}) - R_{L,\pi}^\star(\mathcal{N}) |,

where R_{L,\pi}^\star(\mathcal{M}) denotes the Bayes risk of model \mathcal{M} under loss L and prior \pi, and the loss L is non-negative and upper bounded by one in the supremum. By algebra, the Bayes risk admits the following simple form under the standard model with standard distribution \mu

 R_{L,\pi}^\star(\mu) = \int_{\mathcal{S}_{m+1}} \left(\inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t \right) \mu(dt),

where the set \mathcal{C}_{L,\pi}(t) is defined as 

 \mathcal{C}_{L,\pi}(t) = \left\{\left(\pi_0\int L(0,a)\delta(t,da), \cdots, \pi_m\int L(m,a)\delta(t,da) \right): \text{stochastic kernel }\delta: \mathcal{S}_{m+1}\rightarrow \mathcal{A} \right\}.

Since the diameter of \mathcal{C}_{L,\pi}(t) in \|\cdot\|_1 is one, we conclude that t\mapsto \inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t is upper bounded by m+1 and 1-Lipschitz under \|\cdot\|_\infty. The rest of the proof follows from the definition of Dudley’s metric.  \Box 

Finally we are ready to present the proof of Theorem 6. Note that there is a bijective map between (L_1,\cdots,L_m) and (\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}}), which is continuous under the model \mathcal{M} due to the homogeneity assumption. Then by continuous mapping theorem (see remark below), the weak convergence of likelihood ratios implies the weak convergence of their standard distributions. Since Dudley’s metric metrizes the weak convergence of probability measures, the result of Lemma 9 completes the proof. 

Remark 2 The continuous mapping theorem for weak convergence states that, if Borel-measurable random variables X_n converges weakly to X on a metric space, and f is a function continuous on a set C such that \mathop{\mathbb P}(X\in C)=1, then f(X_n) also converges weakly to f(X). Note that the function f is only required to be continuous on the support of the limiting random variable X

3.2. Locally Asymptotically Normal (LAN) Models 

Motivated by Theorem 6, in order to prove that certain models asymptotically become normal, we may show that the likelihood functions weakly converge to those in the normal model. Note that for the Gaussian location model \mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,\Sigma))_{\theta\in{\mathbb R}^d}), the log-likelihood ratio is given by 

 \log \frac{dP_{\theta+h}}{dP_\theta}(x) = h^\top \Sigma^{-1}(x-\theta) - \frac{1}{2}h^\top \Sigma^{-1}h, \ \ \ \ \ (3)

where \Sigma^{-1}(x-\theta)\sim \mathcal{N}(0,\Sigma^{-1}) for x\sim P_\theta. The equation (3) motivates the following definition of local asymptotic normal (LAN) models, in which the likelihood function looks like (3). 

Definition 10 (Local Asymptotic Normality) A sequence of models \mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, (P_{n,h})_{h\in \Theta_n}) with \Theta_n \uparrow {\mathbb R}^d is called locally asymptotically normal (LAN) with central sequence Z_n and Fisher information matrix I_0 if

 \log\frac{dP_{n,h}}{dP_{n,0}} = h^\top Z_n - \frac{1}{2}h^\top I_0h + r_n(h), \ \ \ \ \ (4)

with Z_n \overset{d}{\rightarrow} \mathcal{N}(0,I_0) under P_{n,0}, and r_n(h) converges to zero in probability under P_{n,0} for any fixed h\in {\mathbb R}^d

Based on the form of the likelihood ratio in (4), the following theorem is then immediate. 

Theorem 11 If a sequence of models \mathcal{M}_n satisfies the LAN condition with Fisher information matrix I_0, then \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M}) = 0 for \mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,I_0^{-1}))_{\theta\in{\mathbb R}^d})

Proof: Note that for any finite sub-model, Slutsky’s theorem applied to (4) gives the desired convergence in distribution, and clearly the Gaussian location model is homogeneous. Now applying Theorem 6 gives the desired convergence. We leave the discussion of the general case in the bibliographic notes.  \Box 

Now the only remaining task is to check the likelihood ratios for some common models and show that the LAN condition is satisfied. For example, for QMD models (P_\theta)_{\theta\in\Theta}, we have 

 \log \frac{dP_{\theta+h/\sqrt{n}}^{\otimes n}}{dP_\theta^{\otimes n}}(X) = h^\top \left( \frac{1}{\sqrt{n}}\dot{\ell}_\theta(X_i) \right) + \frac{1}{2} h^\top \left(\frac{1}{n}\sum_{i=1}^n \ddot{\ell}_\theta(X_i) \right)h + o_{P_\theta^{\otimes n}}(1),

where intuitively by CLT and LLN will arrive at the desired form (4). The next proposition makes this intuition precise. 

Proposition 12 Let (P_\theta)_{\theta\in\Theta} be QMD in an open set \Theta\in{\mathbb R}^d with Fisher information matrix I(\theta). Then the sequence of model \mathcal{M}_n = (\mathcal{X}^n, \mathcal{F}^{\otimes n}, (P_{\theta+h/\sqrt{n}}^{\otimes n} )_{h\in \Theta_n} ), with \Theta_n = \{h\in {\mathbb R}^d: \theta+ h/\sqrt{n}\in \Theta\}\uparrow {\mathbb R}^d satisfies the LAN condition with Fisher information I(\theta)

Proof: Write P_n = P_{\theta+h/\sqrt{n}}, P = P_\theta and g = h^\top \dot{\ell}_\theta. Then by Taylor expansion, 

 \begin{array}{rcl} \log\frac{dP_n^{\otimes n}}{dP^{\otimes n}}(X) & = & \sum_{i=1}^n \log \frac{dP_n}{dP}(X_i) \\ &=& 2\sum_{i=1}^n \log\left(1+\frac{1}{2}W_{ni}\right) \\ &=& \sum_{i=1}^n W_{ni} - \frac{1}{4}\sum_{i=1}^n W_{ni}^2 + \sum_{i=1}^n W_{ni}^2 r(W_{ni}), \end{array}

where W_{ni} := 2(\sqrt{(dP_n/dP)(X_i)} - 1), and r(W_{ni}) = o_{W_{ni}}(1). By the QMD condition, we have 

 \text{Var}\left(\sum_{i=1}^n W_{ni} - \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) \right) = n\text{Var}\left(W_{n1} - \frac{g(X_1)}{\sqrt{n}}\right) = o(1).

Moreover, \mathop{\mathbb E}_P g(X_i) = 0 by the property of the score function, and 

 \mathop{\mathbb E}\left[\sum_{i=1}^n W_{ni}\right] = -n\int (\sqrt{dP_n} - \sqrt{dP})^2 \rightarrow - \frac{1}{4}\mathop{\mathbb E}_P[g(X)^2].

Consequently, we conclude that 

 \sum_{i=1}^n W_{ni} = \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) -\frac{1}{4}\mathop{\mathbb E}_P[g(X)^2] + o_P(1).

For the second term, the QMD condition gives nW_{ni}^2 = g(X_i)^2 + o_P(1), and therefore LLN gives \sum_{i=1}^n W_{ni}^2 = \mathop{\mathbb E}_P[g(X)^2] + o_P(1). For the last term, the Markov’s inequality gives n\mathop{\mathbb P}(|W_{ni}|\ge \varepsilon)\rightarrow 0, and therefore \max_{i\in [n]} | r(W_{ni}) | = o_P(1), as desired.  \Box 

In other words, Proposition 12 implies that all regular statistical models locally look like normal, where the local radius is \Theta(n^{-1/2})

3.3. Proof of LAM Theorem 

Now we are ready to glue all necessary ingredients together. First, for product QMD statistical models, Proposition 12 implies that LAN condition is satisfied for the local model around any chosen parameter \theta. Second, by Theorem 11, these local models will converge to a Gaussian location model with covariance I(\theta)^{-1}. By definition of the model distance, the minimax risk of these local models will be asymptotically no smaller than that of the limiting Gaussian location model, which by Theorem 3 is \mathop{\mathbb E} \ell(Z) with Z\sim \mathcal{N}(0,I(\theta)^{-1}) for bowl-shaped loss \ell. Consequently, we have the following local asymptotic minimax theorem. 

Theorem 13 (LAM, restated) Let (P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d} be a QMD statistical model which admits a non-singular Fisher information I(\theta_0) at \theta_0. Let \psi(\theta) be differentiable at \theta=\theta_0, and T_n be an estimator sequence of \psi(\theta) in the model (P_\theta^{\otimes n}). Consider any compact action space \mathcal{A}\subset {\mathbb R}^d, and any bowl-shaped loss function \ell: {\mathbb R}^d\rightarrow {\mathbb R}_+. Then

 \lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)

with Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)

Note that here the compactness of the action space is required for the limiting arguments, while all our previous analysis consider finite models. Our arguments via model distance are also different from those used by H\'{a}jek and Le Cam, where they introduced the notion of contiguity to arrive at the same result with weaker conditions. Further details of these alternative approaches are referred to the bibliographic notes. 

4. Applications and Limitations 

In this section, we will apply the LAM theorem to prove asymptotic lower bounds for both parametric and nonparametric problems. We will also discuss the limitations of LAM to motivate the necessity of future lectures. 

4.1. Parametric Entropy Estimation 

Consider the discrete i.i.d. sampling model X_1,\cdots,X_n\sim P=(p_1,\cdots,p_k)\in\mathcal{M}_k, where \mathcal{M}_k denotes the probability simplex on k elements. The target is to estimate the Shannon entropy 

 H(P) := \sum_{i=1}^k -p_i\log p_i

under the mean squared loss. We can apply LAM to prove a local minimax lower bound for this problem. 

First we compute the Fisher information of the multinomial model X\sim P, where we set \theta=(p_1,\cdots,p_{k-1}) to be the free parameters. It is easy to show that 

 I(\theta) = \text{diag}(p_1^{-1},\cdots,p_{k-1}^{-1}) + p_k^{-1}{\bf 1}{\bf 1}^\top .

By the matrix inversion formula (A + UCV )^{-1} = A^{-1} - A^{-1}U(C^{-1} + V A^{-1}U)^{-1}V A^{-1}, we have 

 I(\theta)^{-1} = \text{diag}(\theta) - \theta\theta^\top.

Now choosing \psi(\theta) = (\sum_{i=1}^{k-1} -\theta_i\log \theta_i) - (1-\sum_{i=1}^{k-1}\theta_i)\log(1-\sum_{i=1}^{k-1}\theta_i) and \ell(t) = t^2 in LAM, after some algebra we arrive at 

 \inf_{\hat{H}} \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P (\hat{H} - H(P))^2 \ge \frac{1+o_n(1)}{n} \sup_{P\in \mathcal{M}_k} \text{Var}\left[\log\frac{1}{P(X)}\right].

Note that \sup_{P\in \mathcal{M}_k} on the RHS is due to our arbitrary choice of the centering \theta

4.2. Nonparametric Entropy Estimation 

Consider a continuous i.i.d. sampling model from some density, i.e., X_1,\cdots,X_n\sim f. Assume that the density f is supported on {\mathbb R}^d, and the target is to estimate the differential entropy 

 h(f) := \int_{{\mathbb R}^d} -f(x)\log f(x)dx.

As before, we would like to prove a local minimax lower bound for the mean squared error around any target density f_0. However, since the model is nonparametric and the parameter f has an infinite dimension, there is no Fisher information matrix for this model. To overcome this difficulty, we may consider a one-dimensional parametric submodel instead.

Let g: {\mathbb R}^d\rightarrow {\mathbb R} be any measurable function with \text{supp}(g)\subseteq \text{supp}(f_0) and \int_{{\mathbb R}^d} g(x)dx=0, then f_0+tg is still a valid density on {\mathbb R}^d for small |t|. Consequently, keeping t small, the i.i.d. sampling model X_1,\cdots,X_n\sim f_0+tg becomes a submodel parametrized only by t. For this 1-D parametric submodel, the Fisher information at t=0 can be computed as 

 I_0 = \int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx.

Setting \psi(t) = h(f_0+tg) in LAM, we have 

 \psi'(0) = \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx,

and consequently 

 \begin{array}{rcl} \inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 &\ge& \inf_{\hat{h}} \sup_{t} \mathop{\mathbb E}_{f_0+tg}(\hat{h} - \psi(t))^2 \\ &\ge& \frac{1+o_n(1)}{n}\left(\int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx\right)^{-1} \left(\int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx\right)^2. \end{array}

Since our choice of the test function g is arbitrary, we may actually choose the worst-case g such that the above lower bound is maximized. We claim that the maximum value is 

 V(f_0):=\int_{{\mathbb R}^d} f_0(x)\log^2 f_0(x)dx - h(f_0)^2.

Clearly, this value is attained for the test function g (x)= f_0(x)(-\log f_0(x) - h(f_0)). For the maximality, the Cauchy–Schwartz inequality and the assumption \int_{{\mathbb R}^d} g(x)dx=0 gives 

 \begin{array}{rcl} \left( \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx \right)^2 &=& \left( \int_{{\mathbb R}^d} g(x)(-\log f_0(x) - h(f_0))dx \right)^2\\ &\le & \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx \cdot \int_{{\mathbb R}^d} f_0(x)(-\log f_0(x) - h(f_0))^2dx \\ &=& V(f_0)\cdot \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx. \end{array}

Therefore, the parametric lower bound for nonparametric entropy estimation is 

 \inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 \ge \frac{1+o_n(1)}{n}\cdot V(f_0).

4.3. Limitations of Classical Asymptotics 

The theorems from classical asymptotics can typically help to prove an error bound \Theta(n^{-1/2}) with an explicit constant, and it is also known that these bounds are optimal and achieved by MLE. However, there are still some problems in these approaches: 

  1. Non-asymptotic vs. asymptotic: Asymptotic bounds are useful only in scenarios where the problem size remains fixed and the sample size grows to infinity, and there is no general guarantee of when we have entered the asymptotic regime (it may even require that n\gg e^d). In practice, essentially all recent problems are high-dimensional ones where the number of parameters is comparable to or even larger than the sample size (e.g., an over-parametrized neural network), and some key properties of the problem may be entirely obscured in the asymptotic regime. 
  2. Parametric vs. nonparametric: The results in classical asymptotics may not be helpful for a large quantity of nonparametric problems, where the main problem is the infinite-dimensional nature of nonparametric problems. Although sometimes the parametric reduction is helpful (e.g., the entropy example), the parametric rate \Theta(n^{-1/2}) is in general not attainable in nonparametric problems and some other tools are necessary. For example, if we would like to estimate the density f at some point x_0, the worst-case test function will actually give a vacuous lower bound (which is infinity). 
  3. Global vs. local: As the name of LAM suggests, the minimax lower bound here holds locally. However, the global structure of some problems may also be important, and the global minimax lower bound may be much larger than the supremum of local bounds over all possible points. For example, in Shannon entropy estimation, the bias is actually dominating the problem and cannot be reflected in local methods. 

To overcome these difficulties, we need to develop tools to establish non-asymptotic results for possibly high-dimensional or nonparametric problems, which is the focus of the rest of the lecture series. 

5. Bibliographic Notes 

The asymptotic theorems in Theorem 2 are first presented in Hájek (1970) and Hájek (1972), and we refer to Le Cam (1986), Le Cam and Yang (1990) and van der Vaart (2000) as excellent textbooks. Here the approach of using model distance to establish LAM is taken from Section 4, 6 of Liese and Miescke (2007); also see Le Cam (1972). 

There is another line of approach to establish the asymptotic theorems. A key concept is the contiguity proposed by Le Cam (1960), which enables an asymptotic change of measure. Based on contiguity and LAN condition, the distribution of any (regular) estimator under the local alternative can be evaluated. Then the convolution theorem can be shown, which helps to establish LAM; details can be found in van der Vaart (2000). LAM theorem can also be established directly by computing the asymptotic Bayes risk under proper priors; see Section 6 of Le Cam and Yang (1990). 

For parametric or nonparametric entropy estimation, we refer to recent papers (Jiao et al. (2015) and Wu and Yang (2016) for the discrete case, Berrett, Samworth and Yuan (2019) and Han et al. (2017) for the continuous case) and the references therein. 

  1. Jaroslav Hájek, A characterization of limiting distributions of regular estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14.4 (1970): 323-330. 
  2. Jaroslav Hájek, Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability. Vol. 1. 1972. 
  3. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986. 
  4. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990. 
  5. Aad W. Van der Vaart, Asymptotic statistics. Vol. 3. Cambridge university press, 2000. 
  6. Friedrich Liese and Klaus-J. Miescke. Statistical decision theory. Springer, New York, NY, 2007. 
  7. Lucien M. Le Cam, Limits of experiments. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972. 
  8. Lucien M. Le Cam, Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98 (1960). 
  9. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885. 
  10. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720. 
  11. Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, Efficient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics 47.1 (2019): 288-318. 
  12. Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu, Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint arXiv:1711.02141 (2017).

Leave a Reply