Lecture 4: Local Asymptotic Normality and Asymptotic Theorems

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In this lecture, we establish the asymptotic lower bounds for general statistical decision problems. Specifically, we show that models satisfying the benign local asymptotic normality (LAN) condition are asymptotically equivalent to a Gaussian location model, under which the Hájek–Le Cam local asymptotic minimax (LAM) lower bound holds. We also apply this theorem to both parametric and nonparametric problems. 

1. History of Asymptotic Statistics 

To begin with, we first recall the notions of the score function and Fisher information, which can be found in most textbooks. 

Definition 1 (Fisher Information) A family of distributions {(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}} on {\mathcal{X}} is quadratic mean differentiable (QMD) at {\theta\in {\mathbb R}^d} if there exists a score function {\dot{\ell}_\theta: \mathcal{X} \rightarrow {\mathbb R}^d} such that 

\displaystyle \int\left(\sqrt{dP_{\theta+h}} - \sqrt{dP_\theta} - \frac{1}{2}h^\top \dot{\ell}_\theta \sqrt{dP_\theta} \right)^2 = o(\|h\|^2).

In this case, the matrix {I(\theta) := \mathop{\mathbb E}_{P_\theta}[\dot{\ell}_\theta\dot{\ell}_\theta^\top ]} exists and is called the Fisher information at {\theta}

R. A. Fisher popularized the above Fisher information and the usage of the maximum likelihood estimator (MLE) starting from 1920s. He believes that, the Fisher information of a statistical model characterizes the fundamental limits of estimating {\theta} based on {n} i.i.d. observations from {P_\theta}, and MLE asymptotically attains this limit. More specifically, he makes the following conjectures: 

  1. For any asymptotically normal estimators {\hat{\theta}_n} such that {\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta)} for any {\theta\in \Theta\subseteq {\mathbb R}^d}, there must be \displaystyle \Sigma_\theta \succeq I(\theta)^{-1},  \forall \theta\in \Theta. \ \ \ \ \ (1)
  2. The MLE satisfies that {\Sigma_\theta = I(\theta)^{-1}} for any {\theta\in\Theta}

Although the second conjecture is easier to establish assuming certain regularity conditions, the first conjecture, which seems to be correct by the well-known Cramér-Rao bound, actually caused some trouble when people tried to prove it. The following example shows that (1) may be quite problematic. 

Example 1 Here is a counterexample to (1) proposed by Hodges in 1951. Let {\Theta = {\mathbb R}}, and consider a Gaussian location model where {X_1,\cdots,X_n} are i.i.d. distributed as {\mathcal{N}(\theta,1)}. A natural estimator of {\theta} is the empirical mean {\bar{X}_n = \sum_{i=1}^n X_i/n}, and the Fisher information is {I(\theta)\equiv 1}. The Hodges’ estimator is constructed as follows: 

\displaystyle \hat{\theta}_n = \bar{X}\cdot 1(|\bar{X}| \ge n^{-1/4}). \ \ \ \ \ (2)

It is easy to show that {\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta)} for any {\theta\in {\mathbb R}}, with {\Sigma_\theta = 1} for non-zero {\theta} and {\Sigma_\theta = 0} for {\theta=0}. Consequently, (1) does not hold for the Hodges’ estimator. The same result holds if the threshold {n^{-1/4}} in (2) is changed by any sequence {a_n} with {a_n = o(1)} and {a_n = \omega(n^{-1/2})}

Hodges’ example suggests that (1) should at least be weakened in appropriate ways. Observing the structure of Hodges’ estimator (2) carefully, there can be three possible attempts: 

  1. The estimator {\hat{\theta}_n} is superefficient (i.e., where (1) fails) only at one point {\theta=0}. It may be expected that the set of {\theta} such that (1) fails is quite small. 
  2. Although the Hodges’ estimator satisfies the asymptotic normal condition, i.e., {\sqrt{n}(\hat{\theta}_n - \theta)} weakly converges to a normal distribution under {P_\theta^{\otimes n}}, for any non-zero perturbation {h\in {\mathbb R}}, the sequence {\sqrt{n}(\hat{\theta}_n - \theta - h/\sqrt{n})} does not weakly converge to the same distribution under {P_{\theta+h/\sqrt{n}}^{\otimes n}}. Hence, we may expect that (1) actually holds for more regular estimator sequences. 
  3. Let {R(\theta)} be the risk function of the estimator {\hat{\theta}_n} under the absolute value loss. It can be computed that {R(0) = o(1/\sqrt{n})}, while {R(\theta) = \Omega(|\theta|)} for {|\theta| = O(n^{-1/4})}. In other words, the worst-case risk over an interval of size {n^{-1/2}} around {\theta=0} is still of the order {\Omega(n^{-1/2})}, which is considerably larger than the single point {\theta=0}. Therefore, it may make sense to consider the local minimax risk. 

It turns out that all these attempts can be successful, and the following theorem summarizes the key results of asymptotic statistics developed by J. Hájek and L. Le Cam in 1970s. 

Theorem 2 (Asymptotic Theorems) Let {(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}} be a QMD statistical model which admits a non-singular Fisher information {I(\theta_0)} at {\theta_0}. Let {\psi(\theta)} be differentiable at {\theta=\theta_0}, and {T_n} be an estimator sequence of {\psi(\theta)} in the model {(P_\theta^{\otimes n})}

1. (Almost everywhere convolution theorem) If {\sqrt{n}(T_n - \psi(\theta))} converges in distribution to some probability measure {L_\theta} for every {\theta}, and {I(\theta)} is non-singular for every {\theta}, then there exists some probability measure {M} such that 

\displaystyle L_{\theta} = \mathcal{N}(0, \dot{\psi}_{\theta} I(\theta)^{-1}\dot{\psi}_{\theta}^\top) * M

for Lebesgue almost every {\theta}, where {*} denotes the convolution. 

2. (Convolution theorem) If {\sqrt{n}(T_n - \psi(\theta))} converges in distribution to some probability measure {L_\theta} for {\theta = \theta_0}, and {T_n} is regular in the sense that {\sqrt{n}(T_n - \psi(\theta+h/\sqrt{n}))} weakly converges to the same limit under {P_{\theta+h/\sqrt{n}}^{\otimes n}} for every {h\in {\mathbb R}^d}, then there exists some probability measure {M} such that 

\displaystyle L_{\theta_0} = \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top) * M.

3. (Local asymptotic minimax theorem) Let {\ell} be a bowl-shaped loss function, i.e., {\ell} is non-negative, symmetric and quasi-convex. In mathematical words, {\ell(x) = \ell(-x)\ge 0} and the sublevel sets {\{x: \ell(x)\le t \}} are convex for all {t\in {\mathbb R}}. Then 

\displaystyle \lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)

with {Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)}.

We will be primarily interested in the local asymptotic minimax (LAM) theorem, for it directly gives general lower bounds for statistical estimation. This theorem will be proved in the next two sections using asymptotic equivalence between models, and some applications will be given in the subsequent section. 

2. Gaussian Location Model 

In this section we study the possibly simplest statistical model, i.e., the Gaussian location model, and will show in the next section that all regular models will converge to it asymptotically. In the Gaussian location model, we have {\theta\in {\mathbb R}^d} and observes {X\sim \mathcal{N}(\theta,\Sigma)} with a known non-singular covaraince {\Sigma}. Consider a bowl-shaped loss function {\ell} (defined in Theorem 2), a natural estimator of {\theta} is {\hat{\theta}=X}, whose worst-case risk is {\mathop{\mathbb E} \ell(Z)} with {Z\sim \mathcal{N}(0,\Sigma)}. The main theorem in this section is that the natural estimator {\hat{\theta}=X} is minimax. 

Theorem 3 For any bowl-shaped loss {\ell}, we have 

\displaystyle \inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) = \mathop{\mathbb E} \ell(Z)

for {Z\sim \mathcal{N}(0,\Sigma)}

The proof of Theorem 3 relies on the following important lemma for Gaussian random variables. 

Lemma 4 (Anderson’s Lemma) Let {Z\sim \mathcal{N}(0,\Sigma)} and {\ell} be bowl-shaped. Then 

\displaystyle \min_{x\in{\mathbb R}^d} \mathop{\mathbb E} \ell(Z+x) = \mathop{\mathbb E} \ell(Z).

Proof: For {t\ge 0}, let {K_t = \{z: \ell(z)\le t \} \subseteq {\mathbb R}^d}. Since {\ell} is bowl-shaped, the set {K_t} is convex. Moreover, since 

\displaystyle \mathop{\mathbb E} \ell(Z+x) = \int_0^\infty (1 - \mathop{\mathbb P}(Z\in K_t - x) )dt,

it suffices to show that {\mathop{\mathbb P}(Z\in K_t - x )\le \mathop{\mathbb P}(Z\in K_t)} for any {x\in {\mathbb R}^d}. We shall need the following inequality in convex geometry. 

Theorem 5 (Prépoka-Leindler Inequality, or Functional Brunn-Minkowski Inequality) Let {\lambda\in (0,1)} and {f,g,h} be non-negative real-valued measurable functions on {{\mathbb R}^d}, with 

\displaystyle h((1-\lambda)x + \lambda y) \ge f(x)^{1-\lambda} g(y)^\lambda, \qquad \forall x,y\in {\mathbb R}^d.

Then 

\displaystyle \int_{{\mathbb R}^d} h(x)dx \ge \left( \int_{{\mathbb R}^d} f(x)dx\right)^{1-\lambda} \left( \int_{{\mathbb R}^d} g(x)dx\right)^\lambda.

Let {\phi} be the density function of {Z}, which is log-concave. Consider functions 

\displaystyle f(z) = \phi(z)\cdot 1_{K_t+x} (z), \quad g(z) = \phi(z)\cdot 1_{K_t-x}(z), \quad h(z) = \phi(z)\cdot 1_{K_t}(z)

and {\lambda=1/2}, the log-concavity of {\phi} and {(K_t-x)/2 + (K_t+x)/2 = K_t/2 + K_t/2 = K_t} by convexity of {K_t} ensures the condition of Theorem 5. Hence, Theorem 5 gives 

\displaystyle \mathop{\mathbb P}(Z\in K_t) \ge \sqrt{\mathop{\mathbb P}(Z\in K_t+x) \cdot \mathop{\mathbb P}(Z\in K_t-x) }.

Finally, by symmetry of {Z} and {K_t}, we have {\mathop{\mathbb P}(Z\in K_t+x) = \mathop{\mathbb P}(Z\in K_t-x)}, which completes the proof. \Box

Now we are ready to prove Theorem 3

Proof: Consider a Gaussian prior {\pi=\mathcal{N}(0,\Sigma_0)} on {\theta}. By algebra, the posterior distribution of {\theta} given {X=x} is Gaussian distributed as {\mathcal{N}( (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}x, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} ) }. By Proposition 3 in Lecture 3 and the above Anderson’s lemma, the Bayes estimator is then {\hat{\theta} = (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}X}. Since the minimax risk is lower bounded by any Bayes risk (as the maximum is no less than the average), we have 

\displaystyle \inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) \ge \int \ell d\mathcal{N}(0, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} ).

Since this inequality holds for any {\Sigma_0}, choosing {\Sigma_0 = \lambda I} with {\lambda\rightarrow\infty} completes the proof of Theorem 3. \Box

3. Local Asymptotic Minimax Theorem 

In this section, we show that regular statistical models converge to a Gaussian location model asymptotically. To prove so, we shall need verifiable criterions to establish the convergence of Le Cam’s distance, as well as the specific regularity conditions. 

3.1. Likelihood Ratio Criteria for Asymptotic Equivalence 

In Lecture 3 we introduced the notion of Le Cam’s model distance, and showed that it can be upper bounded via the randomization criterion. However, designing a suitable transition kernel between models is too ad-hoc and sometimes challenging, and it will be helpful if simple criteria suffice. 

The main result of this subsection is the following likelihood ratio criteria: 

Theorem 6 Let {\mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, \{P_{n,0},\cdots, P_{n,m}\})} and {\mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \})} be finite statistical models. Further assume that {\mathcal{M}} is homogeneous in the sense that any pair in {\{P_0,\cdots,P_m\}} is mutually absolutely continuous. Define 

\displaystyle L_{n,i}(x_n) = \frac{dP_{n,i}}{dP_{n,0}}(x_n), \qquad i\in [m]

as the likelihood ratios, and similarly for {L_i(x)}. Then {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M})=0} if the distribution of {(L_{n,1}(X_n),\cdots, L_{n,m}(X_n))} under {X_n\sim P_{n,0}} weakly converges to that of {(L_1(X),\cdots,L_m(X))} under {X\sim P_0}

In other words, Theorem 6 states that a sufficient condition for asymptotic equivalence of models is the weak convergence of likelihood ratios. Although we shall not use that, this is also a necessary condition. The finiteness assumption is mainly for technical purposes, and the general case requires proper limiting arguments. 

To prove Theorem 6, we need the following notion of standard models

Definition 7 (Standard Model) Let {\mathcal{S}_{m+1} = \{(t_0,\cdots,t_m)\in {\mathbb R}_+^{m+1}: \sum_{i=0}^m t_i=m+1 \}}, and {\sigma(\mathcal{S}_{m+1})} be its Borel {\sigma}-algebra. A standard distribution {\mu} on {(\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1}))} is a probability measure such that {\mathop{\mathbb E}_{\mu}[t_i]= 1} for any {i=0,1,\cdots,m}. The model 

\displaystyle \mathcal{N} = (\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1}), \{Q_0,\cdots, Q_m \}), \qquad \text{with}\quad dQ_i = t_id\mu,

is called the standard model of {\mu}

The following lemma shows that any finite statistical model can be transformed into an equivalent standard form. 

Lemma 8 Let {\mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \})} be a finite model, and {\mathcal{N}} be a standard model with standard distribution {\mu} being the distribution of {t:=(\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}})\in \mathcal{S}_{m+1}} under mean measure {\overline{P}=\sum_{i=0}^m P_i/(m+1)}. Then {\Delta(\mathcal{M},\mathcal{N})=0}

Proof: Since {\mathop{\mathbb E}_{\mu}[t_i] = \mathop{\mathbb E}_{\overline{P}}[dP_i/d\overline{P}]=1}, the measure {\mu} is a standard distribution. Moreover, let {Q_i} be the distribution of {t} under {P_i}, then 

\displaystyle \mathop{\mathbb E}_{Q_i}[h(t)] = \mathop{\mathbb E}_{\overline{P}}\left[h(t)\frac{dP_i}{d\overline{P}} \right] = \mathop{\mathbb E}_{\mu}\left[h(t) t_i \right]

for any measurable function {h}, which gives {dQ_i = t_id\mu}, agreeing with the standard model. Finally, since {dP_i(x) = t_i d\mu(x)}, by the factorization criterion (e.g., Theorem 7 in Lecture 3) we conclude that the statistic {t} is sufficient, and therefore {\Delta(\mathcal{M},\mathcal{N})=0}. \Box

Lemma 8 helps to convert the sample space of all finite models to the simplex {\mathcal{S}_{m+1}}, and comparison between models is reduced to the comparison between their standard distributions. Consequently, we have the following quantitative bound on the model distance between finite models. 

Lemma 9 Let {\mathcal{M} = (\mathcal{X},\mathcal{F},\{P_0,\cdots,P_m \})} and {\mathcal{N} = (\mathcal{Y}, \mathcal{G}, \{Q_0,\cdots,Q_m\})} be two finite models with standard distributions {\mu_1, \mu_2} respectively. Then 

\displaystyle \Delta(\mathcal{M},\mathcal{N}) \le (m+1)\cdot \|\mu_1 - \mu_2\|_{\text{\rm D}} := (m+1)\cdot \sup_{\phi} |\mathop{\mathbb E}_{\mu_1} \phi - \mathop{\mathbb E}_{\mu_2} \phi|,

where { \|\mu_1 - \mu_2\|_{\text{\rm D}}} denotes the Dudley’s metric between probability measures {\mu_1,\mu_2}, and the supremum is taken over all measurable functions {\phi: \mathcal{S}_{m+1}\rightarrow {\mathbb R}} with {\|\phi\|_\infty\le 1} and {|\phi(t_1) - \phi(t_2)| \le \|t_1 - t_2\|_\infty} for any {t_1,t_2\in \mathcal{S}_{m+1}}

Remark 1 Recall that Dudley’s metric metrizes the weak convergence of probability measures on a metric space with its Borel {\sigma}-algebra. The fact that it is smaller than the total variation distance will be crucial to establish Theorem 6

Proof: Similar to the proof of the randomization criterion (Theorem 5 in Lecture 3), the following upper bound on the model distance holds: 

\displaystyle \Delta(\mathcal{M},\mathcal{N}) \le \sup_{L,\pi} | R_{L,\pi}^\star(\mathcal{M}) - R_{L,\pi}^\star(\mathcal{N}) |,

where {R_{L,\pi}^\star(\mathcal{M})} denotes the Bayes risk of model {\mathcal{M}} under loss {L} and prior {\pi}, and the loss {L} is non-negative and upper bounded by one in the supremum. By algebra, the Bayes risk admits the following simple form under the standard model with standard distribution {\mu}

\displaystyle R_{L,\pi}^\star(\mu) = \int_{\mathcal{S}_{m+1}} \left(\inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t \right) \mu(dt),

where the set {\mathcal{C}_{L,\pi}(t)} is defined as 

\displaystyle \mathcal{C}_{L,\pi}(t) = \left\{\left(\pi_0\int L(0,a)\delta(t,da), \cdots, \pi_m\int L(m,a)\delta(t,da) \right): \text{stochastic kernel }\delta: \mathcal{S}_{m+1}\rightarrow \mathcal{A} \right\}.

Since the diameter of {\mathcal{C}_{L,\pi}(t)} in {\|\cdot\|_1} is one, we conclude that {t\mapsto \inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t} is upper bounded by {m+1} and {1}-Lipschitz under {\|\cdot\|_\infty}. The rest of the proof follows from the definition of Dudley’s metric. \Box

Finally we are ready to present the proof of Theorem 6. Note that there is a bijective map between {(L_1,\cdots,L_m)} and {(\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}})}, which is continuous under the model {\mathcal{M}} due to the homogeneity assumption. Then by continuous mapping theorem (see remark below), the weak convergence of likelihood ratios implies the weak convergence of their standard distributions. Since Dudley’s metric metrizes the weak convergence of probability measures, the result of Lemma 9 completes the proof. 

Remark 2 The continuous mapping theorem for weak convergence states that, if Borel-measurable random variables {X_n} converges weakly to {X} on a metric space, and {f} is a function continuous on a set {C} such that {\mathop{\mathbb P}(X\in C)=1}, then {f(X_n)} also converges weakly to {f(X)}. Note that the function {f} is only required to be continuous on the support of the limiting random variable {X}

3.2. Locally Asymptotically Normal (LAN) Models 

Motivated by Theorem 6, in order to prove that certain models asymptotically become normal, we may show that the likelihood functions weakly converge to those in the normal model. Note that for the Gaussian location model {\mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,\Sigma))_{\theta\in{\mathbb R}^d})}, the log-likelihood ratio is given by 

\displaystyle \log \frac{dP_{\theta+h}}{dP_\theta}(x) = h^\top \Sigma^{-1}(x-\theta) - \frac{1}{2}h^\top \Sigma^{-1}h, \ \ \ \ \ (3)

where {\Sigma^{-1}(x-\theta)\sim \mathcal{N}(0,\Sigma^{-1})} for {x\sim P_\theta}. The equation (3) motivates the following definition of local asymptotic normal (LAN) models, in which the likelihood function looks like (3)

Definition 10 (Local Asymptotic Normality) A sequence of models {\mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, (P_{n,h})_{h\in \Theta_n})} with {\Theta_n \uparrow {\mathbb R}^d} is called locally asymptotically normal (LAN) with central sequence {Z_n} and Fisher information matrix {I_0} if 

\displaystyle \log\frac{dP_{n,h}}{dP_{n,0}} = h^\top Z_n - \frac{1}{2}h^\top I_0h + r_n(h), \ \ \ \ \ (4)

with {Z_n \overset{d}{\rightarrow} \mathcal{N}(0,I_0)} under {P_{n,0}}, and {r_n(h)} converges to zero in probability under {P_{n,0}} for any fixed {h\in {\mathbb R}^d}

Based on the form of the likelihood ratio in (4), the following theorem is then immediate. 

Theorem 11 If a sequence of models {\mathcal{M}_n} satisfies the LAN condition with Fisher information matrix {I_0}, then {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M}) = 0} for {\mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,I_0^{-1}))_{\theta\in{\mathbb R}^d})}

Proof: Note that for any finite sub-model, Slutsky’s theorem applied to (4) gives the desired convergence in distribution, and clearly the Gaussian location model is homogeneous. Now applying Theorem 6 gives the desired convergence. We leave the discussion of the general case in the bibliographic notes. \Box

Now the only remaining task is to check the likelihood ratios for some common models and show that the LAN condition is satisfied. For example, for QMD models {(P_\theta)_{\theta\in\Theta}}, we have 

\displaystyle \log \frac{dP_{\theta+h/\sqrt{n}}^{\otimes n}}{dP_\theta^{\otimes n}}(X) = h^\top \left( \frac{1}{\sqrt{n}}\dot{\ell}_\theta(X_i) \right) + \frac{1}{2} h^\top \left(\frac{1}{n}\sum_{i=1}^n \ddot{\ell}_\theta(X_i) \right)h + o_{P_\theta^{\otimes n}}(1),

where intuitively by CLT and LLN will arrive at the desired form (4). The next proposition makes this intuition precise. 

Proposition 12 Let {(P_\theta)_{\theta\in\Theta}} be QMD in an open set {\Theta\in{\mathbb R}^d} with Fisher information matrix {I(\theta)}. Then the sequence of model {\mathcal{M}_n = (\mathcal{X}^n, \mathcal{F}^{\otimes n}, (P_{\theta+h/\sqrt{n}}^{\otimes n} )_{h\in \Theta_n} )}, with {\Theta_n = \{h\in {\mathbb R}^d: \theta+ h/\sqrt{n}\in \Theta\}\uparrow {\mathbb R}^d} satisfies the LAN condition with Fisher information {I(\theta)}

Proof: Write {P_n = P_{\theta+h/\sqrt{n}}, P = P_\theta} and {g = h^\top \dot{\ell}_\theta}. Then by Taylor expansion, 

\displaystyle \begin{array}{rcl} \log\frac{dP_n^{\otimes n}}{dP^{\otimes n}}(X) & = & \sum_{i=1}^n \log \frac{dP_n}{dP}(X_i) \\ &=& 2\sum_{i=1}^n \log\left(1+\frac{1}{2}W_{ni}\right) \\ &=& \sum_{i=1}^n W_{ni} - \frac{1}{4}\sum_{i=1}^n W_{ni}^2 + \sum_{i=1}^n W_{ni}^2 r(W_{ni}), \end{array}

where {W_{ni} := 2(\sqrt{(dP_n/dP)(X_i)} - 1)}, and {r(W_{ni}) = o_{W_{ni}}(1)}. By the QMD condition, we have 

\displaystyle \text{Var}\left(\sum_{i=1}^n W_{ni} - \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) \right) = n\text{Var}\left(W_{n1} - \frac{g(X_1)}{\sqrt{n}}\right) = o(1).

Moreover, {\mathop{\mathbb E}_P g(X_i) = 0} by the property of the score function, and 

\displaystyle \mathop{\mathbb E}\left[\sum_{i=1}^n W_{ni}\right] = -n\int (\sqrt{dP_n} - \sqrt{dP})^2 \rightarrow - \frac{1}{4}\mathop{\mathbb E}_P[g(X)^2].

Consequently, we conclude that 

\displaystyle \sum_{i=1}^n W_{ni} = \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) -\frac{1}{4}\mathop{\mathbb E}_P[g(X)^2] + o_P(1).

For the second term, the QMD condition gives {nW_{ni}^2 = g(X_i)^2 + o_P(1)}, and therefore LLN gives {\sum_{i=1}^n W_{ni}^2 = \mathop{\mathbb E}_P[g(X)^2] + o_P(1)}. For the last term, the Markov’s inequality gives {n\mathop{\mathbb P}(|W_{ni}|\ge \varepsilon)\rightarrow 0}, and therefore {\max_{i\in [n]} | r(W_{ni}) | = o_P(1)}, as desired. \Box

In other words, Proposition 12 implies that all regular statistical models locally look like normal, where the local radius is {\Theta(n^{-1/2})}

3.3. Proof of LAM Theorem 

Now we are ready to glue all necessary ingredients together. First, for product QMD statistical models, Proposition 12 implies that LAN condition is satisfied for the local model around any chosen parameter {\theta}. Second, by Theorem 11, these local models will converge to a Gaussian location model with covariance {I(\theta)^{-1}}. By definition of the model distance, the minimax risk of these local models will be asymptotically no smaller than that of the limiting Gaussian location model, which by Theorem 3 is {\mathop{\mathbb E} \ell(Z)} with {Z\sim \mathcal{N}(0,I(\theta)^{-1})} for bowl-shaped loss {\ell}. Consequently, we have the following local asymptotic minimax theorem. 

Theorem 13 (LAM, restated) Let {(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}} be a QMD statistical model which admits a non-singular Fisher information {I(\theta_0)} at {\theta_0}. Let {\psi(\theta)} be differentiable at {\theta=\theta_0}, and {T_n} be an estimator sequence of {\psi(\theta)} in the model {(P_\theta^{\otimes n})}. Consider any compact action space {\mathcal{A}\subset {\mathbb R}^d}, and any bowl-shaped loss function {\ell: {\mathbb R}^d\rightarrow {\mathbb R}_+}. Then 

\displaystyle \lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)

with {Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)}

Note that here the compactness of the action space is required for the limiting arguments, while all our previous analysis consider finite models. Our arguments via model distance are also different from those used by Hájek and Le Cam, where they introduced the notion of contiguity to arrive at the same result with weaker conditions. Further details of these alternative approaches are referred to the bibliographic notes. 

4. Applications and Limitations 

In this section, we will apply the LAM theorem to prove asymptotic lower bounds for both parametric and nonparametric problems. We will also discuss the limitations of LAM to motivate the necessity of future lectures. 

4.1. Parametric Entropy Estimation 

Consider the discrete i.i.d. sampling model {X_1,\cdots,X_n\sim P=(p_1,\cdots,p_k)\in\mathcal{M}_k}, where {\mathcal{M}_k} denotes the probability simplex on {k} elements. The target is to estimate the Shannon entropy 

\displaystyle H(P) := \sum_{i=1}^k -p_i\log p_i

under the mean squared loss. We can apply LAM to prove a local minimax lower bound for this problem. 

First we compute the Fisher information of the multinomial model {X\sim P}, where we set {\theta=(p_1,\cdots,p_{k-1})} to be the free parameters. It is easy to show that 

\displaystyle I(\theta) = \text{diag}(p_1^{-1},\cdots,p_{k-1}^{-1}) + p_k^{-1}{\bf 1}{\bf 1}^\top .

By the matrix inversion formula {(A + UCV )^{-1} = A^{-1} - A^{-1}U(C^{-1} + V A^{-1}U)^{-1}V A^{-1}}, we have 

\displaystyle I(\theta)^{-1} = \text{diag}(\theta) - \theta\theta^\top.

Now choosing {\psi(\theta) = (\sum_{i=1}^{k-1} -\theta_i\log \theta_i) - (1-\sum_{i=1}^{k-1}\theta_i)\log(1-\sum_{i=1}^{k-1}\theta_i)} and {\ell(t) = t^2} in LAM, after some algebra we arrive at 

\displaystyle \inf_{\hat{H}} \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P (\hat{H} - H(P))^2 \ge \frac{1+o_n(1)}{n} \sup_{P\in \mathcal{M}_k} \text{Var}\left[\log\frac{1}{P(X)}\right].

Note that {\sup_{P\in \mathcal{M}_k}} on the RHS is due to our arbitrary choice of the centering {\theta}

4.2. Nonparametric Entropy Estimation 

Consider a continuous i.i.d. sampling model from some density, i.e., {X_1,\cdots,X_n\sim f}. Assume that the density {f} is supported on {{\mathbb R}^d}, and the target is to estimate the differential entropy 

\displaystyle h(f) := \int_{{\mathbb R}^d} -f(x)\log f(x)dx.

As before, we would like to prove a local minimax lower bound for the mean squared error around any target density {f_0}. However, since the model is nonparametric and the parameter {f} has an infinite dimension, there is no Fisher information matrix for this model. To overcome this difficulty, we may consider a one-dimensional parametric submodel instead. 

Let {g: {\mathbb R}^d\rightarrow {\mathbb R}} be any measurable function with {\text{supp}(g)\subseteq \text{supp}(f_0)} and {\int_{{\mathbb R}^d} g(x)dx=0}, then {f_0+tg} is still a valid density on {{\mathbb R}^d} for small {|t|}. Consequently, keeping {t} small, the i.i.d. sampling model {X_1,\cdots,X_n\sim f_0+tg} becomes a submodel parametrized only by {t}. For this 1-D parametric submodel, the Fisher information at {t=0} can be computed as 

\displaystyle I_0 = \int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx.

Setting {\psi(t) = h(f_0+tg)} in LAM, we have 

\displaystyle \psi'(0) = \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx,

and consequently 

\displaystyle \begin{array}{rcl} \inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 &\ge& \inf_{\hat{h}} \sup_{t} \mathop{\mathbb E}_{f_0+tg}(\hat{h} - \psi(t))^2 \\ &\ge& \frac{1+o_n(1)}{n}\left(\int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx\right)^{-1} \left(\int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx\right)^2. \end{array}

Since our choice of the test function {g} is arbitrary, we may actually choose the worst-case {g} such that the above lower bound is maximized. We claim that the maximum value is 

\displaystyle V(f_0):=\int_{{\mathbb R}^d} f_0(x)\log^2 f_0(x)dx - h(f_0)^2.

Clearly, this value is attained for the test function {g (x)= f_0(x)(-\log f_0(x) - h(f_0)).} For the maximality, the Cauchy–Schwartz inequality and the assumption {\int_{{\mathbb R}^d} g(x)dx=0} gives 

\displaystyle \begin{array}{rcl} \left( \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx \right)^2 &=& \left( \int_{{\mathbb R}^d} g(x)(-\log f_0(x) - h(f_0))dx \right)^2\\ &\le & \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx \cdot \int_{{\mathbb R}^d} f_0(x)(-\log f_0(x) - h(f_0))^2dx \\ &=& V(f_0)\cdot \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx. \end{array}

Therefore, the parametric lower bound for nonparametric entropy estimation is 

\displaystyle \inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 \ge \frac{1+o_n(1)}{n}\cdot V(f_0).

4.3. Limitations of Classical Asymptotics 

The theorems from classical asymptotics can typically help to prove an error bound {\Theta(n^{-1/2})} with an explicit constant, and it is also known that these bounds are optimal and achieved by MLE. However, there are still some problems in these approaches: 

  1. Non-asymptotic vs. asymptotic: Asymptotic bounds are useful only in scenarios where the problem size remains fixed and the sample size grows to infinity, and there is no general guarantee of when we have entered the asymptotic regime (it may even require that {n\gg e^d}). In practice, essentially all recent problems are high-dimensional ones where the number of parameters is comparable to or even larger than the sample size (e.g., an over-parametrized neural network), and some key properties of the problem may be entirely obscured in the asymptotic regime. 
  2. Parametric vs. nonparametric: The results in classical asymptotics may not be helpful for a large quantity of nonparametric problems, where the main problem is the infinite-dimensional nature of nonparametric problems. Although sometimes the parametric reduction is helpful (e.g., the entropy example), the parametric rate {\Theta(n^{-1/2})} is in general not attainable in nonparametric problems and some other tools are necessary. For example, if we would like to estimate the density {f} at some point {x_0}, the worst-case test function will actually give a vacuous lower bound (which is infinity). 
  3. Global vs. local: As the name of LAM suggests, the minimax lower bound here holds locally. However, the global structure of some problems may also be important, and the global minimax lower bound may be much larger than the supremum of local bounds over all possible points. For example, in Shannon entropy estimation, the bias is actually dominating the problem and cannot be reflected in local methods. 

To overcome these difficulties, we need to develop tools to establish non-asymptotic results for possibly high-dimensional or nonparametric problems, which is the focus of the rest of the lecture series. 

5. Bibliographic Notes 

The asymptotic theorems in Theorem 2 are first presented in H\'{a}jek (1970) and H\'{a}jek (1972), and we refer to Le Cam (1986), Le Cam and Yang (1990) and van der Vaart (2000) as excellent textbooks. Here the approach of using model distance to establish LAM is taken from Section 4, 6 of Liese and Miescke (2007); also see Le Cam (1972). 

There is another line of approach to establish the asymptotic theorems. A key concept is the contiguity proposed by Le Cam (1960), which enables an asymptotic change of measure. Based on contiguity and LAN condition, the distribution of any (regular) estimator under the local alternative can be evaluated. Then the convolution theorem can be shown, which helps to establish LAM; details can be found in van der Vaart (2000). LAM theorem can also be established directly by computing the asymptotic Bayes risk under proper priors; see Section 6 of Le Cam and Yang (1990). 

For parametric or nonparametric entropy estimation, we refer to recent papers (Jiao et al. (2015) and Wu and Yang (2016) for the discrete case, Berrett, Samworth and Yuan (2019) and Han et al. (2017) for the continuous case) and the references therein.

  1. Jaroslav Hájek, A characterization of limiting distributions of regular estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14.4 (1970): 323-330. 
  2. Jaroslav Hájek, Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability. Vol. 1. 1972. 
  3. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986. 
  4. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990. 
  5. Aad W. Van der Vaart, Asymptotic statistics. Vol. 3. Cambridge university press, 2000. 
  6. Friedrich Liese and Klaus-J. Miescke. Statistical decision theory. Springer, New York, NY, 2007. 
  7. Lucien M. Le Cam, Limits of experiments. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972. 
  8. Lucien M. Le Cam, Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98 (1960). 
  9. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885. 
  10. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720. 
  11. Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, Efficient multivariate entropy estimation via {k}-nearest neighbour distances. The Annals of Statistics 47.1 (2019): 288-318. 
  12. Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu, Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint arXiv:1711.02141 (2017).

Lecture 3: Statistical Decision Theory: Model Distance and Equivalence

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

This lecture starts to talk about specific tools and ideas to prove information-theoretic lower bounds. We will temporarily restrict ourselves to statistical inference problems (which most lower bounds apply to), where the presence of randomness is a key feature in these problems. Since the observer cannot control the realizations of randomness, the information contained in the observations, albeit not necessarily in a discrete structure (e.g., those in Lecture 2), can still be limited. In this lecture and subsequent ones, we will introduce the reduction and hypothesis testing ideas to prove lower bounds of statistical inference, and these ideas will also be applied to other problems.

1. Basics of Statistical Decision Theory 

To introduce statistical inference problems, we first review some basics of statistical decision theory. Mathematically, let {(\mathcal{X}, \mathcal{F}, (P_\theta)_{\theta\in\Theta})} be a collection of probability triplets indexed by {\theta}, and the observation {X} is a random variable following distribution {P_{\theta}} for an unknown parameter {\theta} in the known parameter set {\Theta}. The specific structure of {(P_\theta)_{\theta\in\Theta}} is typically called modelsor experiments, for the parameter {\theta} can represent different model parameters or theories to explain the observation {X}. A decision rule, or a (randomized) estimator is a transition kernel {\delta(x,da)} from {\mathcal{X}} to some action space {\mathcal{A}}. Let {L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+} be a loss function, where {L(\theta,a)} represents the loss of using action {a} when the true parameter is {\theta}. A central quantity to measure the quality of a decision rule is the risk in the following definition. 

Definition 1 (Risk) Under the above notations, the risk of the decision rule {\delta} under loss function {L} and the true parameter {\theta} is defined as

\displaystyle R_\theta(\delta) = \iint L(\theta,a)P_\theta(dx)\delta(x,da). \ \ \ \ \ (1)

Intuitively, the risk characterizes the expected loss (over the randomness in both the observation and the decision rule) of the decision rule when the true parameter is {\theta}. When {\delta(x,da)} is a point mass {\delta(a-T(x))} for some deterministic function {T:\mathcal{X}\rightarrow \mathcal{A}}, we will also call {T(X)} as an estimator and the risk in (1) becomes

\displaystyle R_\delta(T) = \int L(\theta,T(x)) P_\theta(dx) = \mathop{\mathbb E}_{\theta} L(\theta, T(X)). \ \ \ \ \ (2)

Example 1 In linear regression model with random design, let the observations {(x_1,y_1), \cdots, (x_n,y_n)\in {\mathbb R}^p\times {\mathbb R}} be i.i.d. with {x_i \sim P_X} and {y_i|x_i\sim \mathcal{N}(x_i^\top \theta, \sigma^2)}. Here the parameter set {\Theta={\mathbb R}^p} is a finite-dimensional Euclidean space, and therefore we call this model parametric. The quantity of interest is {\theta}, and the loss function may be chosen to be the prediction error {L(\theta,\hat{\theta}) = \mathop{\mathbb E}_{\theta} (y-x^\top \hat{\theta})^2} of the linear estimator {f(x) = x^\top \hat{\theta}}

Example 2 In density estimation model, let {X_1, \cdots, X_n} be i.i.d. drawn from some {1}-Lipschitz density {f} supported on {[0,1]}. Here the parameter set {\Theta} of the unknown {f} is the infinite-dimensional space of all possible {1}-Lipschitz functions on {[0,1]}, and we call this model non-parametric. The target may be to estimate the density {f} at a point, the entire density, or some functional of the density. In respective settings, the loss functions can be 

\displaystyle L_1(f,T) = |T - f(0)|, \quad L_2(f,T) = \int_0^1 (T(x) - f(x))^2dx, \quad L_3(f,T) = |T - \|f\|_1|.

Example 3 By allowing general action spaces and loss functions, the decision-theoretic framework can also incorporate some non-statistical examples. For instance, in stochastic optimization {\theta\in\Theta} may parameterize a class of convex Lipschitz functions {f_\theta: [-1,1]^d\rightarrow {\mathbb R}}, and {X} denotes the noisy observations of the gradients at the queried points. Then the action space {\mathcal{A}} may just be the entire domain {[-1,1]^d}, and the loss function {L} is the optimality gap defined as 

\displaystyle L(\theta, a) = f_\theta(a) - \min_{a^\star \in [0,1]^d} f_\theta(a^\star).

In practice, one would like to find optimal decision rules for a given task. However, the risk is a function of {\theta} and it is hard to compare two risk functions directly. Hence, people typically map the risk functions into scalars and arrive at the following minimax and Bayesian paradigm. 

Definition 2 (Minimax and Bayes Decision Rule) The minimax decision rule is the decision rule {\delta} which minimizes the quantity {\max_{\theta\in\Theta} R_\theta(\delta)}. The Bayes decision rule under distribution {\pi(d\theta)} (called the prior distribution) is the decision rule {\delta} which minimizes the quantity {\int R_\theta(\delta)\pi(d\theta)}

It is typically hard to find the minimax decision rule in practice, while the Bayes decision rule admits a closed-form expression (although hard to compute in general). 

Proposition 3 The Bayes decision rule under prior {\pi} is given by the estimator 

\displaystyle T(x) \in \arg\min_{a\in\mathcal{A}} \int L(\theta,a)\pi(d\theta|x), \ \ \ \ \ (3)

where {\pi(d\theta|x)} denotes the posterior distribution of {\theta} under {\pi} (assuming the existence of regular posterior). 

Proof: Left as an exercise for the reader. \Box

2. Deficiency and Model Distance 

The central target of statistical inference is to propose some decision rule for a given statistical model with small risks. Similarly, the counterpart in the lower bound is to prove that certain risks are unavoidable for any decision rules. Here to compare risks, we may either compare the entire risk function, or its minimax or Bayes version. In this lecture we will focus on the risk function, and many later lectures will be devoted to appropriate minimax risks. 

We start with the task of comparing two statistical models with the same parameter set {\Theta}. Intuitively, one may think that the model with a smaller noise level would be better than the other, e.g., the model {\mathcal{M}_1 = \{\mathcal{N}(\theta,1): \theta\in {\mathbb R} \}} should be better than {\mathcal{M}_2 = \{\mathcal{N}(\theta,2): \theta\in {\mathbb R} \}}. However, this criterion is bad due to two reasons: 

  1. There is no proper notion of noise for general (especially non-additive) statistical models; 
  2. Even if a natural notion of noise exists for certain models, it is not necessarily true that the model with smaller noise is always better. For example, let \displaystyle \mathcal{M}_1 = \{\text{Unif}\{\theta-1,\theta+1 \}: |\theta|\le 1\}, \quad \mathcal{M}_2 = \{\text{Unif}\{\theta-3,\theta+3 \}: |\theta|\le 1\},the model {\mathcal{M}_1} has a smaller magnitude of the additive noise. However, despite with a larger magnitude of the noise, the model {\mathcal{M}_2} is actually noise-free since one can decode the perfect knowledge of {\theta} from the observation. 

To overcome the above difficulties, we introduce the idea of model reduction. Specifically, we aim to establish the fact that for any decision rule in one model, there is another decision rule in the other model which is uniformly no worse than the former rule. The idea of reduction appears in many fields, e.g., in P/NP theory it is sufficient to work out one NP-complete instance (e.g., circuit satisfiability) from scratch and establish all others by polynomial reduction. This reduction idea is made precise via the following definition. 

Definition 4 (Model Deficiency) For two statistical models {\mathcal{M} = (\mathcal{X}, \mathcal{F}, (P_{\theta})_{\theta\in \Theta})} and {\mathcal{N} = (\mathcal{Y}, \mathcal{G}, (Q_{\theta})_{\theta\in \Theta})}, we call {\mathcal{M}} is {\varepsilon}-deficient relative to {\mathcal{N}} if for any finite subset {\Theta_0\subseteq \Theta}, any finite action space {\mathcal{A}}, any loss function {L: \Theta_0\times \mathcal{A}\rightarrow [0,1]}, and any decision rule {\delta_{\mathcal{N}}} based on model {\mathcal{N}}, there exists some decision rule {\delta_{\mathcal{M}}} based on model {\mathcal{M}} such that 

\displaystyle R_\theta(\delta_{\mathcal{M}}) \le R_\theta(\delta_{\mathcal{N}}) + \varepsilon, \qquad \forall \theta\in \Theta_0. \ \ \ \ \ (4)

Note that the definition of model deficiency does not involve the specific choice of the action space and loss function, and the finiteness of {\Theta_0} and {\mathcal{A}} in the definition is mainly for technical purposes. Intuitively speaking, {\mathcal{M}} is {\varepsilon}-deficient relative to {\mathcal{N}} if the entire risk function of some decision rule in {\mathcal{M}} is no worse than that of any given decision rule in {\mathcal{N}}, within an additive gap {\varepsilon}

Although Definition 4 gives strong performance guarantees for the {\varepsilon}-deficient model {\mathcal{M}}, it is difficult to verify the condition (4), i.e., to transform an arbitrary decision rule {\delta_\mathcal{N}} to some proper {\delta_\mathcal{M}}. A crucial observation here is that it may be easier to transform between models {\mathcal{M}} and {\mathcal{N}}, and in particular, when {\mathcal{N}} is a randomization of {\mathcal{M}}. For example, if there exists some stochastic kernel {\mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y}} such that {Q_\theta = \mathsf{K}P_\theta} for all {\theta\in\Theta} (where { \mathsf{K}P_\theta} denotes the marginal distribution of the output of kernel {\mathsf{K}} with input distributed as {P_\theta}), then we may simply set {\delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K}} to arrive at (4) with {\varepsilon=0}. The following theorem shows that model deficiency is in fact equivalent to approximate randomization. 

Theorem 5 Model {\mathcal{M}} is {\varepsilon}-deficient with respect to {\mathcal{N}} if and only if there exists some stochastic kernel {\mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y}} such that 

\displaystyle \sup_{\theta\in\Theta} \|Q_\theta - \mathsf{K}P_\theta \|_{\text{\rm TV}} \le \varepsilon,

where {\|P-Q\|_{\text{\rm TV}} := \frac{1}{2}\int |dP-dQ|} is the total variation distance between probability measures {P} and {Q}

Proof: The sufficiency part is easy. Given such a kernel {\mathsf{K}} and a decision rule {\delta_\mathcal{N}} based on model {\mathcal{N}}, we simply set {\delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K}}, i.e., transmit the output through kernel {\mathsf{K}} and apply {\delta_\mathcal{N}}. Then for any {\theta\in\Theta}

\displaystyle \begin{array}{rcl} R_\theta(\delta_{\mathcal{M}}) - R_\theta(\delta_{\mathcal{N}}) &=& \iint L(\theta,a)\delta_\mathcal{N}(y,da) \left[\int P_\theta(dx)\mathsf{K}(dy|x)- Q_\theta(dy) \right] \\ &\le & \|Q_\theta - \mathsf{K}P_\theta \|_{\text{TV}} \le \varepsilon, \end{array}

for the loss function is non-negative and upper bounded by one. 

The necessity part is slightly more complicated, and for simplicity we assume that all {\Theta, \mathcal{X}, \mathcal{Y}} are finite (the general case requires proper limiting arguments). In this case, any decision rules {\delta_\mathcal{M}} or {\delta_\mathcal{N}}, loss functions {L} and priors {\pi(d\theta)} can be represented by a finite-dimensional vector. Given {\mathcal{A}} and {\delta_{\mathcal{N}}}, the condition (4) ensures that 

\displaystyle \sup_{L(\theta,a),\pi(d\theta)} \inf_{\delta_{\mathcal{M}}}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (5)

Note that the LHS of (5) is bilinear in {L(\theta,a)\pi(d\theta)} and {\delta_\mathcal{M}(x,da)}, both of which range over some convex sets (e.g., the domain for {M(\theta,a) := L(\theta,a)\pi(d\theta)} is exactly {\{M\in [0,1]^{\Theta\times \mathcal{A}}: \sum_\theta \|M(\theta, \cdot)\|_\infty \le 1 \}}), the minimax theorem allows to swap {\sup} and {\inf} of (5) to obtain that 

\displaystyle \inf_{\delta_{\mathcal{M}}}\sup_{L(\theta,a),\pi(d\theta)}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (6)

By evaluating the inner supremum, (6) implies the existence of some {\delta_\mathcal{M}^\star} such that 

\displaystyle \sup_{\theta\in\Theta} \frac{1}{2}\int_{\mathcal{A}} \left| \int_{\mathcal{X}} \delta_\mathcal{M}^\star(x,da)P_\theta(dx) - \int_{\mathcal{Y}} \delta_\mathcal{N}(y,da)Q_\theta(dy)\right| \le \varepsilon. \ \ \ \ \ (7)

Finally, choosing {\mathcal{A}=\mathcal{Y}} and {\delta_\mathcal{N}(y,da) = 1(y=a)} in (7), the corresponding {\delta_\mathcal{M}^\star} is the desired kernel {\mathsf{K}}. \Box

Based on the notion of deficiency, we are ready to define the distance between statistical models, also known as the Le Cam’s distance

Definition 6 (Le Cam’s Distance) For two statistical models {\mathcal{M}} and {\mathcal{N}} with the same parameter set {\Theta}, Le Cam’s distance {\Delta(\mathcal{M},\mathcal{N})} is defined as the infimum of {\varepsilon\ge 0} such that {\mathcal{M}} is {\varepsilon}-deficient relative to {\mathcal{N}}, and {\mathcal{N}} is {\varepsilon}-deficient relative to {\mathcal{M}}

It is a simple exercise to show that Le Cam’s distance is a pesudo-metric in the sense that it is symmetric and satisfies the triangle inequality. The main importance of Le Cam’s distance is that it helps to establish equivalence between some statistical models, and people are typically interested in the case where {\Delta(\mathcal{M},\mathcal{N})=0} or {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0}. The main idea is to use randomization (i.e., Theorem 5) to obtain an upper bound on Le Cam’s distance, and then apply Definition 4 to deduce useful results (e.g., to carry over an asymptotically optimal procedure in one model to other models). 

In the remainder of this lecture, I will give some examples of models whose distance is zero or asymptotically zero. In the next lecture it will be shown that regular models will always be close to some Gaussian location model asymptotically, and thereby the classical asymptotic theory of statistics can be established. 

3. Equivalence between Models 

3.1. Sufficiency 

We first examine the case where {\Delta(\mathcal{M},\mathcal{N})=0}. By Theorem 5, models {\mathcal{M}} and {\mathcal{N}} are mutual randomizations. In the special case where {Y=T(X)\sim Q_\theta} is a deterministic function of {X\sim P_\theta} (thus {Q_\theta=P_\theta\circ T^{-1}} is the push-forward measure of {P_\theta} through {T}), we have the following result. 

Theorem 7 Under the above setting, {d(\mathcal{M},\mathcal{N})=0} if and only if {\theta-Y-X} forms a Markov chain. 

Note that the Markov condition {\theta-Y-X} is the usual definition of sufficient statistics, and also gives the well-known Rao–Blackwell factorization criterion for sufficiency. Hence, sufficiency is in fact a special case of model equivalence, and deficiency can be thought of as approximate sufficiency.

3.2. Equivalence between Multinomial and Poissonized Models 

Consider a discrete probability vector {P=(p_1,\cdots,p_k)} with {p_i\ge 0, \sum_{i=1}^k p_i=1}. A widely-used model in practice is the multinomial model {\mathcal{M}_n}, which models the i.i.d. sampling process and draws i.i.d. observations {X_1,\cdots, X_n\sim P}. However, a potential difficulty in handling multinomial models is that the empirical frequencies {\hat{p}_1, \cdots, \hat{p}_k} of symbols are dependent, which makes the analysis annoying. To overcome this difficulty, a common procedure is to consider a Poissonized model {\mathcal{N}_n}, where we draw a Poisson random variable {N\sim \text{Poisson}(n)} first and observes i.i.d. {X_1,\cdots,X_N\sim P}. Due to the nice properties of Poisson random variables, the empirical frequencies now follow independent scaled Poisson distribution. 

The next theorem shows that the multinomial and Poissonized models are asymptotically equivalent, which means that it actually does no harm to consider the more convenient Poissonized model for analysis, at least asymptotically. In later lectures I will also show a non-asymptotic result between these two models. 

Theorem 8 For fixed {k}, {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0}

Proof: We only show that {\mathcal{M}_n} is {\varepsilon_n}-deficient relative to {\mathcal{N}_n}, with {\lim_{n\rightarrow\infty} \varepsilon_n=0}, where the other direction is analogous. By Theorem 5, it suffices to show that {\mathcal{N}_n} is an approximate randomization of {\mathcal{M}_n}. The randomization procedure is as follows: based on the observations {X_1,\cdots,X_n} under the multinomial model, let {P_n=(\hat{p}_1,\cdots,\hat{p}_k)} be the vector of empirical frequencies. Next draw an independent random variable {N\sim \text{Poisson}(n)}. If {N\le n}, let {(X_1,\cdots,X_N)} be the output of the kernel. Otherwise, we generate i.i.d. samples {X_{n+1}', \cdots, X_N'\sim P_n}, and let {(X_1,\cdots,X_n,X_{n+1}',\cdots,X_N')} be the output. We remark that it is important that the above randomization procedure does depend on the unknown {P}

Let {\mathcal{N}_P, \mathcal{N}_P'} be the distribution of the Poissonized and randomized model under true parameter {P}, respectively. Now it is easily shown that 

\displaystyle \|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} = \mathop{\mathbb E}_m \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}}, \ \ \ \ \ (8)

where {m:=(N-n)_+}, {P^{\otimes m}} denotes the {m}-fold produce of {P}, {\mathop{\mathbb E}_m} takes the expectation w.r.t. {m}, and {\mathop{\mathbb E}_{X^n}} takes the expectation w.r.t. random samples {X^n\sim P}. To upper bound the total variation distance in (8), we shall need the following lemma. 

Lemma 9 Let {D_{\text{\rm KL}}(P\|Q) = \int dP\log \frac{dP}{dQ}} and {\chi^2(P,Q) = \int \frac{(dP-dQ)^2}{dQ}} be the KL divergence and {\chi^2}-divergence, respectively. Then 

\displaystyle 2\|P-Q \|_{\text{\rm TV}}^2 \le D_{\text{\rm KL}}(P\|Q) \le \chi^2(P,Q).

The proof of Lemma 9 will be given in later lectures when we talk about joint ranges of divergences. Then by Lemma 9 and Jensen’s inequality, 

\displaystyle \begin{array}{rcl} \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}} & \le & \mathop{\mathbb E}_{X^n}\sqrt{\frac{1}{2} D_{\text{KL}}(P_n^{\otimes m},P^{\otimes m} ) }\\ &=& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} D_{\text{KL}}(P_n,P ) } \\ &\le& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} \chi^2(P_n,P ) }\\ &\le& \sqrt{\frac{m}{2} \mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) }. \end{array}

Since by simple algebra, 

\displaystyle \mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) = \sum_{i=1}^k \frac{\mathop{\mathbb E}_{X^n} (\hat{p}_i-p_i)^2 }{p_i} = \sum_{i=1}^k \frac{p_i(1-p_i)}{np_i} = \frac{k-1}{n},

then by (8) we obtain that 

\displaystyle \|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} \le \mathop{\mathbb E}_m \sqrt{\frac{m(k-1)}{2n}} \le \sqrt{\frac{k-1}{2n}}\cdot (\mathop{\mathbb E} m^2)^{\frac{1}{4}} \le \sqrt{\frac{k-1}{2\sqrt{n}}},

which goes to zero uniformly in {P} as {n\rightarrow\infty}, as desired. \Box

3.3. Equivalence between Nonparametric Regression and Gaussian White Noise Models 

A well-known problem in nonparametric statistics is the nonparametric regression: 

\displaystyle y_i = f\left(\frac{i}{n}\right) + \sigma\xi_i, \qquad i=1,\cdots,n, \quad \xi_i\overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1), \ \ \ \ \ (9)

where the underlying regression function {f} is unknown, and {\sigma>0} is some noise level. Typically, the statistical goal is to recover the function {f} at some point or globally, and some smoothness conditions are necessary to perform this task. A typical assumption is that {f\in \mathcal{H}^s(L)} belongs to some H\”{o}lder ball, where 

\displaystyle \mathcal{H}^s(L) := \left\{f\in C[0,1]: \sup_{x\neq y}\frac{|f^{(m)}(x) - f^{(m)}(y)| }{|x-y|^\alpha} \le L\right\},

with {s=m+\alpha, m\in {\mathbb N}, \alpha\in (0,1]} denotes the smoothness parameter. 

There is also a continuous version of (9) called the Gaussian white noise model, where a process {(Y_t)_{t\in [0,1]}} satisfying the following stochastic differential equation is observed: 

\displaystyle dY_t = f(t)dt + \frac{\sigma}{\sqrt{n}}dB_t, \qquad t\in [0,1], \ \ \ \ \ (10)

where {(B_t)_{t\in [0,1]}} is the standard Brownion motion. Compared with the regression model in (9), the white noise model in (10) gets rid of the quantization issue of {[0,1]} and is therefore easier to analyze. Note that in both models {n} is effectively the sample size. 

Let {\mathcal{M}_n, \mathcal{N}_n} be the regression and white noise models with known parameters {(\sigma,L)} and the paramter set {f\in \mathcal{H}^s(L)}, respectively. The main result in this section is that, when {s>1/2}, these models are asymptotically equivalent. 

Theorem 10 If {s>1/2}, we have {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0}

Proof: Consider another Gaussian white noise model {\mathcal{N}_n^\star} where the only difference is to replace {f} in (10) by {f^\star} defined as 

\displaystyle f^\star(t) = \sum_{i=1}^n f\left(\frac{i}{n}\right) 1\left(\frac{i-1}{n}\le t<\frac{i}{n}\right), \qquad t\in [0,1].

Note that under the same parameter {f}, we have 

\displaystyle \begin{array}{rcl} D_{\text{KL}}(P_{Y_{[0,1]}^\star} \| P_{Y_{[0,1]}}) &=& \frac{n}{2\sigma^2}\int_0^1 (f(t) - f^\star(t))^2dt\\ & =& \frac{n}{2\sigma^2}\sum_{i=1}^n \int_{(i-1)/n}^{i/n} (f(t) - f(i/n))^2dt \\ & \le & \frac{L^2}{2\sigma^2}\cdot n^{1-2(s\wedge 1)}, \end{array}

which goes to zero uniformly in {f} as {n\rightarrow\infty}. Therefore, by Theorem 5 and Lemma 9, we have {\Delta(\mathcal{N}_n, \mathcal{N}_n^\star)\rightarrow 0}. On the other hand, in the model {\mathcal{N}_n^\star} the likelihood ratio between the signal distribution {P_{Y^\star}} and the pure noise distribution {P_{Z^\star}} is 

\displaystyle \begin{array}{rcl} \frac{dP_Y}{dP_Z}((Y_t^\star)_{t\in [0,1]}) &=& \exp\left(\frac{n}{2\sigma^2}\left(\int_0^1 2f^\star(t)dY_t^\star-\int_0^1 f^\star(t)^2 dt \right)\right) \\ &=& \exp\left(\frac{n}{2\sigma^2}\left(\sum_{i=1}^n 2f(i/n)(Y_{i/n}^\star - Y_{(i-1)/n}^\star) -\int_0^1 f^\star(t)^2 dt \right)\right). \end{array}

As a result, under model {\mathcal{N}_n^\star}, there is a Markov chain {f \rightarrow (n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]}\rightarrow (Y_t^\star)_{t\in [0,1]}}. Since under the same parameter {f}, {(n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]}} under {\mathcal{N}_n^\star} is identically distributed as {(y_i)_{i\in [n]}} under {\mathcal{M}_n}, by Theorem 7 we have exact sufficiency and conclude that {\Delta(\mathcal{M}_n, \mathcal{N}_n^\star)=0}. Then the rest follows from the triangle inequality. \Box

3.4. Equivalence between Density Estimation and Gaussian White Noise Models 

Another widely-used model in nonparametric statistics is the density estimation model, where samples {X_1,\cdots,X_n} are i.i.d. drawn from some unknown density {f}. Typically some smoothness condition is also necessary for the density, and we assume that {f\in \mathcal{H}^s(L)} again belongs to the H\”{o}lder ball. 

Compared with the previous results, a slightly more involved result is that the density estimation model, albeit with a seemingly different form, is also asymptotically equivalent to a proper Gaussian white noise model. However, here the Gaussian white noise model should take the following different form: 

\displaystyle dY_t = \sqrt{f(t)}dt + \frac{1}{2\sqrt{n}}dB_t, \qquad t\in [0,1]. \ \ \ \ \ (11)

In other words, in nonparametric statistics the problems of density estimation, regression and estimation in Gaussian white noise are all asymptotically equivalent, under certtain smoothness conditions. 

Let {\mathcal{M}_n, \mathcal{N}_n} be the density estimation model and the Gaussian white noise model in (11), respectively. The main result is summarized in the following theorem. 

Theorem 11 If {s>1/2} and the density {f} is bounded below from zero everywhere, then {\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0}

Proof: Instead of the original density estimation model, we actually consider a Poissonized sampling model {\mathcal{M}_{n,P}} instead, where the observation under {\mathcal{M}_{n,P}} is a Poisson process {(Z_t)_{t\in [0,1]}} on {[0,1]} with intensity {nf(t)}. Similar to the proof of Theorem 8, we have {\Delta(\mathcal{M}_n, \mathcal{M}_{n,P})\rightarrow 0} and it remains to show that {\Delta(\mathcal{N}_n, \mathcal{M}_{n,P})\rightarrow 0}

Fix an equal-spaced grid {t_0=0,t_1, t_2, \cdots, t_m=1} in {[0,1]} with {m=n^{1-\varepsilon}}, where {\varepsilon>0} is a small constant depending only on {s}. Next we come up with two new models {\mathcal{M}_{n,P}^\star} and {\mathcal{N}_n^\star}, where the only difference is that the parameter {f} is replaced by {f^\star} defined as 

\displaystyle f^\star(t) = \sum_{i=1}^M f(t_i) 1\left(t_{i-1}\le t<t_i\right), \qquad t\in [0,1].

As long as {2s(1-\varepsilon)>1}, the same arguments in the proof of Theorem 10 can be applied to arrive at {\Delta(\mathcal{M}_{n,P}^\star, \mathcal{M}_{n,P}), \Delta(\mathcal{N}_{n}^\star, \mathcal{N}_{n})\rightarrow 0} (for the white noise model, the assumption that {f} is bounded away from zero ensures the smoothness of {\sqrt{f}}). Hence, it further suffices to focus on the new models and show that {\Delta(\mathcal{M}_{n,P}^\star, \mathcal{N}_n^\star)\rightarrow 0}. An interesting observation is that under the model {\mathcal{M}_{n,P}^\star}, the vector {\mathbf{Z}=(Z_1,\cdots,Z_m)} with 

\displaystyle Z_i = \sum_{j=1}^N 1(t_{i-1}\le X_j<t_i) \sim \text{Poisson}\left(n^\varepsilon f(t_i) \right), \quad i\in [m]

is sufficient. Moreover, under the model {\mathcal{N}_n^\star}, the vector {\mathbf{Y}=(Y_1,\cdots,Y_m)} with 

\displaystyle Y_i = \sqrt{nm}(Y_{t_i} - Y_{t_{i-1}}) \sim \mathcal{N}\left(\sqrt{n^\varepsilon f(t_i)}, \frac{1}{4}\right), \quad i\in [m]

is also sufficient. Further, all entries of {\mathbf{Y}} and {\mathbf{Z}} are mutually independent. Hence, the ultimate goal is to find mutual randomizations between {\mathbf{Y}} and {\mathbf{Z}} for {f\in \mathcal{H}^s(L)}

To do so, a first attempt would be to find a bijective mapping {Y_i \leftrightarrow Z_i} independently for each {i}. However, this approach would lose useful information from the neighbors as we know that {f(t_i)\approx f(t_{i+1})} thanks to the smoothness of {f}. For example, we have {Y_1|Y_1+Y_2 \sim \text{Binomial}(Y_1+Y_2, p)} with {p = \frac{f(t_1)}{f(t_1) + f(t_2)}\approx \frac{1}{2}}, and {Z_1 - Z_2\sim \mathcal{N}(\mu, \frac{1}{2})} with {\mu = n^{\varepsilon/2}(\sqrt{f(t_1)} - \sqrt{f(t_2)})\approx 0}. Motivated by this fact, we represent {\mathbf{Y}} and {\mathbf{Z}} in the following bijective way (assume that {m} is even): 

\displaystyle \begin{array}{rcl} \mathbf{Y}' &=& (Y_1, Y_1 + Y_2, Y_3, Y_3 + Y_4, \cdots, Y_{m-1}, Y_{m-1} + Y_m), \\ \mathbf{Z}' &=& (Z_1-Z_2, Z_1+Z_2, \cdots, Z_m-Z_{m-1}, Z_m+Z_{m-1}). \end{array}

Note that {(Y_1+Y_2,Y_3+Y_4,\cdots,Y_{m-1}+Y_m)} is again an independent Poisson vector, we may repeat the above transformation for this new vector. Similar things also hold for {\mathbf{Z}'}. Hence, at each iteration we may leave half of the components unchanged, and apply the above transformations to the other half. We repeat the iteration for {\log_2 \sqrt{n}} times (assuming {\sqrt{n}} is a power of {2}), so that finally we arrive at a vector of length {m/\sqrt{n} = n^{1/2-\varepsilon}} consisting of sums. Let {\mathbf{Y}^{(1)}} (resp. {\mathbf{Z}^{(1)}}) be the final vector of sums, and {\mathbf{Y}^{(2)}} (resp. {\mathbf{Z}^{(2)}}) be the vector of remaining entries which are left unchanged at some iteration. 

Remark 1 Experienced readers may have noticed that these are the wavelet coefficients under the Haar wavelet basis, where superscripts {1} and {2} stand for father and mother wavelets, respectively. 

Next we are ready to describe the randomization procedure. For notational simplicity we will write {Y_1+Y_2} as a representative example of an entry in {\mathbf{Y}^{(1)}}, and write {Y_1} as a representative example of an entry in {\mathbf{Y}^{(2)}}

For entries in {\mathbf{Y}^{(1)}}, note that by the delta method, for {Y\sim \text{Poisson}(\lambda)}, the random variable {\sqrt{Y}} is approximately distributed as {\mathcal{N}(\sqrt{\lambda},1/4)} (in fact, the squared root is the variance-stabilizing transformation for Poisson random variables). The exact transformation is then given by 

\displaystyle Y_1 + Y_2 \mapsto \text{sign}(Y_1 + Y_2 +U)\cdot \sqrt{|Y_1 + Y_2 + U|}, \ \ \ \ \ (12)

where {U\sim \text{Uniform}([-1/2,1/2])} is an independent auxiliary variable. The mapping (12) is one-to-one and can thus be inverted as well. 

For entries in {\mathbf{Y}^{(2)}}, we aim to use quantile transformations to convert {\text{Binomial}(Y_1+Y_2, 1/2)} to {\mathcal{N}(0,1/2)}. For {k\ge 0}, let {F_k} be the CDF of {\text{Binomial}(k, 1/2)}, and {\Phi} be the CDF of {\mathcal{N}(0,1)}. Then the one-to-one quantile transformation is given by 

\displaystyle Y_1 \mapsto \frac{1}{\sqrt{2}}\Phi^{-1}(F_{Y_1+Y_2}(Y_1+U)), \ \ \ \ \ (13)

where again {U\sim \text{Uniform}([-1/2,1/2])} is an independent auxiliary variable. The output given by (13) will be expected to be close in distribution to {Z_1-Z_2}, and the overall transformation is also invertible. 

The approximation properties of these transformations are summarized in the following theorem. 

Theorem 12 Sticking to the specific examples of {Y_1} and {Y_1 + Y_2}, let {P_1, P_2} be the respective distributions of the RHS in (12) and (13), and {Q_1, Q_2} be the respective distributions of {Z_1 + Z_2} and {Z_1 - Z_2}, we have 

\displaystyle \begin{array}{rcl} H^2(P_1, Q_1) & \le & \frac{C}{n^\varepsilon (f(t_1) + f(t_2))}, \\ H^2(P_2, Q_2) & \le & C\left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^2 + Cn^\varepsilon \left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^4. \end{array}

where {C>0} is some universal constant, and {H^2(P,Q) := \int (\sqrt{dP}-\sqrt{dQ})^2} denotes the Hellinger distance. 

The proof of Theorem 12 is purely probabilitistic and involved, and is omitted here. Applying Theorem 12 to the vector {\mathbf{Y}^{(1)}} of length {m/\sqrt{n}}, each component is the sum of {\sqrt{n}} elements bounded away from zero. Consequently, let {\mathsf{K}} be the overall transition kernel of the randomization, the inequality {H^2(\otimes_i P_i, \otimes_i Q_i)\le \sum_i H^2(P_i,Q_i)} gives 

\displaystyle H^2(\mathsf{K}P_{\mathbf{Y}^{(1)}}, P_{\mathbf{Z}^{(1)}}) = O\left( \frac{m}{\sqrt{n}}\cdot \frac{1}{n^\varepsilon \cdot n^{1/2}} \right) = O(n^{-2\varepsilon}) \rightarrow 0.

As for the vector {\mathbf{Y}^{(2)}}, the components lie in {\ell_{\max} := \log_2 \sqrt{n}} possible different levels. At level {\ell\in [\ell_{\max}]}, the spacing of the grid becomes {n^{-1+\varepsilon}\cdot 2^{\ell}}, and there are {m\cdot 2^{-\ell}} elements. Also, we have {(f(t_1)-f(t_2))/(f(t_1)+f(t_2))=O(n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})} at {\ell}-th level, with {s':= s\wedge 1}. Consequently, 

\displaystyle \begin{array}{rcl} H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) &\le & \sum_{\ell=1}^{\ell_{\max}} m2^{-\ell}\cdot O\left((n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^2 + n^\varepsilon 2^\ell\cdot (n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^4 \right) \\ &= & O\left(n^{-(1-\varepsilon)(2s'-1)}\cdot 2^{(2s'-1)\ell_{\max}} + n\cdot (n^{-1+\varepsilon}2^{\ell_{\max}})^{4s'} \right) \\ &=& O\left(n^{-(1/2-\varepsilon)(2s'-1)} + n^{1-2s'(1-2\varepsilon)} \right). \end{array}

Since {s'>1/2}, we may choose {\varepsilon} to be sufficiently small (i.e., {2s'(1-2\varepsilon)>1}) to make {H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) = o(1)}. The proof is completed. \Box

4. Bibliographic Notes 

The statistical decision theory framework dates back to Wald (1950), and is currently the elementary course for graduate students in statistics. There are many excellent textbooks on this topic, e.g., Lehmann and Casella (2006) and Lehmann and Romano (2006). The concept of model deficiency is due to Le Cam (1964), where the randomization criterion (Theorem 5) was proved. The present form is taken from Torgersen (1991). We also refer to the excellent monographs by Le Cam (1986) and Le Cam and Yang (1990). 

The asymptotic equivalence between nonparametric models has been studied by a series of papers since 1990s. The equivalence between nonparametric regression and Gaussian white noise models (Theorem 10) was established in Brown and Low (1996), where both the fixed and random designs were studied. It was also shown in a follow-up work (Brown and Zhang 1998) that these models are non-equivalent if {s\le 1/2}. The equivalence of the density estimation model and others (Theorem 11) was established in Brown et al. (2004), and we also refer to a recent work (Ray and Schmidt-Hieber 2016) which relaxed the crucial assumption that the density is bounded below from zero. Poisson approximation or Poissonization is a well-known technique widely used in probability theory, statistics and theoretical computer science, and the current treatment is essentially taken from Brown et al. (2004). 

  1. Abraham Wald, Statistical decision functions. Wiley, 1950. 
  2. Erich L. Lehmann and George Casella, Theory of point estimation. Springer Science & Business Media, 2006. 
  3. Erich L. Lehmann and Joseph P. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006. 
  4. Lucien M. Le Cam, Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics (1964): 1419-1455. 
  5. Erik Torgersen, Comparison of statistical experiments. Cambridge University Press, 1991. 
  6. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986. 
  7. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990. 
  8. Lawrence D. Brown and Mark G. Low, Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics 24.6 (1996): 2384-2398. 
  9. Lawrence D. Brown and Cun-Hui Zhang, Asymptotic nonequivalence of nonparametric experiments when the smoothness index is {1/2}. The Annals of Statistics 26.1 (1998): 279-287. 
  10. Lawrence D. Brown, Andrew V. Carter, Mark G. Low, and Cun-Hui Zhang, Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift. The Annals of Statistics 32.5 (2004): 2074-2097. 
  11. Kolyan Ray, and Johannes Schmidt-Hieber, The Le Cam distance between density estimation, Poisson processes and Gaussian white noise. arXiv preprint arXiv:1608.01824 (2016).

Lecture 2: Importance of Problem Structures

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

This lecture continues the introduction of information-theoretic lower bounds without formally developing tools. The emphasis of this lecture is that, the dependence of the fundamental limits on the problem structure may be very subtle, where the results may change significantly even with tiny changes in the problem setting. Hence, one should be very careful when applying some known lower bounds to similar problems, and it is vital to fully understand the specific problem structure. We will use a specific example of communication complexity in the theoretical computer science literatue to illustrate this phenomenon. 

1. Equality Evaluation: Deterministic Communication Complexity 

Alice and Bob hold vectors {x,y \in \{0,1\}^n} respectively, and they would like to check whether {x=y} or not. However, they live very far from each other, and some limited resources of communication are available. What is the minimum number of communication bits to perform this task reliably, in the sense that they always find the correct answer for any pairs {(x,y)}? In other words, what is the deterministic communication complexity of the distributed computing problem for the equality function {f(x,y)=1(x=y)}

To answer this question, we need to formally define all possible communication protocols. Here we define a general class of blackboard communication protocols. 

Definition 1 (Blackboard Communication Protocol) A blackboard communication protocol over domain {X\times Y} with range {Z} is a binary tree where each internal node {v} is labeled by a function {a_v: X\rightarrow \{0,1\}} or by a function {b_v: Y\rightarrow \{0,1\}}, and each leaf is labeled by an element {z\in Z}

The value of the protocol on input {(x,y)} is the label of the leaf reached by starting from the root and walking on the tree using the following rule: at each internal node labeled by {a_v}, walk left if {a_v(x)=0} and right if {a_v(x)=1}; at each internal node labeled by {b_v}, walk left if {b_v(y)=0} and right if {b_v(y)=1}. The cost of the protocol is the height of the tree. 

Intuitively speaking, there is a blackboard in front of both Alice and Bob in a blackboard communication protocol on which they can write binary bits, and this process continues sequentially where later bits can depend on the entire history. The formal definition above simply uses a binary tree to keep track of the history, and uses functions {a_v, b_v} to illustrate who will be writing on the blackboard at stage {v}. The communication cost of a given protocol is simply the maximum length of the message written on the blackboard, where the maximum is taken over all possible input pairs {(x,y)\in X\times Y}

Now the question we are asking is: among all blackboard communication protocols which always give the correct evaluation of the equality function, what is the minimum communication cost? This is also known as the deterministic communication complexity, where “deterministic” here means zero-error. 

To prove a lower bound on the deterministic complexity, we need to make the following key observation: 

Lemma 2 (Copy-paste Property) If a blackboard communication protocol arrives at the same leaf node {v} on both inputs {(x_1,y_1)} and {(x_2,y_2)}, then on input {(x_1,y_2)} the protocol also arrives at {v}

Proof: For each internal nodes {v'} labeled by {a_{v}(x)} on the path from the root to {v}, the condition ensures that {a_{v'}(x_1) = a_{v'}(x_2)}. Similarly, {b_{v'}(y_1)=b_{v'}(y_2)} if the internal node is labeled by {b_{v'}(y)}. Then the result is immediate. \Box

The above copy-paste property shows the following limitation of any blackboard communication protocols: the set of all input pairs {(x,y)} arriving at any leaf node forms a rectangle, i.e., takes the form of {X_v\times Y_v}. Hence, if the value of the function to be computed is not a constant on large rectangles, each leaf node will need to take care of small rectangles and more leaves are required. This is the key insight of the following theorem. 

Theorem 3 (Log-rank Inequality) Let {M\in \{0,1\}^{X\times Y}} be a binary matrix defined as {M(x,y)=f(x,y)} for all {x\in X, y\in Y}. Then the deterministic communication complexity of computing {f} is at least {\log_2 \text{\rm rank}(M)}, where the linear rank is understood over {{\mathbb R}}

Proof: For any leaf node {v} which outputs {1}, define {M_v(x,y) = 1(v\text{ is reached on input }(x,y))}. Then clearly {M= \sum_v M_v}. By the sub-additivity of rank, we have 

\displaystyle \text{rank}(M) \le \sum_v \text{rank}(M_v). \ \ \ \ \ (1)

However, by Lemma 2, we have {\text{rank}(M_v)=1} for any leaf, so inequality (1) implies that the number of leaf nodes is at least {\text{rank}(M)}, and we are done. \Box

Applying Theorem 3 to the equality function, the matrix {M} is a {2^n\times 2^n} identity matrix, and therefore the deterministic communication complexity is at least {n}. This is essentially tight: Alice can communicate the entire vector {x} to Bob using {n} bits and Bob then sends the 1-bit evaluation of the equality function back to Alice. Hence, the deterministic communication complexity of equality evaluation is {\Theta(n)}

2. Equality Evaluation: Randomized Private-coin Communication Complexity 

Next we consider a slight variant of the previous problem: what is the communication complexity if Alice and Bob can tolerate some maximum probability of error {\varepsilon>0} on all inputs? This quantity is called the randomized communication complexity. Here to introduce randomness, we assume that both Alice and Bob have access to some (private, not shared) random number generators. We also assume that {\varepsilon} is a small constant (say, {\varepsilon=0.01}) and suppress the dependence of the communication complexity on {\varepsilon}

A striking result is that, if we allow some small probability of error (even of the order {1/\text{poly}(n)}), the randomized communication complexity drops from {\Theta(n)} to {O(\log n)}. Here is a communication protocol: by Bertrand’s postulate there exists some prime number {p\in [n^2,2n^2]}. Alice chooses such a {p} and {z\sim \text{Unif}([p])}, evaluates the polynomial 

\displaystyle q(x,z) = x_0 + x_1z + \cdots + x_{n-1}z^{n-1} \pmod p, \ \ \ \ \ (2)

and sends {(p,z,q(x,z))} to Bob using {O(\log n)} bits, where {x=(x_0,x_1,\cdots,x_{n-1})}. Bob then evaluates {q(y,z)} in (2)and outputs {1(q(x,z)=q(y,z))}. Clearly, if {x=y} then Bob always outputs {1} correctly. If {x\neq y}, the map {z\mapsto q(x,z) - q(y,z)} is a non-zero polynomial with degree at most {n}, so it has at most {n} zeros in {\mathbb{F}_p}. Consequently, Bob makes mistakes with probability at most {n/p\le 1/n} for any {x\neq y}

The previous upper bound shows that Theorem 3 no longer holds for the randomized communication complexity. In fact, a careful inspection of the proof shows that the zero-error property is crucial for Theorem 3 to hold. However, we may still perform a deterministic simulation of any randomized protocol and reduce to the previous case. 

Theorem 4 Let {D(f)} and {R(f)} be the deterministic and randomized private-coin communication complexity of computing {f}, respectively. Then

\displaystyle D(f) \le 2^{R(f)}\left(-\log_2 \left(\frac{1}{2}-\varepsilon\right) + R(f) \right). \ \ \ \ \ (3)

Proof: Given a randomized protocol, Alice can transmit the probabilities of arriving at all {2^{R(f)}} leaf nodes based on her input {x}. Given these probabilities, Bob can privately evaluate the probabilities of arriving at each leaf based on the inputs {(x,y)} (which is made possible by the product law of independent events thanks to the private-coin assumption). Then Bob sums up probabilities for leaf nodes which output {0} and {1}, respectively, and takes the majority vote. Since the error probability of the randomized protocol is at most {\varepsilon<1/2}, the final output given by the majority vote is always correct. The proof of (3) is completed by noting that these probabilities sent by Alice can be made within precision {(\frac{1}{2}-\varepsilon)2^{-R(f)}} without hurting the majority vote. \Box

Applying Theorem 4 to the equality function {f}, we immediately obtain that {R(f) = \Omega(\log n)}. Hence, the randomized private-coin communication complexity of equality evaluation is {\Theta(\log n)}

3. Equality Evaluation: Randomized Public-coin Communication Complexity 

We ask again the same question in the previous section, but now Alice and Bob have shared randomness. Careful readers may have noticed a subtle point used in the proof of Theorem 4, i.e., the identity {\mathop{\mathbb P}(A\cap B)=\mathop{\mathbb P}(A)\mathop{\mathbb P}(B)} for events {A,B} given by independent private randomness. Hence, the {\Omega(\log n)} lower bound does not directly apply to the public-coin scenario, and the public-coin protocols may perform much better. This is indeed the case – the randomized public-coin communication complexity is actually {\Theta(1)}

There is nothing to prove for the lower bound. For the upper bound, here is a simple procotol: both Alice and Bob draw (the same collection of) independent vectors {z_1, \cdots, z_m\sim \text{Unif}(\mathbb{F}_2^n)}. Then Alice sends an {m}-bit vector {(z_1^\top x, z_2^\top x, \cdots, z_m^\top x)} to Bob, and Bob claims that {x=y} if and only if {z_i^\top x = z_i^\top y} for any {i\in [m]}. Clearly, this protocol errs with probability at most {2^{-m}} on all inputs. Hence, the randomized public-coin communication complexity of equality evaluation is {\Theta(1)}

4. Equality Evalution: Randomized One-round Private-coin Communication Complexity 

Now we consider a final variant of the distributed equality evaluation problem, where we assume a randomized private-coin protocol but restrict the protocol to be only one-round. Recall that the previous blackboard communication protocol allows for multiple rounds of interaction, i.e., Alice may provide feedbacks to the messages transmitted by Bob, and vice versa. In the one-round scenario, Alice and Bob can only send {k}-bit messages once to a central referee, and the target is to minimize the communication cost {k} while controlling the maximum error probability below {\varepsilon}. Then what is the communication complexity in this scenario? 

Surprisingly, the communication complexity is {\Theta(\sqrt{n})}, which is different from all previous ones. We first prove this lower bound. Notice that the one-round protocol has a more concise representation than a binary tree, i.e., any randomized strategy of Alice or Bob may be written as stochastic mappings from {[2^n]} to {[2^k]}. In particular, the copy-paste property in Lemma 2 becomes insufficient for our purposes. Let {A,B\in [0,1]^{2^n\times 2^k}} be the transition matrices of the stochastic mappings of Alice and Bob, respectively, and {R\in [0,1]^{2^k\times 2^k}} be the matrix with {R(m_A,m_B)} being the probability that the referee outputs {1} given messages {(m_A, m_B)}. Now {(ARB^\top)_{x,y}} denotes the probability of outputting {1} on inputs {(x,y)}, the correctness property ensures that

\displaystyle \|\text{vec}(ARB^\top - I_{2^n}) \|_\infty \le \varepsilon, \ \ \ \ \ (4)

where {I_m} denotes the {m\times m} identity matrix, and {\|\cdot \|_\infty} denotes the {\ell_\infty} norm of vectors. Hence, (4) implies that {ARB^\top} is an approximate non-negative matrix factorization of the identity matrix, and {2^k} is the approximate non-negative rank of the identity matrix. The following result in linear algebra provides a lower bound on {k}

Theorem 5 Let {0<\varepsilon\le 0.1}. If there exist stochastic matrices {A,B\in [0,1]^{N\times m}} and a non-negative matrix {R\in [0,1]^{m\times m}} such that {\|\text{\rm vec}(ARB^\top - I_N)\|_\infty\le \varepsilon}, then 

\displaystyle m \ge 2^{\Omega(\sqrt{\log N})}.

The proof of Theorem 5 is quite involved, and we refer interested readers to the reference cited in bibliographic notes. Now {k=\Omega(\sqrt{n})} is an immediate consequence of (4) and Theorem 5

To show the upper bound {k=O(\sqrt{n})}, the idea is to simulate a randomized protocol by deterministic ones. Recall that in the previous section, there is a randomized (also one-round) public-coin protocol which requires {O(1)} bits of communication. Let {R_1, \cdots, R_m} be {m} independent deterministic realizations of the randomized protocol with maximum error probability {\varepsilon}, then by Hoeffding’s inequality, for any fixed input {(x,y)} we have

\displaystyle \mathop{\mathbb P}\left(\frac{1}{m}\sum_{i=1}^m 1(R_i(x,y) \neq 1(x=y)) \ge 2\varepsilon \right) \le \exp\left(-2m\varepsilon^2\right). \ \ \ \ \ (5)

Hence, by (5) and the union bound over {2^{2n}} possible inputs, for fixed {\varepsilon>0} we may simply choose {m=O(n)} such that the average probability of error of {m} deterministic protocols on each inputs is at most {2\varepsilon}

Now the private-coin protocol is as follows: put the above {m} deterministic protocols in a {\sqrt{m}\times\sqrt{m}} matrix. Alice draws {i\sim \mathsf{Unif}([\sqrt{m}])}, runs all protocols in {i}-th row and sends all outputs to the referee (together with {i}). Similarly, Bob draws {j\sim \mathsf{Unif}([\sqrt{m}])} and runs all protocols in {j}-th column. The central referee simply looks at the protocol at the {(i,j)}-th entry, whose probability of error is exactly the average probability of error of these {m} deterministic protocols, which is at most {2\varepsilon}. Meanwhile, the communication complexity of this protocol is {O(\sqrt{m}) = O(\sqrt{n})}, as desired. 

5. Bibliographic Notes 

Communication complexity is a central topic in theoretical computer science. The blackboard communication protocol is proposed by Yao in 1979, where some central tools (including Theorems 3 and 4) are available in the excellent book by Kushilevitz and Nisan (1996). The proof of Theorem 5 is given by Newman and Szegedy (1996), and the simulation idea is due to Newman (1991). 

  1. Andrew Chi-Chih Yao, Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213. ACM, 1979. 
  2. Eyal Kushilevitz and Noam Nisan, Communication complexity. Cambridge University Press. 1996. 
  3. Ilan Newman and Mario Szegedy, Public vs private coin flips in one round communication games, In Proceedings of the 28th annual ACM symposium on Theory of computing, pages 561–570. ACM, 1996. 
  4. Ilan Newman, Private vs. common random bits in communication complexity. Information processing letters 39.2 (1991): 67–71.

Lecture 1: Introduction to Information-theoretic Lower Bounds via Puzzles

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

1. Overview 

Most scientific research are devoted to developing new tools or findings for challenging problems arising in practice, which help to provide explicit solutions to real problems. These works are typically called constructive, and characterize tasks which are possible to accomplish. However, there is also another line of work which focuses on the other side, i.e., provides fundamental limits on certain tasks. In other words, these works are mostly non-constructive and illustrate targets which are impossible to achieve. Such results are also useful for a number of reasons; we may certify the fundemental limits and optimality of some procedures so that it becomes unnecessary to look for better ones, we may understand the limitations of the current framework, and we may compute the amount of resources necessary to complete certain tasks. 

Among different types of fundamental limits (or converse in information theory, lower bounds broadly used in a number of fields), throughout these lectures we will be interested in so-called information-theoretic ones. When it comes to the information-theoretic lower bounds, typically one makes observations (either passively or actively) subject to some prescribed rules and argues that certain tasks cannot be accomplished using any tool solely based on these observations. To show these impossibility results, one argues through the line that the amount of information contained in these observations is insufficient to perform these tasks. Therefore, the limited-information structure is of utmost importance in the previous arguments, and the structure occurs in a number of fields including statistics, machine learning, optimization, reinforcement learning, to name a few; e.g., a given size of samples to perform inference, a given amount of training data to learn a statistical model, a given number of gradient evaluations to optimize a convex function, a given round of adaptivity to search for a good policy, etc. 

These lectures are devoted to providing an overview of (classical and modern) tools to establish information-theoretic lower bounds. We will neither be restricted to certain problems (e.g., statistical inference, optimization, bandit problems), nor restrict ourselves to tools in information theory. Instead, we will try to present an extensive set of tools/ideas which are suitable for different problem structures, followed by numerous interdisplinary examples. We will see that certain tools/ideas can be applied to many seemingly unrelated problems. 

2. Usefulness of Lower Bounds 

We ask the following question: why do we care about lower bounds? We remark that the usefulness of lower bounds is not restricted to providing fundamental limits and telling people which are impossible. In fact, the power of understanding lower bounds lies more on the upper bound in the sense that it helps us to understand the problem structure better, including figuring out the most difficult part of the problem and the most essential part of information which should be made full use of. In other words, the lower bounds are interwined with the upper bounds, and should be in no means treated as an independent component. We elaborate on this point via the following examples from puzzle games. 

2.1. Example I: Card Guessing Game 

There is a magic show as follows: there is a {52}-card deck thoroughly shuffled by the audience. Alice draws {5} cards from the top of the deck and reveals {4} of them to Bob one after one, and then Bob can always correctly guess the remaining card in Alice’s hand. How can Alice and Bob achieve that? 

Instead of proposing an explicit strategy of Alice and Bob, let us look at the information-theoretic limit of this game first. Suppose the deck consists of {n} cards, what are the possible values of {n} such that such a strategy still exists? To prove such a bound, we need to understand what are the possible strategies. From Alice’s side, her strategy is simply a mapping {f} from unordered {5}-tuples to an ordered {4}-tuple, with an additional restriction that {f(A)\subseteq A} for any unordered {5}-tuples {A\subseteq [n]}. From Bob’s side, his strategy is another mapping {g} from ordered {4}-tuples to a specific card in {[n]}. Finally, the correcntess of Bob’s guessing corresponds to

\displaystyle \{ g(f(A))\} \cup f(A) = A, \qquad \forall A\subseteq [n], |A|=5. \ \ \ \ \ (1)

An equivalent way to state (1) is that, Bob can recover all {5} cards after observing the first ordered 4 cards; write {h} for this strategy. Now we come to our information-theoretic observation: do ordered {4}-tuples contain enough information to recover any unordered {5}-tuples? Here we will quantify the information as cardinality: mathematically speaking, the mapping {h} must be surjective, so {|\text{dom}(h)|\ge |\text{range}(h)|}. In other words,

\displaystyle 4!\cdot \binom{n}{4} \ge \binom{n}{5} \Longrightarrow n \le 124. \ \ \ \ \ (2)

Hence, this magic show will fail for any deck with more than {124} cards. The next question is that, is {n=124} achievable? Now we are seeking an upper bound, but the previous lower bound still helps – the equality in (2) holds implies that {h} must be a bijection! Keeping the bijective nature of {h} in mind, it then becomes not too hard to propose the following strategy which works. Label all cards by {1,2,\cdots,124}, and suppose the chosen cards be {c_1<c_2<\cdots<c_5}. Alice computes the sum {s=\sum_{i=1}^5 c_i \pmod 5}, keeps {c_s} and reveals all others {d_1<d_2<d_3<d_4} to Bob. Let {t=\sum_{i=1}^4 d_i \pmod 5}, then for Bob to decode {c_s}, he only needs to solve the following equation: 

\displaystyle c = - t + \min\{i\in [4]: c<d_i \} \pmod 5, \ \ \ \ \ (3)

where we define {\min\emptyset = 5}. It is easy to show that (3) always admits {24} solutions in {[124]}, and any solution can be encoded using {4!=24} different permutations of the revealed cards {(d_1,\cdots,d_4)}

2.2. Example II: Coin Flipping Game 

Now Alice and Bob play another game cooperatively: consider {n} coins on the table each of which may be head or tail, and the initial state is unknown to both Alice and Bob. The audience tells a number in {[n]} to Alice, and she comes to the table, looks at all coins and flips one of them. Bob then comes to the table and tells the number told by the audience correctly. The question is: for which values of {n} can they come up with such a strategy? 

To answer this question, again we need to understand the structure of all strategies. Let {s\in \{0,1\}^n} be the initial state of the coins, which may be arbitrary. We may also identify {s} as vertices of an {n}-dimensional hypercube. Let {k\in [n]} be the number given by the audience. Alice’s strategy is then a mapping {f: \{0,1\}^n\times [n]\rightarrow \{e_1,\cdots,e_n\}}, choosing a coin to flip (where {e_i} denotes the {i}-th canonical vector) based on her knowledge. Then Bob’s strategy is a mapping {g: \{0,1\}^n\rightarrow [n]}, and the correctness condition implies that

\displaystyle g(s \oplus f(s,k)) = k, \qquad \forall s\in\{0,1\}^n, k\in[n]. \ \ \ \ \ (4)

It is clear from (4) that the map {k\mapsto f(s,k)} for any {s} must be injective. By cardinality arguments, this map is further bijective. Now the problem structure becomes more transparent: for each {k\in [n]}, let {g^{-1}(k)\subseteq \{0,1\}^n} be the states where Bob will claim {k}. We call these vertices in the hypercube have {k}-th color. Then (4) states that, for each vertex {s\in \{0,1\}^n}, its {n} different neighbors {\{s\oplus e_i \}_{i\in [n]}} must have {n} different colors. The converse is also true: if there exists such a coloring scheme, then we may find {f,g} such that (4) holds. 

Now when does such a coloring scheme exist? A simple idea is that, the number of vertices with any color should be the same by double counting arguments. Hence, we must have {n \mid 2^n}, which implies that {n=2^m} must be a power of {2}. Based on the coloring intuition given by the lower bound, the strategy also becomes simple: consider any finite Abelian group {G=\{g_1,\cdots,g_n\}} with {n} elements, we identify colors as elements of {G} and let {s\in \{0,1\}^n} have color

\displaystyle G(s) = s_1g_1 + s_2g_2 + \cdots + s_ng_n \in G. \ \ \ \ \ (5)

It is straightforward to verify that (4) holds if and only if {G} has characteristic {2}, i.e., {2g=0} for any {g\in G}. By algebra, such an Abelian group exists if and only if {n} is a power of {2}, e.g., {G=\mathbb{F}_2^m} when {n=2^m} (necessity can be shown via taking quotients {G/\{0,g\}} repeatedly). Hence, such a strategy exists if and only if {n=2^m} is a power of {2}

2.3. Example III: Hat Guessing Game 

Now we look at a hat-guessing game with a more complicated information structure. There are {15} people who are sitting together, each of whom wears a hat of color red or blue, independently with probability {\frac{1}{2}}. They cannot talk to each other, and they can see the color of all hats except for their own. Now each of them simultaneously chooses to guess the color of his own hat, or chooses to pass. They win if at least one person guesses the color, and all guesses are correct. What is their optimal winning probability of this game? 

The answer to this question is {\frac{15}{16}}, which is shocking at the first appearance because it greatly outperforms {\frac{1}{2}} achieved by the simple guessing scheme where only the first person guesses. To understand this improvement, we first think about the fundamental limit of this problem. Let {s\in \{0,1\}^{15}} be the sequence of hat colors, and {s} is called success if they win under state {s}, and failure if fail. Clearly, at each success state {s}, there must be some person {i\in [15]} who makes the correct guess. However, since he cannot see {s_i}, he will make the same guess even if {s_i} is flipped, and this guess becomes wrong in the new state. This argument seems to suggest that the {\frac{1}{2}} winning probability is not improvable, for each success state corresponds to a failure state and therefore constitutes at most half of the states. However, this intuition is wrong since multiple success states may correspond to the same failure state. 

Mathematically speaking, let {S} be the success states and {F} be the failure states. By the previous argument, there exists a map {f: S\rightarrow F} which only flips one coordinate. Since there are at most {15} coordinates which can be flipped, we have {|f^{-1}(s)|\le 15} for each {s\in F}. Consequently, {|S|\le 15|F|}, and the winning probability is 

\displaystyle \frac{|S|}{|S|+|F|} \le \frac{15}{16}. \ \ \ \ \ (6)

The above lower bound argument shows that, in order to achieve the optimal winning probability, we must have {|f^{-1}(s)|=15} for any {s\in F}. Consequently, in any failure state, it must be the case where everyone makes wrong guesses. This crucial observation motivates the following strategy: identify people as elements of {\mathbb{F}_2^4 -\{0\}}. Each person {v} computes the sums {s_1, s_2} of the people indices with red and blue hats, respectively. If {v=s_1}, he claims blue; if {v=s_2}, he claims red; otherwise he passes. Then it becomes obvious that a failure occurs if and only if the indices of people with red hats sum into {0\in \mathbb{F}_2^4}, whose probability is {(\frac{1}{2})^4 = \frac{1}{16}}

3. Bibliographic Notes 

There are tons of information-theoretic analysis used to solve puzzles. We recommend two books if the readers would like to learn more about funny puzzles: 

  1. Peter Winkler, Mathematical puzzles: a connoisseur’s collection. AK Peters/CRC Press, 2003. 
  2. Jiri Matousek, Thirty-three miniatures: Mathematical and Algorithmic applications of Linear Algebra. Providence, RI: American Mathematical Society, 2010.