Lecture 3: Statistical Decision Theory: Model Distance and Equivalence

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

This lecture starts to talk about specific tools and ideas to prove information-theoretic lower bounds. We will temporarily restrict ourselves to statistical inference problems (which most lower bounds apply to), where the presence of randomness is a key feature in these problems. Since the observer cannot control the realizations of randomness, the information contained in the observations, albeit not necessarily in a discrete structure (e.g., those in Lecture 2), can still be limited. In this lecture and subsequent ones, we will introduce the reduction and hypothesis testingideas to prove lower bounds of statistical inference, and these ideas will also be applied to other problems. 

1. Basics of Statistical Decision Theory 

To introduce statistical inference problems, we first review some basics of statistical decision theory. Mathematically, let (\mathcal{X}, \mathcal{F}, (P_\theta)_{\theta\in\Theta}) be a collection of probability triplets indexed by \theta, and the observation X is a random variable following distribution P_{\theta} for an unknown parameter \theta in the known parameter set \Theta. The specific structure of (P_\theta)_{\theta\in\Theta} is typically called models or experiments, for the parameter \theta can represent different model parameters or theories to explain the observation X. A decision rule, or a (randomized) estimator is a transition kernel \delta(x,da) from \mathcal{X} to some action space \mathcal{A}. Let L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+ be a loss function, where L(\theta,a) represents the loss of using action a when the true parameter is \theta. A central quantity to measure the quality of a decision rule is the risk in the following definition. 

Definition 1 (Risk) Under the above notations, the risk of the decision rule \delta under loss function L and the true parameter \theta is defined as

 R_\theta(\delta) = \iint L(\theta,a)P_\theta(dx)\delta(x,da). \ \ \ \ \ (1)

Intuitively, the risk characterizes the expected loss (over the randomness in both the observation and the decision rule) of the decision rule when the true parameter is \theta. When \delta(x,da) is a point mass \delta(a-T(x)) for some deterministic function T:\mathcal{X}\rightarrow \mathcal{A}, we will also call T(X) as an estimator and the risk in (1) becomes 

 R_\delta(T) = \int L(\theta,T(x)) P_\theta(dx) = \mathop{\mathbb E}_{\theta} L(\theta, T(X)). \ \ \ \ \ (2)

Example 1 In linear regression model with random design, let the observations (x_1,y_1), \cdots, (x_n,y_n)\in {\mathbb R}^p\times {\mathbb R} be i.i.d. with x_i \sim P_X and y_i|x_i\sim \mathcal{N}(x_i^\top \theta, \sigma^2). Here the parameter set \Theta={\mathbb R}^p is a finite-dimensional Euclidean space, and therefore we call this model parametric. The quantity of interest is \theta, and the loss function may be chosen to be the prediction error L(\theta,\hat{\theta}) = \mathop{\mathbb E}_{\theta} (y-x^\top \hat{\theta})^2 of the linear estimator f(x) = x^\top \hat{\theta}

Example 2 In density estimation model, let X_1, \cdots, X_n be i.i.d. drawn from some 1-Lipschitz density f supported on [0,1]. Here the parameter set \Theta of the unknown f is the infinite-dimensional space of all possible 1-Lipschitz functions on [0,1], and we call this model non-parametric. The target may be to estimate the density f at a point, the entire density, or some functional of the density. In respective settings, the loss functions can be

 L_1(f,T) = |T - f(0)|, \quad L_2(f,T) = \int_0^1 (T(x) - f(x))^2dx, \quad L_3(f,T) = |T - \|f\|_1|.

Example 3 By allowing general action spaces and loss functions, the decision-theoretic framework can also incorporate some non-statistical examples. For instance, in stochastic optimization \theta\in\Theta may parameterize a class of convex Lipschitz functions f_\theta: [-1,1]^d\rightarrow {\mathbb R}, and X denotes the noisy observations of the gradients at the queried points. Then the action space \mathcal{A} may just be the entire domain [-1,1]^d, and the loss function L is the optimality gap defined as

 L(\theta, a) = f_\theta(a) - \min_{a^\star \in [0,1]^d} f_\theta(a^\star).

In practice, one would like to find optimal decision rules for a given task. However, the risk is a function of \theta and it is hard to compare two risk functions directly. Hence, people typically map the risk functions into scalars and arrive at the following minimax and Bayesian paradigm. 

Definition 2 (Minimax and Bayes Decision Rule) The minimax decision rule is the decision rule \delta which minimizes the quantity \max_{\theta\in\Theta} R_\theta(\delta). The Bayes decision rule under distribution \pi(d\theta) (called the prior distribution) is the decision rule \delta which minimizes the quantity \int R_\theta(\delta)\pi(d\theta)

It is typically hard to find the minimax decision rule in practice, while the Bayes decision rule admits a closed-form expression (although hard to compute in general). 

Proposition 3 The Bayes decision rule under prior \pi is given by the estimator

 T(x) \in \arg\min_{a\in\mathcal{A}} \int L(\theta,a)\pi(d\theta|x), \ \ \ \ \ (3)

where \pi(d\theta|x) denotes the posterior distribution of \theta under \pi (assuming the existence of regular posterior). 

Proof: Left as an exercise for the reader.  \Box 

2. Deficiency and Model Distance 

The central target of statistical inference is to propose some decision rule for a given statistical model with small risks. Similarly, the counterpart in the lower bound is to prove that certain risks are unavoidable for any decision rules. Here to compare risks, we may either compare the entire risk function, or its minimax or Bayes version. In this lecture we will focus on the risk function, and many later lectures will be devoted to appropriate minimax risks. 

We start with the task of comparing two statistical models with the same parameter set \Theta. Intuitively, one may think that the model with a smaller noise level would be better than the other, e.g., the model \mathcal{M}_1 = \{\mathcal{N}(\theta,1): \theta\in {\mathbb R} \} should be better than \mathcal{M}_2 = \{\mathcal{N}(\theta,2): \theta\in {\mathbb R} \}. However, this criterion is bad due to two reasons: 

  1. There is no proper notion of noise for general (especially non-additive) statistical models; 
  2. Even if a natural notion of noise exists for certain models, it is not necessarily true that the model with smaller noise is always better. For example, let  \mathcal{M}_1 = \{\text{Unif}\{\theta-1,\theta+1 \}: |\theta|\le 1\}, \quad \mathcal{M}_2 = \{\text{Unif}\{\theta-3,\theta+3 \}: |\theta|\le 1\},
    the model \mathcal{M}_1 has a smaller magnitude of the additive noise. However, despite with a larger magnitude of the noise, the model \mathcal{M}_2 is actually noise-free since one can decode the perfect knowledge of \theta from the observation. 

To overcome the above difficulties, we introduce the idea of model reduction. Specifically, we aim to establish the fact that for any decision rule in one model, there is another decision rule in the other model which is uniformly no worse than the former rule. The idea of reduction appears in many fields, e.g., in P/NP theory it is sufficient to work out one NP-complete instance (e.g., circuit satisfiability) from scratch and establish all others by polynomial reduction. This reduction idea is made precise via the following definition. 

Definition 4 (Model Deficiency) For two statistical models \mathcal{M} = (\mathcal{X}, \mathcal{F}, (P_{\theta})_{\theta\in \Theta}) and \mathcal{N} = (\mathcal{Y}, \mathcal{G}, (Q_{\theta})_{\theta\in \Theta}), we call \mathcal{M} is \varepsilon-deficient relative to \mathcal{N} if for any finite subset \Theta_0\subseteq \Theta, any finite action space \mathcal{A}, any loss function L: \Theta_0\times \mathcal{A}\rightarrow [0,1], and any decision rule \delta_{\mathcal{N}} based on model \mathcal{N}, there exists some decision rule \delta_{\mathcal{M}} based on model \mathcal{M} such that

 R_\theta(\delta_{\mathcal{M}}) \le R_\theta(\delta_{\mathcal{N}}) + \varepsilon, \qquad \forall \theta\in \Theta_0. \ \ \ \ \ (4)

Note that the definition of model deficiency does not involve the specific choice of the action space and loss function, and the finiteness of \Theta_0 and \mathcal{A} in the definition is mainly for technical purposes. Intuitively speaking, \mathcal{M} is \varepsilon-deficient relative to \mathcal{N} if the entire risk function of some decision rule in \mathcal{M} is no worse than that of any given decision rule in \mathcal{N}, within an additive gap \varepsilon

Although Definition 4 gives strong performance guarantees for the \varepsilon-deficient model \mathcal{M}, it is difficult to verify the condition (4), i.e., to transform an arbitrary decision rule \delta_\mathcal{N} to some proper \delta_\mathcal{M}. A crucial observation here is that it may be easier to transform between models \mathcal{M} and \mathcal{N}, and in particular, when \mathcal{N} is a randomization of \mathcal{M}. For example, if there exists some stochastic kernel \mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y} such that Q_\theta = \mathsf{K}P_\theta for all \theta\in\Theta (where  \mathsf{K}P_\theta denotes the marginal distribution of the output of kernel \mathsf{K} with input distributed as P_\theta), then we may simply set \delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K} to arrive at (4) with \varepsilon=0. The following theorem shows that model deficiency is in fact equivalent to approximate randomization. 

Theorem 5 Model \mathcal{M} is \varepsilon-deficient with respect to \mathcal{N} if and only if there exists some stochastic kernel \mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y} such that

 \sup_{\theta\in\Theta} \|Q_\theta - \mathsf{K}P_\theta \|_{\text{\rm TV}} \le \varepsilon,

where \|P-Q\|_{\text{\rm TV}} := \frac{1}{2}\int |dP-dQ| is the total variation distance between probability measures P and Q

Proof: The sufficiency part is easy. Given such a kernel \mathsf{K} and a decision rule \delta_\mathcal{N} based on model \mathcal{N}, we simply set \delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K}, i.e., transmit the output through kernel \mathsf{K} and apply \delta_\mathcal{N}. Then for any \theta\in\Theta

 \begin{array}{rcl} R_\theta(\delta_{\mathcal{M}}) - R_\theta(\delta_{\mathcal{N}}) &=& \iint L(\theta,a)\delta_\mathcal{N}(y,da) \left[\int P_\theta(dx)\mathsf{K}(dy|x)- Q_\theta(dy) \right] \\ &\le & \|Q_\theta - \mathsf{K}P_\theta \|_{\text{TV}} \le \varepsilon, \end{array}

for the loss function is non-negative and upper bounded by one. 

The necessity part is slightly more complicated, and for simplicity we assume that all \Theta, \mathcal{X}, \mathcal{Y} are finite (the general case requires proper limiting arguments). In this case, any decision rules \delta_\mathcal{M} or \delta_\mathcal{N}, loss functions L and priors \pi(d\theta) can be represented by a finite-dimensional vector. Given \mathcal{A} and \delta_{\mathcal{N}}, the condition (4) ensures that 

 \sup_{L(\theta,a),\pi(d\theta)} \inf_{\delta_{\mathcal{M}}}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (5)

Note that the LHS of (5) is bilinear in L(\theta,a)\pi(d\theta) and \delta_\mathcal{M}(x,da), both of which range over some convex sets (e.g., the domain for M(\theta,a) := L(\theta,a)\pi(d\theta) is exactly \{M\in [0,1]^{\Theta\times \mathcal{A}}: \sum_\theta \|M(\theta, \cdot)\|_\infty \le 1 \}), the minimax theorem allows to swap \sup and \inf of (5) to obtain that 

 \inf_{\delta_{\mathcal{M}}}\sup_{L(\theta,a),\pi(d\theta)}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (6)

By evaluating the inner supremum, (6) implies the existence of some \delta_\mathcal{M}^\star such that 

 \sup_{\theta\in\Theta} \frac{1}{2}\int_{\mathcal{A}} \left| \int_{\mathcal{X}} \delta_\mathcal{M}^\star(x,da)P_\theta(dx) - \int_{\mathcal{Y}} \delta_\mathcal{N}(y,da)Q_\theta(dy)\right| \le \varepsilon. \ \ \ \ \ (7)

Finally, choosing \mathcal{A}=\mathcal{Y} and \delta_\mathcal{N}(y,da) = 1(y=a) in (7), the corresponding \delta_\mathcal{M}^\star is the desired kernel \mathsf{K}.  \Box 

Based on the notion of deficiency, we are ready to define the distance between statistical models, also known as the Le Cam’s distance

Definition 6 (Le Cam’s Distance) For two statistical models \mathcal{M} and \mathcal{N} with the same parameter set \Theta, Le Cam’s distance \Delta(\mathcal{M},\mathcal{N}) is defined as the infimum of \varepsilon\ge 0 such that \mathcal{M} is \varepsilon-deficient relative to \mathcal{N}, and \mathcal{N} is \varepsilon-deficient relative to \mathcal{M}

It is a simple exercise to show that Le Cam’s distance is a pesudo-metric in the sense that it is symmetric and satisfies the triangle inequality. The main importance of Le Cam’s distance is that it helps to establish equivalence between some statistical models, and people are typically interested in the case where \Delta(\mathcal{M},\mathcal{N})=0 or \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0. The main idea is to use randomization (i.e., Theorem 5) to obtain an upper bound on Le Cam’s distance, and then apply Definition 4 to deduce useful results (e.g., to carry over an asymptotically optimal procedure in one model to other models). 

In the remainder of this lecture, I will give some examples of models whose distance is zero or asymptotically zero. In the next lecture it will be shown that regular models will always be close to some Gaussian location model asymptotically, and thereby the classical asymptotic theory of statistics can be established. 

3. Equivalence between Models 

3.1. Sufficiency 

We first examine the case where \Delta(\mathcal{M},\mathcal{N})=0. By Theorem 5, models \mathcal{M} and \mathcal{N} are mutual randomizations. In the special case where Y=T(X)\sim Q_\theta is a deterministic function of X\sim P_\theta (thus Q_\theta=P_\theta\circ T^{-1} is the push-forward measure of P_\theta through T), we have the following result. 

Theorem 7 Under the above setting, d(\mathcal{M},\mathcal{N})=0 if and only if \theta-Y-X forms a Markov chain. 

Note that the Markov condition \theta-Y-X is the usual definition of sufficient statistics, and also gives the well-known Rao–Blackwell factorization criterion for sufficiency. Hence, sufficiency is in fact a special case of model equivalence, and deficiency can be thought of as approximate sufficiency. 

3.2. Equivalence between Multinomial and Poissonized Models 

Consider a discrete probability vector P=(p_1,\cdots,p_k) with p_i\ge 0, \sum_{i=1}^k p_i=1. A widely-used model in practice is the multinomial model \mathcal{M}_n, which models the i.i.d. sampling process and draws i.i.d. observations X_1,\cdots, X_n\sim P. However, a potential difficulty in handling multinomial models is that the empirical frequencies \hat{p}_1, \cdots, \hat{p}_k of symbols are dependent, which makes the analysis annoying. To overcome this difficulty, a common procedure is to consider a Poissonized model \mathcal{N}_n, where we draw a Poisson random variable N\sim \text{Poisson}(n) first and observes i.i.d. X_1,\cdots,X_N\sim P. Due to the nice properties of Poisson random variables, the empirical frequencies now follow independent scaled Poisson distribution. 

The next theorem shows that the multinomial and Poissonized models are asymptotically equivalent, which means that it actually does no harm to consider the more convenient Poissonized model for analysis, at least asymptotically. In later lectures I will also show a non-asymptotic result between these two models. 

Theorem 8 For fixed k, \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0

Proof: We only show that \mathcal{M}_n is \varepsilon_n-deficient relative to \mathcal{N}_n, with \lim_{n\rightarrow\infty} \varepsilon_n=0, where the other direction is analogous. By Theorem 5, it suffices to show that \mathcal{N}_n is an approximate randomization of \mathcal{M}_n. The randomization procedure is as follows: based on the observations X_1,\cdots,X_n under the multinomial model, let P_n=(\hat{p}_1,\cdots,\hat{p}_k) be the vector of empirical frequencies. Next draw an independent random variable N\sim \text{Poisson}(n). If N\le n, let (X_1,\cdots,X_N) be the output of the kernel. Otherwise, we generate i.i.d. samples X_{n+1}', \cdots, X_N'\sim P_n, and let (X_1,\cdots,X_n,X_{n+1}',\cdots,X_N') be the output. We remark that it is important that the above randomization procedure does not depend on the unknown P

Let \mathcal{N}_P, \mathcal{N}_P' be the distribution of the Poissonized and randomized model under true parameter P, respectively. Now it is easily shown that 

 \|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} = \mathop{\mathbb E}_m \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}}, \ \ \ \ \ (8)

where m:=(N-n)_+, P^{\otimes m} denotes the m-fold produce of P, \mathop{\mathbb E}_m takes the expectation w.r.t. m, and \mathop{\mathbb E}_{X^n} takes the expectation w.r.t. random samples X^n\sim P. To upper bound the total variation distance in (8), we shall need the following lemma. 

Lemma 9 Let D_{\text{\rm KL}}(P\|Q) = \int dP\log \frac{dP}{dQ} and \chi^2(P,Q) = \int \frac{(dP-dQ)^2}{dQ} be the KL divergence and \chi^2-divergence, respectively. Then

 2\|P-Q \|_{\text{\rm TV}}^2 \le D_{\text{\rm KL}}(P\|Q) \le \chi^2(P,Q).

The proof of Lemma 9 will be given in later lectures when we talk about joint ranges of divergences. Then by Lemma 9 and Jensen’s inequality, 

 \begin{array}{rcl} \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}} & \le & \mathop{\mathbb E}_{X^n}\sqrt{\frac{1}{2} D_{\text{KL}}(P_n^{\otimes m},P^{\otimes m} ) }\\ &=& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} D_{\text{KL}}(P_n,P ) } \\ &\le& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} \chi^2(P_n,P ) }\\ &\le& \sqrt{\frac{m}{2} \mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) }. \end{array}

Since by simple algebra, 

 \mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) = \sum_{i=1}^k \frac{\mathop{\mathbb E}_{X^n} (\hat{p}_i-p_i)^2 }{p_i} = \sum_{i=1}^k \frac{p_i(1-p_i)}{np_i} = \frac{k-1}{n},

then by (8) we obtain that 

 \|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} \le \mathop{\mathbb E}_m \sqrt{\frac{m(k-1)}{2n}} \le \sqrt{\frac{k-1}{2n}}\cdot (\mathop{\mathbb E} m^2)^{\frac{1}{4}} \le \sqrt{\frac{k-1}{2\sqrt{n}}},

which goes to zero uniformly in P as n\rightarrow\infty, as desired.  \Box 

Remark 1 In fact, the following non-asymptotic characterization

\Delta(\mathcal{M}_{n,k}, \mathcal{N}_{n,k}) \asymp \min\{1, \sqrt{k/n} \}

could be established. This result is contained in an upcoming paper.

3.3. Equivalence between Nonparametric Regression and Gaussian White Noise Models 

A well-known problem in nonparametric statistics is the nonparametric regression: 

 y_i = f\left(\frac{i}{n}\right) + \sigma\xi_i, \qquad i=1,\cdots,n, \quad \xi_i\overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1), \ \ \ \ \ (9)

where the underlying regression function f is unknown, and \sigma>0 is some noise level. Typically, the statistical goal is to recover the function f at some point or globally, and some smoothness conditions are necessary to perform this task. A typical assumption is that f\in \mathcal{H}^s(L) belongs to some H\”{o}lder ball, where 

 \mathcal{H}^s(L) := \left\{f\in C[0,1]: \sup_{x\neq y}\frac{|f^{(m)}(x) - f^{(m)}(y)| }{|x-y|^\alpha} \le L\right\},

with s=m+\alpha, m\in {\mathbb N}, \alpha\in (0,1] denotes the smoothness parameter. 

There is also a continuous version of (9) called the Gaussian white noise model, where a process (Y_t)_{t\in [0,1]} satisfying the following stochastic differential equation is observed: 

 dY_t = f(t)dt + \frac{\sigma}{\sqrt{n}}dB_t, \qquad t\in [0,1], \ \ \ \ \ (10)

where (B_t)_{t\in [0,1]} is the standard Brownion motion. Compared with the regression model in (9), the white noise model in (10) gets rid of the quantization issue of [0,1] and is therefore easier to analyze. Note that in both models n is effectively the sample size. 

Let \mathcal{M}_n, \mathcal{N}_n be the regression and white noise models with known parameters (\sigma,L) and the paramter set f\in \mathcal{H}^s(L), respectively. The main result in this section is that, when s>1/2, these models are asymptotically equivalent. 

Theorem 10 If s>1/2, we have \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0

Proof: Consider another Gaussian white noise model \mathcal{N}_n^\star where the only difference is to replace f in (10) by f^\star defined as 

 f^\star(t) = \sum_{i=1}^n f\left(\frac{i}{n}\right) 1\left(\frac{i-1}{n}\le t<\frac{i}{n}\right), \qquad t\in [0,1].

Note that under the same parameter f, we have 

 \begin{array}{rcl} D_{\text{KL}}(P_{Y_{[0,1]}^\star} \| P_{Y_{[0,1]}}) &=& \frac{n}{2\sigma^2}\int_0^1 (f(t) - f^\star(t))^2dt\\ & =& \frac{n}{2\sigma^2}\sum_{i=1}^n \int_{(i-1)/n}^{i/n} (f(t) - f(i/n))^2dt \\ & \le & \frac{L^2}{2\sigma^2}\cdot n^{1-2(s\wedge 1)}, \end{array}

which goes to zero uniformly in f as n\rightarrow\infty. Therefore, by Theorem 5 and Lemma 9, we have \Delta(\mathcal{N}_n, \mathcal{N}_n^\star)\rightarrow 0. On the other hand, in the model \mathcal{N}_n^\star the likelihood ratio between the signal distribution P_{Y^\star} and the pure noise distribution P_{Z^\star} is 

 \begin{array}{rcl} \frac{dP_Y}{dP_Z}((Y_t^\star)_{t\in [0,1]}) &=& \exp\left(\frac{n}{2\sigma^2}\left(\int_0^1 2f^\star(t)dY_t^\star-\int_0^1 f^\star(t)^2 dt \right)\right) \\ &=& \exp\left(\frac{n}{2\sigma^2}\left(\sum_{i=1}^n 2f(i/n)(Y_{i/n}^\star - Y_{(i-1)/n}^\star) -\int_0^1 f^\star(t)^2 dt \right)\right). \end{array}

As a result, under model \mathcal{N}_n^\star, there is a Markov chain f \rightarrow (n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]}\rightarrow (Y_t^\star)_{t\in [0,1]}. Since under the same parameter f, (n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]} under \mathcal{N}_n^\star is identically distributed as (y_i)_{i\in [n]} under \mathcal{M}_n, by Theorem 7 we have exact sufficiency and conclude that \Delta(\mathcal{M}_n, \mathcal{N}_n^\star)=0. Then the rest follows from the triangle inequality.  \Box 

3.4. Equivalence between Density Estimation and Gaussian White Noise Models 

Another widely-used model in nonparametric statistics is the density estimation model, where samples X_1,\cdots,X_n are i.i.d. drawn from some unknown density f. Typically some smoothness condition is also necessary for the density, and we assume that f\in \mathcal{H}^s(L) again belongs to the H\”{o}lder ball. 

Compared with the previous results, a slightly more involved result is that the density estimation model, albeit with a seemingly different form, is also asymptotically equivalent to a proper Gaussian white noise model. However, here the Gaussian white noise model should take the following different form: 

 dY_t = \sqrt{f(t)}dt + \frac{1}{2\sqrt{n}}dB_t, \qquad t\in [0,1]. \ \ \ \ \ (11)

In other words, in nonparametric statistics the problems of density estimation, regression and estimation in Gaussian white noise are all asymptotically equivalent, under certtain smoothness conditions. 

Let \mathcal{M}_n, \mathcal{N}_n be the density estimation model and the Gaussian white noise model in (11), respectively. The main result is summarized in the following theorem. 

Theorem 11 If s>1/2 and the density f is bounded below from zero everywhere, then \lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0

Proof: Instead of the original density estimation model, we actually consider a Poissonized sampling model \mathcal{M}_{n,P} instead, where the observation under \mathcal{M}_{n,P} is a Poisson process (Z_t)_{t\in [0,1]} on [0,1] with intensity nf(t). Similar to the proof of Theorem 8, we have \Delta(\mathcal{M}_n, \mathcal{M}_{n,P})\rightarrow 0 and it remains to show that \Delta(\mathcal{N}_n, \mathcal{M}_{n,P})\rightarrow 0

Fix an equal-spaced grid t_0=0,t_1, t_2, \cdots, t_m=1 in [0,1] with m=n^{1-\varepsilon}, where \varepsilon>0 is a small constant depending only on s. Next we come up with two new models \mathcal{M}_{n,P}^\star and \mathcal{N}_n^\star, where the only difference is that the parameter f is replaced by f^\star defined as 

 f^\star(t) = \sum_{i=1}^M f(t_i) 1\left(t_{i-1}\le t<t_i\right), \qquad t\in [0,1].

As long as 2s(1-\varepsilon)>1, the same arguments in the proof of Theorem 10 can be applied to arrive at \Delta(\mathcal{M}_{n,P}^\star, \mathcal{M}_{n,P}), \Delta(\mathcal{N}_{n}^\star, \mathcal{N}_{n})\rightarrow 0 (for the white noise model, the assumption that f is bounded away from zero ensures the smoothness of \sqrt{f}). Hence, it further suffices to focus on the new models and show that \Delta(\mathcal{M}_{n,P}^\star, \mathcal{N}_n^\star)\rightarrow 0. An interesting observation is that under the model \mathcal{M}_{n,P}^\star, the vector \mathbf{Z}=(Z_1,\cdots,Z_m) with 

 Z_i = \sum_{j=1}^N 1(t_{i-1}\le X_j<t_i) \sim \text{Poisson}\left(n^\varepsilon f(t_i) \right), \quad i\in [m]

is sufficient. Moreover, under the model \mathcal{N}_n^\star, the vector \mathbf{Y}=(Y_1,\cdots,Y_m) with 

 Y_i = \sqrt{nm}(Y_{t_i} - Y_{t_{i-1}}) \sim \mathcal{N}\left(\sqrt{n^\varepsilon f(t_i)}, \frac{1}{4}\right), \quad i\in [m]

is also sufficient. Further, all entries of \mathbf{Y} and \mathbf{Z} are mutually independent. Hence, the ultimate goal is to find mutual randomizations between \mathbf{Y} and \mathbf{Z} for f\in \mathcal{H}^s(L)

To do so, a first attempt would be to find a bijective mapping Y_i \leftrightarrow Z_i independently for each i. However, this approach would lose useful information from the neighbors as we know that f(t_i)\approx f(t_{i+1}) thanks to the smoothness of f. For example, we have Y_1|Y_1+Y_2 \sim \text{Binomial}(Y_1+Y_2, p) with p = \frac{f(t_1)}{f(t_1) + f(t_2)}\approx \frac{1}{2}, and Z_1 - Z_2\sim \mathcal{N}(\mu, \frac{1}{2}) with \mu = n^{\varepsilon/2}(\sqrt{f(t_1)} - \sqrt{f(t_2)})\approx 0. Motivated by this fact, we represent \mathbf{Y} and \mathbf{Z} in the following bijective way (assume that m is even): 

 \begin{array}{rcl} \mathbf{Y}' &=& (Y_1, Y_1 + Y_2, Y_3, Y_3 + Y_4, \cdots, Y_{m-1}, Y_{m-1} + Y_m), \\ \mathbf{Z}' &=& (Z_1-Z_2, Z_1+Z_2, \cdots, Z_m-Z_{m-1}, Z_m+Z_{m-1}). \end{array}

Note that (Y_1+Y_2,Y_3+Y_4,\cdots,Y_{m-1}+Y_m) is again an independent Poisson vector, we may repeat the above transformation for this new vector. Similar things also hold for \mathbf{Z}'. Hence, at each iteration we may leave half of the components unchanged, and apply the above transformations to the other half. We repeat the iteration for \log_2 \sqrt{n} times (assuming \sqrt{n} is a power of 2), so that finally we arrive at a vector of length m/\sqrt{n} = n^{1/2-\varepsilon} consisting of sums. Let \mathbf{Y}^{(1)} (resp. \mathbf{Z}^{(1)}) be the final vector of sums, and \mathbf{Y}^{(2)} (resp. \mathbf{Z}^{(2)}) be the vector of remaining entries which are left unchanged at some iteration. 

Remark 2 Experienced readers may have noticed that these are the wavelet coefficients under the Haar wavelet basis, where superscripts 1 and 2 stand for father and mother wavelets, respectively. 

Next we are ready to describe the randomization procedure. For notational simplicity we will write Y_1+Y_2 as a representative example of an entry in \mathbf{Y}^{(1)}, and write Y_1 as a representative example of an entry in \mathbf{Y}^{(2)}

For entries in \mathbf{Y}^{(1)}, note that by the delta method, for Y\sim \text{Poisson}(\lambda), the random variable \sqrt{Y} is approximately distributed as \mathcal{N}(\sqrt{\lambda},1/4) (in fact, the squared root is the variance-stabilizing transformation for Poisson random variables). The exact transformation is then given by 

 Y_1 + Y_2 \mapsto \text{sign}(Y_1 + Y_2 +U)\cdot \sqrt{|Y_1 + Y_2 + U|}, \ \ \ \ \ (12)

where U\sim \text{Uniform}([-1/2,1/2]) is an independent auxiliary variable. The mapping (12) is one-to-one and can thus be inverted as well. 

For entries in \mathbf{Y}^{(2)}, we aim to use quantile transformations to convert \text{Binomial}(Y_1+Y_2, 1/2) to \mathcal{N}(0,1/2). For k\ge 0, let F_k be the CDF of \text{Binomial}(k, 1/2), and \Phi be the CDF of \mathcal{N}(0,1). Then the one-to-one quantile transformation is given by 

 Y_1 \mapsto \frac{1}{\sqrt{2}}\Phi^{-1}(F_{Y_1+Y_2}(Y_1+U)), \ \ \ \ \ (13)

where again U\sim \text{Uniform}([-1/2,1/2]) is an independent auxiliary variable. The output given by (13) will be expected to be close in distribution to Z_1-Z_2, and the overall transformation is also invertible. 

The approximation properties of these transformations are summarized in the following theorem. 

Theorem 12 Sticking to the specific examples of Y_1 and Y_1 + Y_2, let P_1, P_2 be the respective distributions of the RHS in (12) and (13), and Q_1, Q_2 be the respective distributions of Z_1 + Z_2 and Z_1 - Z_2, we have

 \begin{array}{rcl} H^2(P_1, Q_1) & \le & \frac{C}{n^\varepsilon (f(t_1) + f(t_2))}, \\ H^2(P_2, Q_2) & \le & C\left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^2 + Cn^\varepsilon \left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^4. \end{array}

where C>0 is some universal constant, and H^2(P,Q) := \int (\sqrt{dP}-\sqrt{dQ})^2 denotes the Hellinger distance. 

The proof of Theorem 12 is purely probabilitistic and involved, and is omitted here. Applying Theorem 12 to the vector \mathbf{Y}^{(1)} of length m/\sqrt{n}, each component is the sum of \sqrt{n} elements bounded away from zero. Consequently, let \mathsf{K} be the overall transition kernel of the randomization, the inequality H^2(\otimes_i P_i, \otimes_i Q_i)\le \sum_i H^2(P_i,Q_i) gives 

 H^2(\mathsf{K}P_{\mathbf{Y}^{(1)}}, P_{\mathbf{Z}^{(1)}}) = O\left( \frac{m}{\sqrt{n}}\cdot \frac{1}{n^\varepsilon \cdot n^{1/2}} \right) = O(n^{-2\varepsilon}) \rightarrow 0.

As for the vector \mathbf{Y}^{(2)}, the components lie in \ell_{\max} := \log_2 \sqrt{n} possible different levels. At level \ell\in [\ell_{\max}], the spacing of the grid becomes n^{-1+\varepsilon}\cdot 2^{\ell}, and there are m\cdot 2^{-\ell} elements. Also, we have (f(t_1)-f(t_2))/(f(t_1)+f(t_2))=O(n^{(\varepsilon-1)s'}\cdot 2^{\ell s'}) at \ell-th level, with s':= s\wedge 1. Consequently, 

 \begin{array}{rcl} H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) &\le & \sum_{\ell=1}^{\ell_{\max}} m2^{-\ell}\cdot O\left((n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^2 + n^\varepsilon 2^\ell\cdot (n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^4 \right) \\ &= & O\left(n^{-(1-\varepsilon)(2s'-1)}\cdot 2^{(2s'-1)\ell_{\max}} + n\cdot (n^{-1+\varepsilon}2^{\ell_{\max}})^{4s'} \right) \\ &=& O\left(n^{-(1/2-\varepsilon)(2s'-1)} + n^{1-2s'(1-2\varepsilon)} \right). \end{array}

Since s'>1/2, we may choose \varepsilon to be sufficiently small (i.e., 2s'(1-2\varepsilon)>1) to make H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) = o(1). The proof is completed.  \Box 

4. Bibliographic Notes 

The statistical decision theory framework dates back to Wald (1950), and is currently the elementary course for graduate students in statistics. There are many excellent textbooks on this topic, e.g., Lehmann and Casella (2006) and Lehmann and Romano (2006). The concept of model deficiency is due to Le Cam (1964), where the randomization criterion (Theorem 5) was proved. The present form is taken from Torgersen (1991). We also refer to the excellent monographs by Le Cam (1986) and Le Cam and Yang (1990). 

The asymptotic equivalence between nonparametric models has been studied by a series of papers since 1990s. The equivalence between nonparametric regression and Gaussian white noise models (Theorem 10) was established in Brown and Low (1996), where both the fixed and random designs were studied. It was also shown in a follow-up work (Brown and Zhang 1998) that these models are non-equivalent if s\le 1/2. The equivalence of the density estimation model and others (Theorem 11) was established in Brown et al. (2004), and we also refer to a recent work (Ray and Schmidt-Hieber 2016) which relaxed the crucial assumption that the density is bounded below from zero. Poisson approximation or Poissonization is a well-known technique widely used in probability theory, statistics and theoretical computer science, and the current treatment is essentially taken from Brown et al. (2004). 

  1. Abraham Wald, Statistical decision functions. Wiley, 1950. 
  2. Erich L. Lehmann and George Casella, Theory of point estimation. Springer Science & Business Media, 2006. 
  3. Erich L. Lehmann and Joseph P. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006. 
  4. Lucien M. Le Cam, Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics (1964): 1419-1455. 
  5. Erik Torgersen, Comparison of statistical experiments. Cambridge University Press, 1991. 
  6. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986. 
  7. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990. 
  8. Lawrence D. Brown and Mark G. Low, Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics 24.6 (1996): 2384-2398. 
  9. Lawrence D. Brown and Cun-Hui Zhang, Asymptotic nonequivalence of nonparametric experiments when the smoothness index is 1/2. The Annals of Statistics 26.1 (1998): 279-287. 
  10. Lawrence D. Brown, Andrew V. Carter, Mark G. Low, and Cun-Hui Zhang, Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift. The Annals of Statistics 32.5 (2004): 2074-2097. 
  11. Kolyan Ray, and Johannes Schmidt-Hieber, The Le Cam distance between density estimation, Poisson processes and Gaussian white noise. arXiv preprint arXiv:1608.01824 (2016). 

3 thoughts on “Lecture 3: Statistical Decision Theory: Model Distance and Equivalence

Leave a Reply to The InformaticistsCancel reply