(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at

yjhan@stanford.edu.)

This lecture starts to talk about specific tools and ideas to prove information-theoretic lower bounds. We will temporarily restrict ourselves to statistical inference problems (which most lower bounds apply to), where the presence of randomness is a key feature in these problems. Since the observer cannot control the realizations of randomness, the information contained in the observations, albeit not necessarily in a discrete structure (e.g., those in Lecture 2), can still be limited. In this lecture and subsequent ones, we will introduce the *reduction* and *hypothesis testing*ideas to prove lower bounds of statistical inference, and these ideas will also be applied to other problems.

**1. Basics of Statistical Decision Theory **

To introduce statistical inference problems, we first review some basics of statistical decision theory. Mathematically, let be a collection of probability triplets indexed by , and the observation is a random variable following distribution for an *unknown* parameter in the *known* parameter set . The specific structure of is typically called *models* or *experiments*, for the parameter can represent different model parameters or theories to explain the observation . A *decision rule*, or a *(randomized) estimator* is a transition kernel from to some action space . Let be a loss function, where represents the loss of using action when the true parameter is . A central quantity to measure the quality of a decision rule is the *risk* in the following definition.

Definition 1 (Risk)Under the above notations, the risk of the decision rule under loss function and the true parameter is defined as

Intuitively, the risk characterizes the expected loss (over the randomness in both the observation and the decision rule) of the decision rule when the true parameter is . When is a point mass for some deterministic function , we will also call as an *estimator* and the risk in (1) becomes

Example 1In linear regression model with random design, let the observations be i.i.d. with and . Here the parameter set is a finite-dimensional Euclidean space, and therefore we call this model parametric. The quantity of interest is , and the loss function may be chosen to be the prediction error of the linear estimator .

Example 2In density estimation model, let be i.i.d. drawn from some -Lipschitz density supported on . Here the parameter set of the unknown is the infinite-dimensional space of all possible -Lipschitz functions on , and we call this model non-parametric. The target may be to estimate the density at a point, the entire density, or some functional of the density. In respective settings, the loss functions can be

Example 3By allowing general action spaces and loss functions, the decision-theoretic framework can also incorporate some non-statistical examples. For instance, in stochastic optimization may parameterize a class of convex Lipschitz functions , and denotes the noisy observations of the gradients at the queried points. Then the action space may just be the entire domain , and the loss function is the optimality gap defined as

In practice, one would like to find optimal decision rules for a given task. However, the risk is a function of and it is hard to compare two risk functions directly. Hence, people typically map the risk functions into scalars and arrive at the following *minimax* and *Bayesian* paradigm.

Definition 2 (Minimax and Bayes Decision Rule)The minimax decision rule is the decision rule which minimizes the quantity . The Bayes decision rule under distribution (called the prior distribution) is the decision rule which minimizes the quantity .

It is typically hard to find the minimax decision rule in practice, while the Bayes decision rule admits a closed-form expression (although hard to compute in general).

Proposition 3The Bayes decision rule under prior is given by the estimator

where denotes the posterior distribution of under (assuming the existence of regular posterior).

*Proof:* Left as an exercise for the reader.

**2. Deficiency and Model Distance **

The central target of statistical inference is to propose some decision rule for a given statistical model with small risks. Similarly, the counterpart in the lower bound is to prove that certain risks are unavoidable for any decision rules. Here to compare risks, we may either compare the entire risk function, or its minimax or Bayes version. In this lecture we will focus on the risk function, and many later lectures will be devoted to appropriate minimax risks.

We start with the task of comparing two statistical models with the same parameter set . Intuitively, one may think that the model with a smaller noise level would be better than the other, e.g., the model should be better than . However, this criterion is bad due to two reasons:

- There is no proper notion of noise for general (especially non-additive) statistical models;
- Even if a natural notion of noise exists for certain models, it is not necessarily true that the model with smaller noise is always better. For example, let

the model has a smaller magnitude of the additive noise. However, despite with a larger magnitude of the noise, the model is actually*noise-free*since one can decode the perfect knowledge of from the observation.

To overcome the above difficulties, we introduce the idea of model reduction. Specifically, we aim to establish the fact that for any decision rule in one model, there is another decision rule in the other model which is uniformly no worse than the former rule. The idea of reduction appears in many fields, e.g., in P/NP theory it is sufficient to work out one NP-complete instance (e.g., circuit satisfiability) from scratch and establish all others by polynomial reduction. This reduction idea is made precise via the following definition.

Definition 4 (Model Deficiency)For two statistical models and , we call is -deficient relative to if for any finite subset , any finite action space , any loss function , and any decision rule based on model , there exists some decision rule based on model such that

Note that the definition of model deficiency does not involve the specific choice of the action space and loss function, and the finiteness of and in the definition is mainly for technical purposes. Intuitively speaking, is -deficient relative to if the entire risk function of *some* decision rule in is no worse than that of *any* given decision rule in , within an additive gap .

Although Definition 4 gives strong performance guarantees for the -deficient model , it is difficult to verify the condition (4), i.e., to transform an arbitrary decision rule to some proper . A crucial observation here is that it may be easier to transform between models and , and in particular, when is a *randomization* of . For example, if there exists some stochastic kernel such that for all (where denotes the marginal distribution of the output of kernel with input distributed as ), then we may simply set to arrive at (4) with . The following theorem shows that model deficiency is in fact equivalent to approximate randomization.

Theorem 5Model is -deficient with respect to if and only if there exists some stochastic kernel such that

where is the total variation distance between probability measures and .

*Proof:* The sufficiency part is easy. Given such a kernel and a decision rule based on model , we simply set , i.e., transmit the output through kernel and apply . Then for any ,

for the loss function is non-negative and upper bounded by one.

The necessity part is slightly more complicated, and for simplicity we assume that all are finite (the general case requires proper limiting arguments). In this case, any decision rules or , loss functions and priors can be represented by a finite-dimensional vector. Given and , the condition (4) ensures that

Note that the LHS of (5) is bilinear in and , both of which range over some convex sets (e.g., the domain for is exactly ), the minimax theorem allows to swap and of (5) to obtain that

By evaluating the inner supremum, (6) implies the existence of some such that

Finally, choosing and in (7), the corresponding is the desired kernel .

Based on the notion of deficiency, we are ready to define the distance between statistical models, also known as the *Le Cam’s distance*.

Definition 6 (Le Cam’s Distance)For two statistical models and with the same parameter set , Le Cam’s distance is defined as the infimum of such that is -deficient relative to , and is -deficient relative to .

It is a simple exercise to show that Le Cam’s distance is a pesudo-metric in the sense that it is symmetric and satisfies the triangle inequality. The main importance of Le Cam’s distance is that it helps to establish equivalence between some statistical models, and people are typically interested in the case where or . The main idea is to use randomization (i.e., Theorem 5) to obtain an upper bound on Le Cam’s distance, and then apply Definition 4 to deduce useful results (e.g., to carry over an asymptotically optimal procedure in one model to other models).

In the remainder of this lecture, I will give some examples of models whose distance is zero or asymptotically zero. In the next lecture it will be shown that *regular* models will always be close to some Gaussian location model asymptotically, and thereby the classical asymptotic theory of statistics can be established.

**3. Equivalence between Models **

**3.1. Sufficiency **

We first examine the case where . By Theorem 5, models and are mutual randomizations. In the special case where is a deterministic function of (thus is the push-forward measure of through ), we have the following result.

Theorem 7Under the above setting, if and only if forms a Markov chain.

Note that the Markov condition is the usual definition of *sufficient statistics*, and also gives the well-known Rao–Blackwell factorization criterion for sufficiency. Hence, sufficiency is in fact a special case of model equivalence, and deficiency can be thought of as approximate sufficiency.

**3.2. Equivalence between Multinomial and Poissonized Models **

Consider a discrete probability vector with . A widely-used model in practice is the multinomial model , which models the i.i.d. sampling process and draws i.i.d. observations . However, a potential difficulty in handling multinomial models is that the empirical frequencies of symbols are dependent, which makes the analysis annoying. To overcome this difficulty, a common procedure is to consider a *Poissonized* model , where we draw a Poisson random variable first and observes i.i.d. . Due to the nice properties of Poisson random variables, the empirical frequencies now follow independent scaled Poisson distribution.

The next theorem shows that the multinomial and Poissonized models are asymptotically equivalent, which means that it actually does no harm to consider the more convenient Poissonized model for analysis, at least asymptotically. In later lectures I will also show a non-asymptotic result between these two models.

Theorem 8For fixed , .

*Proof:* We only show that is -deficient relative to , with , where the other direction is analogous. By Theorem 5, it suffices to show that is an approximate randomization of . The randomization procedure is as follows: based on the observations under the multinomial model, let be the vector of empirical frequencies. Next draw an independent random variable . If , let be the output of the kernel. Otherwise, we generate i.i.d. samples , and let be the output. We remark that it is important that the above randomization procedure does not depend on the unknown .

Let be the distribution of the Poissonized and randomized model under true parameter , respectively. Now it is easily shown that

where , denotes the -fold produce of , takes the expectation w.r.t. , and takes the expectation w.r.t. random samples . To upper bound the total variation distance in (8), we shall need the following lemma.

Lemma 9Let and be the KL divergence and -divergence, respectively. Then

The proof of Lemma 9 will be given in later lectures when we talk about joint ranges of divergences. Then by Lemma 9 and Jensen’s inequality,

Since by simple algebra,

then by (8) we obtain that

which goes to zero uniformly in as , as desired.

Remark 1In fact, the following non-asymptotic characterization

could be established. This result is contained in an upcoming paper.

**3.3. Equivalence between Nonparametric Regression and Gaussian White Noise Models **

A well-known problem in nonparametric statistics is the nonparametric regression:

where the underlying regression function is unknown, and is some noise level. Typically, the statistical goal is to recover the function at some point or globally, and some smoothness conditions are necessary to perform this task. A typical assumption is that belongs to some H\”{o}lder ball, where

with denotes the smoothness parameter.

There is also a continuous version of (9) called the Gaussian white noise model, where a process satisfying the following stochastic differential equation is observed:

where is the standard Brownion motion. Compared with the regression model in (9), the white noise model in (10) gets rid of the quantization issue of and is therefore easier to analyze. Note that in both models is effectively the sample size.

Let be the regression and white noise models with known parameters and the paramter set , respectively. The main result in this section is that, when , these models are asymptotically equivalent.

Theorem 10If , we have .

*Proof:* Consider another Gaussian white noise model where the only difference is to replace in (10) by defined as

Note that under the same parameter , we have

which goes to zero uniformly in as . Therefore, by Theorem 5 and Lemma 9, we have . On the other hand, in the model the likelihood ratio between the signal distribution and the pure noise distribution is

As a result, under model , there is a Markov chain . Since under the same parameter , under is identically distributed as under , by Theorem 7 we have exact sufficiency and conclude that . Then the rest follows from the triangle inequality.

**3.4. Equivalence between Density Estimation and Gaussian White Noise Models **

Another widely-used model in nonparametric statistics is the density estimation model, where samples are i.i.d. drawn from some unknown density . Typically some smoothness condition is also necessary for the density, and we assume that again belongs to the H\”{o}lder ball.

Compared with the previous results, a slightly more involved result is that the density estimation model, albeit with a seemingly different form, is also asymptotically equivalent to a proper Gaussian white noise model. However, here the Gaussian white noise model should take the following different form:

In other words, in nonparametric statistics the problems of density estimation, regression and estimation in Gaussian white noise are all asymptotically equivalent, under certtain smoothness conditions.

Let be the density estimation model and the Gaussian white noise model in (11), respectively. The main result is summarized in the following theorem.

Theorem 11If and the density is bounded below from zero everywhere, then .

*Proof:* Instead of the original density estimation model, we actually consider a Poissonized sampling model instead, where the observation under is a Poisson process on with intensity . Similar to the proof of Theorem 8, we have and it remains to show that .

Fix an equal-spaced grid in with , where is a small constant depending only on . Next we come up with two new models and , where the only difference is that the parameter is replaced by defined as

As long as , the same arguments in the proof of Theorem 10 can be applied to arrive at (for the white noise model, the assumption that is bounded away from zero ensures the smoothness of ). Hence, it further suffices to focus on the new models and show that . An interesting observation is that under the model , the vector with

is sufficient. Moreover, under the model , the vector with

is also sufficient. Further, all entries of and are mutually independent. Hence, the ultimate goal is to find mutual randomizations between and for .

To do so, a first attempt would be to find a bijective mapping independently for each . However, this approach would lose useful information from the neighbors as we know that thanks to the smoothness of . For example, we have with , and with . Motivated by this fact, we represent and in the following bijective way (assume that is even):

Note that is again an independent Poisson vector, we may repeat the above transformation for this new vector. Similar things also hold for . Hence, at each iteration we may leave half of the components unchanged, and apply the above transformations to the other half. We repeat the iteration for times (assuming is a power of ), so that finally we arrive at a vector of length consisting of sums. Let (resp. ) be the final vector of sums, and (resp. ) be the vector of remaining entries which are left unchanged at some iteration.

Remark 2Experienced readers may have noticed that these are the wavelet coefficients under the Haar wavelet basis, where superscripts and stand for father and mother wavelets, respectively.

Next we are ready to describe the randomization procedure. For notational simplicity we will write as a representative example of an entry in , and write as a representative example of an entry in .

For entries in , note that by the delta method, for , the random variable is approximately distributed as (in fact, the squared root is the variance-stabilizing transformation for Poisson random variables). The exact transformation is then given by

where is an independent auxiliary variable. The mapping (12) is one-to-one and can thus be inverted as well.

For entries in , we aim to use quantile transformations to convert to . For , let be the CDF of , and be the CDF of . Then the one-to-one quantile transformation is given by

where again is an independent auxiliary variable. The output given by (13) will be expected to be close in distribution to , and the overall transformation is also invertible.

The approximation properties of these transformations are summarized in the following theorem.

Theorem 12Sticking to the specific examples of and , let be the respective distributions of the RHS in (12) and (13), and be the respective distributions of and , we have

where is some universal constant, and denotes the Hellinger distance.

The proof of Theorem 12 is purely probabilitistic and involved, and is omitted here. Applying Theorem 12 to the vector of length , each component is the sum of elements bounded away from zero. Consequently, let be the overall transition kernel of the randomization, the inequality gives

As for the vector , the components lie in possible different levels. At level , the spacing of the grid becomes , and there are elements. Also, we have at -th level, with . Consequently,

Since , we may choose to be sufficiently small (i.e., ) to make . The proof is completed.

**4. Bibliographic Notes **

The statistical decision theory framework dates back to Wald (1950), and is currently the elementary course for graduate students in statistics. There are many excellent textbooks on this topic, e.g., Lehmann and Casella (2006) and Lehmann and Romano (2006). The concept of model deficiency is due to Le Cam (1964), where the randomization criterion (Theorem 5) was proved. The present form is taken from Torgersen (1991). We also refer to the excellent monographs by Le Cam (1986) and Le Cam and Yang (1990).

The asymptotic equivalence between nonparametric models has been studied by a series of papers since 1990s. The equivalence between nonparametric regression and Gaussian white noise models (Theorem 10) was established in Brown and Low (1996), where both the fixed and random designs were studied. It was also shown in a follow-up work (Brown and Zhang 1998) that these models are non-equivalent if . The equivalence of the density estimation model and others (Theorem 11) was established in Brown et al. (2004), and we also refer to a recent work (Ray and Schmidt-Hieber 2016) which relaxed the crucial assumption that the density is bounded below from zero. Poisson approximation or Poissonization is a well-known technique widely used in probability theory, statistics and theoretical computer science, and the current treatment is essentially taken from Brown et al. (2004).

- Abraham Wald,
*Statistical decision functions.*Wiley, 1950. - Erich L. Lehmann and George Casella,
*Theory of point estimation*. Springer Science & Business Media, 2006. - Erich L. Lehmann and Joseph P. Romano,
*Testing statistical hypotheses*. Springer Science & Business Media, 2006. - Lucien M. Le Cam,
*Sufficiency and approximate sufficiency.*The Annals of Mathematical Statistics (1964): 1419-1455. - Erik Torgersen,
*Comparison of statistical experiments*. Cambridge University Press, 1991. - Lucien M. Le Cam,
*Asymptotic methods in statistical theory.*Springer, New York, 1986. - Lucien M. Le Cam and Grace Yang,
*Asymptotics in statistics.*Springer, New York, 1990. - Lawrence D. Brown and Mark G. Low,
*Asymptotic equivalence of nonparametric regression and white noise.*The Annals of Statistics 24.6 (1996): 2384-2398. - Lawrence D. Brown and Cun-Hui Zhang,
*Asymptotic nonequivalence of nonparametric experiments when the smoothness index is .*The Annals of Statistics 26.1 (1998): 279-287. - Lawrence D. Brown, Andrew V. Carter, Mark G. Low, and Cun-Hui Zhang,
*Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift.*The Annals of Statistics 32.5 (2004): 2074-2097. - Kolyan Ray, and Johannes Schmidt-Hieber,
*The Le Cam distance between density estimation, Poisson processes and Gaussian white noise.*arXiv preprint arXiv:1608.01824 (2016).

In Example 2, should it say “all possible 1-Lipschitz densities” rather than “functions”?

Yes! Thanks for pointing this out.