Lecture 7: Mixture vs. Mixture and Moment Matching

Blog, Online Lectures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.) 

In the last two lectures we have seen two specific categories of the Le Cam’s two-point methods, i.e., testing between two single hypotheses, or between one single hypothesis and a mixture of hypotheses. Of course, the most powerful and natural generalization of the two-point methods is to test between two mixtures of distributions, which by the minimax theorem is potentially the best possible approach to test between two composite hypotheses. However, the least favorable prior (mixture) may be hard to find, and it may be theoretically difficult to upper bound the total variation distance between two mixtures of distributions. In this lecture, we show that both problems are closely related to moment matching via examples in Gaussian and Poisson models. 

1. Fuzzy Hypothesis Testing and Moment Matching 

In this section, we present the general tools necessary for this lecture. First, we prove a generalized two-point method also known as the fuzzy hypothesis testing. To upper bound the crucial divergence term between two mixtures, we introduce the orthogonal polynomials under different distributions and show that the upper bound depends on moment differences of the mixtures. 

1.1. Mixture vs. Mixture 

We first state the theorem of the generalized two-point methods with two mixtures. As usual, we require that any two points in respective mixtures be well separated but the mixtures be indistinguishable from samples. However, since many natural choices of mixtures may not satisfy the well-separated property in the worst case, the next theorem will be a bit more flexible to require that the mixtures are separated with a large probability. 

Theorem 1 (Mixture vs. Mixture) Let L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+ be any loss function, and there exist \Theta_0\subseteq \Theta and \Theta_1 \subseteq \Theta such that

L(\theta_0, a) + L(\theta_1, a) \ge \Delta, \qquad \forall a\in \mathcal{A}, \theta_0\in \Theta_0, \theta_1\in \Theta_1.

Then for any probability measures \mu_0, \mu_1 supported on \Theta, we have

\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|\mathop{\mathbb E}_{\mu_0(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\mu_1(d\theta)}P_{\theta} \|_{\text{\rm TV}} - \mu_0(\Theta_0^c) - \mu_1(\Theta_1^c)\right).

Proof: For i=0,1, let \nu_i be the conditional probability measure of \mu_i conditioned on \Theta_i, i.e., 

\nu_i(A) := \frac{\mu_i(A\cap \Theta_i)}{\mu_i(\Theta_i)}.

Simple algebra gives \|\mu_i - \nu_i\|_{\text{TV}} = \mu_i(\Theta_i^c). By coupling, we also have 

\|\mathop{\mathbb E}_{\mu_i(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\nu_i(d\theta)}P_{\theta} \|_{\text{\rm TV}} \le \|\mu_i - \nu_i\|_{\text{TV}} = \mu_i(\Theta_i^c).

Now the desired result follows from the standard two-point arguments and the triangle inequality of the total variation distance. \Box 

The central quantity in Theorem 1 is \|\mathop{\mathbb E}_{\mu_0(d\theta)}P_{\theta} - \mathop{\mathbb E}_{\mu_1(d\theta)}P_{\theta} \|_{\text{\rm TV}}, the total variation distance between mixture distributions with priors \mu_0 and \mu_1. A general upper bound on this quantity is very hard to obtain, but the next two sections will show that it is small when the moments of \mu_0 and \mu_1 are close to each other if the model (P_\theta) is Gaussian or Poisson. 

1.2. Hermite and Charlier Polynomials 

This section reviews some preliminary results on orthogonal polynomials under a fixed probability distribution. Let P be a probability measure on (\mathcal{X}, \mathcal{F}) with \mathcal{X}\subseteq {\mathbb R} and all moments finite. Recall that functions \{p_n\}_{n=0}^\infty with p_n: \mathcal{X} \rightarrow {\mathbb R} are called orthogonal under P iff 

\mathop{\mathbb E}[p_m(X)p_n(X)] = 0

for X\sim P and all m\neq n. By orthogonal polynomials we mean that for each n\in {\mathbb N}, x\mapsto p_n(x) is a polynomial with degree n

The simplest way to construct orthogonal polynomials is via the Gram–Schmidt orthogonalization. Specifically, we may choose p_0(x) = 1, and p_n(x) to be the orthogonal component of x^n projected onto \text{span}(p_0(x),\cdots,p_{n-1}(x)) with the inner product structure (f,g)\mapsto \mathop{\mathbb E}_P[f(X)g(X)]. This approach works for general distributions and can be easily implemented in practice, but it gives little insight on the properties of p_n. We shall apply a new approach to arrive at orthogonal functions, which turn out to be polynomials in Gaussian and Poisson models. 

Let (P_\theta)_{\theta\in [\theta_0-\varepsilon,\theta_0+\varepsilon]\subseteq {\mathbb R}} be a family of distributions on (\mathcal{X},\mathcal{F}) with P_{\theta_0} = P. Assume that (P_\theta) admits the following local expansion around \theta = \theta_0

\frac{dP_{\theta_0 +u}}{dP_{\theta_0}}(x) = \sum_{m=0}^\infty p_m(x;\theta_0)\frac{u^m}{m!}, \qquad \forall |u|\le \varepsilon, x\in \mathcal{X}. \ \ \ \ \ (1)

The next lemma claims that under specific conditions of (P_\theta), the functions \{p_m(x;\theta_0)\}_{m=0}^\infty are orthogonal under P_{\theta_0}

Lemma 2 Under the above conditions, if for all u,v\in [-\varepsilon,\varepsilon] the quantity

\int_{\mathcal{X}} \frac{dP_{\theta_0+u}dP_{\theta_0+v}}{dP_{\theta_0}}

depends only on their product uv and \theta_0, then the functions \{p_m(x;\theta_0)\}_{m=0}^\infty are orthogonal under P_{\theta_0}

Remark 1 Recall that the quantity \int_{\mathcal{X}} dP_{\theta_0+u}dP_{\theta_0+v}/ dP_{\theta_0} plays an important role in the Ingster-Suslina method in Lecture 6. The upper bounds in the next section can be thought of as a generalization of the Ingster-Suslina method, with the help of proper orthogonality properties. 

Proof: The local expansion of likelihood ratio gives 

\int_{\mathcal{X}} \frac{dP_{\theta_0+u}dP_{\theta_0+v}}{dP_{\theta_0}} = \sum_{m,n=0}^\infty \mathop{\mathbb E}_{P_{\theta_0}}[p_m(X;\theta_0) p_n(X;\theta_0) ]\cdot \frac{u^mv^n}{m!n!}.

The condition of Lemma 2 implies that the coefficient of the monomial u^mv^n on the RHS with m\neq n must be zero, as desired. \Box 

Exercise 1 Show that under the conditions in Lemma 2, \mathop{\mathbb E}_{P_{\theta_0+u}} [p_m(X;\theta_0)] = c_mu^m for some scalar c_m\neq 0 and any u\in [-\varepsilon,\varepsilon]. In other words, p_m(X;\theta_0) is an unbiased estimator of u^m up to scaling in the location model (P_{\theta_0+u})_{u\in [-\varepsilon,\varepsilon]}

The condition of Lemma 2 is satisfied by various well-known probability distributions. For example, 

\int_{{\mathbb R}} \frac{d\mathcal{N}(\theta_0+u, 1)d\mathcal{N}(\theta_0+v,1)}{d\mathcal{N}(\theta_0,1)} = \exp(uv),

and for any \lambda>0

\int_{{\mathbb N}} \frac{d\mathsf{Poi}(\lambda+u) d\mathsf{Poi}(\lambda+v)}{d\mathsf{Poi}(\lambda)} = \exp\left(\frac{uv}{\lambda}\right).

In fact, the functions p_m(x;\theta_0) given in the defining equation (1) in the Gaussian and Poisson models are both polynomials of degree m, known as the Hermite polynomial H_m(x) and the Charlier polynomial c_m(x;\lambda), respectively. The proof of Lemma 2 gives the following orthogonal relations: 

\begin{array}{rcl} \mathop{\mathbb E}_{X\sim \mathcal{N}(0,1)} [H_m(X)H_n(X)] &=& n!\cdot 1(m=n) \\ \mathop{\mathbb E}_{X\sim \mathsf{Poi}(\lambda)} [c_m(X;\lambda)c_n(X;\lambda)] &=& \frac{n!}{\lambda^n}\cdot 1(m=n). \end{array}

Throughout this lecture, we won’t need the specific forms of the Hermite and Charlier polynomials. We shall only need the defining property (1) and the above orthogonal relations. 

1.3. Divergences between Mixtures 

Now we are ready to present the upper bound on the total variation distance between Gaussian mixture and Poisson mixture models. We also provide upper bounds on the \chi^2 divergence due to its nice tensorization property. 

We first deal with the Gaussian location model. Let U and U' be two random variables on {\mathbb R}, and let \mathop{\mathbb E} \mathcal{N}(U,1) be the Gaussian mixture with random mean U

Theorem 3 (Divergence between Gaussian Mixtures) For any \mu\in {\mathbb R}, we have

\|\mathop{\mathbb E} \mathcal{N}(\mu+ U,1) - \mathop{\mathbb E} \mathcal{N}(\mu+U',1) \|_{\text{\rm TV}} \le \frac{1}{2}\left(\sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2 }{m!} \right)^{\frac{1}{2}}.

Moreover, if \mathop{\mathbb E}[U']=0 and \mathop{\mathbb E}[(U')^2]\le M, then 

\chi^2(\mathop{\mathbb E}\mathcal{N}(\mu+ U,1), \mathop{\mathbb E}\mathcal{N}(\mu+U',1)) \le e^{M^2/2}\cdot \sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2 }{m!}.

Proof: By translation we may assume that \mu=0. Let \phi(x) be the pdf of \mathcal{N}(0,1), and \Delta_m := \mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m]. Then 

\begin{array}{rcl} \|\mathop{\mathbb E} \mathcal{N}(U,1) - \mathop{\mathbb E} \mathcal{N}(U',1) \|_{\text{\rm TV}} &=& \frac{1}{2}\int_{{\mathbb R}} |\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|dx \\ &\overset{(a)}{=}& \frac{1}{2}\int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|\phi(x)dx \\ &\overset{(b)}{\le} & \frac{1}{2} \left(\int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|^2\phi(x)dx \right)^{\frac{1}{2}} \\ &\overset{(c)}{=} & \frac{1}{2}\left(\sum_{m=0}^\infty \frac{\Delta_m^2}{m!}\right)^{\frac{1}{2}}, \end{array}

where step (a) is due to the defining property (1), step (b) follows from the Cauchy–Schwartz inequality, and step (c) uses the orthogonal relation of H_m(x). Hence the upper bound on the total variation distance is proved. 

For the \chi^2-divergence, first note that by Jensen’s inequality, 

\mathop{\mathbb E}[\phi(x-U')] = \phi(x)\mathop{\mathbb E}\left[\exp\left(U'x - \frac{(U')^2}{2}\right)\right] \ge \phi(x)e^{-M^2/2}.

As a result, 

\begin{array}{rcl} \chi^2(\mathop{\mathbb E}\mathcal{N}(U,1), \mathop{\mathbb E}\mathcal{N}(U',1)) &=& \int_{{\mathbb R}} \frac{|\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|^2}{\mathop{\mathbb E}[\phi(x-U')]}dx \\ &\le & e^{M^2/2}\cdot \int_{{\mathbb R}} \frac{|\mathop{\mathbb E}[\phi(x-U)] - \mathop{\mathbb E}[\phi(x-U')]|^2}{\phi(x)}dx \\ &= & e^{M^2/2}\cdot \int_{{\mathbb R}} \left|\sum_{m=0}^\infty H_m(x)\frac{\Delta_m}{m!} \right|^2 \phi(x)dx \\ &=& e^{M^2/2}\cdot \sum_{m=0}^\infty \frac{\Delta_m^2}{m!}, \end{array}

where again we have used the defining property (1) and the orthogonality in the last two identities. \Box 

Specifically, Theorem 3 shows that when the moments of U and U' are close, then the corresponding Gaussian mixtures are statistically close. Similar results also hold for Poisson mixtures. 

Theorem 4 (Divergence between Poisson Mixtures) For any \lambda>0 and random variables U,U' supported on [-\lambda,\infty), we have

\|\mathop{\mathbb E}\mathsf{Poi}(\lambda+U) - \mathop{\mathbb E}\mathsf{Poi}(\lambda+U') \|_{\text{\rm TV}} \le \frac{1}{2}\left(\sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2}{m!\lambda^m}\right)^{\frac{1}{2}}.

Moreover, if \mathop{\mathbb E}[U']=0 and |U'|\le M almost surely, then 

\chi^2(\mathop{\mathbb E}\mathsf{Poi}(\lambda+U), \mathop{\mathbb E}\mathsf{Poi}(\lambda+U') ) \le e^{M}\cdot \sum_{m=0}^\infty \frac{|\mathop{\mathbb E}[U^m] - \mathop{\mathbb E}[(U')^m] |^2}{m!\lambda^m}.

Proof: The proof of both inequalities essentially follow the same lines as those in the proof of Theorem 3, with the Hermite polynomial H_m(x) replaced by the Charlier polynomial c_m(x;\lambda). The only difference is that when \mathop{\mathbb E}[U']=0 and |U'|\le M almost surely, for all x\in {\mathbb N} we have 

\begin{array}{rcl} \mathop{\mathbb P}(\mathop{\mathbb E}\mathsf{Poi}(\lambda+U') = x ) &=& \mathop{\mathbb E}\left[ e^{-\lambda-U'} \frac{(\lambda+U')^x}{x!}\right] \\ &\ge & e^{-\lambda-M} \mathop{\mathbb E}\left[\frac{(\lambda+U')^x}{x!} \right] \\ &\ge & e^{-\lambda-M}\frac{\lambda^x}{x!} \\ &=& e^{-M}\cdot \mathop{\mathbb P}(\mathsf{Poi}(\lambda) = x ). \end{array} \Box

2. Examples in Gaussian Models 

In this section, we present examples in Gaussian location models where we need to test between two mixtures. In these examples, we match moments up to either some finite degree, or some large and growing degrees such that the information divergences become extremely small. 

2.1. Gaussian Mixture Models 

Consider the following two-component Gaussian mixture model where n i.i.d. samples X_1,\cdots,X_n are drawn from p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2). One possible task of proper learning is to estimate the parameters (\hat{\mu}_1, \hat{\mu}_2, \hat{\sigma}_1, \hat{\sigma}_2) within O(1)-distance to the truth up to permutation. In other words, the target is to recover the components of the Gaussian mixture, which in practice helps to perform tasks such as clustering. The target is to determine the optimal sample complexity of this problem, assuming that p\in (0,1) is unknown but bounded away from 0 and 1 (e.g., 0.01\le p\le 0.99), and the overall variance of the mixture is at most \sigma^2, where \sigma = \Omega(1) is some prespecified parameter. 

To derive a lower bound, the two-point method suggests to find two sets of parameters (p,\mu_1,\mu_2,\sigma_1,\sigma_2) which are \Omega(1)-seperated, while the information divergence between these two mixtures is O(1). However, since Theorem 3 can only deal with mixtures with an identical variance in each component, we cannot simply take U to be a discrete random variable supported on two points \{\mu_1, \mu_2 \}. To overcome this difficulty, note that 

\left( p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2) \right) * \mathcal{N}(0,\sigma^2) = p\mathcal{N}(\mu_1,\sigma_1^2 + \sigma^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2 + \sigma^2),

where * denotes convolution of probability measures. Hence, we may treat the overall mixture to be p\mathcal{N}(\mu_1,\sigma_1^2 + \sigma^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2 + \sigma^2), and choose U \sim p\mathcal{N}(\mu_1,\sigma_1^2) + (1-p)\mathcal{N}(\mu_2, \sigma_2^2) and similarly for U'. Now the desired \chi^2-divergence becomes 

\chi^2(U * \mathcal{N}(0,\sigma^2), U' * \mathcal{N}(0,\sigma^2)).

To apply Theorem 3, we should construct random variables U and U' with as many matched moments as possible. Since there are 5 free parameters in the two-component Gaussian mixtures, we expect that U and U' can only have matched moments up to degree 5. A specific choice can be as follows: 

\begin{array}{rcl} U \sim P & = & 0.5\mathcal{N}(-1,1) + 0.5\mathcal{N}(1,2), \\ U' \sim Q & \approx & 0.2968\mathcal{N}(-1.2257, 0.6100) + 0.7032\mathcal{N}(0.5173, 2.3960). \end{array}

Clearly both U * \mathcal{N}(0,\sigma^2) and U' * \mathcal{N}(0,\sigma^2) have overall variance \Theta(\sigma^2). Replacing U, U' by their centered version, Theorem 3 gives 

\chi^2(U * \mathcal{N}(0,\sigma^2), U' * \mathcal{N}(0,\sigma^2)) \le e^{O(1)/\sigma^2}\cdot \sum_{m=6}^\infty \frac{2^m}{m!\sigma^{2m}} = O\left(\frac{1}{\sigma^{12}}\right)

as long as \sigma = \Omega(1). Hence, by the additivity of the \chi^2-divergence, we conclude that n = \Omega(\sigma^{12}) is a lower bound on the sample complexity. 

The seemingly strange bound \Omega(\sigma^{12}) is also tight for this problem, and the idea is to estimate the first 5 moments of the mixture and then show that close moments imply close parameters. We leave the details to the reference in the bibliographic notes. 

2.2. L_1-norm Estimation of Bounded Gaussian Mean 

Consider the Gaussian location model X\sim \mathcal{N}(\theta,I_p) with unit variance, where the mean vector \theta\in {\mathbb R}^p satisfies \|\theta\|_\infty \le 1. The target here is to estimate the L_1 norm of the mean vector \|\theta\|_1, and let R_p^\star be the minimax risk under the absolute value loss. The main result here is to prove the following tight lower bound: 

R_p^\star = \Omega\left(p\cdot \frac{\log\log p}{\log p}\right).

2.2.1. Failure of Point vs. Mixture

Motivated by the point vs. mixture approach in the last lecture, one natural idea is to test between H_0: \theta =0 and a composite hypothesis H_1: \|\theta\|_1 \ge \rho, where \rho>0 is a parameter to be specified later such that H_0 and H_1 are indistinguishable in the minimax sense. Consequently, this approach gives the lower bound R_p^\star = \Omega(\rho), and the target is to find some \rho>0 as large as possible. We claim that the largest possible \rho is \rho = \Theta(p^{3/4}), which is strictly smaller than the desired minimax risk. 

Let P_\theta = \mathcal{N}(\theta, I_p), and consider any prior distribution \pi supported on \{\theta: \|\theta\|_1 \ge \rho \}. Then Ingster-Suslina method gives 

\chi^2\left( \mathop{\mathbb E}_\pi[P_\theta], P_0\right) = \mathop{\mathbb E}_{\theta,\theta'\sim \pi} \exp(\theta^\top \theta') - 1.

Using the Taylor expansion \exp(x) = \sum_{k=0}^\infty x^k/k! and the inequality \mathop{\mathbb E}[(\theta^\top \theta')^k]\ge 0 for all k\in {\mathbb N}, we conclude that 

\begin{array}{rcl} \chi^2\left( \mathop{\mathbb E}_\pi[P_\theta], P_0\right) &\ge & \frac{1}{2}\mathop{\mathbb E}_{\theta,\theta'} [(\theta^\top \theta')^2] \\ &=& \frac{1}{2}\sum_{i=1}^p (\mathop{\mathbb E}[\theta_i^2])^2 + \frac{1}{2}\sum_{i\neq j} (\mathop{\mathbb E}[\theta_i\theta_j])^2 \\ &\ge & \frac{1}{2}\sum_{i=1}^p (\mathop{\mathbb E}[\theta_i^2])^2 \\ &\ge & \frac{1}{2p}\left(\sum_{i=1}^p \mathop{\mathbb E}[\theta_i^2] \right)^2 \\ &\ge & \frac{1}{2p}\left(\frac{\mathop{\mathbb E}[\|\theta\|_1^2 ]}{p} \right)^2 \\ &\ge & \frac{\rho^4}{2p^3} . \end{array}

Note that the above inequality holds for any \pi supported on \{\theta: \|\theta\|_1 \ge \rho \}. Hence, when \rho = \Omega(p^{3/4}), these hypotheses H_0 and H_1 become statistically distinguishable, and therefore the best possible lower bound from the point vs. mixture approach is \Omega(p^{3/4})

There is also another way to show the desired failure, i.e., we may construct an explicit test which reliably distinguishes between H_0: \theta =0 and H_1: \|\theta\|_1 \ge p^{3/4}. The idea is to apply the \chi^2 test, i.e., compute the statistic T = \|X\|_2^2 - p. Clearly under H_0 we have T = O_p(\sqrt{p}), and after some algebra we may show that T = \|\theta\|_2^2 - O_p(\sqrt{p} + \|\theta\|_2) under P_\theta. Since \|\theta\|_1\ge p^{3/4} implies \|\theta\|_2\ge p^{1/4}, we conclude that comparing T with a suitable threshold O(\sqrt{p}) results in a reliable test. 

2.2.1. Moment Matching and Polynomial Approximation

Previous section shows that testing between a single distribution and a mixture does not work, where the knowledge of the single distribution can be used for the \chi^2 test and may make the problem significantly easier. Hence, an improvement is to consider two composite hypotheses H_0: \|\theta\|_1\le \rho_0 and H_1: \|\theta\|_1\ge \rho_1, where \rho_1>\rho_0>0 are parameters to be chosen later. For the priors \pi_0, \pi_1 on \theta, Theorem 3 motivates us to use product priors \pi_i = \mu_i^{\otimes p} where \mu_0, \mu_1 are probability measures on [-1,1] with matched moments up to degree K (to be chosen later). The specific choices of \mu_0, \mu_1 must fulfill the following requirements: 

  1. Have matched moments up to degree K while with the quantity \rho_1 - \rho_0 as large as possible; 
  2. For i\in \{0,1\}, the prior \pi_i is supported on H_i
  3. The \chi^2-divergence is upper bounded by \chi^2(\pi_0, \pi_1) = O(1), or in other words, \chi^2(\mu_0, \mu_1) = O(1/p)

We check the above requirements in the reverse order and specify the choices of \mu_0, \mu_1 and K gradually. For the last requirement, Theorem 3 with M \le 2 gives 

\chi^2\left(\mu_0,\mu_1 \right) \le e^2\sum_{m=K+1}^\infty \frac{2^{m+1}}{m!} \le 2e^2\sum_{m=K+1}^\infty \left(\frac{2e}{m}\right)^m \le \frac{2e^2}{1-2e/K}\cdot \left(\frac{2e}{K}\right)^K.

As a result, to have \chi^2(\mu_0,\mu_1) = O(1/p), it suffices to take K = \Omega(\log p/\log\log p)

The second constraint that \pi_i be (almost) supported on H_i is also easy. Set 

\rho_0 = p\mathop{\mathbb E}_{\theta\sim \mu_0}[|\theta|] + c\sqrt{p}, \quad \rho_1 = p\mathop{\mathbb E}_{\theta\sim \mu_1}[|\theta|] - c\sqrt{p},

where c>0 is a large eough numerical constant. The idea behind the above choices is that, under the product distribution \mu_i^{\otimes p}, the random variable \|\theta\|_1 is the sum of p i.i.d. random variables taking value in [-1,1] following distribution \mu_i. Then by the sub-Gaussian concentration, the L_1 norm is centered at p\mathop{\mathbb E}_{\theta\sim \mu_i}[|\theta|] with fluctuation O(\sqrt{p}). Hence, for large c>0 both probabilities \pi_0(\{\|\theta\|_1> \rho_0 \}) and \pi_1(\{\|\theta\|_1<\rho_1 \}) are small, which fulfills the conditions in Theorem 1. 

The most non-trivial requirement is the first requirement, which by our choice of \rho_0, \rho_1 essentially aims to maximize the difference \mathop{\mathbb E}_{\theta\sim \mu_1}[|\theta|] - \mathop{\mathbb E}_{\theta\sim \mu_0}[|\theta|] subject to the constraint that the probability measures \mu_0, \mu_1 are supported on [-1,1] and have matching first K moments. The following lemma shows the duality between moment matching and best polynomial approximation. 

Lemma 5 For any bounded interval I\subseteq {\mathbb R} and real-valued function f on I, let S^\star be the maximum difference \mathop{\mathbb E}_{\theta\sim \mu_1}[f(\theta)] - \mathop{\mathbb E}_{\theta\sim \mu_0}[f(\theta)] subject to the constraint that the probability measures \mu_0, \mu_1 are supported on I and have matching first K moments. Then

S^\star = 2E_K(f;I ),

where E_K(f;I) denotes the best degree-K polynomial approximation error of f on the interval I

E_K(f; I) := \inf_{a_0,\cdots,a_K } \sup_{x\in I} \left|f(x) - \sum_{k=0}^K a_kx^k \right|.

Proof: It is an easy exercise to show that S^\star \le 2E_K(f;I). We present two proofs for the hard direction S^\star \ge 2E_K(f;I). The first proof is an abstract proof which holds for general basis functions other than monomials, while the construction of the measures \mu_0, \mu_1 is implicit. The second proof gives an explicit construction, while some properties of polynomials are used in the proof. 

(First Proof) Consider the following linear functional T: \text{span}(1,x,\cdots,x^K,f(x)) \rightarrow {\mathbb R}, with T(p) = 0 for p\in \text{span}(1,x,\cdots,x^K) and T(f) = E_K(f;I). Equipped with the \|\cdot\|_\infty norm on functions, it is easy to show that the operator norm of T is \|T\| = 1. By the Hahn-Banach theorem, the linear functional T can be extended to C(I) \rightarrow {\mathbb R} without increasing the operator norm. Then by the Riesz representation theorem, there exists a signed Radon measure \mu on I such that 

T(g) = \int_I g(x)\mu(dx), \quad \forall g \in C(I).

The fact \|T\| = 1 implies that the total variation of \mu is one. Write \mu = \mu_+ - \mu_- by Jordan decomposition of signed measures, then \mu_1 = 2\mu_+ and \mu_0 = 2\mu_- satisfy the desired properties. 

(Second Proof) Let p_K be a degree-K polynomial with \|f-p_K\|_\infty = E_K(f;I) on I. Since \{1,x,\cdots,x^K\} is a Haar basis on I, Chebyshev’s alternation theorem shows that there exist K+2 points x_1,\cdots,x_{K+2} such that f(x_i) - p_K(x_i) = \varepsilon\cdot (-1)^iE_K(f;I) with \varepsilon=1 or -1. Consider the signed measure \mu supported on \{x_1,\cdots,x_{K+2}\} with 

\mu( \{x_i\} ) := c\left(\prod_{j\neq i} (x_i - x_j) \right)^{-1},

where c\in{\mathbb R} is a normalizing constant such that c\cdot \varepsilon\sum_{i=1}^{K+2} (-1)^i\mu(\{x_i\}) = 2. Then by simple algebra, \mathop{\mathbb E}_{\mu}f = 2E_K(f;I) and \mu has total variation 2. Moreover, the following identity is given by Lagrange interpolation

\sum_{i=1}^{K+2} x_i^k\prod_{j\neq i} \frac{x-x_j}{x_i-x_j} = x^k, \quad k = 0,1,\cdots,K,

and comparing the coefficient of x^{K+1} on both sides gives \mathop{\mathbb E}_\mu[x^k]=0 for k=0,\cdots,K. Now another Jordan decomposition of \mu gives the desired result. \Box 

By Lemma 5, it boils down to the best degree-K polynomial approximation error of |x| on [-1,1]. This error is analyzed in approximation theory and summarized in the following lemma. 

Lemma 6 There is a numerical constant (known as the Bernstein’s constant) \beta_\star \approx 0.280169499 such that

E_K(|x|; [-1,1]) = (\beta_\star+o_K(1))K^{-1}.

Hence, by Lemma 5 and 6, the condition of Theorem 1 is satisfied with \Delta = \Theta(pK^{-1}) = \Theta(p\cdot \frac{\log\log p}{\log p}). As a result, we finally arrive at R_p^\star = \Omega(p\cdot \frac{\log\log p}{\log p})

3. Examples in Poisson Models 

In this section, we present examples in i.i.d. sampling models from a discrete distribution with a large support. We show that a general Poissonization technique allows us to operate in the simpler Poisson models, and use moment matching to either finite or growing degrees to establish tight lower bounds. 

3.1. Poissonization and Approximate Distribution 

Throughout this section the statistical model is i.i.d. sampling from a discrete distribution X_1,\cdots,X_n \sim P=(p_1,\cdots,p_k), where n denotes the sample size and k denotes the support size. It is well-known that the histogram (h_1,\cdots,h_k) with 

h_j := \sum_{i=1}^n 1(X_i = j), \quad \forall j\in [k]

constitutes a sufficient statistic. Moreover, h_j\sim \mathsf{B}(n,p_j) for all j\in [k], and (h_1,\cdots,h_k) are negatively dependent. To remove the dependence among different bins, recall the following Poissonized model: 

Definition 7 (Poissonization) In the Poissonized model, h_j\sim \mathsf{Poi}(np_j) for all j\in [k] and they are mutually independent. 

In other words, in the Poissonized model we draw a random number N\sim \mathsf{Poi}(n) of samples from P and then compute the histogram. In Lecture 3 we have shown the asymptotic equivalence of the i.i.d. sampling model and the Poissonized model, but the arguments are highly asymptotic. The following lemma establishes a non-asymptotic relationship. 

Lemma 8 For a given statistical task, let R_n and R_n^\star be the minimax risk under the i.i.d. sampling model and the Poissonized model with design sample size n, respectively. Then

\frac{1}{2}R_{2n} \le R_n^\star \le R_{n/2} + R_0 e^{-n/8}.

Proof: Let R_n(\pi), R_n^\star(\pi) be the Bayes risks under prior \pi in respective models. The desired inequality for Bayes risks follows from the identity 

R_n^\star(\pi) = \sum_{m=0}^\infty R_m(\pi)\cdot \mathop{\mathbb P}(\mathsf{Poi}(n) = m),

the Poisson tail bounds and the monotonicity of Bayes risks R_0(\pi)\ge R_1(\pi) \ge \cdots. Now the minimax theorem gives the desired lemma. 

To establish the first identity, simply note that under any prior distribution \pi, the Bayes estimator under the Poissonized model given the realization N=m is exactly the Bayes estimator under the i.i.d. sampling model with m samples. \Box 

Lemma 8 shows that the minimax risk essentially does not change after Poissonization. We sometimes also consider approximate distributions in the Poissonized model, where \sum_{j=1}^k p_j may not necessarily sum into one (note that the distribution of the histogram is still well-defined). The approximate distribution is typically used in lower bounds where a product prior is assigned to (p_1,\cdots,p_k) and cannot preserve the distribution property. For statistical problems where the objective function or hypothesis depends on the vector P=(p_1,\cdots,p_k) in a nice way that changing P into \lambda P with \lambda\approx 1 will not change the objective much, in the lower bound it typically suffices to consider the approximate distributions in the Poissonized model. The key idea is that, conditioning on \sum_{j=1}^k h_j=m, the histogram (h_1,\cdots,h_k) is exactly distributed as that in the i.i.d. sampling model from P/\|p\|_1 with m samples. Then we may construct the estimator in the Poissonized model from the hypothetical optimal estimators (with different sample sizes) in the i.i.d. sampling model, and applying the same tail bounds as in the proof of Lemma 8 suffices. The details of the arguments may vary from example to example, but in most scenarios it will not hurt the lower bound. 

3.2. Generalized Uniformity Testing 

Consider the following generalized uniformity testing problem: given n i.i.d. observations from some discrete distribution P supported on at most k elements, one would like to test whether the underlying distribution P is uniform on its support. Note the difference from the traditional uniformity testing problem: the distribution P may be supported on a subset of [k] while still be uniform on this subset. Specifically, the task is to determine the sample complexity of distinguishing from the hypothesis H_0: P uniform on its support and H_1: P is \varepsilon-away from any uniform distribution with support \subseteq [k] under \ell_1 distance. We will show that the desired sample complexity is lower bounded by 

\Omega\left(\frac{\sqrt{k}}{\varepsilon^2} + \frac{k^{2/3}}{\varepsilon^{4/3}} \right).

Note that the first term \Omega(\sqrt{k}/\varepsilon^2) trivially follows from the Paninski’s construction in the traditional uniformity testing problem, the goal is to prove the second term. It is expected that the second term captures the difficulty of recovering the support of the distribution, and therefore we should consider a mixture of uniform distributions with random support in H_0. Specifically, we consider the following mixture: let U,U' be two random variables with 

U = \begin{cases} 0 & \text{w.p. } \frac{\varepsilon^2}{1+\varepsilon^2} \\ \frac{1+\varepsilon^2}{k} & \text{w.p. } \frac{1}{1+\varepsilon^2} \end{cases}, \qquad U' = \begin{cases} \frac{1-\varepsilon}{k} & \text{w.p. } \frac{1}{2} \\ \frac{1+\varepsilon}{k} & \text{w.p. } \frac{1}{2} \end{cases}.

Assign the k-fold product distribution of U (or U') to the probability vector (p_1,\cdots,p_k), then (p_1,\cdots,p_k) forms an approximate probability distribution since \mathop{\mathbb E}[U] = \mathop{\mathbb E}[U'] = k^{-1}. Moveover, under prior U the (normalized) distribution is always uniform, and under prior U' the (normalized) distribution is \Omega(\varepsilon)-far from any uniform distribution supported on a subset of [k] with high probability. Hence, neglecting the additional details for the approximate distribution, it suffices to show that 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) = O\left(\frac{1}{k}\right), \quad \text{if }n = O\left(\frac{k^{2/3}}{\varepsilon^{4/3}}\right).

To establish the above bound for the \chi^2-divergence, note that the random variables U,U' are chosen carefully with \mathop{\mathbb E}[U^m] = \mathop{\mathbb E}[(U')^m] for m=0,1,2. Moreover, 

|\mathop{\mathbb E}[(U-k^{-1})^m] - \mathop{\mathbb E}[(U' - k^{-1})^m ]| \le \frac{2\varepsilon^2}{k^m}, \qquad m\ge 3.

Consequently, Theorem 4 gives that 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) \le e^{n\varepsilon/k} \sum_{m=3}^\infty \frac{4\varepsilon^4}{m!(n/k)^m}\left(\frac{n}{k}\right)^{2m} = e^{n\varepsilon/k} \sum_{m=3}^\infty \frac{4\varepsilon^4}{m!}\left(\frac{n}{k}\right)^{m}.

Since k^{2/3}/\varepsilon^{4/3} is a stronger lower bound than \sqrt{k}/\varepsilon^2 if and only if k\ge \varepsilon^{-4}, under this condition and n\le k^{2/3}/\varepsilon^{4/3} we will have n\le k. Then the above inequality gives 

\chi^2\left(\mathop{\mathbb E}[\mathsf{Poi}(nU)], \mathop{\mathbb E}[\mathsf{Poi}(nU')] \right) = O\left(\frac{n^3\varepsilon^4}{k^3}\right),

which is the claimed upper bound on the \chi^2-divergence, establishing the lower bound.

We provide some discussions on the choice of U and U'. Theorem 4 suggests that if U and U' could match more moments, the \chi^2-divergence could even be smaller. However, the number of matched moments is in fact limited by the problem structure. In the traditional uniformity testing problem, in the last lecture we essentially choose U \equiv 1/k and U' as above. In this case, U and U' only match the first moment, which is the best possible for U must be a constant. In generalized uniformity testing, U must only be supported on two points, one of which is zero. Meanwhile, the support of U' can potentially be arbitrarily large. The next lemma shows that no matter how we choose U', we can match at most the first two moments. 

Lemma 9 Let \mu be a probability measure supported on k elements of [0,\infty), one of which is zero, and \mu' be any probability measure supported on [0,\infty). Then if \mu and \mu' match the first 2k-1 moments, we must have \mu = \mu'

Proof: Let X\sim \mu, X'\sim \mu'. Let the support of \mu be 0,x_1,\cdots,x_{k-1}. Consider the polynomial Q(x) = x\prod_{i=1}^{k-1}(x-x_i)^2 of degree 2k-1, the assumption gives \mathop{\mathbb E}[Q(X')] = \mathop{\mathbb E}[Q(X)]=0. Finally, since Q(X')\ge 0 is always non-negative, we have \mu' = \mu. \Box 

Lemma 9 applied to k=2 shows that moment matching up to degree 2 is the best we can hope for. In fact, the above lower bound is also tight (see bibliographic notes). 

3.3. Shannon Entropy Estimation 

Finally we revisit the Shannon entropy estimation problem where the target is to estimate the Shannon entropy H(P) = \sum_{i=1}^k -p_i\log p_i. Let R_{n,k}^\star be the minimax risk of estimating H(P) under the mean squared error, our target is to show that 

R_{n,k}^\star = \Omega\left( \left(\frac{k}{n\log n}\right)^2 + \frac{\log^2 k}{n} \right), \qquad \text{if } n = \Omega\left(\frac{k}{\log k}\right).

The lower bound \Omega(n^{-1}\log^2 k) has already been shown via the two-point method in Lecture 5, and therefore the remaining target is to establish \Omega(k^2/(n\log n)^2)

Similar to the L_1 norm example above, the Shannon entropy H(P) is a symmetric sum of individual functions of p_i, where the individual function is non-differentiable at zero. It suggests us to apply similar ideas based on moment matching and best polynomial approximation to establish the lower bound. Specifically, the target is to construct two priors \mu_0, \mu_1 on the interval [0,M] (with parameter M>0 to be chosen later) such that: 

  1. The priors \mu_0, \mu_1 have matched moments up to degree K (to be chosen later); 
  2. The difference between \mathop{\mathbb E}_{p\sim \mu_0}[-p\log p] and \mathop{\mathbb E}_{p\sim \mu_1}[-p\log p] is large; 
  3. With high probability, the Shannon entropy H(P) under P\sim \mu_0^{\otimes k} and P\sim \mu_1^{\otimes k} is well-separated; 
  4. The common mean \mathop{\mathbb E}_{p\sim \mu_0}[p] is at most O((n\log n)^{-1})

Note that the first requirement (moment matching) ensures a small TV distance between the mixtures, the second and third requirements ensure the separation property (i.e., lower bound the mean difference and upper bound the fluctuations), and the last requirement ensures that (p_1,\cdots,p_k) sums into a constant smaller than one (recall that k=O(n\log n) by assumption) and therefore setting p_{k+1} := 1-\sum_{i=1}^k p_i gives a valid probability vector (p_1,\cdots,p_{k+1}). We check these requirements one by one to find proper parameters (M,K)

For the first requirement, recall that Theorem 4 gives 

\|\mathop{\mathbb E}_{p\sim \mu_0}[\mathsf{Poi}(np)] -\mathop{\mathbb E}_{p\sim \mu_1}[\mathsf{Poi}(np)] \|_{\text{TV}}^2 \le \sum_{m=K+1}^\infty \frac{(nM)^{2m}}{m!(nM)^m} \le \sum_{m=K+1}^\infty \left(\frac{enM}{m}\right)^m.

Since the individual TV distance should be at most O(k^{-1}) (for the future triangle inequality), we should choose K = \Omega(\max\{nM, \log k \}) = \Omega(\max\{nM,\log n\}), where \log k\asymp \log n is due to the assumption n=\Omega(k/\log k) and that n= O(k^2) to make the first term \Omega(k^2/(n\log n)^2) become dominate in the minimax risk. 

For the second requirement, by the duality result in Lemma 5 it is essentially the best degree-K polynomial approximation error of -x\log x on [0,M]. The next lemma gives the best approximation error. 

Lemma 10

E_K(-x\log x; [0,M]) = \Theta\left(\frac{M}{K^2}\right).

By Lemma 10, the target is to maximize M/K^2 subject to the previous condition K=\Omega(\max\{nM,\log n \}). Simple algebra shows that the maximum is \Theta((n\log n)^{-1} ), with the maximizer M \asymp n^{-1}\log n, K\asymp \log n

To resolve the third requirement, note that the mean difference of H(P) is \Theta(k/(n\log n)) by the above choice of M and K. Moreover, since -p\log p\in [0, c(\log n)^2/n] for some constant c>0 if p\in [0,M], the sub-Gaussian concentration shows that the fluctuation of H(P) under both \mu_i^{\otimes k} is at most O(\sqrt{k}(\log n)^2/n). Since n=O(k^2), the fluctuation is indeed negligible compared with the mean difference. Careful analysis also shows that the contribution of p_{k+1} to the entropy difference is negligible. 

The last requirement requires proper modifications on the priors \mu_0, \mu_1 constructed in Lemma 5 to satisfy the mean constraint. This can be done via the following trick of change of measures: let \nu_0, \nu_1 be the priors constructed in Lemma 5which attains E_{\log n}(-\log x; [\frac{1}{n\log n}, \frac{\log n}{n}]), whose value is summarized in the following lemma. 

Lemma 11

E_{\log n}\left(-\log x; \left[\frac{1}{n\log n}, \frac{\log n}{n}\right]\right) = \Theta(1).

Next we construct the priors \mu_0, \mu_1 as follows: for i=0,1, set 

\mu_i(dx) = \left(1 - \mathop{\mathbb E}_{X\sim\nu_i}\left[\frac{1}{nX\log n}\right] \right)\delta_0(dx) + \frac{\nu_i(dx)}{nx\log n}.

Then it is easy to show that both \mu_0 and \mu_1 are probability measures, have matched moments up to degree K+1, and have mean (n\log n)^{-1}. Moreover, 

\mathop{\mathbb E}_{x\sim \mu_1} \left[-x\log x \right] - \mathop{\mathbb E}_{x\sim \mu_0} \left[-x\log x \right] = \frac{\mathop{\mathbb E}_{x\sim \nu_1}[-\log x] - \mathop{\mathbb E}_{x\sim \nu_0}[-\log x] }{n\log n} \asymp \frac{1}{n\log n}.

Hence, the fourth requirement is fulfilled without hurting the previous ones. 

In summary, Theorem 1 holds with \Delta = \Theta(k^2/(n\log n)^2), and we arrive at the desired lower bound R_{n,k}^\star = \Omega(k^2/(n\log n)^2 )

4. Bibliographic Notes 

The method of two fuzzy hypotheses (Theorem 1) are systematically used in Ingster and Suslina (2012) on nonparametric testing, and the current form of the theorem is taken from Theorem 2.14 of Tsybakov (2009). Statistical closeness of Gaussian mixture models via moment matching (Theorem 3) was established in Cai and Low (2011), Hardt and Price (2015), Wu and Yang (2018) for the \chi^2-divergence, where the result is new for the TV distance. For Theorem 4, the TV version was established in part by Jiao, Kartik, Han and Weissman (2015) and Jiao, Han and Weissman (2018), where the \chi^2 version is new here. When U,U'\in [0,M], a stronger bound of the TV distance without the squared root is also available in Wu and Yang (2016). For more properties of Hermite and Charlier polynomials, we refer to Labelle and Yeh (1989). 

For Gaussian examples, the proper learning of two-component Gaussian mixture was established in Hardt and Price (2015). The L_1 norm estimation problem was taken from Cai and Low (2011), which was further motivated by Lepski, Nemirovski and Spokoiny (1999). The Gaussian mean testing example under L_1 metric was taken from Ingster and Suslina (2012). For technical lemmas, proofs of Lemma 5 is taken from Lepski, Nemirovski and Spokoiny (1999) and Wu and Yang (2016), respectively, and Lemma 6 was due to Bernstein (1912). 

For Poisson examples, the non-asymptotic equivalence between i.i.d. sampling model and the Poissonized model was due to Jiao, Kartik, Han and Weissman (2015). The tight bounds of the generalized uniformity testing problem were due to Diakonikolas, Kane and Stewart (2018), where their proof was greatly simplified here thanks to Theorem 4. For Shannon entropy estimation, the optimal sample complexity was obtained in Valiant and Valiant (2011), and the minimax risk was obtained independently in Jiao, Kartik, Han and Weissman (2015) and Wu and Yang (2016). For tools in approximation theory to establish Lemma 10 and 11, we refer to books Devore and Lorentz (1993), Ditzian and Totik (2012) for wonderful toolsets. 

  1. Yuri Ingster and Irina A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Vol. 169. Springer Science & Business Media, 2012. 
  2. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 
  3. T. Tony Cai, and Mark G. Low. Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. The Annals of Statistics 39.2 (2011): 1012–1041. 
  4. Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 2015. 
  5. Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments. arXiv preprint arXiv:1807.07237 (2018). 
  6. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885. 
  7. Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the L_1 distance. IEEE Transactions on Information Theory 64.10 (2018): 6672–6706. 
  8. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702–3720. 
  9. Jacques Labelle, and Yeong Nan Yeh. The combinatorics of Laguerre, Charlier, and Hermite polynomials. Studies in Applied Mathematics 80.1 (1989): 25–36. 
  10. Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the L r norm of a regression function. Probability theory and related fields 113.2 (1999): 221-253. 
  11. Serge Bernstein. Sur l’ordre de la meilleure approximation des fonctions continues par des polynomes de degré donné. Vol. 4. Hayez, imprimeur des académies royales, 1912. 
  12. Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems. 2018. 
  13. Gregory Valiant and Paul Valiant. The power of linear estimators. 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE, 2011. 
  14. Ronald A. DeVore and George G. Lorentz. Constructive approximation. Vol. 303. Springer Science & Business Media, 1993. 
  15. Zeev Ditzian and Vilmos Totik. Moduli of smoothness. Vol. 9. Springer Science & Business Media, 2012. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.