Lecture 8: Multiple Hypothesis Testing: Tree, Fano and Assoaud

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In the last three lectures we systematically introduced the Le Cam’s two-point method with generalizations and various examples. The main idea of the two-point method is to reduce the problem in hand to a binary hypothesis testing problem, where the hypotheses may be either single or composite. This approach works when the target is to estimate a scalar (even when the underlying statistical model may involve high-dimensional parameters), while it typically fails when the target is to recover a high-dimensional vector of parameters. For example, even in the simplest Gaussian location model $X\sim \mathcal{N}(\theta, I_p)$ where the target is to estimate $\theta$ under the mean squared loss, the best two-point approach gives a lower bound $R_p^\star = \Omega(1)$ while we have known from Lecture 4 that the minimax risk is $R_p^\star = p$. To overcome this difficulty, recall from the minimax theorem that a suitable prior should always work, which by discretization motivates the idea of multiple hypothesis testing.

In this lecture, we develop the general theory of reducing problems into multiple hypothesis testing, and present several tools including tree-based methods, Assoaud’s lemma and Fano’s inequality. In the next two lectures we will enumerate more examples (possibly beyond the minimax estimation in statistics) and variations of the above tools.

1. General Tools of Multiple Hypothesis Testing

This section presents the general theory of applying multiple hypothesis testing to estimation problems, as well as some important tools such as tree, Fano and Assoaud. Recall from the two-point method that the separability and indistinguishability conditions are of utmost importance to apply testing arguments, the above tools require different separability conditions and represent the indistinguishability condition in terms of different divergence functions.

1.1. Tree-based method and Fano’s inequality

We start with the possibly simplest separability condition, i.e., the chosen parameters (hypothese) $\theta_1, \cdots, \theta_m\in \Theta$ are pairwise seperated. The following lemma shows that the minimax estimation error is lower bounded in terms of the average test error.

Lemma 1 (From estimation to testing) Let $L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+$ be any loss function, and there exist $\theta_1,\cdots,\theta_m\in \Theta$ such that

$$L(\theta_i, a) + L(\theta_j, a) \ge \Delta, \qquad \forall i\neq j\in [m], a \in \mathcal{A}.$$

Then

$$\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i),$$

where the infimum is taken over all measurable tests $\Psi: \mathcal{X} \rightarrow [m]$

Proof: For any given estimator $\hat{\theta}$, we construct a test $\Psi$ as follows:

$$\Psi(X) \in \arg\min_{i\in [m]} L(\theta_i, \hat{\theta}(X)).$$

Then the separability condition implies that

$$L(\theta_i, \hat{\theta}(X)) \ge \frac{\Delta}{2}\cdot 1(\Psi(X) \neq i), \qquad \forall i\in [m].$$

Then the rest follows from that the maximum is lower bounded by the average. $\Box$

Remark 1 The following weaker separability condition also suffices for Lemma 1 to hold: for any $i\neq j\in [m]$ and $a\in \mathcal{A}$, the inequality $L(\theta_i, a)\le \Delta/2$ always implies that $L(\theta_j, a)\ge \Delta/2$

The remaining quantity in Lemma 1 is the optimal average test error $\inf_{\Psi}m^{-1}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i)$, for which we are looking for a lower bound. Recall that when $m=2$, this quantity equals to $(1- \|P_{\theta_1} - P_{\theta_2}\|_{\text{TV}})/2$ by Le Cam’s first lemma. For general $m\ge 2$, we have the following two different lower bounds.

Lemma 2 (Tree-based inequality) Let $T=([m], E)$ be any undirected tree on vertex set $[m]$ with edge set $E$. Then

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge \frac{1}{m}\sum_{(i,j)\in E} (1 - \|P_{\theta_i} - P_{\theta_j}\|_{\text{\rm TV}}).$$

Proof: It is straightforward to see that

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) = 1 - \frac{1}{m}\int \max\{dP_{\theta_1}(x), \cdots, dP_{\theta_m}(x)\}. \ \ \ \ \ (1)$$

It is also easy (left as an exercise) to establish the following elementary inequality: for any reals $x_1, \cdots, x_m$ and any tree $T=([m], E)$, we have

$$\sum_{i=1}^m x_i - \max_{i\in [m]} x_i \ge \sum_{(i,j)\in E} \min\{x_i, x_j\}.$$

Now using $\int \min\{dP, dQ \} = 1 - \|P-Q\|_{\text{TV}}$ gives the desired inequality. $\Box$

Lemma 3 (Fano’s inequality) Let $\bar{P} := m^{-1}\sum_{i=1}^m P_i$. Then

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge 1 - \frac{1}{\log m}\left(\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) +\log 2\right).$$

Remark 2 If we introduce auxiliary random variables $V\sim \mathsf{Uniform}([m])$ and $X$ with $P_{X|V=i} := P_{\theta_i}$, then

$$\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) = I(V;X),$$

where $I(V;X)$ denotes the mutual information between $V$ and $X$

Proof: We present two proofs of Lemma 3. The first proof builds upon the representation (1) and is more analytical, while the second proof makes use of the data-processing inequality, which is essentially the classical information-theoretic proof of Fano’s inequality.

(First Proof) By (1), it suffices to prove that

$$\mathop{\mathbb E}_{\bar{P}} \left[ \max_{i\in [m]} \left\{\frac{dP_i}{d\bar{P}} \right\} \right] \le \frac{1}{\log m}\left(\sum_{i=1}^m \mathop{\mathbb E}_{\bar{P}} \left[\frac{dP_i}{d\bar{P}}\log \frac{dP_i}{d\bar{P}}\right] + m\log 2\right).$$

By the linearity of expectation, it further suffices to prove that for any non-negative reals $x_1, \cdots, x_m$ with $\sum_{i=1}^m x_i = m$, we have

$$\max_{i\in [m]} x_i \le \frac{1}{\log m}\left(\sum_{i=1}^m x_i\log x_i + m\log 2\right).$$

To establish this inequality, let $t:=\max_{i\in [m]} x_i$. Then by the convexity of $x\mapsto x\log x$, Jensen’s inequality gives

$$\sum_{i=1}^m x_i\log x_i \ge t\log t + (m-t)\log\frac{m-t}{m-1} \ge m\log \frac{m}{2} - (m-t)\log m.$$

Plugging in the above inequality completes the proof of Lemma 3.

(Second Proof) Introduce the auxiliary random variables $V$ and $X$ as in Remark 2. For any fixed $X$, consider the kernel $\mathsf{K}_X: [m]\rightarrow \{0,1\}$ which sends $i\in [m]$ to $1(\Psi(X) = i)$, then $\mathsf{K}_X\circ P_V$ is the Bernoulli distribution with parameter $m^{-1}$, and $\mathsf{K}_X\circ P_{V|X}$ is the Bernoulli distribution with parameter $p_X := \mathop{\mathbb P}_{V|X}(V = \Psi(X))$. By the data-processing inequality of KL divergence,

$$\begin{array}{rcl} D_{\text{KL}}(P_{V|X} \| P_V) & \ge & p_X \log(mp_X) + (1-p_X)\log (1-p_X) \\ &\ge& p_X\log m - \log 2. \end{array}$$

Now taking expectation on $X$ at both sides gives

$$I(V;X) = \mathop{\mathbb E}_X D_{\text{KL}}(P_{V|X} \| P_V) \ge \mathop{\mathbb P}(V = \Psi(X))\cdot \log m - \log 2,$$

and rearranging completes the proof. $\Box$

Lemma 2 decomposes a multiple hypothesis testing problem into several binary testing problems, which cannot outperform the best two-point methods in typical scenarios but can be useful when there is some external randomness associated with each $P_{\theta_i}$ (see the bandit example later). Lemma 3 is the well-known Fano’s inequality involving the mutual information, and the additional $\log m$ term is the key difference in multiple hypothesis testing. Hence, the typical lower bound arguments are to apply Lemma 1 together with Lemma 3 (or sometimes Lemma 2), with the following lemma which helps to upper bound the mutual information.

Lemma 4 (Variational representation of mutual informatino)

$$I(V;X) = \inf_{Q_X} \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )].$$

Proof: Simply verify that

$$I(V;X) = \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )] - D_{\text{KL}}(P_X\|Q_X).$$ $\Box$

1.2. Assoaud’s lemma

Instead of the previous pairwise separation condition, Assoaud’s lemma builds upon a different one where the hypotheses are essentially the vertices of an $m$-dimensional hypercube. Specifically, we shall require that the distance between parameters $v$ and $v'\in \{\pm 1\}^p$ is proportional to their Hamming distance $d_{\text{H}}(v,v') := \sum_{i=1}^p 1(v_i \neq v_i')$, which becomes quite natural when the parameter of the statistical model lies in an $m$-dimensional space.

Theorem 5 (Assoaud’s Lemma) Let $L: \Theta\times \mathcal{A} \rightarrow {\mathbb R}_+$ be any function. If there exist parameters $(\theta_v)\subseteq \Theta$ indexed by $v\in \{\pm1\}^p$ such that

$$L(\theta_v, a) + L(\theta_{v'}, a) \ge \Delta\cdot d_{\text{\rm H}}(v,v'), \qquad \forall v,v'\in \{\pm 1\}^p, a\in\mathcal{A},$$

then

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{\Delta}{2}\sum_{j=1}^p \left(1 - \|P_{j,+} - P_{j,-}\|_{\text{\rm TV}}\right),$$

where

$$P_{j,+} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = 1} P_{\theta_v}, \quad P_{j,-} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = -1} P_{\theta_v}.$$

Proof: For a given estimator $\hat{\theta}$, define

$$\hat{v} := \arg\min_{v\in \{\pm 1\}^p} L(\theta_v, \hat{\theta}).$$

Then it is straightforward to see that

$$L(\theta_v, \hat{\theta}) \ge \frac{L(\theta_v,\hat{\theta}) + L(\theta_{\hat{v}}, \hat{\theta})}{2} \ge \frac{\Delta}{2}\sum_{j=1}^p 1(v_j \neq \hat{v}_j), \qquad \forall v\in \{\pm 1\}^p.$$

The rest of the proof follows from Le Cam’s first lemma and the fact that the maximum is no smaller than the average. $\Box$

In most scenarios, the total variation distance between the mixtures $P_{j,+}$ and $P_{j,-}$ is hard to compute directly, and the following corollaries (based on joint convexity of the total variation distance and Cauchy–Schwartz) are often the frequently presented versions of Assoaud’s lemma.

Corollary 6 Under the conditions of Theorem 5, we have

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \max_{v,v'\in \{\pm 1\}^p: d_{\text{\rm H}}(v,v')=1} \|P_{\theta_v} - P_{\theta_{v'}}\|_{\text{\rm TV}}\right).$$

Corollary 7 Under the conditions of Theorem 5, we have

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \left(\frac{1}{p2^p}\sum_{j=1}^p \sum_{v\in \{\pm 1\}^p} \|P_{\theta_v} - P_{\theta_{\bar{v}^j}} \|_{\text{\rm TV}}^2 \right)^{1/2} \right),$$

where $\bar{v}^j$ is the resulting binary vector after flipping the $j$-th coordinate of $v$

The idea behind the Assoaud’s lemma is to apply a two-point argument for each coordinate of the parameter and then sum up all coordinates. Hence, Assoaud’s lemma typically improves over the best two-point argument by a factor of the dimension $p$, given that the hypercube-type separation condition in Theorem 5 holds.

1.3. Generalized Fano’s inequality

In the Fano’s inequality and Assoaud’s lemma above, we introduce a random variable $V$ which is uniformly distributed on the set $[m]$ or the hypercube $\{\pm 1\}^p$. Moreover, the separation condition must hold for each pair of the realizations of $V$. In general, we may wish to consider some non-uniform random variable $V$, and only require that the separation condition holds for most pairs. This is the focus of the following generalized Fano’s inequality.

Theorem 8 (Generalized Fano’s Inequality) Let $L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+$ be any loss function, and $\pi$ be any probability distribution on $\Theta$. For any $\Delta>0$, define

$$p_{\Delta} := \sup_{a\in \mathcal{A}} \pi(\{\theta\in \Theta: L(\theta,a)\le \Delta \}).$$

Then for $V\sim \pi$, we have

$$\inf_{\hat{\theta}}\sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \Delta\left(1 - \frac{I(V;X) + \log 2}{\log(1/p_{\Delta})}\right).$$

Proof: By Markov’s inequality, it suffices to show that for any estimator $\hat{\theta}$

$$\mathop{\mathbb P}\left(L(V, \hat{\theta}(X))\le \Delta \right)\le \frac{I(V;X) + \log 2}{\log(1/p_\Delta)}.$$

We again use the second proof of Lemma 3. For any fixed $X$, consider the deterministic kernel $\mathsf{K}_X: \Theta\rightarrow \{0,1\}$ which sends $V$ to $1( L(V,\hat{\theta}(X))\le \Delta )$. Then the composition $K_X\circ P_V$ is a Bernoulli distribution with parameter $p_X\le p_\Delta$, and the composition $K_X\circ P_{V|X}$ is a Bernoulli distribution with parameter $q_X := P_{V|X}(L(V,\hat{\theta}(X))\le \Delta)$. Then by the data processing inequality of KL divergence,

$$\begin{array}{rcl} D_{\text{KL}}(P_{V|X}\|P_V) &\ge& q_X\log \frac{q_X}{p_X} + (1-q_X)\log \frac{1-q_X}{1-p_X} \\ &\ge & q_X\log \frac{q_X}{p_\Delta} + (1-q_X)\log (1-q_X) \\ &\ge & q_X\log(1/p_\Delta) - \log 2. \end{array}$$

Since $\mathop{\mathbb E}[q_X] = \mathop{\mathbb P}(L(V,\hat{\theta}(X))\le \Delta)$, taking expectation on $X$ at both sides gives the desired inequality. $\Box$

To see that Theorem 8 is indeed a generalization to the original Fano’s inequality, simply note that when $V\sim \mathsf{Uniform}([m])$ and the separation condition in Lemma 1 holds, we have $p_\Delta = m^{-1}$ and therefore the denominator $\log(1/p_\Delta) = \log m$ becomes the log-cardinality. Hence, in the generalized Fano’s inequality, the verification of the seperation condition becomes upper bounding the probability $p_\Delta$

2. Applications

In this section we present some applications of the above tools to statistical examples. Here the applications are mostly simple and straightforward, and we defer some more sophisticated examples in other domains to the next lecture.

2.1. Example I: Gaussian mean estimation

We start from the possibly simplest example of Gaussian mean estimation. Consider $X\sim \mathcal{N}(\theta,\sigma^2 I_p)$ with known $\sigma$, and the target is the estimate the vector $\theta\in {\mathbb R}^p$ under the squared $\ell_2$ loss. Let $R_{p,\sigma}^\star$ be the corresponding minimax risk.

We first show that any two-point argument fails. In fact, if the two points are chosen to be $\theta_1, \theta_2\in {\mathbb R}^p$, the two-point argument gives

$$R_{p,\sigma}^\star \ge \frac{\|\theta_1 - \theta_2\|_2^2}{2}\left(1-\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2\sigma}\right)\right),$$

where $\Phi(\cdot)$ is the normal CDF. Optimization only gives $R_{p,\sigma}^\star = \Omega(\sigma^2)$.

Now we show that Assoaud’s lemma can give us the rate-optimal lower bound $R_{p,\sigma}^\star= \Omega(p\sigma^2)$. Let $P_v = \mathcal{N}(\delta v, \sigma^2 I_p)$ for $v\in \{\pm 1\}^p$, where $\delta>0$ is a parameter to be chosen later. Since

$$\|\delta v - \delta v'\|_2^2 = 4\delta^2\cdot d_{\text{H}}(v,v'),$$

the separation condition in Theorem 5 holds with $\Delta = 4\delta^2$. Consequently, Corollary 6 gives

$$R_{p,\sigma}^\star \ge 2p\delta^2\cdot \left(1 - \Phi\left(\frac{\delta}{\sigma}\right)\right),$$

and choosing $\delta =\Theta(\sigma)$ gives $R_{p,\sigma}^\star= \Omega(p\sigma^2)$.

We can also establish the same lower bound using the generalized Fano’s inequality. Let $V\sim \mathsf{Uniform}(\{\pm\delta\}^p)$, then Lemma 4 gives

$$I(V;X) \le \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(V,\sigma^2 I_p) \| \mathcal{N}(0, \sigma^2I_p)) = \frac{\mathop{\mathbb E} \|V\|_2^2}{2\sigma^2} = \frac{p\delta^2}{2\sigma^2}.$$

Since for any $\theta_0\in {\mathbb R}^p$ with $v_0 = \text{sign}(\theta_0)\in \{\pm 1\}^p$ (break ties arbitrarily), we have

$$\|V - \theta_0\|_2^2 \ge \delta^2 \cdot d_{\text{H}}(\text{sign}(V), v_0),$$

choosing $\Delta = p\delta^2/3$ gives

$$\mathop{\mathbb P}(\|V-\theta_0\|_2^2 \le \Delta) \le \mathop{\mathbb P}(d_{\text{H}}(\text{sign}(V), v_0)\le p/3 ) \le e^{-cp}$$

for some numerical constant $c>0$, where the last inequality is given by sub-Gaussian concentration. Consequently, $p_\Delta \le e^{-cp}$, and Theorem 8 gives

$$R_{p,\sigma}^\star \ge \frac{p\delta^2}{3}\left(1 - \frac{p\delta^2/(2\sigma^2) + \log 2}{cp}\right).$$

Again, choosing $\delta = c'\sigma$ with small $c'>0$ gives $R_{p,\sigma}^\star = \Omega(p\sigma^2)$.

Remark 3 The original version of Fano’s inequality can also be applied here, where $\{\theta_1,\cdots,\theta_m\}$ is a maximal packing of $\{\pm \delta\}^p$ with $\ell_2$ packing distance $O(\delta\sqrt{p})$

2.2. Example II: Sparse linear regression

Consider the following sparse linear regression $Y\sim \mathcal{N}(X\theta, \sigma^2 I_n)$, where $X\in {\mathbb R}^{n\times p}$ is a fixed design matrix, and $\theta\in {\mathbb R}^p$ is a sparse vector with $\|\theta\|_0\le s$. Here $s\in [1,p]$ is a known sparsity parameter, and we are interested in the minimax risk $R_{n,p,s,\sigma}^\star(X)$ of estimating $\theta$ under the squared $\ell_2$ loss. Note that when $X = I_p$, this problem is reduced to the sparse Gaussian mean estimation.

We apply the generalized Fano’s inequality to this problem. Since a natural difficulty of this problem is to recover the support of $\theta$, we introduce the random vector $V=\delta v_S$, where $\delta>0$ is a parameter to be specified later, $v\sim \mathsf{Uniform}\{\pm 1\}^p$, $v_S$ denotes the restriction of $v$ onto $S$, and $S\subseteq [p]$ is uniformly chosen from all size-$s$ subsets of $[p]$. Clearly,

$$\mathop{\mathbb E}[VV^\top] = \delta^2 \mathop{\mathbb E}[v_Sv_S^\top] = \frac{s\delta^2}{p}I_p.$$

By Lemma 4,

$$\begin{array}{rcl} I(V;Y) &\le & \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(XV, \sigma^2 I_n) \| \mathcal{N}(0, \sigma^2 I_n)) \\ &=& \frac{\mathop{\mathbb E}\|XV\|_2^2}{2\sigma^2} \\ &=& \frac{\text{Trace}(X^\top X\cdot \mathop{\mathbb E}[VV^\top]) }{2\sigma^2}\\ &=& \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2, \end{array}$$

where $\|X\|_{\text{F}} := \sqrt{\text{Trace}(X^\top X)}$ is the Frobenius norm of $X$. Now it remains to upper bound $p_\Delta$ for $\Delta = s\delta^2/12$. Fix any $\theta_0\in {\mathbb R}^p$ such that $\|V - \theta_0\|_2^2\le \Delta$ holds with non-zero probability (otherwise the upper bound is trivial), and by symmetry we may assume that $\|\theta_0 - \delta1_{[s]}\|_2^2 \le \Delta$. Now by triangle inequality, $\|\theta_0 - \delta v_S\|_2^2 \le \Delta$ implies that $\|v_S - 1_{[s]}\|_2^2 \le s/3$. Hence,

$$\begin{array}{rcl} p_\Delta &\le& \mathop{\mathbb P}(\|v_S - 1_{[s]}\|_2^2\le s/3) \\ &\le& \frac{1}{2^s\binom{p}{s}}\sum_{k=\lceil 2s/3 \rceil}^s \binom{s}{k} \binom{p-k}{s-k}2^{s-k} \\ &\le& \left(s/(ep)\right)^{cs} \end{array}$$

for a numerical constant $c>0$ after some algebra. Consequently, Theorem 8 gives

$$R_{n,p,s,\sigma}^\star(X) \ge \frac{s\delta^2}{12}\left(1 - \frac{ \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2 + \log 2}{cs\log(ep/s)} \right).$$

Finally, choosing $\delta^2 = c'p\sigma^2\log(ep/s)/\|X\|_{\text{F}}^2$ for some small constant $c'>0$ gives

$$R_{n,p,s,\sigma}^\star (X)= \Omega\left(\frac{sp\sigma^2\log(ep/s)}{\|X\|_{\text{F}}^2} \right).$$

In the special cases where $X= I_p$ or $X$ consists of i.i.d. $\mathcal{N}(0,n^{-1})$ entries, we arrive at the tight lower bound $\Omega(s\sigma^2\log(ep/s))$ for both sparse Gaussian mean estimation and compressed sensing.

2.3. Example III: Multi-armed bandit

Next we revisit the example of the multi-armed bandit in Lecture 5. Let $T$ be the time horizon, $K$ be the total number of arms. For each $i\in [K]$, the reward of pulling arm $i$ at time $t$ is an independent random variable $r_{t,i}\sim \mathcal{N}(\mu_i, 1)$. The target of the learner is to devise a policy $(\pi_t)_{t\in [T]}$ where $\pi_t\in [K]$ is the arm to pull at time $t$, and $\pi_t$ may depend on the entire observed history $(\pi_\tau)_{\tau and $(r_{\tau,\pi_\tau})_{\tau. The learner would like minimize the following worst-case regret:

$$R_{T,K}^\star = \inf_{\pi} \sup_{\max_{i\in [K]}|\mu_i| \le \sqrt{K} } \mathop{\mathbb E}\left[T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t}\right],$$

where $\max_{i\in [K]}|\mu_i|\le \sqrt{K}$ is a technical condition. As in Lecture 5, our target is to show that

$$R_{T,K}^\star = \Omega(\sqrt{KT})$$

via multiple hypothesis testing.

To prove the lower bound, a natural idea is to construct $K$ hypotheses where the $i$-th arm is the optimal arm under the $i$-th hypothesis. Specifically, we set

$$\begin{array}{rcl} \mu^{(1)} &=& (\delta, 0, 0, 0, \cdots, 0), \\ \mu^{(2)} &=& (\delta, 2\delta, 0, 0, \cdots, 0), \\ \mu^{(3)} &=& (\delta, 0, 2\delta, 0, \cdots, 0), \\ \vdots & & \vdots \\ \mu^{(K)} &=& (\delta, 0, 0, 0, \cdots, 2\delta), \end{array}$$

where $\delta>0$ is some parameter to be specified later. The construction is not entirely symmetric, and the reward distribution under the first arm is always $\mathcal{N}(\delta,1)$.

For a mean vector $\mu$ and policy $\pi$, let $L(\mu,\pi) = T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t}$ be the non-negative loss function for this online learning problem. Then clearly the separation condition in Lemma 1 is fulfilled with $\Delta = \delta T$. Next, we apply Lemma 2 to a star graph $([K], \{(1,2),(1,3),\cdots,(1,K) \})$ with center $1$ gives

$$\begin{array}{rcl} R_{T,K}^\star &\overset{(a)}{\ge} & \frac{\delta T}{2}\cdot \frac{1}{K}\sum_{i=2}^K \left(1 - \|P_{\mu^{(1)}} - P_{\mu^{(i)}} \|_{\text{TV}}\right) \\ &\overset{(b)}{\ge} & \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( -D_{\text{KL}}(P_{\mu^{(1)}} \| P_{\mu^{(i)}} ) \right) \\ &\overset{(c)}{=}& \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( - 2\delta^2 \mathop{\mathbb E}_1(T_i)\right) \\ &\overset{(d)}{\ge}& \frac{\delta T}{4}\cdot \frac{K-1}{K} \exp\left(-2\delta^2\cdot \frac{\mathop{\mathbb E}_1(T_2 + \cdots + T_K)}{K-1} \right) \\ &\overset{(e)}{\ge} & \frac{\delta T}{4}\cdot \frac{K-1}{K}\exp\left(-\frac{2T\delta^2}{K-1}\right), \end{array}$$

where in step (a) we denote by $P_{\mu^{(i)}}$ the distribution of the observed rewards under mean vector $\mu^{(i)}$, step (b) follows from the inequality $1 - \|P-Q\|_{\text{TV}}\ge \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q))$ presented in Lecture 5, step (c) is the evaluation of the KL divergence where $\mathop{\mathbb E}_1(T_i)$ denotes the expected number of pullings of arm $i$ under the mean vector $\mu^{(1)}$, step (d) follows from Jensen’s inequality, and step (e) is due to the deterministic inequality $T_2 + \cdots + T_K\le T$. Now choosing $\delta = \Theta(\sqrt{K/T})$ gives the desired lower bound.

The main reason why we apply Lemma 2 instead of the Fano’s inequality and choose the same reward distribution for arm $1$ at all times is to deal with the random number of pullings of different arms under different reward distributions. In the above way, we may stick to the expectation $\mathop{\mathbb E}_1$ and apply the deterministic inequality $T_2 + \cdots + T_K\le T$ to handle the randomness.

2.4. Example IV: Gaussian mixture estimation

Finally we look at a more involved example in nonparametric statistics. Let $f$ be a density on ${\mathbb R}$ which is a Gaussian mixture, i.e., $f = g * \mathcal{N}(0,1)$ for some density $g$, where $*$ denotes the convolution. We consider the estimation of the density $f$ given $n$ i.i.d. observations $X_1,\cdots,X_n$ from $f$, and we denote by $R_n^\star$ the minimax risk under the squared $L_2$ loss between real-valued functions. The central claim is that

$$R_n^\star = \Omega\left(\frac{(\log n)^{1/2}}{n}\right),$$

which is slightly larger than the (squared) parametric rate $n^{-1}$.

Before proving this lower bound, we first gain some insights from the upper bound. In the problem formulation, the only restriction on the density $f$ is that $f$ must be a Gaussian mixture. Since convolution makes function smoother, $f$ should be fairly smooth, and harmonic analysis suggests that the extent of smoothness is reflected in the speed of decay of the Fourier transform of $f$. Let $\widehat{f}$ be the Fourier transform of $f$, we have

$$\widehat{f}(\omega) = \widehat{g}(\omega)\cdot \frac{1}{\sqrt{2\pi}}e^{-\omega^2/2},$$

and $\|\widehat{g}\|_\infty \le \|g\|_1=1$. Hence, if we truncate $\widehat{f}(\omega)$ to zero for all $|\omega| > 2\sqrt{\log n}$, the $L_2$ approximation error would be smaller than $O(n^{-1})$. This suggests us to consider the kernel $K$ on ${\mathbb R}$ with

$$\widehat{K}(\omega) = 1(|\omega|\le 2\sqrt{\log n}).$$

Then the density estimator $\mathop{\mathbb P}_n * K$ has mean $f * K$, which by Parseval’s identity has squared bias

$$\|f - f*K\|_2^2 = \|\widehat{f}(1-\widehat{K}) \|_2^2 = O(n^{-1}).$$

Moreover, by Parseval again,

$$\mathop{\mathbb E}\|\mathop{\mathbb P}_n * K - f * K\|_2^2 \le \frac{\|K\|_2^2}{n} = \frac{\|\widehat{K}\|_2^2}{n} = O\left(\frac{\sqrt{\log n}}{n}\right),$$

and a triangle inequality leads to the desired result.

The upper bound analysis motivates us to apply Fourier-type arguments in the lower bound. For $v\in \{\pm 1\}^p$ (with integer $p$ to be specified later), consider the density

$$f_v = f_0 + \left(\delta\cdot \sum_{i=1}^p v_ig_i \right) * \phi,$$

where $f_0$ is some centered probability density, $\delta>0$ is some parameter to be specified later, $g_1,\cdots,g_p$ are perturbation functions with $\int g_i = 0$, and $\phi$ is the PDF of $\mathcal{N}(0,1)$. To apply the Assouad’s lemma on this hypercube structure, we require the following orthogonality condition

$$\int_{{\mathbb R}} (g_i*\phi)(x) (g_j*\phi)(x)dx = 1(i=j)$$

to ensure the separation condition in Theorem 5 with $\Delta = \Theta(\delta^2)$. By Plancherel’s identity, the above condition is equivalent to

$$\int_{{\mathbb R}} \widehat{g}_i(\omega)\widehat{g}_j(\omega)\phi^2(\omega)d\omega = 1(i=j).$$

Recall from Lecture 7 that the Hermite polynomials are orthogonal under the normal distribution, a natural candidate of $\widehat{g}_i(\omega)$ is $\widehat{g}_i(\omega)\propto \phi(\omega)H_{2i-1}(2\omega)$, where we have used that $\phi(\omega)^4 \propto \phi(2\omega)$, and we restrict to odd degrees to ensure that $\int g_i = \widehat{g}_i(0) = 0$. This choice enjoys another nice property where the inverse Fourier transform of $\widehat{g}_i$ has the following closed-form expression for $g_i$:

$$g_i(x) = \sqrt{2}(2\pi)^{3/4}\sqrt{\frac{3^{2i-1}}{(2i-1)!}}\cdot\phi(x) H_{2i-1}\left(\frac{2x}{\sqrt{3}}\right).$$

Moreover, since $H_k(x)\sim \sqrt{k!}e^{x^2/4}$ for large $x$, we have $\|g_i\|_\infty = \Theta(3^i)$. Therefore, we must have $p=O(\log n)$ to ensure the non-negativity of the density $f_v$.

Now the only remaining condition is to check that the $\chi^2$-divergence between neighboring vertices is at most $O(1)$, which is essentially equivalent to

$$n\delta^2\cdot \int_{{\mathbb R}} \frac{(g_i * \phi(x))^2}{f_0(x)}dx = O(1).$$

A natural choice of $f_0$ is the PDF of $\mathcal{N}(0,\sigma^2)$, with $\sigma>0$ to be specified. Splitting the above integral over ${\mathbb R}$ into $|x|\le C\sigma$ and $|x|> C\sigma$ for some large constant $C>0$, the orthogonality relation of $g_i$ gives

$$\int_{|x|\le C\sigma} \frac{(g_i * \phi(x))^2}{f_0(x)}dx \le \sqrt{2\pi}e^{C^2/2}\sigma \cdot \int_{{\mathbb R}} (g_i * \phi(x))^2dx = O(\sigma).$$

To evaluate the integral over $|x|>C\sigma$, recall that $g_i(x)$ behaves as $3^ie^{-cx^2}$ for large $|x|$. Hence, the natural requirement $\sigma^2 = \Omega(p)$ leads to the same upper bound $O(\sigma)$ for the second integral, and therefore the largest choice of $\delta$ is $\delta^2 = (n\sigma)^{-1}$.

To sum up, we have arrived at the lower bound

$$R_{n}^\star = \Omega\left(\frac{p}{n\sigma}\right)$$

subject to the constraints $\sigma^2 = \Omega(\sqrt{p})$ and $p = O(\log n)$. Hence, the optimal choices for the auxiliary parameters are $p = \Theta(\log n), \sigma = \Theta(\sqrt{\log n})$, leading to the desired lower bound.

3. Bibliographic Notes

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977), who also applied Fano’s lemma (Fano (1952)) to a statistical setting. The Assouad’s lemma is due to Assoaud (1983), and we also refer to a survey paper Yu (1997). The tree-based technique (Lemma 2 and the analysis of the multi-armed bandit) is due to Gao et al. (2019), and the generalized Fano’s inequality is motivated by the distance-based Fano’s inequality (Duchi and Wainwright (2013)) and the current form is presented in the lecture note Duchi (2019).

For the examples, the lower bound of sparse linear regression is taken from Candes and Davenport (2013), and we also refer to Donoho and Johnstone (1994), Raskutti, Wainwright and Yu (2011), and Zhang, Wainwright and Jordan (2017) for related results. The full proof of the Gaussian mixture example is referred to Kim (2014).

1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
2. Robert M. Fano, Class notes for transmission of information. Massachusetts Institute of Technology, Tech. Rep (1952).
3. Patrick Assouad, Densité et dimension. Annales de l’Institut Fourier. Vol. 33. No. 3. 1983.
4. Bin Yu, Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam. Springer, New York, NY, 1997. 423–435.
5. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou, Batched multi-armed bandit problem. Advances in Neural Information Processing Systems, 2019.
6. John C. Duchi, and Martin J. Wainwright. Distance-based and continuum Fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669 (2013).
7. John C. Duchi, Lecture notes for statistics 311/electrical engineering 377. 2019.
8. Emmanuel J. Candes, and Mark A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis 34.2 (2013): 317–323.
9. David L. Donoho, and Iain M. Johnstone. Minimax risk over $\ell_p$-balls for $\ell_q$-error. Probability Theory and Related Fields 99.2 (1994): 277–303.
10. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls. IEEE transactions on information theory 57.10 (2011): 6976–6994.
11. Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators. Electronic Journal of Statistics 11.1 (2017): 752-799.
12. Arlene K. H. Kim. Minimax bounds for estimation of normal mixtures. Bernoulli 20.4 (2014): 1802–1818.

This site uses Akismet to reduce spam. Learn how your comment data is processed.