# Lecture 9: Multiple Hypothesis Testing: More Examples

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In the last lecture we have seen the general tools and concrete examples of reducing the statistical estimation problem to multiple hypothesis testing. In these examples, the loss function is typically straightforward and the construction of the hypotheses is natural. However, other than statistical estimation the loss function may become more complicated (e.g., the excess risk in learning theory), and the hypothesis constructions may be implicit. To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems.

1. Example I: Density Estimation

This section considers the most fundamental problem in nonparametric statistics, i.e., the density estimation. It is well-known in the nonparametric statistics literature that, if a density on the real line has Hölder smoothness parameter $s>0$, it then can be estimated within $L_2$ accuracy $\Theta(n^{-s/(2s+1)})$ based on $n$ samples. However, in this section we will show that this is no longer the correct minimax rate when we generalize to other notions of smoothness and other natural loss functions.

Fix an integer $k\ge 1$ and some norm parameter $p\in [1,\infty]$, the Sobolev space $\mathcal{W}^{k,p}(L)$ on $[0,1]$ is defined as

$$\mathcal{W}^{k,p}(L) := \{f\in C[0,1]: \|f\|_p + \|f^{(k)}\|_p \le L \},$$

where the derivative $f^{(k)}$ is defined in terms of distributions ($f$ not necessarily belongs to $C^k[0,1]$). Our target is to determine the minimax rate $R_{n,k,p,q}^\star$ of estimating the density $f\in \mathcal{W}^{k,p}(L)$ based on $n$ i.i.d. observations from $f$, under the general $L_q$ loss with $q\in [1,\infty)$. For simplicity we suppress the dependence on $L$ and assume that $L$ is a large positive constant.

The main result of this section is as follows:

$$R_{n,k,p,q}^\star = \begin{cases} \Omega(n^{-\frac{k}{2k+1}}) & \text{if } q<(1+2k)p, \\ \Omega((\frac{\log n}{n})^{\frac{k-1/p + 1/q}{2(k-1/p) + 1/q}} ) & \text{if } q \ge (1+2k)p. \end{cases}$$

This rate is also tight. We see that as $k,p$ are fixed and $q$ increases from $1$ to $\infty$, the minimax rate for density estimation increases from $n^{-k/(2k+1)}$ to $(\log n/n)^{(k-1/p)/(2(k-1/p)+1) }$. We will show that these rates correspond to the “dense” and the “sparse” cases, respectively.

1.1. Dense case: $q<(1+2k)p$

We first look at the case with $q<(1+2k)p$, which always holds for Hölder smoothness where the norm parameter is $p=\infty$. The reason why it is called the “dense” case is that in the hypothesis construction, the density is supported everywhere on $[0,1]$, and the difficulty is to classify the directions of the bumps around some common density. Specifically, let $h\in (0,1)$ be a bandwidth parameter to be specified later, we consider

$$f_v(x) := 1 + \sum_{i=1}^{h^{-1}} v_i\cdot h^sg\left(\frac{x - (i-1)h}{h}\right)$$

for $v\in \{\pm 1\}^{1/h}$, where $g$ is a fixed smooth function supported on $[0,1]$ with $\int_0^1 g(x)dx=0$, and $s>0$ is some tuning parameter to ensure that $f_v\in \mathcal{W}^{k,p}(L)$. Since

$$f_v^{(k)}(x) = \sum_{i=1}^{h^{-1}} v_i\cdot h^{s-k}g^{(k)}\left(\frac{x - (i-1)h}{h}\right),$$

we conclude that the choice $s=k$ is sufficient. To invoke the Assoaud’s lemma, first note that for $q=1$ the separation condition is fulfilled with $\Delta = \Theta(h^{k+1})$. Moreover, for any $v,v'\in \{\pm 1\}^{1/h}$ with $d_{\text{H}}(v,v') = 1$, we have

$$\chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) + 1 \le \left( 1 + \frac{ \|f_v - f_{v'}\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n = \left(1 + \frac{ 4h^{2k+1}\|g\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n.$$

Hence, the largest $h>0$ to ensure that $\chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) = O(1)$ is $h = \Theta(n^{-1/(2k+1)})$, which for $q=1$ gives the minimax lower bound

$$R_{n,k,p,q}^\star = \Omega(n^{-\frac{k}{2k+1}} ).$$

This lower bound is also valid for the general case $q\ge 1$ due to the monotonicity of $L_q$ norms.

1.2. Sparse case: $q\ge (1+2k)p$

Next we turn to the sparse case $q\ge (1+2k)p$, where the name of “sparse” comes from the fact that for the hypotheses we will construct densities with only one bump on a small interval, and the main difficulty is to locate that interval. Specifically, let $h\in (0,1), s>0$ be bandwidth and smoothness parameters to be specified later, and we consider

$$f_i(x) := h^sg\left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h

where $i\in [h^{-1}]$, and $g$ is a fixed smooth density on $[0,1]$. Clearly $f_i(x)$ is a density on $[0,1]$ for all $i\in [M]$, and

$$f_i^{(k)}(x) = h^{s-k}g^{(k)} \left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h

Consequently, $\|f_i^{(k)}\|_p = h^{s-k+1/p}\|g^{(k)}\|_p$, and the Sobolev ball requirement leads to the choice $s = k-1/p$.

Next we check the conditions of Fano’s inequality. Since

$$\|f_i - f_j\|_q = h^{k-1/p+1/q} \cdot 2^{1/q}\|g\|_q, \qquad \forall i\neq j\in [h^{-1}],$$

the separation condition is fulfilled with $\Delta = \Theta(h^{k-1/p+1/q})$. Moreover, since

$$D_{\text{KL}}(f_i \| f_j) \le \frac{\|f_i - f_j\|_2^2}{1-h^s} = \frac{2\|g\|_2^2}{1-h^s}\cdot h^{2(k-1/p)+1},$$

we conclude that $I(V;X)= O(nh^{2(k-1/p)+1})$. Consequently, Fano’s inequality gives

$$R_{n,k,p,q}^\star = \Omega\left(h^{k-1/p+1/q}\left(1 - \frac{O(nh^{2(k-1/p)+1})+\log 2}{\log(1/h)} \right) \right),$$

and choosing $h = \Theta((\log n/n)^{1/(2(k-1/p)+1)} )$ gives the desired result.

2. Example II: Aggregation

In estimation or learning problems, sometimes the learner is given a set of candidate estimators or predictors, and she aims to aggregate them into a new estimate based on the observed data. In scenarios where the candidates are not explicit, aggregation procedures can still be employed based on sample splitting, where the learner splits the data into independent parts, uses the first part to construct the candidates and the second part to aggregate them.

In this section we restrict ourselves to the regression setting where there are $n$ i.i.d. observations $x_1,\cdots,x_n\sim P_X$, and

$$y_i = f(x_i )+ \xi_i, \qquad i\in [n],$$

where $\xi_i\sim \mathcal{N}(0,1)$ is the independent noise, and $f: \mathcal{X}\rightarrow {\mathbb R}$ is an unknown regression function with $\|f\|_\infty \le 1$. There is also a set of candidates $\mathcal{F}=\{f_1,\cdots,f_M\}$, and the target of aggregation is to find some $\hat{f}_n$ (not necessarily in $\mathcal{F}$) to minimize

$$\mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right),$$

where the expectation is taken over the random observations $(x_1,y_1),\cdots,(x_n,y_n)$, $f_\lambda(x) := \sum_{i=1}^M \lambda_i f_i(x)$ for any $\lambda \in {\mathbb R}^M$, and $\Theta\subseteq {\mathbb R}^p$ is a suitable subset corresponding to different types of aggregation. For a fixed data distribution $P_X$, the minimax rate of aggregation is defined as the minimum worst-case excess risk over all bounded functions $f$ and candidate functions $\{f_1,\cdots,f_M\}$:

$$R_{n,M}^{\Theta} := \sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n} \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right).$$

Some special cases are in order:

1. When $\Theta = {\mathbb R}^M$, the estimate $\hat{f}_n$ is compared with the best linear aggregates and this is called linear aggregation, with optimal rate denoted as $R^{\text{L}}$
2. When $\Theta = \Lambda^M = \{\lambda\in {\mathbb R}_+^M: \sum_{i=1}^M \lambda_i\le 1 \}$, the estimate $\hat{f}_n$ is compared with the best convex combination of candidates (and zero) and this is called convex aggregation, with optimal rate denoted as $R^{\text{C}}$
3. When $\Theta = \{e_1,\cdots,e_M\}$, the set of canonical vectors, the estimate $\hat{f}_n$ is compared with the best candidate in $\mathcal{F}$ and this is called model selection aggregation, with optimal rate denoted as $R^{\text{MS}}$

The main result in this section is summarized in the following theorem.

Theorem 1 If there is a cube $\mathcal{X}_0\subseteq \mathcal{X}$ such that $P_X$ admits a density lower bounded from below w.r.t. the Lebesgue measure on $\mathcal{X}_0$, then

$$R_{n,M}^\Theta = \begin{cases} \Omega(1\wedge \frac{M}{n}) &\text{if } \Theta = {\mathbb R}^M, \\ \Omega(1\wedge (\frac{M}{n} + \sqrt{\frac{1}{n}\log(\frac{M}{\sqrt{n}}+1)}) ) & \text{if } \Theta = \Lambda^M, \\ \Omega(1\wedge \frac{\log M}{n}) & \text{if } \Theta = \{e_1,\cdots,e_M\}. \end{cases}$$

We remark that the rates in Theorem 1 are all tight. In the upcoming subsections we will show that although the loss function of aggregation becomes more complicated, the idea of multiple hypothesis testing can still lead to tight lower bounds.

2.1. Linear aggregation

Since the oracle term $\inf_{\lambda \in {\mathbb R}^d} \|f_\lambda - f\|_{L^2(P_X)}^2$ is hard to deal with, a natural idea would be to consider a well-specified model such that this term is zero. Since $P_X$ admits a density, we may find a partition $\mathcal{X} = \cup_{i=1}^M \mathcal{X}_i$ such that $P_X(\mathcal{X}_i) = M^{-1}$ for all $i$. Consider the candidate functions $f_i(x) = 1(x\in \mathcal{X}_i)$, and for $v\in \{0, 1\}^M$, let

$$f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),$$

where $\gamma\in (0,1)$ is to be specified.

To apply the Assoaud’s lemma, note that for the loss function

$$L(f,\hat{f}) := \|\hat{f} - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in {\mathbb R}^p} \|f_\lambda - f\|_{L_2(P_X)}^2,$$

the orthogonality of the candidates $(f_i)_{i\in [M]}$ implies that the separability condition holds for $\Delta = \gamma^2/2M$. Moreover,

$$\max_{d_{\text{H}}(v,v')=1 } D_{\text{KL}}(P_{f_v}^{\otimes n} \| P_{f_{v'}}^{\otimes n}) = \max_{d_{\text{H}}(v,v')=1 } \frac{n\|f_v - f_{v'}\|_{L_2(P_X)}^2}{2} = O\left(\frac{n\gamma^2}{M}\right),$$

Therefore, the Assoud’s lemma (with Pinsker’s inequality) gives

$$R_{n,M}^{\text{L}} \ge \frac{\gamma^2}{4}\left(1 - O\left( \sqrt{\frac{n\gamma^2}{M}}\right)\right).$$

Choosing $\gamma = 1\wedge \sqrt{M/n}$ completes the proof.

2.2. Convex aggregation

The convex aggregation differs only with the linear aggregation in the sense that the linear coefficients must be non-negative and the sum is at most one. Note that the only requirement in the previous arguments is the orthogonality of $(f_i)_{i\in [M]}$ under $L_2(P_X)$, we may choose any orthonormal functions $(f_i)_{i\in [M]}$ under $L_2(dx)$ with $\max_i\|f_i\|_\infty = O(1)$ (existence follows from the cube assumption) and use the density lower bound assumption to conclude that $\|f_i\|_{L_2(P_X)} = \Omega(\|f_i\|_2)$ (to ensure the desired separation). Then the choice of $\gamma$ above becomes $O(n^{-1/2})$, and we see that the previous arguments still hold for convex aggregation if $M = O(\sqrt{n})$. Hence, it remains to prove that when $M = \Omega(\sqrt{n})$

$$R_{n,M}^{\text{C}} = \Theta\left(1\wedge \sqrt{\frac{1}{n}\log\left(\frac{M}{\sqrt{n}}+1\right) }\right).$$

Again we consider the well-specified case where

$$f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),$$

with the above orthonormal functions $(f_i)$, a constant scaling factor $\gamma = O(1)$, and a uniform size-$m$ subset of $(v_i)$ being $1/m$ and zero otherwise ($m\in \mathbb{N}$ is to be specified). Since the vector $v$ is no longer the vertex of a hypercube, we apply the generalized Fano’s inequality instead of Assoaud. First, applying Lemma 4 in Lecture 8 gives

$$I(V; X) \le \mathop{\mathbb E}_V D_{\text{KL}}(P_{f_V}^{\otimes n} \| P_{f_0}^{\otimes n}) = O\left(\frac{n}{m}\right).$$

Second, as long as $m\le M/3$, using similar arguments as in the sparse linear regression example in Lecture 8, for $\Delta = c/m$ with a small constant $c>0$ we have

$$p_\Delta := \sup_{f} \mathop{\mathbb P}\left(\|f_V - f\|_{L^2(P_X)}^2 \le \Delta \right) \le \left(\frac{M}{m}+1\right)^{c'm},$$

with $c'>0$. Therefore, the generalized Fano’s inequality gives

$$R_{n,M}^{\text{C}} \ge \frac{c}{m}\left(1 - \frac{O(n/m) + \log 2}{c'm\log\left(\frac{M}{m}+1\right)} \right),$$

and choosing $m \asymp \sqrt{n/\log(M/\sqrt{n}+1)} = O(M)$ completes the proof.

2.3. Model selection aggregation

As before, we consider the well-specified case where we select orthonormal functions $(\phi_i)_{i\in [M]}$ on $L_2(dx)$, and let $f_i = \gamma \phi_i$, $f\in \{f_1,\cdots,f_M\}$. The orthonormality of $(\phi_i)$ gives $\Delta = \Theta(\gamma^2)$ for the separation condition of the original Fano’s inequality, and

$$\max_{i\neq j} D_{\text{KL}}(P_{f_i}^{\otimes n}\| P_{f_j}^{\otimes n}) = \max_{i\neq j} \frac{\|f_i - f_j\|_{L_2(P_X)}^2}{2} = O\left(n\gamma^2\right).$$

Hence, Fano’s inequality gives

$$R_{n,M}^{\text{MS}} = \Omega\left(\gamma^2 \left(1 -\frac{n\gamma^2 + \log 2}{\log M}\right)\right),$$

and choosing $\gamma^2 \asymp 1\wedge n^{-1}\log M$ completes the proof.

We may show a further result that any model selector cannot attain the above optimal rate of model selection aggregation. Hence, even to compare with the best candidate function in $\mathcal{F}$, the optimal aggregate $\hat{f}$ should not be restricted to $\mathcal{F}$. Specifically, we have the following result, whose proof is left as an exercise.

Exercise 1 Under the same assumptions of Theorem 1, show that

$$\sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n\in\{f_1,\cdots,f_M\} } \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \min_{i\in [M]} \|f_i - f\|_{L_2(P_X)}^2\right) = \Omega\left(1\wedge \sqrt{\frac{\log M}{n}}\right).$$

3. Example III: Learning Theory

Consider the classification problem where there are $n$ training data $(x_1,y_1), \cdots, (x_n, y_n)$ i.i.d. drawn from an unknown distribution $P_{XY}$, with $y_i \in \{\pm 1\}$. There is a given collection of classifiers $\mathcal{F}$ consisting of functions $f: \mathcal{X}\rightarrow \{\pm 1\}$, and given the training data, the target is to find some classifier $\hat{f}\in \mathcal{F}$ with a small excess risk

$$R_{\mathcal{F}}(P_{XY}, \hat{f}) = \mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right],$$

which is the difference in the performance of the chosen classifer and the best classifier in the function class $\mathcal{F}$. In the definition of the excess risk, the expectation is taken with respect to the randomness in the training data. The main focus of this section is to characterize the minimax excess risk of a given function class $\mathcal{F}$, i.e.,

$$R_{\text{pes}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY}} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

The subscript “pes” here stands for “pessimistic”, where $P_{XY}$ can be any distribution over $\mathcal{X} \times \{\pm 1\}$ and there may not be a good classifier in $\mathcal{F}$, i.e., $\inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))$ may be large. We also consider the optimistic scenario where there exists a perfect (error-free) classifer in $\mathcal{F}$. Mathematically, denoting by $\mathcal{P}_{\text{opt}}(\mathcal{F})$ the collection of all probability distributions $P_{XY}$ on $\mathcal{X}\times \{\pm 1\}$ such that $\inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))=0$, the minimax excess risk of a given function class $\mathcal{F}$ in the optimistic case is defined as

$$R_{\text{opt}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}_{\text{opt}}(\mathcal{F})} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

The central claim of this section is the following:

Theorem 2 Let the VC dimension of $\mathcal{F}$ be $d$. Then

$$R_{\text{\rm opt}}^\star(\mathcal{F}) = \Omega\left(1\wedge \frac{d}{n}\right), \qquad R_{\text{\rm pes}}^\star(\mathcal{F}) = \Omega\left(1\wedge \sqrt{\frac{d}{n}}\right).$$

Recall that the definition of VC dimension is as follows:

Definition 3 For a given function class $\mathcal{F}$ consisting of mappings from $\mathcal{X}$ to $\{\pm 1\}$, the VC dimension of $\mathcal{F}$ is the largest integer $d$ such that there exist $d$ points from $\mathcal{X}$ which can be shattered by $\mathcal{F}$. Mathematically, it is the largest $d>0$ such that there exist $x_1,\cdots,x_d\in \mathcal{X}$, and for all $v\in \{\pm 1\}^d$, there exists a function $f_v\in \mathcal{F}$ such that $f_v(x_i) = v_i$ for all $i\in [d]$

VC dimension plays a significant role in statistical learning theory. For example, it is well-known that for the empirical risk minimization (ERM) classifier

$$\hat{f}^{\text{ERM}} = \arg\min_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n 1(y_i \neq f(x_i)),$$

we have

$$\mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}^{\text{ERM}}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right] \le 1\wedge \begin{cases} O(d/n) & \text{in optimistic case}, \\ O(\sqrt{d/n}) & \text{in pessimistic case}. \end{cases}$$

Hence, Theorem 2 shows that the ERM classifier attains the minimax excess risk for all function classes, and the VC dimension exactly characterizes the difficulty of the learning problem.

3.1. Optimistic case

We first apply the Assoaud’s lemma to the optimistic scenario. By the definition of VC dimension, there exist points $x_1,\cdots,x_d\in \mathcal{X}$ and functions $f_v\in \mathcal{F}$ such that for all $v\in \{\pm 1\}^d$ and $i\in [d]$, we have $f_v(x_i) = v_i$. Consider $2^{d-1}$ hypotheses indexed by $u\in \{\pm 1\}^{d-1}$: the distribution $P_X$ is always

$$P_X(\{x_i\}) = p > 0, \quad \forall i\in [d-1], \qquad P_X(\{x_d\}) = 1-p(d-1),$$

where $p\in (0,(d-1)^{-1})$ is to be specified later. For the conditional distribution $P_{Y|X}$, let $Y = f_{(u,1)}(X)$ hold almost surely under the joint distribution $P_u$. Clearly this is the optimistic case, for there always exists a perfect classifier in $\mathcal{F}$.

We first examine the separation condition in Assoaud’s lemma, where the loss function here is

$$L(P_{XY}, \hat{f}) := P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X)).$$

For any $u,u'\in \{\pm 1\}^{d-1}$ and any $\hat{f}\in \mathcal{F}$, we have

$$\begin{array}{rcl} L(P_u, \hat{f}) + L(P_{u'}, \hat{f}) &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq f_u(x_i)) + 1(\hat{f}(x_i) \neq f_{u'}(x_i)) \\ &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq u_i) + 1(\hat{f}(x_i) \neq u_i') \\ &\ge & p\cdot d_{\text{H}}(u,u'), \end{array}$$

and therefore the separation condition holds for $\Delta = p$. Moreover, if $u$ and $u'$ only differ in the $i$-th component, then $P_u$ and $P_{u'}$ are completely indistinguishable if $x_i$ does not appear in the training data. Hence, by coupling,

$$\max_{d_{\text{H}}(u,u')=1 } \|P_u^{\otimes n} - P_{u'}^{\otimes n}\|_{\text{\rm TV}} \le \left(1 - p\right)^n.$$

Therefore, Assoaud’s lemma gives

$$R_{\text{opt}}^\star(\mathcal{F}) \ge \frac{(d-1)p}{2}\left(1 - (1-p)^n\right),$$

and choosing $p = \min\{n^{-1},(d-1)^{-1}\}$ yields to the desired lower bound.

3.2. Pessimistic case

The analysis of the pessimistic case only differs in the construction of the hypotheses. As before, fix $x_1,\cdots,x_d\in \mathcal{X}$ and $f_v\in \mathcal{F}$ such that $f_v(x_i) = v_i$ for all $v\in \{\pm 1\}^d$ and $i\in [d]$. Now let $P_X$ be the uniform distribution on $\{x_1,\cdots,x_d\}$, and under $P_v$, the conditional probability $P_{Y|X}$ is

$$P_v(Y= v_i | X=x_i) = \frac{1}{2} + \delta, \qquad \forall i\in [d],$$

and $\delta\in (0,1/2)$ is to be specified later. In other words, the classifier $f_v$ only outperforms the random guess by a margin of $\delta$ under $P_v$.

Again we apply the Assoaud’s lemma for the lower bound of $R_{\text{pes}}^\star(\mathcal{F})$. First note that for all $v\in \{\pm 1\}^d$

$$\min_{f^\star \in \mathcal{F}} P_v(Y \neq f^\star(X)) = \frac{1}{2} - \delta.$$

Hence, for any $v\in \{\pm 1\}^d$ and any $f\in \mathcal{F}$,

$$\begin{array}{rcl} L(P_v, f) &=& \frac{1}{d}\sum_{i=1}^d \left[\frac{1}{2} + \left(2\cdot 1(f(x_i) = v_i) - 1\right)\delta \right] - \left(\frac{1}{2}-\delta\right) \\ &=& \frac{2\delta}{d}\sum_{i=1}^d 1(f(x_i) = v_i). \end{array}$$

By triangle inequality, the separation condition is fulfilled with $\Delta = 2\delta/d$. Moreover, direct computation yields

$$\max_{d_{\text{H}}(v,v')=1 } H^2(P_v, P_{v'}) = \frac{1}{d}\left(\sqrt{\frac{1}{2}+\delta} - \sqrt{\frac{1}{2}-\delta}\right)^2 = O\left(\frac{\delta^2}{d}\right),$$

and therefore tensorization gives

$$\max_{d_{\text{H}}(v,v')=1 } H^2(P_v^{\otimes n}, P_{v'}^{\otimes n}) = 2 - 2\left(1 - O\left(\frac{\delta^2}{d}\right) \right)^n.$$

Finally, choosing $\delta = (1\wedge \sqrt{d/n})/2$ gives the desired result.

3.3. General case

We may interpolate between the pessimistic and optimistic cases as follows: for any given $\varepsilon\in (0,1/2]$, we restrict to the set $\mathcal{P}(\mathcal{F},\varepsilon)$ of joint distributions with

$$\sup_{P_{XY}\in \mathcal{P}(\mathcal{F},\varepsilon) }\inf_{f^\star\in \mathcal{F}} P_{XY}(Y\neq f^\star(X))\le \varepsilon.$$

Then we may define the minimax excess risk over $\mathcal{P}(\mathcal{F},\varepsilon)$ as

$$R^\star(\mathcal{F},\varepsilon) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}(\mathcal{F},\varepsilon)} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

Clearly the optimistic case corresponds to $\varepsilon =0$, and the pessimistic case corresponds to $\varepsilon = 1/2$. Similar to the above arguments, we have the following lower bound on $R^\star(\mathcal{F},\varepsilon)$, whose proof is left as an exercise.

Exercise 2 Show that when the VC dimension of $\mathcal{F}$ is $d$, then

$$R^\star(\mathcal{F},\varepsilon) = \Omega\left(1\wedge \left(\frac{d}{n} + \sqrt{\frac{d}{n}\cdot \varepsilon}\right) \right).$$

4. Example IV: Stochastic Optimization

Consider the following oracle formulation of the convex optimizaton: let $f: \Theta\rightarrow {\mathbb R}$ be the convex objective function, and we aim to find the minimum value $\min_{\theta\in\Theta} f(\theta)$. To do so, the learner can query some first-order oracle adaptively, where given the query $\theta_t\in \Theta$ the oracle outputs a pair $(f(\theta_t), g(\theta_t))$ consisting of the objective value at $\theta_t$ (zeroth-order information) and a sub-gradient of $f$ at $\theta_t$ (first-order information). The queries can be made in an adaptive manner where $\theta_t$ can depend on all previous outputs of the oracle. The target of convex optimization is to determine the minimax optimization gap after $T$ queries defined as

$$R_T^\star(\mathcal{F}, \Theta) = \inf_{\theta_T}\sup_{f\in \mathcal{F}} \left( f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta) \right),$$

where $\mathcal{F}$ is a given class of convex functions, and the final query $\theta_{T+1}$ can only depend on the past outputs of the functions.

Since there is no randomness involved in the above problem, the idea of multiple hypothesis testing cannot be directly applied here. Therefore, in this section we consider a simpler problem of stochastic optimization, and we postpone the general case to later lectures. Specifically, suppose that the above first-order oracle $\mathcal{O}$ is stochastic, i.e., it only outputs $(\widehat{f}(\theta_t), \widehat{g}(\theta_t))$ such that $\mathop{\mathbb E}[\widehat{f}(\theta_t)] = f(\theta_t)$, and $\mathop{\mathbb E}[\widehat{g}(\theta_t)] \in \partial f(\theta_t)$. Let $\mathcal{F}$ be the set of all convex $L$-Lipschitz functions (in $L_2$ norm), $\Theta = \{\theta\in {\mathbb R}^d: \|\theta\|_2\le R \}$, and assume that the subgradient estimate returned by the oracle $\mathcal{O}$ always satisfies $\|\widehat{g}(\theta_t)\|_2\le L$ as well. Now the target is to determine the quantity

$$R_{T,d,L,R}^\star = \inf_{\theta_T} \sup_{f \in \mathcal{F}} \sup_{\mathcal{O}} \left(\mathop{\mathbb E} f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta)\right),$$

where the expectation is taken over the randomness in the oracle output. The main result in this section is summarized in the following theorem:

Theorem 4

$$R_{T,d,L,R}^\star = \Omega\left(\frac{LR}{\sqrt{T}}\right).$$

Since it is well-known that the stochastic gradient descent attains the optimization gap $O(LR/\sqrt{T})$ for convex Lipschitz functions, the lower bound in Theorem 4 is tight.

Now we apply the multiple hypothesis testing idea to prove Theorem 4. Since the randomness only comes from the oracle, we should design the stochastic oracle carefully to reduce the optimization problem to testing. A natural way is to choose some function $F(\theta;X)$ such that $F(\cdot;X)$ is convex $L$-Lipschitz for all $X$, and $f(\theta) = \mathop{\mathbb E}_P[F(\theta;X)]$. In this way, the oracle can simply generate i.i.d. $X_1,\cdots,X_T\sim P$, and reveals the random vector $(F(\theta_t;X_t), \nabla F(\theta_t;X_t))$ to the learner at the $t$-th query. Hence, the proof boils down to find a collection of probability distributions $(P_v)_{v\in \mathcal{V}}$ such that they are separated apart while hard to distinguish based on $T$ observations.

We first look at the separation condition. Since the loss function of this problem is $L(f,\theta_{T+1}) = f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta)$, we see that

$$\inf_{\theta_{T+1}\in \Theta} L(f,\theta_{T+1}) + L(g,\theta_{T+1}) = \min_{\theta\in\Theta} (f(\theta) + g(\theta)) - \min_{\theta\in\Theta} f(\theta) - \min_{\theta\in\Theta} g(\theta),$$

which is known as the optimization distance $d_{\text{opt}}(f,g)$. Now consider

$$F(\theta;X) = L\left|\theta_i - \frac{Rx_i}{\sqrt{d}}\right| \qquad \text{if }x = \pm e_i,$$

and for $v\in \{\pm 1\}^d$, let $f_v(\theta) = \mathop{\mathbb E}_{P_v}[F(\theta;X)]$ with $P_v$ defined as

$$P_v(X = v_ie_i) = \frac{1+\delta}{2d}, \quad P_v(X= -v_ie_i) = \frac{1-\delta}{2d}, \qquad \forall i\in [d].$$

Clearly $F(\cdot;X)$ is convex and $L$-Lipschitz, and for $\|\theta\|_\infty \le R/\sqrt{d}$ we have

$$f_v(\theta) = \frac{LR}{\sqrt{d}} - \frac{\delta L}{d}\sum_{i=1}^d v_i\theta_i.$$

Hence, it is straightforward to verify that $\min_{\|\theta\|_2\le R} f_v(\theta) = LR(1-\delta)/\sqrt{d}$, and

$$d_{\text{opt}}(f_v,f_{v'}) = \frac{2\delta LR}{d}\cdot d_{\text{H}}(v,v').$$

In other words, the separation condition in the Assoaud’s lemma is fulfilled with $\Delta = 2\delta LR/d$. Moreover, simple algebra gives

$$\max_{d_{\text{H}}(v,v') =1} D_{\text{KL}}(P_v^{\otimes T}\| P_{v'}^{\otimes T}) = O(T\delta^2),$$

and therefore Assoaud’s lemma gives

$$R_{T,d,L,R}^\star \ge d\cdot \frac{\delta LR}{d}\left(1 - O(\delta\sqrt{T})\right),$$

and choosing $\delta \asymp T^{-1/2}$ completes the proof of Theorem 4. In fact, the above arguments also show that if the $L_2$ norm is replaced by the $L_\infty$ norm, then

$$R_{T,d,L,R}^\star = \Theta\left(LR\sqrt{\frac{d}{T}}\right).$$

5. Bibliographic Notes

The example of nonparametric density estimation over Sobolev balls is taken from Nemirovski (2000), which also contains the tight minimax linear risk among all linear estimators. For the upper bound, linear procedures fail to achieve the minimax rate whenever $q>p$, and some non-linearities are necessary for the minimax estimators either in the function domain (Lepski, Mammen and Spokoiny (1997)) or the wavelet domain (Donoho et al. (1996)).

The linear and convex aggregations are proposed in Nemirovski (2000) for the adaptive nonparametric thresholding estimators, and the concept of model selection aggregation is due to Yang (2000). For the optimal rates of different aggregations (together with upper bounds), we refer to Tsybakov (2003) and Leung and Barron (2006).

The examples from statistical learning theory and stochastic optimization are similar in nature. The results of Theorem 2 and the corresponding upper bounds are taken from Vapnik (1998), albeit with a different proof language. For general results of the oracle complexity of convex optimization, we refer to the wonderful book Nemirovksi and Yudin (1983) and the lecture note Nemirovski (1995). The current proof of Theorem 4 is due to Agarwal et al. (2009).

1. Arkadi Nemirovski, Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28 (2000): 85.
2. Oleg V. Lepski, Enno Mammen, and Vladimir G. Spokoiny, Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics 25.3 (1997): 929–947.
3. David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard, Density estimation by wavelet thresholding. The Annals of Statistics (1996): 508–539.
4. Yuhong Yang, Combining different procedures for adaptive regression. Journal of multivariate analysis 74.1 (2000): 135–161.
5. Alexandre B. Tsybakov, Optimal rates of aggregation. Learning theory and kernel machines. Springer, Berlin, Heidelberg, 2003. 303–313.
6. Gilbert Leung, and Andrew R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52.8 (2006): 3396–3410.
7. Vlamimir Vapnik, Statistical learning theory. Wiley, New York (1998): 156–160.
8. Arkadi Nemirovski, and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
9. Arkadi Nemirovski, Information-based complexity of convex programming. Lecture Notes, 1995.
10. Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar, Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 2009.

This site uses Akismet to reduce spam. Learn how your comment data is processed.