# Lecture 9: Multiple Hypothesis Testing: More Examples

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In the last lecture we have seen the general tools and concrete examples of reducing the statistical estimation problem to multiple hypothesis testing. In these examples, the loss function is typically straightforward and the construction of the hypotheses is natural. However, other than statistical estimation the loss function may become more complicated (e.g., the excess risk in learning theory), and the hypothesis constructions may be implicit. To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems.

1. Example I: Density Estimation

This section considers the most fundamental problem in nonparametric statistics, i.e., the density estimation. It is well-known in the nonparametric statistics literature that, if a density on the real line has Hölder smoothness parameter $s>0$, it then can be estimated within $L_2$ accuracy $\Theta(n^{-s/(2s+1)})$ based on $n$ samples. However, in this section we will show that this is no longer the correct minimax rate when we generalize to other notions of smoothness and other natural loss functions.

Fix an integer $k\ge 1$ and some norm parameter $p\in [1,\infty]$, the Sobolev space $\mathcal{W}^{k,p}(L)$ on $[0,1]$ is defined as

$$\mathcal{W}^{k,p}(L) := \{f\in C[0,1]: \|f\|_p + \|f^{(k)}\|_p \le L \},$$

where the derivative $f^{(k)}$ is defined in terms of distributions ($f$ not necessarily belongs to $C^k[0,1]$). Our target is to determine the minimax rate $R_{n,k,p,q}^\star$ of estimating the density $f\in \mathcal{W}^{k,p}(L)$ based on $n$ i.i.d. observations from $f$, under the general $L_q$ loss with $q\in [1,\infty)$. For simplicity we suppress the dependence on $L$ and assume that $L$ is a large positive constant.

The main result of this section is as follows:

$$R_{n,k,p,q}^\star = \begin{cases} \Omega(n^{-\frac{k}{2k+1}}) & \text{if } q<(1+2k)p, \\ \Omega((\frac{\log n}{n})^{\frac{k-1/p + 1/q}{2(k-1/p) + 1/q}} ) & \text{if } q \ge (1+2k)p. \end{cases}$$

This rate is also tight. We see that as $k,p$ are fixed and $q$ increases from $1$ to $\infty$, the minimax rate for density estimation increases from $n^{-k/(2k+1)}$ to $(\log n/n)^{(k-1/p)/(2(k-1/p)+1) }$. We will show that these rates correspond to the “dense” and the “sparse” cases, respectively.

1.1. Dense case: $q<(1+2k)p$

We first look at the case with $q<(1+2k)p$, which always holds for Hölder smoothness where the norm parameter is $p=\infty$. The reason why it is called the “dense” case is that in the hypothesis construction, the density is supported everywhere on $[0,1]$, and the difficulty is to classify the directions of the bumps around some common density. Specifically, let $h\in (0,1)$ be a bandwidth parameter to be specified later, we consider

$$f_v(x) := 1 + \sum_{i=1}^{h^{-1}} v_i\cdot h^sg\left(\frac{x - (i-1)h}{h}\right)$$

for $v\in \{\pm 1\}^{1/h}$, where $g$ is a fixed smooth function supported on $[0,1]$ with $\int_0^1 g(x)dx=0$, and $s>0$ is some tuning parameter to ensure that $f_v\in \mathcal{W}^{k,p}(L)$. Since

$$f_v^{(k)}(x) = \sum_{i=1}^{h^{-1}} v_i\cdot h^{s-k}g^{(k)}\left(\frac{x - (i-1)h}{h}\right),$$

we conclude that the choice $s=k$ is sufficient. To invoke the Assoaud’s lemma, first note that for $q=1$ the separation condition is fulfilled with $\Delta = \Theta(h^{k+1})$. Moreover, for any $v,v'\in \{\pm 1\}^{1/h}$ with $d_{\text{H}}(v,v') = 1$, we have

$$\chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) + 1 \le \left( 1 + \frac{ \|f_v - f_{v'}\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n = \left(1 + \frac{ 4h^{2k+1}\|g\|_2^2 }{1 - h^k\|g\|_\infty} \right)^n.$$

Hence, the largest $h>0$ to ensure that $\chi^2(f_v^{\otimes n}, f_{v'}^{\otimes n}) = O(1)$ is $h = \Theta(n^{-1/(2k+1)})$, which for $q=1$ gives the minimax lower bound

$$R_{n,k,p,q}^\star = \Omega(n^{-\frac{k}{2k+1}} ).$$

This lower bound is also valid for the general case $q\ge 1$ due to the monotonicity of $L_q$ norms.

1.2. Sparse case: $q\ge (1+2k)p$

Next we turn to the sparse case $q\ge (1+2k)p$, where the name of “sparse” comes from the fact that for the hypotheses we will construct densities with only one bump on a small interval, and the main difficulty is to locate that interval. Specifically, let $h\in (0,1), s>0$ be bandwidth and smoothness parameters to be specified later, and we consider

$$f_i(x) := h^sg\left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h

where $i\in [h^{-1}]$, and $g$ is a fixed smooth density on $[0,1]$. Clearly $f_i(x)$ is a density on $[0,1]$ for all $i\in [M]$, and

$$f_i^{(k)}(x) = h^{s-k}g^{(k)} \left(\frac{x-(i-1)h}{h}\right)\cdot 1\left((i-1)h

Consequently, $\|f_i^{(k)}\|_p = h^{s-k+1/p}\|g^{(k)}\|_p$, and the Sobolev ball requirement leads to the choice $s = k-1/p$.

Next we check the conditions of Fano’s inequality. Since

$$\|f_i - f_j\|_q = h^{k-1/p+1/q} \cdot 2^{1/q}\|g\|_q, \qquad \forall i\neq j\in [h^{-1}],$$

the separation condition is fulfilled with $\Delta = \Theta(h^{k-1/p+1/q})$. Moreover, since

$$D_{\text{KL}}(f_i \| f_j) \le \frac{\|f_i - f_j\|_2^2}{1-h^s} = \frac{2\|g\|_2^2}{1-h^s}\cdot h^{2(k-1/p)+1},$$

we conclude that $I(V;X)= O(nh^{2(k-1/p)+1})$. Consequently, Fano’s inequality gives

$$R_{n,k,p,q}^\star = \Omega\left(h^{k-1/p+1/q}\left(1 - \frac{O(nh^{2(k-1/p)+1})+\log 2}{\log(1/h)} \right) \right),$$

and choosing $h = \Theta((\log n/n)^{1/(2(k-1/p)+1)} )$ gives the desired result.

2. Example II: Aggregation

In estimation or learning problems, sometimes the learner is given a set of candidate estimators or predictors, and she aims to aggregate them into a new estimate based on the observed data. In scenarios where the candidates are not explicit, aggregation procedures can still be employed based on sample splitting, where the learner splits the data into independent parts, uses the first part to construct the candidates and the second part to aggregate them.

In this section we restrict ourselves to the regression setting where there are $n$ i.i.d. observations $x_1,\cdots,x_n\sim P_X$, and

$$y_i = f(x_i )+ \xi_i, \qquad i\in [n],$$

where $\xi_i\sim \mathcal{N}(0,1)$ is the independent noise, and $f: \mathcal{X}\rightarrow {\mathbb R}$ is an unknown regression function with $\|f\|_\infty \le 1$. There is also a set of candidates $\mathcal{F}=\{f_1,\cdots,f_M\}$, and the target of aggregation is to find some $\hat{f}_n$ (not necessarily in $\mathcal{F}$) to minimize

$$\mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right),$$

where the expectation is taken over the random observations $(x_1,y_1),\cdots,(x_n,y_n)$, $f_\lambda(x) := \sum_{i=1}^M \lambda_i f_i(x)$ for any $\lambda \in {\mathbb R}^M$, and $\Theta\subseteq {\mathbb R}^p$ is a suitable subset corresponding to different types of aggregation. For a fixed data distribution $P_X$, the minimax rate of aggregation is defined as the minimum worst-case excess risk over all bounded functions $f$ and candidate functions $\{f_1,\cdots,f_M\}$:

$$R_{n,M}^{\Theta} := \sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n} \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in \Theta} \|f_\lambda - f\|_{L_2(P_X)}^2\right).$$

Some special cases are in order:

1. When $\Theta = {\mathbb R}^M$, the estimate $\hat{f}_n$ is compared with the best linear aggregates and this is called linear aggregation, with optimal rate denoted as $R^{\text{L}}$
2. When $\Theta = \Lambda^M = \{\lambda\in {\mathbb R}_+^M: \sum_{i=1}^M \lambda_i\le 1 \}$, the estimate $\hat{f}_n$ is compared with the best convex combination of candidates (and zero) and this is called convex aggregation, with optimal rate denoted as $R^{\text{C}}$
3. When $\Theta = \{e_1,\cdots,e_M\}$, the set of canonical vectors, the estimate $\hat{f}_n$ is compared with the best candidate in $\mathcal{F}$ and this is called model selection aggregation, with optimal rate denoted as $R^{\text{MS}}$

The main result in this section is summarized in the following theorem.

Theorem 1 If there is a cube $\mathcal{X}_0\subseteq \mathcal{X}$ such that $P_X$ admits a density lower bounded from below w.r.t. the Lebesgue measure on $\mathcal{X}_0$, then

$$R_{n,M}^\Theta = \begin{cases} \Omega(1\wedge \frac{M}{n}) &\text{if } \Theta = {\mathbb R}^M, \\ \Omega(1\wedge (\frac{M}{n} + \sqrt{\frac{1}{n}\log(\frac{M}{\sqrt{n}}+1)}) ) & \text{if } \Theta = \Lambda^M, \\ \Omega(1\wedge \frac{\log M}{n}) & \text{if } \Theta = \{e_1,\cdots,e_M\}. \end{cases}$$

We remark that the rates in Theorem 1 are all tight. In the upcoming subsections we will show that although the loss function of aggregation becomes more complicated, the idea of multiple hypothesis testing can still lead to tight lower bounds.

2.1. Linear aggregation

Since the oracle term $\inf_{\lambda \in {\mathbb R}^d} \|f_\lambda - f\|_{L^2(P_X)}^2$ is hard to deal with, a natural idea would be to consider a well-specified model such that this term is zero. Since $P_X$ admits a density, we may find a partition $\mathcal{X} = \cup_{i=1}^M \mathcal{X}_i$ such that $P_X(\mathcal{X}_i) = M^{-1}$ for all $i$. Consider the candidate functions $f_i(x) = 1(x\in \mathcal{X}_i)$, and for $v\in \{0, 1\}^M$, let

$$f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),$$

where $\gamma\in (0,1)$ is to be specified.

To apply the Assoaud’s lemma, note that for the loss function

$$L(f,\hat{f}) := \|\hat{f} - f\|_{L_2(P_X)}^2 - \inf_{\lambda\in {\mathbb R}^p} \|f_\lambda - f\|_{L_2(P_X)}^2,$$

the orthogonality of the candidates $(f_i)_{i\in [M]}$ implies that the separability condition holds for $\Delta = \gamma^2/2M$. Moreover,

$$\max_{d_{\text{H}}(v,v')=1 } D_{\text{KL}}(P_{f_v}^{\otimes n} \| P_{f_{v'}}^{\otimes n}) = \max_{d_{\text{H}}(v,v')=1 } \frac{n\|f_v - f_{v'}\|_{L_2(P_X)}^2}{2} = O\left(\frac{n\gamma^2}{M}\right),$$

Therefore, the Assoud’s lemma (with Pinsker’s inequality) gives

$$R_{n,M}^{\text{L}} \ge \frac{\gamma^2}{4}\left(1 - O\left( \sqrt{\frac{n\gamma^2}{M}}\right)\right).$$

Choosing $\gamma = 1\wedge \sqrt{M/n}$ completes the proof.

2.2. Convex aggregation

The convex aggregation differs only with the linear aggregation in the sense that the linear coefficients must be non-negative and the sum is at most one. Note that the only requirement in the previous arguments is the orthogonality of $(f_i)_{i\in [M]}$ under $L_2(P_X)$, we may choose any orthonormal functions $(f_i)_{i\in [M]}$ under $L_2(dx)$ with $\max_i\|f_i\|_\infty = O(1)$ (existence follows from the cube assumption) and use the density lower bound assumption to conclude that $\|f_i\|_{L_2(P_X)} = \Omega(\|f_i\|_2)$ (to ensure the desired separation). Then the choice of $\gamma$ above becomes $O(n^{-1/2})$, and we see that the previous arguments still hold for convex aggregation if $M = O(\sqrt{n})$. Hence, it remains to prove that when $M = \Omega(\sqrt{n})$

$$R_{n,M}^{\text{C}} = \Theta\left(1\wedge \sqrt{\frac{1}{n}\log\left(\frac{M}{\sqrt{n}}+1\right) }\right).$$

Again we consider the well-specified case where

$$f_v(x) := \gamma\sum_{i=1}^M v_if_i(x),$$

with the above orthonormal functions $(f_i)$, a constant scaling factor $\gamma = O(1)$, and a uniform size-$m$ subset of $(v_i)$ being $1/m$ and zero otherwise ($m\in \mathbb{N}$ is to be specified). Since the vector $v$ is no longer the vertex of a hypercube, we apply the generalized Fano’s inequality instead of Assoaud. First, applying Lemma 4 in Lecture 8 gives

$$I(V; X) \le \mathop{\mathbb E}_V D_{\text{KL}}(P_{f_V}^{\otimes n} \| P_{f_0}^{\otimes n}) = O\left(\frac{n}{m}\right).$$

Second, as long as $m\le M/3$, using similar arguments as in the sparse linear regression example in Lecture 8, for $\Delta = c/m$ with a small constant $c>0$ we have

$$p_\Delta := \sup_{f} \mathop{\mathbb P}\left(\|f_V - f\|_{L^2(P_X)}^2 \le \Delta \right) \le \left(\frac{M}{m}+1\right)^{c'm},$$

with $c'>0$. Therefore, the generalized Fano’s inequality gives

$$R_{n,M}^{\text{C}} \ge \frac{c}{m}\left(1 - \frac{O(n/m) + \log 2}{c'm\log\left(\frac{M}{m}+1\right)} \right),$$

and choosing $m \asymp \sqrt{n/\log(M/\sqrt{n}+1)} = O(M)$ completes the proof.

2.3. Model selection aggregation

As before, we consider the well-specified case where we select orthonormal functions $(\phi_i)_{i\in [M]}$ on $L_2(dx)$, and let $f_i = \gamma \phi_i$, $f\in \{f_1,\cdots,f_M\}$. The orthonormality of $(\phi_i)$ gives $\Delta = \Theta(\gamma^2)$ for the separation condition of the original Fano’s inequality, and

$$\max_{i\neq j} D_{\text{KL}}(P_{f_i}^{\otimes n}\| P_{f_j}^{\otimes n}) = \max_{i\neq j} \frac{\|f_i - f_j\|_{L_2(P_X)}^2}{2} = O\left(n\gamma^2\right).$$

Hence, Fano’s inequality gives

$$R_{n,M}^{\text{MS}} = \Omega\left(\gamma^2 \left(1 -\frac{n\gamma^2 + \log 2}{\log M}\right)\right),$$

and choosing $\gamma^2 \asymp 1\wedge n^{-1}\log M$ completes the proof.

We may show a further result that any model selector cannot attain the above optimal rate of model selection aggregation. Hence, even to compare with the best candidate function in $\mathcal{F}$, the optimal aggregate $\hat{f}$ should not be restricted to $\mathcal{F}$. Specifically, we have the following result, whose proof is left as an exercise.

Exercise 1 Under the same assumptions of Theorem 1, show that

$$\sup_{f_1,\cdots,f_M} \inf_{\hat{f}_n\in\{f_1,\cdots,f_M\} } \sup_{\|f\|_\infty\le 1} \mathop{\mathbb E}_f \left(\|\hat{f}_n - f\|_{L_2(P_X)}^2 - \min_{i\in [M]} \|f_i - f\|_{L_2(P_X)}^2\right) = \Omega\left(1\wedge \sqrt{\frac{\log M}{n}}\right).$$

3. Example III: Learning Theory

Consider the classification problem where there are $n$ training data $(x_1,y_1), \cdots, (x_n, y_n)$ i.i.d. drawn from an unknown distribution $P_{XY}$, with $y_i \in \{\pm 1\}$. There is a given collection of classifiers $\mathcal{F}$ consisting of functions $f: \mathcal{X}\rightarrow \{\pm 1\}$, and given the training data, the target is to find some classifier $\hat{f}\in \mathcal{F}$ with a small excess risk

$$R_{\mathcal{F}}(P_{XY}, \hat{f}) = \mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right],$$

which is the difference in the performance of the chosen classifer and the best classifier in the function class $\mathcal{F}$. In the definition of the excess risk, the expectation is taken with respect to the randomness in the training data. The main focus of this section is to characterize the minimax excess risk of a given function class $\mathcal{F}$, i.e.,

$$R_{\text{pes}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY}} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

The subscript “pes” here stands for “pessimistic”, where $P_{XY}$ can be any distribution over $\mathcal{X} \times \{\pm 1\}$ and there may not be a good classifier in $\mathcal{F}$, i.e., $\inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))$ may be large. We also consider the optimistic scenario where there exists a perfect (error-free) classifer in $\mathcal{F}$. Mathematically, denoting by $\mathcal{P}_{\text{opt}}(\mathcal{F})$ the collection of all probability distributions $P_{XY}$ on $\mathcal{X}\times \{\pm 1\}$ such that $\inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))=0$, the minimax excess risk of a given function class $\mathcal{F}$ in the optimistic case is defined as

$$R_{\text{opt}}^\star(\mathcal{F}) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}_{\text{opt}}(\mathcal{F})} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

The central claim of this section is the following:

Theorem 2 Let the VC dimension of $\mathcal{F}$ be $d$. Then

$$R_{\text{\rm opt}}^\star(\mathcal{F}) = \Omega\left(1\wedge \frac{d}{n}\right), \qquad R_{\text{\rm pes}}^\star(\mathcal{F}) = \Omega\left(1\wedge \sqrt{\frac{d}{n}}\right).$$

Recall that the definition of VC dimension is as follows:

Definition 3 For a given function class $\mathcal{F}$ consisting of mappings from $\mathcal{X}$ to $\{\pm 1\}$, the VC dimension of $\mathcal{F}$ is the largest integer $d$ such that there exist $d$ points from $\mathcal{X}$ which can be shattered by $\mathcal{F}$. Mathematically, it is the largest $d>0$ such that there exist $x_1,\cdots,x_d\in \mathcal{X}$, and for all $v\in \{\pm 1\}^d$, there exists a function $f_v\in \mathcal{F}$ such that $f_v(x_i) = v_i$ for all $i\in [d]$

VC dimension plays a significant role in statistical learning theory. For example, it is well-known that for the empirical risk minimization (ERM) classifier

$$\hat{f}^{\text{ERM}} = \arg\min_{f\in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n 1(y_i \neq f(x_i)),$$

we have

$$\mathop{\mathbb E}\left[P_{XY}(Y \neq \hat{f}^{\text{ERM}}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X))\right] \le 1\wedge \begin{cases} O(d/n) & \text{in optimistic case}, \\ O(\sqrt{d/n}) & \text{in pessimistic case}. \end{cases}$$

Hence, Theorem 2 shows that the ERM classifier attains the minimax excess risk for all function classes, and the VC dimension exactly characterizes the difficulty of the learning problem.

3.1. Optimistic case

We first apply the Assoaud’s lemma to the optimistic scenario. By the definition of VC dimension, there exist points $x_1,\cdots,x_d\in \mathcal{X}$ and functions $f_v\in \mathcal{F}$ such that for all $v\in \{\pm 1\}^d$ and $i\in [d]$, we have $f_v(x_i) = v_i$. Consider $2^{d-1}$ hypotheses indexed by $u\in \{\pm 1\}^{d-1}$: the distribution $P_X$ is always

$$P_X(\{x_i\}) = p > 0, \quad \forall i\in [d-1], \qquad P_X(\{x_d\}) = 1-p(d-1),$$

where $p\in (0,(d-1)^{-1})$ is to be specified later. For the conditional distribution $P_{Y|X}$, let $Y = f_{(u,1)}(X)$ hold almost surely under the joint distribution $P_u$. Clearly this is the optimistic case, for there always exists a perfect classifier in $\mathcal{F}$.

We first examine the separation condition in Assoaud’s lemma, where the loss function here is

$$L(P_{XY}, \hat{f}) := P_{XY}(Y \neq \hat{f}(X)) - \inf_{f^\star \in \mathcal{F}} P_{XY}(Y \neq f^\star(X)).$$

For any $u,u'\in \{\pm 1\}^{d-1}$ and any $\hat{f}\in \mathcal{F}$, we have

$$\begin{array}{rcl} L(P_u, \hat{f}) + L(P_{u'}, \hat{f}) &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq f_u(x_i)) + 1(\hat{f}(x_i) \neq f_{u'}(x_i)) \\ &=& p\cdot \sum_{i=1}^{d-1} 1(\hat{f}(x_i) \neq u_i) + 1(\hat{f}(x_i) \neq u_i') \\ &\ge & p\cdot d_{\text{H}}(u,u'), \end{array}$$

and therefore the separation condition holds for $\Delta = p$. Moreover, if $u$ and $u'$ only differ in the $i$-th component, then $P_u$ and $P_{u'}$ are completely indistinguishable if $x_i$ does not appear in the training data. Hence, by coupling,

$$\max_{d_{\text{H}}(u,u')=1 } \|P_u^{\otimes n} - P_{u'}^{\otimes n}\|_{\text{\rm TV}} \le \left(1 - p\right)^n.$$

Therefore, Assoaud’s lemma gives

$$R_{\text{opt}}^\star(\mathcal{F}) \ge \frac{(d-1)p}{2}\left(1 - (1-p)^n\right),$$

and choosing $p = \min\{n^{-1},(d-1)^{-1}\}$ yields to the desired lower bound.

3.2. Pessimistic case

The analysis of the pessimistic case only differs in the construction of the hypotheses. As before, fix $x_1,\cdots,x_d\in \mathcal{X}$ and $f_v\in \mathcal{F}$ such that $f_v(x_i) = v_i$ for all $v\in \{\pm 1\}^d$ and $i\in [d]$. Now let $P_X$ be the uniform distribution on $\{x_1,\cdots,x_d\}$, and under $P_v$, the conditional probability $P_{Y|X}$ is

$$P_v(Y= v_i | X=x_i) = \frac{1}{2} + \delta, \qquad \forall i\in [d],$$

and $\delta\in (0,1/2)$ is to be specified later. In other words, the classifier $f_v$ only outperforms the random guess by a margin of $\delta$ under $P_v$.

Again we apply the Assoaud’s lemma for the lower bound of $R_{\text{pes}}^\star(\mathcal{F})$. First note that for all $v\in \{\pm 1\}^d$

$$\min_{f^\star \in \mathcal{F}} P_v(Y \neq f^\star(X)) = \frac{1}{2} - \delta.$$

Hence, for any $v\in \{\pm 1\}^d$ and any $f\in \mathcal{F}$,

$$\begin{array}{rcl} L(P_v, f) &=& \frac{1}{d}\sum_{i=1}^d \left[\frac{1}{2} + \left(2\cdot 1(f(x_i) = v_i) - 1\right)\delta \right] - \left(\frac{1}{2}-\delta\right) \\ &=& \frac{2\delta}{d}\sum_{i=1}^d 1(f(x_i) = v_i). \end{array}$$

By triangle inequality, the separation condition is fulfilled with $\Delta = 2\delta/d$. Moreover, direct computation yields

$$\max_{d_{\text{H}}(v,v')=1 } H^2(P_v, P_{v'}) = \frac{1}{d}\left(\sqrt{\frac{1}{2}+\delta} - \sqrt{\frac{1}{2}-\delta}\right)^2 = O\left(\frac{\delta^2}{d}\right),$$

and therefore tensorization gives

$$\max_{d_{\text{H}}(v,v')=1 } H^2(P_v^{\otimes n}, P_{v'}^{\otimes n}) = 2 - 2\left(1 - O\left(\frac{\delta^2}{d}\right) \right)^n.$$

Finally, choosing $\delta = (1\wedge \sqrt{d/n})/2$ gives the desired result.

3.3. General case

We may interpolate between the pessimistic and optimistic cases as follows: for any given $\varepsilon\in (0,1/2]$, we restrict to the set $\mathcal{P}(\mathcal{F},\varepsilon)$ of joint distributions with

$$\sup_{P_{XY}\in \mathcal{P}(\mathcal{F},\varepsilon) }\inf_{f^\star\in \mathcal{F}} P_{XY}(Y\neq f^\star(X))\le \varepsilon.$$

Then we may define the minimax excess risk over $\mathcal{P}(\mathcal{F},\varepsilon)$ as

$$R^\star(\mathcal{F},\varepsilon) := \inf_{\hat{f}}\sup_{P_{XY} \in \mathcal{P}(\mathcal{F},\varepsilon)} R_{\mathcal{F}}(P_{XY}, \hat{f}).$$

Clearly the optimistic case corresponds to $\varepsilon =0$, and the pessimistic case corresponds to $\varepsilon = 1/2$. Similar to the above arguments, we have the following lower bound on $R^\star(\mathcal{F},\varepsilon)$, whose proof is left as an exercise.

Exercise 2 Show that when the VC dimension of $\mathcal{F}$ is $d$, then

$$R^\star(\mathcal{F},\varepsilon) = \Omega\left(1\wedge \left(\frac{d}{n} + \sqrt{\frac{d}{n}\cdot \varepsilon}\right) \right).$$

4. Example IV: Stochastic Optimization

Consider the following oracle formulation of the convex optimizaton: let $f: \Theta\rightarrow {\mathbb R}$ be the convex objective function, and we aim to find the minimum value $\min_{\theta\in\Theta} f(\theta)$. To do so, the learner can query some first-order oracle adaptively, where given the query $\theta_t\in \Theta$ the oracle outputs a pair $(f(\theta_t), g(\theta_t))$ consisting of the objective value at $\theta_t$ (zeroth-order information) and a sub-gradient of $f$ at $\theta_t$ (first-order information). The queries can be made in an adaptive manner where $\theta_t$ can depend on all previous outputs of the oracle. The target of convex optimization is to determine the minimax optimization gap after $T$ queries defined as

$$R_T^\star(\mathcal{F}, \Theta) = \inf_{\theta_T}\sup_{f\in \mathcal{F}} \left( f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta) \right),$$

where $\mathcal{F}$ is a given class of convex functions, and the final query $\theta_{T+1}$ can only depend on the past outputs of the functions.

Since there is no randomness involved in the above problem, the idea of multiple hypothesis testing cannot be directly applied here. Therefore, in this section we consider a simpler problem of stochastic optimization, and we postpone the general case to later lectures. Specifically, suppose that the above first-order oracle $\mathcal{O}$ is stochastic, i.e., it only outputs $(\widehat{f}(\theta_t), \widehat{g}(\theta_t))$ such that $\mathop{\mathbb E}[\widehat{f}(\theta_t)] = f(\theta_t)$, and $\mathop{\mathbb E}[\widehat{g}(\theta_t)] \in \partial f(\theta_t)$. Let $\mathcal{F}$ be the set of all convex $L$-Lipschitz functions (in $L_2$ norm), $\Theta = \{\theta\in {\mathbb R}^d: \|\theta\|_2\le R \}$, and assume that the subgradient estimate returned by the oracle $\mathcal{O}$ always satisfies $\|\widehat{g}(\theta_t)\|_2\le L$ as well. Now the target is to determine the quantity

$$R_{T,d,L,R}^\star = \inf_{\theta_T} \sup_{f \in \mathcal{F}} \sup_{\mathcal{O}} \left(\mathop{\mathbb E} f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta)\right),$$

where the expectation is taken over the randomness in the oracle output. The main result in this section is summarized in the following theorem:

Theorem 4

$$R_{T,d,L,R}^\star = \Omega\left(\frac{LR}{\sqrt{T}}\right).$$

Since it is well-known that the stochastic gradient descent attains the optimization gap $O(LR/\sqrt{T})$ for convex Lipschitz functions, the lower bound in Theorem 4 is tight.

Now we apply the multiple hypothesis testing idea to prove Theorem 4. Since the randomness only comes from the oracle, we should design the stochastic oracle carefully to reduce the optimization problem to testing. A natural way is to choose some function $F(\theta;X)$ such that $F(\cdot;X)$ is convex $L$-Lipschitz for all $X$, and $f(\theta) = \mathop{\mathbb E}_P[F(\theta;X)]$. In this way, the oracle can simply generate i.i.d. $X_1,\cdots,X_T\sim P$, and reveals the random vector $(F(\theta_t;X_t), \nabla F(\theta_t;X_t))$ to the learner at the $t$-th query. Hence, the proof boils down to find a collection of probability distributions $(P_v)_{v\in \mathcal{V}}$ such that they are separated apart while hard to distinguish based on $T$ observations.

We first look at the separation condition. Since the loss function of this problem is $L(f,\theta_{T+1}) = f(\theta_{T+1}) - \min_{\theta\in\Theta} f(\theta)$, we see that

$$\inf_{\theta_{T+1}\in \Theta} L(f,\theta_{T+1}) + L(g,\theta_{T+1}) = \min_{\theta\in\Theta} (f(\theta) + g(\theta)) - \min_{\theta\in\Theta} f(\theta) - \min_{\theta\in\Theta} g(\theta),$$

which is known as the optimization distance $d_{\text{opt}}(f,g)$. Now consider

$$F(\theta;X) = L\left|\theta_i - \frac{Rx_i}{\sqrt{d}}\right| \qquad \text{if }x = \pm e_i,$$

and for $v\in \{\pm 1\}^d$, let $f_v(\theta) = \mathop{\mathbb E}_{P_v}[F(\theta;X)]$ with $P_v$ defined as

$$P_v(X = v_ie_i) = \frac{1+\delta}{2d}, \quad P_v(X= -v_ie_i) = \frac{1-\delta}{2d}, \qquad \forall i\in [d].$$

Clearly $F(\cdot;X)$ is convex and $L$-Lipschitz, and for $\|\theta\|_\infty \le R/\sqrt{d}$ we have

$$f_v(\theta) = \frac{LR}{\sqrt{d}} - \frac{\delta L}{d}\sum_{i=1}^d v_i\theta_i.$$

Hence, it is straightforward to verify that $\min_{\|\theta\|_2\le R} f_v(\theta) = LR(1-\delta)/\sqrt{d}$, and

$$d_{\text{opt}}(f_v,f_{v'}) = \frac{2\delta LR}{d}\cdot d_{\text{H}}(v,v').$$

In other words, the separation condition in the Assoaud’s lemma is fulfilled with $\Delta = 2\delta LR/d$. Moreover, simple algebra gives

$$\max_{d_{\text{H}}(v,v') =1} D_{\text{KL}}(P_v^{\otimes T}\| P_{v'}^{\otimes T}) = O(T\delta^2),$$

and therefore Assoaud’s lemma gives

$$R_{T,d,L,R}^\star \ge d\cdot \frac{\delta LR}{d}\left(1 - O(\delta\sqrt{T})\right),$$

and choosing $\delta \asymp T^{-1/2}$ completes the proof of Theorem 4. In fact, the above arguments also show that if the $L_2$ norm is replaced by the $L_\infty$ norm, then

$$R_{T,d,L,R}^\star = \Theta\left(LR\sqrt{\frac{d}{T}}\right).$$

5. Bibliographic Notes

The example of nonparametric density estimation over Sobolev balls is taken from Nemirovski (2000), which also contains the tight minimax linear risk among all linear estimators. For the upper bound, linear procedures fail to achieve the minimax rate whenever $q>p$, and some non-linearities are necessary for the minimax estimators either in the function domain (Lepski, Mammen and Spokoiny (1997)) or the wavelet domain (Donoho et al. (1996)).

The linear and convex aggregations are proposed in Nemirovski (2000) for the adaptive nonparametric thresholding estimators, and the concept of model selection aggregation is due to Yang (2000). For the optimal rates of different aggregations (together with upper bounds), we refer to Tsybakov (2003) and Leung and Barron (2006).

The examples from statistical learning theory and stochastic optimization are similar in nature. The results of Theorem 2 and the corresponding upper bounds are taken from Vapnik (1998), albeit with a different proof language. For general results of the oracle complexity of convex optimization, we refer to the wonderful book Nemirovksi and Yudin (1983) and the lecture note Nemirovski (1995). The current proof of Theorem 4 is due to Agarwal et al. (2009).

1. Arkadi Nemirovski, Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28 (2000): 85.
2. Oleg V. Lepski, Enno Mammen, and Vladimir G. Spokoiny, Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics 25.3 (1997): 929–947.
3. David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard, Density estimation by wavelet thresholding. The Annals of Statistics (1996): 508–539.
4. Yuhong Yang, Combining different procedures for adaptive regression. Journal of multivariate analysis 74.1 (2000): 135–161.
5. Alexandre B. Tsybakov, Optimal rates of aggregation. Learning theory and kernel machines. Springer, Berlin, Heidelberg, 2003. 303–313.
6. Gilbert Leung, and Andrew R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52.8 (2006): 3396–3410.
7. Vlamimir Vapnik, Statistical learning theory. Wiley, New York (1998): 156–160.
8. Arkadi Nemirovski, and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
9. Arkadi Nemirovski, Information-based complexity of convex programming. Lecture Notes, 1995.
10. Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar, Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 2009.

# Lecture 8: Multiple Hypothesis Testing: Tree, Fano and Assoaud

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In the last three lectures we systematically introduced the Le Cam’s two-point method with generalizations and various examples. The main idea of the two-point method is to reduce the problem in hand to a binary hypothesis testing problem, where the hypotheses may be either single or composite. This approach works when the target is to estimate a scalar (even when the underlying statistical model may involve high-dimensional parameters), while it typically fails when the target is to recover a high-dimensional vector of parameters. For example, even in the simplest Gaussian location model $X\sim \mathcal{N}(\theta, I_p)$ where the target is to estimate $\theta$ under the mean squared loss, the best two-point approach gives a lower bound $R_p^\star = \Omega(1)$ while we have known from Lecture 4 that the minimax risk is $R_p^\star = p$. To overcome this difficulty, recall from the minimax theorem that a suitable prior should always work, which by discretization motivates the idea of multiple hypothesis testing.

In this lecture, we develop the general theory of reducing problems into multiple hypothesis testing, and present several tools including tree-based methods, Assoaud’s lemma and Fano’s inequality. In the next two lectures we will enumerate more examples (possibly beyond the minimax estimation in statistics) and variations of the above tools.

1. General Tools of Multiple Hypothesis Testing

This section presents the general theory of applying multiple hypothesis testing to estimation problems, as well as some important tools such as tree, Fano and Assoaud. Recall from the two-point method that the separability and indistinguishability conditions are of utmost importance to apply testing arguments, the above tools require different separability conditions and represent the indistinguishability condition in terms of different divergence functions.

1.1. Tree-based method and Fano’s inequality

We start with the possibly simplest separability condition, i.e., the chosen parameters (hypothese) $\theta_1, \cdots, \theta_m\in \Theta$ are pairwise seperated. The following lemma shows that the minimax estimation error is lower bounded in terms of the average test error.

Lemma 1 (From estimation to testing) Let $L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+$ be any loss function, and there exist $\theta_1,\cdots,\theta_m\in \Theta$ such that

$$L(\theta_i, a) + L(\theta_j, a) \ge \Delta, \qquad \forall i\neq j\in [m], a \in \mathcal{A}.$$

Then

$$\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i),$$

where the infimum is taken over all measurable tests $\Psi: \mathcal{X} \rightarrow [m]$

Proof: For any given estimator $\hat{\theta}$, we construct a test $\Psi$ as follows:

$$\Psi(X) \in \arg\min_{i\in [m]} L(\theta_i, \hat{\theta}(X)).$$

Then the separability condition implies that

$$L(\theta_i, \hat{\theta}(X)) \ge \frac{\Delta}{2}\cdot 1(\Psi(X) \neq i), \qquad \forall i\in [m].$$

Then the rest follows from that the maximum is lower bounded by the average. $\Box$

Remark 1 The following weaker separability condition also suffices for Lemma 1 to hold: for any $i\neq j\in [m]$ and $a\in \mathcal{A}$, the inequality $L(\theta_i, a)\le \Delta/2$ always implies that $L(\theta_j, a)\ge \Delta/2$

The remaining quantity in Lemma 1 is the optimal average test error $\inf_{\Psi}m^{-1}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i)$, for which we are looking for a lower bound. Recall that when $m=2$, this quantity equals to $(1- \|P_{\theta_1} - P_{\theta_2}\|_{\text{TV}})/2$ by Le Cam’s first lemma. For general $m\ge 2$, we have the following two different lower bounds.

Lemma 2 (Tree-based inequality) Let $T=([m], E)$ be any undirected tree on vertex set $[m]$ with edge set $E$. Then

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge \frac{1}{m}\sum_{(i,j)\in E} (1 - \|P_{\theta_i} - P_{\theta_j}\|_{\text{\rm TV}}).$$

Proof: It is straightforward to see that

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) = 1 - \frac{1}{m}\int \max\{dP_{\theta_1}(x), \cdots, dP_{\theta_m}(x)\}. \ \ \ \ \ (1)$$

It is also easy (left as an exercise) to establish the following elementary inequality: for any reals $x_1, \cdots, x_m$ and any tree $T=([m], E)$, we have

$$\sum_{i=1}^m x_i - \max_{i\in [m]} x_i \ge \sum_{(i,j)\in E} \min\{x_i, x_j\}.$$

Now using $\int \min\{dP, dQ \} = 1 - \|P-Q\|_{\text{TV}}$ gives the desired inequality. $\Box$

Lemma 3 (Fano’s inequality) Let $\bar{P} := m^{-1}\sum_{i=1}^m P_i$. Then

$$\inf_{\Psi}\frac{1}{m}\sum_{i=1}^m \mathop{\mathbb P}_{\theta_i}(\Psi\neq i) \ge 1 - \frac{1}{\log m}\left(\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) +\log 2\right).$$

Remark 2 If we introduce auxiliary random variables $V\sim \mathsf{Uniform}([m])$ and $X$ with $P_{X|V=i} := P_{\theta_i}$, then

$$\frac{1}{m}\sum_{i=1}^m D_{\text{\rm KL}}(P_i \| \bar{P}) = I(V;X),$$

where $I(V;X)$ denotes the mutual information between $V$ and $X$

Proof: We present two proofs of Lemma 3. The first proof builds upon the representation (1) and is more analytical, while the second proof makes use of the data-processing inequality, which is essentially the classical information-theoretic proof of Fano’s inequality.

(First Proof) By (1), it suffices to prove that

$$\mathop{\mathbb E}_{\bar{P}} \left[ \max_{i\in [m]} \left\{\frac{dP_i}{d\bar{P}} \right\} \right] \le \frac{1}{\log m}\left(\sum_{i=1}^m \mathop{\mathbb E}_{\bar{P}} \left[\frac{dP_i}{d\bar{P}}\log \frac{dP_i}{d\bar{P}}\right] + m\log 2\right).$$

By the linearity of expectation, it further suffices to prove that for any non-negative reals $x_1, \cdots, x_m$ with $\sum_{i=1}^m x_i = m$, we have

$$\max_{i\in [m]} x_i \le \frac{1}{\log m}\left(\sum_{i=1}^m x_i\log x_i + m\log 2\right).$$

To establish this inequality, let $t:=\max_{i\in [m]} x_i$. Then by the convexity of $x\mapsto x\log x$, Jensen’s inequality gives

$$\sum_{i=1}^m x_i\log x_i \ge t\log t + (m-t)\log\frac{m-t}{m-1} \ge m\log \frac{m}{2} - (m-t)\log m.$$

Plugging in the above inequality completes the proof of Lemma 3.

(Second Proof) Introduce the auxiliary random variables $V$ and $X$ as in Remark 2. For any fixed $X$, consider the kernel $\mathsf{K}_X: [m]\rightarrow \{0,1\}$ which sends $i\in [m]$ to $1(\Psi(X) = i)$, then $\mathsf{K}_X\circ P_V$ is the Bernoulli distribution with parameter $m^{-1}$, and $\mathsf{K}_X\circ P_{V|X}$ is the Bernoulli distribution with parameter $p_X := \mathop{\mathbb P}_{V|X}(V = \Psi(X))$. By the data-processing inequality of KL divergence,

$$\begin{array}{rcl} D_{\text{KL}}(P_{V|X} \| P_V) & \ge & p_X \log(mp_X) + (1-p_X)\log (1-p_X) \\ &\ge& p_X\log m - \log 2. \end{array}$$

Now taking expectation on $X$ at both sides gives

$$I(V;X) = \mathop{\mathbb E}_X D_{\text{KL}}(P_{V|X} \| P_V) \ge \mathop{\mathbb P}(V = \Psi(X))\cdot \log m - \log 2,$$

and rearranging completes the proof. $\Box$

Lemma 2 decomposes a multiple hypothesis testing problem into several binary testing problems, which cannot outperform the best two-point methods in typical scenarios but can be useful when there is some external randomness associated with each $P_{\theta_i}$ (see the bandit example later). Lemma 3 is the well-known Fano’s inequality involving the mutual information, and the additional $\log m$ term is the key difference in multiple hypothesis testing. Hence, the typical lower bound arguments are to apply Lemma 1 together with Lemma 3 (or sometimes Lemma 2), with the following lemma which helps to upper bound the mutual information.

Lemma 4 (Variational representation of mutual informatino)

$$I(V;X) = \inf_{Q_X} \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )].$$

Proof: Simply verify that

$$I(V;X) = \mathop{\mathbb E}_V [D_{\text{\rm KL}}(P_{X|V} \| Q_X )] - D_{\text{KL}}(P_X\|Q_X).$$ $\Box$

1.2. Assoaud’s lemma

Instead of the previous pairwise separation condition, Assoaud’s lemma builds upon a different one where the hypotheses are essentially the vertices of an $m$-dimensional hypercube. Specifically, we shall require that the distance between parameters $v$ and $v'\in \{\pm 1\}^p$ is proportional to their Hamming distance $d_{\text{H}}(v,v') := \sum_{i=1}^p 1(v_i \neq v_i')$, which becomes quite natural when the parameter of the statistical model lies in an $m$-dimensional space.

Theorem 5 (Assoaud’s Lemma) Let $L: \Theta\times \mathcal{A} \rightarrow {\mathbb R}_+$ be any function. If there exist parameters $(\theta_v)\subseteq \Theta$ indexed by $v\in \{\pm1\}^p$ such that

$$L(\theta_v, a) + L(\theta_{v'}, a) \ge \Delta\cdot d_{\text{\rm H}}(v,v'), \qquad \forall v,v'\in \{\pm 1\}^p, a\in\mathcal{A},$$

then

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{\Delta}{2}\sum_{j=1}^p \left(1 - \|P_{j,+} - P_{j,-}\|_{\text{\rm TV}}\right),$$

where

$$P_{j,+} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = 1} P_{\theta_v}, \quad P_{j,-} := \frac{1}{2^{p-1}}\sum_{v\in \{\pm 1\}^p: v_j = -1} P_{\theta_v}.$$

Proof: For a given estimator $\hat{\theta}$, define

$$\hat{v} := \arg\min_{v\in \{\pm 1\}^p} L(\theta_v, \hat{\theta}).$$

Then it is straightforward to see that

$$L(\theta_v, \hat{\theta}) \ge \frac{L(\theta_v,\hat{\theta}) + L(\theta_{\hat{v}}, \hat{\theta})}{2} \ge \frac{\Delta}{2}\sum_{j=1}^p 1(v_j \neq \hat{v}_j), \qquad \forall v\in \{\pm 1\}^p.$$

The rest of the proof follows from Le Cam’s first lemma and the fact that the maximum is no smaller than the average. $\Box$

In most scenarios, the total variation distance between the mixtures $P_{j,+}$ and $P_{j,-}$ is hard to compute directly, and the following corollaries (based on joint convexity of the total variation distance and Cauchy–Schwartz) are often the frequently presented versions of Assoaud’s lemma.

Corollary 6 Under the conditions of Theorem 5, we have

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \max_{v,v'\in \{\pm 1\}^p: d_{\text{\rm H}}(v,v')=1} \|P_{\theta_v} - P_{\theta_{v'}}\|_{\text{\rm TV}}\right).$$

Corollary 7 Under the conditions of Theorem 5, we have

$$\inf_{\hat{\theta}} \sup_{\theta\in \Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \frac{p\Delta}{2}\cdot \left(1 - \left(\frac{1}{p2^p}\sum_{j=1}^p \sum_{v\in \{\pm 1\}^p} \|P_{\theta_v} - P_{\theta_{\bar{v}^j}} \|_{\text{\rm TV}}^2 \right)^{1/2} \right),$$

where $\bar{v}^j$ is the resulting binary vector after flipping the $j$-th coordinate of $v$

The idea behind the Assoaud’s lemma is to apply a two-point argument for each coordinate of the parameter and then sum up all coordinates. Hence, Assoaud’s lemma typically improves over the best two-point argument by a factor of the dimension $p$, given that the hypercube-type separation condition in Theorem 5 holds.

1.3. Generalized Fano’s inequality

In the Fano’s inequality and Assoaud’s lemma above, we introduce a random variable $V$ which is uniformly distributed on the set $[m]$ or the hypercube $\{\pm 1\}^p$. Moreover, the separation condition must hold for each pair of the realizations of $V$. In general, we may wish to consider some non-uniform random variable $V$, and only require that the separation condition holds for most pairs. This is the focus of the following generalized Fano’s inequality.

Theorem 8 (Generalized Fano’s Inequality) Let $L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+$ be any loss function, and $\pi$ be any probability distribution on $\Theta$. For any $\Delta>0$, define

$$p_{\Delta} := \sup_{a\in \mathcal{A}} \pi(\{\theta\in \Theta: L(\theta,a)\le \Delta \}).$$

Then for $V\sim \pi$, we have

$$\inf_{\hat{\theta}}\sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta[L(\theta,\hat{\theta})] \ge \Delta\left(1 - \frac{I(V;X) + \log 2}{\log(1/p_{\Delta})}\right).$$

Proof: By Markov’s inequality, it suffices to show that for any estimator $\hat{\theta}$

$$\mathop{\mathbb P}\left(L(V, \hat{\theta}(X))\le \Delta \right)\le \frac{I(V;X) + \log 2}{\log(1/p_\Delta)}.$$

We again use the second proof of Lemma 3. For any fixed $X$, consider the deterministic kernel $\mathsf{K}_X: \Theta\rightarrow \{0,1\}$ which sends $V$ to $1( L(V,\hat{\theta}(X))\le \Delta )$. Then the composition $K_X\circ P_V$ is a Bernoulli distribution with parameter $p_X\le p_\Delta$, and the composition $K_X\circ P_{V|X}$ is a Bernoulli distribution with parameter $q_X := P_{V|X}(L(V,\hat{\theta}(X))\le \Delta)$. Then by the data processing inequality of KL divergence,

$$\begin{array}{rcl} D_{\text{KL}}(P_{V|X}\|P_V) &\ge& q_X\log \frac{q_X}{p_X} + (1-q_X)\log \frac{1-q_X}{1-p_X} \\ &\ge & q_X\log \frac{q_X}{p_\Delta} + (1-q_X)\log (1-q_X) \\ &\ge & q_X\log(1/p_\Delta) - \log 2. \end{array}$$

Since $\mathop{\mathbb E}[q_X] = \mathop{\mathbb P}(L(V,\hat{\theta}(X))\le \Delta)$, taking expectation on $X$ at both sides gives the desired inequality. $\Box$

To see that Theorem 8 is indeed a generalization to the original Fano’s inequality, simply note that when $V\sim \mathsf{Uniform}([m])$ and the separation condition in Lemma 1 holds, we have $p_\Delta = m^{-1}$ and therefore the denominator $\log(1/p_\Delta) = \log m$ becomes the log-cardinality. Hence, in the generalized Fano’s inequality, the verification of the seperation condition becomes upper bounding the probability $p_\Delta$

2. Applications

In this section we present some applications of the above tools to statistical examples. Here the applications are mostly simple and straightforward, and we defer some more sophisticated examples in other domains to the next lecture.

2.1. Example I: Gaussian mean estimation

We start from the possibly simplest example of Gaussian mean estimation. Consider $X\sim \mathcal{N}(\theta,\sigma^2 I_p)$ with known $\sigma$, and the target is the estimate the vector $\theta\in {\mathbb R}^p$ under the squared $\ell_2$ loss. Let $R_{p,\sigma}^\star$ be the corresponding minimax risk.

We first show that any two-point argument fails. In fact, if the two points are chosen to be $\theta_1, \theta_2\in {\mathbb R}^p$, the two-point argument gives

$$R_{p,\sigma}^\star \ge \frac{\|\theta_1 - \theta_2\|_2^2}{2}\left(1-\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2\sigma}\right)\right),$$

where $\Phi(\cdot)$ is the normal CDF. Optimization only gives $R_{p,\sigma}^\star = \Omega(\sigma^2)$.

Now we show that Assoaud’s lemma can give us the rate-optimal lower bound $R_{p,\sigma}^\star= \Omega(p\sigma^2)$. Let $P_v = \mathcal{N}(\delta v, \sigma^2 I_p)$ for $v\in \{\pm 1\}^p$, where $\delta>0$ is a parameter to be chosen later. Since

$$\|\delta v - \delta v'\|_2^2 = 4\delta^2\cdot d_{\text{H}}(v,v'),$$

the separation condition in Theorem 5 holds with $\Delta = 4\delta^2$. Consequently, Corollary 6 gives

$$R_{p,\sigma}^\star \ge 2p\delta^2\cdot \left(1 - \Phi\left(\frac{\delta}{\sigma}\right)\right),$$

and choosing $\delta =\Theta(\sigma)$ gives $R_{p,\sigma}^\star= \Omega(p\sigma^2)$.

We can also establish the same lower bound using the generalized Fano’s inequality. Let $V\sim \mathsf{Uniform}(\{\pm\delta\}^p)$, then Lemma 4 gives

$$I(V;X) \le \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(V,\sigma^2 I_p) \| \mathcal{N}(0, \sigma^2I_p)) = \frac{\mathop{\mathbb E} \|V\|_2^2}{2\sigma^2} = \frac{p\delta^2}{2\sigma^2}.$$

Since for any $\theta_0\in {\mathbb R}^p$ with $v_0 = \text{sign}(\theta_0)\in \{\pm 1\}^p$ (break ties arbitrarily), we have

$$\|V - \theta_0\|_2^2 \ge \delta^2 \cdot d_{\text{H}}(\text{sign}(V), v_0),$$

choosing $\Delta = p\delta^2/3$ gives

$$\mathop{\mathbb P}(\|V-\theta_0\|_2^2 \le \Delta) \le \mathop{\mathbb P}(d_{\text{H}}(\text{sign}(V), v_0)\le p/3 ) \le e^{-cp}$$

for some numerical constant $c>0$, where the last inequality is given by sub-Gaussian concentration. Consequently, $p_\Delta \le e^{-cp}$, and Theorem 8 gives

$$R_{p,\sigma}^\star \ge \frac{p\delta^2}{3}\left(1 - \frac{p\delta^2/(2\sigma^2) + \log 2}{cp}\right).$$

Again, choosing $\delta = c'\sigma$ with small $c'>0$ gives $R_{p,\sigma}^\star = \Omega(p\sigma^2)$.

Remark 3 The original version of Fano’s inequality can also be applied here, where $\{\theta_1,\cdots,\theta_m\}$ is a maximal packing of $\{\pm \delta\}^p$ with $\ell_2$ packing distance $O(\delta\sqrt{p})$

2.2. Example II: Sparse linear regression

Consider the following sparse linear regression $Y\sim \mathcal{N}(X\theta, \sigma^2 I_n)$, where $X\in {\mathbb R}^{n\times p}$ is a fixed design matrix, and $\theta\in {\mathbb R}^p$ is a sparse vector with $\|\theta\|_0\le s$. Here $s\in [1,p]$ is a known sparsity parameter, and we are interested in the minimax risk $R_{n,p,s,\sigma}^\star(X)$ of estimating $\theta$ under the squared $\ell_2$ loss. Note that when $X = I_p$, this problem is reduced to the sparse Gaussian mean estimation.

We apply the generalized Fano’s inequality to this problem. Since a natural difficulty of this problem is to recover the support of $\theta$, we introduce the random vector $V=\delta v_S$, where $\delta>0$ is a parameter to be specified later, $v\sim \mathsf{Uniform}\{\pm 1\}^p$, $v_S$ denotes the restriction of $v$ onto $S$, and $S\subseteq [p]$ is uniformly chosen from all size-$s$ subsets of $[p]$. Clearly,

$$\mathop{\mathbb E}[VV^\top] = \delta^2 \mathop{\mathbb E}[v_Sv_S^\top] = \frac{s\delta^2}{p}I_p.$$

By Lemma 4,

$$\begin{array}{rcl} I(V;Y) &\le & \mathop{\mathbb E} D_{\text{KL}}(\mathcal{N}(XV, \sigma^2 I_n) \| \mathcal{N}(0, \sigma^2 I_n)) \\ &=& \frac{\mathop{\mathbb E}\|XV\|_2^2}{2\sigma^2} \\ &=& \frac{\text{Trace}(X^\top X\cdot \mathop{\mathbb E}[VV^\top]) }{2\sigma^2}\\ &=& \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2, \end{array}$$

where $\|X\|_{\text{F}} := \sqrt{\text{Trace}(X^\top X)}$ is the Frobenius norm of $X$. Now it remains to upper bound $p_\Delta$ for $\Delta = s\delta^2/12$. Fix any $\theta_0\in {\mathbb R}^p$ such that $\|V - \theta_0\|_2^2\le \Delta$ holds with non-zero probability (otherwise the upper bound is trivial), and by symmetry we may assume that $\|\theta_0 - \delta1_{[s]}\|_2^2 \le \Delta$. Now by triangle inequality, $\|\theta_0 - \delta v_S\|_2^2 \le \Delta$ implies that $\|v_S - 1_{[s]}\|_2^2 \le s/3$. Hence,

$$\begin{array}{rcl} p_\Delta &\le& \mathop{\mathbb P}(\|v_S - 1_{[s]}\|_2^2\le s/3) \\ &\le& \frac{1}{2^s\binom{p}{s}}\sum_{k=\lceil 2s/3 \rceil}^s \binom{s}{k} \binom{p-k}{s-k}2^{s-k} \\ &\le& \left(s/(ep)\right)^{cs} \end{array}$$

for a numerical constant $c>0$ after some algebra. Consequently, Theorem 8 gives

$$R_{n,p,s,\sigma}^\star(X) \ge \frac{s\delta^2}{12}\left(1 - \frac{ \frac{s\delta^2}{2p\sigma^2}\|X\|_{\text{F}}^2 + \log 2}{cs\log(ep/s)} \right).$$

Finally, choosing $\delta^2 = c'p\sigma^2\log(ep/s)/\|X\|_{\text{F}}^2$ for some small constant $c'>0$ gives

$$R_{n,p,s,\sigma}^\star (X)= \Omega\left(\frac{sp\sigma^2\log(ep/s)}{\|X\|_{\text{F}}^2} \right).$$

In the special cases where $X= I_p$ or $X$ consists of i.i.d. $\mathcal{N}(0,n^{-1})$ entries, we arrive at the tight lower bound $\Omega(s\sigma^2\log(ep/s))$ for both sparse Gaussian mean estimation and compressed sensing.

2.3. Example III: Multi-armed bandit

Next we revisit the example of the multi-armed bandit in Lecture 5. Let $T$ be the time horizon, $K$ be the total number of arms. For each $i\in [K]$, the reward of pulling arm $i$ at time $t$ is an independent random variable $r_{t,i}\sim \mathcal{N}(\mu_i, 1)$. The target of the learner is to devise a policy $(\pi_t)_{t\in [T]}$ where $\pi_t\in [K]$ is the arm to pull at time $t$, and $\pi_t$ may depend on the entire observed history $(\pi_\tau)_{\tau and $(r_{\tau,\pi_\tau})_{\tau. The learner would like minimize the following worst-case regret:

$$R_{T,K}^\star = \inf_{\pi} \sup_{\max_{i\in [K]}|\mu_i| \le \sqrt{K} } \mathop{\mathbb E}\left[T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t}\right],$$

where $\max_{i\in [K]}|\mu_i|\le \sqrt{K}$ is a technical condition. As in Lecture 5, our target is to show that

$$R_{T,K}^\star = \Omega(\sqrt{KT})$$

via multiple hypothesis testing.

To prove the lower bound, a natural idea is to construct $K$ hypotheses where the $i$-th arm is the optimal arm under the $i$-th hypothesis. Specifically, we set

$$\begin{array}{rcl} \mu^{(1)} &=& (\delta, 0, 0, 0, \cdots, 0), \\ \mu^{(2)} &=& (\delta, 2\delta, 0, 0, \cdots, 0), \\ \mu^{(3)} &=& (\delta, 0, 2\delta, 0, \cdots, 0), \\ \vdots & & \vdots \\ \mu^{(K)} &=& (\delta, 0, 0, 0, \cdots, 2\delta), \end{array}$$

where $\delta>0$ is some parameter to be specified later. The construction is not entirely symmetric, and the reward distribution under the first arm is always $\mathcal{N}(\delta,1)$.

For a mean vector $\mu$ and policy $\pi$, let $L(\mu,\pi) = T \max_{1\le i\le K} \mu_i - \sum_{t=1}^T \mu_{t,\pi_t}$ be the non-negative loss function for this online learning problem. Then clearly the separation condition in Lemma 1 is fulfilled with $\Delta = \delta T$. Next, we apply Lemma 2 to a star graph $([K], \{(1,2),(1,3),\cdots,(1,K) \})$ with center $1$ gives

$$\begin{array}{rcl} R_{T,K}^\star &\overset{(a)}{\ge} & \frac{\delta T}{2}\cdot \frac{1}{K}\sum_{i=2}^K \left(1 - \|P_{\mu^{(1)}} - P_{\mu^{(i)}} \|_{\text{TV}}\right) \\ &\overset{(b)}{\ge} & \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( -D_{\text{KL}}(P_{\mu^{(1)}} \| P_{\mu^{(i)}} ) \right) \\ &\overset{(c)}{=}& \frac{\delta T}{4K} \sum_{i=2}^K \exp\left( - 2\delta^2 \mathop{\mathbb E}_1(T_i)\right) \\ &\overset{(d)}{\ge}& \frac{\delta T}{4}\cdot \frac{K-1}{K} \exp\left(-2\delta^2\cdot \frac{\mathop{\mathbb E}_1(T_2 + \cdots + T_K)}{K-1} \right) \\ &\overset{(e)}{\ge} & \frac{\delta T}{4}\cdot \frac{K-1}{K}\exp\left(-\frac{2T\delta^2}{K-1}\right), \end{array}$$

where in step (a) we denote by $P_{\mu^{(i)}}$ the distribution of the observed rewards under mean vector $\mu^{(i)}$, step (b) follows from the inequality $1 - \|P-Q\|_{\text{TV}}\ge \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q))$ presented in Lecture 5, step (c) is the evaluation of the KL divergence where $\mathop{\mathbb E}_1(T_i)$ denotes the expected number of pullings of arm $i$ under the mean vector $\mu^{(1)}$, step (d) follows from Jensen’s inequality, and step (e) is due to the deterministic inequality $T_2 + \cdots + T_K\le T$. Now choosing $\delta = \Theta(\sqrt{K/T})$ gives the desired lower bound.

The main reason why we apply Lemma 2 instead of the Fano’s inequality and choose the same reward distribution for arm $1$ at all times is to deal with the random number of pullings of different arms under different reward distributions. In the above way, we may stick to the expectation $\mathop{\mathbb E}_1$ and apply the deterministic inequality $T_2 + \cdots + T_K\le T$ to handle the randomness.

2.4. Example IV: Gaussian mixture estimation

Finally we look at a more involved example in nonparametric statistics. Let $f$ be a density on ${\mathbb R}$ which is a Gaussian mixture, i.e., $f = g * \mathcal{N}(0,1)$ for some density $g$, where $*$ denotes the convolution. We consider the estimation of the density $f$ given $n$ i.i.d. observations $X_1,\cdots,X_n$ from $f$, and we denote by $R_n^\star$ the minimax risk under the squared $L_2$ loss between real-valued functions. The central claim is that

$$R_n^\star = \Omega\left(\frac{(\log n)^{1/2}}{n}\right),$$

which is slightly larger than the (squared) parametric rate $n^{-1}$.

Before proving this lower bound, we first gain some insights from the upper bound. In the problem formulation, the only restriction on the density $f$ is that $f$ must be a Gaussian mixture. Since convolution makes function smoother, $f$ should be fairly smooth, and harmonic analysis suggests that the extent of smoothness is reflected in the speed of decay of the Fourier transform of $f$. Let $\widehat{f}$ be the Fourier transform of $f$, we have

$$\widehat{f}(\omega) = \widehat{g}(\omega)\cdot \frac{1}{\sqrt{2\pi}}e^{-\omega^2/2},$$

and $\|\widehat{g}\|_\infty \le \|g\|_1=1$. Hence, if we truncate $\widehat{f}(\omega)$ to zero for all $|\omega| > 2\sqrt{\log n}$, the $L_2$ approximation error would be smaller than $O(n^{-1})$. This suggests us to consider the kernel $K$ on ${\mathbb R}$ with

$$\widehat{K}(\omega) = 1(|\omega|\le 2\sqrt{\log n}).$$

Then the density estimator $\mathop{\mathbb P}_n * K$ has mean $f * K$, which by Parseval’s identity has squared bias

$$\|f - f*K\|_2^2 = \|\widehat{f}(1-\widehat{K}) \|_2^2 = O(n^{-1}).$$

Moreover, by Parseval again,

$$\mathop{\mathbb E}\|\mathop{\mathbb P}_n * K - f * K\|_2^2 \le \frac{\|K\|_2^2}{n} = \frac{\|\widehat{K}\|_2^2}{n} = O\left(\frac{\sqrt{\log n}}{n}\right),$$

and a triangle inequality leads to the desired result.

The upper bound analysis motivates us to apply Fourier-type arguments in the lower bound. For $v\in \{\pm 1\}^p$ (with integer $p$ to be specified later), consider the density

$$f_v = f_0 + \left(\delta\cdot \sum_{i=1}^p v_ig_i \right) * \phi,$$

where $f_0$ is some centered probability density, $\delta>0$ is some parameter to be specified later, $g_1,\cdots,g_p$ are perturbation functions with $\int g_i = 0$, and $\phi$ is the PDF of $\mathcal{N}(0,1)$. To apply the Assouad’s lemma on this hypercube structure, we require the following orthogonality condition

$$\int_{{\mathbb R}} (g_i*\phi)(x) (g_j*\phi)(x)dx = 1(i=j)$$

to ensure the separation condition in Theorem 5 with $\Delta = \Theta(\delta^2)$. By Plancherel’s identity, the above condition is equivalent to

$$\int_{{\mathbb R}} \widehat{g}_i(\omega)\widehat{g}_j(\omega)\phi^2(\omega)d\omega = 1(i=j).$$

Recall from Lecture 7 that the Hermite polynomials are orthogonal under the normal distribution, a natural candidate of $\widehat{g}_i(\omega)$ is $\widehat{g}_i(\omega)\propto \phi(\omega)H_{2i-1}(2\omega)$, where we have used that $\phi(\omega)^4 \propto \phi(2\omega)$, and we restrict to odd degrees to ensure that $\int g_i = \widehat{g}_i(0) = 0$. This choice enjoys another nice property where the inverse Fourier transform of $\widehat{g}_i$ has the following closed-form expression for $g_i$:

$$g_i(x) = \sqrt{2}(2\pi)^{3/4}\sqrt{\frac{3^{2i-1}}{(2i-1)!}}\cdot\phi(x) H_{2i-1}\left(\frac{2x}{\sqrt{3}}\right).$$

Moreover, since $H_k(x)\sim \sqrt{k!}e^{x^2/4}$ for large $x$, we have $\|g_i\|_\infty = \Theta(3^i)$. Therefore, we must have $p=O(\log n)$ to ensure the non-negativity of the density $f_v$.

Now the only remaining condition is to check that the $\chi^2$-divergence between neighboring vertices is at most $O(1)$, which is essentially equivalent to

$$n\delta^2\cdot \int_{{\mathbb R}} \frac{(g_i * \phi(x))^2}{f_0(x)}dx = O(1).$$

A natural choice of $f_0$ is the PDF of $\mathcal{N}(0,\sigma^2)$, with $\sigma>0$ to be specified. Splitting the above integral over ${\mathbb R}$ into $|x|\le C\sigma$ and $|x|> C\sigma$ for some large constant $C>0$, the orthogonality relation of $g_i$ gives

$$\int_{|x|\le C\sigma} \frac{(g_i * \phi(x))^2}{f_0(x)}dx \le \sqrt{2\pi}e^{C^2/2}\sigma \cdot \int_{{\mathbb R}} (g_i * \phi(x))^2dx = O(\sigma).$$

To evaluate the integral over $|x|>C\sigma$, recall that $g_i(x)$ behaves as $3^ie^{-cx^2}$ for large $|x|$. Hence, the natural requirement $\sigma^2 = \Omega(p)$ leads to the same upper bound $O(\sigma)$ for the second integral, and therefore the largest choice of $\delta$ is $\delta^2 = (n\sigma)^{-1}$.

To sum up, we have arrived at the lower bound

$$R_{n}^\star = \Omega\left(\frac{p}{n\sigma}\right)$$

subject to the constraints $\sigma^2 = \Omega(\sqrt{p})$ and $p = O(\log n)$. Hence, the optimal choices for the auxiliary parameters are $p = \Theta(\log n), \sigma = \Theta(\sqrt{\log n})$, leading to the desired lower bound.

3. Bibliographic Notes

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977), who also applied Fano’s lemma (Fano (1952)) to a statistical setting. The Assouad’s lemma is due to Assoaud (1983), and we also refer to a survey paper Yu (1997). The tree-based technique (Lemma 2 and the analysis of the multi-armed bandit) is due to Gao et al. (2019), and the generalized Fano’s inequality is motivated by the distance-based Fano’s inequality (Duchi and Wainwright (2013)) and the current form is presented in the lecture note Duchi (2019).

For the examples, the lower bound of sparse linear regression is taken from Candes and Davenport (2013), and we also refer to Donoho and Johnstone (1994), Raskutti, Wainwright and Yu (2011), and Zhang, Wainwright and Jordan (2017) for related results. The full proof of the Gaussian mixture example is referred to Kim (2014).

1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
2. Robert M. Fano, Class notes for transmission of information. Massachusetts Institute of Technology, Tech. Rep (1952).
3. Patrick Assouad, Densité et dimension. Annales de l’Institut Fourier. Vol. 33. No. 3. 1983.
4. Bin Yu, Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam. Springer, New York, NY, 1997. 423–435.
5. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou, Batched multi-armed bandit problem. Advances in Neural Information Processing Systems, 2019.
6. John C. Duchi, and Martin J. Wainwright. Distance-based and continuum Fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669 (2013).
7. John C. Duchi, Lecture notes for statistics 311/electrical engineering 377. 2019.
8. Emmanuel J. Candes, and Mark A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis 34.2 (2013): 317–323.
9. David L. Donoho, and Iain M. Johnstone. Minimax risk over $\ell_p$-balls for $\ell_q$-error. Probability Theory and Related Fields 99.2 (1994): 277–303.
10. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls. IEEE transactions on information theory 57.10 (2011): 6976–6994.
11. Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators. Electronic Journal of Statistics 11.1 (2017): 752-799.
12. Arlene K. H. Kim. Minimax bounds for estimation of normal mixtures. Bernoulli 20.4 (2014): 1802–1818.

# Journal for High Schoolers in 2019

Journal for High Schoolers

In 2019, from June to August, 40 high school students attended the STEM to SHTEM (Science, Humanities, Technology, Engineering and Mathematics) summer program hosted by Prof. Tsachy Weissman and the Stanford Compression Forum. During this summer program, the high schoolers pursued fun research projects in various domains under the supervision of 18 mentors, where the entire collection of the high schoolers’ reports can be found below.

# Olfactory Communications In a Virtual Environment

Journal for High Schoolers

Authors:

Anushka Sheth, Bikrant Das Sharma, Harvey Jarin, Ian Zhang, Jonathon Sneh, Leena Elzeiny, Prayusha Parikh and Vyomika Gupta

Abstract:

Currently, VR systems immerse an individual visually and auditorily, but the lack of  olfactory stimulation is part of the void that separates virtual reality from real life  experiences. The prospect of adding scent as a third dimension to virtual reality prompted  our research of the effects of adding an olfactory feature to virtual interactions, as well as  its effect on the judgement of pleasantness and personality of others met in the virtual  environment. At the start of the experiment, all participants take a personality quiz. The  quiz predicts how much each participant enjoys a certain smell. Each individual then  participates in a conversation in virtual reality with a stranger. Some participants have  scent dispersed during their conversation, and some participants do not. The stranger the  participant meets has scripted responses. Immediately after, each participant answers  questions based on their conversation. The questionnaire quantitatively scores the (1) pleasantness to examine if scents impact the perception of someone and (2) the perceived  personality of the stranger, in order to determine whether olfactory communication aids  in an accurate first impression. This research is aimed at investigating the implications of  integrating olfactory communication in virtual interactions and assessing its effects on the  judgement of the other person’s personality and pleasantness.

1. Introduction

Olfactory communication is deeply instinctual. A scent is first processed by the olfactory  bulb which directly activates the amygdala, the control center for emotions in the brain.  The association between smells and emotions can run so deep that, according to three clinical studies by two psychiatric professors, patients with post-traumatic stress disorder reported feeling guilt and nausea upon smelling an environment associated with the traumatic incident [1]. Currently, technology is slowly shifting to incorporate olfaction machinery into spaces, to bring a new era of immersive reality. In order to simulate the  physical world, one must mimic all dimensions of it, but little research has been conducted  to understand the human response to syntents.

The purpose of our research is to determine the effect of aroma in a person’s judgement of  the pleasantness and personality of a person who they meet. A personality quiz is given to  the participant to predict their enjoyment of a certain smell. The participant interacts with  another person. During the virtual reality conversation between the participant and the  actor, the scent is dispersed for some subjects without their knowledge.

2. Background

Virtual reality (VR) is a highly realistic, immersive, and simulated experience incorporating  sophisticated visual and audio elements. Virtual reality is primarily used for educational  and entertainment purposes, as it simulates a realistic interaction. The two participants  interact in virtual reality, through the platform AltspaceVR, a social virtual world where  users create avatars and explore chat rooms. AltspaceVR serves to supplement social  interactions, and connects users worldwide. While AltspaceVR and other such virtual  communication platforms attempt to mimic and replicate real life encounters, many  subtleties humans have in their physical interactions are lost. Most first impressions, as  well as the formation of memories, require tangible visual and olfactory cues that cannot  be conveyed currently through virtual means. First impressions are formed in the  amygdala and posterior cingulate cortex; parts of the brain that process fear and assign  value to objects. The factors involved in the assignment of value and trustworthiness are  largely derived from visual signals, primarily body language. People can evaluate faces on  trustworthiness, status, and attractiveness following even just a brief glance (e.g., 33 ms)  and extra time (100 or 500 ms) only led to a small improvement in the correspondence of  these time-constrained evaluations with an independent set of time-unconstrained  judgments. Perceptions of personality are also based in visual cues, including clothing,  posture, and eye contact. Many such cues, however, are also conveyed through olfaction.

Yet, such a strong correlation has not always been addressed. The potential of olfactory  communication has been tabled since the ancient Greek philosophers: Plato and Aristotle  described scent as the most animalistic of the five senses [2]. Later philosophers such as  Immanuel Kant agreed because smell did not contribute to the music and art during the Enlightenment. On the other hand, some non-Western cultures glorified their sense of smell [3]. For example, the Bororo people in Brazil consider body odor to be the life force of an individual and the smell of breath to be linked to the soul.

3. Methods and Materials

We received a scent machine from OWidgets, a device from the SCHI lab at the University  of Sussex. AltspaceVR was used for the virtual reality world and experience. The VR  headsets used were an Oculus Rift and an HTC Vive. Scents were received from the  Institute of Art and Olfaction in Los Angeles, California. An Arduino board was used to  control the dispersion of the OWidgets machine. The board was connected to a laptop  which controlled the dispersion of scent.

The experience begins in a waiting room, where participants are ushered in and out. Each  participant takes the scent personality quiz anonymously and the database stores the  person’s scent preferences scored by the quiz. From there, participants are told the  narrative for the experiment: they have left their lives and are in a state which determines  whether they will return to their old life or move on to the afterlife. In the room, the  participant enters the virtual reality world. There is another person in the Virtual Reality  world, and the two users converse. After the experience, they take an exit survey, judging  each others personality and pleasantness.

An important factor in conducting the experiment relies on the participants perceived  reality as a blinded experiment. The experiment is disguised as a study on the effect of  virtual reality on first impression. Each participant meets an “actor”. An actor is a person  who has memorized his/her responses to the questions and gives the memorized response  when the participant asks them a question.

After their interaction, the participant steps out of the VR headset and take an exit survey  on the interaction they just had. The exit survey asks the participant to judge the actor’s  personality and pleasantness.

3.1 Altspace

AltspaceVR is a virtual reality platform that is used as a meeting space for both private and public events. In AltspaceVR, worlds can be built from scratch, which allows for greater customization and control over certain variables (avatars, music, boundaries, setting). In addition to its customization features, Altspace also supports world-building through Unity and in-game app development through its MDK feature. This, in combination with its other features, makes AltspaceVR ideal for our experimental procedure.

For our world, we wanted to create a neutral, non-religious interpretation of the Afterlife.  In order to ensure that the world would not elicit any negative emotions, we also wanted  to make the world as serene and calm as possible. The world we determined that would be  best included the following features: soothing music, a night time and mountain background, a reflective plane, and a limiting boundary in the form of rocks. To make the experience more interactive, we also implemented a Tic-Tac-Toe game to go along with the questions.

3.2 Personal Quiz

In order to determine the scent that corresponded to a participant’s unique personality,  we created our own version of a personality quiz. Based on the research we conducted on  various personality quizzes (Myers-Briggs, Five Factor Model, Eysenck’s PEN Model), we  determined that a quiz following the general structure and trait association as the FFM  would best suit our purpose. We devised 12 questions, 11 four response multiple choice  and 1 two response question, based around the five main personality traits in the FFM:  Openness to experience, Conscientiousness, Extraversion, Agreeableness, and  Neuroticism (OCEAN). Since there are four main fragrance families–Floral, Oriental,  Woody, Fresh–and five trait groups, we decided to combine openness and extraversion due to their similarities [4]. To connect fragrance families and the personality traits, we found correlations between one’s personality and preferred perfume. The fragrance-trait relations we used were Floral-Neuroticism, Oriental-Conscientiousness, Woody-Agreeableness, Fresh-Openness and Extraversion [5].

Early versions of our quiz were inaccurate, as a majority of people tended to  overwhelmingly favour one response over the other answer choices for several questions.  This led to a majority of people ending up with a high score in the Fresh fragrance family  as opposed to the other three fragrance families. After going through several iterations of  the quiz while modifying and removing certain biased questions, we were left with 12  questions with even distributions in terms of answer responses. For scoring, we assigned  each response three points (with the exception of #11 which was given four points for  more weight). The points corresponded to different scents, as our research proved that  those who chose a specific answer choice were more likely to favor a specific fragrance.  Each answer choice that an individual made added points to one or more scents. In total,  there were 37 possible points to be earned in each of the four scents, meaning that there  was an equal opportunity to score any of the four fragrance families. After all 12 questions  were answered, the total number of points scored in each fragrance family was summed  up to create the participant’s scent array. The scent array ranked the four scents, in order  to anticipate which one they would be partial to. Next, we had to evaluate whether an  individual’s personality – as judged by our quiz – actually had an effect on the kind of smells  they preferred. To test the accuracy of our personality quiz, we went to various locations  on campus and surveyed people randomly. Participants were first given the quiz in order  for us to predict the scent that they should prefer based on their personality type. Then,  they were given four strips of paper, each sprayed with a different fragrance oil and were  asked to choose which one they prefered. Out of the 24 people we surveyed, 16 of them  chose one of their top two scoring scents as their favourite, proving that our quiz results  could accurately predict which aromas an individual would prefer a majority of the time.

3.3 OWidgets Scent Machine

To disperse the participants’ unique perfume throughout the room, we looked into various  methods of scent delivery systems such as essential oil diffusers, heat diffusers, and  nebulizer oil diffusers. While each of these have proved to be effective at dispersing  smells throughout a given environment, the fragrance oils they give off tend to linger in  the room for hours. Over time, the different smells would blend together to form olfactory  white – an unidentifiable scent that would distort our data. We wanted to avoid the  Smell-o-vision disaster of the mid-twentieth century, in which users were overwhelmed  by the multiple scents coming at them. Furthermore, we had to ensure that the smell was  not too strong. Since each VR experience would only last for four to five minutes, we  needed a way to easily get rid of the smell after each encounter. To do this, we partnered  with OWidgets, a company that designs ways to incorporate olfaction into various  experiences. This company was founded by Emanuela Maggioni, an experimental  psychologist whose interest in the human olfactory system led her to develop a machine  that could be used for further experimentation with smells. The scent machine that she  and her team created allows three different scents to be sprayed from the nozzles.  Through experimentation, we confirmed that perfumes sprayed from the machine only  noticeably lingered in the environment for 1-3 seconds after the machine was turned off.  We found that the ideal distance between the user and the OWidgets Scent machine was  five feet. Moreover, we discovered that the best way to spray the fragrances was to have  the machine on a cycle with it on for five seconds and off for 15 seconds for the duration of the experience – about four minutes. This ensured that the participant could smell the scent for a majority of the experience and was not overwhelmed by the strong fragrance.

4. Future Directions

The next steps for continuing this research would be using more perfumes in the VR  interaction. Expanding from only fresh to also including floral, warm and spicy, and woody  would offer more data about responses to smells which people like and do not like.

Afterwards, expanding to a more dynamic set of scents would also be a possible future  direction. By implementing scents that correspond to one’s unique environment, an even  more immersive experience would be gained from the users. Smells in an environment  could be identified by an E-nose and then transferred to another user using a similar  system as ours, but the actual replication of the scent and as well the current capabilities  of an E-nose limit this process. Future advancements in these areas would be needed to  further the inclusion of dynamic scents in VR or any other platform.

5. Acknowledgements

We would like to acknowledge Professor Tsachy Weissman of Stanford’s Electrical  Engineering Department and head of the Stanford Compression Forum for his guidance  and help with the project. We also want to thank Devon Baur for being the mentor of our  group and helping us with designing the project, accessing and getting materials, and  running the experiment. Additionally, we also want to thank Professors Subhashish Mitra,  Gordon Wetzstein, and Debbie Senesky for all of their help with the project. Also, we want  to thank the OWidgets group in London for giving us a scent machine and helping us set it  up and troubleshoot. Lastly, we want to thank the Institute of Art and Olfaction in Los  Angeles for providing us with the scents we used for the project.

6. References

[1] Vermetten E & Bremner JD. Olfaction as a​ ​traumatic​ reminder in posttraumatic stress disorder: case reports and review. The Journal of Clinical​ ​Psychiatry​ 64 (2003), 202-207.

[2] Camille Ferdenzi, S. Craig Roberts, Annett Schirmer, Sylvain Delplanque, Sezen Cekic, Christelle Porcherot, Isabelle Cayeux, David Sander, Didier Grandjean, Variability of Affective Responses to Odors: Culture, Gender, and Olfactory Knowledge, ​Chemical Senses​, Volume 38, Issue 2, February 2013, Pages 175–186, https://doi.org/10.1093/chemse/bjs083

[3] Bhana, Yusuf. “Localising Smells: How Different Cultures Perceive Scents.” ​TranslateMedia​, 7 Dec. 2017, www.translatemedia.com/translation-blog/localising-smells-different-cultures-perceive-scents/

[4] Mensing J., Beck C. (1988) The psychology of fragrance selection. In: Van Toller S., Dodd G.H. (eds) Perfumery. Springer, Dordrecht

[5] Ahrndt, Sebastian & Fähndrich, Johannes & Albayrak, Sahin. (2015). Modelling of Personality in Agents: From Psychology to Implementation.