On Entropy and Investment Theory

EE376A (Winter 2019)

by Rifath Rashid

It’s a hot and humid summer’s day in New York, and you’ve just entered the Belmont racetrack for the Belmont Stakes, the final leg of the prestigious Triple Crown in thoroughbred horse racing.  Today, there are five horses on the track, and you have $100 in your pocket that you’re looking to spend on bets.  As the tumultuous crowd cheers and the announcer’s voice crackles overhead, you wonder, how should you bet? 

This is arguably the most important question in a gambling decision.  You might imagine that there are infinitely many scenarios here, all dependent on too many factors to count.  What horses are racing?  Who are the jockeys?  What was each horse fed in the morning?  What will the weather be like in an hour and which horses are best equipped for that weather?

One way to make your decision, it turns out, is to turn to a concept that you might have heard in your middle school science classroom. Remember the phrase “the entropy of the universe is always increasing?”   This statement, which derives its truth from the second law of thermodynamics, has likely passed through one ear and out the other for generations of students in chemistry courses, many of whom don’t end up in careers in science.  Yet, the notion of entropy crops up in many fields and situations outside the walls of the 7th grade physical science classroom.

Take, for instance, the role of entropy in the field of investment theory.  Entropy serves as an underpinning of many studied and successful betting strategies, like Kelley proportional investing, and also serves as critical element in understanding modern decision making algorithms, some of which have shaped the field of Artificial Intelligence.  In the real world, most investment strategies depend on exploiting side information in some useful way, and understanding entropy provides us with a framework for figuring out how to value this information.  

Entropy is most commonly explained as the level of unpredictability of something you’re trying to measure.  In the case the racetrack, what you’re trying to gauge is the outcome of the race. If you walk in and all the horses appear to be in perfect shape, rearing at the sound of the crowds as their jockeys hype them up for a win, it might be hard to tell which horse is a winner. In investment theory, this situation is said to have high entropy: given that all horses have a good chance of winning, the outcome of the race is hard to predict.   Now, imagine that moments after you walk in a freak lighting strikes every horse but one.  The organizers of the Belmont racetrack—aware of where they placed their own bets—yell, “The race must go on!”  A situation like this, where it’s easy to predict the winner, is said to have low entropy.  

Now imagine you’re somewhere in between the two extreme scenarios.  That is, some horses look healthy and ready to win, while others seem not so interested in the race.  How will you place your bets?  Your first instinct might be to take a hard look at each horse and their past races to figure out which horse has the best chance of winning.  From looking at the results of the last three races, you might convince yourself that the horse “Hailey’s Charm” is most likely to win, and you decide to place all your money on her.  If you lose, you lose.  If you win, you really win.

This strategy makes sense on an intuitive level, but investment schemes like this are often considered sub-optimal. Although you’ll get a bigger payout on some races (when you get lucky), you might get lucky on fewer races, resulting in an overall middling winnings over a number of races. A good investment scheme accounts for the fact that outcomes are probabilistic events, and helps you accrue the biggest gain over a large number of events. 

In order to come up with a good investment scheme that beats the odds, it’s important to consider something called a doubling rate. The doubling rate of an investment scheme is the rate at which a betting strategy grows exponentially given all possible outcomes of the event you’re betting on.  For instance, if your doubling rate is 1, then with each race you bet on using the same investment strategy, you’re expected to double your investment.  But a caveat to the terminology: “doubling” doesn’t always refer to gains. If your betting strategy is sub-optimal, your doubling rate will be negative, which actually means that your investment is shrinking with every race.

In 1956, John Kelley, a researcher at Bell labs, showed that the optimal investment strategy in situations like our horse race would be something called proportional investing. That is, the best doubling rate you can get is when you place bets on each horse in proportion to the probability that it will win.  Originally, you might have ignored some horse because it didn’t look so peppy, and so you decided to skip putting any money on it at all.  What Kelly showed is that in the long run, your best option is to still place some money on that horse, even if you personally don’t think it is going to win.

The doubling rate of a Kelley proportional investing scheme ultimately boils down to two things:  expectation of return and the entropy of the underlying event.  Here, the expectation of return is simply how much you expect to make on the horse race as a multiple of the wealth you invested on each horse.  But the doubling rate is actually negatively correlated with the entropy of the underlying event. In other words, if every horse has the same chance of winning (high entropy situation), then we probably won’t get that great of a return. This relationship between doubling rate and entropy means that before you can head to the racetrack and start proportionally investing all of your money, you need to know the probability that each horse will win, because these probabilities define the entropy of the race’s outcome. 

The difficult reality is that it’s impossible to know exactly what each horse’s winning probability is. This is because there are simply too many uncertainties at play and too much information to possibly consider, like the horse’s morning regimen, the horse’s focus at the start of the race, the current weather, and so on. But we can try to focus on the most relevant information.  How might we understand which pieces of information are useful and which are not?   Consider weather information. If our favorite meteorologist just called and told us that it is almost certain to rain today, it might not be obvious whether this information gives us a better or worse understanding of which horses will win.  But if we know that some horses perform very well in rain while others simply can’t function in rain, then knowing rain is incredibly valuable because it helps us better understand the outcome of the event.  In this case, having additional, or “side” information about the weather decreases the entropy of our situation since we have a better idea of which horse might win.  Conversely, if all horses become unpredictable in the rain, then basing our decisions on the weather might adversely affect our understanding of the race.  Either way, it’s clear that weather could be an important factor. It turns out that there’s a special type of entropy, called conditional entropy, which can tell us how much a piece of side information—like weather—is actually useful for our task of betting on horses.

In 2011, researchers at Stanford generalized this paradigm to the multivariate case where we have many outcomes we care about and multiple sources of side information.  A big takeaway from their work is the realization that strength of the relationship between our sources of side information and our events are an upper bound on the growth of our investments.  Upper bounds, such as this one, are interesting because they give us reasonable guarantees to aim for.

Entropy is integral to our understanding of how probabilistic systems and events work.  Every day, we make judgements and bets on outcomes that we may not completely understand, but advancements in information theory have helped to delineate the underlying mathematical nature of these decisions.


I pursued two projects this quarter, and focused my outreach efforts on my other group project “On Entropy in NBA Basketball.” Please check it out! 

Sensory Processing in the Retina

EE376A (Winter 2019)

By Saarthak Sarup, Raman Vilkhu, and Louis Blankemeier

Modeling the Vertebrate Retina

“To suppose that the eye with all its inimitable contrivances for adjusting the focus to different distances, for admitting different amounts of light, and for the correction of spherical and chromatic aberration, could have been formed by natural selection, seems, I confess, absurd in the highest degree…The difficulty of believing that a perfect and complex eye could be formed by natural selection , though insuperable by our imagination, should not be considered subversive of the theory.” – Charles Darwin

The sentiment behind this quote truly captures the miracle that is the human eye… even Darwin had to specifically address its unique case in his argument for natural selection. In this project, we aim to capture some of this complexity in the structure and function of the eye and study it under the lens of information theory.

A short biology lesson…

Figure 1: Anatomy of the human eye, with exploded view of the retina. [6]

The human eye is composed of various complex structures, however the one studied and modeled throughout this blog is the retina. The retina is responsible for converting light responses captured by the cornea into electrical signals sent to the visual cortex in the brain via the optic nerve. As seen in Figure 1, the retina lies in the back of the eye and can be further decomposed into layers of neurons and photoreceptors. As a brief introduction to the biology, the light passes through the layers from the leftmost to the rightmost: nerve fiber layer (RNFL), retinal ganglion cells (RGC), amacrine cells (AC), bipolar cells (BC), horizontal cells (HC), photoreceptors (PR), and the retinal pigment epithelium (RPE). For the scope of this blog, it is sufficient having a high level understanding that the retina translates varying light responses to electrical spikes (known as action potentials) which can be transmitted along the neurons in the optic nerve to the visual cortex, where perception takes place. For those interested in these specific retina layers and their unique contributions to this encoding process, please refer to [4].

Communicating through the eye

The primary objective of the project was to model the eye under the lens of information theory and see what analytical results can be derived to answer questions such as: what is the maximum rate achievable by the eye? What is the best model for the encoding/decoding done in the human visual system? In order to study these overarching questions, we had to start by modeling the system as an information processing channel.

Figure 2: Vision mapped as an information processing problem.

First, the eye takes in a light response captured at the cornea and encodes that into electrical spikes defined by a spiking rate. In an information theory sense, this is equivalent to encoding a source using some encoder that takes intensity and maps it to discrete spikes. Here, as seen in Figure 2, we assume that the eye does rate-encoding, where the majority of the useful “information” is captured by the spiking rate rather than precise spatio-temporal information about each individual spike. This particular assumption is one heavily debated in literature and the true encoding scheme used by the eye is still a mystery, however this assumption is nonetheless a good starting point. Additionally, for the scope of our project, we generated the visual inputs with known statistics matching that of natural scenes, as discussed in later sections. Therefore, mapping the biology to the information theory, the encoder in the system models the retina — mapping light responses to spikes.

In the visual system, the communication channel is represented by the optic nerve, which runs from the retina to the visual cortex in the brain. In our model, this communication channel is modeled as a parallel, inhomogeneous Poisson channel with some additive Poisson noise. Specifically, the number of parallel channels corresponds to the number of retinal ganglion cells connected to the optic nerve bundle, each transmitting some spike rate on the channel.

Finally, in our model, the communication channel feeds into a decoder which implements a strategy to capture the spike rates and reconstruct the perceived visual inputs from these estimated rates. In biology, this corresponds to the visual cortex of the brain, which perceives the spikes as some visual image.

Overall, this general framework of modelling the visual system as an information theory problem is particularly interesting because we can apply well-known information theory techniques to gain insights into otherwise mysterious processes.

Joint Source-Channel Coding: A Review

Our model of the early visual system depicts both compression via neural downsampling and communication via Poisson spiking. All together, our work fits into the joint source-channel coding framework. The optimal strategy in this framework according to Shannon’s separation theorem is to independently optimize both our source coding compression scheme and our channel coding communication protocol.

Source coding for optimal compression

In source coding, we deal with a sequence of inputs $latex X_1,X_2,\dots,X_N ~ p(x), x\in \mathcal{X}$, which is encoded by an index $latex f_n(X^n)\in {1,2,\dots,2^{nR}}$ and then decoded by an estimate $latex \hat{X}^n$ from the output alphabet $latex \hat{\mathcal{X}}$. This procedure is outlined in Figure 4 below. An important quantity in this framework is the compression-rate $latex R$, defined as the average number of bits per input symbol. According to Shannon’s source coding theorem, it is impossible to perfectly compress input data such that the compression-rate is less than the entropy of the source. In the case where lossy compression is tolerated, the tradeoff between minimizing the compression-rate and minimizing the loss is given by the rate-distortion curve. Points along this curve $latex R(D)$ define the smallest achievable compression-rates (largest compression factors) for a given distortion $latex D$.

Figure 3: Standard source coding framework

Channel coding for reliable communication

Channel coding consists of finding a set of channel input symbol sequences such that each sequence is sufficiently “far” from the other sequences and each input sequence maps to a disjoint set of output sequences. In this case, the channel input sequence can be decoded without error. Shannon showed that for every channel, there exists some channel code such that $latex max_{p(x)} I(Y;X)$ bits can be sent per channel use. $latex max_{p(x)} I(Y;X)$ is known as the channel capacity. This idea can be generalized to a channel which takes a continuous waveform as input. In this case, channel coding consists of finding waveforms, which last for duration $latex T$, and are spaced sufficiently far apart such that they can be decoded without error. This is shown in Figure 4 for the case of four independent Poisson channels. Here, the codebook consists of waveforms which take the form of pulse modulated signals. As discussed in sections “Channel Capacity Under Poisson Noise” and a “Mathematical Aside”, a code generated with such waveforms can reach capacity. Thus, each continuous input is mapped to a pulse modulated signal which is then transmitted across the Poisson channel.

Figure 4: Channel coding framework

Compressing a Gaussian Source with Finite Bandwidth

How can we statistically characterize our visual inputs? In the space of continuous and unbounded alphabets, Gaussian distributions maximize the available entropy under a bounded power (technically, second moment) constraint. As a result, we take a Gaussian source to be a general description of our 2D image data. To model a general bandwidth assumption on our images, we employ shift-invariant autocorrelation function given by the squared-exponential kernel $latex K(\textbf{x}-\textbf{x’}) = \exp\Big(\frac{-|\textbf{x}-\textbf{x’}|^2}{2l^2}\Big)$. This Gaussian process allows us to draw random $latex S\times S$ images as $latex S^2$-length vectors from $latex \mathcal{N} \big(\textbf{0}, \Sigma_l \big)$, where the covariance matrix $latex \Sigma_l$ is parameterized by the length constant $latex l$. This process gives a non-uniform power spectrum with analytic form given below:

$latex S(f) = \sqrt{2\pi l^2}\exp(-2\pi^2 l^2 f^2)$

To further simplify our analysis, we consider images in gray-scale, as color representation in the retinal code can be handled somewhat orthogonally to intensity by color-tuned photoreceptors and retinal ganglion cells. Example frames from this procedure can be seen in Fig. 5 below.

Figure 5: Typical frames for this isotropic Gaussian process with $latex l=0.05$, $latex l=0.1$, and $latex l=0.2$ from left to right

How good can we theoretically do?

When compressing a memoryless Gaussian source with a uniform power spectrum $latex \sigma^2$, the maximum achievable compression rate for a minimum mean-squared distortion $latex D$ is given by the well-known expression, $latex R(D) = \frac{1}{2}\log\frac{\sigma^2}{D}$.

For our non-memoryless case, the optimal compression strategy can be clarified in the spectral domain. Here, the single spatially-correlated source can be decomposed into infinitely many parallel independent sources, each with variance given by the power $latex S(f)$. This transformation inspires a reverse water-filling solution [7], where we use a given distortion budget $latex D$ to determine a threshold $latex \theta$, such that compression is done by only encoding the components where $latex S(f)>\theta$. In this parametric form, the new rate-distortion curve is given by the following equations:

$latex D(\theta) = \int_{-\frac{1}{2}}^{\frac{1}{2}}\min(\theta,S(f)) \ df$

$latex R(\theta) = \int_{-\frac{1}{2}}^{\frac{1}{2}}\max\Big(0,\frac{1}{2}\log\frac{S(f)}{\theta}\Big) \ df$

To reiterate an important point, these $latex R(D)$ curves depend on the length scale of the source $latex l$. In the limit where $latex l \rightarrow \infty$, our image becomes uniform and its power spectrum is concentrated at $latex f=0$. This source distribution corresponds to the largest rate-distortion curve, since the all the input pixels can be encoded with the single uniform intensity and all the distortion is concentrated in the quantization of that value. In the limit where $latex l\rightarrow 0$, our bandwidth becomes unbounded and we recover the white Gaussian source, with a constant power spectrum $latex S(f) = \sigma^2$. At this extreme, each input pixel is uncorrelated with the other pixels, and for compression under the same fixed distortion $latex D$, each symbol must share this budget with a more aggressive quantization. This intuition is validated in Fig. 6a, where the rate-distortion curves are computed for a variety of length-scales.

Figure 6: (a) Rate-distortion curve for various length scales $latex l$. (b) Shaded region describes achievable compression rates

How good do we practically do?

Given these fundamental limits, we now describe our neural compression scheme and compare our results. In the pioneering work by Dr. Haldan Hartline in 1938 [3], researchers discovered evidence of a receptive field in retinal ganglion cells, corresponding to a particular region of sensory space in which a stimulus would modify the firing properties of the neuron. In more mathematical terms, this property is modeled through a convolutional spatial filter which the neuron applies to its input to determine its response.

We thus interpret each pixel value as the encoded firing intensity of the primary photoreceptors, which we compress by applying spatial filters $latex k(x-x_c,y-y_c)$ at each neuron location $latex (x_c,y_c)$. Our model uses an isotropic Gaussian for each neuron’s spatial filter, producing an output $latex r(x_c,y_c) = k(x-x_c,y-y_c) * I(x,y)$ which is used as the neuron’s firing rate. If the neurons are spaced $latex \Delta x$ pixels apart, a $latex S\times S$ image is compressed into a $latex \frac{S}{\Delta x}\times\frac{S}{\Delta x}$ output, corresponding to a compression rate of $latex \frac{1}{\Delta x^2}$.

To measure the distortion of this compression scheme, we propose decoding schemes for reconstructing the image. These schemes are not biologically motivated, as its unclear what neural process, if any, attempts to reconstruct the visual inputs to the photoreceptors. It is nevertheless instructive to consider how our biologically-inspired encoding scheme could be reversed. A simple procedure is one which deconvolves a neuron’s firing rate $latex r$ with its receptive field, producing an estimate $latex \hat{I}(x,y) = \sum_{(x_c,y_c)} r(x_c,y_c)k(x-x_c,y-y_c)$. This decoding procedure is nearly optimal under the fewest assumptions on the input source distribution, as is shown in our appendix below. A more complex procedure which is allowed perfect knowledge of the distribution’s covariance and global firing rate information can decode with much lower distortion, though its estimate $latex \hat{I}(x,y)$ is difficult to express analytically and is instead the solution to a convex optimization problem. It’s derivation is tied closely to the minimization of the Mahalanobis distance $latex d_M(\textbf{x};\mu, \Sigma)^2 = (\textbf{x}-\mu)^TW(\textbf{x}-\mu)$, and is also shown more explicitly in the appendix below. An example compression and decompression is shown in Fig. 7.

Figure 7: Example compression and reconstruction using the naive and optimal decoders

Are we even measuring this right?

Commonly used image fidelity metrics, such as mean-squared error (MSE), are simple to calculate and have clear physical meaning, however these metrics often do not reflect perceived visual quality. Therefore, modern research has been done on fidelity metrics that incorporate high-level properties of the human visual system. One such metric is known as the structural similarity (SSIM) metric. For the scope of this blog post it is enough knowing that SSIM incorporates image properties such as averages, variances, co-variances, and dynamic range in order to better gauge perceived similarity between images.

Figure 8: SSIM vs MSE. [8]

Channel Capacity under Poisson Noise

After compression, how well can this information be transmitted?

Here we outline some key results from Wyner [1][2]. The Poisson channel takes as input a continuous waveform input, $latex \lambda(t)$, with domain $latex 0\leq t < \infty$. We assume the following constraints: $latex 0\leq \lambda(t) \leq A$ and $latex (1/T)\int_0^T \lambda(t)dt \leq \sigma A$ where $latex 0<\sigma \leq 1$. Now, the channel output is a Poisson process with intensity $latex \lambda(t)+\lambda_0$ where $latex \lambda_0$ represents additive Poisson noise. $latex \nu (t)$ defines a Poisson counting function which gives the number of spikes received before time $latex t$. More specifically, $latex \nu (t)$ is a staircase function where each step represents detection of a spike at the output. We then have that $latex Pr\{\nu (t+\tau)-\nu (t)=j\}=\frac{\exp^{-\Lambda}\Lambda^j}{j!}$ where $latex \Lambda=\int_t^{t+\tau}(\lambda(t’)+\lambda_0)dt’$.

We now describe a general Poisson channel code. The input codebook consists of a set of M waveforms $latex \lambda_m(t)$ where $latex 0\leq t \leq T$ which satisfy $latex 0\leq \lambda_m(t) \leq A$ and $latex (1/T)\int_0^T \lambda_m(t)dt \leq \sigma A$. The output alphabet consists of the set $latex S(T)$ where $latex S(T)$ is the set of all $latex \nu (t)$ for $latex 0\leq t \leq T$. The channel decoder is the mapping: $latex S(T) \rightarrow \{1,2,…,M\}$. The output of the channel takes on a value $latex \nu_0^T$ from the output alphabet $latex S(T)$. $latex \nu_0^T$ is a staircase function with domain $latex [0,T)$. We then have an error probability given by: $latex P_e = \frac{1}{M}\sum _{m=1}^M Pr\{D(\nu_0^T)\neq m\}$. The rate, $latex R$ of the code in bits per second is therefore $latex \frac{\log M}{T}$.

We say the $latex R$ is achievable if for all $latex \epsilon>0$ there exists a code such that $latex M\geq 2^{RT}$ and $latex P_e \leq \epsilon$. The channel capacity is then the supremum of all such achievable rates. For any set of codes satisfying these constraints, we have the following capacity:

$latex C=A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)] $

Here, $latex s=\lambda_0/A$, $latex q^*=\min{\sigma,q_0(s)}$, and $latex q_0(s)=\frac{(1+s)^{1+s}}{s^s}-s$.

Figure 9: (a) Channel capacity v. $latex A$, (b) Channel capacity v. $latex \sigma$, (c) Channel capacity v. $latex \lambda_0$, (d) Channel capacity v. number of neurons

But what if we had more noise?

Interestingly, work by Frey [5] has shown that in the presence of random, nondeterministic background noise $latex \lambda_0$, feedback can improve channel capacity (under mild assumptions on the distribution of the noise). In particular the This is surprising because the Poisson channel is still memoryless, and in general, memoryless channels cannot increase their capacity with the addition of feedback. Nevertheless, biologically speaking, this is a more realistic setting, given that seemingly stochastic vesicle release appears to inject random noise into the propagation of spikes within the nervous system. Thus this is a particularly encouraging result from information theory as recurrent connections and backpropagating dendritic action potentials are a common feature within biological neural networks, and may jointly be facilitating this feedback mechanism to counter the intrinsic background noise.

A Mathematical Aside

For the more mathematically inclined:

Optimal decoders

We first set up the problem. Each $latex S\times S$ image frame $latex I$ is vectorized into $latex S^2$-length vectors $latex f$. The receptive field of neuron $latex i$ as a $latex w\times w$ patch, when centered at its location $latex (x_c,y_c)$ within the $latex S\times S$ frame, is also vectorized into an $latex S^2$-length vector $latex k_i$. In this framework, we can compute the rates of all $latex N$ neurons by arranging $latex k_1 \dots k_N$ receptive field vectors into a matrix $latex C \in \mathbb{R}^{N\times S^2}$ such that the rates $latex r$ are computed as $latex r = Cf$.

This operation necessarily loses information as the solution to $latex r=C\hat{f}$ is underdetermined. A common technique in probabilistic inference is determining the maximum likelihood estimate (MLE). Since our source is a Gaussian source with $latex \mu =0$ and covariance $latex \Sigma_l$ (where the subscript $latex l$ designates its parametrization by the length constant $latex l$), we write out our log-likelihood function $latex L(\mu,\Sigma)$:

$latex L(0,\Sigma_l) = -\frac{1}{2}\log(|\Sigma_l|) – \frac{1}{2}x^T\Sigma_l^{-1}x – \frac{S^2}{2}\log(2\pi)$

Within this expression we recognize the square of the Mahalanobis distance $latex d$, which can be considered an expression of the distance between a point $latex x$ to the distribution $latex \mathcal{N}(0, \Sigma_l)$. Rewriting this expression along with a constant term $latex c = \frac{1}{2}\log(|\Sigma_l|) + \frac{S^2}{2}\log(2\pi)$, we get:

$latex L(\mu,\Sigma_l) = -\frac{1}{2}d^2 – c$

With this, we see we can maximize the log-likelihood by minimizing the Mahalanobis distance $latex d(\hat{f};0,\Sigma_l) = (\hat{f}-\mu)^T\Sigma_l^{-1}(\hat{f}-\mu)$ subject to the constraint $latex C\hat{f} = r$. Moreover, this is a simple convex optimization problem, since the matrix $latex \Sigma_l^{-1}$ is a positive definite precision matrix.

$latex min_f \quad f^T\Sigma_l^{-1}f$

$latex s.t. \quad Cf = r$

This decoder is optimal under the assumption that the source is a stationary Gaussian distribution with $latex \mu=0$, covariance $latex \Sigma_l$, and the decoder has perfect knowledge of the distribution. We also consider the case where the decoder makes no assumptions on a finite bandwidth of the source, i.e., $latex \Sigma_l \rightarrow \mathbb{I}$. The resulting objective corresponds to a least-norm problem on $latex f$:

$latex min_f \quad f^T f$

$latex s.t. \quad Cf = r$

This optimization problem has an explicit solution $latex f = C^T(CC^T)^{-1}r$, which we now compare to our naive deconvolution solution $latex \hat{I}(x,y) = \sum_{(x_c,y_c)} r(x_c,y_c) k(x-x_c,y-y_c)$. When vectorized into the format of this framework, this is equivalent to $latex f = C^Tr$. Now the comparison becomes more clear, since the least-norm solution and the deconvolution solution are equivalent if $latex CC^T = \mathbb{I}$. This is the case when each neuron’s receptive field $latex k_i$ is unit length and pairwise orthogonal ($latex k_i^Tk_j = 0$), or in other words, have no overlap over the image.

Important to note is that this decoder is optimal in the noiseless channel case. In the case where the receiver’s estimate $latex \hat{r} \neq r$, the decompression process minimizes the objective in the nullspace of $latex Cf = \hat{r}$ and fails to find the appropriately reconstruction. We thus modify our constraints to respect our noisy channel by assuming the true rate $latex r$ falls in the set $latex [\hat{r}-\sigma_{\hat{r}}, \hat{r}+\sigma_{\hat{r}} ]$. The new stochastic optimization problem is reformulated below, where the channel noise is known to take a Poisson distribution.

$latex min_f \quad f^T\Sigma_l^{-1}f$

$latex s.t. \quad Cf \leq \hat{r}+\sqrt{\hat{r}}, Cf \geq \hat{r}-\sqrt{\hat{r}}$

Channel capacity lower bound

We now derive an expression for the lower bound of the Poisson channel capacity [1][2]. The main idea of this proof is that we can analyze $latex max_{p(x)} I(Y;X)$ when $latex X$ is constrained to a subset all possible inputs to the Poisson channel. Clearly, if this constraint is lifted, the same distribution over the inputs can be achieved by setting $latex p(x)=0$ for any $latex x$ which is not a member of the constrained subset. Therefore, $latex max_{p(x)} I(Y;X)$ when $latex X$ is constrained to a subset of inputs to the channel is a lower bound on the capacity of the Poisson channel.

Specifically, we impose the constraint that $latex \lambda(t)$ is piecewise constant so that it takes one of two values, 0 or A, over each interval $latex ((n-1)\Delta,n\Delta]$. We then define the discrete signal $latex x_n$ such that $latex x_n = 1$ if $latex \lambda(t)=A$ in the interval $latex ((n-1)\Delta,n\Delta]$. Otherwise, $latex x_n=0$. Additionally, we assume that the detector sees only the values $latex \nu(n\Delta)$. Thus, it sees the number of spikes received in the interval $latex ((n-1)\Delta,n\Delta]$ which we define as $latex y_n=\nu(n\Delta)-\nu((n-1)\Delta)$. Although there is a finite probability that $latex \nu(n\Delta)-\nu((n-1)\Delta)\geq 2$, we take $latex y_n = 0$ for all cases where $latex \nu(n\Delta)-\nu((n-1)\Delta)\geq\neq 1$. This is a reasonable approximation for small enough $latex \Delta$. Thus, $latex y_n$ is binary valued. We have now reduced the channel to a binary input, binary output channel characterized by $latex P(y_n|x_n)$. Any capacity achieved by this channel will be a lower bound on the capacity achieved by the general Poisson channel as we are dealing with a restriction of the input and output spaces. However, it will turn out that this bound is tight.

From the Poisson distribution, we have that $latex P(y_n=1|x_n=0)=\lambda_0\Delta\exp^{-\lambda_0\Delta}$. We have that $latex \lambda_0=s A$. Therefore, $latex P(y_n=1|x_n=0)=s A\Delta \exp^{-s A\Delta}$. Likewise, $latex P(y_n=1|x_n=1)=(1+s)A\Delta \exp^{-(1+s)A\Delta}$. Now, the average power constraint takes the form $latex \frac{1}{N}\sum_{n=1}^N x_{mn} \leq \sigma$. Where $latex m$ indexes the codeword. Therefore, the lower bound on capacity is given by $latex \max I(x_n;y_n)/\Delta$ where the maximum is taken over all distributions of $latex x_n$. Clearly, $latex p(x_n)$ must satisfy $latex E(x_n)\leq \sigma$. Now we define $latex q=Pr\{x_n=1\}$, $latex a=sA\Delta\exp^{-sA\Delta}$, and $latex b=(1+s)A\Delta\exp^{-(1+s)A\Delta}$. We then have that $latex I(x_n;y_n)=h(qb+(1-q)a)-qh(b)-(1-q)h(a)$ where $latex h(.)$ is the binary entropy. We then have $latex \Delta C\geq \max_{0\leq q \leq \sigma} I(x_n;y_n)$. After making a few simplifying approximations, we have an expression which is convex in $latex q$. We then maximize with respect to $latex q$ subject to the constraint $latex q\leq \sigma$. The mutual information is maximized when $latex q=\frac{(1+s)^{s+1}}{s^s}-s$. We define $latex q_0(s)=\frac{(1+s)^{s+1}}{s^s}-s$. We then have that the capacity is lower bounded by $latex C=A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)]$ where $latex q^*=\min{\sigma,q_0(s)}$.

Taking our channel to consist of $latex m$ neurons, each modeled as an independent Poisson channel, we have the total capacity $latex m A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)]$.

Outreach Activity

For the outreach activity at Nixon Elementary school, our group decided to create a board game called A Polar (codes) Expedition. The purpose of this outreach activity was to help the kids walk away with an ability to XOR, and, furthermore, use that ability to decode a simple 4-bit polar code.

The overarching theme of the board game is that the player is trapped in a polar storm and needs to use polar codes to figure out where the rescue team is coming. The outreach activity was a huge success (see some pictures from the event below). It was especially amazing to see kids even smaller than we anticipated being the target age for our game (3rd and 4th graders) pick up XOR-ing really fast and walk through the game. Additionally, it was inspiring to see the kids so excited after solving the polar code and then realizing this decoding scheme is an actual one used in modern day. Overall, our group believes the outreach component of the course was truly a wholesome experience and we all came out of it hoping to continue taking part in STEM outreach in some capacity throughout the rest of our PhDs. It was special to see the genuine curiosity that the kids had while learning about polar codes through our game and end the end of the day it really reminded us why we are here to do a PhD in the first place — it boils down to the simple fact that we find these topics to be cool and interesting and we derive happiness from learning more and more about EE.

The transcript for our board game can be found at, please take a look if you would like to read the story arcs that the kids got to take on their polar expedition: https://docs.google.com/document/d/1gQk1USQOqbW8xdxMENY7QQ4kr4LFIazgFS149aJIhuQ/edit?usp=sharing


[1] A.D. Wyner, “Capacity and error exponent for the direct detection photon channel–Part 1,” in IEEE Trans. Inform. Theory, vol. 34, pp. 1449-1461, 1988.

[2] A.D. Wyner, “Capacity and error exponent for the direct detection photon channel–Part 2,” in IEEE Trans. Inform. Theory, vol. 34, pp. 1462-1471, 1988.

[3] H.K. Hartline, “The response of single optic nerve fibers to illumination of the retina,” in American Journal of Physiology, vol. 121, no. 2, pp. 400-415, 1938.

[4] Kolb H, Fernandez E, Nelson R, editors. Webvision: The Organization of the Retina and Visual System [Internet]. Salt Lake City (UT): University of Utah Health Sciences Center; 1995-.

[5] M. Frey, “Information capacity of the Poisson channel,” in IEEE Trans. Inform. Theory, vol. 37, no. 2, pp. 244-256, 1991.

[6] Palanker D – Own work, CC BY-SA 4.0 https://commons.wikimedia.org/w/index.php?curid=49156032.

[7] R. Zamir, Y. Kochman, U. Erez, “Achieving the Gaussian rate-distortion function by prediction,” in IEEE Trans. Inform. Theory, vol. 54, no. 7, pp. 3354-3364, 2008.

[8] Z. Wang et al., “Image Quality Assessment: From Error Visibility to Structural Similarity,” in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004.

Compressing Neuropixel Recordings

EE376A (Winter 2019)

by Benjamin Antin and Anthony Degleris

In a recent conversation with a neuroscience professor at Stanford, we found out that his lab spends over eight-thousand dollars per month storing data; a clear case for lossless compression! In the following blog post, we explore lossless compression for high-dimensional neural electrode recordings on a sample dataset from Neuropixels probes. We compare universal compressors, wavelet-based methods, and a simple delta coding scheme, and achieve roughly 3x compression on our dataset. Our code can be found here.

Neuropixel and the data storage problem

There is a growing understanding among neuroscientists that in order to understand the brain, we simply need more data. Starting in the 1940s, scientists have used recordings from single neurons (so-called single unit recordings) as a way to probe how the brain works. Since the first recordings were made using glass electrodes, the number of neurons from which we are able to record has doubled approximately every 7 years, following a Moore’s law-like pattern [Stevenson 2011, Nature Neuroscience].

The most recent of these advances are Neuropixels Probes. These are CMOS fabricated probes capable of recording from up to 384 channels (and thus from a similar number of individual neurons) simultaneously [Jun et al 2017, Nature]. Prior to this, state-of-the-art neural recording used MEMS fabricated arrays capable of recording a maximum of around 100 neurons. All this is to say that there is a data revolution doing on in neuroscience, and storage will become a top priority.

Why we care about raw data

In many cases, neuroscientists are interested in looking at Action Potentials (APs): spikes in voltage across the cell-membrane which are used to signal across long distances. Instead of looking at the raw electrical signals, neuroscientists often split the data into short time-bins (about 1ms long) and binarize the measurements: “1” means that a spike occurred; “0” mean that there was no spike. These spikes tend to be sparse, meaning there are far spikes than there are bins. Because of the data’s sparsity, binarized spike data is easier to store than raw electrical recordings: we can just store the indices where spikes occur and achieve orders of magnitude compression (10x or possibly even 100x).

However, there’s a growing movement toward including the study of Local-Field Potential (LFP) in the study of neural activity and behavior. Local Field Potential can roughly be thought of as the low-frequency components in neural recordings. Whereas APs are all-or-nothing events which travel long distances, LFP is gradated and local. LFP is more poorly understood than Action Potentials, but there is a growing movement to study and understand it [Herreras 2016, Frontiers in Neural Circuits]. The potential role of LFP in future studies motivates our need to store raw neural recordings losslessly, so that the raw experimental data is always available if future studies require it.

The dataset

To explore lossless compression of neural recordings, we use a publicly-available dataset recorded using Neuropixels probes (http://data.cortexlab.net/singlePhase3/). The dataset is a recording from 384 channels in an awake mouse over the course of an hour. The raw electrical signals have been quantized to 16-bit integers, creating a matrix of size 384 by 1.8 million. The data takes up 1.3GB when stored uncompressed. Individual electrode channels are placed along the rows of the data matrix, while the columns of the data matrix are used to index time. From here on, we’ll refer to our data matrix as $latex \mathbf{X}$.

To give the reader a sense of the structure present in these data, a plot is shown below of 20 electrode channels over 2000 time-steps. Since the data is a time series, we see that there is strong correlation across time. Since adjacent electrodes are physically close to one another, we also note strong correlations between adjacent rows.

20 Neuropixel electrode channels over 2000 time steps. Note the correlations.

The time-series nature of our data makes it somewhat similar to EEG data. In the figure below, we show a sample EEG recording from multiple electrodes placed on the scalp.

8 EEG waveforms from electrodes placed on the scalp.

Note that the EEG waveforms also exhibit correlations through time, as well as across adjacent channels. Of course, EEG recordings are not distributed exactly like neural recordings (they tend to have fewer channels, for instance). Still, the two types of data seem similar enough that we’d expect approaches to compressing EEG data to work well for compressing Neuropixel data. One study [Antoniol 1997, IEEE Transactions on Biomedical Engineering], demonstrated roughly 2x compression using simple delta coding combined with Huffman coding. This work, in part, motivates our use of delta coding in the next sections.

Understanding the data

Since the data is quantized as 16-bit integers, each entry can take on $latex 2^{16}$ possible values, ranging from -32,768 to 32,767. However, not all these values are being used. In the below figures, we see that the histogram of raw values is concentrated in about 100 bins. Even better, the histogram of first differences is concentrated in just 40 bins. This is encouraging: we may be able store the raw values or the first differences using fewer than 16 bits per integer.

As a lower bound on the compression performance, we first calculate the first-order entropy. For this calculation, we treat every entry of X as an i.i.d sample from a discrete distribution. The first order entropy calculation gives 3.53 bits per symbol. Base on this calculation, we shouldn’t hope for a universal compressor to do better than 306 MB for the entire array. Similarly, a first order entropy calculation of the difference data gives 3.37 bits per symbol, or 292 MB for the entire file. Can we come close to achieving these rates?

Writing a compressor: results

1) Baseline: off the shelf universal compressors

Our first approach was to simply compare different universal compressors on the raw data. We stuck to the basics: gzip and bzip2, both of which use modified versions of Lempel-Ziv compression. We achieved around 2-3x compression. Later, we tried other compressors (such as sprintz or density) but with limited success; no compressor outperformed bzip2. This gave us insight into what compression rates we could expected.

MethodFile SizeCompression Rate
Raw1332 MB1.00
gzip559 MB2.38
bzip2423 MB3.15

2) Simple, effective and online: delta coding

Delta coding is a simple technique frequently used to compress time-series data. The idea isn’t super complicated: instead of encoding the raw values, we just encode the differences. Since the differences tend to be small, we can use just a few bits to encode each value. When the differences happen to be large, we instead encode the difference in a separate file.

There is an inherent tradeoff when applying this method: using smaller bit-widths to represent the differences means that the differences take up less space. However, since we need to store the outliers as well as their indices, having too many outliers does not work well. We experimented with 8-bit, 6-bit, and 4-bit representations. The results below are for the total file size after delta coding the data.

MethodFile SizeCompression Rate
Raw1332 MB1.00
4-bit diffs + bzip2688 MB1.94
6-bit diffs + bzip2418 MB3.19
8-bit diffs + bzip2403 MB3.31
8-bit diffs (w/o compressor)694 MB1.92

One astounding fact is that the 8-bit differences achieved nearly a 2x compression rate before applying a universal compressors. Since the first difference is easily computed in practice, this means delta coding has the potential to compress neural data online, i.e. directly on the hardware while recording. This could allow neuroscientists to record and transmit Neuropixel data under limited bandwidth conditions, possibly even wirelessly.

3) Fancy stuff: integer wavelet compression

Inspired by delta coding, we wanted to find other methods for compressing data quickly and efficiently, attempting to achieve high compression rates with simple calculations in a possibly online setting. We chose to investigate the integer wavelet transform. The wavelet transform is a method for decomposing a signal into various temporal and frequency components, and often yields sparse representations [Mallat, A Wavelet Tour of Signal Processing]. These sparse representations lead to efficient compression of many signals — in fact, wavelet transforms underlie the JPEG 2000 image compression algorithm.

We decided to implement the integer-to-integer Haar transform (a very simple wavelet transform) [Fouad and Dansereau 2014, IJ Image, Graphics and Signal Processing]. After applying this transform to each electrode, we applied the bzip2 compressor and reduced the file to 502 MB, no better than just compressing the raw data! Wavelets, in this case, were a failed attempt.


So, what did we learn from this experiment?

  • Universal compressors do pretty well on neural data, and are hard to improve upon.
  • Simple delta coding achieves a reasonable compression rate, and can be implemented online to reduce the bandwidth needed for transmitting Neuropixel recordings.
  • Although wavelet compression has been successful in the past, it appears to work poorly for this type of data.

In the future, we hope to explore several directions

  • Exploiting correlations between electrodes.
  • Advanced delta coding (e.g. second order differences or linear predictors).
  • Advanced wavelet techniques (e.g. using more complex wavelets, transforming both dimensions).

Overall, our project highlights how Neuropixel and other modern neuroscience datasets present a new challenge for compression — can we compress this data live with simple algorithms like delta coding in order to transmit large amounts of data wirelessly? Can we reduce the growing storage costs related to neuroscience datasets? Is there an efficient representation of these recordings, and if so, will this representation tell us something fundamental about the brain? Questions like these fit naturally in an information theory framework, and our exploration is a first step in tackling this problem.


For our outreach project at Lucille Nixon High School, we designed a game called Spy Messaging in which the players had to communicate with each other through a noisy channel. The idea is that one player (the Spy) must communicate the location of the secret documents using three characters, each written on its own sticky note, to another player (the Analyst). The trick though, is that the game master (Anthony) could erase one of the letters at random by pulling away one of the sticky notes. This simulates a binary erasure channel, where the erasure probability is the probability that Anthony decides to be sneaky. Below is the description from our poster, describing how to play the game:

The instructions for Spy Messaging.

The game was played using a board showing many rooms (which we borrowed from the game of “Clue”).

The Spy Messaging board and three letters one might use to encode “Library”.

At the beginning of the game, we place a yellow marker in one of the rooms, signifying the location of the secret documents. While we do so, the Analyst must turn around so that they can’t see where we are placing the marker. While the Analyst is turned around, the Spy encodes the location of the secret documents using three characters, each one on a sticky note (shown above). Once the Spy is done writing, we remove the yellow marker, and (optionally) erase one of the characters in the message.

The analyst must then decipher the location of the secrete documents from whatever characters remain. If they do so correctly, they win a prize — which was candy, of course.

As we guided students through the game, we had to adapt on the fly to different age groups and maturity levels. Older kids quickly learned to assign each room its own character, and then repeat that character three times (repetition coding). For our youngest participants, though, communicating a secret message even without any erasures was challenging enough. We found it helpful to have a parent or guardian be one of the partners, so that they could coach their child. Following the lead of other booths at the outreach event, we added a “Leaderboard” where students who won the game could write their name. For reasons we still don’t fully understand, kids loved writing their names down after they won.

At the end of the game, we explained to each participant how the game they had just played was related to information theory: every time our phones and computers send information, that information can get partially erased. By being clever about the way we transmit messages, we can send images, texts, and even movies without losing information. Although we never used the phrase “Binary Erasure Channel,” we think kids got the idea.

Ben (left) and Anthony (right) running the Spy Messaging station at Lucille Nixon Elementary.

Informatrix — Learn Information Theory through Fun!

EE376A (Winter 2019)

By Jack Jin, Yueyang Liu, and Kailai Xu

Love board game? Want to learn information theory? Then this might be the right post for you. In this post, we introduce a new board game that assists you learn information theory through fun — Informatrix.

— A game of information

The game allows 2-6 people playing at the same time and is appropriate for people from 6 years old. It usually takes 30 minutes for a single game.

The game is inspired by Chinese checkers and Catan. In this game, every player starts with 5-10 information sources, and tries to transmit the sources through the channels. For each player, there are assigned destinations for the sources. The first player who has successfully transmitted all the sources to the destination wins the game. The process simulates the process of information transmission in real life. Your information sources are not safe all the time in the noisy channels. So be ready to route your information sources with your intelligence!

The board of Informatrix. Each polygon (4, 6, or 12) contains one of the three resources, noise, or a random resource. The channels are the lines of the polygon and the intersections are the relays Each relay is affected by the (at most) three adjacent polygons.

As you can see in the board, the channels are populated with different resource: copper, plastic and silicon. They are essential blocks for building up the communication system. We also have several noisy sinks. Each round, the players move one of their sources one step forward; then a dice is tossed to decide the active noisy sink and the active channels. The information source on the edge of the sink will be corrupted and put back to the starting point; the players will acquire the corresponding resources if they have sources on the edge of the active channels.

The three resources available in the game.

The players can use the resources to purchase correction code. The price of one correction code is 6 copper + 3 silicon + 1 plastic. When one of the player’s sources is corrupted, she can pay one correction code to immunize the source from corruption. Players can also purchase compression at the price of 1 silicon, allowing the player to move her information bits twice in the current turn. The most fun part is that the players can trade resources and correction codes with one another at any price they agree on.

We also mimic congestion in information theory by disallowing two sources occupying the same location. The players can negotiate with each other or reroute their information.

That’s all! Simple enough but covers many basic concepts in information theory. It basically teaches the players how information is transmitted and how they are impacted by the noise and congestions in the channels. Also the concept of correction codes is very important and we make the codes incentives for players to collect the resources.

Outreach Activity

As for our outreach event at Nixon Elementary school, we designed a session where we have the kids read flashcards on information theory and answer short questions. A list of the topics on the flashcards are:  

  • Background of Claude Elwood Shannon​ and his 1948 paper “A Mathematical Theory of Communication”. 
  • A typical structure of a communication system. 
A communication System
  • Examples of a noiseless channel, a binary symmetric channel, etc. 
  • Code, and ASCII and DNA as examples. 
  • The concept of entropy and an example of Morse code. 
  • Mutual information. An example that illustrates the concept is presented below: You walk outside at night and found out that the grass was wet. You then deduce that it must have either rained, or the sprinkler must have been turned on. Thus, Rain and Sprinkler have mutual information!​
An illustration of mutual information
  • Huffman Code. Background of Claude Elwood Shannon​ and his 1948 paper “A Mathematical Theory of Communication”. 
  • A typical structure of a communication system. 

Below is a photo of the poster (of eight flashcards) we had in the outreach event:

A photo of the posters at the outreach event. We forgot to take one during the event, so we cropped this out from a group photo taken at our booth.

We’ve learned a lot during the outreach event explaining basic concepts in information theory to 1st to 5th graders. The older kids were able to grasps the concepts, while to some younger kids we had to explain using the words they understand. A 1st grader asked questions like “what does the word ‘structure’ mean”. We find it easier to communicate the ideas to them when a younger kid comes with his or her older sibling, as they can discuss their understandings.

The outreach event is overall a fun experience, and both the kids and us have enjoyed it. We give the kids both pencils as and flashcards with information theory concepts on them — so that they could review it after the outreach event — as prizes. We’ve distributed more than 70 copies of flashcards.

Humans vs. Animal: Who’s the best for conversation?

EE376A (Winter 2019)

By Logan Spear, Grant Spellman, Heather Shen, and Andre Cornman

Human conversation is made up of more than just the words we speak; our messages are enriched by ‘paralinguistics’, or modifications to the way we deliver our words (such as tone, volume, and pitch) and nonverbal cues (consider facial expressions, posture, and gaze). Together, these ‘superlinguistic’ alterations to our words can enhance and amplify our messages or change their meaning completely. But how can we measure and quantify these effects? For this final project, our group explored the complexity of human speech and communication as well as various channels (in addition to speech) for human conversation.

Given that the audience at Nixon Elementary School was largely first through fifth graders, we narrowed our outreach event to focus on showing students how to visualize and compare various sounds.

We designed our project interface with two main panels: one for visualizing different animal sounds, and one for interactively visualizing short voice recordings (see Fig. 1). On the first panel, the students could choose from a list of animals (monkey, duck, elephant, or dog), play a recorded clip from the selected animal, and visualize the sound using a time and spectrogram plot. On the second panel, the student could record a three second clip of their voice and visualize it using time and spectrogram plots as well.

Figure 1: Left – the recorded noise and spectrogram of a monkey. Right – the recorded noise and spectrogram of a person talking.
Figure 2: Left – the recorded noise and spectrogram of a monkey. Right – the recorded noise and spectrogram of an undulating sound.

We found that most of the students easily understood how the amplitude of their voice affected the time plot. However, most students were not yet familiar with the idea of frequency and being able to visualize it in the spectrogram was exciting and new to them! You may have heard some high pitched noises — do not worry, that was just some excited young students learning the correlation between frequency and sound. Students were particularly excited about seeing an undulating noise visualized as a wave in the spectrogram (see Fig 2.) or how a whistle transformed into a straight line over time in the frequency domain (see Fig. 3).

Figure 3: Left – the recorded noise and spectrogram of a duck. Right – the recorded noise and spectrogram of a whistle.

To ground these concepts for students, we compared their own spectrograms with well-known animal noises, including a monkey, elephant, duck, and dog. Students were then able to try to emulate animal noises and see how that also might affect the spectrogram.
For students with more advanced understandings of frequency (and parents curious about our project), we started to broach the idea of measuring the “complexity” of sound based on simple metrics. For example, after thresholding the recorded sound to reduce noise, we displayed the minimum and maximum frequency as well as tonal variance on our GUI. Although simplistic, these metrics began the conversation around what makes human language more complex than animals. For example, we were able to show that the range of humans was typically larger than the animals and had greater variance.

Photos from the outreach event at Nixon Elementary School

Slightly more technical stuff
Now, we’ll discuss some ideas we’ve come up with through the process of creating our outreach project, reading some papers, and discussing things over dinner.

Initially, we wanted to do a project in which we captured the “sophistication” or “complexity” of a signal (specifically an audio clip) in order to compare the potential capacities of different species. What we found out, rather quickly, is that that task is extremely difficult. Within our research, though, we did find a paper that sparked our interest, ultimately leading us to flesh out the ideas we outline below. Thanks to Ariana, our mentor, for sending us so many great sources, such as this paper.

The paper that sparked our adventure into the idea for a model of human communication was On the Information Rate of Speech Communication. In this paper, they propose a model of communication and use it to estimate the information rate of English speech. The authors assume no prior knowledge of how language works, and rely on recordings of different people speaking the same sentences. Most importantly, they assume that a message consists entirely of the string of words spoken, and has no dependence on any other factor. When we read it, we were like “yeah, okay, human communication (not just speech) is complicated, and you’re going to need to make some simplifying assumptions, but that one just seems too big.” That sentiment sparked our discussions on the topic of human communication, and eventually led to the (rough) idea of a model we present now.

Our Model
At the heart of our model is the idea that a message between humans is made up of more than just the words spoken. We believe that there are many “superlinguistic” methods of transmitting and receiving information, and our proposed model tries to account for those factors. At a high level, our model consists of a message that gets chopped up into pieces (some of which may be redundant), and then those pieces each get sent across a different channel to the receiver. The receiver then tries to reconstruct the original message given all of the pieces she received. The idea is that the channels represent the various forms of superlinguistic communication, and a true message is made up of more than just the words received over one channel.

We’ll now provide a deeper discussion of the message (which we ourselves still don’t fully understand), and then the channels (which, again, we don’t fully understand).

The Message
To begin with, we point out that messages, as we see them, can be very complicated, since it’s possible to overload single interactions with multiple meanings (some of which are even unintentional). We’ll explore two different example interactions which will motivate our discussion of messages and subsequent discussion of channels.

Imagine first what we consider to be a very simple, direct message: one friend saying to another “pass me the broccoli” at the dinner table. This is an example of one of the simplest messages we could come up with: the speaker’s main (and likely only) intent is for her friend to pass her the broccoli so that she can add some to her plate. In this context, this message makes practically no use of any of the superlinguistic channels of communication, and the primary message is very clear.

Now imagine the case when a friend stomps into the room, sits down heavily in the chair next to you, sighs loudly, and says “I’m exhausted.” In this case, we imagine that the primary message that the speaker is communicating is actually a plea for the listener to ask them what’s up and comfort them, not the mere statement that they’re exhausted. This (rather contrived) example is meant to demonstrate how much meaning can be conveyed through channels other than the selection of words spoken.

In general, we see a single message as something that can actually contain information or statements regarding many different things. Some general types of information within messages that we’ve come up with include things like information about the speaker, the speaker’s opinion of other people or things, a request of the listener, or an invitation to the listener. Honestly, trying to capture all of the different types of information that could be communicated is very difficult, and this list is extremely incomplete and messy. The point, though, is to convey the idea of how layered a message can be and emphasize that there can be multiple discrete sub-messages contained within a single message. In general, we consider most messages to contain a primary message (which is the majority of what the speaker aims to communicate), and often also a secondary message (which is usually an implication made by the speaker), and the remaining information in the message is general background on context or character of the speaker, which consists mostly of things instinctively inferred by the listener rather than intentionally transmitted by the speaker.

So, in summary, messages have the potential to be very complex because of the way they can actually contain multiple, discrete ideas within a single message. To make things at least a little tractable, we consider that most messages contain a primary message, which is the speaker’s intended message, and can contain subsequent messages (such as intended suggestions by the speaker), and finally general information which is basically the information you get from a speaker purely through interacting with them and not by their intention (and it is often related to the nature of the speaker themselves).

The Channels
Now we’ll talk about the channels. In our model, the message itself gets broken up into pieces (not necessarily at random) and each piece is sent across a different channel to the receiver. These channels are supposed to represent the various “superlinguistic” modes of communication, such as concrete things like tone, body language, and cadence, but also more abstract things like speaker-listener relationship, cultural backgrounds of the involved parties, and conversational context. Again, this list is incomplete and rough, but the point is to convey the idea that communicating information relies on more abstract ideas than just the words or even tone and body language. Furthermore, each channel is usually used to relay a certain type or part of the message, which is why the message is not simply broken up at random to be sent across these channels.

When considering the more abstract things, like cultural background, we recognize that it’s weird to consider those as a channel. As a result, we can also consider the case where these abstract things instead parametrize the other, more classic channels, rather than constituting a channel themselves. For instance, a culture that makes heavy use of sarcasm may interpret messages sent across the tonal or facial expression channels very differently than a culture in which sarcasm is rare.

To summarize the whole model: in communicating, the speaker (who may not even be speaking out loud) sends a message. That message in fact contains many messages, some which the speaker intends to convey and some which she doesn’t. We assume that the speaker usually has at least a primary message which she actually intends to communicate. That whole message is split up into various blocks (some of which may be redundant) and sent across multiple channels to the receiver. The receiver then reconstructs the whole message from the received transmissions. Hopefully, all the discussion leading up to this at least sort of convinced you of the nuance in communication that we hope this model is able to capture.

Suggested Experiments
Now we’ll talk about some of the weird experiments we came up with that would try to verify or flesh out this model. These ideas are more for fun and aren’t very rigorous. Think of them more like a suggestion for how to begin thinking about experiments instead. For context, when coming up with these ideas, we thought that there were two things we needed to try to do: first is to come up with simple, repeatable messages so that you can control the message and tweak other things instead. The second thing is to find ways to cut off certain channels or at least select which channels to funnel the message through. This would allow you to explore what parts of what messages are sent across different channels.

Regarding the control of the message, we thought that one was pretty difficult. However, we suspect that if you focus on something expository, like describing pictures or videos (as they do in the Humans are Awesome project), then you can remove most of the “fuzzy” part of messages. And that’s about the best we came up with.

For the control of channels, we basically have one idea that can be reused a bunch of times, which is to directly cutoff or mess with one of the obvious channels. For instance, put two people in separate rooms and have them communicate over an instant messenger, and compare that to people talking directly about the same thing. Or put two people in the same room, but don’t let them talk and instead they still use the instant messenger. That way, they can see each other but they’re not talking. Similarly, put two people in the same room and let them talk to each other, but blindfold them, or at least don’t let them see the other person.

For testing those more abstract ideas (like cultural background or relationship), compare communication between a person and their friend or peer to the same person communicating with a non-peer. Similarly, compare the communication between two people from the same cultural background to two people that are from different cultural backgrounds (and also have relatively little exposure to the culture of their respective communication partner). Along similar lines, you could also explore the difference of language by having people communicate on the same topic twice, once each in different languages. Basically, once you get the idea, you can create plenty of your own experiments.

In conclusion: we naively tried to capture the “sophistication” of communication based off of an audio clip and then convey that to elementary schoolers. We found out that’s really hard. For our outreach, we instead tried to give them something they could explore intuitively, and we had a lot of fun with that (they did too, we hope). Through some readings we did on information theoretic approaches to analyzing human speech, we came across an interesting paper that inspired us to come up with our own model of human communication. The model consists of a message which is split up and sent across multiple channels, indicating the aspects of communication beyond simply the words used. We discussed the model, and then presented some experiments people could try — experiments which would probably reveal a bunch a flaws in the model and hopefully suggest ways to refine it. Most importantly, though, we learned a lot and had a great time doing it all!