Learning is hard work

EE376A (Winter 2019)

Daniel Wennberg1

Accounting for around 20 % of the body’s energy consumption, the human brain burns calories at 10 times the rate of your average organ when measured per unit mass.2 However, its $latex \sim 20 \mathrm{W}$ drain pales in comparison to that of the hardware used for training and inference in machine learning: a single GPU can consume several hundred watts, more than all the other components of a computer combined, and these days deep neural networks are routinely trained on systems containing more such processors than you can count on one hand.3

Anyone who has felt the heat from the bottom of their laptop has experienced the energy cost of information processing within the current paradigms, but systems capable of learning from their inputs seem to have a particularly high appetite. Perhaps this can tell us something about what learning is at the most fundamental level? Here I will describe an attempt to study perhaps the simplest learning system, the perceptron, as a physical system subject to a stochastic driving force, and suggest a novel perspective on the thermodynamic performance of a learning rule: From an efficiency standpoint we may consider the transient energy expenditure during learning as a cost to be minimized; however, learning is also about producing a state of high information content, which in thermodynamical terms means a far-from-equilibrium and thus highly dissipative state, so perhaps a good learning rule is rather one for which minimizing loss and maximizing heat dissipation goes hand in hand?

To explore this question, we need to know a couple of things about stochastic thermodynamics. This is a rather specialized field of study in its own right, and I’m only aware of three papers in which it is applied to learning algorithms. Therefore, in order that this text may be enjoyed by as many as possible, I will spend the bulk of it laying out the basics of stochastic thermodynamics, how it connects to information theory, and how it can be applied to study learning in the perceptron. I will conclude by presenting one novel result, but this is merely a taste of what it would actually take to properly answer the question posed above.

Thermodynamics and information

Classical thermodynamics is a self-contained phenomenological theory, in which entropy is defined as a measure of irreversibility. If the change in the total entropy of the universe between the beginning and end of some process is zero, it could just as well have happened the other way around, and we say that the process is reversible. Otherwise, it’s a one-way street: the only allowed direction is the one in which entropy increases, and the larger the entropy increase, the more outrageous the reverse process would be, like the unmixing of paint or the spontaneous reassembly of shards of glass. This is the content of the second law of thermodynamics, that total entropy can never decrease, and this is the reason we cannot solve the energy crisis by simply extracting thermal energy (of which there is plenty) from the ocean and atmosphere and convert it into electricity.

However, thermodynamics only makes reference to macroscopic observables such as temperature, pressure, and volume. What if a hypothetical being had the ability to observe each individual molecule in a reservoir of gas and sort them into two separate reservoirs according to their kinetic energy? That would reduce the entropy in the gas and allow us to extract useful work from it. This being is known as Maxwell’s demon, and the resolution to the apparent violation of the second law builds on connections between thermodynamics and the mechanics of the gas molecules as well as the information processing capabilities of the demon, provided by the framework of statistical mechanics. If we assume a probability distribution (or, in physics parlance, an ensemble) over the fully specified states of the fundamental constituents of the system, the entropy of the system can be defined as the corresponding Shannon entropy, in physics called Gibbs entropy. That is, if $latex x$ represents a possible microscopic state and $latex p(x)$ is the corresponding probability mass, the Gibbs entropy is,

$latex S = -\sum_x p(x) \log p(x) &s=1$ .

Thermodynamic observables correspond to various means under this distribution, and an equilibrium state is represented by the probability distribution that maximizes entropy for a set of constraints on thermodynamic observables.4 Thus, the thermodynamic entropy can be interpreted as the amount of information swept under the rug by describing a system with many degrees of freedom in terms of only a small set of observables.

This is perhaps the fundamental sense in which we can claim that information is physical: through entropy, the constraints on the time evolution of a system depend on which information about the system is kept and which is discarded. A more operational take on the same sentiment is that the amount of useful work we can extract from a system depends on the amount of information we have about it.5

Fluctuations in heat and entropy production6

The deterministic nature of classical thermodynamics corresponds to our experience in daily life: water predictably freezes at exactly $latex 0 \, {}^\circ \mathrm{C}$ and boils at exactly $latex 100 \, {}^\circ \mathrm{C}$. However, if we reduce the system size such that the number of degrees of freedom is closer to 1 than Avogadro’s number, fluctuations become significant, and we find that classical thermodynamics only describes average behavior. Assuming a separation of timescales between the fast dynamics of the thermal environment and the slow intrinsic dynamics of the system, this can be modeled as a Markovian process represented by a stochastic differential equation that physicists refer to as a Langevin equation. In the case of an overdamped system (that is, a system with negligible inertia), the equation reads,

$latex d \boldsymbol{x} = \mu \boldsymbol{F}_t(\boldsymbol{x}) dt + 2 D d \boldsymbol{W} &s=1$ .

Here, $latex \boldsymbol{x}$ is a vector of length $latex n$ representing the state of the system, while $latex \boldsymbol{F}_t$ is the driving force at time $latex t$, and $latex d \boldsymbol{W}$ is a vector of uncorrelated stochastic increments with variance $latex dt$ (Wiener increments), modeling the unpredictable kicks that the system receives courtesy of the thermal energy in the environment. The mobility $latex \mu$ and diffusion $latex D$ are related by the Einstein relation $latex D = T \mu$, where $latex T$ is the environment temperature.

The first law of thermodynamics states that energy is always conserved, and is often written $latex \Delta E = W – Q$, where $latex \Delta E$ is the change in internal energy of a system, $latex W$ is the work performed on the system from the outside, and $latex Q$ is the heat dissipated to the environment. We can apply this theorem to any infinitesimal time step of any particular trajectory arising from the Langevin equation if we identify heat dissipated with the net force times displacement, $latex dq = \boldsymbol{F} \cdot d \boldsymbol{x}$. The infinitesimal work from the outside must then be $latex dw = dV + dq$, where $latex dV = dE$ is the infinitesimal change in any potential energy $latex V$ that contributes a term $latex \nabla V$ to the net force $latex \boldsymbol{F}$ (we do not consider overdamped systems to have kinetic energy).7

Having defined the heat dissipation $latex dq$, we can use the standard thermodynamical definition of change in entropy in the environment: $latex ds^\text{env} = dq / T$, where $latex T$ is the temperature. However, in order to define the entropy of the system, we also need a probability distribution $latex p_t(\boldsymbol{x})$, provided by the Langevin equation in combination with an ensemble $latex p_0(\boldsymbol{x})$ for the initial state. We define the system entropy on a particular trajectory $latex \boldsymbol{x}_t$ as the information content, or surprise, of that particular realization at each time, $latex s_t^\text{sys} = -\log p_t(\boldsymbol{x}_t)$, such that the ensemble average becomes the system Gibbs entropy,

$latex S_t^\text{sys} = -\int p_t(\boldsymbol{x}) \log p_t(\boldsymbol{x}) d^n x &s=1$ .

In order to express other ensemble-averaged quantities, we introduce the probability current,

$latex \boldsymbol{j}_t(\boldsymbol{x}) = \mu \boldsymbol{F}_t(\boldsymbol{x}) p_t(\boldsymbol{x}) – D \nabla p_t(\boldsymbol{x}) &s=1$ ,

such that we can write the equation of motion for the probability density, known as the Fokker-Planck equation, as,

$latex \frac{\partial p_t(\boldsymbol{x})}{\partial t} = – \nabla \cdot \boldsymbol{j}_t(\boldsymbol{x}) &s=1$ .

Using this we can show that the entropy produced in the environment is,

$latex \dot{S}_t^\text{env} = \frac{1}{T} \int \boldsymbol{F}_t(\boldsymbol{x}) \cdot \boldsymbol{j}_t(\boldsymbol{x}) d^n x &s=1$ ,

and we find the total entropy production to be nonnegative, as required by the second law,

$latex \dot{S}_t^\text{tot} = \int \frac{\boldsymbol{j}_t(\boldsymbol{x})^2}{D p_t(\boldsymbol{x})} d^n x > 0 &s=1$ .

Thermodynamics and learning

The perceptron is a linear classifier that maps a vector of $latex n$ numbers to a binary label. It is often presented as a simple model of a neuron that either emits a spike or doesn’t, depending on the input from $latex n$ upstream neurons. Mathematically, the model is defined by a vector $latex \boldsymbol{w}$ of $latex n$ weights, and a classification rule $latex \boldsymbol{\xi} \mapsto \sigma = \mathrm{sign}(\mathcal{A})$, where $latex \boldsymbol{\xi}$ is an input, $latex \mathcal{A} = \boldsymbol{w} \cdot \boldsymbol{\xi} / \sqrt{n}$ is the activation of the neuron, and $latex \sigma$ is the output label. Conventionally, the definition of the perceptron includes a particular learning rule to determine the weights based on a set of training examples; however, there are in principle many possible ways to train this model, some of which will be discussed in the following.

Here we follow Goldt & Seifert (2017),8 and consider a situation where there is a predefined weight vector $latex \boldsymbol{T}$, referred to as the teacher, that defines the correct classification rule that we want the model to learn. We also restrict the input vectors to have entries $latex \pm 1$, which can be thought of as representing the presence or absence of a spike from the corresponding upstream neuron. The training is modeled by a Langevin equation for the time evolution of the weights, with an input-dependent force:

$latex d \boldsymbol{w} = \boldsymbol{F} \left(\boldsymbol{w}, \boldsymbol{\xi}^{\mu_t}, \sigma_T^{\mu_t}, \nu \right) dt + 2 d \boldsymbol{W} &s=1$ .

Here, $latex \mu_t$ is some process that associates a particular input $latex \boldsymbol{\xi}^{\mu_t}$ with each point in time, and $latex \sigma_T = \mathrm{sign}(\boldsymbol{T} \cdot \boldsymbol{\xi})$ is the correct label for input $latex \boldsymbol{\xi}$, determined by the teacher. The parameter $latex \nu$ is the learning rate. We set both the mobility and diffusion constant to unity, $latex \mu = D = 1$, since varying these just corresponds to rescaling the weights and time.

We consider forces of the form

$latex \boldsymbol{F}(\boldsymbol{w}, \boldsymbol{\xi}, \sigma_T, \nu) = -k \boldsymbol{w} + \nu \sigma_T \boldsymbol{\xi} \mathcal{F}(\mathcal{A}, \sigma_T) &s=1$ .

To a physicist, the first term represents a conservative force that drives the weights towards the minimum of a potential energy $latex V(\boldsymbol{w}) = k \boldsymbol{w}^2 / 2$ (in the absence of training the system would equilibrate to the Boltzmann distribution with this energy function). This keeps the weight magnitudes from growing without bounds in response to the training, so computationally it serves as a regularization term.

The input-dependent force drives the weight vector in the direction of alignment (when $latex \sigma_T = 1$) or anti-alignment (when $latex \sigma_T = -1$) with the input vector $latex \boldsymbol{\xi}$. The overall strength of this drive is set by the learning rate $latex \nu$, and it is modulated for each input by the activation-dependent learning rule $latex \mathcal{F}$. The learning rules that will be considered here are:

Learning rule$latex \mathcal{F}(\mathcal{A}, \sigma_T)$
Hebbian$latex 1$
Perceptron$latex \Theta(-\sigma_T \mathcal{A})$
AdaTron$latex |\mathcal{A}|\Theta(-\sigma_T \mathcal{A})$
$latex \alpha$-tron$latex \Theta(-\sigma_T \mathcal{A}) + |\mathcal{A}|^\alpha \Theta(\sigma_T \mathcal{A})$

Here, $latex \Theta(x)$ is the step function: $latex \Theta(x) = 1$ if $latex x > 1$ and $latex \Theta(x) = 0$ otherwise. Hebbian learning weights all training examples equally, the classical perceptron algorithm only updates the weights when the neuron predicted the wrong label, and AdaTron learning is a modification of perceptron learning where the correction is larger the more confident the neuron was in its erroneous prediction. The $latex \alpha$-tron rule is a novel family of learning rules introduced here, designed as a modification of Hebbian learning such that better predictions lead to increased variance in the learning force, and hence, as we will show below, more dissipation. This is achieved by using a constant weight $latex 1$ for incorrect predictions and an activation-dependent weight $latex |\mathcal{A}|^\alpha$ for correct predictions, where the parameter $latex \alpha$ can be tuned to adjust the strength of the fluctuations.

To fully specify the problem, we must define two sets of inputs: a training set $latex D = \{\boldsymbol{\xi}^\mu\}_{\mu = 1}^N$ and a test set $latex \tilde{D} = \{\boldsymbol{\xi}^{\tilde{\mu}}\}_{\tilde{\mu} = 1}^{\tilde{N}}$. We will assume that both sets are sufficiently large and uniform samples as to be, for the analysis that follows, statistically indistinguishable from the uniform distribution over all the $latex 2^n$ possible input vectors.

As the performance metric for training we will use the generalization error $latex \epsilon = \langle \Theta(-\sigma^{\tilde{\mu}} \sigma_T^{\tilde{\mu}}) \rangle_{\tilde{D}}$, that is, the fraction test examples for which the preceptron yields incorrect predictions. In the thermodynamic limit, $latex n \rightarrow \infty$, it can be shown that the generalization error is proportional to the angle between the weight and teacher vectors,

$latex \epsilon = \frac{1}{\pi} \arccos \left( \frac{\boldsymbol{T} \cdot \boldsymbol{w}}{|\boldsymbol{T}| |\boldsymbol{w}|} \right) &s=1$ .

The generalization error can be related to the mutual information between the predicted and correct label: $latex I(\sigma_T^{\tilde{\mu}} ; \sigma^{\tilde{\mu}}) = 1 – h_2(\epsilon)$, where $latex h_2$ is the binary entropy function. Moreover, this mutual information can be bounded by the entropy produced over the course of learning. These connections close the circle between the information theoretical, thermodynamical, and learning perspectives on this model.

Steady state entropy production of learning

For the learning to converge to a steady state, we will assume a separation of timescales between the process $latex \mu_t$ that flips through training examples, and the relaxation time for the weights, governed by the learning rate $latex \nu$. The idea is that $latex \mu_t$ scans a representative sample of training examples fast enough that in any time interval over which the weights change appreciably, the time evolution is completely determined by the statistical properties of the training set, and the detailed dynamics of $latex \mu_t$ are irrelevant.

In steady state, the system entropy is constant and the prevailing entropy production must occur in the environment. Hence, we can write the total steady-state entropy production as,

$latex \dot{S}^\text{s}(\nu) = \int \left\langle \boldsymbol{F}\left(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu \right) \cdot \boldsymbol{j}^{\text{s},\mu}(\boldsymbol{w}; \nu) \right\rangle_D d^n w &s=1$ .

Here we write $latex j^{s,\mu}(\boldsymbol{w}; \nu)$ for the probability current when the system is in steady state and the training example indexed by $latex \mu$ is presented, and we take the average over the training set, $latex \langle \cdot \rangle_D$, to define the effective, $latex \mu$-independent entropy production.

The steady-state currents can be written in terms of the force and steady-state probability density $latex p^s(\boldsymbol{w}; \nu)$ as,

$latex \boldsymbol{j}^{\text{s},\mu} (\boldsymbol{w}; \nu) = \left[\boldsymbol{F} \left(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu\right) – \nabla \right] p^\text{s} (\boldsymbol{w}; \nu) &s=1$ ,

and by subtracting the input-averaged version of the same equation we find that,

$latex \boldsymbol{j}^{\text{s},\mu}(\boldsymbol{w}; \nu) = \left\langle \boldsymbol{j}^{\text{s},\mu}(\boldsymbol{w}; \nu) \right\rangle_D + p^\text{s}(\boldsymbol{w}; \nu) \left[\boldsymbol{F}(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu) – \left\langle \boldsymbol{F}(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu) \right\rangle_D\right] &s=1$ .

Plugging this back into the expression for total entropy production, we obtain two distinct terms, and write $latex \dot{S}^\text{s}(\nu) = \dot{S}^{\text{s},\text{hk}}(\nu) + \dot{S}^{\text{s},\text{ex}}(\nu)$. The first term is the housekeeping entropy rate,

$latex \dot{S}^{\text{s},\text{hk}}(\nu) = \int \left\langle \boldsymbol{F} \left(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu\right) \right\rangle_D \cdot \left\langle \boldsymbol{j}^{\text{s},\mu}(\boldsymbol{w}; \nu) \right\rangle_D d^n w &s=1$ ,

which is the entropy production required to maintain the steady state. This contribution would remain the same in a batch learning scenario where the force does not fluctuate from scanning through training examples, but instead takes the average value $latex \left\langle \boldsymbol{F} \left(\boldsymbol{w}, \boldsymbol{\xi}^\mu, \sigma_T^\mu, \nu\right) \right\rangle_D$ at all times.

The fluctuations are responsible for the second contribution, the excess entropy production,

$latex \dot{S}^{\text{s},\text{ex}}(\nu) = N \nu^2 \int p^\text{s}(\boldsymbol{w}; \nu) \left[ \left\langle \mathcal{F} \left(\mathcal{A}^\mu, \sigma_T^\mu \right)^2 \right\rangle_D – \left\langle \mathcal{F} \left(\mathcal{A}^\mu, \sigma_T^\mu \right) \right\rangle_D^2 \right] d^n w &s=1$ ,

where we simplified the expression by substituting the definition of $latex \boldsymbol{F}$ and noting that $latex \sigma_T^2 = 1$ and $latex \boldsymbol{\xi}^2 = N$. By the separation of timescale assumption, the input fluctuations average out quickly enough to not affect the time evolution of the probability distribution on the timescales of interest (this is a condition for defining a steady state in the first place); however, they still add to the heat dissipation.

One of the main contributions of this work is identifying the latter term, a contribution to the excess entropy production that remains even in the steady state. In Goldt & Seifert (2017), the assumption of time scale separation was taken to imply that there would be no difference between presenting inputs sequentially (online learning) and in batch, and this effect of fluctuations was therefore neglected. A question not pursued here is whether this refinement modifies any of the bounds on mutual information from their paper.

Learning by adaptation?

How do we choose a good learning rule $latex \mathcal{F}(\mathcal{A}, \sigma_T)$? We have defined our metric, the generalization error $latex \epsilon$, but since we cannot try every imaginable learning rule, we need some heuristic for designing one. We have derived an expression that relates the steady-state entropy production (or, equivalently, heat dissipation) to $latex \mathcal{F}$. Perhaps we can find a way to apply that?

It has been observed that in many complex systems out of equilibrium, the likelihood of observing a particular outcome grows with the amount of heat dissipated over the history leading up to that outcome. The hypothesis of dissipative adaptation asserts that, as a corollary of this, one should expect many complex systems to evolve towards regions of state space where they absorb a lot of work from their driving forces so they can dissipate a lot of heat to their environments.9 For example, simulations have demonstrated a tendency for many-species chemical reaction networks to evolve towards statistically exceptional configurations that are uniquely adapted to absorbing work from the external drives they are subjected to.10 Inspired by this, we want to investigate the performance of learning rules designed such that the force fluctuations increase as the generalization error decreases. The $latex \alpha$-tron family of learning rules, presented above, is one attempt at designing such rules.

A simulation

Now that we begin to understand the thermodynamical perspective on learning and the (admittedly quite speculative) rationale behind the $latex \alpha$-tron, let us put it to the test. For the purpose of this example, we set $latex k = 1$, $latex \nu = 3 \sqrt{n}$,11 and $latex n = 10000$. Sampling a random but fixed teacher $latex \boldsymbol{T}$ and a random initial $latex \boldsymbol{w}$, and using a fixed step size of $latex \Delta t = 10^{-3}$, we simulate the Langevin equation for the perceptron in the most straightforward way, that is, using the Euler-Maruyama method. The code is available here, and the result looks like this:

Among the three established algorithms, we see that Hebbian learning is the clear winner. Interestingly, however, the $latex \alpha$-tron slightly outperforms Hebbian learning for $latex \alpha \in \{0.25, 0.50\}$. Was it that simple? Did our adaptation-based approach just revolutionize perceptron theory?

Probably not. First of all, this simulation is a little bit of a bodge. We sample training examples at random at each time step, but a step size of $latex k \Delta t = 10^{-3}$ is not nearly small enough to introduce proper separation of time scales. We know this because all algorithms perform measurably better if we reduce the step size. To partially mitigate this, the simulation was actually performed by averaging the force over a small batch of 30 examples at each time step, but this is still not enough to converge to time scale-separated dynamics; on the other hand, it serves to dampen the input-driven fluctuations that were the entire rationale behind this study by a factor of $latex 1 / \sqrt{30} \approx 0.18$. However, this was the compromise my laptop could handle. The qualitative features of the figure remain the same for different $latex \Delta t$, so perhaps the $latex \alpha$-tron does indeed perform better than the established algorithms for these particular parameters and $latex \alpha \lesssim 0.5$, at least when averaging over small batches.

A methodological problem is of course that this is just a single run at a single point in parameter space. More thorough studies are needed before interesting claims can be made.

More importantly, we have not actually looked at the rate of entropy production after convergence, so we cannot say whether the solid $latex \alpha$-tron performance has anything to do with the fluctuations that we tried to design into the algorithm under inspiration from dissipative adaptation. A notable feature of the $latex \alpha$-tron is that the magnitude of the learning force scales with the activation $latex \mathcal{A} \sim |\boldsymbol{w}|$ when the prediction is correct. Since only the direction of $latex \boldsymbol{w}$ matters for prediction, this gives the algorithm an extra knob with which to balance the strength of the learning force against regularization and thermal noise such that the steadiest possible nonequilibrium state can be maintained (in other words, it can dynamically adjust the learning rate that applies when the prediction is correct). Whether this leads more or less dissipation than Hebbian learning is by no means obvious; note that both terms in the steady state entropy production scale with the square of the learning force. However, it does suggest that some interesting form of adaptation is taking place in the $latex \alpha$-tron.

Finally, we have not said much about an essential aspect of the systems for which dissipative adaptation is relevant, namely that they are complex. Sophisticated machine learning algorithms such as deep convolutional neural networks with nonlinearities in every layer are arguably quite complex, and biological systems that can learn or adapt, such as brains, immune systems, gene regulatory networks, ecological networks, and so on, are some the most complex systems known. It is for such systems that dissipative adaptation can emerge because strong fluctuations can help the system overcome energy barriers that would otherwise confine it to a small region of its state space. The perceptron, on the other hand, is not a complex system at all; it is a simple tug-of-war between a learning force that fluctuates a little but always points in the same general direction, a deterministic regularization force derived from a quadratic potential, and, in our version, some thermal noise sprinkled on top. Moreover, in certain limiting cases, the mean learning force is a simple function of $latex \boldsymbol{w}$ that can be thought of as an extra term in the potential energy: for example, with Hebbian learning it is constant, and in the $latex n \rightarrow \infty$ limit the effective potential is,

$latex V(\boldsymbol{w}) = \frac{k}{2} \boldsymbol{w}^2 – \frac{\nu}{\sqrt{n}} \mathrm{sign}(\boldsymbol{T}) \cdot \boldsymbol{w} &s=1$ ,

where the sign function applies elementwise. Hebbian learning in the perceptron is thus little more than thermal equilibration in a quadratic potential—the simplest system imaginable!

All this is to say that one should probably not expect dissipative adaptation to be a viable learning mechanism in the perceptron, even though the $latex \alpha$-tron algorithm may be a worthwhile object for further study. But the more interesting takeaway is perhaps that there is a vast, uncharted territory waiting to be explored by taking an adaptation perspective on complex systems that learn, such as deep neural networks and other learning algorithms with complex loss landscapes, as well as biological learning systems.


For the EE 376A outreach event, rather than discussing perceptrons and thermodynamics with elementary school children I wanted to convey the idea of complex systems that respond to external forcing by adopting far-from-equilibrium states where weird things happen. The obvious choice was to demonstrate corn starch and water!

A suspension of corn starch in water in the right proportions (about 5/8 corn starch) forms a strongly shear-thickening fluid, which runs like a liquid when left to itself but acts as a solid as soon as a force is applied. This is cool already, but the real fun happens when subjecting the mixture to strong vibrations. Take a look at this video, recorded at the outreach event:

When the vibrations are at their strongest, the puddle almost seems to morph into a primitive life form. The suspension exhibits strongly nonlinear interactions that respond to the external forcing in a very unpredictable manner, and the net effect is the dancing and crawling we see in the video. Even without a detailed understanding of flocculation at the microscopic level, it is clear that the vibrations push the suspension very far away from equilibrium and into highly dissipative states that must absorb a lot of energy from the speaker to persist. In other words, it might not unreasonable to interpret this as a simple example of dissipative adaptation: the suspension, being a complex system, responds to the forcing from the speaker by finding a corner in state space where the amount of energy absorbed from the speaker as work and dissipated to the environment as heat is unusually high.

My spiel to the kids was that the fluid is a thrill seeker and wants to get the biggest buzz it can out of the vibrating speaker, so what can it do but learn the rhythm and start dancing? In the same way, our brain wants to get the biggest buzz it can out of the signals it receives through our senses, so it has to pay attention, figure out common patterns, and play along—and that is what we call learning!

To the extent the last statement carries any substance, it is also just speculative conjecture. For anyone wanting to find out more, thermodynamics provides one possible link between the complex dynamics of brains in a noisy environment and the information measures that are the natural currency of a learning system.

How Do Cell Phones Work?

EE376A (Winter 2019)

by Yousef Hindy ’19


Cell phones are all around us. We use them every day to communicate instantly with friends, family, and random internet strangers across the globe. The fact that we are able to send messages so quickly is truly a technological marvel and brings together innovations in physics, electrical engineering, information theory, computer science, and more. This blog post hopes to illuminate how these magical devices work.

There is no way that this post will be able to cover every aspect of wireless communication (as you can devote your entire life to studying the subject), but I hope to distill it into manageable components that allow for further study if desired.

Each section will be broken up into two subsections: one accessible to those with a non-technical background and one for those with more physics and math knowledge. The more general sections will be denoted by a 😎, while the more advanced sections will be denoted by a 🤓. They are as follows:

  1. A Brief History of Communication
  2. The Communication Problem
  3. The Layered Model of Communication
  4. A Primer On Electromagnetic Waves (Radio Waves & Light)
  5. How Information is Encoded in Light (Modulation)
  6. Antennas
  7. The Network of Cell Towers
  8. Putting It All Together

Hopefully, this post will give you a taste of how the theory of information works closely with other fields of science and engineering to produce massively effective and efficient systems.

A Brief History of Communication

Just to give a sense of where we are and where we have come from, here is a by-no-means-complete timeline of the history of communication:

  • ~1.75 Million Years Ago: First humans use spoken language to yell at kids for playing with rocks.
  • ~3000 BCE: Ancient Mesopotamians develop Sumerian, the first written language, to write down their laws, stories, and more.
  • ~540 BCE: Cyrus of Persia establishes one of the first postal services, where messages traveled along a network of roads.
  • ~1200 CE: People begin to use carrier pigeons to transport messages in a one-way fashion.
  • ~1450 CE: Johannes Gutenberg invents the printing press, allowing for mass production of books and other writing.
  • ~1650 CE: British navy pioneers using flags as signals for different messages at sea.
  • May 24, 1844: Samuel Morse sends first telegraph message: “What hath God wrought?”
  • March 10, 1876: Alexander Graham Bell makes first telephone call.
  • 1885: Heinrich Hertz conducts experiments sending and receiving radio waves.
  • Early 1920s: Bell Labs tests car-based telephone systems.
  • 1973: Motorola’s Martin Cooper invents the first handheld cellular phone.
  • 2007: First iPhone
  • 2019: 5G cell phone technology

The Communication Problem

The basic idea of communication is to move information from point A to point B as accurately and efficiently as possible. Let’s take a closer look at what this actually means.


According to Merriam-Webster, information is defined as “knowledge obtained from investigation, study, or instruction.” This is a nice colloquial definition, but not exactly what we are going for in an information theoretic sense.

To get a better sense of what information really is, consider the following situation. Imagine that you live in a town called Bagelville without any cell phones and you and a friend are trying to meet up to grab some of your town’s famous bagels. Assume that Bagelville only has four bagel shops, numbered 1, 2, 3, and 4 respectively. You know that he is waiting for you at one of the four, but you don’t know which one. In your mind, he has an equal chance of being at any of the four. At this point, you have minimal information about your friend. As you walk out the front door, you notice that there is a note with the number 2 written on it signed by your friend. Now, you have reduced your uncertainty about where he is. In some sense, you have gained information about your friend. In addition, your friend has encoded the information about which shop he is in by writing it down for you.

Another way to look at it is that information is anything that can be described. An encoding is, therefore, the description that you choose that uniquely determines what you have in mind. It’s important to note that not all encodings are created equal. Instead of writing the number 2, your friend could have also written: “go to the bagel shop next to the hardware store on the corner of Main Street.” These two descriptions may be about the same location and after reading them you may have the same place in mind, but one of them is more concise than the other.

Looking at information in a general sense, information theorists try to find ways to efficiently encode information (usually in bits, i.e. 0’s and 1’s), as it is easier to send and receive smaller descriptions than bigger ones. There’s even a couple of cool mathematical theories that state that the most efficient way to encode information is with bits!

The communication problem can be summarized by the following diagram:

Screen Shot 2019-03-16 at 3.34.07 PM

You have some message (in this case, “hey, want to hang out?”) that you want to send to your friend. You want him to be able to reconstruct the information in your message, so you send it across a channel that has noise. In other words, when you send your message, there is a chance that there is some distortion, e.g. that your message becomes “hey, want to work out?”

Another way to think of noise is to think of you and your other friend in the bagel shop in Bagelville. You’ve been dying to tell her about your newfound obsession with cat memes, so you try to communicate using your words. Since you chose the most popular shop in town, there are tons of people also talking and music playing in the background that makes it harder for your friend to understand you. This process is also an example of noise. You can think of noise as anything that makes it harder to discern the actual information of a message. Information theorists call the actual content of the message the signal, and one of the goals of engineering communication systems is to maximize the ratio between the signal and the noise, called the signal to noise ratio or SNR for short.

In order for your friend to understand you in the crowded shop, you need to speak louder (therefore increasing the SNR). This often has an adverse effect, however, as your conversation contributes noise to others, and they increase their conversation volume to compensate. This increases the amount of noise you have to deal with, and you speak louder as a result. This is a form of a feedback loop.


Information turns out to be something that you can quantitatively measure. Given the link between information and uncertainty, we can define a quantity that describes how much uncertainty we have about a given source of information called entropy. To make things more concrete, we need to utilize the mechanics of probability theory.

Consider a discrete random variable $latex U$ that can take values from an alphabet $latex \mathcal{U} = \{ u_1, u_2, \dots, u_M\}$. For each value $latex u_i$, there is a probability $latex p(U=u_i)$ that denotes how likely $latex U$ is to take on the value $latex u_i$. We can now define the “surprise” function: $latex S(u) = \log \frac{1}{p(u)}$ (here $latex \log = \log_2$). To gain some intuition about this object, consider the extremes of the input. If $latex p(u) = 1$, then the surprise function will evaluate to zero, and if $latex p(u) << 1$, then the surprise function will be very large. We are now ready to define the entropy of the random variable $latex U, H(U)$.

$latex H(U) = \mathbb{E}[S(U)] = -\sum_u p(u) \log p(u)$

You can think of the entropy as the expected surprise of the random variable. In other words, on average, how surprised are you when you see a manifestation of the random variable?

Entropy is linked to information in the following sense. The more entropy a random variable has, the more information is required to describe it. Think about the extreme situations. If a random variable is essentially deterministic (say always takes the value “1”), then you can just convey that it always takes on “1”. But if a random variable is uniformly distributed, you need to describe that it is sometimes “1”, sometimes “2”, sometimes “3”, etc. This definition of entropy can be expanded to more complex distributions, e.g. joint and conditional ones, by replacing the probability in the log with the desired distributions.

Now that we have entropy, we can also define a measure of information called mutual information. It essentially describes how easy it is to reconstruct one random variable from another. It is defined as:

$latex I(X;Y) = H(X) – H(X|Y)$

Note that if $latex X$ and $latex Y$ are independent, then the mutual information will be 0.

But how does this all relate to communication? Consider the diagram below…

Screen Shot 2019-03-16 at 3.46.47 PM

Taken from EE376A Lecture 7 Notes 2018

Essentially, we are trying to convey a message $latex X^n$ to our friend through a noisy channel which produces a (potentially) altered message $latex Y^n$ based on the statistics of the channel $latex P_{Y^n|X^n}$. Note that given this probability distribution for the channel (and a little more information about $latex X$), we can calculate the mutual information between $latex X$ and $latex Y$. The higher the mutual information, the better the channel and the easier it is to reconstruct the original message.

The Layered Model of Communication

Telecommunications is a complex problem to tackle. To make it manageable, people much smarter than me have developed the Open Systems Interconnection (OSI) model. The OSI model has seven layers of communication that incorporate both software and hardware components. The layers are shown here:

The general idea is that when a user interfaces with layer 7 (an application, e.g. email) and hits “send”, the information is converted into simpler and simpler forms as it moves down the layers and then gets sent over the physical link in its simplest form. The intended recipient of the message then receives the physical signal and reconstructs it into something legible to a human. Each layer has protocols (read: languages and customs) that everyone agrees on so that data processing can happen seamlessly.

Since I will mainly focus on layer 1 in this post, I will give a quick overview of what the different layers do and examples of their manifestations.

  1. The Physical Layer: This layer is responsible for the barebones transport of bits (0s and 1s) from point A to point B. This can be done through voltages, light, or other media. The protocols specify all the basics about the communication: the rate of data transfer, what a 0 looks like, what a 1 looks like, etc. Examples of the physical layer include Bluetooth, USB, and Ethernet.
  2. The Data Link Layer: The data link layer connects two directly connected devices (e.g. a computer and a router on the same wifi) in a network and helps establish when they can talk to each other. This layer also performs some rudimentary error correction on the physical layer when it messes up. The most well-known of these protocols is the Media Access Control (MAC) which gives permission to devices on a network to talk to each other.
  3. The Network Layer: This layer connects different local networks together. You may have your wifi router in your home in California, but you want to send an email to someone in Florida. The Internet Protocol (IP) provides a way to find efficient routes from one node in a network to another.
  4. The Transport Layer: This layer ensures that all of the data that you are trying to send accurately makes it to the intended recipient. An example of this is the  Transmission Control Protocol (TCP).
  5. The Session Layer: When two devices need to talk to each other, a session is created that stays open as long as the devices are communicating. This layer handles the mechanics of setting up, coordinating, and terminating a session.
  6. The Presentation Layer: When a raw stream of bits is received, it is not very useful unless you know what they are for. If you took the bits for a picture and put them in a text editor, you would get something really weird, but when you open them with picture viewing software, e.g. Preview, you can clearly see what the image is. The presentation layer can be thought of like the different file types that people use. Examples include JPEG, GIF, and more.
  7. The Application Layer: This is where you, the user, come in. The application is the thing that presents the data to the end user, whether it be a web browser or email.

Don’t worry if you didn’t understand all of that. For the rest of the post, we will focus mainly on the physical link.

A Primer on Electromagnetic Waves

Electromagnetic (EM) waves are essential to nearly everything we do. They often act as a carrier of information that travels INCREDIBLY fast. Take light as an example. Light is essential for our visual faculties, and photons (little particles of light) that bounce off of the objects around us reach our eyes almost instantly, giving us a real-time view of what is happening. Because of this, it is often said that “light is the carrier of information.” This section aims to give you a basic background on what EM waves are and why they are important to wireless communication.


You may be familiar with electricity and magnetism, but what you may not know is that they are two sides of the same coin. When the strength of something electric changes, magnetism pops up. When the strength of something magnetic changes, electricity pops up. EM waves are bundles of electricity and magnetism that change in such a way that creates a self-reinforcing cycle.

Image result for electromagnetic wave

A Diagram of an Electromagnetic Wave (Source)

As you can see in the diagram above, the strength of the electricity and magnetism “waves” from high to low, which thus creates a “waving” pattern in the other force. James Maxwell Clerk, known as the father of electricity and magnetism, discovered that these waves can propagate in free space at the speed of light ($latex c =$ 186,000 mi/s). To put that speed into perspective, if you were traveling at the speed of light, you could travel from New York to London in about 18 milliseconds.

How quickly the strength of the electricity and magnetism changes is called the frequency of the wave. Frequency is measured in a unit called Hertz, which is basically the number of peaks that you see in a second. The more peaks per second, i.e. the higher the frequency, the more energy that wave has. Frequency is closely related to the wavelength of the wave, which is the spatial distance between consecutive peaks. The two are related by the equation $latex v = \lambda f$, where $latex \lambda$ is the wavelength (measured in meters), $latex f$ is the frequency (measured in Hertz $latex = 1/s$), and $latex v$ is the speed of the wave (measured in meters per second). The larger the wavelength at a given speed, the smaller the frequency. Electromagnetic waves span a wide range of frequencies, and the different “bands” are used for different purposes. The spectrum is shown below.

Image result for em spectrum

The Electromagnetic Spectrum (Source)

You may be familiar with X-rays and microwaves from your everyday experience in the doctor’s office or the kitchen, but you might not know that visible light, i.e. the light that you see with your eyes, is made of the same stuff! For the purpose of cell phones, we are going to focus on the far right of the spectrum: radio waves.

Radio waves are great for telecommunications because they travel at the speed of light and are good at traveling long distances without getting disturbed. We will get into how cell phones turn your text messages into radio waves later.


(This section will be a little shorter as I will assume more knowledge of EM waves). Electromagnetic waves are a consequence of Maxwell’s Equations, which are a set of four differential equations discovered in part by James Clerk Maxwell in the 1860s.

maxwell's equations

Maxwell’s Equations (Source)

Here, $latex \mathbf{E}$ is the electric field, $latex \mathbf{H}$ is the magnetic field, and $latex \mathbf{J}$ is the current density, which isn’t really important for waves traveling in free space. If you solve the last two equations for $latex \mathbf{E}$ and $latex \mathbf{H}$, you will find a form of the wave equation:

The Wave Equation for the Electric Field (Source)

This gives rise to sinusoidal solutions that look like the ones in the pictures above. In free space, the waves are allowed to have any frequency and propagate at the speed of light $latex c$.

Encoding Information Into Light

The process of encoding information into a wave is called modulation. It’s used all the time in radio communications and in cellular communications. In this section, I’ll try and give you a sense of how it works.


Let’s consider what happens when you type a text message into your phone. First, the letters that you type are turned into a series of bits (0s and 1s). There is a standard for doing this with individual letters called the ASCII standard. Each letter is assigned a number, and that number is converted into a series of bits. You can think of the bits as answering a yes or no question about the letter. The first bit could answer the question: “Is the letter in the first half of the alphabet?” If it is “1”, then we know it is in the first half and we can discard the other letters as a possibility. The second bit can then represent us asking, “Is the letter in the first half of the first half of the alphabet?” (or alternatively, “Is the letter in the first half of the second half of the alphabet?” if the first bit is a “0”). We continue this process until we know precisely what letter we have.

Once we have the series of bits that we are going to send, we can turn the bits into a wave. There are several ways to do this, so I’ll explain the simplest one. The signal “waves” for a little bit to represent a “1” and doesn’t “wave” for a little bit to represent a “0”. The following diagram should make this a little more clear for the sequence 1011:

Screen Shot 2019-03-17 at 9.12.08 PM

Courtesy of Prof. Andrea Goldsmith

As you can see from the picture, once we receive the signal we can simply measure the amplitude of the wave for a given period to determine whether the current bit is a 0 or a 1. It’s important to note that the sender and the receiver must agree on how long these 0’s and 1’s last and at what frequency they are being sent, or else the message will not be able to be recovered. We will explain how these pulses are physically sent out in the next section.


The goal of modulation is to take a signal $latex x(t)$ (analog or digital) and turn it into an analog signal $latex y(t)$ that has some carrier frequency $latex \omega_c$, which can then be transmitted as an electromagnetic wave. There are two stages: modulation and demodulation.

  1. Modulation: We take the signal $latex x(t)$ and multiply it by the carrier $latex c(t) = \cos (\omega_c t)$ to get $latex y(t)$. We can then use an antenna (explained in next section) to send this signal out.

Screen Shot 2019-03-17 at 9.21.40 PM

The Process of Modulation (Courtesy of Prof. Joseph Kahn)

2. Demodulation: Once we receive the signal $latex y(t)$, we multiply it by the carrier again to get

$latex v(t) = c(t) y(t)= x(t) \cos^2(\omega_c t) = \frac{1}{2}x(t) + \frac{1}{2} x(t) \cos(2\omega_c t)$

Note that after multiplying again by the carrier frequency, we have a signal that consists of the original signal plus a higher frequency version of the carrier signal. If we then apply a filter to the signal that gets rid of the high frequencies, we are left with the original signal. A schematic is shown below.

Screen Shot 2019-03-17 at 9.26.34 PM.png

Demodulation (Courtesy of Prof. Joseph Kahn)

There are a litany of extra steps that need to be taken (compression, encryption, and error correction) if you are transmitting digital signals, but they are outside the scope of this post. Here are some resources on compressionencryption, and error correction if you are interested. The basic idea is that you need to turn your information into bits, make it as small as possible, make it so no one else can read what you are trying to write, and then make sure you haven’t made any mistakes. Once you’ve done all of this, then you can modulate your signal and send it along the airwaves!


So now that we can turn our information into something that can be sent as an electromagnetic wave, how do cell phones actually send signals out that get to other phones? If this description leaves you unsatisfied, check out this website for a comprehensive guide to how antennas work.


An antenna is basically a piece of metal connected to a special kind of battery that is able to send and receive electromagnetic waves. It does so by applying electricity in such a way that creates the particular wave (as described in the EM primer section). The size of the antenna matters, as it needs to be at least as big as half of the wavelength scale that you are trying to send.

Each antenna also has a specific directivity, which a measure of how concentrated in space the radiation is. Directivity is closely related to antenna size, and the larger the antenna, the more focused you can make your beam in one direction. Since your phone has a relatively small antenna, it generally radiates waves out isotropically, i.e. in all directions. Cell phone towers are basically huge antennas, and they are able to specifically beam radiation in the general direction of your phone rather than everywhere. This allows them to not waste power sending signals in every direction.

One important concept for antennas is called the bandwidth. Since a single antenna can radiate multiple frequencies of radiation, we can define bandwidth as the difference between the highest and lowest frequencies of radiation. This concept will become important when we discuss the cellular grid. Most cell phone systems operate in the 3-30 GHz (billion Hertz) range.


To make an antenna, we need a piece of wire hooked up to an alternating current source that can accelerate the electrons in the wire at the frequency we want to radiate. When the electrons move, they cause a change in the electric field that consequently causes a shift in the magnetic field around. This process continues and the result is electromagnetic waves. There needs to be impedance matching with the incoming transmission line, as otherwise, the propagating signals will become out of phase with each other and power will be lost. This mostly matters at higher frequencies, which is important for cellular communications.

In addition to affecting directivity (colloquially defined above), the size of an antenna dictates how many different frequencies it can radiate. A general rule of thumb is that an antenna of size $latex l$ can radiate wavelengths of length $latex \lambda = 2l$.

One cool thing you can do with antennas is putting many of them together in an array. This allows you to interfere the waves with each other in such a way that increases your directivity. They are also generally more efficient as they can radiate more power and increase the signal-to-noise ratio of the incoming messages.

If you haven’t, make sure you read the definition of bandwidth above, as it will be important later.

Image result for cell phone tower

An Example of an Array of Cellular Antennae on a Cell Phone Tower (Source)

The Network Of Cell Towers

There are nearly 100,000 cell phone towers in the United States alone. In this section, I’ll try to explain how your phone talks to the tower and then how your message gets to its intended recipient. The general setup is depicted below:

Screen Shot 2019-03-17 at 10.21.20 PM

A Schematic of Cellular Communication


Cell phones are called “cell” phones because of the cellular nature of the arrangement of cell towers. Each one is given a region that it governs, and when your phone has service, it means that it is in communication with the nearest tower. A toy example of the grid is shown below:

Screen Shot 2019-03-17 at 10.21.03 PM

The towers are arranged in a hexagonal lattice because it provides a nice tradeoff between the coverage quality and the number of towers. If they were arranged in circles, there would be blackout spots without any coverage, and if they were arranged in squares, then there would be a higher variability of signal strength in the cell.

Individual towers are connected by a series of fiber optic cables that use light (yes, visible light) to transmit information from tower to tower. In 4G communications, once a signal gets to a cell phone tower it gets to the destination cell phone tower using the same technology as the internet. The exact nature of how a message gets from point A to point B is outside the scope of this post, but if you are interested, you can read more on the Internet Protocol. For messages to go overseas, e.g. to Europe or Asia, they travel through cables that have been laid down under the sea.

Image result for undersea cable map

A Map of the Submarine Cables in the World (Source)


This grid of cellular towers would not work if all the towers were on the same frequency range. You can think of a frequency “band” as a channel over which communication can occur. If you and I try to use the same channel at the same time, our messages could interfere with each other and disrupt service. Towers that are next to each other therefore cannot be on the same frequency band. The hexagonal organization of the towers also allows for the same frequency to be reused with some spatial separation.

For each cell tower, there are hundreds or even thousands of phones trying to use cellular services at the same time. So how do cellular communication systems solve this problem? The answer lies in a technique called multiplexing. The basic idea of multiplexing is dividing the channel into different buckets by time or frequency and putting different messages in different buckets. Below is a depiction of what time-domain multiplexing looks like (where the different colors represent different users of the channel). Since cell phones operate at frequencies in the gigahertz range, they are able to fit in many time “buckets” per unit time.

Screen Shot 2019-03-17 at 10.36.01 PM.png

Time Domain Multiplexing (Courtesy of Prof. Joseph Kahn)

Similarly, you can do the same thing in the frequency domain. Below is what frequency-domain multiplexing looks like (where again the different colors represent different users of the channel):

Screen Shot 2019-03-17 at 10.38.44 PM.png

Frequency-Domain Multiplexing (Courtesy of Prof. Joseph Kahn)

You can combine the two schemes to maximize the amount of information per unit time. This is where the concept of bandwidth comes into play. If we have high bandwidth, we can fit many more “buckets” in the channel and therefore transmit information at a higher rate.

Putting it all together

In summary, this is what happens when you press send on your message.

  1. Your message gets encoded into a series of bits that represent the information you are trying to convey in a concise and efficient manner.
  2. Your phone’s internal computer figures out how to modulate the signal so that it can be sent out as electromagnetic radiation.
  3. The cell phone’s antenna radiates the message with some meta-information (e.g. who the recipient is, what kind of data it is) to the nearest cell tower.
  4. The cell tower receives the message and decides which node in the network is the best to send the message to.
  5. Step 4 repeats as the message arrives at the cell tower closest to your friend.
  6. The final cell tower radiates the same signal to your friend’s phone.
  7. Your friend’s phone demodulates, decrypts, and decompresses the signal and displays it on the screen.

If you have any questions or want more resources, feel free to email me at yous.hindy@gmail.com, and I’d be happy to send you resources.


I’d like to thank Prof. Tsachy Weissman for advising me on this project and providing me with guidance and enthusiasm at every step of the way. I’d also like to thank Professors Jon Fan, Andrea Goldsmith, and Joseph Kahn for taking the time to meet with me and sharing the resources that made this post possible.

Outreach Event @ Nixon Elementary

On March 17, 2019, as part of the EE376A course, I presented my work to a group of students from Nixon Elementary School in Stanford, CA. They ranged from K-5th grade and found the topic pretty fascinating. I didn’t realize that most of them didn’t own cell phones, but they were all familiar with the ones that their parents use. It certainly was difficult explaining these topics at a 1st-grade level, but it made writing this post a lot easier, as I had to really think deeply about these topics and how they could be simplified. Below is a picture of the poster that was presented. I also had a deconstructed cell phone and was able to show them the various components on the phone’s board like the various antennas, microphones, speakers, etc.

EE376A Project.jpg





(Fisher) Information and Quantum Mechanics

EE376A (Winter 2019)

By Carlos Gomez Uribe

Recently, I stumbled upon a well-known result in Quantum Mechanics (QM) while taking a class on the topic that seemed to scream information theory (IT). Being new to QM, I was unaware of any connections to IT beyond quantum communications, and what I was looking at clearly wasn’t about that. It turns out there is at least one more obvious and strong connection between these areas, which is not, surprisingly, based on the Shannon-entropy-related concepts that pervade IT. Rather, it is the less information-theory-popular Fisher information that plays a prominent role in QM. The fuller results can be found in the pdf below:


Information theory is the theory underpinning all of our digital communications: the storing, sending, and receiving of bits so widespread in our digital society, whether these describe songs, movies, photos, or this post. The starring concepts in Information theory are the so-called Shannon entropy, and its related quantities, e.g., the Kullback-Leibler (KL) distance between two probability distributions, or the mutual information.

In addition to defining the communication technology of our times, connections and cross-pollination between information theory and machine learning and statistics appear ever more regularly. For example, it is now well known that strategies for parameter estimation for broad classes of useful statistical models models, ranging from maximum likelihood for simpler models to variational inference for more complex models, turn out to be equivalent to minimization of the KL distance with appropriately chosen arguments. Information theory concepts have also inspired research and ways of thinking in biology, particularly in neuroscience and in signaling pathways, where in some cases it is obvious that organisms (including unicellular ones) survive and thrive only by reliably communicating external information about their environment to their control center (the brain or the nucleus).

It is similarly well known that the entropy of statistical mechanics is related to Shannon’s entropy, but I was unaware of additional connections between physics and information theory. Until recently, when I stumbled upon a well-known result in QM for the simple harmonic oscillator (SHO). The state of lowest energy in a QM system is called the ground state. Determining the ground state is one of the core problems in QM for general systems, partly since other excited states of the system can be more readily found from the ground state.

For the SHO, the ground state is readily found by minimizing the expected energy of the system, resulting in a Gaussian distribution for the oscillator’s position. But the Gaussian distribution also solves the well-known IT problem of maximizing the entropy with a known variance, and the optimization problem itself looks very similar to the minimization of energy problem in the SHO. Could it be true more generally that finding the ground state of a QM system is equivalent to maximizing the entropy given appropriate system-specific constraints? Unfortunately, it seems like the answer is no. However, the corresponding explorations helped me learn many things I did not know about the Fisher information, including the prominent role it plays in QM.

Quantum Mechanics

Arguably the simplest system to learn QM describes a single spin-less particle evolving in a single spatial dimension $latex x&s=1$ under the influence of a position dependent potential $latex V(x) &s=1$. The state of the system is given by its wave function $latex \psi(x)&s=1$, a complex valued function of position that induces a probability density on the particle’s position $latex \rho(x) = |\psi(x)|^2 &s=1$.

Interestingly, position and momentum are are both functions of the same wave function in QM. The probability of measuring a certain momentum $latex p &s=1$ is $latex \gamma(p) = |\psi(p)|^2, &s=1$ where $latex \psi(p) &s=1$ is the Fourier transform of $latex \psi(x) &s=1$. That the position and momentum wave functions are related by the Fourier and inverse Fourier transform is ultimately the reason why we find a connection between QM and the Fisher information.

Wave functions generally depend on time too, and evolve according to the Schrodinger equation $latex i\hbar\frac{\partial}{\partial t}\psi(x)=H\psi(x)&s=1,$ where $latex H&s=1$ is the Hamiltonian or energy of the system. The Hamiltonian is an infinite dimensional (Hermitian) matrix given in appropriately rescaled units by $latex H(x,p) = V(x) + p^2 &s=1$. To solve Schrodinger, and finding the state of the system, one needs to first find the eigenfunctions of the Hamiltonian, i.e., the functions that satisfy $latex H\psi(x) = E \psi(x) &s=1$ where the real valued energy $latex E&s=1$ corresponds to the eigenvalue.

One often finds that there are an infinite but countable number of eigenfunctions. Any arbitrary wave function can then be expressed as a linear combination of all of the eigenfunctions. The ground state is the eigenfunction with the smallest eigenvalue, i.e., that which achieves the smallest energy with certainty. Equivalently, the ground state $latex \psi_o(x) &s=1$ is the wave function that minimizes the expected energy:

$latex E[H] = E[V(x)] + E[p^2] = E[V(x)] + \sigma_p^2 + E[p]^2 &s=1,$ where $latex E[] &s=1$ denotes expectation, and $latex \sigma_p^2 &s=1$ is the variance in momentum, and $latex \psi_o(x) = \arg \min_\psi E[H]&s=1 $. Because $latex \psi(x) &s=1$ and $latex \psi(p) &s=1$ are related by the Fourier transform, shifting $latex \psi(p) &s=1$ by a constant to obtain $latex \psi(p – p_o) &s=1$ only adds a phase to $latex \psi(x) &s=1,$ leaving $latex \rho(x)&s=1$ unchanged. So the first step in the minimization above is to choose the origin in the momentum coordinates such that $latex E[p] = 0&s=1$. Similarly, we chan shift the origin in position so that $latex E[x] = 0&s=1.$ Then, the ground state is:

$latex \psi_o(x) = \arg \min_\psi E[V(x)] + \sigma_p^2,&s=1$ where, ignoring a constant that does not affect the optimization, $latex E[V(x)] &s=1$ is a function of the variance and higher central moments of $latex x &s=1$.

Now for the interesting part, also an implication of the Fourier relationship between $latex \psi(x) &s=1$ and $latex \psi(p) &s=1$: $latex \sigma_p^2 = E[|\nabla \log \psi(x)|^2] = \frac{1}{4}E[|\nabla \log \rho(x)|^2] = J(x)/4, &s=1$ where $latex J(x) &s=1$ is the Fisher information of a hypothetical location parameter. To get the last equality, note that for any probability density with a location parameter $latex \rho(x – \theta) &s=1$, $latex J(x) = E[|\frac{\partial}{\partial \theta}\log \rho(x – \theta)|^2] = E[|\frac{\partial}{\partial x}\log \rho(x – \theta)|^2] =E[|\frac{\partial}{\partial x}\log \rho(x )|^2],$ where the last step follows from a simple change of variables. This means that the ground state problem can be written as:

$latex \psi_o(x) = \arg \min_\psi E[V(x)] + J(x)/4,&s=1$ which is entirely in terms of space (and not momentum). In words, the ground state minimizes the sum of the Fisher information in position and the expected potential energy.

Now, think of the space of wave function as the union of subspaces of wave functions with a specific expected potential energy, say $latex E[V(x)] = v &s=1.$ Then one can imagine first minimizing the Fisher information within each subspace, evaluating the resulting Fisher information, and the resulting expected energy, and then choosing the subspace that achieves the minimum expected energy across all subspaces. The first step, minimizing the Fisher information matrix given $latex E[V(x)] = v &s=1,$ is reminiscent of a well studied problem of maximizing the (differential) entropy given the same constraint. The result of the latter problem is a distribution in the exponential family. Unfortunately, minimizing the Fisher information is in general not as simple as maximizing the entropy given constraints, and has not been thoroughly studied as far as I can tell.

For the SHO, $latex V(x) = x^2 &s=1$, so $latex E[H] = \sigma_x^2 + J(x)/4 &s=1.$ Minimizing the Fisher information matrix subject to a variance constraint is one of the few known solutions to the general minimization of Fisher information problem with constraints problem, and the minimizing distribution is a Gaussian distribution. The same distribution solves the problem of maximizing the relative entropy given a variance constraint. Similarly, without any constraints, the uniform distribution is the distribution defined in an interval that minimizes the Fisher information, and it is also the distribution in the same space that maximizes the entropy.

An appealing conjecture then is whether maximizing entropy given constraints is equivalent to minimizing the Fisher information given the same constraints. Unfortunately, the general answer is no. It still seems, however, like the answer is generally that the two problems are not equivalent. E.g., Appendix A of the book Physics from Fisher Information by Frieden (a book full of interesting ideas and observations) claims that these problems only coincide for the SHO and not otherwise.

Additional Learnings

Reading and thinking about this was fun, and made me learn several things about the Fisher information I didn’t know or had not thought about, including:

  1. Unlike the Shannon entropy, the Fisher information captures the local behavior of its functional argument. Take your favorite continuous distribution, e.g., a Gaussian, and compute its Shannon entropy. Now break your space into chunks, shuffle them around, and compute the entropy of the resulting shuffled distribution. You’ll find that the entropy remains unchanged, even though your distribution looks nothing like the one you started with. Now try the same experiment using the Fisher information. You will find that it is significantly changed, because it is a function of the derivative of the function, which has now multiple points of discontinuity.
  2. Just like the Shannon entropy leads to derived useful concepts like the KL distance between distributions, the Fisher information also has derived concepts that can be similarly useful. E.g., the relative Fisher information is the analog to the KL distance between two distributions, and is given by $latex D_F(p|q) = \int p(x) \big|\frac{\partial}{\partial x}\log \frac{p(x )}{q(x)} \big|^2dx &s=1$ for any two distributions $latex p(x) &s=1$ and $latex q(x) &s=1$. There is also a Fisher mutual information which is what you probably already expect by now. See “Fisher Information Properties” by Pablo Zegers for examples and more on these quantities, including a Fisher data processing inequality.
  3. Unlike the Shannon entropy, it is easy to work with the Fisher information even when you only have an unnormalized distribution, a very common situation, e.g., when working with any undirected graphical model where the normalization is expensive or impossible to compute. This is because the integrand in the Fisher information $latex \frac{\partial}{\partial x}\log \rho(x ) = \frac{\frac{\partial \rho(x )}{\partial x}}{ \rho(x )} &s=1$ is independent of normalization.
  4. There are several known interesting relations between the Fisher information and the Shannon entropy. The best known shows that adding an independent zero-mean Gaussian with a small variance $latex \epsilon &s=1$ to a random variable, increases the Shannon entropy of the random variable by half its Fisher information times $latex \epsilon &s=1$. See, e.g., chapter 16.6 of the Thomas and Cover information theory book for a derivation. By the way, this relation means that minimizing the Fisher information is equivalent to minimizing the entropy increase when a zero mean Gaussian is added. So it seems almost intuitive that maximizing the initial entropy should achieve the minimum, but again, this seems to not be true in general, and is something I’d like to better understand. Another similar but lesser known relation between Fisher and Shannon, shown in the article “Interpretation and Generalization of Score Matching” by Siwei Lyu, relates the KL and relative Fisher distances: the KL distance increases in proportion to the minus a half of the relative Fisher distance times $latex \epsilon &s=1$ when a zero mean Gaussian with variance $latex \epsilon &s=1$ is added to both distributions used as arguments in these distances. The minus sign is very interesting, since it means that minimizing the relative Fisher information is equivalent to maximizing the rate of change of the KL distance.
  5. There is much ongoing work in QM to better understand the properties and behaviors of the Fisher information, also known there as the Weizsacker energy. Accessible examples include the articles “On the realization of quantum Fisher information” by Aparna Saha from 2017, or “The Fisher information of single-particle systems with a central potential” by Romera et al. from 2005.
  6. The properties above have lead to a few promising articles in machine learning that develop learning methods based on Fisher information, starting with a series of articles including “Estimation of Non-Normalized Statistical Models by Score Matching” by Aapo Hyvarinen. There, one also finds Fisher informations based not on a location parameter but on a scale parameter, resulting in a Fisher information where the integrand is $latex |x\frac{\partial}{\partial x}\log \rho(x )|^2 &s=1$ (note the extra $latex x^2 &s=1$) that is useful for distributions on the positive real line. Generally, I bet that revisiting the learning methods where the KL distance is used as part or all of the underlying optimization objective, and adding the corresponding relative Fisher distance or replacing the KL distance by the relative Fisher distance, will result in useful new learning algorithms. Similarly, it is possible that some of these algorithms could then help find the ground states in QM.

Overall, I expect the Fisher information still has much to contribute to science, statistics and machine learning. For example, how does the analog of variational inference work when using the relative Fisher information instead of KL? Does it lead to efficient and useful algorithms, or is the resulting optimization problem and plausible relaxations of it too complex?

If you want to read up some more on related topics here are some suggestions:

  1. A more complete and clear summary of my explorations.
  2. Information theory in Biology. Chapter 6 in the Biophysics book by William Bialek is all about Shannon entropy in cell Biology. I have not yet seen any work based on Fisher information in this context.
  3. Statistical mechanics and information theory. The classic paper is “Information Theory and Statistical Mechanics” by Jaynes.
  4. Several uncertainty relations based on several concepts of entropy, though not including Fisher information: “Entropic Uncertainty Relations and their Applications” by Coles et al.

Teaching About Information With Rocks

On a completely separate note, this class set up an Information Theory night in an elementary school in Palo Alto. Knowing kids interest in rocks, I brought a rock collection with crystals, fossils, etc. and used the rocks as an excuse to get the kids engaged. After letting them explore the rock collection, talking about which rocks they liked and why, how diverse the rocks are, etc. we played a version of the Guess Who game. In the first round, I would pick in my mind a rock out of the collection, and a kid got to ask me yes/no questions about the rock. In the second round, the roles would reverse, and the winner was the person who guessed the rock picked by the other player in the fewest questions. Then we talked about which kinds of questions seemed to eliminate the largest number of remaining rock candidates, and told them those questions reveal more information than questions that tend to eliminate only few of the remaining candidates. I think it worked out pretty well, as explaining more abstract topics to kids and keeping their interest while you do it, can be a challenge.