Daniel Wennberg[efn_note]If anyone has checked in with this post between 2019-03-24 and 2019-04-03, you might have noticed that it has expanded substantially during that time. The post has now reached its final form. Doing it this way ended up being a necessary part of the well-being component of the course.[/efn_note]
Accounting for around 20 % of the body’s energy consumption, the human brain burns calories at 10 times the rate of your average organ when measured per unit mass.[efn_note]Raichle, M. E., & Gusnard, D. A. (2002). Appraising the brain’s energy budget. Proceedings of the National Academy of Sciences of the United States of America, 99(16), 10237–10239. https://doi.org/10.1073/pnas.172399499[/efn_note] However, its drain pales in comparison to that of the hardware used for training and inference in machine learning: a single GPU can consume several hundred watts, more than all the other components of a computer combined, and these days deep neural networks are routinely trained on systems containing more such processors than you can count on one hand.[efn_note]https://aws.amazon.com/ec2/instance-types/p3/[/efn_note]
Anyone who has felt the heat from the bottom of their laptop has experienced the energy cost of information processing within the current paradigms, but systems capable of learning from their inputs seem to have a particularly high appetite. Perhaps this can tell us something about what learning is at the most fundamental level? Here I will describe an attempt to study perhaps the simplest learning system, the perceptron, as a physical system subject to a stochastic driving force, and suggest a novel perspective on the thermodynamic performance of a learning rule: From an efficiency standpoint we may consider the transient energy expenditure during learning as a cost to be minimized; however, learning is also about producing a state of high information content, which in thermodynamical terms means a far-from-equilibrium and thus highly dissipative state, so perhaps a good learning rule is rather one for which minimizing loss and maximizing heat dissipation goes hand in hand?
To explore this question, we need to know a couple of things about stochastic thermodynamics. This is a rather specialized field of study in its own right, and I’m only aware of three papers in which it is applied to learning algorithms. Therefore, in order that this text may be enjoyed by as many as possible, I will spend the bulk of it laying out the basics of stochastic thermodynamics, how it connects to information theory, and how it can be applied to study learning in the perceptron. I will conclude by presenting one novel result, but this is merely a taste of what it would actually take to properly answer the question posed above.
Thermodynamics and information
Classical thermodynamics is a self-contained phenomenological theory, in which entropy is defined as a measure of irreversibility. If the change in the total entropy of the universe between the beginning and end of some process is zero, it could just as well have happened the other way around, and we say that the process is reversible. Otherwise, it’s a one-way street: the only allowed direction is the one in which entropy increases, and the larger the entropy increase, the more outrageous the reverse process would be, like the unmixing of paint or the spontaneous reassembly of shards of glass. This is the content of the second law of thermodynamics, that total entropy can never decrease, and this is the reason we cannot solve the energy crisis by simply extracting thermal energy (of which there is plenty) from the ocean and atmosphere and convert it into electricity.
However, thermodynamics only makes reference to macroscopic observables such as temperature, pressure, and volume. What if a hypothetical being had the ability to observe each individual molecule in a reservoir of gas and sort them into two separate reservoirs according to their kinetic energy? That would reduce the entropy in the gas and allow us to extract useful work from it. This being is known as Maxwell’s demon, and the resolution to the apparent violation of the second law builds on connections between thermodynamics and the mechanics of the gas molecules as well as the information processing capabilities of the demon, provided by the framework of statistical mechanics. If we assume a probability distribution (or, in physics parlance, an ensemble) over the fully specified states of the fundamental constituents of the system, the entropy of the system can be defined as the corresponding Shannon entropy, in physics called Gibbs entropy. That is, if represents a possible microscopic state and is the corresponding probability mass, the Gibbs entropy is,
Thermodynamic observables correspond to various means under this distribution, and an equilibrium state is represented by the probability distribution that maximizes entropy for a set of constraints on thermodynamic observables.[efn_note]For example, the internal energy is the mean of the state-dependent energy , and maximizing entropy while keeping this fixed gives the Boltzmann distribution for a system in equilibrium with a constant-temperature reservoir; in the entropy maximization procedure, temperature emerges as the Lagrange multiplier that determines the value of .[/efn_note] Thus, the thermodynamic entropy can be interpreted as the amount of information swept under the rug by describing a system with many degrees of freedom in terms of only a small set of observables.
This is perhaps the fundamental sense in which we can claim that information is physical: through entropy, the constraints on the time evolution of a system depend on which information about the system is kept and which is discarded. A more operational take on the same sentiment is that the amount of useful work we can extract from a system depends on the amount of information we have about it.[efn_note]Claims like these have spurred some controversy, mainly because they seem to render thermodynamics subjective. I believe that the objections are based on a subtle misunderstanding of the claim: As long as a system is left to equilibrate with an environment, the thermodynamically relevant entropy is the amount of information ignored when describing the system in terms of the equilibrating potentials only, and this is not a subjective notion. A reduced information entropy due to knowing details that the potentials don’t reveal only becomes thermodynamically operational if we can also manipulate the system-environment interactions based on this information.[/efn_note]
Fluctuations in heat and entropy production[efn_note]For a comprehensive reference, see Seifert, U. (2012). Stochastic thermodynamics, fluctuation theorems and molecular machines. Reports on Progress in Physics, 75(12), 126001. https://doi.org/10.1088/0034-4885/75/12/126001[/efn_note]
The deterministic nature of classical thermodynamics corresponds to our experience in daily life: water predictably freezes at exactly and boils at exactly . However, if we reduce the system size such that the number of degrees of freedom is closer to 1 than Avogadro’s number, fluctuations become significant, and we find that classical thermodynamics only describes average behavior. Assuming a separation of timescales between the fast dynamics of the thermal environment and the slow intrinsic dynamics of the system, this can be modeled as a Markovian process represented by a stochastic differential equation that physicists refer to as a Langevin equation. In the case of an overdamped system (that is, a system with negligible inertia), the equation reads,
Here, is a vector of length representing the state of the system, while is the driving force at time , and is a vector of uncorrelated stochastic increments with variance (Wiener increments), modeling the unpredictable kicks that the system receives courtesy of the thermal energy in the environment. The mobility and diffusion are related by the Einstein relation , where is the environment temperature.
The first law of thermodynamics states that energy is always conserved, and is often written , where is the change in internal energy of a system, is the work performed on the system from the outside, and is the heat dissipated to the environment. We can apply this theorem to any infinitesimal time step of any particular trajectory arising from the Langevin equation if we identify heat dissipated with the net force times displacement, . The infinitesimal work from the outside must then be , where is the infinitesimal change in any potential energy that contributes a term to the net force (we do not consider overdamped systems to have kinetic energy).[efn_note]It is perhaps a little puzzling that the heat is identified with net force times displacement, rather than the work. This is due to the absence of kinetic energy in overdamped systems and the fact that we are interested in work from the outside, while the net force contains a term internal to the system due to the gradient of the potential energy.[/efn_note]
Having defined the heat dissipation , we can use the standard thermodynamical definition of change in entropy in the environment: , where is the temperature. However, in order to define the entropy of the system, we also need a probability distribution , provided by the Langevin equation in combination with an ensemble for the initial state. We define the system entropy on a particular trajectory as the information content, or surprise, of that particular realization at each time, , such that the ensemble average becomes the system Gibbs entropy,
In order to express other ensemble-averaged quantities, we introduce the probability current,
such that we can write the equation of motion for the probability density, known as the Fokker-Planck equation, as,
Using this we can show that the entropy produced in the environment is,
and we find the total entropy production to be nonnegative, as required by the second law,
Thermodynamics and learning
The perceptron is a linear classifier that maps a vector of numbers to a binary label. It is often presented as a simple model of a neuron that either emits a spike or doesn’t, depending on the input from upstream neurons. Mathematically, the model is defined by a vector of weights, and a classification rule , where is an input, is the activation of the neuron, and is the output label. Conventionally, the definition of the perceptron includes a particular learning rule to determine the weights based on a set of training examples; however, there are in principle many possible ways to train this model, some of which will be discussed in the following.
Here we follow Goldt & Seifert (2017),[efn_note]Goldt, S., & Seifert, U. (2017). Thermodynamic efficiency of learning a rule in neural networks. New Journal of Physics, 19(11), 113001. https://doi.org/10.1088/1367-2630/aa89ff[/efn_note] and consider a situation where there is a predefined weight vector , referred to as the teacher, that defines the correct classification rule that we want the model to learn. We also restrict the input vectors to have entries , which can be thought of as representing the presence or absence of a spike from the corresponding upstream neuron. The training is modeled by a Langevin equation for the time evolution of the weights, with an input-dependent force:
Here, is some process that associates a particular input with each point in time, and is the correct label for input , determined by the teacher. The parameter is the learning rate. We set both the mobility and diffusion constant to unity, , since varying these just corresponds to rescaling the weights and time.
We consider forces of the form
To a physicist, the first term represents a conservative force that drives the weights towards the minimum of a potential energy (in the absence of training the system would equilibrate to the Boltzmann distribution with this energy function). This keeps the weight magnitudes from growing without bounds in response to the training, so computationally it serves as a regularization term.
The input-dependent force drives the weight vector in the direction of alignment (when ) or anti-alignment (when ) with the input vector . The overall strength of this drive is set by the learning rate , and it is modulated for each input by the activation-dependent learning rule . The learning rules that will be considered here are:
Here, is the step function: if and otherwise. Hebbian learning weights all training examples equally, the classical perceptron algorithm only updates the weights when the neuron predicted the wrong label, and AdaTron learning is a modification of perceptron learning where the correction is larger the more confident the neuron was in its erroneous prediction. The -tron rule is a novel family of learning rules introduced here, designed as a modification of Hebbian learning such that better predictions lead to increased variance in the learning force, and hence, as we will show below, more dissipation. This is achieved by using a constant weight for incorrect predictions and an activation-dependent weight for correct predictions, where the parameter can be tuned to adjust the strength of the fluctuations.
To fully specify the problem, we must define two sets of inputs: a training set and a test set . We will assume that both sets are sufficiently large and uniform samples as to be, for the analysis that follows, statistically indistinguishable from the uniform distribution over all the possible input vectors.
As the performance metric for training we will use the generalization error , that is, the fraction test examples for which the preceptron yields incorrect predictions. In the thermodynamic limit, , it can be shown that the generalization error is proportional to the angle between the weight and teacher vectors,
The generalization error can be related to the mutual information between the predicted and correct label: , where is the binary entropy function. Moreover, this mutual information can be bounded by the entropy produced over the course of learning. These connections close the circle between the information theoretical, thermodynamical, and learning perspectives on this model.
Steady state entropy production of learning
For the learning to converge to a steady state, we will assume a separation of timescales between the process that flips through training examples, and the relaxation time for the weights, governed by the learning rate . The idea is that scans a representative sample of training examples fast enough that in any time interval over which the weights change appreciably, the time evolution is completely determined by the statistical properties of the training set, and the detailed dynamics of are irrelevant.
In steady state, the system entropy is constant and the prevailing entropy production must occur in the environment. Hence, we can write the total steady-state entropy production as,
Here we write for the probability current when the system is in steady state and the training example indexed by is presented, and we take the average over the training set, , to define the effective, -independent entropy production.
The steady-state currents can be written in terms of the force and steady-state probability density as,
and by subtracting the input-averaged version of the same equation we find that,
Plugging this back into the expression for total entropy production, we obtain two distinct terms, and write . The first term is the housekeeping entropy rate,
which is the entropy production required to maintain the steady state. This contribution would remain the same in a batch learning scenario where the force does not fluctuate from scanning through training examples, but instead takes the average value at all times.
The fluctuations are responsible for the second contribution, the excess entropy production,
where we simplified the expression by substituting the definition of and noting that and . By the separation of timescale assumption, the input fluctuations average out quickly enough to not affect the time evolution of the probability distribution on the timescales of interest (this is a condition for defining a steady state in the first place); however, they still add to the heat dissipation.
One of the main contributions of this work is identifying the latter term, a contribution to the excess entropy production that remains even in the steady state. In Goldt & Seifert (2017), the assumption of time scale separation was taken to imply that there would be no difference between presenting inputs sequentially (online learning) and in batch, and this effect of fluctuations was therefore neglected. A question not pursued here is whether this refinement modifies any of the bounds on mutual information from their paper.
Learning by adaptation?
How do we choose a good learning rule ? We have defined our metric, the generalization error , but since we cannot try every imaginable learning rule, we need some heuristic for designing one. We have derived an expression that relates the steady-state entropy production (or, equivalently, heat dissipation) to . Perhaps we can find a way to apply that?
It has been observed that in many complex systems out of equilibrium, the likelihood of observing a particular outcome grows with the amount of heat dissipated over the history leading up to that outcome. The hypothesis of dissipative adaptation asserts that, as a corollary of this, one should expect many complex systems to evolve towards regions of state space where they absorb a lot of work from their driving forces so they can dissipate a lot of heat to their environments.[efn_note]Perunov, N., Marsland, R. A., & England, J. L. (2016). Statistical Physics of Adaptation. Physical Review X, 6(2), 021036. https://doi.org/10.1103/PhysRevX.6.021036[/efn_note] For example, simulations have demonstrated a tendency for many-species chemical reaction networks to evolve towards statistically exceptional configurations that are uniquely adapted to absorbing work from the external drives they are subjected to.[efn_note]Horowitz, J. M., & England, J. L. (2017). Spontaneous fine-tuning to environment in many-species chemical reaction networks. Proceedings of the National Academy of Sciences of the United States of America, 114(29), 7565–7570. https://doi.org/10.1073/pnas.1700617114[/efn_note] Inspired by this, we want to investigate the performance of learning rules designed such that the force fluctuations increase as the generalization error decreases. The -tron family of learning rules, presented above, is one attempt at designing such rules.
Now that we begin to understand the thermodynamical perspective on learning and the (admittedly quite speculative) rationale behind the -tron, let us put it to the test. For the purpose of this example, we set , ,[efn_note]This is a somewhat special value for the AdaTron for reasons that may or may not be related to the purpose of this study.[/efn_note] and . Sampling a random but fixed teacher and a random initial , and using a fixed step size of , we simulate the Langevin equation for the perceptron in the most straightforward way, that is, using the Euler-Maruyama method. The code is available here, and the result looks like this:
Among the three established algorithms, we see that Hebbian learning is the clear winner. Interestingly, however, the -tron slightly outperforms Hebbian learning for . Was it that simple? Did our adaptation-based approach just revolutionize perceptron theory?
Probably not. First of all, this simulation is a little bit of a bodge. We sample training examples at random at each time step, but a step size of is not nearly small enough to introduce proper separation of time scales. We know this because all algorithms perform measurably better if we reduce the step size. To partially mitigate this, the simulation was actually performed by averaging the force over a small batch of 30 examples at each time step, but this is still not enough to converge to time scale-separated dynamics; on the other hand, it serves to dampen the input-driven fluctuations that were the entire rationale behind this study by a factor of . However, this was the compromise my laptop could handle. The qualitative features of the figure remain the same for different , so perhaps the -tron does indeed perform better than the established algorithms for these particular parameters and , at least when averaging over small batches.
A methodological problem is of course that this is just a single run at a single point in parameter space. More thorough studies are needed before interesting claims can be made.
More importantly, we have not actually looked at the rate of entropy production after convergence, so we cannot say whether the solid -tron performance has anything to do with the fluctuations that we tried to design into the algorithm under inspiration from dissipative adaptation. A notable feature of the -tron is that the magnitude of the learning force scales with the activation when the prediction is correct. Since only the direction of matters for prediction, this gives the algorithm an extra knob with which to balance the strength of the learning force against regularization and thermal noise such that the steadiest possible nonequilibrium state can be maintained (in other words, it can dynamically adjust the learning rate that applies when the prediction is correct). Whether this leads more or less dissipation than Hebbian learning is by no means obvious; note that both terms in the steady state entropy production scale with the square of the learning force. However, it does suggest that some interesting form of adaptation is taking place in the -tron.
Finally, we have not said much about an essential aspect of the systems for which dissipative adaptation is relevant, namely that they are complex. Sophisticated machine learning algorithms such as deep convolutional neural networks with nonlinearities in every layer are arguably quite complex, and biological systems that can learn or adapt, such as brains, immune systems, gene regulatory networks, ecological networks, and so on, are some the most complex systems known. It is for such systems that dissipative adaptation can emerge because strong fluctuations can help the system overcome energy barriers that would otherwise confine it to a small region of its state space. The perceptron, on the other hand, is not a complex system at all; it is a simple tug-of-war between a learning force that fluctuates a little but always points in the same general direction, a deterministic regularization force derived from a quadratic potential, and, in our version, some thermal noise sprinkled on top. Moreover, in certain limiting cases, the mean learning force is a simple function of that can be thought of as an extra term in the potential energy: for example, with Hebbian learning it is constant, and in the limit the effective potential is,
where the sign function applies elementwise. Hebbian learning in the perceptron is thus little more than thermal equilibration in a quadratic potential—the simplest system imaginable!
All this is to say that one should probably not expect dissipative adaptation to be a viable learning mechanism in the perceptron, even though the -tron algorithm may be a worthwhile object for further study. But the more interesting takeaway is perhaps that there is a vast, uncharted territory waiting to be explored by taking an adaptation perspective on complex systems that learn, such as deep neural networks and other learning algorithms with complex loss landscapes, as well as biological learning systems.
For the EE 376A outreach event, rather than discussing perceptrons and thermodynamics with elementary school children I wanted to convey the idea of complex systems that respond to external forcing by adopting far-from-equilibrium states where weird things happen. The obvious choice was to demonstrate corn starch and water!
A suspension of corn starch in water in the right proportions (about 5/8 corn starch) forms a strongly shear-thickening fluid, which runs like a liquid when left to itself but acts as a solid as soon as a force is applied. This is cool already, but the real fun happens when subjecting the mixture to strong vibrations. Take a look at this video, recorded at the outreach event:
When the vibrations are at their strongest, the puddle almost seems to morph into a primitive life form. The suspension exhibits strongly nonlinear interactions that respond to the external forcing in a very unpredictable manner, and the net effect is the dancing and crawling we see in the video. Even without a detailed understanding of flocculation at the microscopic level, it is clear that the vibrations push the suspension very far away from equilibrium and into highly dissipative states that must absorb a lot of energy from the speaker to persist. In other words, it might not unreasonable to interpret this as a simple example of dissipative adaptation: the suspension, being a complex system, responds to the forcing from the speaker by finding a corner in state space where the amount of energy absorbed from the speaker as work and dissipated to the environment as heat is unusually high.
My spiel to the kids was that the fluid is a thrill seeker and wants to get the biggest buzz it can out of the vibrating speaker, so what can it do but learn the rhythm and start dancing? In the same way, our brain wants to get the biggest buzz it can out of the signals it receives through our senses, so it has to pay attention, figure out common patterns, and play along—and that is what we call learning!
To the extent the last statement carries any substance, it is also just speculative conjecture. For anyone wanting to find out more, thermodynamics provides one possible link between the complex dynamics of brains in a noisy environment and the information measures that are the natural currency of a learning system.