Hopfield Networks

EE376A (Winter 2019)

Jordan Alexander

(Note: I’d recommend just checking out the link to my personal site: https://jfalexanders.github.io/me/articles/19/hopfield-networks, the version there has a few very useful side notes, images, and equations that I couldn’t include here)

These days there’s a lot of hype around deep learning. We have these things called “deep neural networks” with billions of parameters that are trained on gigabytes of data to classify images, produce paragraphs of text, and even drive cars. But how did we get here? And why are our neural networks built the way they are? In order to answer the latter, I’ll be giving a brief tour of Hopfield networks, their history, how they work, and their relevance to information theory. By studying a path that machine learning could’ve taken, we can better understand why machine learning looks like it does today.

I. A brief history of artificial neural networks

While neural networks sound fancy and modern, they’re actually quite old. Depending on how loosely you define “neural network”, you could probably trace their origins all the way back to Alan Turing’s late work, Leibniz’s logical calculus, or even the vague notions ofGreek automata.

In my eyes, however, the field truly comes into shape with two neuroscientist-logicians: Walter Pitts and Warren McCullough. These two researchers believed that the brain was some kind of universal computing device that used its neurons to carry out logical calculations. Together, these researchers invented the most commonly used mathematical model of a neuron today: the McCulloch–Pitts (MCP) neuron.

These neural networks can then be trained to approximate mathematical functions, and McCullough and Pitts believed this would be sufficient to model the human mind. Now, whether an MCP neuron can truly capture all the intricacies of a human neuron is a hard question, but what’s undeniable are the results that came from applying this model to solve hard problems.

The first major success came from David Rumelhardt’s group in 1986, who applied the backpropagation algorithm to train a neural network for image classification and showed that neural networks can learn internal representations of data. Before we examine the results let’s first unpack the concepts hidden in this sentence:training/learning, backpropagation, and internal representation.

Let’s start with learning. While learning conjures up images of a child sitting in a classroom, in practice, training a neural network just involves a lot of math. At its core, a neural networks is a function approximator, and “training” a neural network simply means feeding it data until it approximates the desired function. Sometimes this function is a map from images to digits between 0-9, and sometimes it’s a map from blocks of text to blocks of text, but the assumption is that there’s always a mathematical structure to be learned.

Training a neural network requires a learning algorithm. Of these, backpropagation is the most widely used. The basic idea of backpropagation is to train a neural network by giving it an input, comparing the output of the neural network with the correct output, and adjusting the weights based on this error.

Backpropagation allows you to quickly calculate the partial derivative of the error with respect to a weight in the neural network. This roughly corresponds to how “significant” this weight was to the final error, and can be used to determine by how much we should adjust the weight of the neural network. If fed enough data, the neural network learns what weights are good approximations of the desired mathematical function.

There’s a tiny detail that we’ve glossed over, though. The original backpropagation algorithm is meant for feed-forward neural networks. That is, in order for the algorithm to successfully train the neural network, connections between neurons shouldn’t form a cycle. We call neural networks that have cycles between neurons recurrent neural networks, and, it at least seems like the human brain should be closer to a recurrent neural network than to a feed-forward neural network, right? Yet, backpropgation still works. While researchers
later generalized backpropagation to work with recurrent neural networks, the success of backpropgation was somewhat puzzling, and it wasn’t always as clear a choice to train neural networks. So what does that mean for our neural network architectures?

In the present, not much. Regardless of the biological impossibility of backprop, our deep neural networks are actually performing quite well with it. But a few years ago, there was an abundance of alternative architectures and training methods that all seemed equally likely to produce massive breakthroughs.

One of these alternative neural networks was the Hopfield network, a recurrent neural network inspired by associative human memory. The hope for the Hopfield human network was that it would be able to build useful internal representations of the data it was given. That is, rather than memorize a bunch of images, a neural network with good
internal representations stores data about the outside world in its own, space-efficient internal language. There are a few interesting concepts related to the storage of information that come into play when generating internal representations, and Hopfield networks illustrate them quite nicely. As for practical uses of Hopfield networks, later in this post we’ll
play around with a Hopfield network to see how effective its own internal representations turned out to be.

II. What is a Hopfield network?

Imagine a neural network that’s designed for storing memories in a way that’s closer to how human brains work, not to how digital hard-drives work. We’d want the network to have
the following properties:

  • Associative: Memories are tied to each other and remembering one thing triggers another memory.
  • Robust: Even when the network is damaged it still retains a good portion of its functionality. For example, if you were to damage your digital hard drive, it would often be a complete loss, but the human brain–even under traumatic injuries–is still able to retain a sigificant portion of its functionality.

To make this a bit more concrete, we’ll treat memories as binary strings with B bits, and each state of the neural network will correspond to a possible memory. This means that there will be a single neuron for every bit we wish to remember, and in this model, “remembering a memory” corresponds to matching a binary string to the most similar binary string in the list of possible memories. Example: Say you have two memories {1, 1, -1, 1}, {-1, -1, 1, -1} and you are presented the input {1, 1, -1, -1}. The desired outcome would be retrieving the memory {1, 1, -1, 1}.

Now, how can we get our desired properties? The first, associativity, we can get by using a novel learning algorithm. Hebbian learning is often distilled into the phrase “neurons that fire together wire together”,

So, for example, if we feed a Hopfield network lots of (images) of tomatoes, the neurons corresponding to the color red and the neurons corresponding to the shape of a circle will activate at the same time and the weight between these neurons will increase. If we later feed the network an image of an apple, then, the neuron group corresponding to a circular shape will also activate, and the we’d say that the network was “reminded” of a tomato.

The second property, robustness, we can get by thinking of memories as stable states of the network:

If a certain amount of neurons were to change (say, by an accident or a data corruption event), then the network would update in such a way that returns the changed neurons back to the stable state. These states correspond to local “energy” minima, which we’ll explain later on.

Hopfield networks can be used to retrieve binary patterns when given a corrupted binary string by repeatedly updating the network until it reaches a stable state. If the weights
of the neural network were trained correctly we would hope for the stable states to correspond to memories. Intuitively, seeing some amount of bits should “remind” the neural network of the other bits in the memory, since our weights were adjusted to satisfy the Hebbian principle “neurons that fire together wire together”

Now that we know how Hopfield networks work, let’s analyze some of their properties.

III. Storing information in Hopfield networks

Hopfield networks might sound cool, but how well do they work? To answer this question we’ll explore the capacity of our network (Highly recommend going to: https://jfalexanders.github.io/me/articles/19/hopfield-networks for LaTeX support). The
idea of capacity is central to the field of information theory because it’s a direct measure of how much information a neural network can store. For example, in the same way a hard-drive with higher capacity can store more images, a Hopfield network with higher capacity can store more memories. To give a concrete definition of capacity, if we assume that the memories of our neural network are randomly chosen, give a certain tolerance for memory-corruption, and choose a satisfactory probability for correctly remembering each pattern in our network, how many memories can we store?

To answer this question we’ll model our neural network as a communication channel. We’re trying to encode N memories into W weights in such a way that prevents:

  • The corruption of individual bits
  • Stable states that do not correspond to any memories in our list

Example: Say you have two memories {1, 1, -1, 1}, {-1, -1, 1, -1} and you are presented the
input {1, 1, 1, -1}. The desired outcome would be retrieving the memory {1, 1, -1, 1}, corresponding to the most similar memory associated to the memories stored in the neural network.

We can use the formula for the approximation of the area under the Gaussian to bound the maximum number of memories that a neural network can retrieve. Using methods from statistical physics, too, we can model what our capacity is if we allow for the corruption of a certain percentage of memories. Finally, if you wanted to go even further, you could get some additional gains by using the Storkey rule for updating weights or by minimizing an objective function that measures how well the networks stores memories.

IV. What’s next?

Well, unfortunately, not much. Despite some interesting theoretical properties, Hopfield networks are far outpaced by their modern counterparts. But that doesn’t mean their developement wasn’t influential! Research into Hopfield networks was part of a larger
program that sought to mimic different components of the human brain, and the idea that networks should be recurrent, rather than feed-forward, lived on in the deep recurrent neural networks used today for natural language processing.


For the outreach portion of the project, I explained the basics of how neural networks stored information through my own blog post and a few articles on distill.pub about machine learning interpretability and feature visualization. The focus of my project was letting the kids play around with neural networks to understand how they generate “internal representations” of the data being fed to them, coupled with a high-level explanation of what this meant.

External Links

For a more detailed blog post, with some visualizations and equations, check out my other blog post on my personal site: https://jfalexanders.github.io/me/articles/19/hopfield-networks

Information Theory Through Puzzles

EE376A (Winter 2019)

Karan Singhal and Neel Yerneni

Most people are first introduced to the ideas of information theory through puzzles and games. We encountered 20 Questions and coin weighing well before entropy and joint source-channel encoding. Given the popularity of puzzles and games backed by information theoretic concepts, studying these puzzles can provide insight into how the general public views information theory and how information theoretic concepts can be communicated to a broader audience.

We discuss seven puzzles involving information theoretic concepts, four of which we created for the outreach event at Nixon Elementary School. We propose a simple taxonomy for these puzzles that reflects the fundamental concepts involved in each. While this taxonomy is not exhaustive, we hope it proves useful in demonstrating that these puzzles often have shared foundations in information theory.

Bits Encode Possibilities

As information theorists, we know that a small number of bits can be used to express one of exponentially many possibilities. This principle is used in games where one player is able to “read the mind” of another player, determining which of many possibilities the other player is thinking of despite receiving a seemingly paltry amount of information.

The 20Q handheld toy. Observe that answers to questions are one of four possibilities: yes, no, sometimes, and unknown.

The most well-known of this category of games is 20 Questions, which was popularized by the 20Q handheld toy owned by many children during the 2000’s. In this game, the user is challenged to think of anything or anyone she wants–an actor, a board game, a fictional character, a national monument, etc. The toy asks a series of 20 yes/no/sometimes/unknown questions about the thing the user is thinking of (e.g., is the thing a person?). At some point when it becomes more confident, the toy begins guessing specific things that the user might be thinking of. If the toy cannot guess the thing after a number of guesses, the user wins. As owners of the original toy know, the toy was able to guess the thing that the user is thinking of–no matter what it was–almost every time. A modified adaptive version of the game, where questions can have more than four possible answers, can be found online.

Armed with the tools of information theory, we can begin to deconstruct the success of the toy. The toy must have a fixed list of possible things (roughly 2^20 = 1,000,000 things) that the user might think of (since it was not connected to the internet), and it must choose its questions so that the number of possibilities is split in roughly half each time to minimize the number of questions it asks. For example, if half the possibilities that the toy is still considering at some stage of the game are humans and the other half are not, the toy would do well to ask: Is it human? This says nothing about how to choose such a question efficiently, especially since people will provide unpredictable answers to ambiguous questions (e.g., is a fictional humanoid alien character a person or not?). Even in cases like this, the toy was expected to guess the thing correctly (and it usually did!). It turns out that the toy, whose algorithm was first developed in 1988 as a test of artificial intelligence, is actually powered by an artificial neural network that figures out which questions to ask and guesses to make. Though most probably did not realize it, for many children growing up in the 2000’s, this toy might have been their first exposure to ideas from both information theory and artificial intelligence! Unfortunately, the original toy was discontinued in 2011 due to licensing issues.

We were inspired by the success of this toy in presenting information theory concepts to the broadest possible audience, so we created our own simple game to evoke the same wonder in today’s elementary school students: 1000 Numbers. We asked students to think of a number from 1 to 1000 (inclusive) and told them we could guess it using ten guesses, as long as they told us whether our guess was correct, too high, or too low. If we could not get it within ten guesses, students won candy. We then performed binary search in real-time, guessing the number within ten guesses. It turns out that only ten guesses are needed, no matter which number a student chooses, even though 1000 < 2^10 = 1024. We show this in a simple Python simulation. This is because each guess actually provides more than a bit of information–we know whether our guess was too high or too low, but we also know whether it is correct or not, which is especially useful when the number of possibilities left to consider is low.

We found that 1000 Numbers was our most popular puzzle on the day of the outreach. Students of all ages were surprised and delighted to find that we could guess their number using few guesses (an average of eight guesses, according to our simulation). Even though we could always guess their number within ten guesses (unless we made an error during the mental math associated with binary search in real-time), we provided candy to all students who participated.

Decisions Create Information

Puzzles often ask people to make informative decisions. In an uncertain situation, making the right decision can reveal crucial information. This is best illustrated by example; consider the popular genre of puzzles involving weighing coins.

The canonical version of this puzzle is as follows. You have nine coins in front of you. All of them are identical except for one counterfeit coin, which is lighter than the other eight. If you have a balance that can compare the weights of any subset the coins with another subset, at most how many weighings would it take to figure out which of the coins is counterfeit?

Observe that each weighing can provide a variable amount of information, depending on the choice of coins to weigh. For example, if two random coins are weighed against each other for the first weighing, this does not seem very informative (in the worst case), since it is likely that we will be left with seven possible counterfeit coins after the weighing. Compare this to weighing two sets of three coins, which is part of the optimal solution and allows us to narrow the possible counterfeit coins to three in both the best and worst case. Thus, making the right decision can create important information.

In the canonical problem presented above, it turns out that just two weighings are needed. In the first weighing, some three coins are compared to some other three coins. Based on the result of this weighing, we can narrow down the possibilities to three. We then weigh any two of these. If one of these is lighter, then we have our desired counterfeit coin. If neither is lighted, then the third unweighed coin must be the counterfeit coin. This solution is illustrated below.

Just two weighings are needed to identify the counterfeit among nine coins. Observe that the problem reduces to determining the number of digits needed to represent nine possibilities in ternary.

Note that with each weighing we can reduce the number of possibilities by a factor of three using the optimal weighing, so the problem reduces to encoding nine possibilities in ternary, which can be done using two digits (corresponding to two weighings). Thus, if 27 coins were present instead, three weighings would be sufficient. Many other variations of this puzzle build off this information theoretic foundation.

Another example of a puzzle in which decisions create information is presented in Quanta’s puzzle column from July 2015. In this puzzle, you seem to be able to win a number-guessing game without any information.

The puzzle is presented as follows:

I write down two different numbers that are completely unknown to you, and hold one in my left hand and one in my right. You have absolutely no idea how I generated these two numbers. Which is larger? You can point to one of my hands, and I will show you the number in it. Then you can decide to either select the number you have seen or switch to the number you have not seen, held in the other hand, as your final choice. Is there a strategy that will give you a greater than 50 percent chance of choosing the larger number, no matter which two numbers I write down?

Pradeep Mutalik, Quanta Puzzle Columnist

When faced with this puzzle, most readers are likely to rely on concrete assumptions over the distribution from which numbers are drawn (e.g., people are likely to choose a number from 1 to 1000). It turns out this kind of assumption is not necessary.

The solution provided by the column is to pick a random number G before pointing to one of the hands. If the revealed number is larger than G, select the revealed number. If it is smaller, switch. The column provides a detailed discussion of this strategy and why it yields a greater than 50% chance of choosing the larger number (even if numbers are chosen over an infinite space), but the intuition is that in most cases, whether you stay or switch, your conditional probability of winning is still 50%. In one case, however, where one of the numbers is less than G and the other is greater, then you are guaranteed to win by playing your strategy. Then as long as you assume that there is a nonzero chance of your random number being in between the two numbers (a mild assumption), this strategy has a greater than 50% chance of choosing the larger number.

In this puzzle, as before, being strategic about how to act in a constrained environment introduces information.

Channels Are Noisy

Channels transmitting information are not always able to transmit exactly what was originally intended–there is often noise. Languages evolved as people needed to be able to communicate given the chance that there might be noise in transmission. The redundancy of modern language has rendered many problems related to noisy channels obsolete in practical situations. This allows people to infer correct meaning even with noise in the transmission.

To conceptualize this as an activity, we wanted to explore scenarios within the English language where original words in a cohesive text could be inferred despite the redaction of certain letters. In the outreach activity, we introduced two games that built on this. The first was Reconstructing Text. We randomly replaced certain letters in a piece of text and had the student reconstruct the text to the best of their ability. With this game, we noticed that students had little trouble reconstructing the text correctly as they were able to infer what the missing words were. We noticed that most of the words that students got incorrect were ones where most of the consonants were replaced by noise rather than the vowels. This made sense as consonants often carry more information than vowels in English words, which we hoped to convey to the students.

The second game we made was the Text Message Game. For this game, students worked in pairs. One student was responsible for masking the letters and the other was to guess the letters that were blacked out. Though this game was less popular due to its complexity, we thought it gave students an even stronger glimpse into the limits and nuances of the redundancies of language as they were encouraged to black out as many letters as they could but so long as their partner could perfectly reconstruct it. This activity provided a much more hands-on exploration as students also had to directly consider the implications behind the transmission of their information.


The encoding of information is another critical concept in the field of information theory and tends to be more complex. In constructing an activity surrounding this concept, we wanted to obviate some of the more involved notions and convey more abstract, generalized concepts in the encoding of information. Specifically, we wanted to explore the idea that smaller symbols could be combined in different ways to communicate more complex concepts.

In our activity at the outreach event, we had students pair up where one student was blindfolded and the other student needed to give the first instructions to complete a maze. However, to do so, the instructing student could only use three pre-outlined words (we used soy, sugar, and glue in our activity), even though there were four possible directions for movement in the maze. It was then up to the students to come up with a plan for how they wanted to convey the proper instructions via these three words beforehand.

We were surprised by the diversity in methods taken by students to use these three words to instruct their partner. While some students worked in the dimension of the words themselves (i.e., using permutations of the words to convey more complex instructions), others made use of other dimensions of variation. One example was a group of students who decided to alter their pitch in saying the word as a mechanism for signaling the direction of movement. We did not expect this solution!


We have proposed a simple taxonomy for seven puzzles involving information theoretic concepts, some of our own creation. Besides furthering our understanding of how these puzzles relate, our analysis may also prove useful in producing curriculum introducing students to information theory. The outreach event is highly relevant: in our station, we presented four puzzles (1000 Numbers, Reconstructing Text, Text Message Game, and Soy-Sugar-Glue) using worksheets that students could take home with them, along with copious prizes (candy) for students who were able to complete the puzzles. We found that students were generally quite excited to complete the puzzles both individually and in pairs, indicating that these and other puzzles could serve as excellent teaching tools for the popularization of information theory.

Our full worksheets that we created for the outreach event: 1000 Numbers, Reconstructing Text, Text Message Game, and Soy-Sugar-Glue.

Comparing information content of language representations under compression

EE376A (Winter 2019)
Cindy Wang and Stephanie Chen

If you’re sitting in a cafe and feel a little hungry, you might say to the waiter, “I would like a croissant, please.” However, if you’re in Paris, you might say, “Je voudrais un croissant, s’il vous plaît.” On a surface level, these sentences are saying the same thing — they are direct translations of each other, after all. Both sentences communicate the first-person subject (I), a verb expressing want or desire (would like), the object (a croissant), and a function word to express politeness (please). Upon closer inspection, you might notice that the French sentence has extra information: the conjugation of the verb voudrais reinforces the first-person subject, the article un denotes that croissant is a masculine noun, and the formal pronoun in s’il vous plaît offers context — the speaker is saying please to someone they do not know.

In a way, language is an encoding for semantic information. We can think of the rules of various languages (e.g. English or French) as different encoding schemes, and the various representations of each language (like writing systems, transcription systems, etc.) as further encodings. Under this paradigm, we might wonder whether the encoding mechanisms of different languages vary in their quality and information capacity, and whether different language representations vary in how they preserve the information transmitted by the language. In this project, we compare how several writing and transcription systems in three different languages perform under the same compression scheme, in order to explore how compressible information varies across languages and how information preservation varies across language representations.



For our project, we used the German-English and Spanish-English parallel corpora from the Europarl dataset, which is a collection of transcriptions, in a variety of languages, of European Parliament proceedings from 1996 to 2011. Because these corpora are parallel translations of proceedings by professional human translators, each set contains approximately the same amount of basic semantic information expressed fluently in each language.

DEFLATE Compression

For compression, we chose the DEFLATE algorithm as it is widely used, fast for experimentation, and lossless. DEFLATE uses a two-stage approach (both of which we discussed in EE376A!):

  1. Matching and replacing duplicate strings with pointers (the LZ77 algorithm)
  2. Replacing symbols based on frequency of use (Huffman coding)

A DEFLATE data stream consists of a series of blocks. Block size is arbitrary, and blocks can be encoded in one of three ways:

  1. As raw literals between 0 and 65,535 bytes in length. This option is used for data that is incompressible.
  2. As a compressed block, using a static Huffman encoding. This is generally used for short messages, when the compression gains achieved do not warrant the cost of generating and storing a Huffman tree.
  3. As a compressed block, using a dynamic Huffman encoding. Most blocks are stored in this way. 

Stage 1: LZ77

The deduplication stage proceeds as the LZ77 algorithm described in class. Duplicate strings in a block are found and replaced with a length (8 bits) and distance from the beginning of the duplicate (15 bits). A minimum match length of 3 and sliding window size of 32KB are used.

Stage 2: Huffman coding

The second stage involves generating (if this is a dynamic Huffman block) and encoding via Huffman trees. This achieves bit reduction by replacing commonly used symbols with shorter representations. (See Huffman coding class notes for more details.)

Each symbol in the output of stage 1 may be a literal, a match length, or a match distance. Stage 2 replaces each of these with a literal/length code and optionally a distance code. Two trees are thus created:

  • Literal/length code (288 symbols)
    • 0-255: The literal bytes/symbols 0–255.
    • 256: End of block.
    • 257-285: Match length range, with extra bits* to encode the actual length.
  • Distance code (32 symbols)
    • Distance range, with extra bits to encode* the actual length.

* For example, symbols 4-5 encode distances 5-8. One extra bit is needed (in addition in the bit already communicated via whether the symbol is 4 or 5) to encode the index of the actual distance within the range 5-8.

DEFLATE thus uses a scheme that places literals, lengths, and an end-of-block symbol together into a single alphabet. Distances are encoded in a separate alphabet, and this can be safely done since a distance can only occur after a length.

The two above codes are themselves encoded compactly as canonical Huffman codes, with run-length encoding for the bit length of the code for each symbol. It is easy for the compressor to choose whether the static or dynamic compressed size is smaller, as the static compressed size can be computed using the same statistics (the number of times each symbol appears) as are used to generate the dynamic trees.

A optimizes implementation of DEFLATE is available via gzip, which we use here and is included in most *nix distributions. (Fun fact: in addition to gzip, DEFLATE is used in PNG image files and the ZIP file format.) The compressed sizes below refer to the size of the GZ file produced by running gzip.

As a baseline, we compress the raw parallel corpora using DEFLATE. The compression rates of the three languages are similar, with German and English both at just under 68% and Spanish slightly better at 69%. Notably, the German and Spanish uncompressed file sizes are both about 14% larger than the English file.


IPA Transcription

Using eSpeak, a text-to-speech synthesizer, we then convert all three corpora to transcriptions in the International Phonetic Alphabet (IPA), the standard phonetic alphabet used in linguistics. The IPA contains 107 letters and 52 diacritical markers, which together capture the wide variety of vowel and consonant sounds, tones, stresses, and other phonetic features present in human languages. IPA transcription provides a standardized way to represent the way in which a language actually sounds, bypassing the idiosyncrasies of writing systems. For example, the English sentence

There has therefore been enough time for the Commission to prepare its programme and for us to become familiar with it and explain it to our citizens.

is transcribed in IPA (using Received Pronunciation, a.k.a. standard British English) as

ðeə hɐz ðˈeəfɔː bˌiːn ɪnˈʌf tˈaɪm fəðə kəmˈɪʃən tə pɹɪpˈeəɹ ɪts pɹˈəʊɡɹam and fɔːɹ ˌʌs tə bɪkˌʌm fəmˈɪliə wɪð ɪt and ɛksplˈeɪn ɪt tʊ aʊə sˈɪtɪzənz

Note how certain sounds in English that require long (and often unintuitive) character combinations to write become collapsed and standardized in IPA. For example, the sound indicated by ugh in enough, which can also be spelled as f, ff, or ph in English, becomes /f/ in /ɪnˈʌf/. Likewise, the sh sound indicated by ssi in Commission (also spelled in a variety of ways) becomes /ʃ/ in /kəmˈɪʃən/. 

Since this transcription format coalesces different spellings of identical sounds, we predicted that the compression rate would be higher for IPA transcriptions than for plaintext (Table 1). As expected, the compression rates were higher across the board. Surprisingly, however, the compression rates of the IPA transcriptions for each of the three languages were identical.

Screen Shot 2019-03-19 at 11.07.16 PM.png

Though IPA is the most recognized standard in phonetic transcription, its heavy use of characters outside the standard Latin alphabet makes it difficult to compare file sizes with our plaintext, since written German, English, and Spanish all mostly use standard Latin characters that can be encoded in 1 byte. For example, our files are UTF-8 encoded, which means that the 1-byte o in Commission is replaced by a 2-byte /ə/ once transcribed into IPA. This artificial augmentation is partially addressed by the increased compression rate of IPA transcriptions, but both the raw and compressed IPA files are much larger than their plaintext versions.

X-SAMPA Transcription

To address this mismatch, we then considered phonetic transcription into X-SAMPA, an alternative transcription scheme that covers the full IPA using only ASCII characters. Though the IPA already makes use of all 26 lowercase Latin letters, X-SAMPA leverages uppercase letters, numbers, and special ASCII symbols like @, so that nearly all (with one or two exceptions) IPA characters used in German, English, and Spanish can be represented with a single byte of ASCII. The example sentence above, in X-SAMPA, becomes

De@ h6z D”e@fO: b%i:n In”Vf t”aIm f@D@ k@m”IS@n t@ pr\Ip”e@r\ It-s pr\”@Ugr\am and fO:r\ %Vs t@ bIk%Vm f@m”Ili@ wID It and Ekspl”eIn It tU aU@ s”ItIz@nz

We can see in Table 3 below that as expected, X-SAMPA transcription results in much smaller raw file sizes than IPA transcription.

Screen Shot 2019-03-23 at 7.50.42 PM

We note that neither of the transcriptions yield smaller compressed sizes than plaintext, even though X-SAMPA uses the ASCII symbol alphabet. The transcription process inherently encodes additional information — specifically, the precise pronunciation, including stress and vowel length, of words, which is often not obvious from their spellings. Therefore, the raw and compressed sizes here are not directly comparable with those from Table 1.


We observe that in every transcription scheme, from plaintext to IPA to X-SAMPA, the compression rates of each of our three languages were within 2% of each other. This suggests that the symbol distributions in German, English, and Spanish are approximately the same. In particular, the compression rates were closest for the IPA transcriptions, in which the mapping from symbol to phonetic sound is closest to one-to-one. This results in an interesting linguistic insight, which is that the nth most common sound in German is about as common as the nth most common sounds in English and Spanish. Examining the raw IPA symbol counts (as a percentage of total text), we see that this holds true, but though the distribution is similar, the actual sounds that are most common in each language are quite different. Note that we include IPA symbols for whitespace (_), stress (‘), and vowel length (ː), as they are relevant characters to the compression algorithm, but they do not represent specific sounds.

Screen Shot 2019-03-24 at 11.31.14 PM

Potential future explorations might include an encoding process that first maps IPA symbols to an 8-bit character value (there are only 163 total IPA symbols), then performs optimal plaintext compression. This could yield smaller raw sizes for IPA transcriptions, but as the symbol distribution would still be the same, this would probably not improve compression rate. Another interesting direction is lossy compression. Since written plaintext is not always recoverable from phonetic transcription, we might want to evaluate the information gain and loss of a transcription + compression pipeline.

Outreach: How much does your sentence say?

Four kids standing in front of a poster board. The poster board has blocks with word segments on them.

For the outreach event, we wanted to communicate a simplified subset of the ideas from our project. Specifically, we wanted to show, through an interactive activity, how information content varies across languages. The goal of our activity was to address two main points:

  1. Different languages take different amounts of space to express the same amount of meaning (e.g. Chinese words are usually 1-2 characters in length, while English words average around 6 letters).
  2. Some languages require speakers to express certain kinds of information that other languages don’t (e.g. the gender of croissant in French).

We came up with a few dozen morphemes in English, Chinese, and French and wrote them on rearrangeable blocks (cardstock with Velcro). We chose the morphemes to allow for different subject pronouns, verb tenses and conjugations, objects, adjective agreement, and even conversion of adjectives into adverbs. We selected English, Chinese, and French as our display languages due to the apparent variation in the written text, as well as grammatical and morphosemantic variation. Students were invited to create sentences in a language of their choice (usually English), and we helped them create parallel sentences in the other two languages.


The same sentence in English, Chinese, and French. Morphemes with the same meaning are connected with an arrow, e.g. {play, 玩, jou-} are approximately equivalent. We can see that even though this is the same sentence, its length varies across languages. For example, French uses many more characters than Chinese. Moreover, some languages express meaning that others omit. Observe the morphemes connected by the dotted line: Chinese has no morphemes to represent that 游戏 (game) is plural, but English and French use -s and -x respectively.

Just by counting the number of blocks in each sentence, the students could observe how languages differ in the number and type of morphemes, or “units of meaning.” The activity was a fun introduction to the intersection of information theory and linguistics. Students and parents alike were also excited to learn how to say the same sentence in two new languages!


An example sentence created by Tsachy!

We hope you found this post interesting and informative! Please direct any questions to Cindy Wang (ciwang at stanford dot edu) and Stephanie Chen (stephchen at stanford dot edu).

Inferring Connectivity in Neural Spike Train Recordings

EE376A (Winter 2019)

By Ethan Shen (ezshen), Eli Pugh (epugh), Vamsi Varanasi (vamsijv), and Mac Klinkachorn (sirapopk)

Importance of Neural Connectivity

Simulated neural data is of vital importance to understanding the function of the brain as well as for the development of potential interventions. Synchronization of different neurons and connectivity are two key parameters that govern how a portion of the brain behaves. In particular, gross behavior results from synchronized neurons firing together over the period of time during which the behavior is observed. Thus, predicting connectivity and synchronization from neural firing patterns, or “spike trains” is key to understanding the underlying structure of the biological neural network. Here, we use information theoretic measures to quantify connectivity of a population of neurons given their spike trains. We aim to use our methods to quantify synchronization between different subsets of the neural population to better understand the dynamics of our system.

Check out our paper and code!


For the outreach activity, our group designed a decoding/encoding game for the students to play. For younger participants(around 6-8 years old), we let them translate the Ascii code on the board into short words or vice versa by providing them with a key and some guidance on how they should partition the code. For older participants (around 8-11 years old), we give out multiple cards containing sequences with different patterns. For example, there are sequences containing: 1111111, 111000111, 10101010101,10011010111 etc.  We then ask the students to observe the pattern in the series and try to come up with a way to concatenate it. The participants were quick to recognized that there are repeated elements and we explained to them the concept of entropy and information in each code.

Throughout the process, we faced several challenges. We realized that the students come from different backgrounds ranging from 5-11 years old. Therefore, we need to come up with the variations of the activities right on the spot to help students learn the best. Also, we improvise a lot during the event to make the activity exciting for the students and keep them engaged. We offered candy to those who completed the challenges in order to bring more kids to our station and give them some gratification upon finishing.

Our most popular game was a binary code on the board with a key that mapped binary strings of length three to eight different letters. The kids had to match each block of three 0’s and 1’s in the “code” until they spelled out “CODEBREAKER”. It took some kids longer than others, but it was really cool to see them figure out by themselves how to match the blocks of binary to each letter. The excitement of finishing and seeing “CODEBREAKER” was very apparent.

The outreach was an extremely fulfilling and intellectually challenging experience. We learned how to convey important material in simple ways, and it turns out that by preparing for the outreach event our team got a better understanding of the material in the class. We really enjoyed interacting with the students and seeing them have fun while developing some information theory intuition.

Information Flow in Simple Neural Networks

EE376A (Winter 2019)

Author: Nathan Kong


In the past decade, high throughput training of deep learning models has bolstered their success in completing difficult tasks. Unfortunately, a theoretical understanding of why these models are so successful is missing. Some work investigating why deep learning models generalize so well utilize concepts from information theory and analyze the information gain between the inputs (and outputs) and the internal representations. A problem that arose in this kind of approach was related to how mutual information was computed in deterministic neural networks. Goldfeld et al. (2018) developed a new method for estimating mutual information by analyzing information theoretic quantities using noisy neural networks and observed that the reduction in mutual information between the internal representation and the inputs (compression) is associated with the clustering of internal representations.

In this work, we reproduce some simple empirical observations in Goldfeld et al. (2018). Furthermore, we conduct some experiments related to modifying the data distribution, as previous work studying information flow in neural networks used a uniform input data distribution. We observe that for a single Gaussian data distribution, using a non-saturating non-linearity in the hidden layer such as LeakyReLU, we do not observe a clustering of the internal representations.

The code can be found here.

The report can be found here.

Outreach Summary

The main goal of my outreach was to introduce elementary school students to (biological) neural networks. In particular, I wanted to communicate with the students about how the brain processes objects in the visual field — starting from early visual cortex to higher visual cortex. To put this a little more concretely, what “parts” of the object are detected in early visual cortex and what “parts” are detected in higher visual cortex?

The activity was originally planned for two teams of four students, but given the circumstances, I had to slightly change it so that it was more amenable for single students. First, I had the students choose an object from the set shown in the image below without telling me what they chose. I would then have the students draw “one part” of the object. It could be a small part of the object such as a door knob or it could be the general shape of the object such as a circle (for the earth, sun, soccer ball, etc.). The goal was to have me guess the object based on the drawing.

If I could not guess the object, I would have the student add to his or her current drawing by adding “another part” of the object. For example, this could be another part of the house object or the seams of the soccer ball. The additions to the drawings would continue until I could successfully guess the chosen object.

The idea behind having the students sequentially draw small parts of object is motivated by the way the visual system extracts features from the visual field. In early visual cortex, features such as edges are extracted from the image. Using this information, it is very difficult to figure out the category of the object. However, as the image is further processed in the visual system, more complex features are extracted such as textures. In higher visual cortex, there are populations of neurons that respond specifically to faces or places. This is analogous to the activity I had the students carry out. As more and more features of the object are added to the drawing, it became easier for me to figure out the object identity.

To read more about this, see here and here.