*Blog post by Natalie Cygan. Contact: cygann@stanford.edu*

Over the course of this quarter, I had the pleasure to mentor David, a high school senior who my Computer Science teacher from high school connected me with. David is a curious student who has a deep interest in HAM radio and plans to study electrical engineering in college. Coming from these interests, he was eager to learn about information theory.

During our first meeting, we had a fantastic conversation about the motivating ideas behind information theory: compression and effective & reliable communication. We spent most of the time discussing the different ways in which both of these ideas are present in information systems, and also ruminated on what can be considered an information system. It was a lot of fun to talk about the information-centric perspectives on biology, genetics, the brain, and language.

Because David had no background in probability, we spent much of the second session laying some probability foundations necessary to learn about entropy. This included some simple distributions, notations, and expectation. Other topics, such as conditional probability and independence, we discussed later when needed. Another concept we reviewed during the first session before getting to the notion of entropy was a review of logarithms and their importance to computing. This included some simple properties (such as the log of a product or quotient and how you can manipulate those), from which we had a good conversation on why they were so useful, starting with the historical example of humans using log tables before they had computers to make multiplication of large numbers easier, and the practical application of log probabilities as a way to get around the limits of floating-point precision. These examples really helped to motivate the use of logarithms beyond the “it just is” explanation that many high school math classes may leave it at.

After reviewing the concepts of logarithms and building our understanding of probability to include expectations, I introduced the notion of entropy as a concept of expected surprise. After plotting our surprise function and discussing a few examples in the context of drawing unlikely events from a probability distribution, David had an intuitive understanding of entropy. From here, we explored a simple compression example with an arbitrary scheme, and then considered what happened as we coded symbols in larger and larger blocks. The discussion on entropy being the lower bound on compression was definitely a 🤯 moment, which was a joy to witness in someone else. Next, we spent a whole session on Huffman Coding.

After this point, I decided to spend one session dedicated to a hands-on coding exercise since David had a strong interest in learning more Python. Together, we made an n-grams text generation program in Python. I chose n-grams as our programming exercise for two reasons: 1) n-grams have interesting information-theoretic properties since they are a form of Markov process (A great transition into discussion on properties of conditional entropy), and 2) the program itself is a joy to play around with— I remember being fascinated by it when I first made an implementation in CS 106X my freshman fall. To implement n-grams, we used the Atom text editor’s Teletype feature, which worked really well since it allowed us to both run the program on our local machines while editing it. This experience was really fun because while writing n-grams, I was able to share a lot of valuable tips for writing code and teach a lot about the Python programming language. Once the program was fully implemented and working, we compared some of the advantages and disadvantages of n-grams as a language model (especially compared to character-level processes), which led to some further exploration on other language model paradigms that I was happy to talk about (such as GPT-3).

Finally, we talked about communication, where I introduced the concept of mutual information, and eventually, the noisy-channel coding theorem. We talked about two primary types of noisy channels: the Binary Erasure Channel, and the Binary Symmetric Channel. There were many a-ha moments during this meeting as David shared many ways in which the framework of noisy communication applies to things he has seen before or worked on (such as real-world examples of the BSE or BEC). At the end, we spoke briefly on polar codes and how they approach the Shannon Limit, which David found to be very exciting.

Overall, the greatest challenges of the Zoom meeting format was just some bumps in getting screen share to work effortlessly from my iPad, which was an indispensable tool for online teaching. Out of all the possible screen sharing options, what worked best was for me to join the Zoom meeting from both my computer and my iPad, and then screen share from the iPad.

All of my meetings with David were incredibly fun— we talked at length about a variety of computing, electrical engineering, and math related topics along the way and explored fun tangents when they came up. While I had initially planned to only hold one hour meetings each week, it was rare to have a meeting that was under 2 hours long. Each week, David brought so much curiosity and enthusiasm that left us with no shortage of things to discuss. It was a pleasure to mentor David, as he asked many questions that kept me on my feet. This whole experience has helped me grow as a teacher and also refine my own understanding of the topics in information theory.