Blog post by Leo Dong. Contact: leozdong@stanford.edu

**1 Background and Motivation**

In this blog post, I will describe my experience of teaching two high school students some cool concepts I learned in my EE 276 Information Theory class.

I reached out to my CS teacher in high school, Mr. Pillai, when I wanted to find mentees who are potentially interested in spending 1 hour a week for about 2 months learning some information theory. I told him that the students do not need to have any prior mathematical background, only mathematical interest, and that I will adjust the content according to the specific students.

I had two students signing up. Julian is a high school junior and Shreya is a senior. Both have taken AP Calculus, but, as a common problem in high school math education throughout the country, neither of them have had exposure to probability theory. Nonetheless, I thought they had enough motivation and mathematical maturity that I can challenge them with presenting the material in a more formal and rigorous way beyond high-level ideas. I had three reasons in mind behind this decision:

- It was Week 2 and I just learned about the Asymptotic Equipartition Property in information theory, which totally blew my mind and changed the way I see the world. The result is not obvious from first-intuition and even counter-intuitive, so I really wanted to present this topic in a rigorous way.
- One of the coolest things about information theory (which I appreciated in Week 2 and still appreciate today) is that we can truly see the power of quantifying abstract concepts and putting them into a mathematical framework. By quantifying information and formalizing the abstract notion of “communication,” we are able to reach some truly powerful and unexpected results about the achievability and impossibility of many communication problems. I thought that as STEM-oriented students, Julian and Shreya would appreciate this idea as I did.
- Both Julian and Shreya were interested in taking math classes when they get to college, and one of the “culture shocks” I had when I first started taking proof-based math classes is that the level of rigor was way beyond my comfort zone. I thought giving them an earlier taste of it would be beneficial, especially given that we went to the same high school and probably have similar backgrounds.

**2 Session Outline**

Below is an outline of how I structured my weekly sessions:

- One introduction session describing the motivation behind information theory and how ubiquitous it is in our everyday life.
- A few sessions on a blizzard recap of probability theory. Specifically, I focus on: what are random variables, discrete probability distributions, expectation of a random variable, expectation of a function of a random variable, and Jensen’s Inequality.
- A few sessions on information measures. I focus on: motivating the “surprise function,” using the surprise function to define entropy, proving the upper and lower bound of entropy, introducing entropy under an alternative distribution, and defining relative entropy and mutual information (without proofs for properties).
- PUNCHLINE: Rigorously introduce Asymptotic Equipartition Property
- Bonus session where I introduce other places that information measures serve as bounds in communication problems. Specifically, I introduce: expected length of uniquely decodable codes, Shannon’s noisy channel coding theorem, and rate-distortion function for lossy compression.

**2.1 Aside: Technology I used**

I used Zoom for our weekly meetings. I used an iPad pencil alongside the app *Notability *as a whiteboard. I tried to be less “sketchy” and more structured with my live notes, so that I can share the notability notes (via a live link) with them and they can reference this weekly as informal lecture notes if they want a refresher. I thought about cloud recording the sessions so they can re-watch the videos if they want to, but decided against it because I thought it might make them less comfortable interacting and asking questions.

**2.2 Session Interaction**

Each session is 1 hour long. I usually start with a 5–10 minute casual conversation with my mentees in the beginning. Since I went to the same high school, we have a lot of common topics to talk about. I usually ask them how some of my old teachers are doing in high school and then share my experience of studying in college (though as a sophomore with only 6 months of in-person experience on-campus and the rest through Zoom due to COVID, I realized that I actually haven’t had a typical university experience either).

Then I would start the teaching for 50 minutes. I encouraged them to make the session as interactive as possible. I found that if I tell them it is more efficient to my teaching as well if they interrupt me and I immediately clear any confusions on spot, then it makes them more comfortable for interrupting and asking questions. One place I think I could improve during the teaching time is to have a second device with a camera that can capture my face and allow me to see my mentees. I used an iPad for both the video Zoom call and sketching out notes, but the problem is if I start sharing my screen, then my camera is disabled.

The 1-hour session officially ends after the 50-minute teaching. But usually if both my mentees and I have extra time and feel engaged, we would talk for another 10 to 20 minutes. They were pretty interested to hear what classes I am taking in college and liked to talk about possible majors, career paths, or life in general.

**3 Challenges of Teaching**

Having to teach with some rigor to high school students with no probability background is definitely a challenge. The first challenge I encountered was to recap all sorts of important probability theory in a matter of a few sessions, when it normally takes a few months at college level.

**3.1 Teaching Probability Theory**

There are some simplifications I made so that I could introduce the bare minimum of probability they needed. I introduced the concept of random variables and only restricted to the discrete Bernoulli distribution, and then defined expectation. The material so far is intuitive. One teaching technique I found especially useful is to develop a set of descriptive vocabulary and keep *reusing* them. For example, whenever I write the expectation, I keep reiterating that it is nothing but your * weight *times

**The hardest part is when I tried to explain the idea behind taking the expectation of a probability mass function, because it would later be used to define entropy (H(U) := E[\log \frac{1}{p(U)}] ).**

*the value we care about*.I started by convincing them that a function of a random variable is also a random variable, so it is valid to consider its expectation. Then I demonstrated that one can think of the probability mass function as a function, though it only takes nonzero values in a finite set. Finally, I showed that it is justified to define a new random variable p(X) from a random variable X , and take its expectation E[p(X)] or any composition like E[\log p(X)]. We had to go back and forth during this part because it was a bit difficult for them to wrap their head around this the first time. After a few rounds of revisiting this topic, I think they got a good grasp of it. One of the things I learned is that when introducing expectation of functions of random variables, it is very important to clearly distinguish where something is *random *and where something takes a *fixed value*.

Another difficult topic to teach was Jensen’s inequality. I used the “proof by picture” approach and tried to convince them that the inequality holds in a single variable case. I found that just having the picture is not enough, and I gradually developed some more descriptive vocabulary. I would call the convex combination a “linear mixture,” and show them that the expectation can be think of as a “linear mixture.” For example, E[f(X)] = P(X=x_1)f(x_1) + (1-P(X=x_1))f(x_2) would be mixing the two values f(x_1) and f(x_2) linearly, so all possible values that this mixture can take lie in a line that connects the two values.

**3.2 Teaching Information Measures**

I believe it was worth the time to be extra careful and make sure Julian and Shreya understood every single bit of the probability theory I introduced, because the subsequent introduction of information measures was a lot smoother. I first introduced the “surprise function” (S(u) = \log \frac{1}{p(u)}) to motivate the definition of entropy. It was intuitive for them to see why the entropy is the “average surprise” because it is just the expectation. I also added that they can think of entropy as the “amount of information” using the following example: say I have 2 robots, Robot Alice and Robot Bob. Robot Alice has Ber(0.5) probability of saying whether or not she is happy every day you ask her. Robot Bob has Ber(0.9) probability of saying whether or not he is happy every day you ask him. It would be intuitive to think that asking Robot Alice every day would give you more information because we already know that Robot Bob is a somewhat **happy robot**, so taking the effort to ask him won’t tell you much new on average.

**3.3 Teaching Asymptotic Equipartition Property**

Finally, I introduced the punchline of this series of sessions close to the end of the quarter, the Asymptotic Equipartition Property. I was pretty nervous about introducing this topic. Fortunately, it seemed like our previous groundwork of probability theory and information measures was well worth-it, and the overall teaching was smooth. After introducing the \epsilon-typical set and deriving the fact that “the probability that any sequence falls into the \epsilon-typical set is 1 at the limit,” I remembered Julian asked a really good question. He asked me whether this means that the size of the \epsilon-typical set would eventually grow to be as big as the entire space of length n sequences, \mathcal{U}^n for some alphabet \mathcal{U}. I was glad that he asked this, because I had the exact same guess when I first learned this, and it blew his mind as much as it had blown mine when I told him that not only is the \epsilon-typical set *not *growing in size, but its relative size compared to the entire space of length n sequences will vanish to 0. It showed that Julian (and hopefully Shreya as well) really understood the topic and started to have the same appreciation as I did when I first learned this.

I then drew the same picture of the \epsilon-typical set within the entire set of possible sequences, and formalized a few other properties. I concluded that the \epsilon-typical set really *is *the smallest and *tightest *significant set that we care about.

Being able to successfully convey the idea of AEP to Julian and Shreya was definitely a highlight in this series of sessions. I concluded the series with a bonus session when I presented other places where information measures are used as fundamental bounds. I introduced the uniquely decodable codes and the relation of expected code length to entropy. I motivated the noisy channel communication problem and stated Shannon’s noisy channel coding theorem, where mutual information turned out to be a very important quantity. Then I gave an overview of lossy compression and how mutual information can be used again to bound the minimum rate of compression. This part is relatively easy because I don’t need to do any proofs and the results can mostly be justified by the intuition behind the relevant information measures.

**3.4 Value of Revisiting **

I found that spending the time in the beginning of each session to recap the material of the last session is surprisingly helpful, especially when the students are learning something challenging and beyond their comfort zone. I observed from both of my mentees that when some concepts are confusing or overwhelming to them at first, revisiting them after a week can clear a lot of the confusions almost “automatically,” even though no explicit learning was done in between. This offers an interesting insight into my own learning as well. Sometimes the material itself can just seem scary at first especially if it is foreign and unfamiliar, and I should just have the patience to keep going back to it instead of expecting to immediately grasp everything.

**4 Concluding Remarks**

It has been a great experience teaching Julian and Shreya. I really appreciated their patience and enthusiasm throughout the sessions, which gave me a lot of encouragement especially since this is my first time teaching younger students. I discovered that learning the material is completely different from teaching the material, because I have to fully digest the content even further and be able to present it in simple language. I also realized that I actually don’t fully understand some materials myself, and would be sometimes stuck trying to give a convincing and intuitive explanation or answer some questions (Julian and Shreya asked some really good questions). It makes me believe that one truly learns the material if he or she can describe it clearly to people without as much technical background. It has been an educational experience for both Julian and Shreya and myself, and I look forward to potentially doing a similar sharing of things I learned in my future classes.

**5 Appendix**

I am including my live notes here, which gives a more detailed idea about the pace and level of detail I presented each topic. The entire document is about 20 pages long.