A Wonderful Mentorship Experience with a Chinese High School Student

Blog, Information Theory (Winter 2021)

By Qingxi Meng (qingxi@stanford.edu)

I am very lucky to have a great opportunity to mentor a wonderful Chinese high school senior, Yuting. I got connected to Yuting through my mother’s friend. Yuting and I met regularly on Sunday 8pm PT (12pm Beijing time). Yuting has taken many high school math and physics classes, but she only knows a little bit about probability and nothing about calculus. Therefore, my goal is to help her have a high-level understanding in information theory. Both Yuting and I are very interested in science fictions and I used many examples in science fictions for her to understand information theory.

Since Yuting knew nothing about information theory before, I was very nervous in the first session. Before our first meeting, I made a short video and gave a brief introduction of information theory as a story.  During our first session, I showed Yuting one real-life application of information theory: data compression. I showed her that many of the large files we used in daily life (e.g. movies) are stored in a compressed way. In the first session, I also introduced the basic concepts of probability in a very general way. Originally, I prepared a slide of 20 pages to introduce the mathematical backgrounds for information theory. However, I found that Yuting got lost very quickly. Thus, that part of the first session did not go as smoothly as I expected. In the last thirty minutes of the first session, we have a very interesting discussion about what will happen if there is no compression in our world. We also quickly discussed about Shannon. Surprisingly, Yuting heard of Shannon in her history class. It turned out that the contribution of Shannon was mentioned very briefly in the modern technology section of Chinese history book. Also, as a homework, I asked Yuting to estimate the sizes of all compressed files she had in her computer.

After the first session, I did not use slides to introduce mathematical ideas that much because I realized that Yuting was very easy to get lost in that manner. Instead, I usually played an interesting video related to the topic I wanted to cover. Then, I would only give a brief introduction and then Yuting and I had an interesting conversation over it. After each session, I encouraged Yuting to talk to her parents and cousins about what she learned in our session. Since her relatives had nearly no background at all in the science or technology, she also sometimes drew a picture or told a story.

The video I made in Chinese for a quick introductory to information theory for Yuting

In the next a couple sessions, I taught Yuting about the entropy. I first played an interesting video about why we would we use bits in computer. Yuting was very surprised in the beginning about the fact that the computer only consists of 0 and 1 at its core. We also had a great philosophical discussion about whether or not the world is made of 0 and 1. Yuting did not believe that the world is made of 0 and 1 because she thought it was just a human-made convention. Then, I raised a question about what a universal scale for information could be. Yuting first thought that it is not possible to do that because the language is ambiguous and thus it is not possible to use a mathematical measure for that ambiguity. Then, I played a game with Yuting. Specifically, I let Yuting think about an animal in her mind, and then I asked her 10 questions to guess what it was. Then, I explained to her that the information entropy is a measure of randomness of information. For example, as I asked more and more questions to guess which animal she had in mind, the entropy was lower because I had more certainty.

Over the next a couple sessions, I introduced the high-level ideas of lossless compression to Yuting. First, I wrote down a DNA sequence and let Yuting translated it into bits. Then, I explained very briefly why we could save space by utilizing the different frequency of each symbol. Yuting was not sure if this is the case in real life. Thus, I encouraged her to count the frequency of each letters in one paragraph of her favorite novel “Lord of Ring”.  In one session, we also discussed a very interesting topic about “The Golden Disk” and how information theory is related to it. Since both of us are passionate about science fictions, we had a great time discussing it!

The slide I made to discuss the information theory and the “Golden Disk”

Over the last a couple session, Yuting and I quickly went over the concepts of channel capacity. Yuting heard about the bandwidth a lot in the science fiction movies. However, it is her first time to hear about the channel capacity. Thus, we spent a lot of time discussing the difference between the channel capacity and bandwidth. In my understanding, the channel capacity is the maximum amount of data that can move over a specific channel while the bandwidth is the frequency range of a channel. Although Yuting still could not fully understand the difference between channel capacity and bandwidth, she learned about the connection between the two.

I faced some challenges when I was mentoring Yuting. One of the biggest challenges was that it was difficult for me to convey the ideas of information theory in an interesting and straightforward way. Since many concepts in information theory involve lots of mathematical expressions, I need to build up intuitions for Yuting. Also, I found that it was very easy for Yuting to forget what she learned after a week. Therefore, I encouraged her to also teach her parents and cousins about information theory. In addition, sometimes I also left some fun homework for her to apply what she learned from information theory.

Another biggest challenge I had was to clearly express the ideas of information theory in Chinese. Although Yuting learned about English in her English classes, she preferred to use Chinese for communication. Since I learned Information theory in English, it was challenging to express the ideas well in Chinese. I resolved this challenge by translating all terminologies into Chinese and did lots of practice. I also tried to speak the ideas of information theory in Chinese to myself to get used to it before each session.

Yuting was an awesome mentee! She had many great ideas and perspectives about information theory that I never thought about. Also, I learnt a lot about how to express information theory in a non-technical and fun way! Yuting mentioned that she wanted to take more related classes in the future to learn more about information theory. Both Yuting and I are looking forward to meeting in the future again to continue our discussion about information theory!

My Information Theory Mentorship Experience

Blog, Information Theory (Winter 2021)

Blog post by Natalie Cygan. Contact: cygann@stanford.edu

Over the course of this quarter, I had the pleasure to mentor David, a high school senior who my Computer Science teacher from high school connected me with. David is a curious student who has a deep interest in HAM radio and plans to study electrical engineering in college. Coming from these interests, he was eager to learn about information theory.

During our first meeting, we had a fantastic conversation about the motivating ideas behind information theory: compression and effective & reliable communication. We spent most of the time discussing the different ways in which both of these ideas are present in information systems, and also ruminated on what can be considered an information system. It was a lot of fun to talk about the information-centric perspectives on biology, genetics, the brain, and language.

Because David had no background in probability, we spent much of the second session laying some probability foundations necessary to learn about entropy. This included some simple distributions, notations, and expectation. Other topics, such as conditional probability and independence, we discussed later when needed. Another concept we reviewed during the first session before getting to the notion of entropy was a review of logarithms and their importance to computing. This included some simple properties (such as the log of a product or quotient and how you can manipulate those), from which we had a good conversation on why they were so useful, starting with the historical example of humans using log tables before they had computers to make multiplication of large numbers easier, and the practical application of log probabilities as a way to get around the limits of floating-point precision. These examples really helped to motivate the use of logarithms beyond the “it just is” explanation that many high school math classes may leave it at.

After reviewing the concepts of logarithms and building our understanding of probability to include expectations, I introduced the notion of entropy as a concept of expected surprise. After plotting our surprise function and discussing a few examples in the context of drawing unlikely events from a probability distribution, David had an intuitive understanding of entropy. From here, we explored a simple compression example with an arbitrary scheme, and then considered what happened as we coded symbols in larger and larger blocks. The discussion on entropy being the lower bound on compression was definitely a 🤯 moment, which was a joy to witness in someone else. Next, we spent a whole session on Huffman Coding.

After this point, I decided to spend one session dedicated to a hands-on coding exercise since David had a strong interest in learning more Python. Together, we made an n-grams text generation program in Python. I chose n-grams as our programming exercise for two reasons: 1) n-grams have interesting information-theoretic properties since they are a form of Markov process (A great transition into discussion on properties of conditional entropy), and 2) the program itself is a joy to play around with— I remember being fascinated by it when I first made an implementation in CS 106X my freshman fall. To implement n-grams, we used the Atom text editor’s Teletype feature, which worked really well since it allowed us to both run the program on our local machines while editing it. This experience was really fun because while writing n-grams, I was able to share a lot of valuable tips for writing code and teach a lot about the Python programming language. Once the program was fully implemented and working, we compared some of the advantages and disadvantages of n-grams as a language model (especially compared to character-level processes), which led to some further exploration on other language model paradigms that I was happy to talk about (such as GPT-3).

Finally, we talked about communication, where I introduced the concept of mutual information, and eventually, the noisy-channel coding theorem. We talked about two primary types of noisy channels: the Binary Erasure Channel, and the Binary Symmetric Channel. There were many a-ha moments during this meeting as David shared many ways in which the framework of noisy communication applies to things he has seen before or worked on (such as real-world examples of the BSE or BEC). At the end, we spoke briefly on polar codes and how they approach the Shannon Limit, which David found to be very exciting.

Overall, the greatest challenges of the Zoom meeting format was just some bumps in getting screen share to work effortlessly from my iPad, which was an indispensable tool for online teaching. Out of all the possible screen sharing options, what worked best was for me to join the Zoom meeting from both my computer and my iPad, and then screen share from the iPad.

All of my meetings with David were incredibly fun— we talked at length about a variety of computing, electrical engineering, and math related topics along the way and explored fun tangents when they came up. While I had initially planned to only hold one hour meetings each week, it was rare to have a meeting that was under 2 hours long. Each week, David brought so much curiosity and enthusiasm that left us with no shortage of things to discuss. It was a pleasure to mentor David, as he asked many questions that kept me on my feet. This whole experience has helped me grow as a teacher and also refine my own understanding of the topics in information theory.

Getting Some Rigor to High School Students : )

Blog, Information Theory (Winter 2021)

Blog post by Leo Dong. Contact: leozdong@stanford.edu

1 Background and Motivation

In this blog post, I will describe my experience of teaching two high school students some cool concepts I learned in my EE 276 Information Theory class.

I reached out to my CS teacher in high school, Mr. Pillai, when I wanted to find mentees who are potentially interested in spending 1 hour a week for about 2 months learning some information theory. I told him that the students do not need to have any prior mathematical background, only mathematical interest, and that I will adjust the content according to the specific students.

I had two students signing up. Julian is a high school junior and Shreya is a senior. Both have taken AP Calculus, but, as a common problem in high school math education throughout the country, neither of them have had exposure to probability theory. Nonetheless, I thought they had enough motivation and mathematical maturity that I can challenge them with presenting the material in a more formal and rigorous way beyond high-level ideas. I had three reasons in mind behind this decision:

  1. It was Week 2 and I just learned about the Asymptotic Equipartition Property in information theory, which totally blew my mind and changed the way I see the world. The result is not obvious from first-intuition and even counter-intuitive, so I really wanted to present this topic in a rigorous way.
  2. One of the coolest things about information theory (which I appreciated in Week 2 and still appreciate today) is that we can truly see the power of quantifying abstract concepts and putting them into a mathematical framework. By quantifying information and formalizing the abstract notion of “communication,” we are able to reach some truly powerful and unexpected results about the achievability and impossibility of many communication problems. I thought that as STEM-oriented students, Julian and Shreya would appreciate this idea as I did.
  3. Both Julian and Shreya were interested in taking math classes when they get to college, and one of the “culture shocks” I had when I first started taking proof-based math classes is that the level of rigor was way beyond my comfort zone. I thought giving them an earlier taste of it would be beneficial, especially given that we went to the same high school and probably have similar backgrounds.

2 Session Outline

Below is an outline of how I structured my weekly sessions:

  1. One introduction session describing the motivation behind information theory and how ubiquitous it is in our everyday life.
  2. A few sessions on a blizzard recap of probability theory. Specifically, I focus on: what are random variables, discrete probability distributions, expectation of a random variable, expectation of a function of a random variable, and Jensen’s Inequality.
  3. A few sessions on information measures. I focus on: motivating the “surprise function,” using the surprise function to define entropy, proving the upper and lower bound of entropy, introducing entropy under an alternative distribution, and defining relative entropy and mutual information (without proofs for properties).
  4. PUNCHLINE: Rigorously introduce Asymptotic Equipartition Property
  5. Bonus session where I introduce other places that information measures serve as bounds in communication problems. Specifically, I introduce: expected length of uniquely decodable codes, Shannon’s noisy channel coding theorem, and rate-distortion function for lossy compression.

2.1 Aside: Technology I used

I used Zoom for our weekly meetings. I used an iPad pencil alongside the app Notability as a whiteboard. I tried to be less “sketchy” and more structured with my live notes, so that I can share the notability notes (via a live link) with them and they can reference this weekly as informal lecture notes if they want a refresher. I thought about cloud recording the sessions so they can re-watch the videos if they want to, but decided against it because I thought it might make them less comfortable interacting and asking questions.

2.2 Session Interaction

Each session is 1 hour long. I usually start with a 5–10 minute casual conversation with my mentees in the beginning. Since I went to the same high school, we have a lot of common topics to talk about. I usually ask them how some of my old teachers are doing in high school and then share my experience of studying in college (though as a sophomore with only 6 months of in-person experience on-campus and the rest through Zoom due to COVID, I realized that I actually haven’t had a typical university experience either).

Then I would start the teaching for 50 minutes. I encouraged them to make the session as interactive as possible. I found that if I tell them it is more efficient to my teaching as well if they interrupt me and I immediately clear any confusions on spot, then it makes them more comfortable for interrupting and asking questions. One place I think I could improve during the teaching time is to have a second device with a camera that can capture my face and allow me to see my mentees. I used an iPad for both the video Zoom call and sketching out notes, but the problem is if I start sharing my screen, then my camera is disabled.

The 1-hour session officially ends after the 50-minute teaching. But usually if both my mentees and I have extra time and feel engaged, we would talk for another 10 to 20 minutes. They were pretty interested to hear what classes I am taking in college and liked to talk about possible majors, career paths, or life in general.

3 Challenges of Teaching

Having to teach with some rigor to high school students with no probability background is definitely a challenge. The first challenge I encountered was to recap all sorts of important probability theory in a matter of a few sessions, when it normally takes a few months at college level.

3.1 Teaching Probability Theory

There are some simplifications I made so that I could introduce the bare minimum of probability they needed. I introduced the concept of random variables and only restricted to the discrete Bernoulli distribution, and then defined expectation. The material so far is intuitive. One teaching technique I found especially useful is to develop a set of descriptive vocabulary and keep reusing them. For example, whenever I write the expectation, I keep reiterating that it is nothing but your weight times the value we care about. The hardest part is when I tried to explain the idea behind taking the expectation of a probability mass function, because it would later be used to define entropy (H(U) := E[\log \frac{1}{p(U)}] ).

I started by convincing them that a function of a random variable is also a random variable, so it is valid to consider its expectation. Then I demonstrated that one can think of the probability mass function as a function, though it only takes nonzero values in a finite set. Finally, I showed that it is justified to define a new random variable  p(X) from a random variable  X , and take its expectation  E[p(X)] or any composition like  E[\log p(X)]. We had to go back and forth during this part because it was a bit difficult for them to wrap their head around this the first time. After a few rounds of revisiting this topic, I think they got a good grasp of it. One of the things I learned is that when introducing expectation of functions of random variables, it is very important to clearly distinguish where something is random and where something takes a fixed value.

Another difficult topic to teach was Jensen’s inequality. I used the “proof by picture” approach and tried to convince them that the inequality holds in a single variable case. I found that just having the picture is not enough, and I gradually developed some more descriptive vocabulary. I would call the convex combination a “linear mixture,” and show them that the expectation can be think of as a “linear mixture.” For example,  E[f(X)] = P(X=x_1)f(x_1) + (1-P(X=x_1))f(x_2) would be mixing the two values f(x_1) and f(x_2) linearly, so all possible values that this mixture can take lie in a line that connects the two values.

Example sketch of live notes when I tried to illustrate Jensen’s inequality

3.2 Teaching Information Measures

I believe it was worth the time to be extra careful and make sure Julian and Shreya understood every single bit of the probability theory I introduced, because the subsequent introduction of information measures was a lot smoother. I first introduced the “surprise function” (S(u) = \log \frac{1}{p(u)}) to motivate the definition of entropy. It was intuitive for them to see why the entropy is the “average surprise” because it is just the expectation. I also added that they can think of entropy as the “amount of information” using the following example: say I have 2 robots, Robot Alice and Robot Bob. Robot Alice has Ber(0.5) probability of saying whether or not she is happy every day you ask her. Robot Bob has Ber(0.9) probability of saying whether or not he is happy every day you ask him. It would be intuitive to think that asking Robot Alice every day would give you more information because we already know that Robot Bob is a somewhat happy robot, so taking the effort to ask him won’t tell you much new on average.

3.3 Teaching Asymptotic Equipartition Property

Finally, I introduced the punchline of this series of sessions close to the end of the quarter, the Asymptotic Equipartition Property. I was pretty nervous about introducing this topic. Fortunately, it seemed like our previous groundwork of probability theory and information measures was well worth-it, and the overall teaching was smooth. After introducing the \epsilon-typical set and deriving the fact that “the probability that any sequence falls into the \epsilon-typical set is 1 at the limit,” I remembered Julian asked a really good question. He asked me whether this means that the size of the \epsilon-typical set would eventually grow to be as big as the entire space of length n sequences, \mathcal{U}^n for some alphabet \mathcal{U}. I was glad that he asked this, because I had the exact same guess when I first learned this, and it blew his mind as much as it had blown mine when I told him that not only is the \epsilon-typical set not growing in size, but its relative size compared to the entire space of length n sequences will vanish to 0. It showed that Julian (and hopefully Shreya as well) really understood the topic and started to have the same appreciation as I did when I first learned this.

I then drew the same picture of the \epsilon-typical set within the entire set of possible sequences, and formalized a few other properties. I concluded that the \epsilon-typical set really is the smallest and tightest significant set that we care about.

I drew the same intuitive picture highlighting the relation of the typical set within the entire space of possible sequences as I learned in class.

Being able to successfully convey the idea of AEP to Julian and Shreya was definitely a highlight in this series of sessions. I concluded the series with a bonus session when I presented other places where information measures are used as fundamental bounds. I introduced the uniquely decodable codes and the relation of expected code length to entropy. I motivated the noisy channel communication problem and stated Shannon’s noisy channel coding theorem, where mutual information turned out to be a very important quantity. Then I gave an overview of lossy compression and how mutual information can be used again to bound the minimum rate of compression. This part is relatively easy because I don’t need to do any proofs and the results can mostly be justified by the intuition behind the relevant information measures.

3.4 Value of Revisiting

I found that spending the time in the beginning of each session to recap the material of the last session is surprisingly helpful, especially when the students are learning something challenging and beyond their comfort zone. I observed from both of my mentees that when some concepts are confusing or overwhelming to them at first, revisiting them after a week can clear a lot of the confusions almost “automatically,” even though no explicit learning was done in between. This offers an interesting insight into my own learning as well. Sometimes the material itself can just seem scary at first especially if it is foreign and unfamiliar, and I should just have the patience to keep going back to it instead of expecting to immediately grasp everything.

4 Concluding Remarks

It has been a great experience teaching Julian and Shreya. I really appreciated their patience and enthusiasm throughout the sessions, which gave me a lot of encouragement especially since this is my first time teaching younger students. I discovered that learning the material is completely different from teaching the material, because I have to fully digest the content even further and be able to present it in simple language. I also realized that I actually don’t fully understand some materials myself, and would be sometimes stuck trying to give a convincing and intuitive explanation or answer some questions (Julian and Shreya asked some really good questions). It makes me believe that one truly learns the material if he or she can describe it clearly to people without as much technical background. It has been an educational experience for both Julian and Shreya and myself, and I look forward to potentially doing a similar sharing of things I learned in my future classes.

5 Appendix

I am including my live notes here, which gives a more detailed idea about the pace and level of detail I presented each topic. The entire document is about 20 pages long.

Living Room Chats on Information Theory

Blog, Information Theory (Winter 2021)

By Kao Kitichotkul (rk22@stanford.edu)

I have been having some “chats” with my sister about information theory. My sister is college freshman with a decent background in high school mathematics who likes baking. During this quarter, we talked a little bit about things I learned from EE 276. We would sit at a table in our living room in the afternoons and sketch some figures and calculations on papers. Occasionally, we discussed information theory at the dining table as well. My parents are not engineers or mathematicians, but they can more or less follow along the main ideas in the discussions.

The table I and my sister sit down and chat about information theory.

This blog is about the chats I have been having with my sister. Each section is an “arc” in our journey through the information theory. We both were more or less learning about it together!

Basics: Probability & Measure of Information

My sister learned some probability materials in high school, but the kind of probability we would use in information theory is a bit more abstract than what my sister already knew, and the notations are also a bit different. I explained some definitions and concepts, such as random variables, expected value, and variance, just enough to understand the content of information theory. I also explained the Law of Large Numbers with a focus on the intuition in order to motivate the Asymptotic Equipartition Property later on. I mostly used coin flipping and playing cards as examples to explain these concepts. Having the necessary definitions, we moved on to define entropy  H , relative entropy  D(p || q) , and mutual information  I(X; Y) and discuss some of their properties. We also took some time working on a few examples of Huffman coding.

Asymptotic Equipartition Property

We began our discussion on compression with some examples from my homework. I thought that how the Asymptotic Equipartition Property (AEP) implies that entropy is the lower bound on the average code length is really cool, so I tried to show my sister a rigorous proof. This was when the Law of Large Numbers came into play! I defined the typical sequence and proceeded to prove some theorems about the probability that a source sequence is typical and that the size of the typical set is exponentially smaller than the size of the set of all sequences. It turned out that I did not understand the topic very well, so it took me a few attempts to make my proof both rigorous and accessible. For example, I thought that the entropy  H in the proof meant the entropy of the whole source sequence, but actually it is the entropy of a single source symbol. There were quite a few little details which I would have missed if I did not brought the proofs to my sister, who pointed out contradictions until I figured out all the errors.

Reliable Communication

I began by sketching the communication scheme and gave two examples of communication channels — the binary symmetric channel and the binary erasure channel. Then, I defined channel capacity  C as the maximum rate of reliable communication, and claimed that  C = \max_{p_X} I(X; Y) , where  X is the input random variable and  Y is the output random variable. I did not prove why the channel capacity can be calculated in this way, but we discussed some implications of this claim. For example, the maximum over  p_X in the expression to calculate the channel capacity means that it does not depend on specific inputs, but rather the specification of the channel itself. My sister was more interested how these concepts are related to practical matters, like how we know whether a channel encoding-decoding scheme is good enough by comparing the bit rate of a specific scheme to the channel capacity.

Lossy Compression & Joint Source-Channel Coding

We went through the setting of lossy compression, the definition of rate-distortion, and how we can calculate the rate-distortion from the mutual information. Then, I concluded with joint source-channel coding which combines reliable communication and lossy compression. I would say the more interesting part of the chat was the example. I had some experiences with computational imaging, so I am familiar working with images. We talked about JPEG compression. I described how images are represented in computers as numbers where each number is the intensity of each pixel in a (RGB) channel. Roughly speaking, in the JPEG compression scheme, the image would be transformed into the discrete cosine transform domain where usually only a few pixels have significant intensity. Therefore, by storing only these pixels, we can still recover the original image by using inverse discrete cosine transform and incur only small distortions.

The JPEG compression scheme.

Quantum Information Theory

Among the lectures on special topics, I found the quantum information theory lecture very interesting. It was also quite accessible because the lecturer, Meltem Tolunay, went through the preliminaries in a very clear and concise manner, so I decided to bring quantum information theory to the chat. I roughly followed the lecture note by starting from defining states, operators, and measurements, and then explaining entanglement and super-dense coding. My sister asked an interesting question about super-dense coding which I had not thought about beforehand: what exactly about super-dense coding that makes it good, i.e. achieving the channel capacity of 2. Since the answer that it is the way it is and that we had worked out that it is the case was not very satisfying, I took some time and came up with an answer. I think how a quantum state can be a superposition of other states along with entanglement is the reason why super-dense coding achieves its channel capacity. While a classical bit can only be 0 or 1, a quantum state can be between 0 and 1, and with entanglement, we can obtain 4 different states by acting only on one electron!

The chats have been an interesting experience. I hope this account will inspire you to take a shot at learning information theory as well!

An Application of Information Theory in Reinforcement Learning

Blog, Information Theory (Winter 2021)

A blog post by Ertem Nusret Tas

In this blog post, I will summarize an application of information theory within reinforcement learning as described in the paper ‘An Information-Theoretic Analysis of Thompson Sampling’ by Daniel Russo and Benjamin Van Roy. In this context, I will first describe the setup and lay out the problem statement for reinforcement learning. Then, I will explain how Thompson Sampling works and why it is a good solution for the problem. Finally, I will talk about how information theory helps in a formal treatment of the performance of Thompson Sampling as well as for incorporating different types of knowledge into the final solution.

This is the paper!

As the setting for our problem, consider a video game where there is a universe  \mathcal{P} of planets with different laws of physics. As a player, we are placed at a planet  p^* ; however, we do not apriori know which planet we are placed at. Regardless, we are allowed to have a prior distribution on which planet it could be.

At each time step  t = 1,2,... of the game, we perform an action  A_t from the set of available actions  \mathcal{A} , and observe an outcome for our action denoted by  Y_{t,A_t} from the set  \mathcal{Y} . Note that if we knew what planet we were placed at, we would have known its laws, and thus to some extent, the rough outcome for any action. More formally,  p^* is chosen from a given family of distributions  \mathcal{P} such that the set of outcomes  Y_t = \{Y_{t,a}|a \in \mathcal{A}\} has distribution  p^* over  \mathcal{Y}^{|\mathcal{A}|} . We further assume that given the distribution  p^* ,  \{Y_t\}_{t \in \mathbb{N}} forms an i.i.d sequence.

Now, there exists a reward function called  R , which specifies a reward  R(y) for any observed output  y . As a player, our goal is to maximize our accumulated reward over time. This would have been easy with the knowledge of the distribution  p^* ; because then, we could have simply selected the same action  A^* with the largest expected reward, i.e A^* = \argmax_{a \in \mathcal{A}} E[R(Y_{t,a})] , at each time step  t . However, without this knowledge, we are doomed to perform actions that we will eventually regret (like trying to jump on a planet with a huge gravity). More formally, by time step  T , we will incur a total regret of  \mathrm{Regret}(T) = \sum_{t=1}^T R(Y_{t,A^*})-R(Y_{t,A_t}) .

Although the situation looks bleak, not all hope is lost! Bad actions can hurt us but they still teach us about the laws of physics. (After trying to jump, you would know that diving should be much more fun on a planet with high gravity.) Thus, we can hope to decrease the amount of regret we incur at future time steps by discovering new actions. In fact, even after discovering a good action, an expert player would occasionally diverge from it to explore better actions. In this context, we can come up with policies  \pi that strike a good balance between exploration of new actions and the exploitation of the good actions already found. Our goal in this regard would be finding policies with a good balance of exploration and exploitation that minimize the expected total regret, i.e the Bayesian regret  \mathbb{E}[\mathrm{Regret}(T)] , for the duration  T of the game.

One policy that tries to balance exploration and exploitation is Thompson Sampling (TS). (To be more exact, policies call algorithms to receive actions and TS is one such algorithm.) To understand TS, let’s define  F_t as the knowledge we have learned from our actions and their outcomes before time  t . More formally,  F_t is the sigma algebra of the action-outcome tuples observed until time  t :  F_t = \sigma((A_1,Y_{1,A_1}), (A_2,Y_{2,A_2}), ..., (A_{t_1},Y_{1,A_{t-1}})) . Having defined  F_t , TS chooses the next action at time  t to be  A_t = a with probability  \mathbb{P}(A^* = a|F_t) . In other words, given all our past observations  F_t , if we believe a certain action  a to be optimal for our current estimate of  p^* , then, TS urges us to select it for the next time step with high probability. Hence, under TS,  \mathbb{P}(A = a|F_t) = \mathbb{P}(A^* = a|F_t) . This equation gives TS the other name it commonly goes by: probability matching.

Thompson Smapling for a parametric family of distributions with linear reward and a Gaussian prior  N(\mu,\sigma^2) for the parameter  \theta^* . If we assume that the noise  R(Y_{t,a})-\mathbb{E}[R(Y_{t,a})] is also a zero mean Gaussian, then the posterior distributions for our estimate of  \theta^* would be Gaussian as well, thus yielding the algorithm above. Notice that on parametric families, Thompson Sampling can be implemented with a two-step procedure: First choose a parameter  \hat{\theta}_t via probability matching and then find the optimal action for the chosen parameter. (This is the example given in the paper and the details can be found there.)

One could ask why we should sample different actions at all instead of sticking with the action  a^* for which  \mathbb{P}(A^* = a|F_t) is maximized. After all,  a^* is the action that is most likely to be optimal given our past observations. However, note that such a policy would dramatically reduce exploration efforts and make it likely for us to stick to a sub-optimal action indefinitely. As mentioned before, a good policy walks on the thin line between exploration and exploitation.

Next, we quantify the performance of TS by analyzing Bayesian regret. This is exactly where information theory comes to our aid!! For this purpose, the paper defines the concept of Information Ratio (IR):

  \Gamma_t = \frac{\mathbb{E}_t[R(Y_{t,A^*})-R(Y_{t,A_t})]^2}{I_t(A^*;(A_t,Y_{t,A_t})}

Observe that IR associated with time step  t is the ratio of the expected regret (rather its square) at that time step over the mutual information between the optimal action given  p^* and the action-output tuple observed at that time step. The subscript on the expectation and mutual information signifies that all of these values are conditioned on our past observations  F_t , and are calculated using the posterior distribution  \mathbb{P}(.|F_t) . Notice that this makes  \Gamma_t a random variable.

How does IR help us? Proposition 1 of the paper shows that it can be used to bound Bayesian regret: If  \Gamma_t \leq \overline{\Gamma} for all  t for some  \overline{\Gamma} , then, under TS,

 \mathbb{E}[\mathrm{Regret}(T)] \leq \sqrt{\overline{\Gamma}H(A^*)T}

As argued by the paper, this bound on Bayesian regret carries two types of information:

Soft Knowledge stands for our prior knowledge for  p^* . This information enters the bound through the entropy  H(A^*) term. For instance, if there were only two distributions  p_1,p_2 in  \mathcal{P} with a single yet distinct best action for both  p_1 and  p_2 , then in the case that both distributions are equally likely,  H(A^*) would be the entropy of a Bernoulli-1/2 random variable, which is 1. However, if our prior knowledge told us that we are more likely to start with distribution  p_1 , then  H(A^*) , thus the bound on Bayesian regret, will be lower. This implies that a more informed player is expected to have a smaller regret in the game. Although this is an intuitive observation, many of the past results on Bayesian regret did not incorporate prior knowledge into the bound. This highlights the strength of information theoretic methods in analyzing Bayesian regret.

Hard Knowledge reflects our knowledge of the structure of the distribution family  \mathcal{P} . This information enters the bound through the term  \overline{\Gamma} . For instance, in a family where the outcome of an action does not tell us anything about the outcomes for other actions,  \overline{\Gamma} is close to the upper bound  |\mathcal{A}|/2 . On the other hand, in a family where we can simultaneously learn the outcomes  Y_{t,a} for all of the actions  a at any time step, it is possible to show that  \overline{\Gamma} \leq 1/2 . Thus, in the case of full information, we would in expectation incur a smaller regret. Notice how  \overline{\Gamma} depends on the distribution family.

(The fact that the bound above features a  \sqrt{T} term implies the ability of a player using TS to learn about better actions over time. Note that the difference between the upper bounds for different horizons  T and  T+ \Delta approaches 0 as  T \to \infty , implying that the amount of regret acquired over a fixed interval decays to zero as time grows.)

Bounding Bayesian regret in terms of IR reduces the problem to bounding IR itself. In this context, the paper presents three main results for bounding IR under TS:

  1. A general upper bound:  \overline{\Gamma} \leq |\mathcal{A}|/2 . This bound becomes order optimal for distribution families where the outcome of an action does not tell anything about the outcomes for other actions.
  2. An upper bound for the full information case:  \overline{\Gamma} \leq 1/2 .
  3. Linear Optimization: Suppose  \mathcal{A} \subseteq \mathbb{R}^d and for each  p \in \mathcal{P} , there exists a parameter  \theta_p \in \mathbb{R}^d such that for all  a \in \mathcal{A} ,  \mathbb{E}[R(Y_{t,a})] = a^T \theta_p . (See the algorithm above which describes TS for such a parametric family of distributions.) Then, the IR is upper bounded by  \overline{\Gamma} \leq d/2 . Observe that this case lies between the no-information and full information cases. Thus, the bound on IR is smaller than the general upper bound, yet larger than the full information case, reflecting the specific structure of this distribution family featuring dimension  d .

Finally, we have seen how information theory can help with the analysis of Bayesian regret and bring knowledge ignored by many of the past works to bear on the regret bound. We have also learned about the basics of the model used by reinforcement learning as well as certain designs principles such as exploration vs. exploitation. In particular, we have observed how one particularly strong algorithm, Thompson Sampling, finds a good balance between exploration and exploitation. I hope you have enjoyed my summary of the paper (see references), and for more information, please check out the original work. (It is a great read!)

References:

Daniel Russo and Benjamin Van Roy. 2016. An information-theoretic analysis of Thompson sampling. J. Mach. Learn. Res. 17, 1 (January 2016), 2442–2471.