By Kao Kitichotkul (rk22@stanford.edu)

I have been having some “chats” with my sister about information theory. My sister is college freshman with a decent background in high school mathematics who likes baking. During this quarter, we talked a little bit about things I learned from EE 276. We would sit at a table in our living room in the afternoons and sketch some figures and calculations on papers. Occasionally, we discussed information theory at the dining table as well. My parents are not engineers or mathematicians, but they can more or less follow along the main ideas in the discussions.

This blog is about the chats I have been having with my sister. Each section is an “arc” in our journey through the information theory. We both were more or less learning about it together!

## Basics: Probability & Measure of Information

My sister learned some probability materials in high school, but the kind of probability we would use in information theory is a bit more abstract than what my sister already knew, and the notations are also a bit different. I explained some definitions and concepts, such as random variables, expected value, and variance, just enough to understand the content of information theory. I also explained the Law of Large Numbers with a focus on the intuition in order to motivate the Asymptotic Equipartition Property later on. I mostly used coin flipping and playing cards as examples to explain these concepts. Having the necessary definitions, we moved on to define entropy H , relative entropy D(p || q) , and mutual information I(X; Y) and discuss some of their properties. We also took some time working on a few examples of Huffman coding.

## Asymptotic Equipartition Property

We began our discussion on compression with some examples from my homework. I thought that how the Asymptotic Equipartition Property (AEP) implies that entropy is the lower bound on the average code length is really cool, so I tried to show my sister a rigorous proof. This was when the Law of Large Numbers came into play! I defined the typical sequence and proceeded to prove some theorems about the probability that a source sequence is typical and that the size of the typical set is exponentially smaller than the size of the set of all sequences. It turned out that I did not understand the topic very well, so it took me a few attempts to make my proof both rigorous and accessible. For example, I thought that the entropy H in the proof meant the entropy of the whole source sequence, but actually it is the entropy of a single source symbol. There were quite a few little details which I would have missed if I did not brought the proofs to my sister, who pointed out contradictions until I figured out all the errors.

## Reliable Communication

I began by sketching the communication scheme and gave two examples of communication channels — the binary symmetric channel and the binary erasure channel. Then, I defined channel capacity C as the maximum rate of reliable communication, and claimed that C = \max_{p_X} I(X; Y) , where X is the input random variable and Y is the output random variable. I did not prove why the channel capacity can be calculated in this way, but we discussed some implications of this claim. For example, the maximum over p_X in the expression to calculate the channel capacity means that it does not depend on specific inputs, but rather the specification of the channel itself. My sister was more interested how these concepts are related to practical matters, like how we know whether a channel encoding-decoding scheme is good enough by comparing the bit rate of a specific scheme to the channel capacity.

## Lossy Compression & Joint Source-Channel Coding

We went through the setting of lossy compression, the definition of rate-distortion, and how we can calculate the rate-distortion from the mutual information. Then, I concluded with joint source-channel coding which combines reliable communication and lossy compression. I would say the more interesting part of the chat was the example. I had some experiences with computational imaging, so I am familiar working with images. We talked about JPEG compression. I described how images are represented in computers as numbers where each number is the intensity of each pixel in a (RGB) channel. Roughly speaking, in the JPEG compression scheme, the image would be transformed into the discrete cosine transform domain where usually only a few pixels have significant intensity. Therefore, by storing only these pixels, we can still recover the original image by using inverse discrete cosine transform and incur only small distortions.

## Quantum Information Theory

Among the lectures on special topics, I found the quantum information theory lecture very interesting. It was also quite accessible because the lecturer, Meltem Tolunay, went through the preliminaries in a very clear and concise manner, so I decided to bring quantum information theory to the chat. I roughly followed the lecture note by starting from defining states, operators, and measurements, and then explaining entanglement and super-dense coding. My sister asked an interesting question about super-dense coding which I had not thought about beforehand: what exactly about super-dense coding that makes it good, i.e. achieving the channel capacity of 2. Since the answer that it is the way it is and that we had worked out that it is the case was not very satisfying, I took some time and came up with an answer. I think how a quantum state can be a superposition of other states along with entanglement is the reason why super-dense coding achieves its channel capacity. While a classical bit can only be 0 or 1, a quantum state can be between 0 and 1, and with entanglement, we can obtain 4 different states by acting only on one electron!

The chats have been an interesting experience. I hope this account will inspire you to take a shot at learning information theory as well!