Joe Bosetti, Ben Share, Anav Sood
Polar codes as a topic in information theory are both pedagogically interesting and practically useful. As a ‘cutting-edge’ development, they have been empirically shown to provide order-of-magnitude performance benefits. Yet few people are aware of them, and the math behind their effectiveness can serve as an impediment to many hoping to understand their functioning. In this project, we attempt to create a thorough but extremely accessible guide to the topic. In this post, we include an overview of the concepts involved, emphasizing the problem description and what makes polar codes so impactful as a development. We also include a link to our more technical piece on the subject: a rigorous explanation of polar codes aiming to fully explain their derivation and functioning–but maintaining as much of the accessibility of our overview as possible. We hope that these resources will be a useful tool for those hoping to learn about a fascinating and exceedingly impactful topic.
This is an informal introduction to polar codes. For those with an engineering/mathematics background at an undergraduate level and willingness to dive into some basic graduate probability theory, the following set of notes is a thorough mathematical treatment of the topics covered in this blog:
Polar codes exist as a solution to the general question: If I am communicating with some rate of failure for each piece of information I try to convey, can I somehow devise a system that will allow me to predictably convey the information I wish to, with some fixed cost in communication bandwidth related to my rate of error? We can make the question slightly more concrete with a few (easy!) pieces of terminology:
- A channel is a medium for conveying information. It could be an email server, a phone line, or the air in which you’re sending smoke signals to your friend.
- A channel may have some error rate. That is, for every single piece of information I try to send over the channel, with some probability (usually small but non-negligible), the information doesn’t get there. The rest of the time, it gets there uninhibited. Think of this like static on a phone line: if I say ten words in a sentence and the phone line has a 20% error rate, then on average two of the words I say might be obscured by static, while the remaining eight come across clearly.
- As described, a channel works on individual, discrete ‘pieces’ of information. In the above example, we want to convey an entire sentence to someone. But the phone line works on individual words, and conveys each word correctly or not with its particular error rate. So we succeed at conveying the message as a whole only if every single word in the message is successfully transferred.
- We can use this words-in-a-sentence analogy for every type of message. Just as the set of English words can be used to convey every possible English sentence, we can think of every type of message we might wish to convey as fundamentally boiling down to some number of ‘information units’ that, taken together, form a generic ‘message’. We call these information units ‘bits’–these are the 0s and 1s that computers famously think in–and it’s these that a channel ultimately conveys. This means that when we want to send a message, we create some pattern of bits, then send each bit through the channel. Then, if by looking at the results of the channel for each bit, we can recover the bit translation of the original message, we’ve succeeded in communicating.
Given this setup, our goal is to create a system that overcomes the error rate. We should be able to recover our original message essentially 100% of the time, even though each individual bit (‘piece’) of output is liable to be missing with significant probability. Our approach to this will be to send a longer message than the one we’re trying to convey–essentially, to incorporate redundancy into the system.
Why does sending a longer message help us with our goal? Let’s refer back to our earlier words-in-a-sentence analogy. Imagine we were trying to send a very short message: “Hi”, and our phone line statics up 25% of the time–one out of every four words. We could try sending just a single word: “Hi”. But one out of every four times, that one word gets obscured–and in this case, the receiver has no idea what we wanted to say!
Now, what if instead we sent the message: “Hi”, “Hi”? We’ve sent our message twice, which seems unnecessary. But what happens if our first word gets lost now? The receiver will hear something like “[static], Hi”. They’ll still get our message! If either of our words gets lost, the other one conveys the same information, so the sentence still gets through. The only way the information gets lost now is if both words get obscured–which will only happen (¼)^2 = 1/16 of the time.
There’s a tradeoff going on here: the longer I make my input, the more redundancy I can build into it, and the more likely my original message gets across. But I don’t want to make my inputs infinitely long–that would be cumbersome, and certainly not represent effective communication. So our goal is to find a scheme that gets our message across as securely as possible, without too much extra information added for redundancy.
Polar codes are this scheme. By making the message only a little longer, they get the message across with optimal accuracy. In fact, we’ll see that polar codes make the message longer by the minimum amount we could really hope for–a huge deal, and the first type of scheme to accomplish this. How do they do this? The answer lies in the reason for their name–they ‘polarize’ the input. Remember that our channel has some error rate, and that for every bit of information we send, it loses that piece of information at that rate. What if we could ‘shift’ some of this inaccuracy? What if we could send bits so that the message we care about–which is shorter than the total input we give–always comes through clearly, and the excess information is the part that doesn’t get through?
What would this look like? Suppose we had a way of ‘adding’ and ‘subtracting’ pieces of information. Say we want to convey the message “Hello world” across a channel with error rate e (where e is just some small value). The obvious way to do this is by sending the input [“Hello”, “world”]. As noted, our message gets across if both words get across–a probability of 1 – e (since success happens whenever error doesn’t) for each, so (1 – e)^2 for both to come across successfully. Now a weird hypothetical: what if we cared more about the word “world” than the word “hello”? What if we knew we were saying “hello” to something but weren’t sure what? In this case, we might send something more like [“Hello + world”, “world”].
What does this mean? We’re still using our second channel to send “world”, but our first channel is now sending what we’re going to call the sum of “Hello” and “world”. What would the sum of two words look like? (In reality, this process happens on bits, for which there’s a well-defined operation called xor. If you’re familiar with that, imagine it in place here.) We can think of it as acting just like the sum of two numbers–just for words. If we add two numbers, we always get the same result; same for words. If I receive the sum of two numbers, I have no way of knowing what the original numbers were…unless I know one of them, in which I can just subtract it to get the other. So if I just receive “Hello” + “world”, I have no way of knowing what either word was. But if I receive “Hello” + “world” and I know “Hello” was one of the two, then I can subtract it out to figure out that “world” must be the other. So with this concept, our question is: how much of the time can we figure out “world” now?
Now, it should hopefully be clear that if the second channel goes through correctly, we can figure out “world”–that’s just the second channel’s output! But what’s important here is that–even if the second channel errors–if the first channel succeeds, we can figure out “world”. Since we know what the first word is, and the first channel is just the ‘sum’ of both words, if we subtract “Hello” from the output of the first channel we’ll get “world”–exactly the word we cared about conveying. On the other hand, if we wanted to recover “Hello”, not even the first channel would be enough! After all, “Hello” + “world” is only useful to find “Hello” if we know to subtract “world”–which we wouldn’t know unless we also got the second channel successfully. Otherwise, we would not be able to discern from the entire “Hello” + ”world” message what exactly the first word should be. So while we’re more likely to be able to figure out “world”, “Hello” has gotten even less reliable than in our original setup.
What we’re seeing here is the polarization we hinted at before–the idea that a particular encoding structure allows us to shift the probability around, increasing the chance that we’ll be able to recover our message. This is what polar codes are all about. A full encoding structure uses this idea over and over, building bigger and bigger units of these. By repeating this structure, the inputs corresponding to “world” in the example above become extremely secure–we can almost always decode them–and the others become extremely inaccurate.
There might be one part of this explanation that’s still bugging you, though. When we decoded “world” earlier, we assumed we knew what “Hello” was–but we might not! And besides, don’t we care about “Hello” as well? As we just explained, polar codes result in a lot of our inputs becoming extremely inaccurate–why is this not a problem?
This taps into the idea of redundancy we explored earlier. When we use polar codes, we begin with the message we care about–but our input is longer than that. It’s our message, plus a whole bunch of blank data we don’t care about–but which we know the value of! In bits, this is usually just a bunch of 0s–in our sentence analogy, we can imagine always sending the word “blank”. Doing this answers both of the questions posed above–because this is dummy information, we don’t care about our receiver being able to understand it; and as long as we have our default value agreed upon with the receiver, we’ll be able to subtract it out to decode the information we care about. The magic of polar codes is that by sending a fixed amount of this dummy information, we’re guaranteed near-perfect reception of the original message we wanted to convey.
The leaves us with one final question: for a given message and channel to convey it over, how big an input do we need to send? In other words, how much dummy data is required to get the important parts across? It turns out, if the message is length n and the channel has error rate e (meaning we get 1 – e of our information across), the total message length needs to be n/(1 – e). If e is ½ (meaning every other word is lost–a crazily high rate!)–we can get our message across just by doubling the length. If our rate is something reasonable–say e = 1/10–our message would only need to be 10/9 of the length, or barely 11% longer. (As a small subtlety, this is what is known as an asymptotic result, meaning it holds more accurately the larger the size of our message is. The good news is that in the digital age, the size of most messages is verrry large indeed. Small pieces of data–texts, emails, files–are usually measured in megabytes, which are roughly 8 million bits. For something like the movies we stream on Netflix, we receive about a gigabyte for each hour we watch. How much is that? 8 billion bits. So in practice, polar codes end up being as effective as the theory indicates they should be.)
This is an awesome result. It should feel pretty impressive–we can basically eliminate substantial error rates for a small cost in message length. But it turns out, this rate of exchange that polar codes achieve is fundamentally the best we can hope to do. Think about it like this: if our error rate is ½, then the most amount of information we could possibly hope to communicate after n words is half of that–n/2. After all, that’s all we actually get across successfully! It’s impossible to do better than one word’s worth of information per word (one bit worth of information per bit)–so this is really the best result we could possibly hope for. In fact, this fact–that this is the fundamental limit for what an algorithm could aim for–has been known for years. This means that not only do polar codes not only solve the problem–they do it as well as any solution could hope to.
We were very happy with how our outreach effort turned out. As noted, our priority was to make the often-intimidating topics in info theory as accessible as possible. The concepts discussed above are a little complex for elementary school students, so we decided to focus on fundamentals of probability and analysis by introducing students to a fun/thought-provoking game: The St. Petersburg Paradox. The game is a classic for students learning probability theory: it presents a game with finite rewards, but with decreasing chances for exponentially increasing rewards. It thus presents a paradox to evaluate: its expected return is infinite, but in reality one will certainly be unwilling to pay an infinite amount to play. We challenged students to choose between playing the game and receiving a fixed reward (with payouts for both given in candy). The hope was to develop intuition/appreciation for this type of analysis, as well as to excite students with an intriguing puzzle and provoke further thought and reflection. Overall, we think the students definitely gained some intuition regarding exponential growth, and some of the older students also examined the geometric distribution of winnings and learned that is wasn’t always empirically advantageous to play the game as compared to taking the fixed reward. They did always find it exciting to play though!