The 2019 Stanford Compression Workshop took place on February 15, 2019. It was organized by the laboratory of Tsachy Weissman, a professor of information theory at Stanford university and sponsored by Facebook, Google and non-officially Huawei (with which Stanford has recently severed all contacts). I attended this workshop both as an SCPD student as well as a representative of my software development team from Tencent MediaLab. This is a short digest of the event in which I’m attempting to summarize the key ideas presented prioritized by their relevance to my job and interests, which is research and development of the next generation video compression codecs.
The main three themes of the event were: multimedia compression, the relationship between machine learning and compression, as well as compression of genomics data—the data stored on DNA. Some presentations didn’t necessarily fall within any these three tracks but are still worth mentioning. Besides the individual talks, the event held an open panel between industry and academia on the recent trends in machine learning, compression and their relationship.
Being involved in video codec development, multimedia compression was the topic of the main interest to me so I’m starting with this track and it consists of four presentations.
The presentation from Netflix was devoted to the way they manage their data budget. Evidently, this is a formidable and critical task since, according to some studies, 58% of downstream internet traffic today consists of video streaming, including 15% that comes from Netflix.
A paper presented by a Stanford professor proposed an alternative hybrid design of codec and transport protocol aimed to significantly improve now-latency video delivery capable of readily adapting to changes in network bandwidth.
Google researchers gave a talk about their low bitrate speech encoding. And finally, there was a proof-of concept-level research from Stanford graduate student that experimented with GANs as a new revolutionary way of encoding video.
Tidying up (bits on the internet), Anne Aaron, Netflix
One of the key ideas of the Netflix presentation was technology called VMAF. VMAF is a perception-based metrics of video Netflix uses to evaluate all aspects of video quality in their portfolio that addresses the shortcomings of all direct metrics like PSNR, traditionally used in quality evaluations, that have low correlation with the way people perceive quality in images and video. VMAP uses machine learning to aggregate several numerical statistics collected for a given image into a single score that matches subjective human ratings. It is an open source project, a product of collaboration with other industry players and is used by Netflix throughout their workflow: to compare the different codec technologies, to monitor their data throughput, to optimize their encoding algorithms.
Particularly, they talk about the evolution of the way they encode their video. Originally, they started with very simple destination-dependent encoding, in which the content was encoded according to their targeted destination, for instance, different settings for tv delivery, different setups for mobile phones, etc. Obviously, this didn’t result in producing optimal quality for the wide variety of content—cartoons are very different from high motion videos—so later they switched to title-based encoding, in which the encoding settings were also selected individually for each specific video.
Most recently they added another dimension to the settings’ grid—per-shot encoding. Today they use VMAF rating and dynamic programming to find settings for each movie scene—optimizing either for bandwidth or quality and it works especially well for the content they own since they have the original raw material with all the cuts.
Codec-wise, they use hevc/4k for delivery to tv and avc/720p for mobile. In the future, they plan to use av1.
Salsify: Low-Latency Network Video Through Tighter Integration Between a Video Codec and a Transport Protocol, Sadjad Fouladi, Stanford University
The problem the this paper tries to solve is a suboptimal user experience when the throughput of the network fluctuates. Because the amount of control the transport stream has over the codec is very limited, the codec doesn’t have a chance let alone time to adapt when network bandwidth fluctuates. The fundamental problem here is that in a typical setup there is only one parameter—the bitrate that is given to the codec and even that parameter is only maintained on average. In the event of changing environment, even though the codec tries to achieve the requested bitrate the difference between the sizes of different frame can be quite large. If a huge frame cannot be transmitted within given time slot it doesn’t matter if you compensate for it later on with smaller frames—the user will still experience player’s hiccups, your pipeline stalls, there are breakdowns in image quality—nothing to be particularly enjoyed.
The paper’s solution is a much tighter integration between the transport and the codec. Specifically, the authors propose to discard the bitrate notion to begin with and only thing to worry be the size of the next frame. This would allow the system to be more agile and react much quicker to rapidly changing circumstances.
The other requirement is that the transport should be able to query the state of the encoder and later on reset the latter to the state previously saved. Because the encoder is a state machine and relies on the fact that the decoder knows its state and uses it in decompression process. These assumptions aren’t necessarily maintained unless the transport has a way to control it.
In their open-source implementation, every time the encoder processes a frame it generates two versions—the one with slightly smaller size then the previous frame and the other slightly larger. The transport thus has a choice—it it’s still within the budget of the current throughput then it may pick the larger one to improve the quality. On the other hand, if the the bandwidth drops then it can accommodate it by picking the smaller version or even dropping the frame altogether. The transform informs the encoder of the decision it made, which is especially important in the latter case when the decoder never sees the frame and cannot use it as a reference.
The presentation includes demos with the comparison experiments they had between the Salsify and Facetime or WebRTC when the network throughput changes and, yes, Salsify recovers much faster and the user experience does improve. I believe these ideas are definitely worth considering for any video encoding solution.
Low rate speech coding using deep generative models, Jim Bankoski and Jan Skoglund, Google
The paper from Google deals with the low rate speech encoding, specifically, at less than 8 kbits/second. The goal is to tap into the emerging markets in which very often the data plans are sometimes cheaper than the voice one on the mobile device. Unfortunately, the data connection is slow and unreliable—3G and less. For mid and high bitrate speech channels Google uses open-source Opus codec and it works well for various applications but below 6 kbits/sec it breaks down. The main inspiration behind their attempts to do better was the fact that the estimated speech entropy is about 100 bits/second and it includes everything—timbre, emotion, pitch, anything you expect from a talking person.
There are already standards that encode at 600 bits/second targeted to work in military and government applications. However, the problem with those is they severely degrade speech quality—you can get the message but it sounds like a ham radio or air traffic control—quite distorted and unnaturally sounding. If you compare the input and output spectrally the lower parts of the spectrograms look very similar—that’s the high frequencies where the low bitrate is missing. The authors’ goal was to create a solution to generate these missing high frequencies.
It helps to explain how these low-bitrate speech compression tools work. People familiar with speech processing know that human voice tract can be modelled with reasonable precision using only a 6 or 7 float point numbers. These are essentially the parameters these codecs estimate during encoding and use to synthesize the speech during decoding.
Google’s approach is to build generative neural networks to fill those missing high frequencies. In very simple terms, generative networks work by predict the future given the past. Classical example of such algorithm is Google’s search engine that already knows what you’re searching for once you’ve typed only a few characters. Other network architectures don’t put specific emphasis on previous input but nevertheless attempt to generate data similar to what it saw during the training. Open source WaveNet is capable of generating the speech sound and when you run it without any input it sounds just like a language that makes no sense.
What the authors did was that they conditioned the training on the output of those low-bitrate standard codecs so that the babbler babble the real words. They point out that the data transmitted is still the same output of the old codec but the quality was rated subjectively much better.
Because of its nature the approach has drawbacks. For instance, it’s very optimized to the fact that a single person must be speaking. If it’s given an input of more than one person or if it contains background noise or even music the results can catastrophically deteriorate. They tried to encode some music and it didn’t even sound like somebody singing—it just sounded weird. This is typical for any generative networks as they can only produce something from the same distribution they were trained on and that’s exactly what the Google it going now—training on something that’s not just speech but has noise and other sounds in there.
Neural joint-source channel coding, Kristy Choi, Stanford University
Somewhat revolutionary but hardly practical piece of work from Tsachy’s graduate student that aims to find error-resilient way to compress the images used GANs, another kind of generative network that doesn’t just try to predict the new samples but seeks to find some low-dimensional representation of the sample distribution. The typical example is the network generating images of human faces, which can be described by around 500 floating point numbers and modern networks can generate remarkably convincing high resolution images. This relatively low-dimensional space it continuous that is if you interpolate data points between the two faces in most cases the interpolation results produce just as valid images that morph between the two. The author hopes that if you can learn the codes based on this idea and the transmission channel adds some noise to the latent numbers, the reconstruction results would not be far from the input. She used the MNIST data set so for those who are not familiar with it, it’s a database of handwritten characters, one of the mainstream machine learning datasets that started ML revolution but today it is essentially a toy dataset. So, yes, she can reliably compress images consisting of handwritten images but nothing else. It’s very impractical but nevertheless an example of what people are working on for the next revolutionary image compression technology.
Machine learning and compression
The last two papers already shown that the machine learning has a lot of promise in implementing novel compression schemes. But the ML itself has its own set of compression challenges, particularly, the compressing the neural network weights. Modern neural nets consist of hundreds of layers and hundreds of thousands or million of weights and in uncompressed form take hundreds of thousands of megabytes. If we ever consider using ML for encoders and decoders the need to somehow convey this data between one another very efficiently must be addressed. The first two papers in this track attempt to do it.
The third one, Learning Distributions from their Samples under Communication Constraints from Leighton Barnes, a postdoc at Stanford, which I’m not going to cover, deals with managing the data budget when you have distributed training. When you have a training farm you need to somehow distribute the workload among the nodes and also coordinate their work. The network bandwidth that carries the coordination between the nodes proves to be the bottleneck of this training process. The paper looks into the theoretical limits of such process and touches the depths of information theory that is well beyond the scope of this blog.
Deep Compression and Hardware-Centric AutoML for Efficient Deep Learning Computing, Song Han, MIT
So how do you compress the neural network? The talk was from a Song Han who is currently an MIT professor but he got his PhD from Stanford and I remember him as a guest lecturer at CS231N Convolutional Neural Nets course I was taking two years ago. His lecture was very inspirational, his results were amazing back then and, obviously, he was quite busy since—even though parts of his talk sounded familiar he made even more spectacular progress.
When he first approached the problem of compressing neural nets he began with trivial gzipping the weights, which as expected didn’t work very well. So he tried to reduce the resolution of the weights. The weights usually are trained at 32 bits and he found that in most inference cases (i.e. using already trained network) reducing the number of bits to 4 doesn’t have significant impact on the network performance. Further, often it can be reduced to 1 bit—meaning one or zero and the zero means no weight at all. This way he essentially invented what is now called network pruning. In pruning you can find the weights below certain threshold value and then completely discard them from the network. Initially, the performance drops but if you fine-tune it, i.e. re-train, in most cases the it recovers completely and sometimes it can even improve compared to unpruned original. It’s an open question why it happens but the prevalent theory explains it as another regularization technique.
Having tackled the optimization of inference, Song went on optimizing the training. Because you now have fewer weights—from hundreds of thousands to merely thousands or less—by storing them in a look-up table you can do backprop using fewer calculations as well. His most recent research at MIT focused on how we can automate the pruning process, which before recently was manual, labour intensive process, requiring domain-specific highly trained engineering work. His lab used reinforcement learning to automate the pruning itself and in some cases once again matching or beating the human results. The details of his work are available on his lab’s website. I am a big fan of his work.
Quantization Error in Neural Networks, Elaina Chai, Stanford University
Another interesting find also started as a research in neural net weight quantization. In the process the author found out that the nets that used batch normalization layers quantize better. Batch normalization is relatively new ML technique—invented less than 4 years ago. In very simple terms, it adds additional scaling layers between matrix multiplication layers (e.g., convolutions or fully-connected ones) and activation functions (e.g., relu or sigmoid) in order to move the numerical values into the “sweet spot” areas of activation resulting in healthier distribution of gradients and much quicker training. One of the research outtakes is that the networks using BN also benefited from more uniform data values and as a result were more resilient to the weights’ quantization error.
The method works quite reliably and since its invention become very popular but nobody
knows for sure why it works so well. This makes even more important the other outcome discussed in the presentation—a surprising connection to an old paper from the field of analog circuit design called Normalized LMS. The latter introduced a very similar concept to BN but in analog signal processing hardware and it dates far back as 1950s. The math in the paper looks remarkably similar to BN papers and there’s significant amount of follow-up research in the area. Unfortunately, my expertise is is not at that level to judge but if the claims made are right the prospect of using this knowledge to explain why BN works is very exciting.
Panel: Compression via and for machine learning
Before I go to the final track—genomics— I’d like to summarize the panel since it’s again, is more closely related to my interests and expertise. The panel was comprised of three people from the industry—Google, Samsung and a semi-secret startup WaveOne that works on ways to employ ML methods to image and video compression—as well as two people from Stanford, including the moderator. The key outtakes from the panel are the following.
Despite the progress, ML-based compression is still quite impractical—even with modern powerful GPUs and inference-optimized hardware it’s very resource-intensive and yet barely approaching real time. It’s mostly single-frame based ignoring temporal aspects of video stream and currently the computational costs for the algorithms are heavily dominated by convolutional layers. The panelists agree that its adoption definitely depends on future hardware, especially on the edge device given the fact that mobile GPUs are not even close to the desktop or datacenter-class GPUs.
Regarding image compression, the overall consensus that it has very little chance to ever become competitive with traditional methods like JPEG. The latter is already good enough and is too widely adopted—even JPEG2000 never really took off.
However, video compression and especially other modalities—point cloud was mentioned—do have promise. And the reason for this is not only the hope that it would achieve better bitrates but also that it would be possible to design algorithms in which the compressed data would already contain information that can be used for purposes other than the reconstructing the image itself. The hope that other tasks like image analysis, object detection, segmentation etc., could be done without full decompression.
Likewise, ML-based image compression algorithms have their challenges. One is that image compression task is significantly different from classical ML image tasks, like classification, or even image segmentation, which usually severely reduce the resolution before extracting high level image features. When you do general purpose image decompression, however, you need to maintain the original resolution and do it very faithfully—essentially the same pixels need to be restored, and there’s not a lot of research in this area as far as ML is concerned.
Training is also a challenge because the training data must fully represent the expected input. Like it was mentioned in the previous presentations, if you present the network with something it never saw in training the result is often a complete meltdown.
The final point is that the loss function, the metrics is critical— on one hand, one must select the metrics representative of human quality assessment, on the other, you cheat or if you are not careful and cut corners the ML will find the weakness and will exploit it.
Compressing Genomics Data
The final theme of the workshop was the compressing genomics data fuelled by recently emerging problem of exploding amount of DNA data. The three papers—two from the universities, one from the industry, which I’m not going through individually, have a lot in common so I will instead summarize the key ideas that were repeated in all three.
In the last decade the amount of data related to genes increased by several orders of magnitude. It started with the human genome project that mapped all our genes—essentially reading our DNA, which is roughly 3 billion base pairs, or quaternary digits, resulting in approximately 6 gigabits of information. Ten years ago sequencing a genome costed $100K and took days if not weeks of work. Today you can do the same thing for less than a thousand and a couple of hours. This efficiency breakthrough resulted in deluge of data never seen before.
You might think just 6 gbits—less than a gigabyte, what’s the deal, who cares? Well, it’s not that simple because the DNA is not read as a single strand. The technology is not there yet. The way the sequencing is is done today is whole DNA—the 6 billion base pairs—is broken into short segments, each about 100 base pairs long. Each segment is read independently and then as a jigsaw puzzle, which includes duplicates and overlaps needed in order to compensate for inevitable misreads, is assembled into a single string. Because of the redundancy and because the SAM format file what is currently used is all text the typical file size is about 1.5TB (compared to the 6gbit of actual data). Obviously this can be compressed and it should be compressed or hospitals would need to continue filling their basement with crates of hard drives containing patient’s DNA data.
Another challenge is the unique characteristics of the data itself. Your genome tells a lot about you— not just your gender, race and ethnicity, but your susceptibility to some diseases, your likelihood to live longer life, you socio-economic status—everything that can be exploited by the bad guys. It is also very hard to disassociate from the person it’s obtained from—even if you mask it, even if you aggregate it, the experts are still capable of tracing it back to the owner. Like in cryptography naïve approaches don’t work because villains will find the way to exploit anything that layperson would never suspect possible to compromise. Besides being so hard to anonymize it’s also immutable—if compromised it can’t be changed as, for example, your password—if it’s out in the open there’s no way to go back. It’s not only revealing about yourself it also about your kids, parents, essentially your entire family. It’s a very special kind of data that needs special treatment.
This situation calls for standards and experts who know what they are working with, familiar with all the little issues and pitfalls and hopefully are qualified enough to deal with. Somewhat surprisingly, the effort is lead by MPEG—the committee responsible for most international multimedia compression standards. If you look at proposed standards for file that encode genome they actually look very similar to what we can see in video structures streams. Sometimes they include not just a single individual genome but, for instance, the genome of the family or a test group. And it makes a lot of sense because you don’t need to encode all 3 billion base pairs. Among the people on Earth—all ethnicities and heritage—we’re only different by 0.1% of these characters—so essentially you’re only need to encode this difference. Actually, this fact is heavily used to speed up the sequencing itself. Using the reference genome—an average person DNA—makes it easier compute the specific one just as it’s easier to assemble the puzzle if you know the overall picture even though it may not be exactly what you’re looking for.
When the genome of a family is encoded there’s even less variations. Being able to efficiently index and access pieces of data from such aggregates is another interesting technical problem that is addressed in the Stanford PhD student. The proposed standard also use GABAC, which is final lossless compression stage that is, non-surprisingly, is an arithmetic encoding scheme very sim CABAC used in h264/h265. So this is the kind of optimization they employ in the standards currently under development by industry and academia.
Storing data on the DNA, or using DNA as your hard disk or flash drive was the topic in Advances in High Density Molecular Data Storage from Hanlee Ji, Stanford University.
One obvious advantage of storing general purpose data on a DNA strand is that it is insanely compact. You can record the whole knowledge of human race on a few grams of DNA. Actually, the presentation begins with a claim that if we continue at the current rate of information expansion fueled by youtube and Instagram in a few years we won’t have sufficient amount of chip-grade silicon production to store it.
Another advantage is that DNA is very stable. Data stored on flash and hard drive needs to be refreshed at least once every few years. DNA can be reliably stored for decades in minimally controlled environment—nothing exotic like liquid nitrogen, just a fridge at 15C. With a bit of care one can go much longer—they recently sequenced DNA from a mammoth recovered from a carcass embedded in permafrost for many millennia.
Replication is also extremely quick—terabytes can be copied in seconds using machinery similar to the one constantly at work in every cell of any living thing. The research project that’s mostly driven but the huge progress in sequencing already mentioned. Thanks to these revolutionary advances, reading data off the DNA is already essentially a solved problem—today you can already do it reliably, cheaply, complete with random access using special data markers. It can be done multiple times from the single physical strand with very small rates of degradation.Like I said, reading is essentially solved problem. Writing is much more problematic but people are working on it.
A very cool presentation from a group of high school students—Humans are still the best lossy image compressors— started off a couple of years ago then Tsachy’s lab was given some school interns to entertain. On one hand, nobody wanted to do the baby-sitting but, on the other, they hoped to be able to publish at least something as a result. So the kids were given a task of comparing the existing image compression methods. Instead, they proved to be ambitious enough to go on an invent their own human-based compression method.
The two people—describer and reconstructor have access to the Internet, particularly Google images. The describer gives the instructions to the describer via text chat, while observing the current result of his/her work, very similar to the feedback loop in video compression where encoder relies on previous results coming from of the decoder. They then used the zipped transcript of their chat and compared it to the JPEG2000-compressed version of the same image of the same size as the zipped transcript and subjectively compared the quality of the two. The result was that in all classes of images buy one (faces) the human-compressed images received similar or higher ranks. Obviously, there are many problems with this approach—the consistency is one—but the idea is quite interesting and there is an ongoing research based on this idea.
The conclusion of the event was a beautifully illustrated, entertaining, thought-provoking philosophical and historical analysis—Compression, Self-Sovereign Data and a Decentralized Internet by Jonathan Dotan, HBO, Stanford University— that reflected on relationship between the society, it’s prevalent communication methods, issues of privacy, security, personal, corporate and government responsibility.