Trevor Caldwell, Siddharth Saxena, Robert Pinkerton
Motivation
When thinking about major events in the evolution of audio, the shift from CD players to MP3 players is one of the most prominent in the last several decades. As common consumers became more proficient with computers, people strove to capture the computing and storage power of these new devices. On the other hand, people still wanted a way to listen to their audio files on the go. This birthed portable MP3 players, which hit an inflection point once Apple launched the first iPod. Ever since then, MP3 has been the primary audio file format that most consumers interact with (recording artists are a separate story).
One thing people don’t often realize is that MP3 is a lossy file format. This means that, somewhere along the production of any MP3 file, there was information thrown out so the overall file could be compressed and stored with fewer bits. You may then ask, what would it sound like if it was never compressed? And the answer is found by backtracking before the MP3 evolution to CD’s. Compact discs (CD’s) store two channels of 16-bit pulse code modulation (PCM) data sampled at 44.1 kHz. Given that half this sample rate is above the threshold of hearing for most humans, converting an analog signal to PCM at 44.1 kHz is a lossless conversion as far as humans are concerned (again, recording artists may have a different take on this ‘lossless’ characterization). .WAV is a common file format that stores this same PCM data. This explains why many people claim that CD audio sounds better than MP3 audio: it hasn’t undergone the lossy conversion that MP3 requires. Unfortunately, if an MP3 player wanted to store lossless audio like CD’s, it could only store ~10% as many tracks. Here lies the tradeoff between compression and quality.
How traditional audio compression works
MP3 was previously used as an example because it is one of the most common file formats for digital audio nowadays. With that said, there are many other lossy file formats such as AAC (.mp4, .m4a), OGG (.ogg), and Musepack (.mpc), as well as lossless formats such as WAVE (.wav), FLAC (.flac), and APE (.ape) [1]. Given the prominence of MP3, we’ll continue with that example as we outline perceptual audio coders below.
Time-to-frequency mapping
Given that a large portion of audio is highly tonal, most coders convert time-domain data into its frequency-domain representation. Think of a pure sine wave for example. While its time-domain representation is infinite, It can be fully represented with only 2 non-zero samples in the frequency domain (and possibly 1 if we take advantage of the symmetry about the y-axis). Many instruments have this harmonic nature as well, while speech is band-limited – both of which allow for more compact representation in the frequency domain.
In mapping from time to frequency, coders user a variety of transforms, such as the Discrete Fourier Transform (DFT), Modified Discrete Cosine Transform (MDCT), and the Pseudo-Quadrature Mirror Filterbank (PQMF). The MP3 uses the PQMF cascaded with the MDCT in the analysis stage, with the inverse set of filters applied for synthesis. One block of data (usually between 512-2048 samples) is windowed then passed through each filter consecutively. This hybrid filter model allows for better frequency resolution, which becomes critical for later steps in the algorithm. However, each of these filters does introduce aliasing in the frequency content (which is only removed in the synthesis stage by an overlap-add mechanism), and for this reason the algorithm also computes the DFT using the FFT for use during the psycho-acoustic modeling.
Psychoacoustic modeling
The purpose of the psychoacoustic modeling stage is to identify the spectral components that won’t be picked up by the human ear. This obscuring process if referred to as “masking,” and further occurs in both time and frequency. Generally, temporal masking describes how two simultaneous excitations are perceived, often with the one of higher amplitude providing the majority of the stimulus in the ear. Frequency masking describes how frequencies ‘mask’ nearby frequencies of smaller amplitude, with this effect falling off as the separation of frequencies increases. In the MP3 encoding process, the major maskers are identified and combined to determine a “masked threshold” across the entire spectrum, amplitudes below which will not be detected by the human ear. Finally, the human quiet threshold is added to the masked threshold, since there are also amplitudes that humans can’t hear in complete silence unless they’re above a certain amplitude threshold. This masked threshold is then used to inform the bit allocation during the quantization stage.
Quantization
Quantization is applied to the coefficients at the output hybrid filter bank in order to compress the signal. This is typically done using a floating-point quantization technique as opposed to uniformly-spaced quantization. Floating-point allows for keeping the signal-to-noise ratio fairly constant across the full range of input values, with the noise in question being the quantization noise. Given that this model determines which components may be thrown out (or at least undergo more coarse quantization), it’s crucial to have an accurate frequency representation, thus requiring an FFT without aliasing.
In floating-point quantization, each coefficient is assigned a scale and mantissa value, with the mantissa providing the granularity. Thus, signals far above the masked threshold, which will be clearly detected by the ear, are allocated more mantissa bits, while those well below the masked threshold are assigned fewer mantissa bits. This bit allocation is constrained by the overall data rate of the system, where the number of bits per input block is kept constant.
Additional features
The above stages are the most essential building blocks of the MP3 encoding process. That being said, there are a number of additional features including in both MP3 and other audio coders, including multi-channel coding, entropy coding, and techniques like Spectral Band Replication where the high-frequency content is thrown out and only its envelop is stored and applied to a copy of the low-end. Additionally, there are block-switching mechanisms that shorten the block length of analysis in order to better encode signals with transients.
With all of these features, perceptual audio coders are able to achieve compression ratios around 7-15. The reconstructed audio is typically indistinguishable from the original input. These coded files are considered “perceptually transparent” when no difference can be detected by the average listener. However, transparency not always required for all audio. There are several application areas where a low-fidelity representation of the original audio is sufficient. In these scenarios, there are compression methods that can improve the compression ratios by orders of magnitude, often using a completely different approach than that of perceptual audio coders described above.
Our approach (How humans would communicate a song to each other)
In this work, we try to analyze if we were to use a more human centric way of communicating music, would that be more efficient than the conventional MP3 compression, and how practical or feasible such a scheme would be.
If a music composer were to communicate to another person about how the song is composed, they would list the following attributes of the song:
- The lyrics of a song:
Lyrics typically involves an Intro, a few verses, a repeating chorus section, and an ending.
- The instruments used in the song:
Each instrument has a unique characteristic sound associated with it. This sound quality is known as Timbre. For example, if a piano and violin were playing the same exact musical note, humans would still be able to tell which sound is coming from a Piano versus Violin. That’s because even when we attempt to play a single musical note on an instrument, it’s never really just one frequency, but a band of frequencies centered around the dominant musical note. This envelope of frequencies is what gives each instrument its unique characteristic.
That’s the reason why we humans can listen to a piece of music and figure out that the song is playing Piano, Violin, and Guitar simultaneously because over time, our brains have learnt to distinguish the sound quality of various instruments.
- Sequence of notes played on each instrument:
This information is typically conveyed through a music sheet showing the staff notation, but it effectively is a time series with each time step telling the musician, what note to play on an instrument. In an orchestra, each instrumentalist would have a music sheet which they would play, and collectively when everyone plays their section of the song, the ensemble makes up the whole musical piece.
Apart from the information about musical notes, one would also want to capture information about the sound volume of an instrument because over the duration of a song, different parts of the orchestra become dominant and then fade away.

A closely resembling digital version of this tabular representation of a song is a MIDI file which stores the same information. It’s a structured data file, and so it can be compressed using text compression algorithms. In order to play back the tabular data from a MIDI file, one would need a digital synthesizer which can generate sounds of the instruments mentioned in a MIDI file.
Our Experiment setup
MIDI files only work for instrumental pieces, so for the scope of this project, we have restricted ourselves to only instrumental music.
We model the process of humans communicating music in the following steps:
- A song composer provides the finally composed song (.mp3 file of a song) and a per instrument musical track information as a MIDI file. There is a huge database online for MIDI files of popular songs.
The one we used is: http://www.bobsoremweb.com/misc_midi.html which has mp3 source files of the corresponding MIDI files for a fair comparison of the compression ratio.
Other popular MIDI download links are: http://www.midiworld.com/files/
This is the step where human effort is involved. You need an expert musician to break down the song into its component pieces by listening over the song again and again, and figuring out through Timbre classification, what are the unique instruments being played in the song.
A neural network with a lot of training might be able to do timbre classification and decompose the song. The current research on this topic is in pretty nascent stages.
A project from CS229 attempts to give a binary answer on whether a short piece of music is from a Piano or not. http://cs229.stanford.edu/proj2013/Park-MusicalInstrumentExtractionThroughTimbreClassfication.pdf
Solving the Timbre classification problem via software is the biggest barrier if this model of music compression has to be made more practically feasible.
- Once, the midi file is available, we can apply text compression algorithms to compress it further and then send it to the receiver.
- Receiver decompresses the MIDI data file, and plays it back on a digital sound synthesizer software which has the capability to read and interpret MIDI files. We use GarageBand (Default app on Mac and IPADs) for this.
So, the compressed MIDI file essentially represents all the information contained in the song.
We ran this setup of the following 4 songs (which are also uploaded on the blog). Note that this is the compression on top of the WAV to MP3 compression which already gives approximately 10:1 compression ratio.

Outreach
For the outreach event at Nixon Elementary School we introduced music theory, composition and compression to the students. We showed the students the complete lyrics for Hokey Pokey which took up a full page of text. We then explained to the students how one could compress the lyrics by means of extracting the duplicate information and inserting the non duplicate information by means of a fill in the blank like iterative algorithm. This new format for expressing the lyrics of the song comprised of a few lines of text. We then went on to introduce music composition (intro-verse-chorus-outro, melody, harmony, etc.) and characterization (notes, pitch). The students then listened to a composed track of Happy by Pharrell Williams as we built up the song channel by channel.
More audio results