Comparing information content of language representations under compression

EE376A (Winter 2019)
Cindy Wang and Stephanie Chen

If you’re sitting in a cafe and feel a little hungry, you might say to the waiter, “I would like a croissant, please.” However, if you’re in Paris, you might say, “Je voudrais un croissant, s’il vous plaît.” On a surface level, these sentences are saying the same thing — they are direct translations of each other, after all. Both sentences communicate the first-person subject (I), a verb expressing want or desire (would like), the object (a croissant), and a function word to express politeness (please). Upon closer inspection, you might notice that the French sentence has extra information: the conjugation of the verb voudrais reinforces the first-person subject, the article un denotes that croissant is a masculine noun, and the formal pronoun in s’il vous plaît offers context — the speaker is saying please to someone they do not know.

In a way, language is an encoding for semantic information. We can think of the rules of various languages (e.g. English or French) as different encoding schemes, and the various representations of each language (like writing systems, transcription systems, etc.) as further encodings. Under this paradigm, we might wonder whether the encoding mechanisms of different languages vary in their quality and information capacity, and whether different language representations vary in how they preserve the information transmitted by the language. In this project, we compare how several writing and transcription systems in three different languages perform under the same compression scheme, in order to explore how compressible information varies across languages and how information preservation varies across language representations.


Experiments

Dataset

For our project, we used the German-English and Spanish-English parallel corpora from the Europarl dataset, which is a collection of transcriptions, in a variety of languages, of European Parliament proceedings from 1996 to 2011. Because these corpora are parallel translations of proceedings by professional human translators, each set contains approximately the same amount of basic semantic information expressed fluently in each language.

DEFLATE Compression

For compression, we chose the DEFLATE algorithm as it is widely used, fast for experimentation, and lossless. DEFLATE uses a two-stage approach (both of which we discussed in EE376A!):

  1. Matching and replacing duplicate strings with pointers (the LZ77 algorithm)
  2. Replacing symbols based on frequency of use (Huffman coding)

A DEFLATE data stream consists of a series of blocks. Block size is arbitrary, and blocks can be encoded in one of three ways:

  1. As raw literals between 0 and 65,535 bytes in length. This option is used for data that is incompressible.
  2. As a compressed block, using a static Huffman encoding. This is generally used for short messages, when the compression gains achieved do not warrant the cost of generating and storing a Huffman tree.
  3. As a compressed block, using a dynamic Huffman encoding. Most blocks are stored in this way. 

Stage 1: LZ77

The deduplication stage proceeds as the LZ77 algorithm described in class. Duplicate strings in a block are found and replaced with a length (8 bits) and distance from the beginning of the duplicate (15 bits). A minimum match length of 3 and sliding window size of 32KB are used.

Stage 2: Huffman coding

The second stage involves generating (if this is a dynamic Huffman block) and encoding via Huffman trees. This achieves bit reduction by replacing commonly used symbols with shorter representations. (See Huffman coding class notes for more details.)

Each symbol in the output of stage 1 may be a literal, a match length, or a match distance. Stage 2 replaces each of these with a literal/length code and optionally a distance code. Two trees are thus created:

  • Literal/length code (288 symbols)
    • 0-255: The literal bytes/symbols 0–255.
    • 256: End of block.
    • 257-285: Match length range, with extra bits* to encode the actual length.
  • Distance code (32 symbols)
    • Distance range, with extra bits to encode* the actual length.

* For example, symbols 4-5 encode distances 5-8. One extra bit is needed (in addition in the bit already communicated via whether the symbol is 4 or 5) to encode the index of the actual distance within the range 5-8.

DEFLATE thus uses a scheme that places literals, lengths, and an end-of-block symbol together into a single alphabet. Distances are encoded in a separate alphabet, and this can be safely done since a distance can only occur after a length.

The two above codes are themselves encoded compactly as canonical Huffman codes, with run-length encoding for the bit length of the code for each symbol. It is easy for the compressor to choose whether the static or dynamic compressed size is smaller, as the static compressed size can be computed using the same statistics (the number of times each symbol appears) as are used to generate the dynamic trees.

A optimizes implementation of DEFLATE is available via gzip, which we use here and is included in most *nix distributions. (Fun fact: in addition to gzip, DEFLATE is used in PNG image files and the ZIP file format.) The compressed sizes below refer to the size of the GZ file produced by running gzip.

As a baseline, we compress the raw parallel corpora using DEFLATE. The compression rates of the three languages are similar, with German and English both at just under 68% and Spanish slightly better at 69%. Notably, the German and Spanish uncompressed file sizes are both about 14% larger than the English file.

screen-shot-2019-03-19-at-11.07.10-pm

IPA Transcription

Using eSpeak, a text-to-speech synthesizer, we then convert all three corpora to transcriptions in the International Phonetic Alphabet (IPA), the standard phonetic alphabet used in linguistics. The IPA contains 107 letters and 52 diacritical markers, which together capture the wide variety of vowel and consonant sounds, tones, stresses, and other phonetic features present in human languages. IPA transcription provides a standardized way to represent the way in which a language actually sounds, bypassing the idiosyncrasies of writing systems. For example, the English sentence

There has therefore been enough time for the Commission to prepare its programme and for us to become familiar with it and explain it to our citizens.

is transcribed in IPA (using Received Pronunciation, a.k.a. standard British English) as

ðeə hɐz ðˈeəfɔː bˌiːn ɪnˈʌf tˈaɪm fəðə kəmˈɪʃən tə pɹɪpˈeəɹ ɪts pɹˈəʊɡɹam and fɔːɹ ˌʌs tə bɪkˌʌm fəmˈɪliə wɪð ɪt and ɛksplˈeɪn ɪt tʊ aʊə sˈɪtɪzənz

Note how certain sounds in English that require long (and often unintuitive) character combinations to write become collapsed and standardized in IPA. For example, the sound indicated by ugh in enough, which can also be spelled as f, ff, or ph in English, becomes /f/ in /ɪnˈʌf/. Likewise, the sh sound indicated by ssi in Commission (also spelled in a variety of ways) becomes /ʃ/ in /kəmˈɪʃən/. 

Since this transcription format coalesces different spellings of identical sounds, we predicted that the compression rate would be higher for IPA transcriptions than for plaintext (Table 1). As expected, the compression rates were higher across the board. Surprisingly, however, the compression rates of the IPA transcriptions for each of the three languages were identical.

Screen Shot 2019-03-19 at 11.07.16 PM.png

Though IPA is the most recognized standard in phonetic transcription, its heavy use of characters outside the standard Latin alphabet makes it difficult to compare file sizes with our plaintext, since written German, English, and Spanish all mostly use standard Latin characters that can be encoded in 1 byte. For example, our files are UTF-8 encoded, which means that the 1-byte o in Commission is replaced by a 2-byte /ə/ once transcribed into IPA. This artificial augmentation is partially addressed by the increased compression rate of IPA transcriptions, but both the raw and compressed IPA files are much larger than their plaintext versions.

X-SAMPA Transcription

To address this mismatch, we then considered phonetic transcription into X-SAMPA, an alternative transcription scheme that covers the full IPA using only ASCII characters. Though the IPA already makes use of all 26 lowercase Latin letters, X-SAMPA leverages uppercase letters, numbers, and special ASCII symbols like @, so that nearly all (with one or two exceptions) IPA characters used in German, English, and Spanish can be represented with a single byte of ASCII. The example sentence above, in X-SAMPA, becomes

De@ h6z D”e@fO: b%i:n In”Vf t”aIm f@D@ k@m”IS@n t@ pr\Ip”e@r\ It-s pr\”@Ugr\am and fO:r\ %Vs t@ bIk%Vm f@m”Ili@ wID It and Ekspl”eIn It tU aU@ s”ItIz@nz

We can see in Table 3 below that as expected, X-SAMPA transcription results in much smaller raw file sizes than IPA transcription.

Screen Shot 2019-03-23 at 7.50.42 PM

We note that neither of the transcriptions yield smaller compressed sizes than plaintext, even though X-SAMPA uses the ASCII symbol alphabet. The transcription process inherently encodes additional information — specifically, the precise pronunciation, including stress and vowel length, of words, which is often not obvious from their spellings. Therefore, the raw and compressed sizes here are not directly comparable with those from Table 1.

Insights

We observe that in every transcription scheme, from plaintext to IPA to X-SAMPA, the compression rates of each of our three languages were within 2% of each other. This suggests that the symbol distributions in German, English, and Spanish are approximately the same. In particular, the compression rates were closest for the IPA transcriptions, in which the mapping from symbol to phonetic sound is closest to one-to-one. This results in an interesting linguistic insight, which is that the nth most common sound in German is about as common as the nth most common sounds in English and Spanish. Examining the raw IPA symbol counts (as a percentage of total text), we see that this holds true, but though the distribution is similar, the actual sounds that are most common in each language are quite different. Note that we include IPA symbols for whitespace (_), stress (‘), and vowel length (ː), as they are relevant characters to the compression algorithm, but they do not represent specific sounds.

Screen Shot 2019-03-24 at 11.31.14 PM

Potential future explorations might include an encoding process that first maps IPA symbols to an 8-bit character value (there are only 163 total IPA symbols), then performs optimal plaintext compression. This could yield smaller raw sizes for IPA transcriptions, but as the symbol distribution would still be the same, this would probably not improve compression rate. Another interesting direction is lossy compression. Since written plaintext is not always recoverable from phonetic transcription, we might want to evaluate the information gain and loss of a transcription + compression pipeline.


Outreach: How much does your sentence say?

Four kids standing in front of a poster board. The poster board has blocks with word segments on them.

For the outreach event, we wanted to communicate a simplified subset of the ideas from our project. Specifically, we wanted to show, through an interactive activity, how information content varies across languages. The goal of our activity was to address two main points:

  1. Different languages take different amounts of space to express the same amount of meaning (e.g. Chinese words are usually 1-2 characters in length, while English words average around 6 letters).
  2. Some languages require speakers to express certain kinds of information that other languages don’t (e.g. the gender of croissant in French).

We came up with a few dozen morphemes in English, Chinese, and French and wrote them on rearrangeable blocks (cardstock with Velcro). We chose the morphemes to allow for different subject pronouns, verb tenses and conjugations, objects, adjective agreement, and even conversion of adjectives into adverbs. We selected English, Chinese, and French as our display languages due to the apparent variation in the written text, as well as grammatical and morphosemantic variation. Students were invited to create sentences in a language of their choice (usually English), and we helped them create parallel sentences in the other two languages.

 
outreach-fig

The same sentence in English, Chinese, and French. Morphemes with the same meaning are connected with an arrow, e.g. {play, 玩, jou-} are approximately equivalent. We can see that even though this is the same sentence, its length varies across languages. For example, French uses many more characters than Chinese. Moreover, some languages express meaning that others omit. Observe the morphemes connected by the dotted line: Chinese has no morphemes to represent that 游戏 (game) is plural, but English and French use -s and -x respectively.

Just by counting the number of blocks in each sentence, the students could observe how languages differ in the number and type of morphemes, or “units of meaning.” The activity was a fun introduction to the intersection of information theory and linguistics. Students and parents alike were also excited to learn how to say the same sentence in two new languages!

img_3538

An example sentence created by Tsachy!


We hope you found this post interesting and informative! Please direct any questions to Cindy Wang (ciwang at stanford dot edu) and Stephanie Chen (stephchen at stanford dot edu).

Leave a Reply