Message in a Message: Deep Steganography for Natural Human Language

Information Theory (Winter 2020)

Josh Payne

Ste·ga·no·graph·y / stegəˈnägrəfi / (noun): the practice of concealing messages or information within other nonsecret text or data.

Steganography has been used for ages to communicate information in a hidden manner. You may think: how is this different than cryptography? Isn’t cryptography a very well-studied field in which two individuals aim to share information with each other without an eavesdropper being able to discover this information? Indeed, these two areas are very similar, but there’s a magical property of steganography that takes information sharing to a whole different level: the information is shared without an eavesdropper even knowing that anything secret is being shared. What’s the use of, say, Shor’s algorithm (for breaking RSA encryption in polynomial time using a supercomputer) if you don’t even know what to decrypt?

Steganography has long been associated with painting and visual art. Painters often hide signatures, self-portraits, and other secret messages within their works as a sort of “inside joke”. One such example of this is Jackson Pollock’s “Mural”, wherein Pollock hid his entire name in plain sight in the curvatures of the work.

Until recently, however, computational steganography methods for images (such as appending bits at the end of a .jpg file or applying mathematical functions to select RGB pixel values) have been easy to detect and uncover, and hand-crafted ones are difficult and not scalable.

In 2017, Shumeet Baluja the idea of using deep learning for image steganography in his paper “Hiding Images in Plain Sight: Deep Steganography“. In this paper, a first neural network (the hiding network) takes in two images, a cover and a message. The aim of the hiding network is to create a third image, a container image, that is visually similar to the cover image and is able to be used by a second neural network (the revealing network) to reconstruct the message image via the revealed image (without any knowledge of either the original message or the original cover). The loss is defined by how similar the cover and container images are and how similar the message and revealed images are. The results were astounding: the network was able to create container images that looked very much like the cover yet allowed the revealing network to reconstruct the message very closely.

Figure 1 from “Hiding Images in Plain Sight: Deep Steganography“. This figure outlines the model architecture.

While this result was very interesting, we felt that the utility of steganography specifically for images is limited.

The question arises: what are the limits for this approach in other information domains?

More specifically, what we aimed to apply this approach to was the domain of human language and text, which could be a stepping stone to implementing this procedure for general information. Text is harder to perform steganography with: images are dense while text is sparse; images aren’t affected much by small changes in pixel values while text is greatly affected by small changes in token values. While various methods for conducting text-based steganalysis exist, they face substantial challenges: (1) these classical heuristic-based approaches are often easy to decode, because they leverage fixed, easily reversible rules, and (2) these algorithms do not exploit any of the structural properties of the text, resulting in hidden messages that are not semantically correct or coherent to humans.

Recent deep learning approaches rely on using generative models to hide the “secret” text in meaningless groupings of words. Here, we want to propose using a transformer-based approach to address both problems at once. We explore using a transformer to combine the desired secret text with some human-readable, coherent cover text in order to generate a new container text that both properly encodes the hidden message inside of it and is nearly identical to the cover text, retaining the cover text’s original semantic structure and legibility. In addition to the transformer used for encoding, we leverage a second transformer model to decode the container text and recover the hidden message.

Because transformers are big and bulky, we tested our luck with a much simpler 1D-convolution character-based approach.

From “1D versus 2D CNN” by Nils Ackermann, which is licensed under Creative Commons CC BY-ND 4.0

In this character-based approach, the idea is that a model would learn a statistical profile of character choices in a string of text and modify the characters in a way that sends a signal capturing the hidden message through character additions, substitutions, or removals.

A trivial example in which the message is two bits is considered. To communicate the secret message, our function is the length of the container message modulo 4. More specifically, let  l represent the number of characters in the container message.  l \equiv 0\mod 4 yields 00,  l \equiv 1\mod 4 yields 01,  l \equiv 2\mod 4 yields 10,  l \equiv 3\mod 4 yields 11. The model would accordingly remove or add characters from the cover to communicate the secret message. In practice, we would ideally have our model be more robust, yielding much more complex secret messages through the container texts. This approach has given some recognizable results on both steganographic and reconstructive metrics.

Cover: I can not believe my eyes, what I saw in the forest was far beyond the reaches of my imagination.

Secret: meet at river tomorrow at sunset.

Container: I mac now bleiave mye eey, waht ra sa inn tee freost ws fara beymdo tee racheas o fem imaingaiton.

Revealed Secret: wemt a th rivre tomowro tt snseht.

However, an interesting (likely unsolved and very difficult) information-theoretic question arises in this area: given an information domain (like images, text, etc.), how much secret information can a model hide in given cover information in the average case? We started to see that with larger secret message input sizes and a static cover message size, the model had an increasingly difficult time hiding the information and reconstucting the hidden message. How good it was at each depended on how we weighted the two tasks in the loss function.

Next, we decided to investigate a more hefty model for performing steganography in text. The primary approach we propose to tackle the challenge of text-based steganography consists of leveraging two NMT (Neural Machine Translation) models: one transformer model to encode the hidden message and a second model to decode it. We hypothesize that this transformer-based approach can potentially succeed at encoding a secret text within a cover text to produce a container text that closely matches the semantic structure of the cover text. An additional nice thing about this is that no custom dataset is needed: any collection of sentences or phrases and random generation of secret messages will do.

What does “similarity” between cover and container in this case mean? We don’t have a simple metric anymore like edit distance or L2 norm between pixel values. In our new scheme, the sentence “Don’t eat, my friends!” and “Don’t eat my friends!” mean very different things, whereas “Don’t eat, my friends!” and “Please abstain from feeding yourselves, comrades of mine!” essentially mean the same thing. For ascertaining a metric of similarity, we leverage BERT (Bidirectional Embedded Representations from Transformers), a pretrained model that represents sentences as real-valued vectors where the cosine similarity between two vectors is a good indication of how similar the sentences are in meaning.

The Message in a Message architecture.

The results presented in Neural Linguistic Steganography, the work most closely related to our own, indicate that state-of-the-art transformer-based language models such as GPT-2 can be leveraged to generate convincing cover texts to hide secret messages. In our implementation, our first NMT transformer model reads in the concatenated secret message (four digits in base 10) and cover text and proceeds to translate them into a container text. Our second transformer reads in the container text and translates it into a reconstruction of the original secret message. Again, the loss function we use consists of a linear combination of the similarity functions between the cover text and the container text (using BERT to produce Loss_Stego), along with the edit distance between the reconstructed secret message and the original secret message. The loss function is formulated as

 \mathcal{L}(c, c', s, s') = \alpha(sim(\text{BERT}(c) - \text{BERT}(c'))) + \beta||s - s'||

where  c is the cover instance,  c' is the container instance,  s is the secret message,  s' is the reconstructed message.  \alpha and  \beta in our loss function are parameters we can set or have change as a function of the epoch or the loss rate of change. We define similarity between stegotext (or container text) and cover text with respect to meaning to be the cosine similarity of the embedding of both sequences generated by a pretrained BERT base model.

We found that the model that we used, an LSTM seq2seq model with attention and a hidden size of 512 for encoder and decoder for the hiding network and revealing network, was not powerful enough to generate good containers and was faulty in reconstructing the secret message. The loss started after around 100 examples at 1.1 converged after around 200,000 examples at around 0.85. We additionally hypothesize that a low loss is likely the result of a generative adversarial model for BERT: finding sentences that are meaningless to humans but have small cosine similarity in their embeddings against cover texts as evaluated by BERT. Below is one output example:

Cover: celebrate the opportunity you have to eat this.

Secret: 8 4 1 4.

Container: murdoch cleverness germane obeng blessing pedals ampoule mbi mbi jharkhand ampoule coring substantive substantive tranquil steadfast.

Revealed Secret: 8 4 4 4.

However, we also feel that with a sufficiently powerful model (such as using BERT for the encoder and GPT-2 for the decoder) and enough compute power, a model could start to perform useful steganography on textual and general information. This technique could potentially revolutionize communication where there exist adversarial eavesdroppers, especially in the quantum era of fast decryption of common cryptographic protocols. This implementation is left as future work to folks who have an abundance of compute resources.

Code for the seq2seq model is viewable in this Colab.

Note that while text is very sparse, the same technique can be used on dense audio files for spoken language transmission and data hiding. The technique would be similar to the convolutional approach proposed for images on a 2D spectrogram or 1D convolution over the waveform (e.g. WaveNet) and would have wide implications for covert communications.

Compression of Quantum Information

Information Theory (Winter 2020)

Quantum computers take advantage of the quantum phenomena that does not exist in classical computers to perform speed-up in some specific tasks. These phenomena include superposition, entanglement and interference. The most common architecture for quantum computers is based on two-level systems. The information unit is called qubit and its two levels are usually denoted as quantum states \left|0\right> and \left|1\right>. With N qubits, there are 2^N such quantum states and each of them has 2^N entries, e.g. \left|00101011\right> for N=3. Quantum gates in a quantum computer transform and interfere these states in a way that is hard to be simulated by a classical computer.

When presented in an isolated environment, a quantum state stays coherent and we called the state a pure state. Once the quantum state interacts with its environment, decoherence causes the state to decay to other states with uncertainty. In this case, the quantum state becomes a mixed state. A mixed quantum state cannot be described by a pure state. Instead, it needs to be described by an ensemble of pure states \{\left|\psi_i\right>\} with a probability distribution \{p_i\}. This ensemble is usually represented as a density matrix \rho=\sum_i p_i\left|\psi_i\right>\left<\psi_i\right| with 2^N\times2^N entries in it.

In the aspect of error correction and quantum communication, the process of the quantum state interacting with environment and becoming a mixed state is considered as a noisy channel. Studies have been focused on finding the minimum output von Neumann entropy of the noisy channels because the quantity is related to the maximum mutual information between the input and output of the channels. Once the maximum mutual information of a noisy channel is computed, we know how much error correction is needed to send a piece of quantum information without loss.

Nevertheless, a mixed state is not necessary a pure state with noise. For example, a thermalized quantum state needs to be described as a mixed state. We want to track all components of the density matrix instead of correcting it. Besides, in the Noisy Intermediate Scale Quantum (NISQ) era, quantum computers are used without error correction. To understand how noise affects quantum computers, it is important to track how a density matrix evolve through noisy channels. These lead to the motivation of compressing a mixed state for more efficient characterization.

In the aspect of compression, one should focus on the maximum output von Neumann entropy of a channel because it is related to the minimum mutual information which defines the optimal compression rate. In the following, we give two examples of noisy channels and study the growth of maximum output entropy as a function of number of qubit. Several assumptions are made. First, it is assumed that the noise level is small (\simeq0.1\%) in the noisy channels. Second, n-qubit noisy channels are composition of single-qubit noisy channels. In other word, it is assumed that noise in a qubit is independent to other qubits.

In the bit-flip channel defined by \rho\rightarrow(1-p)\rho+p\sigma_x\rho\sigma_x, where p is the noise level, \sigma_x is the Pauli X matrix, it is found that the maximum output entropy growth approximately logarithmically. It means that, while it takes \mathcal{O}(2^N\times2^N) complex numbers to track a full density matrix, it is possible to tack a compressed representation with only \mathcal{O}(N\times2^N) numbers. The compression rate is approximately \mathcal{O}(\frac{N}{2^N}). In the depolarizing channel defined by \rho\rightarrow(1-p)\rho+\frac p3\sigma_x\rho\sigma_x+\frac p3\sigma_y\rho\sigma_y+\frac p3\sigma_z\rho\sigma_z, where \sigma_y is the Pauli Y matrix and \sigma_z is the Pauli Z matrix, it is harder to find the exact maximum output entropy. But it is found that the entropy is upper bounded by a quantity that grows approximately logarithmically. Thus, the compression rate is also approximately \mathcal{O}(\frac{N}{2^N}).

In summary, compared to the computational resource used in simulations of a full density matrix, it is possible to use only a small fraction of that when the noise is small. This is because the entropy grows much slower than the Hilbert space; while the dimensionality of Hilbert space grows like 2^N, the effective dimensionality of the density matrix (or the effective rank) grows approxmiately polynomially. This gives the possibility of compression and efficient simulation of a density matrix with small noise.

A Picture Says a Thousand Words

Information Theory (Winter 2020)

Elizabeth Chen, Jingxiao Liu, Kailas Vodrahalli

Introduction

Let’s talk about pictograms: what are they and how well do they convey information?

Pictograms are symbols that transmit information through visual forms [1]. You might have seen, on the containers of certain chemicals, an icon of a flame. The icon is used to indicate that the chemical is flammable, and is an example of a pictogram. Several writing systems, such as the Egyptian hieroglyphs and Chinese, use pictograms as characters to deliver ideas. In this study, we aim to probe how well pictograms transmit information. More specifically, we will be looking at pictograms from oracle bone scripts, which are ancient Chinese writings inscribed on animal shells and bones [2], as well as pictograms from modern Chinese characters, and comparing the two on their ability to convey meaning.

Figure 1. Oracle bone script character for “mountain”.

Consider the oracle bone script character for “mountain” shown in Figure 1, and then the modern Chinese character for “mountain” in Figure 2. How well can an individual with no prior knowledge of Chinese associate these characters to actual mountains? Which represents a mountain better: the oracle bone script or modern character?

Figure 2. Modern Chinese character for “mountain

Our team tried to measure the “information distance” between the pictographic characters and the photographs of the objects they represent. We wanted to know which form of writing, the oracle bone scripts or modern Chinese, was the more effective medium for transmitting visual information.

Methods

We measured information distance using two methods:

Method 1: We asked people to try to match either oracle bone scripts or modern Chinese characters to photographs of the corresponding objects they represent. We hope studying how well people correctly matched characters to images will provide us with insight on whether oracle bone scripts or modern Chinese characters more effectively represents visual forms.

Method 2: We trained a neural network to classify photographs of objects. In other words, upon training, the neural network will be able to take in, for instance, a photograph of a dog, and output “dog”. We then give the neural network oracle bone script characters and modern Chinese characters and study what the neural network outputs. In this case, the performance of the neural network on the pictographic characters can serve as an indicator on how well each type of pictographic character transmits the information found in visual presentation of the object it tries to represent.

Results

Method 1 Results

We used Amazon MTurk to obtain input from several hundred people. Each person was shown (1) a single grayscale image of 1 of 10 objects (bird, field, fish, moon, mountain, sheep, sun, tree, turtle, or water) along with (2) ten images of the ancient or modern Chinese character corresponding to these 10 objects. The MTurk worker was asked to “select the character image that most closely resembles [the natural image].” We additionally asked a survey question to determine whether the worker understood Chinese and filtered out these responses during analysis.

You can access the matching survey at the following links:

  • Image-to-Ancient Character Survey: bit.ly/info-theory-pictogram1
  • Image-to-Modern Character Survey: bit.ly/info-theory-pictogram2
  • Image-to-Ancient Character Solution: bit.ly/info-theory-pictogram-answers1
  • Image-to-Modern Character Solution: bit.ly/info-theory-pictogram-answers2

Here, we are essentially asking the workers to select the “best encoding” for a given natural image. We could have done the reverse and show them a character and have them select the “best” natural image match, with one image sampled randomly from each class (this would be something like decoding). We chose our method so that we could get responses per natural image (in the “decoding” method, we would get results relative to 10-tuples of natural images; note the character images are fixed, so this is not an issue in our “encoding” method).

In total, we obtained 2,000 responses (2 character sets x 10 classes x 10 images per class x 100 responses per image/character set pair). After filtering out responses from workers who understood written Chinese, we had 1,816 responses remaining. All further analysis is done after this filtering step.

Our analysis is shown below. In Figure 3, we plot the observed distribution of selected symbols (solid blue) vs the true distribution (dotted black line) over both ancient and modern character sets. We note that some characters are not selected often (field) while others compensate for this (fish).

Figure 3. Frequency of workers classification by class (blue bars) and true number of natural images by class (black dotted lines). Note the true numbers are near but not exactly 0.1 — this is due to our post-collection filtering of responses from individuals who can read Chinese characters.

In Figure 4, we plot the marginal distribution of selected symbols over the ancient character set and modern character set. As one might expect, different characters are weighted differently between modern and ancient character sets. For example, we turn to fish in the modern set and water in the ancient. In both of these cases, the observed frequency is higher than the true frequency of the class, suggesting that fish (modern) and water (ancient) are characters “close” to (or, in other words, often mistaken as) other classes of natural images.

Figure 4. Frequency of worker’s classification by class (blue bars), marginal worker’s classification over ancient character set (orange), marginal worker’s classification over modern character set (green), and true number of natural images by class (black dotted lines). Note the blue bars and black lines are the same ones as in Figure 3.

We now calculate the by-class accuracy of both character sets and plot the results in Figure 5. We first compute the maximum accuracy possible given the distributions from Figure 4 (i.e. 1-\Sigma_{c\in\text{classes}}|\text{observed}_{c}-\text{expected}_{c}|), and obtain 81.6% for the ancient set and 82.5% for the modern one.

Figure 5. Accuracy by class. Blue bars and orange bars are the by-class accuracies for ancient and modern character sets, respectively. Black dotted line is the maximum accuracy possible (1.0).

The true accuracy is 65.7% for the ancient characters and 18.5% for the modern ones. For every class, the ancient character has higher accuracy than the modern character. For classes such as “bird”, the accuracy difference between the 2 character sets is large, suggesting that the ancient bird character conveys more information in its shape than the modern version does. In contrast, the “field” class has similar accuracy between modern and ancient, suggesting a similar amount of information is conveyed in both the ancient and modern characters. Both of these states can be confirmed visually in Table 1, where the ancient character for “bird” looks like an actual bird and the modern and ancient characters for “field” are rather similar in appearance.

Method 2 Results

We trained a convolutional neural network (CNN) to learn to classify images of mountains, moons, trees, fields, and suns. By the end of the training, the neural network is able to take in a photo similar to the ones shown in Figure 6 and correctly identify it as a “mountain”, “moon”, “tree”, “field”, or “sun” based on the contents of the input image with more than 90 percent accuracy.

Figure 6. Sample input images of mountains, moons, trees, fields, and suns to give to the CNN.

We treated the trained CNN as a proxy for an individual that knows what the 5 classes (mountain, moon, tree, field, and sun) look like in real life, but has no knowledge of how they are written as Chinese characters. We then showed both the oracle bone script representations and the modern Chinese characters of the 5 classes to the trained neural network. The performance of the CNN is summarized in Tables 2 and 3.

We see that the CNN guessed 2 out of 5 correctly when shown the oracle bone scripts, and also achieved 2 out of 5 correct when shown the modern Chinese characters, which seems to suggest that, when measured using the classification accuracy of a trained CNN, the information distance between oracle bone scripts and actual images, and that between modern Chinese characters and images are roughly similar.

Conclusion

In this project, we tried to measure how well Chinese characters from two different time periods conveyed meaning by putting out surveys asking people to match natural images to their corresponding character representations and by studying whether a trained image-classifying CNN is able to correctly identify these characters.

From the survey results, we saw that people had an easier time associating natural images to ancient characters than they did associating natural images to modern characters. The trained image-classifying CNN performed similarly for both the oracle bone scripts and modern Chinese characters.

What are your thoughts? Which one spoke to you more: the oracle bone scripts or the modern Chinese characters?

Compressed Sensing and Neural Network for Quantum State Tomography

Information Theory (Winter 2020)

By: Beicheng Lou and Evan Laksono

Hey, this blog is about how to apply compressed sensing to (quantum) state tomography. We’ll start with an extremely friendly introduction:

Motivating game 1: Blank filling

Consider a blank filling game:

  1. he_lo
  2. d_g
  3. s_pe_cal_fr_g_li_t_cex_i_lid_cio_s

You might fill in the first one (hello) and the last one (supercalifragilisticexpialidocious) with confidence, while the middle one may be a bit ambiguous (dog? dig? dag?).

Before this game, one might think the more blanks we have, the more possibilities it allows and therefore, the less confidence we have. However, in this demonstration, we see the opposite side: the more blanks we have (with the total length also longer correspondingly), the more constrained the problem is.

In some sense, this is the essence of compressed sensing: we have high-dimensional data with low-dimensional structure, thus with few observations we can tell what the data is.
In the 3rd example, we have a 34-alphabet word (high-dimensional data), but not every alphabet is arbitrary: the word should be one of the very few valid 34-alphabet words (low-dimensional structure). As a result, knowing a few alphabets in the word (few observations) is sufficient for us to tell what word it is.

Motivating game 2: Pokemon identification

In some sense, humanity has known tomography since antiquity. Haven’t we all been intrigued by our shadows cast by the sun shine, pondering what it is? At least, in Plato’s allegory of the cave, common men are chained in a cave while watching the shadows of objects which are projected on a blank wall by the fire behind them.

We’d better not dwell too much onto this grim picture of humanity as Plato saw it. Instead, we can always take a lighter approach by contemplating on Pokemon’s silhouettes, which I would call “Pokemo-graphy”. I hope this sounds cute enough, but I know no better ways to pay homage to the cuteness of Pokemon. Can you help me to figure out the following Pokemon? I bet you should not call yourself a Pokemon fan if you cannot answer it perfectly without the aid of Professor Google.

Pokemon_shadow

Let us pick the easiest example: Pikachu. Do you learn anything from the practice? Can you decipher your thought process, to conclude that it is indeed Pikachu? Firstly, I think telling you in advance that it is a Pokemon helps to narrow down your wild imagination as you are searching for an answer. There are also some unmistakable marks of Pikachu in the silhouette, such as its thunder-like tail or its chubby cheeks. We are pretty sure that you must have started to imagine where all the body parts are as soon as you see its shadow.

At this point, you probably have already gleaned some intuition regarding what tomography is about. We wish to reconstruct what the actual object looks like by simply taking its cross-sections. The idea has been explored widely for application in many areas, the most notable one is in medical imaging. Interestingly, a similar procedure is essential in the context of quantum computation. In the strange quantum world, we can only see a shadow of the quantum object so that we need to be intelligent if we want to know what it actually is.

Quantum State Tomography (QST)

We are now getting into more serious stuffs. We really want to reconstruct a certain quantum state, both accurately and efficiently. Without getting too much into physics, this QST problem can be formulated as: we have a bunch of possible measurements that can be applied on a quantum state, and from the measurement results, we shall infer what the state is.

There are a number of viable alternatives to perform such a task, but in this work we explore the use of compressed sensing (CS) and neural network (NN) to perform QST.

Compressed Sensing-QST (CS-QST)

“Motivating Game 1” gives some flavor of compressed sensing in a handwavy fashion. Here we present it more concretely:

In the classical problem of signal reconstruction, if we know a signal is sparse (i.e. very few nonzero entries) in frequency domain while our measurements are just a few samples from the time domain, we can reconstruct the signal in the form:

where y is the measurement we have in time domain, x is the signal in frequency domain and A is the Fourier transform matrix. The L_1 norm is used instead of L_0 to encourage sparsity (for numerical tractibility).

This procedure is captured in a nice animation copied from Wikipedia:

Orthogonal_Matching_Pursuit

On the left is the signal in e.g. time domain which we want to reconstruct and are able to sample from. The orange line is our reconstruction and the black dots are the sampled data. On the right is the signal’s sparse representation in e.g. frequency domain.

Of course, the trick of compressed sensing is not guaranteed to work for any reconstruction problem with low dimensional structure: the reconstruction matrix (‘A’ in our example) has to satistfy the ‘Restricted Isometry Property’ (RIP). Some examples other than signal reconstruction would be image reconstruction, where we have the prior knowledge that images are typically sparse in wavelet domain. Luckily, RIP is satisfied by random matrices, thus making compressed sensing a widely applicable technique.

Moving from classical signal reconstruction to quantum state tomography, the recipe is still quite simple, but here instead of a vector, we want to reconstruct a matrix, and instead of knowing the signal is sparse, we know e.g. the matrix is low-rank. Physically, if a quantum state is described by a density matrix of low rank, then it is almost a pure state (which usually means we have a good quantum resource). Mathematically, what we are doing here is effectively matrix completion and “quantum state tomography” just gives it some physical meanings and constraints.

Consider m arbitrary measurement operators (yes, we have relaxed the typical constraint in literature that measurement operator be a composition of spin projections) A_1, A_2, …, A_m and a quantum state represented by a density matrix \rho of a low rank r. Through the m different measurements represented by the operators A_1, A_2, …, A_m, we can reconstruct the state (target matrix) by a convex optimization:

Using a target matrix of  d = 10 (\sim3 qubits), with varying ranks (1\leq r \leq 10), we show the minimum error one can achieve with m random measurements, as in Fig. 1. As a sanity check, we first note the obvious fact that minimum error should decrease with respect to the number of measurements. Moreover, when the target matrix is not full-rank, the error can be reduced to 0 without involving 100 measurements (corresponding to the total degrees of freedom for a 10\times 10 matrix).

trend-error-vs-m-for-r
Fig. 1 Reconstruction error vs number of measurements, for different target  matrix ranks

Interestingly, we also observed that when the number of measurements is below the minimum requirement, lower rank matrix does not lead to a better reconstruction. This feature was not reported in Ref. 1, and the result is shown in Fig. 2, which is intuitive in retrospect, because with very few measurements, low-rank matrix means higher likelihood for the measurement operators to act on bases in which the low-rank properties of the density matrix are not manifest, thus reducing the quality of information we obtained from the few measurements. On the other hand, a matrix of lower-rank contains less information so that it takes less measurements to perfectly recover such a matrix. This explains the observed crossover in Fig. 1.

Next, we look at the trend of number of required measurements (M) for perfect reconstruction for different ranks (which should be linear according to CS literature), as in Fig. 2.

trend-r
Fig. 2 Required number of measurements for different target matrix ranks, fitted by different functions}

Here we notice that the trend can be very well approximated by an exponentially rising function A(1-e^{-r/B}), which is indeed linear in r at low-rank limit and thus consistent with the literature.

We repeat the aforementioned procedure (first plot accuracy against # of measurements and then extract the threshold for perfect reconstruction), but on target matrices with fixed rank (r=2) and various dimensions d. The trend of number of required measurements with different target dimension is shown in Fig. 3.

trend-d
Fig. 3 Required number of measurements for different target matrix dimensions, fitted by different functions

Here we found the relation of M to d is M\propto d\log{d}, as predicted in classical CS [2] and different from the bounds for quantum case [1]. This is the effect of our relaxation of the constraint on A_m (i.e. more freedom to choose measurement operator -> more random measurement matrices -> better scaling of # of measurements required for accurate reconstruction).

Neural Network-QST (NN-QST)

We also explored how neural network can be used to perform quantum state tomography. One central question of the problem is the how neural network can be used to encode a certain quantum state. Here, we try to reproduce the state-of-art methods to represent quantum wavefunctions or density matrices using Restricted Boltzmann Machine (RBM) [3]. The modification comes in as we apply the methods on our own simulated measurements on N-qubit system. The basic idea is that we can use two separate RBMs to encode the probability amplitude (C) and phase (\phi) respectively in a certain basis set \bm{\sigma} = \{\sigma_{1}, ..., \sigma_{N}\}, where \sigma_{i} can stand for any Pauli basis defined according to the measurement axis of qubit i. With index \kappa \in \{C,\phi\}, the Boltzmann weight for a particular configuration is given by:

p_{\kappa}(\bm{\sigma},\bm{h}) = \exp\bigg[\sum_{ij}W^{\kappa}_{ij}h_{i}\sigma_{j} + \sum_{j} b^{\kappa}_{j}\sigma_{j} +\sum_{i} c^{\kappa}_{i}h_{i}\bigg],

where \bm{h} = \{h_{1}, ..., h_{M}\} denotes the states of the hidden nodes, with h_{j}\in\{-1,1\}. We subsequently obtain the distribution over \bm{\sigma}, i.e. p_{\kappa}(\bm{\sigma}) by marginalizing the dependence on \bm{h}. The resulting wavefunction is given by:

\psi(\bm{\sigma}) = C(\bm{\sigma})\exp(i\phi(\bm{\sigma})),
C(\bm{\sigma}) = \sqrt{\frac{p_{C}(\bm{\sigma})}{\sum_{\bm{\sigma}}p_{C}(\bm{\sigma})}}, \hspace{10pt}\phi(\bm{\sigma}) = \frac{1}{2}\log(p_{\phi}(\bm{\sigma})).

In terms of density matrix formulation, the scheme in Ref. [3] only works for rank-1 matrices, i.e. pure states. In our current implementation, we extend it to capture the diagonal terms of any density matrices based on simulated measurements on $N$-qubit system. The RBM is therefore trained to minimize the Kullback-Leibler (KL) divergence in the measurement basis:

\text{KL}_{C} = \sum_{\bm{\sigma}}P(\bm{\sigma})\log\frac{P(\bm{\sigma})}{\rho(\bm{\sigma,\sigma})}

whereas P(\bm{\sigma}) denotes the measurement statistics in the \bm{\sigma} basis, i.e. the probability of attaining state \bm{\sigma} upon measurement, while \rho(\bm{\sigma},\bm{\sigma}) is the diagonal element of the density matrix in the \bm{\sigma} basis. Our results for 3-qubit system are shown in Fig.  [4].

Prob_Density_3Q

Fig. 4: The original and reconstructed probability density of 3-qubit system under 10,000 measurements using RBM.

What’s more

The next step is to extend the scheme to general low-rank matrices, which can in principle be implemented using auxiliary nodes, which can later be integrated out to produce mixed states [4]. The reconstruction of the whole density matrices can in principle be done by choosing a set of measurement bases, and then allowing the RBM to learn their most faithful representations by minimizing the KL-divergence over all the measurement bases.

You may wonder in what sense can this NN-based QST be compressed? Well, one obvious way is to use regularization. The enormous body of literature on regularization in machine learning will allow you to have a way more general framework for compressed QST, where the prior knowledge is not limited to sparsity of the parameters. One can imagine having a generative model producing the parameters of the neural network, which can encode many different types of prior knowledge and sparsity is just one of them. For example, if our lab has some systematic problem that bias the quantum states we can produce, that can be captured by such a generative model. Well, that is a different story and will take up another blog post.

Look forward to that!

References

[1] S. T. Flammia, D. Gross, Y.-K. Liu, and J. Eisert. Quantumtomography via compressed sensing: error bounds, samplecomplexity and efficient estimators. New Journal of Physics,14(9):095022, Sep 2012

[2] E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty prin-ciples: exact signal reconstruction from highly incomplete fre-quency information. IEEE Transactions on Information Theory, 52(2):489–509, Feb 2006

[3] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer, R. Melko,and G. Carleo. Neural-network quantum state tomography. Nature Physics, 14:447–450, May 2018

[4] G. Torlai and R. G. Melko. Latent space purification via neural density operators. Physical Review Letters, 120:240503, Jun 2018.

Huffman Encoding: Wheel of Fortune Style

Information Theory (Winter 2020)

by Katherine Kowalski, Calvin Lin and Yasmeen Jassim

Introduction & Motivation

From sending text messages to searching for recipes on the internet, millions of devices connect and communicate information every minute. But how is this information communicated? This quarter, we set out to teach Ravenswood Middle School students how information is communicated through encoding and decoding. Through our interactive game, “Huffman’s Wheel of Encoding”, we hoped to teach the students the basics of encoding and decoding through our version of Wheel of Fortune.

Huffman’s Wheel of Fortune

Prior to making our game, we thought about the most important concepts we wanted to teach the students. We identified several outcomes that would allow them to engage with encoding/decoding in an effective way and help them understand that communication processes aren’t always smooth. Slots on the wheel each represent different outcomes which manipulate the player’s bitstream.

We created a multi-player program that allows players to input their letter (in the form of a codeword) as well as the number of points their guess is worth. Our program keeps track of each player’s total points and declares a winner at the end of the game. We also made an “Encoding Wheel” mimicked after Wheel of Fortune and constructed the Huffman encoding for the English alphabet based on the probabilities of each letter being in a word (included in the cheat sheet, figure 1.1).

How To Play

Like in the game show, a category to the phrase or word is given. Until a player wants to guess the word, the entire game is communicated in bits that constitute a codebook for the alphabet. Each contestant is given a decoder sheet (figure 1.1) and they must communicate which letter they are guessing by searching for its codeword and entering this codeword into our program. Before each contestant makes a guess, they must spin the wheel (figure 1.2). As mentioned, the wheel has various slots that represent positive or negative consequences. Depending on the consequence they receive, they must guess a letter with that consequence in mind.

Figure 1.2: Wheel of Encoding

Possible Outcomes

Negative Consequences manipulate your bitstream (input you give the program):

Bit flip: One bit in a symbol sequence has a X% chance of being the wrong bit. Randomize which bit is flipped.

Truncate 2: The last 2 bits of all coming letters are truncated – last 2 bits are lost. This could hurt especially if the codes are prefix coded.

Bankruptcy: Lose all points and don’t get to guess a letter.

Lose Turn: Player loses their turn.

Positive Consequences take the form of bonus points or a chance me card:

Bonus Points (+500, +200 etc): Point values that you get if you correctly guess the letter.

Increased Bit Rate (bonus card): Can guess 2 consonants instead of 1 without penalty (increase the number of bits you send in a row).

Bit Repetition: If a given code word has suffered a negative consequence, can use this card to send over the bit stream again with each bit repeated x times.

Error Correcting Code: If transmitted code is incorrect, the student can tell us the bit to correct.

Figure 1.3: Playing the game!

Overall, we tried to combine an important topic in Information Theory with a popular game show we all know and love. We hope the students are Ravenswood get a chance to play our game in the future!

Want to play the game yourself? Grab a friend, make your own wheel (if you’re ambitious enough) and check out our program here! Check out our video for a more in depth explanation of key concepts and a demonstration of how to play the game!