Over the course of the quarter students in EE276 have prepared for an outreach event at Ravenswood Middle School. The class is teaching about a wide range of topics related to information theory. Some teams are talking about mapping political landscapes while others delve into the theory of code breaking. There are groups demonstrating applications to the military when flying jets and others showing information theory through Fortnite.
Given the unfortunate COVID-19 circumstance, the class have undertaken a virtual outreach activity this year. Each group records an interesting short video, and teaches information-related topics in a fun and interactive way! The full list of the videos can be watched through the following Youtube playlist.
The EE276 students also contribute a series of amazing projects in information theory. The full list of projects can be found here.
The EE276 students also contribute a series of amazing projects in information theory. The full list of projects can be found here with individual links below:
Introduction to Ravenswood Outreach (Ryan C Stocking): video
An Order-Optimal Edit-Correcting Code (Daniel Tan): blog
Audio Lossy Compression Game (Timothy Chong): video
Bioinformatics (Christian A Choe and Raphael R Eguchi): video
Brain Computer Interfaces (Iliana Bray and Persiana S Saffari): video
Career Day (Liz Leckie, Jonathan J Mak, and Kevin H Tang): video, blog
Compressed Sensing and Quantum Tomography (Beicheng Lou and Evan Laksono): video, blog
Entropy of quantum circuits: What makes a quantum algorithm difficult to simulate (Grace Johnson): blog
Error Correcting Codes (Alexandre M Bucquet, Colleen Dai, Diwakar Ganesan, and Madison N Hurr): video, blog
Federated Learning (Saelig A Khattar, Andrew Wang): video
Forecasting as a Tool to Effectively Communicate Pressing Information: A Look Into Meat Consumption, Climate Change, and Alternative Food Networks (Rakan Albarghouty and Kyle Walker): blog
How the brain can help us understand information theory (Praveen N Pallegar and Maxim Serebriakov): video
Huffman Encoding (Katherine Kowalski, Calvin Lin, and Yasmeen Jassim): video, blog
Image To Image Translation (Yue Li, Yawen Sun, Yahan Yang): video
Information Manipulation (John Asigbekye and Meghaha V Rao): video, blog
Information of Popular Communication Methods (Alissa S Ling and Arun Seetharaman): video, blog
Information Theoretic Motivation For GANs (Andrew A Miller Smith): video, blog
Information theory and alphabet (Trisha Jani): video
Information Theory In Music (Paavani Dua and Jonas Messner): video
Informational entropy of standardized testing (Ellie R Bowen, Nathan A Fotedar): video
Learned Image Compression (Helena Huang, Boxiao Pan, and Yanpei Tian): video, blog
LED Displays (Natalie K Gable and Sarah R Woodard): video, blog
Lossless and lossy compression of material decomposition images from dual energy CT (Yirong Yang): video, blog
The Evolution Of Music Storage (Casey P Butcher, Tania Dhaliwal, and Michelle V Ly): video, blog
Thanks to all the hardworking students from EE 276 (especially Ryan Stocking for the huge effort in organization), their hardworking project mentors (members of the Weissman group at Stanford), the Ravenswood Middle School with the patient and enthusiastic students for making all of this possible.
We hope everyone is getting excited for a great afternoon!
Proteins are the primary catalytic units of
biology. They are composed of up to twenty different amino acids strung together
into sequences that are usually a few hundred amino acids in length. Each of the
twenty amino acids has different chemical properties, such as different sizes, affinities
for water, and electrical charges, and the interaction of these amino acids with
their environment and with each other gives proteins their distinct
three-dimensional shapes. This structure, in turn, is critically important in
defining the protein’s catalytic properties. While a particular protein might have
slightly different amino acid sequences in different organisms, with
accompanying variations in folding structure, only a small subset of sequences result
in the protein’s proper functioning. By studying a protein’s variations in
amino acid sequences, we can gain valuable insight into how the protein works.
Here, we will borrow fundamental concepts from
information theory and apply them to the study of molecular evolution. In
particular, we will study amino acid sequence variation in the protein
nitrogenase reductase (NifH), which is an ancient protein found in Bacteria and
Archaea and is critically important for sustaining life on Earth. NifH helps convert
nitrogen gas – which most organisms cannot use to grow – into a form of
nitrogen that all life on Earth can use to grow. Without the action of this protein
and the handful of other proteins with which it works, there would not be enough
usable nitrogen to sustain all the living organisms on Earth. The NifH protein
works by splitting apart molecules of ATP (the most common energetic currency for
life on Earth) and using the energy that is released to transfer an electron to
another protein. That electron is then passed to a molecule of nitrogen gas, where
after the accumulation of six electrons, it is converted into ammonia. The
electrons are transferred using a small cluster of iron and sulfur atoms that
is embedded inside the NifH protein. Since the amino acid sequences that are
involved with splitting the ATP and holding the iron-sulfur cluster are absolutely
essential for proper NifH functioning, we would expect little sequence variation
involving these amino acids. We’ll explore this hypothesis using an information
theoretic framework.
Methods
First, I downloaded all the publicly available amino acid sequences of NifH from GenBank (n = 8876 sequences), which is a database organized by the National Center for Biotechnology Information. Then, I constructed an amino acid sequence alignment, which is the placement of the sequences into a matrix, with rows corresponding to each linear sequence and columns corresponding to the equivalent positions in each sequence. The matrix columns capture the amount of amino acid variation that exists at a particular position in the protein. From this alignment, I created a phylogenetic tree to estimate which sequences were the most different based on a model of evolution. I picked the top 750 sequences that were the most spread out on the tree to analyze in this study. Then, I estimated the entropy of each column in the alignment and the mutual information between columns using the Jiao-Venkat-Han-Weissman estimator1. I then applied the average product correction described in Dunn et al., 2008 to the mutual information estimates to account for phylogenetic signal within the sequences2.
Results & Discussion
To study sequence variation, we can calculate the entropy of each column in the sequence alignment. We can then compare the position’s entropy to the entropy of a theoretical position in which any amino acid exists with equal probability. When we do this for NifH, we see little sequence variation in the positions that are involved with ATP splitting (called the “Walker A”, “Switch I”, and “Switch II” sequence motifs) and iron-sulfur cluster binding (Figure 1A). This is exactly what we expected! In Figure 1, the points closer to the dashed line have less sequence variation than points farther below the dashed line. Interestingly, the Walker B motif, which is also involved in ATP-splitting, has much more sequence variation than the other motifs (Figure 1A). Also, we see that the amino acids that compose the alpha helices and beta strands – 3D structures that help give NifH its shape – can have, at times, quite high sequence variation (Figure 1B). We’ll return to this later.
When we map the relative entropy estimates onto the 3D protein structure, we get a strong visual understanding of the relationship between sequence variation and protein function. The protein structure shown in Figure 2 is the structure of NifH during ATP splitting, where ADP-AlF4– serves as an analog of ATP3. First, we see very little sequence variation around the iron-sulfur cluster (Figure 2A). We also see very little sequence variation around the area that interacts with ATP (dark blue stripes in Figure 2B; ADP-AlF4– positions shown in Figure 2F). In addition, we see that the surface of the protein – that is, the area that is exposed to the chemical environment within the cell – has a relatively high amount of sequence variation.
To investigate these relationships further, we
can plot the relative entropy as a function of distance to the iron-sulfur
cluster and ADP-AlF4–, as well as how much each amino acid
is exposed to the outside. We see that amino acids separated from the iron-sulfur
cluster by <5 angstroms (five one-hundred-millionths of a centimeter; a
distance that allows the formation of hydrogen and ionic bonds), have very
little sequence variation (Figure 3A). This low sequence variation extends to
~10 Å, suggesting that not only is direct interaction with the iron-sulfur cluster
important, but the structure of the protein in the vicinity is very important
as well. We see similar results for distance to ATP (Figure 3B). When we look
at the relationship between sequence variation and exposure to the intracellular
environment (measured here as the percent of an amino acid’s surface area that
is accessible to water), we see decreasing relative entropy with increasing exposure
to the environment (Figure 3C). Interestingly, though, we see that not all
amino acids completely buried inside the structure are completely conserved.
There are 50 amino acids located completely inside the NifH structure, 32 of
which display meaningful variation (defined here as having entropy >0.1,
which roughly corresponds to a particular amino acid accounting for <99% of the
variation at a position).
We can use mutual information, or how much we can reduce our uncertainty about the identity of the amino acid at position X if we know the amino acid at position Y, to further study these amino acids. In particular, variation at one amino acid tucked inside the protein structure might be allowed if variation at another amino acid helps “compensate” for the changes. Positions that show these patterns of compensating amino acid substitutions are coevolving, and we can measure the statistical dependency between positions to detect such behavior. However, to distinguish directly from indirectly coevolving positions (i.e., the amino acid at position X directly influencing the identity of the amino acid at position Y vs. X indirectly influencing Z through Y interacting with Z), we will only consider the pairing with the highest mutual information for each amino acid. This approach is an application of the data processing inequality, which has been applied to the study of gene regulation networks4. To evaluate whether the magnitude of mutual information is meaningful or not, we will compare the mutual information estimates to the estimates produced when comparing randomized NifH sequences. In particular, for each mutual information estimate, we will calculate the number of standard deviations that it is above the average mutual information value between positions in the randomized sequences. These values are called “Z-scores”, and we will refer to the pairs involving positions and their strongest coevolving partners as “maxZ pairs”.
Using this approach, we detect 56 directly coevolving
amino acids. When we focus on the maxZ pairs, we see that the distribution of
the distances between the amino acids in each pair is quite different than the
distribution of the distances between all the amino acids (Figure 4). In
particular, these distances tend to be much shorter than would be expected if
we picked amino acid pairs at random. This is consistent with what we would expect
if the amino acids are directly interacting with each other! Interestingly, we
find that, of the 32 amino acids that are completely buried inside the NifH structure,
23 are found to coevolve with another amino acid (Table 1). Furthermore, of the
14 coevolving pairs (some amino acids were maxZ partners to multiple other
amino acids), 8 of them occurred between amino acids that are both completely buried!
This suggests that coevolution can indeed explain some of the sequence variation
found inside the NifH protein structure – some amino acid substitutions appear to
“work” if another substitution simultaneously occurs nearby. Interestingly, the
top two amino acids that showed the most sequence variation (positions 151 and
196) were found to directly coevolve (Table 1). Although separated by 44 amino
acids in the linear sequence, they are only separated by 3.6 Å in the NifH structure.
Coevolution can also explain some of
the sequence variation we observed earlier in the alpha helices and beta
strands (Figure 1B). In NifH, there are 156 amino acids located on alpha
helices or beta strands, 116 of which showed sequence variation roughly >1%.
From this set of positions, 38 positions were found to coevolve – roughly a
third of the positions on alpha helices and beta strands showing meaningful
variation! These formed 30 coevolving pairs (some amino acids were maxZ
partners with multiple others), including many instances of coevolution between
separate alpha helices, separate beta strands, and between an alpha helix and
beta strand (Table 2). This suggests that, while amino acids forming alpha
helices and beta strands may display noticeable sequence variation at first
glance, there is actually less variation than there appears. It appears, as
revealed through our mutual information analysis, that variation at a position is
permissible if variation at another site simultaneously occurs. However, there are
likely many more coevolving positions, particularly those that are strongly coevolving
where the identity of an amino acid at position X is inflexible and requires a particular
amino acid at position Y. The entropy at both positions would therefore be 0, and
since there can be no further reduction in uncertainty, the mutual information between
the positions will be 0. Our approach is therefore unable to detect coevolving positions
that show no sequence variation.
In summary, we were able to not only show that the amino acids surrounding the iron-sulfur cluster and ATP molecules in NifH show very little variation, but that many of the positions on the inside of the NifH structure appear to coevolve!
1. Jiao, J., Venkat, K., Han,
Y. & Weissman, T. Minimax estimation of discrete distributions. IEEE
Int. Symp. Inf. Theory – Proc.2015–June, 2291–2295 (2015).
2. Dunn, S. D., Wahl, L. M.
& Gloor, G. B. Mutual information without the influence of phylogeny or
entropy dramatically improves residue contact prediction. Bioinformatics24, 333–340 (2008).
3. Schindelin, H., Kisker, C.,
Schlessman, J. L., Howard, J. B. & Rees, D. C. Structure of ADP·AlF4–stabilized
nitrogenase complex and its implications for signal transduction. Nature387, 370–376 (1997).
4. Margolin, A. A. et al.
ARACNE: An algorithm for the reconstruction of gene regulatory networks in a
mammalian cellular context. BMC Bioinformatics7, 1–15 (2006).
Quantum computing, a field that has been around since the days of Feynman[1], has recently gained broad interest, especially with the publication of quantum supremacy by Google in late 2019[2]. Its promise lies in the potential to perform computational tasks that cannot feasibly be achieved on even the largest classical supercomputers—the so-called quantum supremacy which Google claimed. Unlike classical states, quantum states are entangled. A quantum computer consisting of n qubits could, in theory, simulate a system with 2n states; problems which scale exponentially on classical machines could be achieved in reasonable time (polynomial) on a quantum machine. Current quantum devices, however, are noisy and therefore not perfectly entangled. Given this, it is important to understand the limitations of current hardware by classically simulating quantum devices and algorithms. What makes quantum algorithms difficult to simulate, and what could that tell us about the algorithm itself? Or, from another angle, how does the entropy of a given problem (i.e. the entanglement of the quantum system) affect our ability to solve it?
Theory and Methods
In my research group, we have been developing a classical simulator for a quantum computer that employs a tree representation of a wavefunction to cut down on computational cost[3], [4]. The method on which this is based is called multi-layer multi-configurational time dependent Hartree (ML-MCTDH)[5], an approach used in the theoretical chemistry community to study quantum dynamical systems. A wavefunction Ψ approximating a quantum system is represented in a basis recursively defined by a given tree graph, so that Ψ is the root node of the tree (see Figure 1). Qubits are represented as leaves in the tree, labeled x0 through x7 in Figure 1. One can perform a two-qubit operation such as the CNot gate depicted which entangles qubits x3 and x4, and only nodes along the path connecting the qubits in the tree (shown in red) need to be considered in the computation of the gate.
Fig 1. Tree representation of quantum system Ψ with red path between entangled qubits.
The tree representation assumes there is some structure to the wavefunction we want to simulate, so we expect it to work well in low-entropy regimes, such as a simple QFT (quantum Fourier transform) circuit[3]. High-entropy regimes, such as the random circuits studied in the Google paper[2], [6], present more of a challenge. We are interested in simulating quantum algorithms, so examine Shor’s algorithm for integer factorization[7]—this algorithm is important for encryption and is a canonical quantum algorithm in that, on an ideal quantum device, it does in polynomial time what on a classical device would be exponential. A simplified quantum circuit for Shor’s algorithm is shown in Figure 2[8].
Fig 2. Summary circuit diagram for Shor’s algorithm. n is the number of qubits to represent the integer being factorized. A series of Ua operations are performed on a measurement space of 2n. See ref. 8 for more detail.
Entropy Analysis and Discussion
We would like to analyze Shor’s algorithm to see 1) how a canonical quantum algorithm behaves in an entropy context (i.e. which entropy regime does the algorithm belong to), and 2) how well a tree-based approach can simulate a quantum algorithm. We measure the entanglement at each node p in the tree as a circuit proceeds using the von Neumann entropy S p:
is the reduced density operator at a node p and kB is the Boltzmann constant. (Note, in the figures below we report node perplexity Λ = eS instead of entropy directly, as it can be related to the number of basis functions required at node p to get an accurate description of the wavefunction.)
Figures 3 and 4 show an entropy analysis simulating Shor’s circuit on 22 qubits using a tree with 22 leaf nodes and a roughly binary structure, similar to that shown in Figure 1. In this case, we are factorizing the integer N=15. Of the 22 qubits, only 5 are used to represent N—the particular algorithm[8] we implemented uses 4n + 2 qubits, where n is the number of qubits used to represent integer N. 2n of these qubits are used for the measurement space, as shown in Figure 2. Figure 3 shows a heatmap of relative node entropy for the nodes at the bottom of the tree, i.e. the 22 nodes representing qubits. Nodes 12-22 represent the measurement space, where you can clearly see the maximal entropy (red) following a stair-like pattern with the ten applications of the Ua gates (from the circuit diagram in Figure 2).
Fig 3. Entanglement heatmap of 22 leaf nodes representing 22 qubits (y-axis) as a Shor’s circuit progresses (x-axis).
The heatmap in Figure 3 shows only engtanglement at the bottom of the tree—a better picture takes into account entropy at higher nodes. Figure 4 shows a video of the 22-qubit tree as the Shor circuit progresses, with each node colored according to its entropy relative to the highest node entropy observed throughout the circuit. The darker the color, the more local node entropy at that point in the circuit.
Fig 4. Entanglement map of a 22-qubit tree throughout a Shor’s circuit simulation. Darker color corresponds to greater entropy.
From this visualization, we can see that entropy hot spots occur not in the leaves but in the second and third layers of the tree, and that the total entropy of this system increases as the circuit progresses.
The simulation described above took 22s on a 2.9 GHz Intel Core i9 (MacBook Pro). Moving to 32 qubits to factor N=83 (with a similar binary tree structure) took 1796s. This 80x jump is due partly to more gates (increased circuit depth), but mostly because each gate operation takes significantly more time for a larger tree. We can note here again that the tree representation is an approximation to the wavefunction of the system. Even so, we obtain the correct answer from the circuit. More structured trees (with more layers and edges) are more aggressive approximations which we can exploit, trading slightly less accuracy for computational speedup. There are some interesting ideas here about finding the optimal tree for a given quantum algorithm, which I plan to explore further. (How can we use entropy hot spots like those in Figure 4 and optimize the tree structure? If we found the optimal tree for a simulation in polynomial time, would we even need a quantum computer?)
Using these tree structures optimally is all about exploiting structure in the quantum circuit, i.e. where the hot spots are not located. In a truly random circuit, there is no structure to exploit, thus we would expect aggressive tree approximations to fail, and a classical simulation of these circuits to scale exponentially. This is a topic I plan to explore further.
Conclusion and Outlook
So, what makes a quantum algorithm difficult to simulate? After an initial analysis of Shor’s algorithm using a tree representation of the wavefunction, I would phrase the answer as a lack of exploitable structure, i.e. high and uniform node entropy. These tree structures can bring classical simulations down to polynomial scaling, but only if optimal structures are found. This project has brought up some very interesting questions about optimization, but has really just scratched the surface of the subject. Entropy will be a useful lens through which to analyze these tree simulations moving forward, and in general to understand the limits of classical simulation of quantum circuits.
Acknowledgements
Major thanks to Roman Ellerbrock for recent useful discussions, as well as previous work developing the ML-MCTDH simulator.
References
[1] R. P. Feynman, “Simulating physics with computers,” Int. J. Theor. Phys., vol. 21, no. 6–7, pp. 467–488, 1982. [2] F. Arute et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019. [3] R. Ellerbrock and T. J. Martinez, “A multilayer multi-configurational approach to efficiently simulate large-scale circuit-based quantum computers on classical machines,” Phys. Rev. Lett., p. Submitted, 2020. [4] R. Ellerbrock et al., “quTree – A tensor tree and general linear algebra package in C++,” p. In prep., 2020. [5] H.-D. Meyer, U. Manthe, and L. S. Cederbaum, “The multi-configurational time-dependent Hartree approach,” Chem. Phys. Lett., vol. 165, no. 1, pp. 73–78, 1990. [6] S. Boixo et al., “Characterizing quantum supremacy in near-term devices,” Nat. Phys., vol. 14, no. 6, pp. 595–600, 2018. [7] P. W. Shor, “Proceedings of the 35th Annual Symposium on Foundations of Computer Science,” IEE Comput. Soc. Press. St. Fe, NM, 1994. [8] S. Beauregard, “Circuit for Shor’s algorithm using 2n+3 qubits,” https://arxiv.org/abs/quant-ph/0205095, 2003.
Learn all about four different types of music storages over time on our blog/website here and on our YouTube video here! The first is analog recording — think of old vinyl records that are slowly coming back in style! The second is the compact disc (CD), which are still commonly used. The third is the MP3 player, revolutionized by big players like the iPod. And the fourth is digital streaming, such as Spotify and Apple Music. We also cover what exactly sound is, how we process sound in our brain, and exactly how it is stored in compressed systems.