Investigating coevolving amino acids in the protein NifH using mutual information

Blog, EE276 (Winter 2020)

by Bennett Kapili


Proteins are the primary catalytic units of biology. They are composed of up to twenty different amino acids strung together into sequences that are usually a few hundred amino acids in length. Each of the twenty amino acids has different chemical properties, such as different sizes, affinities for water, and electrical charges, and the interaction of these amino acids with their environment and with each other gives proteins their distinct three-dimensional shapes. This structure, in turn, is critically important in defining the protein’s catalytic properties. While a particular protein might have slightly different amino acid sequences in different organisms, with accompanying variations in folding structure, only a small subset of sequences result in the protein’s proper functioning. By studying a protein’s variations in amino acid sequences, we can gain valuable insight into how the protein works.

Here, we will borrow fundamental concepts from information theory and apply them to the study of molecular evolution. In particular, we will study amino acid sequence variation in the protein nitrogenase reductase (NifH), which is an ancient protein found in Bacteria and Archaea and is critically important for sustaining life on Earth. NifH helps convert nitrogen gas – which most organisms cannot use to grow – into a form of nitrogen that all life on Earth can use to grow. Without the action of this protein and the handful of other proteins with which it works, there would not be enough usable nitrogen to sustain all the living organisms on Earth. The NifH protein works by splitting apart molecules of ATP (the most common energetic currency for life on Earth) and using the energy that is released to transfer an electron to another protein. That electron is then passed to a molecule of nitrogen gas, where after the accumulation of six electrons, it is converted into ammonia. The electrons are transferred using a small cluster of iron and sulfur atoms that is embedded inside the NifH protein. Since the amino acid sequences that are involved with splitting the ATP and holding the iron-sulfur cluster are absolutely essential for proper NifH functioning, we would expect little sequence variation involving these amino acids. We’ll explore this hypothesis using an information theoretic framework.


First, I downloaded all the publicly available amino acid sequences of NifH from GenBank (n = 8876 sequences), which is a database organized by the National Center for Biotechnology Information. Then, I constructed an amino acid sequence alignment, which is the placement of the sequences into a matrix, with rows corresponding to each linear sequence and columns corresponding to the equivalent positions in each sequence. The matrix columns capture the amount of amino acid variation that exists at a particular position in the protein. From this alignment, I created a phylogenetic tree to estimate which sequences were the most different based on a model of evolution. I picked the top 750 sequences that were the most spread out on the tree to analyze in this study. Then, I estimated the entropy of each column in the alignment and the mutual information between columns using the Jiao-Venkat-Han-Weissman estimator1. I then applied the average product correction described in Dunn et al., 2008 to the mutual information estimates to account for phylogenetic signal within the sequences2.

Results & Discussion

To study sequence variation, we can calculate the entropy of each column in the sequence alignment. We can then compare the position’s entropy to the entropy of a theoretical position in which any amino acid exists with equal probability. When we do this for NifH, we see little sequence variation in the positions that are involved with ATP splitting (called the “Walker A”, “Switch I”, and “Switch II” sequence motifs) and iron-sulfur cluster binding (Figure 1A). This is exactly what we expected! In Figure 1, the points closer to the dashed line have less sequence variation than points farther below the dashed line. Interestingly, the Walker B motif, which is also involved in ATP-splitting, has much more sequence variation than the other motifs (Figure 1A). Also, we see that the amino acids that compose the alpha helices and beta strands – 3D structures that help give NifH its shape – can have, at times, quite high sequence variation (Figure 1B). We’ll return to this later.

When we map the relative entropy estimates onto the 3D protein structure, we get a strong visual understanding of the relationship between sequence variation and protein function. The protein structure shown in Figure 2 is the structure of NifH during ATP splitting, where ADP-AlF4 serves as an analog of ATP3. First, we see very little sequence variation around the iron-sulfur cluster (Figure 2A). We also see very little sequence variation around the area that interacts with ATP (dark blue stripes in Figure 2B; ADP-AlF4 positions shown in Figure 2F). In addition, we see that the surface of the protein – that is, the area that is exposed to the chemical environment within the cell – has a relatively high amount of sequence variation.  

To investigate these relationships further, we can plot the relative entropy as a function of distance to the iron-sulfur cluster and ADP-AlF4, as well as how much each amino acid is exposed to the outside. We see that amino acids separated from the iron-sulfur cluster by <5 angstroms (five one-hundred-millionths of a centimeter; a distance that allows the formation of hydrogen and ionic bonds), have very little sequence variation (Figure 3A). This low sequence variation extends to ~10 Å, suggesting that not only is direct interaction with the iron-sulfur cluster important, but the structure of the protein in the vicinity is very important as well. We see similar results for distance to ATP (Figure 3B). When we look at the relationship between sequence variation and exposure to the intracellular environment (measured here as the percent of an amino acid’s surface area that is accessible to water), we see decreasing relative entropy with increasing exposure to the environment (Figure 3C). Interestingly, though, we see that not all amino acids completely buried inside the structure are completely conserved. There are 50 amino acids located completely inside the NifH structure, 32 of which display meaningful variation (defined here as having entropy >0.1, which roughly corresponds to a particular amino acid accounting for <99% of the variation at a position).

We can use mutual information, or how much we can reduce our uncertainty about the identity of the amino acid at position X if we know the amino acid at position Y, to further study these amino acids. In particular, variation at one amino acid tucked inside the protein structure might be allowed if variation at another amino acid helps “compensate” for the changes. Positions that show these patterns of compensating amino acid substitutions are coevolving, and we can measure the statistical dependency between positions to detect such behavior. However, to distinguish directly from indirectly coevolving positions (i.e., the amino acid at position X directly influencing the identity of the amino acid at position Y vs. X indirectly influencing Z through Y interacting with Z), we will only consider the pairing with the highest mutual information for each amino acid. This approach is an application of the data processing inequality, which has been applied to the study of gene regulation networks4. To evaluate whether the magnitude of mutual information is meaningful or not, we will compare the mutual information estimates to the estimates produced when comparing randomized NifH sequences. In particular, for each mutual information estimate, we will calculate the number of standard deviations that it is above the average mutual information value between positions in the randomized sequences. These values are called “Z-scores”, and we will refer to the pairs involving positions and their strongest coevolving partners as “maxZ pairs”.

Using this approach, we detect 56 directly coevolving amino acids. When we focus on the maxZ pairs, we see that the distribution of the distances between the amino acids in each pair is quite different than the distribution of the distances between all the amino acids (Figure 4). In particular, these distances tend to be much shorter than would be expected if we picked amino acid pairs at random. This is consistent with what we would expect if the amino acids are directly interacting with each other! Interestingly, we find that, of the 32 amino acids that are completely buried inside the NifH structure, 23 are found to coevolve with another amino acid (Table 1). Furthermore, of the 14 coevolving pairs (some amino acids were maxZ partners to multiple other amino acids), 8 of them occurred between amino acids that are both completely buried! This suggests that coevolution can indeed explain some of the sequence variation found inside the NifH protein structure – some amino acid substitutions appear to “work” if another substitution simultaneously occurs nearby. Interestingly, the top two amino acids that showed the most sequence variation (positions 151 and 196) were found to directly coevolve (Table 1). Although separated by 44 amino acids in the linear sequence, they are only separated by 3.6 Å in the NifH structure.

            Coevolution can also explain some of the sequence variation we observed earlier in the alpha helices and beta strands (Figure 1B). In NifH, there are 156 amino acids located on alpha helices or beta strands, 116 of which showed sequence variation roughly >1%. From this set of positions, 38 positions were found to coevolve – roughly a third of the positions on alpha helices and beta strands showing meaningful variation! These formed 30 coevolving pairs (some amino acids were maxZ partners with multiple others), including many instances of coevolution between separate alpha helices, separate beta strands, and between an alpha helix and beta strand (Table 2). This suggests that, while amino acids forming alpha helices and beta strands may display noticeable sequence variation at first glance, there is actually less variation than there appears. It appears, as revealed through our mutual information analysis, that variation at a position is permissible if variation at another site simultaneously occurs. However, there are likely many more coevolving positions, particularly those that are strongly coevolving where the identity of an amino acid at position X is inflexible and requires a particular amino acid at position Y. The entropy at both positions would therefore be 0, and since there can be no further reduction in uncertainty, the mutual information between the positions will be 0. Our approach is therefore unable to detect coevolving positions that show no sequence variation.

In summary, we were able to not only show that the amino acids surrounding the iron-sulfur cluster and ATP molecules in NifH show very little variation, but that many of the positions on the inside of the NifH structure appear to coevolve!

1.       Jiao, J., Venkat, K., Han, Y. & Weissman, T. Minimax estimation of discrete distributions. IEEE Int. Symp. Inf. Theory – Proc. 2015June, 2291–2295 (2015).

2.       Dunn, S. D., Wahl, L. M. & Gloor, G. B. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340 (2008).

3.       Schindelin, H., Kisker, C., Schlessman, J. L., Howard, J. B. & Rees, D. C. Structure of ADP·AlF4stabilized nitrogenase complex and its implications for signal transduction. Nature 387, 370–376 (1997).

4.       Margolin, A. A. et al. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7, 1–15 (2006).

Entropy of quantum circuits: What makes a quantum algorithm difficult to simulate?

Blog, EE276 (Winter 2020)

Author: K. Grace Johnson


Quantum computing, a field that has been around since the days of Feynman[1], has recently gained broad interest, especially with the publication of quantum supremacy by Google in late 2019[2]. Its promise lies in the potential to perform computational tasks that cannot feasibly be achieved on even the largest classical supercomputers—the so-called quantum supremacy which Google claimed. Unlike classical states, quantum states are entangled. A quantum computer consisting of n qubits could, in theory, simulate a system with 2n states; problems which scale exponentially on classical machines could be achieved in reasonable time (polynomial) on a quantum machine. Current quantum devices, however, are noisy and therefore not perfectly entangled. Given this, it is important to understand the limitations of current hardware by classically simulating quantum devices and algorithms. What makes quantum algorithms difficult to simulate, and what could that tell us about the algorithm itself? Or, from another angle, how does the entropy of a given problem (i.e. the entanglement of the quantum system) affect our ability to solve it?

Theory and Methods

In my research group, we have been developing a classical simulator for a quantum computer that employs a tree representation of a wavefunction to cut down on computational cost[3], [4]. The method on which this is based is called multi-layer multi-configurational time dependent Hartree (ML-MCTDH)[5], an approach used in the theoretical chemistry community to study quantum dynamical systems. A wavefunction Ψ approximating a quantum system is represented in a basis recursively defined by a given tree graph, so that Ψ is the root node of the tree (see Figure 1). Qubits are represented as leaves in the tree, labeled x0 through x7 in Figure 1. One can perform a two-qubit operation such as the CNot gate depicted which entangles qubits x3 and x4, and only nodes along the path connecting the qubits in the tree (shown in red) need to be considered in the computation of the gate.

Fig 1. Tree representation of quantum system Ψ with red path between entangled qubits.

The tree representation assumes there is some structure to the wavefunction we want to simulate, so we expect it to work well in low-entropy regimes, such as a simple QFT (quantum Fourier transform) circuit[3]. High-entropy regimes, such as the random circuits studied in the Google paper[2], [6], present more of a challenge. We are interested in simulating quantum algorithms, so examine Shor’s algorithm for integer factorization[7]—this algorithm is important for encryption and is a canonical quantum algorithm in that, on an ideal quantum device, it does in polynomial time what on a classical device would be exponential. A simplified quantum circuit for Shor’s algorithm is shown in Figure 2[8].

Fig 2. Summary circuit diagram for Shor’s algorithm. n is the number of qubits to represent the integer being factorized. A series of Ua operations are performed on a measurement space of 2n. See ref. 8 for more detail.

Entropy Analysis and Discussion

We would like to analyze Shor’s algorithm to see 1) how a canonical quantum algorithm behaves in an entropy context (i.e. which entropy regime does the algorithm belong to), and 2) how well a tree-based approach can simulate a quantum algorithm. We measure the entanglement at each node p in the tree as a circuit proceeds using the von Neumann entropy S p:

is the reduced density operator at a node p and kB is the Boltzmann constant. (Note, in the figures below we report node perplexity Λ = eS instead of entropy directly, as it can be related to the number of basis functions required at node p to get an accurate description of the wavefunction.)

Figures 3 and 4 show an entropy analysis simulating Shor’s circuit on 22 qubits using a tree with 22 leaf nodes and a roughly binary structure, similar to that shown in Figure 1. In this case, we are factorizing the integer N=15. Of the 22 qubits, only 5 are used to represent N—the particular algorithm[8] we implemented uses 4n + 2 qubits, where n is the number of qubits used to represent integer N. 2n of these qubits are used for the measurement space, as shown in Figure 2. Figure 3 shows a heatmap of relative node entropy for the nodes at the bottom of the tree, i.e. the 22 nodes representing qubits. Nodes 12-22 represent the measurement space, where you can clearly see the maximal entropy (red) following a stair-like pattern with the ten applications of the Ua gates (from the circuit diagram in Figure 2).

Fig 3. Entanglement heatmap of 22 leaf nodes representing 22 qubits (y-axis) as a Shor’s circuit progresses (x-axis).

The heatmap in Figure 3 shows only engtanglement at the bottom of the tree—a better picture takes into account entropy at higher nodes. Figure 4 shows a video of the 22-qubit tree as the Shor circuit progresses, with each node colored according to its entropy relative to the highest node entropy observed throughout the circuit. The darker the color, the more local node entropy at that point in the circuit.

Fig 4. Entanglement map of a 22-qubit tree throughout a Shor’s circuit simulation. Darker color corresponds to greater entropy.

From this visualization, we can see that entropy hot spots occur not in the leaves but in the second and third layers of the tree, and that the total entropy of this system increases as the circuit progresses.

The simulation described above took 22s on a 2.9 GHz Intel Core i9 (MacBook Pro). Moving to 32 qubits to factor N=83 (with a similar binary tree structure) took 1796s. This 80x jump is due partly to more gates (increased circuit depth), but mostly because each gate operation takes significantly more time for a larger tree. We can note here again that the tree representation is an approximation to the wavefunction of the system. Even so, we obtain the correct answer from the circuit. More structured trees (with more layers and edges) are more aggressive approximations which we can exploit, trading slightly less accuracy for computational speedup. There are some interesting ideas here about finding the optimal tree for a given quantum algorithm, which I plan to explore further. (How can we use entropy hot spots like those in Figure 4 and optimize the tree structure? If we found the optimal tree for a simulation in polynomial time, would we even need a quantum computer?)

Using these tree structures optimally is all about exploiting structure in the quantum circuit, i.e. where the hot spots are not located. In a truly random circuit, there is no structure to exploit, thus we would expect aggressive tree approximations to fail, and a classical simulation of these circuits to scale exponentially. This is a topic I plan to explore further.

Conclusion and Outlook

So, what makes a quantum algorithm difficult to simulate? After an initial analysis of Shor’s algorithm using a tree representation of the wavefunction, I would phrase the answer as a lack of exploitable structure, i.e. high and uniform node entropy. These tree structures can bring classical simulations down to polynomial scaling, but only if optimal structures are found. This project has brought up some very interesting questions about optimization, but has really just scratched the surface of the subject. Entropy will be a useful lens through which to analyze these tree simulations moving forward, and in general to understand the limits of classical simulation of quantum circuits.


Major thanks to Roman Ellerbrock for recent useful discussions, as well as previous work developing the ML-MCTDH simulator.


[1]      R. P. Feynman, “Simulating physics with computers,” Int. J. Theor. Phys., vol. 21, no. 6–7, pp. 467–488, 1982.
[2]      F. Arute et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
[3]      R. Ellerbrock and T. J. Martinez, “A multilayer multi-configurational approach to efficiently simulate large-scale circuit-based quantum computers on classical machines,” Phys. Rev. Lett., p. Submitted, 2020.
[4]      R. Ellerbrock et al., “quTree – A tensor tree and general linear algebra package in C++,” p. In prep., 2020.
[5]      H.-D. Meyer, U. Manthe, and L. S. Cederbaum, “The multi-configurational time-dependent Hartree approach,” Chem. Phys. Lett., vol. 165, no. 1, pp. 73–78, 1990.
[6]      S. Boixo et al., “Characterizing quantum supremacy in near-term devices,” Nat. Phys., vol. 14, no. 6, pp. 595–600, 2018.
[7]      P. W. Shor, “Proceedings of the 35th Annual Symposium on Foundations of Computer Science,” IEE Comput. Soc. Press. St. Fe, NM, 1994.
[8]      S. Beauregard, “Circuit for Shor’s algorithm using 2n+3 qubits,”, 2003.

Forecasting as a Tool to Effectively Communicate Pressing Information: A Look Into Meat Consumption, Climate Change, and Alternative Food Networks

Blog, EE276 (Winter 2020)

by Rakan Albarghouty and Kyle Walker

It is estimated that cattle are responsible for about 65% of the livestock sector’s emissions, which, worldwide, is responsible for between 14.5 percent and 18 percent of the total annual human-driven greenhouse gas emissions. [1] At the same time, global meat production is projected to be 16 percent higher in 2025 than in the base period (2013-15). [2] Not only does the current system raise disconcerting ethical questions on animal treatment, resource use, and public health – afterall, high meat consumption is linked to heart disease, stroke, diabetes, and several cancers, but it is also highly unsustainable and is a leading greenhouse gas emitter. [3]

To help fight climate change and improve public health, it is clear that there needs to be a drastic change in our meat consumption behaviour. This change can be manifested globally by cutting down total meat consumption, replacing meat with alternatives like plant-based proteins and alternative meats, and/or transitioning to alternative, sustainable food systems. All three of the aforementioned approaches must happen simultaneously to communicate a clear, action-driven message: the world needs much less traditional meat production. Behaviours that limit the success of the meat industry’s business model today must be practiced by both the public and industry leaders themselves to reduce greenhouse gas emissions. Our project seeks to find ways in which forecasting can be used to effectively disseminate vital information to consumers.

Please find an attached copy of our full report here.

The Evolution of Music Storage

Blog, EE276 (Winter 2020)

Learn all about four different types of music storages over time on our blog/website here and on our YouTube video here! The first is analog recording — think of old vinyl records that are slowly coming back in style! The second is the compact disc (CD), which are still commonly used. The third is the MP3 player, revolutionized by big players like the iPod. And the fourth is digital streaming, such as Spotify and Apple Music. We also cover what exactly sound is, how we process sound in our brain, and exactly how it is stored in compressed systems.


Tania Dhaliwal
Michelle Ly
Casey Butcher

An Order-Optimal Edit-Correcting Code

Blog, EE276 (Winter 2020)

By Daniel Tan,


DNA (deoxyribonucleic acid), could also be called the “source code” of life on Earth. Whereas machine code relies on a binary alphabet of 0s and 1s, DNA relies on a quaternary code of the ATCG nucleotides. DNA is easy to read from, write to, and copy, since these functions are already implemented by natural cellular machinery. In addition, it is very stable, and does not require much energy to store. Due to these advantages, it has recently become a medium of interest for long-term information storage. [1]

One flaw of DNA as a storage medium is that information stored in DNA can be corrupted through point mutations. [2] In particular, mutations can involve a nucleotide being removed from the DNA sequence (deletion), a nucleotide being changed to a different nucleotide (substitution), or a new nucleotide being added at an arbitrary location (insertion). If DNA is to be used as information storage, there is a need for error-correcting codes which can reliably handle all three types of error.


In this project, I implement a code presented in a recent paper [3] that corrects a single edit (substitution, insertion, or deletion) with \log n + O(\log \log n) redundancy. In other words, to correct a single edit in a message of n symbols, only \log n + O(\log \log n) additional symbols need to be used. In contrast, previous approaches had a redundancy of c \log n where c \geq 2 .

The implementation and a high-level explanation of how the code works can be found at this GitHub repository. The code is intended to be open-source and may be freely used, reproduced or modified with suitable accreditation.


[1] Goldman, N., Bertone, P., Chen, S. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

[2] “Point Mutation”. Biology Dictionary. Retrieved 25 Mar 2020.

[3] Cai, Kui et al. Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage. arXiv:1910.06501 [cs.IT]