Information Theory in Molecular Biology

EE376A (Winter 2019)

By David Lin and Maxmillian Minichetti


The cell relies on biological signaling networks to adjust its physiological state in response to changing environmental factors. These signaling pathways must transmit specific information about extracellular conditions through various forms of distortion and noise observed in nature. The stochasticity of molecular interactions can interfere with the delivery of these signals, thereby degrading the transmitted information in some cases. How cells can perform inter-and intracellular communication in the presence of noise is a fundamental question in biology. Information theory provides a framework for modeling these biochemical signaling networks and other forms of natural data transmission. In this study, we explore information-theoretic approaches to DNA shotgun sequencing, DNA localization, and the ethylene/auxin signaling pathway in plant cells.

DNA Shotgun Sequencing

DNA sequencing is the process of determining the order of nucleotides in DNA with a focus on their nitrogenous bases, which can be one of the four bases A (adenine), T (thymine), C (cytosine), and G (guanine). With increasingly powerful sequencing technologies, these sequences have found broad-reaching applications in molecular biology, medicine, and forensics. The process of sequencing typically takes on a “shotgun” approach, whereby millions of reads are taken on the length order of 100 [1]. These individual reads can then be reconstructed to produce the original sequence (see figure below). The primary advantage of the shotgun approach over a single start to end read is one of computation. Whereas a single read of length a few billion nucleotides would be prohibitively time consuming, advancements in parallel computing can be readily leveraged using shotgun sequencing.

Figure 1: Schematic of Shotgun Sequencing. Figure courtesy of Motahari et al. 2013 [1].

The problem setup can just as easily be viewed from the lens of Shannon’s information theory principles [1]. In particular, the analogy points to the original input as the DNA sequence, the encoded message as the shotgun reads, and the decoded message as the good faith reconstruction. Recall that channel capacity represents the upper bound on reliable information transmission over communication channels. In the absence of computational bottlenecks, this treatment yields fundamental results on the minimum number of reads necessary for a reliable sequence reconstruction. These bounds can then give insights on the extent to which current reconstruction algorithms approach optimality.

In particularly idealistic conditions where the reads of length L span the sequence, are noiseless, and contain no repeats, it can be shown that the greedy approach achieves the optimality conditions described by channel capacity. That is to say, starting with all the different reads, we can greedily merge the pair of sequences with the largest overlap score until one contiguous sequence remains. Although this basic algorithm does find application in modern genome assemblers, in the more common and complex scenario of noisy and repetitive reads, practitioners tend to rely on other algorithms. One body of approaches, sequential algorithms, take at the most basic level one particular sequence and continuously grow it until all reads have been used. Another set of approaches, coined K-mer approaches, takes length K subsequences and uses lexicographical sorting as a guide for the merging process. While both these two algorithms fall short of optimal, they empirically perform better in experimental contexts.

While the results of information theory are clean and interpretable, their utility is still limited by a couple factors [1]. Perhaps the most notable issue is that of long repeated subsequences in DNA. Any assumptions of independent and identically distributed (iid) data or Markov properties that only account for short-range correlations quickly fall apart in this regime. Further, experimental bottlenecks like quality scores and correlated noise add additional complexity to the problem. In practice, many industry experts will also leverage heuristics based on domain-specific observations to improve runtime and accuracy. Unfortunately, these domain-specific algorithms have more ambiguous performance metrics and are difficult to compare against results from information theory.

Mutual Information in Local DNA Regions

Noncoding DNA, or segments of DNA that do not encode proteins, account for nearly 99 percent of DNA sequence length [2]. As such, a fundamental but indispensable task is that of distinguishing coding from noncoding DNA. Experimental techniques have limitations in exhaustively parsing out these coding segments, so statistical patterns within the sequences themselves are highly sought after in the research community.

Taking a lesson from information theory, the mutual information metric can be a valuable source of inspiration for drawing statistical insights [3]. In particular, we can let Ik represent the shared information in bits between two nucleotides separated by k nucleotides. Given the individual probability density functions p(xi), p(yj) as well as the joint distribution pk(xi, yj), we can express the mutual information as follows [2]:

The summations run up through four to account for the different combinations of the four possible base pairs. Below, Grosse et al. plots the mutual information of coding and noncoding DNA as a function of the distance k between nucleotides [3].

Figure 2: Mutual Information vs. Distance in DNA. Figure courtesy of Grosse et al. 2000 [3].

Two key observations can be made. On one level, the average mutual information is significantly different between these two types of regions. Since there is more local interaction in coding regions, it makes sense that we would see higher rates of mutual information. Interestingly enough, there is almost no difference in mutual information between these two types of regions in nearly all types of animals! The second key distinction is that of functional form. The oscillatory behavior of the coding regions is a result of codons turning triples of nucleotides at a time into amino acids. This effect is further amplified by the nonuniformity of codon frequencies. The results of this study can readily be applied to training models for predicting coding vs. noncoding DNA regions.

Ethylene/Auxin Signaling in Plants

Phototropism is the growth of an organism in response to light. This phenomenon is commonly observed in plants, where a combination of hormones, including ethylene and auxin, are responsible for steering the elongation of a plant stem in the direction of incident sunlight.

Figure 3: Schematic of Phototropism in Plants [4].

The figure depicts auxin molecules migrating away from the incoming light source. This promotes elongation in cells lining the darker face of the plant body. As one side of the plant extends, simple mechanical forces at play ultimately drive the plant body toward the source of light. Previous work has applied information theory to better understand the ethylene signaling network in plant cells [2]. The complex relationship between auxin and ethylene hormones has yet to be fully understood in general; however, one idea is to leverage the complementary nature of this hormonal interplay when specifically applied to root growth. We introduce the ethylene signaling pathway as a communication channel as follows:

Figure 4: Schematic of the Ethylene Signaling Pathway. Figure courtesy of Díaz et. al. 2011 [2].

The phytohormone ethylene is synthesized from methionine during the Yang cycle [5]. This mechanism acts as the source emitter. As expected, the production of ethylene is circadian, peaking around midday [2]. A set of five ethylene gas receptors embedded in the endoplasmic reticulum (ER) membrane (shown in red), ETR1, ETR2, ERS1, ERS2, and EIN4 is collectively modeled as the encoder. Notably, this family of receptors is structurally similar to the histidine kinase receptors of two-component signaling pathways in bacteria [2]. For a total of NT ethylene receptors in the membrane, Na receptors are in their activated state, denoted by a 1, and N0 = NT – Na are in their inactivated state, represented as a 0. Abstractly, the encoder detects the concentration of ethylene by recording the number of inactivated receptors at a given point in time. Thus, we have have reduce this system to a binary code of the following format: C = 00000000…1111111.

In a classic communication channel, the encoded message is transmitted through noise, one symbol at a  time. In a biological setting, messages are not transmitted via symbols, but rather through biochemical signaling cascades that must accurately convey the same information. In plants, the inactivation of an ethylene receptor triggers the activation of EIN2 and ultimately the transcription factor, EIN3, which targets the gene ERF1 [2]. The cascade of events, which occurs largely in the nucleus of a plant cell, translates the binary code C, into an initial fraction of inactivated ethylene receptors f = N0/NT. This fraction is directly proportional to the number of activated EIN3 transcription factors in the nucleus, which then determines the intensity of the ERF1 gene response. We model this process as the transmitter and noisy channel.

After the EIN3 transcription factor has been activated, it binds to the promoter site corresponding to the ERF1 gene and begins transcribing DNA. It is important to realize that transcription is a stochastic process, introducing some level of intrinsic noise that we can model as nint2. Another source of noise originates from the fluctuations in the number of DNA localization molecules, namely transcription factors, regulatory proteins, and polymerases [2]. Díaz et al. models this noise contribution as next2. Thus, the total uncertainty (entropy) in the accuracy of the ERF1 gene response, subject to a noise source ntot2 = nint2 + next2, is given by the following expression:

Note that j = 1 corresponds to the ERF1 “off” state, and j = 2 corresponds to the ERF1 “on” state. The entropy H and information I are plotted below as a function of pERF1on.

Figure 5: Plots of H and I. Figure courtesy of Díaz et. al. 2011 [2].

As expected, the value of H decreases as pERF1on approaches 0 or 1 and is maximized at  pERF1on = 0.5. Interestingly, at pERF1on = 0.5 the plant cell is equally receptive to both ethylene and auxin phytohormones. At this point, the root cell is maximally “uncertain” about its growth behavior.

Outreach Activity

Our outreach project at Nixon Elementary School took a slight detour to introduce the classic “cocktail party problem” to students. We were first exposed to this question in CS229 (Machine Learning) [4], but revisited it with a renewed appreciation for the connections that could be drawn to concepts in other fields like Information Theory (EE376A) and Linear Dynamical Systems (EE263).

In this “cocktail party,” n different individuals are speaking at the same time from different parts of the room. There are n different microphones also located in different parts of the room that pick up noisy linear combinations of these auditory inputs. Naturally, the question then becomes how one might leverage these microphone recordings to isolate individual voice signals. This problem readily translates into the language of linear algebra: the inputs and outputs can be represented as column vectors and the conglomeration of the inputs producing the outputs can be generated through a mixing matrix we call A. To recover the individual voices from the noisy microphone recordings, we only need the matrix that unmixes the voices, represented by W = A-1.

The ICA algorithm does precisely this, with a few caveats. On one level, the output of ICA cannot resolve the exact scaling of the input, which means that we won’t know exactly how loud each individual speaker was talking. Further, ICA requires that each source be independent and non-Gaussian. With that said, ICA can be derived from maximum likelihood estimation. The output is a gradient ascent update rule, where x represents the output recordings, that can be used to recover the “unmixing matrix” W.

Altogether, the average 3rd grader has limited appreciation for linear algebra, so our presentation focused primarily on figures and demonstrations. Our figures focused on the intuition behind ICA and how leveraging matrices gives us a powerful way of interacting with the real world through math. We coded up a demonstration in Matlab that solved the cocktail problem for a variety of auditory inputs pulled from movies.

Having this class outreach event was a very exciting opportunity for us to talk about topics that we’re passionate about in inclusive, educational communities. Often times, the technical community of information theory and the broader community of engineering can feel siloed from the very people it’s meant to serve. This event was not only symbolically a powerful way to break down this barrier, but also a practical one that allowed us to present to enthusiastic children that will probably outshine us some day. Moving forward, this class has been a motivating factor to get more involved in STEM educational opportunities like Stanford Splash.


  1. Motahari, A. S., Bresler, G. & Tse, D. N. C. Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013).
  2. González-García, J.S. & Díaz, J. Information theory and the ethylene genetic network,Plant Signaling & Behavior, 6:10, 1483-1498 (2011).
  3. Grosse, I., Herzel H., Buldyrev S.V. & Stanley E. Species independence of mutual information in coding and noncoding DNA. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2000; 61:5624 – 5629; PMID: 11031617
  4. Ng, A. CS229 Lecture Notes Part XII: Independent Components Analysis. Stanford University, 2016.

Leave a Reply