Analyzing Mutations of SARS-CoV-2 Variants of Concern

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Jenny Lam, Andrew Hong, Riley Kong, HoJoon Lee, Stephanie Greer


The SARS-CoV-2 virus was discovered in late December of 2019, with more than 4 million cases of COVID-19 detected as of August 2021. Because SARS-CoV-2 is an RNA virus, it is highly susceptible to mutations in its genome. Some mutations give the virus a replicative advantage and allow it to pose a greater threat to humans, leading to variants of concern, such as the alpha, beta, gamma, and delta variants. Combatting COVID-19 is heavily dependent on understanding the mutations of SARS-CoV-2 and how they spread. In this study, we analyzed different aspects of the COVID-19 genomic metadata from GISAID to identify patterns in the variants of concerns and mutations in the spike protein. We focused on mutation frequency over time, the possibility of combination of variants of concern, and visualizing mutation frequency in the spike protein. Through this analysis, we identified patterns that can be used to further investigate the virus and set a framework for more in-depth analysis of SARS-CoV-2 and other viruses in the future. We were able to find that mutation frequencies over in the Delta variant displayed a different pattern as mutations increased in frequency more slowly and that variants with combinations of mutations from two variants of concern are mostly likely not a threat, as well as identify positions in the spike protein that could be a potential focus for future treatment.

1. Background

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the RNA virus that causes the global pandemic of coronavirus disease 2019 (COVID-19), has led to over 200 million confirmed cases and 4 million deaths worldwide [1] as of August 13, 2021. As a member of the family of coronaviruses, SARS-CoV-2 contains four primary structural proteins, the spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins [2]. The S protein, located on the surface of the virus, plays an important role in viral infection, as it is the part that attaches the virus onto the receptor sites of the host cells. The S protein mediates receptor binding and the fusion of viral and host cell membranes in order for the virus to release its genome into the host cell and replicate itself [3].

As the virus passes on its genome and spreads, errors in replication of its genome result in mutations, giving rise to variants, or different strains of the virus. The majority of mutations are benign, but several, especially amino acid changes that affect the S protein, can affect the virus’s transmissibility and virulence. The World Health Organization (WHO) and Centers for Disease Control and Prevention (CDC) has established criteria to categorize SARS-CoV-2 variants based on threat level. These designations include variants of interest (VOIs), variants of concern (VOCs), and variants of high consequence (VOHCs). This study primarily focuses on VOCs, which are characterized by evidence of increased transmissibility, reduction of neutralization by antibodies from vaccination or previous infection, reduced effectiveness of treatment, or reduced accuracy in COVID-19 detection [4]. These variants are characterized by specific amino acid substitutions in the S protein. As of August 13, 2021, four VOCs have been identified and are known as the Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), and Delta (B.1.617.2) variants.

Genomic sequence is a process used to identify the order of nucleotides in a genome. It is an essential part of understanding and combating COVID-19, as it allows for the identification of mutations in the genome and to see when new variants of the virus arise, allowing them to be better prepared and potentially be able to limit the spread of a new variant or current variants. It may also enable treatment options in the future by understanding what areas of the SARS-CoV-2 genome are essential and for those areas to be targeted.

2. Methods

2.1 k-mer Mutation Table

We analyzed the SARS-CoV-2 reference genome by creating a mutation table consisting of every possible amino acid substitution, as well as unique k-mers, which are substrings of the genome of length k associated with each possible substitution. Our k-mer mutation table uses the SARS-CoV-2 reference genome (accession number: NC_045512.2) from the National Center for Biotechnology Information (NCBI) [5]. For each position in the reference genome, we applied all possible substitution mutations to generate an altered genomic sequence, which we then used to produce mutant 10- and 20-mers. Correspondingly, we described each mutation with the resulting amino acid change and the relevant gene. We created a Python script in Google Colaboratory, an online collaborative programming environment, to generate a .csv file containing this information. Each row in the resulting file corresponds to one substitution mutation in the genome (Figure 1).

Figure 1: First Several Lines of the SARS-CoV-2 k-mer Mutation Table

2.2. Metadata Analysis

Our study utilizes COVID-19 genomic metadata (updated August 6, 2021) with over 2.6 million cases from the Global Initiative on Sharing Avian Influenza Data (GISAID), a publicly available reference database, in order to characterize the VOCs and S protein mutations through data such as collection date, collection location, patient age, patient gender, variant, and amino acid substitutions. The metadata also includes additional information for each reported case, which was not used for our research, namely the virus name, type, accession ID, additional location information, sequence, host, clade, Pango lineage, if the genome is a reference, if the genome is complete, if the genome is high coverage or low coverage, N-content, and GC-content.

To parse through the metadata, we used Python with Google Colaboratory, and utilized libraries such as Matplotlib and Pandas to generate graphs to visualize the data (Figure 2).

Figure 2: Code snippet utilizing Matplotlib and Pandas

Our code can be found at

3. Results

3.1 Overview of S Protein Mutations in VOCs Over Time

For each VOC, we created a graph that showed the relative frequency of S protein mutations over several months (Figures 3-6). The data points for each variant were determined by dividing the number of samples that contained the mutation by the total number of samples for each month. The graphs do not display every mutation found in the variants; they only include the most frequent mutations for each variant. Months with a small sample size (< 30 cases for that variant), which were the months near the start of the COVID-19 pandemic, were excluded from the graphs.

Nine mutations remain dominant in the Alpha variant: D614G, A570D, P681H, T716I, D1118H, S982A, N501Y, H69del/V70del, and Y144del. Similarly, the Gamma variant has several substitutions, including D614G, H655Y, N501Y, V1176F, T1027I, E484K, P26S, L18F, K417T, T20N, D138Y, and R190S, that were prominent near its rise and have been maintained among the Gamma variant genomes. However, the relative mutation frequencies of the Beta and Delta variants appear to have greater variation from the graphs. Prominent S protein mutations in the Beta variant include D614G, A701V, K417N, D80A, E484K, D215G, A242del/L243del/L244del, and L18F. In contrast to the first five mutations mentioned, D80A, D215G, and L18F were at a low relative frequency in September 2020, but have since increased to be dominant among the population. This increase may suggest that these mutations play a role in giving the virus a replicative advantage and are beneficial to the survival of the virus, along with the mutations that hold a constant high relative frequency. L242del/A243del/L244del, or deletion at amino acid position 242-244 of the S protein, appear to decrease in relative frequency of Beta variant genomes. Mutations that decrease in relative frequency over time likely may not give the virus a replicative advantage anymore. Relative mutation frequency over time for the Delta variant is less consistent, especially compared to the Alpha and Gamma variants. This might have caused the sequencing of the variant to become more difficult to track until more cases were able to be sequenced, allowing it to infect many people before it was able to be properly detected, and also shows a potentially high rate of mutation that would have changed the virus frequently, potentially making it hard to detect for previous antibodies. Overall, relative mutation frequencies for D614G, P681R, T19R, L452R, T478K, D950N, F157del/F158del, and E156G increase to become dominant mutations for the variant, with G142D and T95I present in some sequences. However, the smaller sample sizes used for the earlier months compared to the later months may have affected the relative frequencies.

In general, we have also noticed that variants that become dominant among an area and decrease do not tend to reappear in the population. This pattern suggests that mutations are unlikely to resurface once they have become dominant and it can help predict the behavior of mutations found in currently dominating variants. This could be due to the mutations not providing a beneficial adaptation anymore as a result of environmental changes or improvements in treatment and is a potential topic of further investigation.

3.2 Identifying Combination of VOCs

To assess whether variants that have mutations from multiple VOCs are a possible threat, we analyzed the prevalence of cases in the COVID-19 metadata that were labeled as a VOC in the metadata, but contained mutations of another VOC that are hypothesized to have the strongest effect on the virus or are of the most concern. The mutations chosen for each variant are as follows: Alpha: P681H, N501Y, H69del/V70del [6], Beta: K417N, N501Y, E484K [7], Gamma: K417T, N501Y, E484K [8], and Delta: P681R [9], L452R [10], T478K [11].

Variant + Mutations

Number of Cases (Countries)

Alpha + K417N, N501Y, E484K (Beta)

3/21: 5 (Sweden, USA)

4/21: 7 (Croatia)

5/21: 2 (USA, France)

Alpha + K417T, N501Y, E484K (Gamma)

4/21: 2 (USA)

5/21: 11 (USA)

Alpha + P681R, L452R, T478K (Delta)

5/21: 1 (Netherlands)

6/21: 2 (USA, Czechia)

7/21: 7 (Czechia, Spain, South Korea, Sweden, France)

Beta + P681H, N501Y, H69del/V70del (Alpha)

1/21: 1 (Israel)

2/21: 1 (Bosnia and Herzegovina)

3/21: 1 (USA)

5/21: 5 (USA, Belgium)

Beta + K417T, N501Y, E484K (Gamma)

1/21: 1 (Chile)

3/21: 5 (USA, Turkey)

4/21: 13 (USA, Germany, Colombia)

5/21: 11 (USA)

6/21: 1 (Mexico)

7/21: 1 (Ecuador)

Beta + P681R, L452R, T478K (Delta)

6/21: 2 (Botswana)

7/21: 2 (Botswana)

Gamma + P681H, N501Y, H69del/V70del (Alpha)

5/21: 3 (USA)

Gamma + K417N, N501Y, E484K (Beta)

3/21: 1 (Turkey)

Gamma + P681R, L452R, T478K (Delta)


Delta + P681H, N501Y, H69del/V70del (Alpha)

3/21: 5 (Germany)

4/21: 3 (United Kingdom, USA)

5/21: 5 (Germany, United Kingdom, South Africa)

6/21: 4 (USA, United Kingdom, South Korea)

7/21: 1 (Spain)

Delta + K417N, N501Y, E484K (Beta)

3/21: 1 (Sweden)

Delta + K417T, N501Y, E484K (Gamma)

5/21: 2 (USA, Luxembourg)

6/21: 5 (USA, Canada)

7: Table of Number of Combination of VOCs: first column lists the combination of the variant, using the ‘variant’ attribute of the metadata and S protein mutations; second column lists the number of sequences (if any) for each month that has more than 0, as well as the countries the sequences were found in.

Overall, there were very few sequences that contained a combination of two VOC as defined above. These particular combinations do not seem to stay in the population or grow past 15 cases in one month. This suggests that combinations of VOCs may not pose a serious threat to humans as they have not been spread widely and that the specific combination of mutations do not give the virus a replicative advantage or aid in its transmissibility. Although there is little evidence of the impact of these variants, more data would give a better understanding of how these strains behave. However, there are several emerging strains of the VOCs that include one additional mutation that is characteristic of another VOC. One such example is the Delta Plus (AY.1) variant found in India, which has acquired the K417N mutation, one that is also found in the Beta variant [12]. No current evidence suggests that Delta plus will pose a larger threat than Delta, but it is too early to assess the risk of this variant. In addition to Delta Plus, several hundred Delta variant cases have been found with ​​H69/V70 deletions in the GISAID metadata, mutations that are also found in the Alpha variant. Furthermore, several hundred Alpha variant cases contain the E484K mutation characteristic of the Beta and Gamma variants, and several hundred Gamma variant cases contain the P681H mutation, a dominant mutation in the Alpha variant [13]. In general, many newly discovered variants contain many of the same mutations found in older variants, suggesting that most mutations that provide the most effective adaptations may have already been discovered.

3.3 Graphing Mutation Frequency by Position in S Protein

We created a log plot graph of the relative frequency of mutations at each position in the S protein to get a better visual understanding of S protein mutations by position (Figure 8). The figure is color-coded according to the value of each frequency. This was obtained from looking at the number of cases which had a mutation at each position over each month divided by the total number of cases at each position and month. The maximum frequency for each position over all months was used to calculate the relative frequency. In addition, we created a similar graph of the total number of mutations at each position (Figure 9).

Figure 8: Relative mutation frequencies (f) at each position in the S protein (red = f>50%, yellow = 5%<f<50%, green = 1%<f<5%, blue = f<1%)

Relative mutation frequencies (f) and color from figure 8

No. of positions

Amino Acid Positions in S Protein

f > 50% (red)


19, 69, 70, 142, 144, 156, 157, 158, 452, 478, 501, 570, 614, 681, 716, 950, 982, 1118

5% < f < 50% (yellow)


5, 18, 20, 26, 27, 95, 138, 152, 190, 222, 417, 477, 484, 655, 792, 909, 1027, 1176

f < 5% (green)


8, 12, 13, 21, 49, 54, 80, 98, 153, 189, 215, 242, 243, 244, 251, 253, 262, 272, 367, 439, 583, 675, 677, 701, 719, 732, 769, 772, 780, 859, 936, 957, 1163, 1167, 1191, 1263, 1264

Figure 9: Table of all positions with a relative mutation frequency of greater than 1% (red, yellow, green)
Figure 10: Log plot of total mutations at each position in the S protein

There are 18 mutations in the S protein that have reached past a relative frequency of 0.5 and are labeled in red in figure 8. They are mutations found in the Alpha and Delta variants, most likely due to their global spread, allowing for many mutations in those variants. This indicates that these mutations were linked to the proliferation of the Alpha and Delta variants. There are also 18 positions in the S protein that have a relative frequency between 0.05 and 0.5 that are labeled in yellow in figure 8. They are mainly found from the Gamma variant, with some from the Beta variant. 37 positions in the spike protein labeled in green had a relative mutation frequency of between 0.01 and 0.05. Some of the positions are not found in any specific VOC, meaning that these mutations could either be benign or helpful to every type of variant which could be investigated through further research. These positions could also be sites of key mutations for future VOCs, which means that we should carefully investigate these positions in order to better understand their mutation patterns.

The vast majority of mutations had a frequency of less than 1%, indicating that these positions are essential to the SARS-CoV-2 function, Importantly, some positions had virtually no mutations, like positions 492 with a relative frequency of 0.000017, 1105 with a relative frequency of 0.000018, 488 with a relative frequency of 0.000021, 379 with a relative frequency of 0.000023, and 56 with a relative frequency of 0.000024. The median relative frequency of all positions is 0.00034. Because these positions had a much lower relative frequency than the median, these positions may be essential to the virus and could thus be targeted for future treatments, since any mutations there would disrupt the function of the S protein.

Analysis of general regions of high mutation rates indicates that the area with one of the highest number of total mutations is from the 350-500 position, an area known as the receptor-binding domain (RBD) [14]. This is the area of the spike protein that allows entry into the human cell and causes a person to become infected. High mutation rates in this region can allow for higher rates of infection, making the comparatively large number of mutations reasonable. Furthermore, the 0-250 region of the S protein also contains a very high number of mutations. This may indicate a yet-to-be-investigated area of high importance to the virus, as high mutation rates in this region may similarly allow higher rates of infection.

4. Conclusions

Through analysis of COVID-19 sequencing metadata from GISAID, we were able to visualize the relative mutation frequencies of each VOC over time, investigate whether combinations of mutations from different VOCs pose a possible threat, and depicted mutation frequencies by graphing each position in the S protein. We found that the relative mutation frequencies of mutations in the Delta variant over time were less consistent over time compared to the other variants and its mutations took longer to become dominant. The variability of these relative frequencies of these mutations may have posed challenges to detection early one. We also found that combinations of mutations between two variants of concern most likely do not pose a threat to humans due to their low prevalence and inability to stay in the population. Although the mutations in the VOCs may increase transmissibility and/or infectivity in the virus, a combination of them may not give the virus a stronger evolutionary advantage compared to the VOCs on their own. Additionally, certain regions of the S protein that have a high or low mutation rate could indicate regions allowing for higher infection rates and regions integral to the function of the virus respectively. Positions in the S protein that have a relative frequency significantly lower than the median relative frequency, such as 492, 1105, and 488, may be essential to the function of the S protein and can be potential targets for treatment and vaccination.

5. Future Directions

For future research, we hope to leverage k-mers to more efficiently and accurately identify mutations in SARS-CoV-2 genomes, as well as identify and analyze insertions and deletions, since using k-mers is more efficient than traditional sequence alignment methods. This would be accomplished by utilizing the k-mer table we have already created as a base for a method of comparison between mutations in the genome.

In addition, we hope to be able to further analyze patterns of SARS-CoV-2 mutations in conjunction with various meteorological factors in order to examine how they affect mutation frequency, spread and other factors of SARS-CoV-2. Investigating the weather during specific months or days in various geographical regions would greatly expand the specificity of our knowledge of the effects that weather patterns have on mutations.

6. Acknowledgements

We would like to thank our mentors Hojoon Lee and Stephanie Greer for their guidance throughout this research project. We would also like to thank Cindy Nguyen, Professor Tsachy Weissman, and everyone who helped coordinate different events that allowed the STEM to SHTEM program to be possible, and gave us our first steps into the world of academic research.

We also gratefully acknowledge the Authors from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based (see attached PDFs:

7. References

  1. WHO Coronavirus (COVID-19) Dashboard. (n.d.). Retrieved August 13, 2021, from
  2. Astuti, I., & Ysrafil. (2020). Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response. Diabetes & Metabolic Syndrome, 14(4), 407–412.
  3. Huang, Y., Yang, C., Xu, X., Xu, W., & Liu, S. (2020). Structural and functional properties of SARS-CoV-2 spike protein: Potential antivirus drug development for COVID-19. Acta Pharmacologica Sinica, 41(9), 1141–1149.
  4. CDC. (2020, February 11). Coronavirus Disease 2019 (COVID-19). Centers for Disease Control and Prevention.
  5. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome (1798174254; Version 2). (2020). [Data set]. NCBI Nucleotide Database.
  6. B.1.1.7: What We Know About the Novel SARS-CoV-2 Variant. (n.d.). ASM.Org. Retrieved August 13, 2021, from
  7. Corum, J., & Zimmer, C. (2021, January 18). Inside the B.1.1.7 Coronavirus Variant. The New York Times.
  8. Hirotsu, Y., & Omata, M. (2021). Discovery of a SARS-CoV-2 variant from the P.1 lineage harboring K417T/E484K/N501Y mutations in Kofu, Japan. The Journal of Infection, 82(6), 276–316.
  9. Saito, A., Nasser, H., Uriu, K., Kosugi, Y., Irie, T., Shirakawa, K., Sadamasu, K., Kimura, I., Ito, J., Wu, J., Ozono, S., Tokunaga, K., Butlertanaka, E. P., Tanaka, Y. L., Shimizu, R., Shimizu, K., Fukuhara, T., Kawabata, R., Sakaguchi, T., … Sato, K. (2021). SARS-CoV-2 spike P681R mutation enhances and accelerates viral fusion (p. 2021.06.17.448820).
  10. Motozono, C., Toyoda, M., Zahradnik, J., Saito, A., Nasser, H., Tan, T. S., Ngare, I., Kimura, I., Uriu, K., Kosugi, Y., Yue, Y., Shimizu, R., Ito, J., Torii, S., Yonekawa, A., Shimono, N., Nagasaki, Y., Minami, R., Toya, T., … Sato, K. (2021). SARS-CoV-2 spike L452R variant evades cellular immunity and increases infectivity. Cell Host & Microbe, 29(7), 1124-1136.e11.
  11. Giacomo, S. D., Mercatelli, D., Rakhimov, A., & Giorgi, F. M. (2021). Preliminary report on severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike mutation T478K. Journal of Medical Virology, 93(9), 5638–5643.
  12. SARS-CoV-2 variants of concern and variants under investigation. (n.d.). 71. says, J. (2020, July 6). What is a Receptor-Binding Domain (RBD)? News-Medical.Net.
  13. Emergence and spread of SARS-CoV-2 P.1 (Gamma) lineage variants carrying Spike mutations 𝚫141-144, N679K or P681H during persistent viral circulation in Amazonas, Brazil—SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology. (2021, July 4). Virological.
  14. What is a Receptor-Binding Domain (RBD)? News-Medical.Net.

Leave a Reply