Identifying and Quantifying Differences Among SARS-CoV-2 Genomes Using K-mer Analysis

Journal for High Schoolers


Aarush Aitha, Connie Hong, Adarsh Hiremath, HoJoon Lee, Stephanie Greer, and Dmitri Pavlichin


K-mers, nucleotide subsequences of length “k” nucleotides, impart valuable insight into the attributes of genomes. By analyzing and comparing k-mers and their genetic makeup, differences among distinct genomes can be uncovered. Although identifying variance among genomes is of much scientific interest, conventional methods utilize outdated alignment technologies, making them inefficient. Our study utilizes an analysis of the k-mers in SARS-CoV-2 genomes across the world to demonstrate the viability of k-mers in pinpointing this genomic variance. 

We constructed a clustering model using genomic distance matrices encapsulating different techniques (L1, L2, Jaccard, and Multiple Sequence Alignment). Utilizing these models, it was found that a large proportion of k-mer mutations encoded for the coronavirus ORF1ab non-structural protein 2 (NSP2). This protein plays a significant role in the contagiousness of SARS-CoV-2.

Out of all the distance metrics, the L1 and Jaccard distance calculations were the most accurate. K-mer based distance calculations were inaccurate when applied to mutations in close proximity with one another compared to traditional alignment softwares. However, the k-mer based approach was more efficient for difference estimates. Clusters using k-mer distance matrices were more closely correlated with NCBI’s GISAID labels than the multiple sequence alignment data produced by MAFFT, a traditional alignment software.

1. Background

SARS-CoV-2, also known as COVID-19, is a contagious and lethal virus. The genetic makeup and structure of the SARS-CoV-2 virus is of paramount interest: studies have revealed SARS-CoV-2 is now suited to bind to the ACE2 human receptor due to mutations. 

A genome contains an organism’s complete set of DNA, while a genome sequence contains an organism’s complete set of nucleotides (A, C, T, and G). In the case of a viral genome, like that of SARS-CoV-2, the thymine nucleotide is replaced with a uracil nucleotide. Sequence reads, a mere fragment of an organism’s base pairs, can provide insight into the characteristics of an organism. This study uses k-mers, nucleotide substrings of length “k”, to analyze the genetic information within the SARS-CoV-2 genome. 

Sequencing DNA from multiple samples allows the creation of a SARS-CoV-2 reference genome (available through GISAID and the CDC). Sample reads of the SARS-CoV-2 genome from different sources were created through reading the complementary DNA strands that were generated from using reverse transcriptase enzymes to “copy” the RNA of the viral cells. Although reference genomes do not accurately represent the set of genes of any organism, they allow comparison between the available SARS-CoV-2 genomes. Comparing SARS-CoV-2 reads from similar provinces, for example, to the reference genome could yield locational information about mutations (substitutions, insertions, or deletions) within the DNA. Alternatively, comparing individual genomes with one another could reveal the average mutation rate of SARS-CoV-2. Ordinarily, comparing the DNA subsequences of different genomes is extremely difficult. Traditional alignment softwares (such as BLAST, MEGA, and MAFFT) require large amounts of time, memory, and computing power to individually compare base pairs.12  JELLYFISH, a tool for memory-efficient counting of k-mers, allows users to output k-mer counts in a binary or text format. 

Structurally, the spike protein of SARS-CoV-2 is vulnerable to mutations, which scientists have already identified in 6 unique amino acids within the receptor-binding domain (RBD) of SARS-CoV-2. Computational analyses have demonstrated the initial interactions between the RBD of SARS-CoV-2 and the ACE2 receptor of human cells are not ideal1. However, newer research has confirmed the highly effective binding method employed by SARS-CoV-2 is the result of natural selection and adaptation. SARS-CoV-2 mutations make the virus difficult to study, increase its affinity for the ACE2 receptor, and cause higher peer-to-peer transmission rates compared to the SARS-CoV virus2

2. Methods and Materials

2.1 Preliminary Data Collection

Collecting the relevant information about SARS-CoV-2 required retrieving a file with all 3970 genomes (from GISAID and the CDC), splitting each of the genomes into individual .fasta files, and eliminating any files improperly formatted or containing genetic information outside the human species (bat and pangolin). Organizing the data set was a prerequisite to creating mutation graphs, determining the L1/L2 distances, using clustering algorithms, and finalizing the Jaccard distance matrix.

[‘EPI_ISL_402131.fa’, ‘EPI_ISL_408514.fa’, ‘EPI_ISL_408515.fa’, ‘EPI_ISL_410721.fa’, ‘EPI_ISL_417024.fa’, ‘EPI_ISL_418206.fa’, ‘EPI_ISL_418207.fa’, ‘EPI_ISL_418208.fa’, ‘EPI_ISL_418209.fa’, ‘EPI_ISL_418796.fa’, ‘EPI_ISL_418810.fa’, ‘EPI_ISL_419529.fa’, ‘EPI_ISL_419530.fa’, ‘EPI_ISL_419531.fa’, ‘EPI_ISL_419532.fa’, ‘EPI_ISL_419533.fa’, ‘EPI_ISL_419534.fa’, ‘EPI_ISL_419535.fa’, ‘EPI_ISL_419536.fa’, ‘EPI_ISL_419537.fa’, ‘EPI_ISL_419538.fa’, ‘EPI_ISL_419539.fa’, ‘EPI_ISL_419540.fa’, ‘EPI_ISL_419561.fa’]

Figure 1: Names of Removed Genomes from Testing Set

The JELLYFISH command line software enabled processing the data for each individual read.8 Individual JELLYFISH commands were performed on each of the 3970 sequence files with assistance from the Python OS library and bash scripts. “K” values of 25, 29, and 31 were used to increase the accuracy of data collection, visualize trends among sequence substrings of different length, and generate pairwise distances between each genome in the dataset.

Performing a JELLYFISH “dump” command on the combined read file (highCov_complete_0409_GISAID.fasta) yielded the set of distinct k-mers occurring in the entire set of 3970 genomes. After parsing through the read file and dump files for each individual genome with Python, the data was stored in a .csv file with the file name of each genome and the k-mer sequence on the horizontal and vertical axes. Each cell in the body of the .csv file contains the count of each k-mer within each individual genome (Figure 2).

Figure 2: .CSV File Storing 25-mer Counts for each Genome

2.2 L1 and L2 Distance

To quantify the differences between genomes, L1 and L2 distances (also known as the Taxicab and Euclidean distances) were calculated for pairs of genomes. Distances measure the lengths of various vector spaces; the L1 norm calculates the sum of the absolute value of the vectors while the L2 norm calculates the square root of the sum of absolute values of the vectors squared. When applied to the SARS-CoV-2 data, the L1 norm produced data about how “unique” each genome was, while the L2 norm enabled comparison between an individual genome and the reference genome.  

To calculate the L1 and L2 distances, each row of Figure 2 was treated as a i * 1 vector, where i was equal to the number distinct k-mers (the k-mer count of each genome). The reference genome from Wuhan was compared with every individual genome, with the absolute difference between them calculated with the distinct k-mers recorded in Figure 2. Thus, every genome has its own row as one vector (Figure 3).  

Figure 3: .CSV of Dimension 3970×3970 Containing L1 Distance Values Between Pairs of Genomes

To calculate the L1 norm, the equation L1=n=0iabs(xn-yn)was implemented on the two genome’s k-mer count vectors. The first genome was represented in the form [x0, x1, … Xi] while the second genome was represented in the form [ y0, y1, … yi] . Each of the values in these arrays corresponded to the number of occurrences of a certain k-mer in that genome. To calculate the L2 norm, slight modifications were made to the L1 norm equation. The formula L2=n=0i(xn-yn)2was applied to determine the L2 norm distance between two genomes. 

Code to calculate the L1 and L2 norm differences was created using Python and the values were recorded in a .csv file of dimensions 3970×3970. In Figure 3, each row and column represents a single genome read, while each cell contains the distance between the two genomes. 

2.3 Jaccard Distance Matrix

The Jaccard Index, the size of the intersection of two sets divided by the union of the two sets, provides another powerful method for quantifying similarities and differences between two genomes. Put in equation form, the Jaccard Index can be denoted as Jaccard Index = |A\cap B| / |A\cup B| for any two sets A and B. The Jaccard Index can be denoted as a decimal between 0 and 1 (0% and 100%); the greater the index, the more similar two sample sizes are.9 

When applied to k-mer analysis, the Jaccard Index can be calculated as the size of the common k-mers between two genomes (A and B) divided by union of the two k-mer sets. Using Python, the Jaccard Index formula was implemented for all 15,570,916 (39462) possible pairs of genomes. The results were stored in a .csv file with the file name of genome A on the vertical axis and the file name of genome B on the horizontal axis. The Jaccard Index between any two genomes is located in the cells of the body of the .csv file; the cell where genome A’s column and genome B’s row meet contains the Jaccard Index value for genome A and B (Figure 4). 

Figure 4: .CSV Containing Jaccard Difference Coefficients for Pairs of Genomes

-NOTE: The diagonal should have differences of 0, not 1. This was remedied before clustering began.

2.4 Traditional Sequence Alignment

Compared to k-mer analysis, traditional sequence alignment is inefficient. To verify this, the SARS-CoV-2 genome was compared with other coronaviruses such as MERS, SARS-CoV, and the bat coronavirus to draw conclusions about different traits. The Needleman-Wunsch algorithm, an alignment algorithm traditionally used to align nucleotides, took much longer to implement than the k-mer approach. 

To measure differences, the number of positions that were not equal to one another were logged. For example, the difference between “ACTG-ATTCA” and “AC-GCA-TCA” would be 3. The original length of each genome was logged in the leftmost vertical column of a chart, while the topmost horizontal column logged the type of coronavirus. Inside the body of the chart, the number inside the parentheses logged the length of the aligned sequences. This made it possible to determine the amount the alignment algorithm increased the length by. 

Bat CoronavirusMERSSARS-CoV-2SARS-CoV
Bat Coronavirus(29388)013638 (34033)13334 (33765)13397 (33691)
MERS(30549)013625 (34502)13585 (34355)
SARS-CoV-2(30330)06136 (30924)

2.5 Distance Matrices and Heat Maps

The distance matrices were visualized in a heat map (Figures 5, 6, and 7) with the help of various Python libraries (seaborn, matplotlib, and scipy).  

Figure 5: MSA Alignment Distance Matrix

The dendrograms on the top and left sides of the heatmap were generated using hierarchical clustering with complete linkage on the sequence alignment distance matrix. The branches and “spikes” on the dendrograms depict dissimilar genomes which do not belong to the same group at a particular node. The genomes continually filtered out until 3946 groups were created with one genome per group. 

The heat map itself illustrates the degree of pairwise differences between the genomes. According to the legend (in the top left corner), the lighter shades of pink mean the two compared genomes possess more differences; this captures the distance matrix data for the most dissimilar genomes. The consistency between the dendrogram branching off into different groups above the lighter columns and to the left of lighter rows proves the clustering algorithm determined the genomes were sufficiently different from one another. Additionally, the diagonal from the top left corner to the bottom right corner of the heat map is completely black, demonstrating the identical genomes which have a difference value of “0”. A similar heat map was generated for the distance matrices containing the Jaccard differences (Figure 6) and the L1 norm differences (Figure 7). In the aforementioned examples, the heat maps are formatted identically with the indexes of the genomes (as opposed to the file names) listed on the vertical axis. 

Figure 6: Jaccard Distance Matrix (k=25)

Figure 7: L1 Norm Distance Matrix (k=25)

2.6 Genomic clustering: K-means, Agglomerative Algorithms, and GISAID Classification

Clustering was performed on the 3946 genome reads with the k-means function within Python 3’s SKlearn Library. The clustering approach was done with four different genome distance values (L1, L2, Jaccard, and Multi-Sequence Alignment) and sought to synthesize the other collected data. 

K-means clustering contains two hyperparameters: the degree of centroid movement and the number of generated clusters. Since k-means randomly chooses centroids, the manipulated variable within the study was the number of generated clusters. K-means groups data points based on which of the points are closest to the centroid; later, the centroids shift to equal the mean value of each cluster. The last two steps repeat until centroid movement reaches a minimum.9 

To apply this to k-mer analysis, specific k-mers to each cluster (10 of them) were chosen if they occurred frequently within the cluster. Unlike other forms of clustering, the hierarchical clustering function takes one point consecutively and clusters it into the next group until a single large group is formed.3 

The accuracy of the distance matrices was also computed with the GISAID labels serving as the “truth”. In this case, the clustering problem was analyzed as a supervised classification problem with Precision and Recall Metrics used to compute the accuracy of the models. The k-means clustering into 2 groups for all distance metrics was checked against the GISAID clusters into groups “A” and “B”. In the following segments, only clustering results on the Jaccard indexes for 25-mers were analyzed, as the Jaccard Index based clustering results were shown to have the highest accuracy in correlation with the GISAID labels (see section 4.2).

2.7 Isolating Significant K-mers

Following k-means clustering, the clusters themselves were analyzed to pinpoint any k-mers associated or correlated with specific clusters. This process was divided into three steps: gathering a list of all k-mers in each cluster, isolating exclusive k-mers, and finding the unique k-mers. 

To accumulate a list of all of the k-mers occurring in each cluster, the required rows for each cluster (representing the counts for each genome) were filtered out from the file with the total counts (Figure 2). Each column (representing the counts for each k-mer) was iterated through; if a count of 1 or greater was found, the k-mer was added to a list. The program then moved on to the next k-mer, repeating the process. 

Figure 8: A snapshot of the filtered k-mer counts file. This was the first cluster of 10 clusters that was generated using sequence alignment differences and k-means clustering.

Next, the exclusive k-mers, or k-mers with an unusually high concentration in one cluster, were isolated. Many k-mers (such as “AAAAAAAAAAAAAAAAAAAAAAAAA”) have a high number of occurrences overall, so their counts were comparatively higher in each cluster. Thus, filtering out k-mers that occured at an equal proportion in each cluster yielded the exclusive k-mers; the removed k-mers were not associated with the specific cluster, but rather the genome itself.

Finally, the exclusive k-mers described in the previous step were filtered once more to find the unique k-mers. To prevent the unique k-mers from merely being mutations/glitches that solely occurred in one genome, the proportion “total number of occurrences in the current cluster / the total number of genomes” was calculated. If the proportion was at least one percent, the k-mer was significantly likely to not be a glitch or mutation. Increasing this threshold would make the unique k-mer selection process extremely selective. 

2.8 Correlating Clusters to Locations

Odds ratios for evaluating the association between a cluster and location were calculated from the clustering results, while the proportion of each locations’ genomes within each cluster was evaluated to determine the correlation between location and cluster. 

In LocationNot In Location
In ClusterAB
Not In ClusterCD

Figure 9: Values used in Calculating ORs.

Given the conditions described in Figure 9, the odds ratio of each location’s genomes within a cluster was calculated as Odds Ratio = A*DB*C after removing the locations that occurred <10% within the respective cluster. To determine the range of the possible OR values within a confidence interval of 95%, the formulas cited in Tenny et al. 20197 were utilized to calculate the lower and upper range of the calculated odds ratios. Locations with odds ratios greater than 1 with a confidence interval of 95% were stored into a list for further analysis. Statistical noise was minimized through assigning the cluster with the greatest OR of each location to solely that location.

2.9 Correlating Significant K-mers of Clusters with Encoding Protein Sequences

The mutations unique to each cluster and their respective protein sequences were determined from clustering the Jaccard indexes and the k-means data. It can be concluded that k-mers absent from the reference genome but present in other SARS-CoV-2 genomes were created due to insertions, deletions, or substitutions. 

For each of these mutated k-mers, an algorithm paired off the mutated k-mer with another k-mer from the reference genome based on the least possible nucleotide differences between the two k-mers. If a mutated k-mer was similar to multiple reference genome k-mers, the mutated k-mer was paired to the k-mer from the reference genome that shared starting and end nucleotides. Then, the location of each mutated k-mer was determined by using the paired reference genome k-mer to map the location onto the SARS-CoV-2 genome. Similarly, the proteins the mutations encoded for were also determined through the SARS-CoV-2 locational genome data. Each mutated k-mer’s location on the genome, associated cluster, province were determined. 

For each cluster, the mutated k-mers with the maximum occurrences and their respective proteins were also recorded (see 4.2). In the case that a mutated k-mer did not occur in an encoding region (cluster 2), the mutated k-mer which appeared inside an encoding region was logged instead.

3. Results

3.1 Sequence Alignment vs K-mer Analysis

To determine the efficiency of k-mer distance calculations (L1 norm) versus traditional Needleman-Wunsch sequence alignment, controlled settings with randomized mutations were simulated. 

Genome 12222
Genome 23224
Difference|2 – 3||2 – 2||2 – 2||2 – 4|

Figure 10: Table of Total Counts for Fake Setting 

In a scenario using 1-mers on a sequence under 15 nucleotides, the L1 norm difference would be calculated using the information in Figure 10. A created algorithm found the column-wise absolute differences, added up the respective differences, and stored the differences as an L1 value in the distance matrix. In this situation, the L1 norm difference between Genome 1 and Genome 2 would be (1 + 0 + 0 + 2) = 3.

Conversely, the process for traditional sequence alignment utilizes the Needleman-Wunsch Algorithm (Figure 11). 

Figure 11: Diagram of Needleman-Wunsch Algorithm 

To compare two nucleotide positions, the possibilities include a match, mismatch, or a gap. Assigned weights can value matches over mismatches, or mismatches over gaps. Using the formula on the right, the algorithm finds the alignment with the greatest incremental weight. To employ the Needleman-Wunsch algorithm, a randomly selected SARS-CoV-2 genome was utilized (EPI_ISL_418942.fasta). The file was from region WA-UW375 in the USA. 

In the DNA sequence, a single mutation was simulated for the SARS-CoV-2 genome to verify the Needleman-Wunsch algorithm was working correctly. The expected L1 calculation was 2k * (number of mutations) because each mutation eliminates k k-mers and adds k new k-mers. However, the results from the k-mer analysis were varying; mutations in close proximity (within a range of k nucleotides) and repeating k-mers likely caused inaccuracies in the L1 norm distance calculations. 

Substitution: Line 161 position 41, substitute G to C

  1. Sequence Alignment distance: 1
  2. 25-mer distance: 50
  3. 29-mer distance: 58
  4. 31-mer distance: 62

Insertion: Line 161 position 41, insert T

  1. Sequence Alignment distance: 1
  2. 25-mer distance: 49
  3. 29-mer distance: 57
  4. 31-mer distance: 61

Deletion: Line 161 position 41, delete G

  1. Sequence Alignment distance: 1
  2. 25-mer distance: 49
  3. 29-mer distance: 57
  4. 31-mer distance: 61

The single mutations yielded the expected results; the substitution mutations resulted in exactly 2k * (number of mutations) for all three values of k. However, the insertion and deletion mutations resulted in exactly one less than the expected value, likely due to a k-mer repeating within the genome.

The next simulation created multiple mutations in close proximity to one another: 

Substitution: Line 50 position 25 (T to A), Line 50 position 45 (T to G), L50P65 (T to C)

  1. Sequence Alignment distance: 3
  2. 25-mer distance: 130
  3. 29-mer distance: 138
  4. 31-mer distance: 142

Insertion: Line 50 position 25 (A), Line 50 position 45 (C), L50P65 (G)

  1. Sequence Alignment distance: 3
  2. 25-mer distance: 127
  3. 29-mer distance: 135
  4. 31-mer distance: 139

Deletion: Line 50 position 25 , Line 50 position 45, Line 50 position 65

  1. Sequence Alignment distance: 3
  2. 25-mer distance: 131
  3. 29-mer distance: 139 
  4. 31-mer distance: 143 

Unlike the previous simulation, the k-mer approach faltered in observing differences between genomes when mutations were in close proximity to one another. For the substitution, deletion, and insertion mutations the L1 distance was significantly lower than the expected distance 2k * (number of mutations). Although the sequence alignment algorithm was accurate, it took over 5 minutes to align two sequences of length ~30,000. Conversely, it took seconds to generate the L1 distance between two genomes. It can be concluded the k-mer approach is more efficient for difference estimates between two genomes, but falters for highly specific mutation calculations (especially when in close proximity to one another). 

3.2 Clustering Accuracy and Distance Calculation Methods

Figure 12 depicts the validation metrics of the k-means clustering method for each distance calculation method (L1, L2, sequence alignment, and Jaccard). These values were generated through comparing the outputs of k-means clustering on two clusters with the classification outputs from GISAID into clusters A and B as the “truth”. The GISAID classification results of the genomes into classes A and B were generated with a combination of tracing the progression and geographical movement of the virus over time, as well as utilizing multiple sequence alignment software. Within the GISAID labels, class A contained more genomes than class B. Therefore, the clustering results were assigned to A and B based on their cluster size prior to calculation of their Precision and Recall values with the GISAID labels. 

Analysis of the distance methods based on their Precision and Recall values showed two varying conclusions; by solely making deductions from the Precision values of the distance methods, the L1 Norms and Jaccard Indexes were both very similar to the GISAID labels provided by NCBI. Conversely, the Recall values display a different conclusion; the L2 Norms and MSA distance methods yielded significantly higher values, indicating that these distance methods may be better suited for the composite picture of all the genomes within a GISAID generated cluster.

Collecting k-mer count data for differences took more time than sequence alignment and calculating the MSA base differences took less time than analyzing various k-mer differences. It can be deduced that finding the differences between pre-aligned sequences is quicker.

Distance MethodClass A: PrecisionClass A:Recall/True PositiveClass A: False PositiveClass B: PrecisionClass B:Recall/True PositiveClass A: False Positive
L1 Norm0.45560.26280.26280.992060.00790.00794
L2 Norm0.24920.45950.45950.88060.11940.1194
Jaccard Index0.45530.26310.26310.99210.00790.0079
Multiple Sequence Alignment0.15600.102510.102520.812660.187340.1873

Figure 12: Validation Metrics of Clustering Results on Different Distance Methods.

The MSA distances performed poorly when compared to the k-mer based distance calculation methods within clustering. While the GISAID labels were generated through MSA techniques, this study did not utilize timestamps of different genome sequences when producing clustering results. Thus, it is likely the k-mer based distance calculation methods were able to produce results inline with the predictions of GISAID.

3.3 Locational Data 

Figures 13, 14, and 15 depict the results of an odds ratio based analysis between different countries and the clusters. The clustering performed on two groups (Figure 13) had significant correlations found; group 0 had origins specific to the US, Australia, and Europe. 

The same group was divided into a US, Belgium, Iceland, and independent Europe cluster when K-Means clustering was further performed with four groups (Figure 14). Additionally, the results indicate clustering done on 4 groups (Figure 14) did not yield a significant location correlated to group 0. Group 0 contained 559 genomes; since the cluster was likely spread out among many different locations, a definite location was not traced. 

Figure 13: Location Based Data for K-Means Clustering for 2 Groups on Jaccard Distances

Figure 14: Location Based Data for K-Means Clustering for 4 Groups on Jaccard Distances

Figure 15: Location Based Data for K-Means    Figure 16: Location Based Data for K-Means 

Clustering for 4 Groups on Jaccard Distances  Clustering for 2 Groups on Jaccard Distances

3.4 Unique K-Mers and Associations Between Mutations and Genomic Location

The mutations encoding particular SARS-CoV-2 proteins were mapped to certain clusters and traced to certain locations on the genome (Figures 17 and 18). 

Figure 17: Frequency of Mutations vs Location of Mutations graph for All 4 Clusters created from K-Means clustering to 4 groups on Jaccard Distances with 25-mers

As depicted in Figure 17, the results indicate clusters 1 and 2 have many low-frequency mutations across all positions on the genome. However, only a minimal number of high-frequency mutations were found. This likely means there are no significant mutations correlated to all the distinct genomes within the clusters. The results differed for cluster 3; there were a higher proportion of high-frequency mutations as shown by the concentration of frequencies at around 350. Cluster 4 had mixed results, with a randomized spread of high and low frequency mutations. 

Cluster NumberAssociated LocationsTotal # of Genomes In ClusterMost Common 25-mer in ClusterStartingLocation of Associated 25-mer on Origin Genome# of Occurrences of Mutated K-mer in ClusterAffected Protein Coding Regions
2Western Europe, Japan, Canada, and Australia1704ATCATCAGCACATCTAGGTTTTGTC
3693′-to-5′ exonuclease
4Belgium and Iceland541ACTCAGAGTAGAATCATTATCTAAA11897-1189844nsp7

Figure 18: Summary of All Mutations of Cluster Outputs from Jaccard Indexes. 

– NOTE: In the case of Cluster 2, the most common k-mer did not occur in an encoding region. This value for the protein is determined separately by looking at the encoding k-mers. The starting location of this protein’s associated k-mer was 3896 and the number of occurrences was 132. TCCATAGCCAATTCTAAGAAATCAA was the associated mutated k-mer to this protein’s encoding region.

The cluster with the highest number of mutated k-mers within the genome was cluster 2, with ~1700 occurrences out of 1704 genomes. Although this k-mer occurred in a non-coding region of the origin genome taken from NCBI, further research must be conducted to determine the functionality of this mutated k-mer. Within the encoding region of the genome, the k-mer with the highest count was associated with the NSP2 protein. This mutation was relatively rare; the k-mer only appeared 132 times out of the 1704 genomes. 

In cluster 1, the mutated k-mer with the most occurrences was located only 104 times. However, the k-mer impacted a vital protein coding region: coronavirus ORF1ab non-structural protein 2 (NSP2). Previous research conducted on the SARS-CoV-2 genome confirms a stabilizing mutation in the NSP2 protein may play a large part in the contagiousness of the virus.2  

In cluster 3, mutated k-mers were associated with the encoding regions of the 3’-5’ exonuclease. This mutation is significant since the protein facilitates the process of DNA replication upon viral recombination with a host cell. Any significant mutations which alter the encoded amino acids and tertiary structure of the protein could have significant effects to the methods in which host cells are infected with the virus cells.5

Within the fourth and final cluster, the mutated k-mer with the highest occurrences was associated with the NSP7 non-structural protein cofactor. Eight copies of NSP7 (when combined with eight copies of the NSP8 protein) create a hexadecameric complex to help the SARS-CoV-2 RNA duplicate. Therefore, even a slight mutation to this region could have significant effects on how the virus cells replicate.10

4. Conclusion

4.1 Further Usage of the Jaccard Indexes and L1 distances as K-Mer Based Distance Functions

Although traditional sequence alignment was more accurate than the k-mer approach for mutations in close proximity, multiple sequence alignment was extremely inefficient when broadly compared to the k-mer approach. Out of the four distance metrics (L1, L2, Jaccard, and Multiple Sequence Alignment), the L1 and Jaccard distances yielded results most consistent with the GISAID labels due to their high Precision values. However, the L2 and MSA distances had higher Recall values. This indicates the L2 and MSA methods obtained a more composite picture of the genomes within a cluster.

The application of distance calculation methods is highly context dependent; additional analysis of the various methods would be required to draw more definite conclusions. Possibilities for future research include the creation of an ROC curve for each distance metric by utilizing the GISAID labels as experimental values or shifting the length of the analyzed k-mers.

4.2 Significant Mutation Locations

The k-means clustering data on the Jaccard distances of 25-mers (four clusters) revealed  mutations with high occurrences at the starting locations of nucleotides 2469, 18039, and 11897. The mutations in cluster 1 and cluster 3 (based in Belgium and Iceland) were associated with non-structural proteins that could impact polymerase activity of SARS-CoV-2 and virus proliferation. Conversely, the US strain had significant mutations to the encoding region of the 3’-5’ exonuclease.

The mutated k-mer with a starting location of nucleotide 221 in the second cluster had the highest percentage of occurance (~1700 out of 1704 genomes). This mutation did not occur in an encoding region of the genome, making it likely insignificant to the functions of SARS-CoV-2.

Although further analysis regarding the variability of k-mer regions is needed, the discovery of mutations in NSP2, NSP7, and the 3’-5’ exonucleases (Cluster 1, Cluster 3, and Cluster 4) could aid the development of a location-based drug for SARS-CoV-2. 

4.3 Geographical Locations of K-Means Generated Clusters

The k-means clustering algorithm on the Jaccard distances (k=25) of four groups allowed the discovery of three main locational strains of SARS-CoV-2: Europe, the US, and Belgium/Iceland.

Within the US cluster, 361 out of 920 genomes were sorted into an individual group. However, when clustered into two groups, 466 of the US genomes were grouped with Europe. This indicates that the genome sequences between the two groups possess minimal diversity and share similar origins as they remain in the same group when clustering is performed on two groups. 

Within the clustering results of two groups, 2391 clusters were sorted into the first US, Europe, Australia, and Japan group, while the rest of the 591 genomes remained in their own cluster. The general diversity in locations from both groups demonstrates that the world is becoming more connected and virus transmission is becoming much harder to track. Since the US is a consistent leader within the two and four group clustering, it is likely the US possesses enough cases and mutations to develop its own strain of SARS-CoV-2.

5. Future Directions

Future research can be conducted to create a phylogenetic tree illustrating SARS-CoV-2 mutations by location and better understand the proteins impacted by the k-mer mutations. Phylogenetic trees illustrate relationships between organisms over time; nodes within the trees depict common ancestors based on genetics or phenotype. For SARS-CoV-2, a phylogenetic tree can track changes in DNA based on location. Additionally, further analysis of the effects of mutations on the encoded proteins would be vital to understanding the changes in amino acid structure and function of impacted proteins. 

Continued developments can be made to the clustering algorithm: taking into account the time stamp of genomes, assigning different weights to k-mer differences in encoding and non-coding regions, and reducing computational requirements are all possibilities. 

6. Acknowledgements

We would like to thank our wonderful mentors HoJoon Lee, Stephanie Greer, and Dmitri Pavlichin for assisting us throughout this project, as well as Professor Hanlee Ji for making this opportunity a reality. Additionally, we would like to thank Cindy Nguyen and Professor Tsachy Weissman for coordinating the STEM to SHTEM program. All of these people, and many more, have been instrumental to our research efforts.

7. References

[1] Anderson, K, Rambaut A., Lipkin W. I., Holmes, E.C., and Garry, R. F. The proximal origins of SARS-COV-2. Nat Med, 26, 450-452(2020). DOI:

[2] Angeletti, S., Benvenuto, D., Bianchi, M., Giovanetti, M., and Pascarella, S. 2020. COVID-2019: The role of the nsp2 and nsp3 in its pathogenesis. Journal of Medical Virology, 92. doi:

[3] Bustamam, A, Ulul, E.D., Hura, H.F.A., and Siswantining, T. Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS-CoV genetic relationship. AIP Conference Proceedings. 1862(1).

[4] de Leng, W. W., Gadellaa-van Hooijdonk, C. G., Barendregt-Smouter, F. A., Koudijs, M. J., Nijman, I., Hinrichs, J. W., Cuppen, E., van Lieshout, S., Loberg, R. D., de Jonge, M., Voest, E. E., de Weger, R. A., Steeghs, N., Langenberg, M. H., Sleijfer, S., Willems, S. M., & Lolkema, M. P. (2016). Targeted Next Generation Sequencing as a Reliable Diagnostic Assay for the Detection of Somatic Mutations in Tumours Using Minimal DNA Amounts from Formalin Fixed Paraffin Embedded Material. PloS one, 11(2), e0149405.

[5] Gammon, D. B., & Evans, D. H. (2009). The 3′-to-5′ exonuclease activity of vaccinia virus DNA polymerase is essential and plays a role in promoting virus genetic recombination. Journal of virology, 83(9), 4236–4250.

[6] Kirchdoerfer, R., and Ward, B. 2019. Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 cofactors. Nature Communications, 10: 2342. doi:

[7] Tenny, S.O., Bhimji, S., Filippava, I., Wilcox, L., & Toney-Butler, T.J. (2017). Odds Ratio (OR).

[8] Marcais, G, Kingsrod C. 2011. A Fast, Lock-Free Approach for Efficient Parallel Counting of occurrences of k-mers. Bioinformatics 27: 764-770. DOI:

[9] Rajaraman, A., and Ullman J.D. (2012). Mining of Massive Datasets. Cambridge University Press.

[10] Velthuis, A. J.W., van de Worm, S. and Snijder, E. 2012. The SARS-coronavirus nps7+nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension. Nucleic Acids Research, 40(4):1737-1747. doi:

[11] Wen, J, Zhang Y., and Yau, S.S.T. 2014. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. Journal of Theoretical Biology, 363: 145-150.

[12] Zielezinski, A., Vinga S., Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017: 18(1):186. Published 2017 Oct 3. doi: 10.1186/s13059-017-1319-7

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.