Georgina Cortez, Nicole Krukova, Thuan Le, and Suraj Thangellapally
Abstract:Nanopore technology is a modern way of sequencing DNA, in other words, it determines the order of nucleotides in a DNA strand . First, a DNA strand passes through a motor protein that serves as an inhibitor to improve the accuracy of the readings. Then, the strand runs through a nano-sized hole that is experiencing an ionic current. As the DNA strand passes through the nanopore, the individual nucleotides cause disruptions in the electric current which help determine the nucleic acid sequence . Approximately 250 bases can be sequenced per second by Nanopore technology . However, the results are “noisy” or not completely reliable because of several factors including the inconsistency of the nucleotides’ dwell times, which are the time periods of the central bases in the nanopore. The wavelength of the ionic current depends on the dwell time; inconsistent dwell times produce different wavelengths that make it difficult to accurately determine the nucleic acid sequence.
We believe the central base’s dwell time depends on the nucleotides present in the motor protein at that moment. The central base is the nucleotide currently in the nanopore whose dwell time we are focusing on. A k-mer (e.g. “AAT” or “CGATC”) consists of multiple nucleotides that correspond to the central base’s dwell time. To characterize the inconsistencies of the dwell times, we analyzed the data to find correlations between the k-mers and their dwell times. These correlations can help future researchers assign specific characteristics to k-mers and their impact upon dwell times.
We conducted our research with sequenced DNA data previously acquired from the lambda bacteria, which has approximately 48,000 nucleotides. Through Python, we manipulated the data to determine trends and correlations between the k-mers and their corresponding dwell times. Our main goal in the programming was to create visual representations of the data in the form of plots. These plots allowed us to discern noticeable trends in the data that we otherwise might not have observed. Our initial plots were difficult to analyze because there was too much data that made it nearly impossible to understand any trends. To resolve this issue, we organized the data into a more digestible format, by only plotting the average dwell time for each k-mer and sorting the points from lowest dwell time to highest. This allowed us to more easily spot any outliers or trends in the data and compare different k-mers.
The plot and table showed that the data for all of the bases was very similar as there were barely any differences between their means, standard deviations, etc. These results validated our belief that the dwell times did not depend on the base in the pore but rather that the dwell times are dependent on the following bases that are falling through the motor protein since it controls the speed of the DNA strand. To test this idea, we moved on to creating plots of different k-mer’s dwell times to see if any correlations became more apparent.
We began with an unshifted 3-mer as practice to construct our function and begin experimenting with associating k-mers to the central base’s dwell time. As expected, there were no significant trends in our data because the unshifted 3-mer has only one nucleotide following the central base. We believed that there would be correlations with a longer k-mer and its dwell time since, as our hypothesis claims, the central base’s dwell time depends on the nucleotides passing through the motor protein in that moment. Following on with this observation, we created a dataset with shifted 3-mers, where we shifted the 3-mer one space to the right.
Since the error bars for the dwell times of the 3-mer are so significant, the data for the 3-mer plot should not be used as it is unreliable. This further affirmed our focus on the 5-mer as it holds more reliable data.
We observed that the outlying dwell times below 7 units and above 9 units had a consecutive repetition of a base in their k-mers. The dwell times in these ranges varied greatly from the “normal” dwell time in the 7 to 9 unit range. The k-mers in this “normal” range had a minimal amount of repetition in comparison to the outlying ranges. This lead us to believe that the consecutive repetition of a nucleotide causes an irregular dwell time.
Similar to the previous sample of 5-mer data, there was a repetition of a single base in k-mers with irregular dwell times. In the 6.5 to 6.7 unit range of the dwell time there was a repetition of a single base in each k-mer. We also noticed that there was a higher prevalence of C’s in the lower spectrum of the dwell times. Moving into the higher 10 to 12 unit range of the dwell time, we observed that there was a consistent repetition of the nucleotide G in each k-mer. G’s are known to be particularly disruptive so we believe that the repetition of them caused the change in the dwell time. This data supported that the repetition of a nucleotide causes irregularities in the dwell times.
The first plot models that there is no difference between the prior k-mers and the 7-mer. This indicates that in order to find some trend, we need to increase the length of the k-mer and keep shifting it. There are 16,384 possible 7-mers combinations, so we created a plot of their averages so there would be a better visualization. The second plot above represents the averages of the first 10,000 bases of the lambda data collected, sorted from least to greatest. However, after analyzing the plot, we were sure that we would need a longer k-mer to spot any significant trend. For future reference, researchers should look to analyzing the repetition of nucleotides, focusing on specific ranges, and any other possible trends that could lead to a discovery.
Our project confirmed that the dwell time is not dependant on the central base but rather on the following bases in the motor protein. A significant discovery was that the consecutive repetition of a nucleotide causes irregular dwell times. The nucleotides G and C also appear to affect the consistency of the dwell time. With this information, future researchers should continue to create different shifts of k-mers and gather more data about the impact of base repetition on the dwell time. By understanding this relationship, researchers will be able to characterize and predict the effect a k-mer will have upon the dwell time. This can lead to more accurate readings of the sequenced DNA which will allow for the expansion of Nanopore technology.
Advancements in Nanopore technology will expand the applications of DNA sequencing in fields such as healthcare and data compression. Increased reliability will permit for the technology to have a more widespread role in personalized medicine, where treatment is unique to one’s genetics, and DNA data storage, an incredibly efficient way of storing a substantial amount of data in a microscopic space.
 Python: The Ultimate Beginner’s Guide!. Scotts Valley: CreateSpace Publishing, 2016.
 “Python Numpy Tutorial”, Cs231n.github.io, 2019. [Online]. Available: http://cs231n.github.io/python-numpy-tutorial/#numpy-arrays. [Accessed: 08- Jul- 2019].
 “Data Analysis and Visualization with Python for Social Scientists alpha: Extracting row and columns”, Datacarpentry.org, 2017. [Online]. Available: https://datacarpentry.org/python-socialsci/09-extracting-data/index.html. [Accessed: 08-Jul-2019].
 G. Templeton, “How DNA data storage works – ExtremeTech”, ExtremeTech, 2016. [Online]. Available: https://www.extremetech.com/extreme/231343-how-dna-data-storage-works-as-scientis ts-create-the-first-dna-ram. [Accessed: 08-Jul-2019].
 G. Templeton, “How DNA sequencing works – ExtremeTech”, ExtremeTech, 2015. [Online]. Available: https://www.extremetech.com/extreme/214647-how-does-dna-sequencing-work. [Accessed: 08-Jul-2019].
 “Replacing strings with numbers in Python for Data Analysis – GeeksforGeeks”, GeeksforGeeks, 2018. [Online]. Available: https://www.geeksforgeeks.org/replacing-strings-with-numbers-in-python-for-data-analysis/. [Accessed: 08-Jul-2019].
 N. Jetha, C. Feehan, M. Wiggin, V. Tabard-Cossa and A. Marziali, “Long Dwell-Time Passage of DNA through Nanometer-Scale Pores: Kinetics and Sequence Dependence of Motion”, Biophysical Journal, vol. 100, no. 12, pp. 2974-2980, 2011. Available: 10.1016/j.bpj.2011.05.007.
 T. Petrou, “Selecting Subsets of Data in Pandas: Part 1 – Dunder Data – Medium”, Medium, 2017. [Online]. Available: https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c. [Accessed: 11-Jul-2019].
 “How it works”, Oxford Nanopore Technologies, 2019. [Online]. Available: https://nanoporetech.com/how-it-works. [Accessed: 11-Jul-2019].
 S. Lynn, “Using iloc, loc, & ix to select rows and columns in Pandas DataFrames”, Shane Lynn, 2018. [Online]. Available: https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-an d-ix/#loc-selection. [Accessed: 11-Jul-2019].
 “Python Pandas Tutorial 7. Group By (Split Apply Combine)”, YouTube, 2017. [Online]. Available: https://www.youtube.com/watch?v=Wb2Tp35dZ-I. [Accessed: 14-Jul-2019].
 “How do I apply multiple filter criteria to a pandas DataFrame?”, YouTube, 2016. [Online]. Available: https://www.youtube.com/watch?v=YPItfQ87qjM. [Accessed: 14-Jul-2019].
 “Sequencing DNA (or RNA) | Real-time, Ultra Long-Reads, Scalable Technology from Oxford Nanopore”, YouTube, 2017. [Online]. Available: https://www.youtube.com/watch?v=GUb1TZvMWsw. [Accessed: 14-Jul-2019]
 “Introduction to nanopore sequencing”, YouTube, 2019. [Online]. Available:
https://www.youtube.com/watch?v=I9BOF8Hla5s. [Accessed: 14-Jul-2019].
 lambda data accessed from Weissman’s lab