Comparing the compressibility of different genomes and the correlation between genome length and compression ratio

Journal for High Schoolers

Authors:

Tyler Lafayette, Heather Lee, Naomi Boneh, and Joel Rivera

Abstract:

The compression ratios of various species’ genomes reflect the repetitiveness of these genomes and can lead to a deeper understanding of the repetitiveness of each genome. In addition, we can make connections between gene repetition and compaction and the intended use of the DNA and conditions it must survive in. ​We tested our ideas by running the genomes of many different organisms through several compression algorithms. Our main question was how much variation is in the repetitiveness of genomes. To study this, we used compression since it somewhat measures repetitiveness.

1. Introduction

A genome is an organism’s set of DNA that holds all the information needed to construct and preserve itself. Repetitions of adenine (A), thymine (T), guanine (G), and cytosine (C) make up DNA; we wanted to see how individual species and their base pairs could be compressed. In the constant update in technology seen throughout the world, DNA sequencing has become cheaper and more accessible. The problem, however, is that the human genome is three billion base pairs long so each person’s genome needs lots of data in order to be stored. Currently, the information about DNA compression according to “​How Repetitive Are Genomes​” is that there are large amounts of repetition among diverse genomes. We decided to approach this topic by seeing how far we could compress different genomes and using different compressors to discern which was the most effective. After running genomes through lots of code, we found that the best DNA compressor was H(p).

2. Methods

To start, we started plotting probability in dice rolls. We used the Java coding language to roll a dice 1000 times and give us the side it landed on. In Java, we used an ArrayList and defined the six sides of a die so that, when the program ran, each side had the number of times it was rolled. Since some of us were in the beginning stages of programming proficiency, we copied the data over to Google Sheets in order to create a plot. After doing that, we decided to pivot to the Julia programming language, because this language allowed us to plot the dice rolls in one line of code and due to ​its high-performance numerical analysis and ability to process big data applications.​ We were also learning the basics of compression, in particular, lossless compression at the same time. Data compression is used to encode information using fewer bits than the original model. Lossless compression recognizes and eliminates statistical repetitions so the data size is smaller than the original. It is extremely useful because it not only reduces the resources required to transmit and store data but also enables the data to be decompressed. ​The genomes of several organisms were downloaded from Ensembl (https://ensembl.org).

We organized our goals into a pipeline. The first step was to get the genomes from the websites and download them onto our computers. Once we had the genomes ready to go, we ran them through different compressors like Bzip2, Gzip, Zstd, as well as an entropy function for predicting compressibility in order to see which algorithm would be able to most efficiently compress the genome files. After the compression was run, the new file size was stored in order to make a plot or graph that would demonstrate how different compressors performed, as well as how genome length correlated to the compression ratio.

We decided to implement our pipeline in Julia because it is efficient and simple. As per our pipeline, we created a script that watches a directory containing uncompressed genomes in the FASTA (.fa) file format. When it detects a genome file that has not yet been analyzed, it first runs the file through a “sanitization” program. Genome FASTA files are typically separated into lines using “/n” and/or “/r.” The sanitation program optimizes compression by removing these unnecessary line break symbols. After this process, the script iterates through four different compression algorithms, running each on the source file and recording the output file size and compression ratio in an SQL database, which stores a table of results. Finally, we used another script to data from the database and generates a plot using multiple methods of categorization as shown below.

3. Results and Discussions

Our results suggest that our initial thoughts that the larger genomes would be more compressible than smaller genomes were incorrect. In fact, our data showed that the genomes with the larger sizes tended to be less compressible.

The different colored dot sets in this plot represent the same genome files ran through various compression algorithms. We noticed that, compared to the H(p) entropy algorithms, there was a small set of genomes that the other algorithms were able to compress much more efficiently than the entropy function. It is important to note that, while the rest of the algorithms process any type of text inputted, the entropy algorithm only processes A, C, T, and G characters. This might explain the large gap in some of the compression ratios, as some of the larger genomes had many repetitions of blocks containing only “N,” which signifies an unknown part of a genome. This extra repetition could optimize the algorithm, and cause higher efficiency.

This plot displays the compression ratios of different genomes using the entropy function, grouped by the species’ kingdoms. In general, bacteria have the smallest starting file size and are able to be compressed by a factor of around 4.15, demonstrating how the smallest files are one of the most compressible. On the other hand, the Animalia kingdom had the largest starting file sizes and had around an average compression ratio of 4.05. The two main outliers are a part of the Protozoa kingdom, so they have a lot of repetition within its genome. The higher the compression ratio, the higher the compressibility and the smaller the output file.

This plot displays the relation between the original file size (in bits) and the output file size of genomes after processing using compression algorithms. We found a near-linear relationship between input size and output size.

4. Conclusions

Our project has shown that DNA compression is extremely useful in determining how repetitive a genome is. The more repetitive the genome is, the more compressible it is. This reduces the number of resources needed to store the DNA base pairs. Our progress is useful because we can conclude from our plots that the H(p) compressor has the highest compression ratio, and that larger starting file size does not necessarily mean a higher compression ratio. In the future, people could develop compressors that are DNA specific which would benefit the genome sequencing industry by allowing genomes to be compressed even more than they are with regular data compressors.

5. References

Haubold, Bernhard, and Thomas Wiehe. “How Repetitive Are Genomes?” ​BMC Bioinformatics​, BioMed Central, 22 Dec. 2006, www.ncbi.nlm.nih.gov/pmc/articles/PMC1769404/​.

“Data Compression.” ​Wikipedia​, Wikimedia Foundation, 17 July 2019,
en.wikipedia.org/wiki/Data_compression.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.