An Order-Optimal Edit-Correcting Code

Information Theory (Winter 2020)

By Daniel Tan, dtch1997@stanford.edu

Introduction

DNA (deoxyribonucleic acid), could also be called the “source code” of life on Earth. Whereas machine code relies on a binary alphabet of 0s and 1s, DNA relies on a quaternary code of the ATCG nucleotides. DNA is easy to read from, write to, and copy, since these functions are already implemented by natural cellular machinery. In addition, it is very stable, and does not require much energy to store. Due to these advantages, it has recently become a medium of interest for long-term information storage. [1]

One flaw of DNA as a storage medium is that information stored in DNA can be corrupted through point mutations. [2] In particular, mutations can involve a nucleotide being removed from the DNA sequence (deletion), a nucleotide being changed to a different nucleotide (substitution), or a new nucleotide being added at an arbitrary location (insertion). If DNA is to be used as information storage, there is a need for error-correcting codes which can reliably handle all three types of error.

Results

In this project, I implement a code presented in a recent paper [3] that corrects a single edit (substitution, insertion, or deletion) with \log n + O(\log \log n) redundancy. In other words, to correct a single edit in a message of n symbols, only \log n + O(\log \log n) additional symbols need to be used. In contrast, previous approaches had a redundancy of c \log n where c \geq 2 .

The implementation and a high-level explanation of how the code works can be found at this GitHub repository. The code is intended to be open-source and may be freely used, reproduced or modified with suitable accreditation.

References

[1] Goldman, N., Bertone, P., Chen, S. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013). https://doi.org/10.1038/nature11875

[2] “Point Mutation”. Biology Dictionary. Retrieved 25 Mar 2020.

[3] Cai, Kui et al. Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage. arXiv:1910.06501 [cs.IT]

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.