Project

# DNA-based storage

Goal: Our aim is to design high-capacity DNA-based storage systems. Focus is on reliability and coding efficiency.

0 new
2
Recommendations
0 new
2
Followers
0 new
26
0 new
349

## Project log

In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary $\ell,\epsilon > 0$, we propose simple and efficient $(\epsilon, \ell)$-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, • GC-content constraint: the GC-content of each codeword is within $[0.5 − \epsilon, 0.5 + \epsilon]$. For practical values of l and ε, our codes achieve higher rates than the existing results in the literature. We further design efficient $(\epsilon,\ell)$-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.
The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance , is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset codes , for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a tight bound for a special case, a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, including an optimal construction for that special case, which imply lower bounds on the size of such codes.
We propose coding techniques that simultaneously limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ℓ, ϵ > 0, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy all of the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most ℓ, • GC-content constraint: the GC-content of each codeword is within [0.5 - ϵ; 0.5 + ϵ], • Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. While various combinations of these properties have been considered in the literature, this work provides generalizations of codes constructions that satisfy all the properties with arbitrary parameters of ℓ and ϵ. Furthermore, for practical values of ℓ and ϵ, we show that our encoders achieve higher rates than existing results in the literature and approach capacity. Our methods have low encoding/decoding complexity and limited error propagation.
The world is facing a looming data storage crisis, and Singapore can help to avert it. In 2018, people watched 4.33 million videos on YouTube, sent 159 million e-mails and posted 49,000 photographs on Instagram every minute of the year, among other data uses. At this rate, we will produce 418 zettabytes of data this year, according to the World Economic Forum, and even more in the future. A single zettabyte is a trillion gigabytes.
The subblock energy-constrained codes (SECCs) have recently attracted attention due to various applications in communication systems such as simultaneous energy and information transfer. In a SECC, each codeword is divided into smaller subblocks, and every subblock is constrained to carry sufficient energy. In this work, we study SECCs under more general constraints, namely bounded SECCs and sliding-window constrained codes (SWCCs), and propose two methods to construct such codes with low redundancy and linear-time complexity, based on Knuth’s balancing technique and sequence replacement technique. For certain codes parameters, our methods incur only one redundant bit.
We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, {\epsilon} > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-{\epsilon}, 0.5+{\epsilon}]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and ${\epsilon}$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.
The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, which imply lower bounds on the size of such codes.
We analyze codes for DNA-based data storage which accounts for the maximum homopolymer repetition length and GC-AT balance. We present a new precoding method for translating words with a maximum run of k zeros into words with a maximum homopolymer run m = k + 1, which is atractive for securing GC-AT balance. Generating functions are presented for enumerating the number of n-symbol k-constrained codewords of given GC-AT balance Various efficient constructions are presented of block codes that satisfy a combined balance and maximum homopolymer run.