Project

DNA-based storage

Goal: Our aim is to design high-capacity DNA-based storage systems. Focus is on reliability and coding efficiency.

Updates
0 new
2
Recommendations
0 new
2
Followers
0 new
26
Reads
0 new
349

Project log

Kees Schouhamer Immink
added 3 research items
In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary $\ell,\epsilon > 0$, we propose simple and efficient $(\epsilon, \ell)$-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, • GC-content constraint: the GC-content of each codeword is within $[0.5 − \epsilon, 0.5 + \epsilon]$. For practical values of l and ε, our codes achieve higher rates than the existing results in the literature. We further design efficient $(\epsilon,\ell)$-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.
The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance , is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset codes , for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a tight bound for a special case, a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, including an optimal construction for that special case, which imply lower bounds on the size of such codes.
We propose coding techniques that simultaneously limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ℓ, ϵ > 0, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy all of the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most ℓ, • GC-content constraint: the GC-content of each codeword is within [0.5 - ϵ; 0.5 + ϵ], • Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. While various combinations of these properties have been considered in the literature, this work provides generalizations of codes constructions that satisfy all the properties with arbitrary parameters of ℓ and ϵ. Furthermore, for practical values of ℓ and ϵ, we show that our encoders achieve higher rates than existing results in the literature and approach capacity. Our methods have low encoding/decoding complexity and limited error propagation.
Kees Schouhamer Immink
added a research item
The world is facing a looming data storage crisis, and Singapore can help to avert it. In 2018, people watched 4.33 million videos on YouTube, sent 159 million e-mails and posted 49,000 photographs on Instagram every minute of the year, among other data uses. At this rate, we will produce 418 zettabytes of data this year, according to the World Economic Forum, and even more in the future. A single zettabyte is a trillion gigabytes.
Kees Schouhamer Immink
added a research item
The subblock energy-constrained codes (SECCs) have recently attracted attention due to various applications in communication systems such as simultaneous energy and information transfer. In a SECC, each codeword is divided into smaller subblocks, and every subblock is constrained to carry sufficient energy. In this work, we study SECCs under more general constraints, namely bounded SECCs and sliding-window constrained codes (SWCCs), and propose two methods to construct such codes with low redundancy and linear-time complexity, based on Knuth’s balancing technique and sequence replacement technique. For certain codes parameters, our methods incur only one redundant bit.
Kees Schouhamer Immink
added a research item
We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
Kees Schouhamer Immink
added a research item
We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, {\epsilon} > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-{\epsilon}, 0.5+{\epsilon}]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and ${\epsilon}$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.
Kees Schouhamer Immink
added a research item
The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, which imply lower bounds on the size of such codes.
Kees Schouhamer Immink
added a research item
We analyze codes for DNA-based data storage which accounts for the maximum homopolymer repetition length and GC-AT balance. We present a new precoding method for translating words with a maximum run of k zeros into words with a maximum homopolymer run m = k + 1, which is atractive for securing GC-AT balance. Generating functions are presented for enumerating the number of n-symbol k-constrained codewords of given GC-AT balance Various efficient constructions are presented of block codes that satisfy a combined balance and maximum homopolymer run.
Kees Schouhamer Immink
added a research item
We describe properties and constructions of constraint-based codes for DNA-based data storage which accounts for the maximum repetition length and AT balance. We present algorithms for computing the number of sequences with maximum repetition length and AT balance constraint. We present efficient routines for translating binary runlength limited and/or balanced strings into DNA strands. We show that the implementation of AT-balanced codes is straightforwardly accomplished with binary balanced codes. We present codes that accounts for both the maximum repetition length and AT balance.
Kui Cai
added a research item
The process of DNA data storage can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, termed sequence subset codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derived some upper bounds on the size of the sequence subset codes including a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, which imply lower bounds on the size of such codes.
Kees Schouhamer Immink
added a research item
This paper presents novel soft-decision decoding (SDD) of error correction codes (ECCs) that substantially improve the reliability of DNA-based data storage system compared with conventional hard-decision decoding (HDD). We propose a simplified system model for DNA-based data storage according to the major characteristics and different types of errors associated with the prevailing DNA synthesis and sequencing technologies. We compute analytically the error-free probability of each sequenced DNA oligo nucleotide (oligo), based on which the soft-decision log-likelihood ratio (LLR) of each oligo can be derived. We apply the proposed SDD algorithms to the DNA Fountain scheme which achieves the highest information density so far in the literature. Simulation results show that SDD achieves an error rate improvement of two to three orders of magnitude over HDD, thus demonstrating its potential to improve the information density of DNA-based data storage systems.
Kees Schouhamer Immink
added a research item
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Kees Schouhamer Immink
added an update
We have designed coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We designed simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum.
 
Kees Schouhamer Immink
added a project goal
Our aim is to design high-capacity DNA-based storage systems. Focus is on reliability and coding efficiency.