Article

# Capacity-Approaching Constrained Codes With Error Correction for DNA-Based Data Storage

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

We propose coding techniques that simultaneously limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ℓ, ϵ > 0, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy all of the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most ℓ, • GC-content constraint: the GC-content of each codeword is within [0.5 - ϵ; 0.5 + ϵ], • Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. While various combinations of these properties have been considered in the literature, this work provides generalizations of codes constructions that satisfy all the properties with arbitrary parameters of ℓ and ϵ. Furthermore, for practical values of ℓ and ϵ, we show that our encoders achieve higher rates than existing results in the literature and approach capacity. Our methods have low encoding/decoding complexity and limited error propagation.

## No full-text available

... Then, the authors of [9] proposed a binary shifted-Varshamov-Tenengolts (SVT) code to obtain an improved construction which still corrects exactly b errors but with a lower redundancy than one in [8]. From the obviously efficient correction and low redundancy of the VT codes, the authors in [10,11] proposed a method of the linear-time encoders to implement the binary VT code which satisfies the homopolymer run and Guanine-Cytosine(GC)-content constraints [12,13] among important properties of a DNA strand. However, the binary VT codes used in these linear-time encoders correct a single nucleotide of a DNA strand. ...
... However, the binary VT codes used in these linear-time encoders correct a single nucleotide of a DNA strand. With a similar approach as [10,11], but to correct a burst of size exactly b deletions or insertions of DNA symbols, the authors of [14] applied the encoder of the binary modified VT code in [6] and binary SVT codes in [9]. Then, by interleaving bits of binary VT codewords and binary SVT codewords, the work [9] obtained a binary code construction that can correct a burst error of size exactly 2b, and finally, the codeword of this construction was translated to DNA symbols. ...
... We note that the main concern in this work is the error correction capability of quaternary code design, not focus on constraints in DNA storage. It is assumed that the combination design of error correction code and constraints of DNA storage was already done by other algorithms [11,24]. The main contributions in this paper can be summarized as follows. ...
Article
Full-text available
Due to the properties of DNA data storage, the errors that occur in DNA strands make error correction an important and challenging task. In this paper, a new code design of quaternary code suitable for DNA storage is proposed to correct at most two consecutive deletion or insertion errors. The decoding algorithms of the proposed codes are also presented when one and two deletion or insertion errors occur, and it is proved that the proposed code can correct at most two consecutive errors. Moreover, the lower and upper bounds on the cardinality of the proposed quaternary codes are also evaluated, then the redundancy of the proposed code is provided as roughly 2log48n.
... Some recent work on DNA based storage systems presented end-to-end solutions including molecular implementations (Organick et al. 2018;Erlich and Zielinski 2017;Organick et al. 2020). Others tackled certain components: error correction schemes (Nguyen et al. 2021;Lenz et al. 2020) or read clustering and sequence reconstruction algorithms (Shinkar et al. 2019;Sabary et al. 2020;Chan-Figure 1: DNA based storage and sequence reconstruction. An information baring file is encoded into a set of DNA sequences. ...
... Overlapping k-mer sequence representation is widely used in computational biology for various sequence related tasks including classical sequence alignment algorithms and modern machine learning tasks (Ng 2017;Ji et al. 2021;Ng 2017;Orenstein, Wang, and Berger 2016). ...
... Overlapping k-mer sequence representation is widely used in computational biology for various sequence related tasks including classical sequence alignment algorithms and modern machine learning tasks (Ng 2017;Ji et al. 2021;Ng 2017;Orenstein, Wang, and Berger 2016). ...
Preprint
Full-text available
As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.
... The authors of [16] proposed an efficient systematic encoding algorithm for nonbinary SIDEC codes whose codewords consist of parity symbols and message symbols. A binary EVT code combined with a GC-balanced constrained code [12] and a binary VT code combined with a GC-balanced and r run-length limited constrained code [17] were applied for DNA storage systems. Additionally, a new binary SIDEC code combined with the maximum runlength r constrained code and efficient systematic encoding algorithm was proposed in [18]. ...
... Additionally, a new binary SIDEC code combined with the maximum runlength r constrained code and efficient systematic encoding algorithm was proposed in [18]. All of these studies [8], [9], [11], [12], [17] focused on binary coding schemes for DNA storage systems. So far, there have been few studies focusing on q-ary coding schemes [19] for DNA storage systems. ...
Article
Full-text available
Due to the advantages of high information densities and longevity, DNA storage systems have begun to attract a lot of attention. However, common obstacles to DNA storage are caused by insertion, deletion, and substitution errors occurring in DNA synthesis and sequencing. In this paper, we first explain a method to convert binary data into general maximum run-length r sequences with specific length construction, which can be used as the message sequence of our proposed code. Then, we propose a new single insertion/deletion nonbinary systematic error correction code and its corresponding encoding algorithm. For the proposed code, we design the fixed maximum run-length r in the parity sequence of the proposed code to be three. Additionally, the last parity symbol and the first message symbol are always different. Hence, the overall maximum run-length r of the output codeword is guaranteed to be three when the maximum run-length of the message sequence is three. Finally, we determine the feasibility of the proposed encoding algorithm, verify successful decoding when a single insertion/deletion error occurs in the codeword, and present the comparison results with relevant works.
... Initially, the problem was how to directly map between genetic alphabets and ASCII characters (Church et al., 2012;Jiménez-Sánchez, 2013. Later the problem shifted in the domain of addressable memory mapping, error detection and error correction, GC content and homopolymer run-length constraints in the later phase of developments (Blawat et al., 2016;Bornholt et al., 2016;Erlich and Zielinski, 2017;Goldman et al., 2013;Grass et al., 2015;Immink and Cai, 2018;Nguyen et al., 2020;Organick, 2018;Song et al., 2018;Wang, 2019;Yazdi et al., 2015;Yazdi et al., 2017). Although theoretically, 100% recovery of encoded information is possible, the algorithms present are not universally acceptable and can't compensate for naturally occurring defects like loss of individual bases or complete sequences during reading or writing of NAM, which occurs when DNA sequencing and synthesis happens respectively. ...
Article
As the time elapsed by, the present real life problems have guided the human race towards a data driven society. This in turn caused an exponential hype of data generation globally that led to a new challenge for the human to store and manage such an enormous amount of data. It was further analysed through other research works that this is going to manufacture immense tension on the availability of silicon and magnetic memories in the near future. At this point in time, good data compression algorithms became the prime focus of the computing community. However, it was able to check the pace of the growing scarcity of data storage technologies but could not solve the problem from the root. As a result, it became a necessity to develop an efficient alternative data storage technology when the Nucleic Acid Memory (NAM) was brought forward as a promising solution. On the other hand, the research on expansion of the genetic alphabets beyond the standard nucleotides have emerged recently which have drawn a significant attention in the domain of biological science simultaneously. This led to the creation of the Extended Nucleic Acid Memory (ENAM). However, the initial proposals were put forward without considering the real life sequencing constraints namely the homopolymer runlength and the GC content constraint. But, it was observed in the literature that encoding techniques which accounted for countering the sequencing constraints had to pay a penalty in terms of digital data holding capacity per nucleotide. In this context, taking the inspiration from the domain of cryptography a new encoding algorithm namely the Cipher Constrained Encoding (CCE) has been proposed in this work which has the capability of considering both the sequencing constraints without significantly penalizing the data capacity per nucleotide. Few properties of the Vigenére and Vernam Cipher have been adapted and integrated with basic statistical analytical techniques which was very efficient in checking the violation of the sequencing constraints. Furthermore, experimentation has been done and the results have been reported and compared with the previous works found in the literature which demonstrated promising outcome.
... We start with an explicit formula for the capacity of S k (see, e.g., [15], [31]). ...
Preprint
In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Quite recently, [26] studied the role of batch optimization in reducing the cost of large scale DNA synthesis, for a given pool $\mathcal{S}$ of random quaternary strings of fixed length. Among other things, it was shown that the asymptotic cost savings of batch optimization are significantly greater when the strings in $\mathcal{S}$ contain repeats of the same character (homopolymer run of length one), as compared to the case where strings are unconstrained. Following the lead of [26], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length $k\geq1$. Specifically, we are given a set of DNA strands $\mathcal{S}$, randomly drawn from a natural Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we prove that for any $k\geq 1$, the optimal reference strand, minimizing the cost of DNA synthesis is, perhaps surprisingly, the periodic sequence $\overline{\mathsf{ACGT}}$. It turns out that tackling the homopolymer constraint of length $k\geq2$ is a challenging problem; our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.
... They provide approaches for probabilistic error correction and symmetric encryption respectively. the impact of these two constraints, some coding algorithms [42,65] approached solutions while maintaining a relatively high code rate from binary messages to DNA strings. ...
Preprint
Full-text available
DNA has been considered as a promising medium for storing digital information. Despite the biochemical progress in DNA synthesis and sequencing, novel coding algorithms need to be constructed under the specific constraints in DNA-based storage. Many functional operations and storage carriers were introduced in recent years, bringing in various biochemical constraints including but not confined to long single-nucleotide repeats and abnormal GC content. Existing coding algorithms are not applicable or unstable due to more local biochemical constraints and their combinations. In this paper, we design a graph-based architecture, named SPIDER-WEB, to generate corresponding graph-based algorithms under arbitrary local biochemical constraints. These generated coding algorithms could be used to encode arbitrary digital data as DNA sequences directly or served as a benchmark for the follow-up construction of coding algorithms. To further consider recovery and security issues existing in the storage field, it also provides pluggable algorithmic patches based on the generated coding algorithms: path-based correcting and mapping shuffling. They provide approaches for probabilistic error correction and symmetric encryption respectively.
... For DNA-based storage, we are interested in codewords that are -balanced and -runlength limited for sufficient small = o(1), = o(n). Constructions of codes that obey both -balanced and -runlength limited constraints for practical values of and have been presented in our recent work [19]. ...
Preprint
With the total amount of worldwide data skyrocketing, the global data storage demand is predicted to grow to 1.75×10 14 GB by 2025. Traditional storage methods have difficulties keeping pace, given that current storage media have a maximum density of 10 3 GB/mm 3. As such, the data production will far exceed the capacity of currently available storage methods. The costs of maintaining and transferring data, as well as the limited lifespans and significant data losses associated with current technologies also demand novel solutions for information storage. Nature offers a powerful alternative, storing the information that defines living organisms in unique orders of four bases (A, T, C, G) located in molecules called deoxyribonucleic acid (DNA). DNA molecules as information carriers have many advantages over traditional storage media. Their high storage density, potentially low maintenance cost, ease of synthesis and chemical modification, make them an ideal alternative for information storage. To this end, rapid progress has been made over the past decade by exploiting user-defined DNA materials to encode information. In this review, we discuss the most recent advances of DNA-based data storage with a major focus on the challenges that remain in this promising field, including the current intrinsic low speed in data writing and reading and the high cost per byte stored. Alternatively, data storage relying on DNA nanostructures (as opposed to DNA sequence) as well as on other combinations of nanomaterials and biomolecules have been proposed, with promising technological and economic advantages. In summarizing the advances that have been made and underlining the challenges that remain, we provide a roadmap for ongoing research in this rapidly growing field, which will enable the development of technological solutions to the global demand for superior storage methodologies.
Conference Paper
In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary $\ell,\epsilon > 0$, we propose simple and efficient $(\epsilon, \ell)$-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, • GC-content constraint: the GC-content of each codeword is within $[0.5 − \epsilon, 0.5 + \epsilon]$. For practical values of l and ε, our codes achieve higher rates than the existing results in the literature. We further design efficient $(\epsilon,\ell)$-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.
Preprint
Motivated by the application of fountain codes in the DNA-based data storage systems, in this paper, we consider the decoding of fountain codes when the received symbols have a chance to be incorrect. Unlike the conventional scenario where the received symbols are all error-free, the maximum likelihood (ML) decoding and maximum a posterior probability (MAP) decoding are not practical under this situation due to the exponentially high complexity. Instead, we propose an efficient algorithm, which is referred to as the basis-finding algorithm (BFA) for decoding. We develop a straightforward implementation as well as an efficient implementation for the BFA, both of which have polynomial time complexities. Moreover, to investigate the frame error rate (FER) of the BFA, we derive the theoretical bounds and also perform extensive simulations. Both the analytical and simulation results reveal that the BFA can perform very well for decoding fountain codes with erroneous received symbols.
Article
Full-text available
We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
Owing to its longevity and enormous information density, DNA, the molecule encoding biological information, has emerged as a promising archival storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules that are stored in an unordered way, and can only be read by sampling from this DNA pool. Moreover, imperfections in writing (synthesis), reading (sequencing), storage, and handling of the DNA, in particular amplification via PCR, lead to a loss of DNA molecules and induce errors within the molecules. In order to design DNA storage systems, a qualitative and quantitative understanding of the errors and the loss of molecules is crucial. In this paper, we characterize those error probabilities by analyzing data from our own experiments as well as from experiments of two different groups. We find that errors within molecules are mainly due to synthesis and sequencing, while imperfections in handling and storage lead to a significant loss of sequences. The aim of our study is to help guide the design of future DNA data storage systems by providing a quantitative and qualitative understanding of the DNA data storage channel.
Article
Full-text available
Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Full-text available
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
Article
Full-text available
This paper studies codes that correct a burst of deletions or insertions. Namely, a code will be called a b-burstdeletion/ insertion-correcting code if it can correct a burst of deletions/ insertions of any b consecutive bits. While the lower bound on the redundancy of such codes was shown by Levenshtein to be asymptotically log(n)+b�1, the redundancy of the best code construction by Cheng et al. is b(log(n=b + 1)). In this paper, we close on this gap and provide codes with redundancy at most log(n) + (b � 1) log(log(n)) + b � log(b). We first show that the models of insertions and deletions are equivalent and thus it is enough to study codes correcting a burst of deletions. We then derive a non-asymptotic upper bound on the size of b-burst-deletion-correcting codes and extend the burst deletion model to two more cases: 1) A deletion burst of at most b consecutive bits and 2) A deletion burst of size at most b (not necessarily consecutive). We extend our code construction for the first case and study the second case for b = 3; 4.
Article
Full-text available
Background DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. Results We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. Conclusions The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
Article
Full-text available
Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
Article
Full-text available
Digital information is accumulating at an astounding rate, straining our ability to store and archive it. DNA is among the most dense and stable information media known. The development of new technologies in both DNA synthesis and sequencing make DNA an increasingly feasible digital storage medium. We developed a strategy to encode arbitrary digital information in DNA, wrote a 5.27-megabit book using DNA microchips, and read the book by using next-generation DNA sequencing.
Article
An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with $\lceil {\log \text {n}}\rceil+\text {O}(\log \log \text {n})$ redundancy bits, while the other corrects a single indel with $\lceil {\log \text {n}}\rceil+2$ redundant bits. These two encoders are order-optimal . The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the $\mathtt {GC}$ -balanced constraint and require that exactly half of the symbols of any DNA codeword to be either $\mathtt {C}$ or $\mathtt {G}$ . In particular, via a modification of Knuth’s balancing technique, we provide a linear-time map that translates binary messages into $\mathtt {GC}$ -balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of $\mathtt {GC}$ -balanced codes that correct a single indel or a single edit.
Conference Paper
In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary $\ell,\epsilon > 0$, we propose simple and efficient $(\epsilon, \ell)$-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, • GC-content constraint: the GC-content of each codeword is within $[0.5 − \epsilon, 0.5 + \epsilon]$. For practical values of l and ε, our codes achieve higher rates than the existing results in the literature. We further design efficient $(\epsilon,\ell)$-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.
Article
Motivated by applications in DNA-based storage, we introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which, in addition to deletions, insertions and substitution errors also accounts for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions. We conclude with constructions for joint block deletion and adjacent block transposition error-correcting codes.
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Article
In this article, we study properties and algorithms for constructing sets of 'constant weight' codewords with bipolar symbols, where the sum of the symbols is a constant q, q 6 0. We show various code constructions that extend Knuth's balancing vector scheme, q = 0, to the case where q > 0. We compute the redundancy of the new coding methods. Index Terms—Balanced code, channel capacity, constrained code, magnetic recording, optical recording. I. INTRODUCTION Let q be an integer. A setC, which is a subset of ( w = (w1;w2;:::;wn)2f 1; +1g n : n X i=1 wi = q )
Article
The sequence replacement technique converts an input sequence into a constrained sequence in which a prescribed subsequence is forbidden to occur. Several coding algorithms are presented that use this technique for the construction of maximum run-length limited sequences. The proposed algorithms show how all forbidden subsequences can be successively or iteratively removed to obtain a constrained sequence and how special subsequences can be inserted at predefined positions in the constrained sequence to represent the indices of the positions where the forbidden subsequences were removed. Several modifications are presented to reduce the impact of transmission errors on the decoding operation, and schemes to provide error control are discussed as well. The proposed algorithms can be implemented efficiently, and the rates of the constructed codes are close to their theoretical maximum. As such, the proposed algorithms are of interest for storage systems and data networks.
Article
Coding schemes in which each codeword contains equally many zeros and ones are constructed in such a way that they can be efficiently encoded and decoded.
Article
Two factors are mainly responsible for the stability of the DNA double helix: base pairing between complementary strands and stacking between adjacent bases. By studying DNA molecules with solitary nicks and gaps we measure temperature and salt dependence of the stacking free energy of the DNA double helix. For the first time, DNA stacking parameters are obtained directly (without extrapolation) for temperatures from below room temperature to close to melting temperature. We also obtain DNA stacking parameters for different salt concentrations ranging from 15 to 100 mM Na+. From stacking parameters of individual contacts, we calculate base-stacking contribution to the stability of A•T- and G•C-containing DNA polymers. We find that temperature and salt dependences of the stacking term fully determine the temperature and the salt dependence of DNA stability parameters. For all temperatures and salt concentrations employed in present study, base-stacking is the main stabilizing factor in the DNA double helix. A•T pairing is always destabilizing and G•C pairing contributes almost no stabilization. Base-stacking interaction dominates not only in the duplex overall stability but also significantly contributes into the dependence of the duplex stability on its sequence.
Article
We derive a simple algorithm for the ranking of binary sequences of length n and weight w . This algorithm is then used for source encoding a memoryless binary source that generates O's with probability q and l's with probability p = 1 - q .
Article
A balanced code with r check bits and k information bits is a binary code of length k+r and cardinality 2<sup>k</sup> such that each codeword is balanced; that is, it has [(k+r)/2] 1's and [(k+r)/2] 0's. This paper contains new methods to construct efficient balanced codes. To design a balanced code, an information word with a low number of 1's or 0's is compressed and then balanced using the saved space. On the other hand, an information word having almost the same number of 1's and 0's is encoded using the single maps defined by Knuth's (1986) complementation method. Three different constructions are presented. Balanced codes with r check bits and k information bits with k&les;2<sup>r+1</sup>-2, k&les;3×2<sup>r</sup>-8, and k&les;5×2<sup>r</sup>-10r+c(r), c(r)∈{-15, -10, -5, 0, +5}, are given, improving the constructions found in the literature. In some cases, the first two constructions have a parallel coding scheme
Article
For n >0, d &ges;0, n ≡ d (mod 2), let K ( n , d ) denote the minimal cardinality of a family V of ±1 vectors of dimension n , such that for any ±1 vector w of dimension n there is a v ∈ V such that | v - w |&les; d , where v - w is the usual scalar product of v and w . A generalization of a simple construction due to D.E. Knuth (1986) shows that K ( n , d )&les;[ n /( d +1)]. A linear algebra proof is given here that this construction is optimal, so that K ( n , d )-[ n /( d +1)] for all n ≡ d (mod 2). This construction and its extensions have applications to communication theory, especially to the construction of signal sets for optical data links
DNA codes with run-length limitation and Knuth-like balancing of the GC contents
• D Dubé
• W Song
• K Cai
D. Dubé, W. Song, and K. Cai, "DNA codes with run-length limitation and Knuth-like balancing of the GC contents," in Proc. Symp. Inf. Theory Appl. (SITA), Kagoshima, Japan, Nov. 2019, pp. 1-3.
Binary subblock energy-constrained codes: Knuth's balancing and sequence replacement techniques
• T T Nguyen
• K Cai
• K A S Immink
T. T. Nguyen, K. Cai, and K. A. S. Immink, "Binary subblock energy-constrained codes: Knuth's balancing and sequence replacement techniques," in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2020, pp. 37-41.