If you want to read the PDF, try requesting it from the authors.

Abstract

The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance , is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset codes , for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a tight bound for a special case, a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, including an optimal construction for that special case, which imply lower bounds on the size of such codes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The unordered manner of data storing in DNA storage systems motivates the study of coding problem over sets, following several papers on this topic [3], [6], [9], [10], [14]- [16], [18]. In [10], the authors studied the storage model where the errors are a combination of loss of sequences, as well as symbol errors inside the sequences, such as insertions, deletions, and substitutions. ...
... Several explicit code constructions are also proposed for various error regimes. Later, [14], [16] adapted the model of [10]. In [14], it was assumed that no sequences are lost and a given number of symbol substitutions occur. ...
... Codes which have logarithmic redundancy in both the number of sequences and the length of the sequences have been proposed therein. In [16], a new metric was introduced to establish a uniform framework to combat both sequence loss and symbol substitutions, and Singleton-like and Plotkin-like bounds on the cardinality of optimal codes were derived. A related model was discussed in [6], where unordered multisets are received and errors are counted by sequences, no matter how many symbol errors occur inside the sequences. ...
Preprint
Error-correcting codes over sets, with applications to DNA storage, are studied. The DNA-storage channel receives a set of sequences, and produces a corrupted version of the set, including sequence loss, symbol substitution, symbol insertion/deletion, and limited-magnitude errors in symbols. Various parameter regimes are studied. New bounds on code parameters are provided, which improve upon known bounds. New codes are constructed, at times matching the bounds up to lower-or der terms or small constant factors.
... According to the above-given steps, the transform domain JND model of the digital multimedia video image can be established [14]. e digital multimedia video image is divided into several subimages, and the digital multimedia video image is quantized and encoded by the following formula (5). ...
Article
Full-text available
In order to effectively improve the quality of video image transmission, this paper proposes a method of digital multimedia video image coding. The transmission of digital multimedia video image fault-tolerant coding requires sparse decomposition of a digital multimedia video image to obtain the linear form of the image and complete the transmission of video image fault-tolerant coding. The traditional method of fault-tolerant coding is based on human visual characteristics but ignores the linear form of the digital multimedia video image, which leads to the unsatisfactory effect of coding and transmission. In this paper, a fault-tolerant coding method based on wavelet transform and vector quantization is proposed to decompose and reconstruct digital multimedia video images. The smoothness of wavelet transform can remove visual redundancy; the decomposed image is vector quantized. The mean square deviation method and the similar scalar optimal quantization method are used to select and calculate the image vector, construct the over complete database of a digital multimedia video image, and normalize it; the digital multimedia video image is thinly decomposed by asymmetric atoms, and a linear representation of the image is obtained. According to the above-given operations, we can master the distribution range and law of pixels and realize fault-tolerant coding. The experimental results show that when the number of iterations is 15, the CR index is the same, PSNR increases by 8.7%, coding is 23.7% faster and decoding is 15% faster. Conclusion. The proposed method can not only improve the speed of fault-tolerant coding but also improve the quality of video image transmission.
... In addition, Song et al. introduced a new metric namely the sequence-subset distance which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors. It establishes a unified framework for the design ECCs for DNA storage channel [87]. These studies show that the error-correcting ability of such codes is entirely determined by their minimum distance. ...
Article
Full-text available
Deoxyribonucleic acid (DNA) is increasingly emerging as a serious medium for long-term archival data storage because of its remarkable high-capacity, high-storage-density characteristics and its lasting ability to store data for thousands of years. Various encoding algorithms are generally required to store digital information in DNA and to maintain data integrity. Indeed, since DNA is the information carrier, its performance under different processing and storage conditions significantly impacts the capabilities of the data storage system. Therefore, the design of a DNA storage system must meet specific design considerations to be less error-prone, robust and reliable. In this work, we summarize the general processes and technologies employed when using synthetic DNA as a storage medium. We also share the design considerations for sustainable engineering to include viability. We expect this work to provide insight into how sustainable design can be used to develop an efficient and robust synthetic DNA-based storage system for long-term archiving.
... In accordance, upper bounds on the cardinality of optimal codes correcting any given number of errors were derived, and were asymptotically evaluated in the regime in which the alphabet size of the molecules is linear with M (this is a slightly different scaling than what is considered for DNA storage channels).In [43] the redundancy required to be added in order to guarantee full protection against substitution errors was upper bounded. In [44] a sequence-subset distance has been proposed as a generalization of the Hamming distance suitable for the analysis of DNA storage channels, and generalizations of Plotkin and Singleton upper bounds on the maximal size of the codes were derived. In [26], Gilbert-Varshamov lower bounds and sphere packing upper bounds on the achievable cardinality of DNA storage codes were derived. ...
Preprint
The DNA storage channel is considered, in which $M$ Deoxyribonucleic acid (DNA) molecules comprising each codeword, are stored without order, then sampled $N$ times with replacement, and the sequenced over a discrete memoryless channel. For a constant coverage depth $M/N$ and molecule length scaling $\Theta(\log M)$, lower (achievability) and upper (converse) bounds on the capacity of the channel, as well as a lower (achievability) bound on the reliability function of the channel are provided. Both the lower and upper bounds on the capacity generalize a bound which was previously known to hold only for the binary symmetric sequencing channel, and only under certain restrictions on the molecule length scaling and the crossover probability parameters. When specified to binary symmetric sequencing channel, these restrictions are completely removed for the lower bound and are significantly relaxed for the upper bound. The lower bound on the reliability function is achieved under a universal decoder, and reveals that the dominant error event is that of outage -- the event in which the capacity of the channel induced by the DNA molecule sampling operation does not support the target rate.
... Cущественным недостатком данной модели является отсутствие учета возможных ошибок типа вставки/выпадения. Кроме описанных, еще одной недавно предложенной моделью канала для ДНК-памяти [84] является модель, в которой на вход канала поступает набор из M последовательностей длины L, а на выходе получается набор из уже M  последовательностей, часть из которых добавлена дополнительно. При этом часть исходных последовательностей может отсутствовать на выходе канала, а в части могут происходить ошибки замены. ...
Article
Full-text available
Introduction: Currently, we witness an explosive growth in the amount of information produced by humanity. This raises new fundamental problems of its efficient storage and processing. Commonly used magnetic, optical, and semiconductor information storage devices have several drawbacks related to small information density and limited durability. One of the promising novel approaches to solving these problems is DNA-based data storage. Purpose: An overview of modern DNA-based storage systems and related information-theoretic problems. Results: The current state of the art of DNA-based storage systems is reviewed. Types of errors occurring in them as well as corresponding error-correcting codes are analized. The disadvantages of these codes are shown, and possible pathways for improvement are mentioned. Proposed information-theoretic models of DNA-based storage systems are analyzed, and their limitation highlighted. In conclusion, main obstacles to practical implementation of DNA-based storage systems are formulated, which can be potentially overcome using information-theoretic methods considered in this overview.
Article
The work aims to study the application of Deoxyribonucleic Acid (DNA) multi-source data storage in Digital Twins (DT). Through the investigation of the research status of DT and DNA computing, the work puts forward the concept of DNA multi-source data storage for DT. Raptor code is improved from the design direction of degree distribution function, and six degree function distribution schemes are proposed in turn in the process of describing the research method. Additionally, a quaternary dynamic Huffman coding method is applied in DNA data storage, combined with the improved concatenated code as the error correction code. Considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide Guanine (G) and the distribution of homopolymer in DNA storage, the work proposes and verifies an improved concatenated code algorithm Deoxyribonucleic Acid-Improved Concatenated code (DNA-ICC). The results show that while the Signal-to-Noise Ratio (SNR) increases, the Bit Error Rate (BER) decreases gradually and the trend is similar. But the anti-interference ability of the degree distribution function optimized by the probability transfer method is better. The BER of DNA-ICC scheme decreases with the decrease of error probability, which is stronger than other error correction codes. Compared with the original concatenated code, it saves at least 1.65 s, and has a good control effect on homopolymer. When the size of homopolymer exceeds 4 nt, the probability of homopolymer is only 0.44%. The proposed Quaternary dynamic Huffman code and concatenated error correction code have excellent performance.
Conference Paper
Advances in synthesis and sequencing technologies have made DNA macromolecules an attractive medium for digital information storage. Compared with the ex vivo method that stores data in a non-biological environment, there have been considerations and attempts to store data in living organisms, also known as the in vivo method or live DNA due to several magnificent advantages. Data stored in this medium is prone to errors arising from various mutations such as point mutations (when there is a change in a single nucleotide in DNA, i.e. deletion, insertion, or substitution) or chromosomal alterations (that change the structure of a segment of DNA, i.e. tandem duplication, inversion).In this paper, we provide error-correcting codes for errors caused by inversions, that reverse the order of a segment of DNA. In particular, we construct families of codes for correcting single inversion of a fixed length or variable length up to a given constant k with at most log n+Θ(1) redundant bits, where the redundancy matches the optimal value up to only a constant additive term. Moreover, our codes remain order-optimal, i.e. the redundancy is at most log n + o(log n), when k = o(log n). The redundancy can be further reduced when k ≪ 3.
Article
Error-correcting codes over sets, with applications to DNA storage, are studied. The DNA-storage channel receives a set of sequences, and produces a corrupted version of the set, including sequence loss, symbol substitution, symbol insertion/deletion, and limited-magnitude errors in symbols. Various parameter regimes are studied. New bounds on code parameters are provided, which improve upon known bounds. New codes are constructed, at times matching the bounds up to lower-order terms or small constant factors.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Full-text available
Motivated by communication channels in which the transmitted sequences are subject to random permutations, as well as by DNA storage systems, we study the error control problem in settings where the information is stored/transmitted in the form of multisets of symbols from a given finite alphabet. A general channel model is assumed in which the transmitted multisets are potentially impaired by insertions, deletions, substitutions, and erasures of symbols. Several constructions of error-correcting codes for this channel are described, and bounds on the size of optimal codes correcting any given number of errors derived. The construction based on the notion of Sidon sets in finite Abelian groups is shown to be optimal, in the sense of minimal asymptotic code redundancy, for any "error radius'" and any alphabet size. It is also shown to be optimal in the sense of maximal code cardinality in various cases.
Article
Full-text available
We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.
Article
Full-text available
Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
Article
Full-text available
Digital information is accumulating at an astounding rate, straining our ability to store and archive it. DNA is among the most dense and stable information media known. The development of new technologies in both DNA synthesis and sequencing make DNA an increasingly feasible digital storage medium. We developed a strategy to encode arbitrary digital information in DNA, wrote a 5.27-megabit book using DNA microchips, and read the book by using next-generation DNA sequencing.
Article
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is asymptotically equal to the amount required in the classical error correcting paradigm.
Article
Combinatorial mathematics has been pursued since time immemorial, and at a reasonable scientific level at least since Leonhard Euler (1707-1783). It ren­ dered many services to both pure and applied mathematics. Then along came the prince of computer science with its many mathematical problems and needs - and it was combinatorics that best fitted the glass slipper held out. Moreover, it has been gradually more and more realized that combinatorics has all sorts of deep connections with "mainstream areas" of mathematics, such as algebra, geometry and probability. This is why combinatorics is now apart of the standard mathematics and computer science curriculum. This book is as an introduction to extremal combinatorics - a field of com­ binatorial mathematics which has undergone aperiod of spectacular growth in recent decades. The word "extremal" comes from the nature of problems this field deals with: if a collection of finite objects (numbers, graphs, vectors, sets, etc. ) satisfies certain restrictions, how large or how small can it be? For example, how many people can we invite to a party where among each three people there are two who know each other and two who don't know each other? An easy Ramsey-type argument shows that at most five persons can attend such a party. Or, suppose we are given a finite set of nonzero integers, and are asked to mark an as large as possible subset of them under the restriction that the sum of any two marked integers cannot be marked.
Article
We consider the communication of information in the presence of synchronization errors. Specifically, we consider permutation channels in which a transmitted codeword x = (x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , ... , x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> ) is corrupted by a permutation π ∈ S <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> to yield the received wordy = (y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , . . . , y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> ), where y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</sub> = x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">π(i)</sub> . We initiate the study of worst case (or zero-error) communication over permutation channels that distort the information by applying permutations π, which are limited to displacing any symbol by at most r locations, i.e., permutations π with weight at most r in the ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">∞</sub> -metric. We present direct and recursive constructions, as well as bounds on the rate of such channels for binary and general alphabets. Specific attention is given to the case of r = 1.
Article
Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and tradeoffs of DNA-based storage systems under a simple model, motivated by current technological constraints on DNA synthesis and sequencing. Our model captures two key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the data is read by randomly sampling from this DNA pool. Under this model, we characterize the storage capacity, and show that a simple index-based coding scheme is optimal.
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Conference Paper
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Article
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Article
We report on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today's digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, specifically tailored forward error correction schemes have to be applied. For the purpose of realizing a DNA data storage, we have developed an efficient and robust forwarderror-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School [1]. Our forward error correction scheme is able to cope with all error types of today's DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22 MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved further. This proves the feasibility to use synthetic DNA as longterm digital data storage media.
Article
Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging. Here we show that digital information can be stored on DNA and recovered without errors for considerably longer time frames. To allow for the perfect recovery of the information, we encapsulate the DNA in an inorganic matrix, and employ error-correcting codes to correct storage-related errors. Specifically, we translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silica. Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions. The original information could be recovered error free, even after treating the DNA in silica at 70 °C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.
Article
We consider the problem of storing and retrieving information from synthetic DNA media. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on their collection of substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.
Article
In this paper we presen algorithms for the solution of the general assignment and transportation problems. In Section 1, a statement of the algorithm for the assignment problem appears, along with a proof for the correctness of the algorithm. The remarks which constitute the proof are incorporated parenthetically into the statement of the algorithm. Following this appears a discussion of certain theoretical aspects of the problem. In Section 2, the algorithm is generalized to one for the transportation problem. The algorithm of that section is stated as concisely as possible, with theoretical remarks omitted. 1. THE ASSIGNMENT PROBLEM. The personnel-assignment problem is the problem of choosing an optimal assignment of n men to n jobs, assuming that numerical ratings are given for each man’s performance on each job. An optimal assignment is one which makes the sum of the men’s ratings for their assigned jobs a maximum. There are n! possible assignments (of which several may be optimal), so that it is physically impossible, except
Article
Preface 1. Basic concepts of linear codes 2. Bounds on size of codes 3. Finite fields 4. Cyclic codes 5. BCH and Reed-Soloman codes 6. Duadic codes 7. Weight distributions 8. Designs 9. Self-dual codes 10. Some favourite self-dual codes 11. Covering radius and cosets 12. Codes over Z4 13. Codes from algebraic geometry 14. Convolutional codes 15. Soft decision and iterative decoding Bibliography Index.
Article
By refining Hamming's geometric sphere-packing model a new upper bound for nonsystematic binary error-correcting codes is found. Only combinatorial arguments are used. Whereas Hamming's upper bound estimate for e -error-correcting codes involved a count of all points leq e Hamming distance from the set of code points, the model is extended here to include consideration of points which are >e distance away from the code set. The percentage improvement from Hamming's bounds is sometimes quite sizable for cases of two or more errors to be corrected. The new bound improves on Wax's bounds in all but four of the cases he lists.
Article
Two n -digit sequences, called "points," of binary digits are said to be at distance d if exactly d corresponding digits are unlike in the two sequences. The construction of sets of points, called codes, in which some specified minimum distance is maintained between pairs of points is of interest in the design of self-checking systems for computing with or transmitting binary digits, the minimum distance being the minimum number of digital errors required to produce an undetected error in the system output. Previous work in the field had established general upper bounds for the number of n -digit points in codes of minimum distance d with certain properties. This paper gives new results in the field in the form of theorems which permit systematic construction of codes for given n, d ; for some n, d , the codes contain the greatest possible numbers of points.
Article
We introduce and solve some new problems of efficient reconstruction of an unknown sequence from its versions distorted by errors of a certain type. These erroneous versions are considered as outputs of repeated transmissions over a channel, either a combinatorial channel defined by the maximum number of permissible errors of a given type, or a discrete memoryless channel. We are interested in the smallest N such that N erroneous versions always suffice to reconstruct a sequence of length n, either exactly or with a preset accuracy and/or with a given probability. We are also interested in simple reconstruction algorithms. Complete solutions for combinatorial channels with some types of errors of interest in coding theory, namely, substitutions, transpositions, deletions, and insertions of symbols are given. For these cases, simple reconstruction algorithms based on majority and threshold principles and their nontrivial combination are found. In general, for combinatorial channels the considered problem is reduced to a new problem of reconstructing a vertex of an arbitrary graph with the help of the minimum number of vertices in its metrical ball of a given radius. A certain sufficient condition for solution of this problem is presented. For a discrete memoryless channel, the asymptotic behavior of the minimum number of repeated transmissions which are sufficient to reconstruct any sequence of length n within Hamming distance d with error probability ε is found when d/n and ε tend to 0 as n→∞. A similar result for the continuous channel with discrete time and additive Gaussian noise is also obtained
Clustering billions of reads for DNA data storage
  • rashtchian
C. Rashtchian et al., "Clustering billions of reads for DNA data storage," in Proc. NIPS, 2017, pp. 3360-3371.