Conference PaperPDF Available

Soft-Decision Decoding for DNA-Based Data Storage



This paper presents novel soft-decision decoding (SDD) of error correction codes (ECCs) that substantially improve the reliability of DNA-based data storage system compared with conventional hard-decision decoding (HDD). We propose a simplified system model for DNA-based data storage according to the major characteristics and different types of errors associated with the prevailing DNA synthesis and sequencing technologies. We compute analytically the error-free probability of each sequenced DNA oligo nucleotide (oligo), based on which the soft-decision log-likelihood ratio (LLR) of each oligo can be derived. We apply the proposed SDD algorithms to the DNA Fountain scheme which achieves the highest information density so far in the literature. Simulation results show that SDD achieves an error rate improvement of two to three orders of magnitude over HDD, thus demonstrating its potential to improve the information density of DNA-based data storage systems.
Soft-Decision Decoding for DNA-Based
Data Storage
Mu Zhang, Kui Cai, Kees A. Schouhamer Immink, and Pingping Chen
Science and Math Cluster, Singapore University of Technology and Design, Singapore 487372
Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands
Abstract—This paper presents novel soft-decision decoding
(SDD) of error correction codes (ECCs) that substantially im-
prove the reliability of DNA-based data storage system compared
with conventional hard-decision decoding (HDD). We propose a
simplified system model for DNA-based data storage according
to the major characteristics and different types of errors
associated with the prevailing DNA synthesis and sequencing
technologies. We compute analytically the error-free probability
of each sequenced DNA oligonucleotide (oligo), based on which
the soft-decision log-likelihood ratio (LLR) of each oligo can be
derived. We apply the proposed SDD algorithms to the recently
proposed DNA Fountain scheme. Simulation results show that
SDD achieves an error rate improvement of two to three orders
of magnitude over HDD, thus demonstrating its potential to
improve the information density of DNA-based data storage
DNA-based data storage has emerged as a promising
candidate for the storage of Big Data. It features extremely
high data storage density (for example 1 exabytes/mm3), long
lasting stability of hundreds to a thousand year, and ultra-
low power consumption for operation and maintenance [1],
[2]. Information storage in DNA has been demonstrated by
several research groups [3]-[8]. At the beginning, due to the
limitation of DNA manipulation technologies, only a small
amount of data was stored in DNA molecules. During recent
few years, DNA productivity has been increased significantly,
and storage of megabytes of data has been demonstrated.
The success of DNA-based data storage is largely attributed
to the usage of error correction codes (ECCs). Both the
information writing and reading are prone to errors due to
the specific bio-chemical and bio-physical processes. Fur-
thermore, the reliability of data storage is hampered by
the substitution errors, insertion errors, and deletion errors
simultaneously [3]. ECCs are a requirement for guaranteeing
data storage reliability. In [4] and [5], repetition codes are
used for data protection. In [3] and [6], two-dimensional
interleaved Reed-Solomon (RS) codes with stronger error
correction capability are applied. Recently, an efficient infor-
mation storage architecture, named DNA Fountain [8], has
been proposed. It combines the Luby transform (LT) codes
and RS codes, and achieves a higher information storage
density than the earlier designs.
In prior art DNA storage systems, ECCs are all decoded
by hard-decision decoding (HDD), which in general requires
large coding redundancy and hence lowers the information
density. Although it is well known that soft-decision decoding
(SDD) can provide a significantly performance gain over
HDD, researchers so far were not able to apply SDD of ECCs
to DNA storage systems. This is mainly due to the fact that the
complicated DNA synthesis and sequencing processes cause
much difficulty of generating the soft-decision log-likelihood
ratio (LLR) for each DNA oligonucleotide (or oligo for short)
to support the SDD.
In this paper, we first characterize the DNA-based data
storage system with the prevailing DNA synthesis and se-
quencing technologies. We then propose a simplified system
model, through which the LLR is derived analytically based
on the number of occurrences of each sequenced oligo of the
system. We apply the proposed SDD to decode LT codes in
DNA Fountain, and demonstrate its error performance over
the conventional HDD.
The rest of this paper is organized as follows. In section II,
we introduce the DNA-based data storage technology and the
DNA Fountain scheme. In Section III, we present a simplified
DNA storage system model, as well as the calculation of
LLRs for each sequenced oligo. The proposed DNA Fountain
with SDD as well as the simulation results are given in
Section IV. Finally, Section V concludes the paper.
A. DNA-based Data Storage
A DNA strand, or oligo, is a chain of almost arbitrary
combinations of four base nucleotides, namely Adenine (A),
Cyanine (C), Guanine (G), and Thymine (T). Each base
can represent two bits of information. Modern array-based
synthesis technologies used for DNA storage can synthesize
oligos with length up to about 200 nucleotides [5]. Large
files must be partitioned into small segments and written into
different oligos. DNA synthesis is hampered mainly by two
biochemical constraints. First, the homopolymer run length
of nucleotide is limited. Long homopolymer runs increase the
error probability. In practical, the maximum homopolymer run
length is set to 1-3. Second, the GC-content of each sequence,
i.e., the percentage of the bases G and C in the sequence,
should not be too high or too low. Sequences violating these
constraints will cause more synthesis or sequencing errors [8].
Given the desired sequences, DNA synthesizer can synthesize
nearly 105different oligos in parallel [9], creating up to
1.2×107copies of each DNA string [10], depending on the
technology used. All these oligos are mixed together in a pool,
which serves as the storage media for DNA data storage.
Reading information in DNA storage is realized by ran-
domly and independently sequencing the oligos in the pool,
with each sequenced oligo as one read. The number of reads
is usually much smaller than the total number of oligos
in the pool. For instance, only 0.1% of oligos in the
pool are consumed for sequencing in an experiment in [4].
Some synthesized sequences may not be sequenced at all in
the reading process. Moreover, there might be a portion of
sequences lost during the DNA manipulations. Thus, erasure
codes are required to recover the input information from the
limited number of sequenced oligos. In addition, polymerase
chain reaction (PCR) is performed to amplify the oligos in the
pool before sequencing. It increases the oligo concentration
for the ease of sequencing and allows multiple access for the
storage system.
In DNA-based data storage, both synthesis and sequencing
are prone to error in the bio-chemical and bio-physical
processes. Most of the recent works reported in the litera-
ture adopt the array-based synthesis and the next generation
sequencing techniques for DNA storage, leading to similar
error patterns and raw error probabilities. It has been found
that substitution, insertion, and deletion base errors and oligo
missing occur in DNA storage systems. Therefore, effec-
tive error detection and correction schemes are required for
improving both the reliability and the information storage
density of DNA storage systems.
B. DNA Fountain architecture
The DNA Fountain is a DNA-based data storage architec-
ture that realized error free data writing and reading with a
high number of bits per nucleotide in the literature. It consists
of an RS code for each oligo as the inner code and an LT code
for a set of oligos as the outer code. Because insertion and
deletion errors in the oligos are problematic for efficient error
correction, the RS code in DNA Fountain is only used for
error detection. Oligos with undesired lengths due to insertion
or deletion errors, or those violating the parity-check of the
RS code are discarded and considered missing for the outer
LT code. LT codes are a class of capacity-achieving codes for
erasure channels [12]. Thus, it can tackle the oligo missing
(i.e. oligo dropout) due to various errors. For a given set of k
input symbols, an LT code can create any desired number of
packages, each consisting of the indices and the summation of
random dinput symbols. Here, dfollows the Robust Soliton
Distribution (RSD) µK,c,δ(d)[12], given by
µK,c,δ(d) = ρ(d) + τ(d)
ρ(d) = 1/K if d= 1;
d(d1) for d= 2, ..., K,
τ(d) =
Kd for d= 1,2, ..., K/s 1;
Kfor d=K/s;
0for d > K/s,
Binary file
LT enc.
RS enc.
Recovered file
LT dec.
RS dec.
Random packs
Base sequences
Repeat until enough
oligos are created
Writing Reading
Fig. 1. Block diagram of data writing and reading of DNA Fountain.
and Z=d(ρ(d) + τ(d)) is a normalization coefficient.
Due to the randomness of LT codes, the biochemical con-
straints of DNA manipulations can be satisfied by discarding
all the invalid sequences, at the expense of a long encoding
Fig. 1 shows a diagram of DNA Fountain. The binary
source file is first partitioned into non-overlapping segments
of a certain length. Packages of segments are then produced
by selecting a random subset of segments using the RSD
distribution and adding them bitwise together under a binary
field. Each package is attached with a unique seed created
by a pseudo random number generator (PRNG). This is
essentially the encoding process of the LT code. The obtained
package with its seed is then encoded by an RS code to
obtain a short message called droplet. After that, the binary
droplet is mapped into a DNA base sequence, and a screening
process is performed where the invalid droplets that violate
the biochemical constraints are rejected. The LT-RS-screening
process is then performed iteratively until a sufficient number
of valid droplets is created and synthesized into an oligo
pool. By sequencing the oligos in the pool, demapping the
obtained DNA base sequences into binary droplets, followed
by decoding of the RS code and LT code, the source file can
be recovered.
A. DNA-based data storage system model
In this subsection, we propose a simplified system model
that characterizes the DNA-based data storage following the
analyses of experimental data of open literatures [8]-[11].
Suppose that Nunique input sequences, each with nbases
as the data payload, are synthesized, resulting in Sreads
after oligo sequencing. Then Soutput sequences are obtained
after inner-code-parity-checking and merging of identical
sequences. By making a few assumptions, we can derive a
simplified DNA storage system model shown in Fig. 2.
Random sampling
with replacement
Subs. ins. & del.
errors injection
identical reads
Sample population
N n-tuples
Random Samples
S n-tuples
Sequence reads
S reads
Output sequences
erased sequences
Input sequences
N n-tuples
Fig. 2. Simplified DNA storage system model.
We model the synthesis and sequencing processes as a
random sampling process such that the output sequences
are randomly sampled from the Ninput sequences. The
sampling consists of two stages. First, some sequences are
sampled as erasures such that they are missed during the
DNA manipulations. The second stage is to sample reads
from the remaining sequences. At this stage, we assume all
sequences in the population have the same number of copies
created by the synthesizer. We further assume that the PCR
amplification is ideal such that all synthesized sequences
are equally amplified error free. Then, the input sequence
will either be an erasure as shown by the first block of
Fig. 2, or be sampled with a constant probability. Since the
number of reads is much smaller than the number of oligos
in the pool, the second stage can be considered as a uniform
random sampling with replacement. Next, we noticed that
the synthesis and sequencing errors are independent with
each other, and they occur consecutively in the DNA storage
system. We thus combine the errors generated by the two
processes, by injecting the combined amount of substitution
errors, insertion errors, and deletion errors respectively into
the sampled input sequence. The fourth stage of the system
model merges all identical reads to obtain Soutput sequences
for information recovery.
B. Log-Likelihood Ratio calculation
All DNA storage architectures proposed in the literature use
HDD for error control. In general, HDD is less reliable and
hence requires more redundancy for achieving a target error
rate than SDD. Specifically, for DNA Fountain, there exist
simultaneously the insertion and deletion errors that may not
be detectable by the inner RS codes. The traditional HDD of
DNA Fountain, i.e., the inverse LT (ILT) [12], does not have
error correction capability, and a single erroneous oligo that
was accepted as error free may result in a large number of
decoded errors. This motivates us to seek for soft information
to enable SDD of ECCs to increase the reliability for DNA
data storage.
Recall that by sequencing the oligo pool, we may obtain
multiple output sequences carrying information of the same
input sequence. Since base errors occur randomly and in-
dependently in different copies of the same input sequence
created by the synthesizer [4], output sequences with more
occurrences are more likely to be error free. Consider that an
output sequence occurs rtimes, denoted by event Dr, with
r= 1,2, ..., S. We show that the LLR of the sequence can
be derived explicitly as follows.
Since all output sequences have nbases after RS decoding,
this ensures that the insertion and deletion errors do not occur,
or only occur in pairs. We refer to a pair of insertion and
deletion errors as an i-d error and let Est be the event that a
sequence is corrupted by ssubstitution errors and ti-d errors.
Moreover, the validity of each output sequence is checked by
the inner code, and we use event Cto denote the case where
the output sequence is a valid codeword. The LLR of an
output sequence occurring rtimes is thus given by
Lr= log P(E00|C, Dr)
1P(E00|C, Dr).(2)
To compute P(E00|C, Dr), we assume that the inner code
has a minimum distance dmin. Let Bdenote the event that
the sequence has greater than or equal to dmin code symbol
errors. Applying the law of conditional probability and total
probability, we obtain
P(E00|C, Dr) = P(E00, C, Dr)
P(E00, C, Dr) + P(B , C, Dr),(3)
P(E00, C, Dr) = P(C|E00 , Dr)P(E00)P(Dr|E00 ),
P(B, C, Dr)
=P(C|B, Dr)
P(Est)P(Dr|B , Est)P(B|Est).
Therefore, we obtain
Lr= log [P(C|E00, Dr)P(E00)P(Dr|E00 )
P(C|B, Dr)s,t P(Est )P(Dr|B, Est )P(B|Est)].
Note that the terms associated with event Bin (4) depend
on the error detection capability of the inner code. In the
following, we use the inner code of [8] as an example to
derive all the corresponding terms in (4) to obtain Lr. The
proposed derivations can be generalized to other inner codes
in a straightforward way. In [8], the inner code is an RS
code over GF(256) with nccode symbols and 2 parity-check
symbols. For simplicity, we assume that the oligo length nis
a multiple of 4 such that nc=n/4. This code has dmin = 3
and can detect up to two errors over GF(256). Thus, output
sequences with less than three code symbol errors can always
be detected by the inner code. Then, we can compute the
probabilities in (4) to obtain Lr.
Apparently, P(C|E00, Dr) = 1 and P(C|B, Dr) =
P(C|B). The probability P(C|B)is the undetected error
rate of the inner code under the condition that the sequence
has greater than two code symbol errors. Due to the exis-
tence of the i-d errors, each error pattern can be considered
as a random nc-tuple over GF(256) with weight greater
than two. The total number of such nc-tuples is given by
(nc2)!2! , with 256nc2tuples
forming the complete set of the inner codewords. For the case
of DNA Fountain, we have nc2. Thus, the probability for
a random vector with weight greater than 2 to be a codeword
is given by
P(C|B, Dr) = 256nc2
Then, we can compute P(Est)for various base error rates,
given by
P(Est) = n!
d(1 p)ns2t,(6)
where ps,pi, and pdare the raw substitution, insertion, and
deletion error rates of the system, respectively, with p=ps+
pi+pd, and s, t 0.
Next, we compute P(B|Est),P(Dr|E00), and
P(Dr|B, Est )in (4). Note that P(B|Est)and P(Dr|B , Est)
depend on the error pattern associated with Est, and the error
pattern of the i-d error is related to the input sequence. As an
approximation, we assume all sequenced bases affected by
the i-d errors are incorrect. Moreover, since the base errors
occur rarely, e.g., one error per hundred bases [6] or less
[4], the probability of having more than three base errors
is trivial. Therefore, we only consider three types of error
patterns: E01 (1 i-d error), E11 (1 substitution error and 1
i-d error), and E30 (3 substitution errors). Hence we have
P(B|Est)1with {st}={01,11,30}.
According to our proposed system model, P(Dr|E00)
and P(Dr|B, Est )are probabilities with the corresponding
sequences being sampled rtimes in the random sampling
with replacement process. They can be calculated based on
the number of samples and the population of the sampling
associated with P(Dr|E00)and P(Dr|B , Est), denoted by
Sst and Nst, respectively. Let Pr,st be the unified form of
P(Dr|E00)and P(Dr|B , Est). We thus have
Pr,st =(Sst 1)!
(Sst r)!(r1)! 1
Nst r111
Nst Sstr
In (7), the number of samples is given by Sst =S·
P(Est). To determine Nst, we need to first compute the
number of error patterns for all possible cases, denoted by
n00,n01 ,n11, and n31 , respectively. For E00, the error pattern
is always 0,i.e.,n00 = 1. For other cases, and by considering
each substitution error or insertion error has three different
error patterns while each deletion error has one, we have
n01 = 3 n!
(n2)! nc
2! (nc1)2!4!
n11 = 32n!
(n3)! nc4! nc!
n30 = 33n!
(n3)!3! nc
3! nc!
(nc2)!2! 4!
We can then obtain the population of each sampling given as
Nst =N·nst.
At this point, we have derived all the probabilities involved
in (4) and thus the soft information Lrof each oligo can be
In this section, we apply SDD to DNA-based data storage
system and investigate its performance gain over HDD. In
principle, all existing DNA storage systems with ECCs can
use the proposed LLR calculation to carry out SDD for more
reliable data retrieval. As an example, we apply SDD to DNA
Fountain [8].
In [8], the source data is stored in n= 152 bases per oligo.
Each oligo consists of 32 bytes of data payload, 4 bytes of
seed for the PRNG of the LT code, and 2 bytes of parity-
check symbols of a (38, 36) shortened RS code over GF(256).
During the screening stage, sequences with homopolymer
run length greater than 3 or GC-content exceeds the range
of [0.45, 0.55] are rejected. In the writing process, 67088
segments of binary message are encoded into 72000 base
sequences, and thus multiple copies of 72000 unique oligos
are synthesized. In the reading process, different number of
reads, e.g., from 750000 to 32000000, are performed [8] to
evaluate the performance of DNA Fountain. For the ease of
simulations, we consider the case with 750000 reads in this
An ILT, the traditional HDD of LT codes, is essentially
a simplified Gaussian elimination. It does not have error
correction capability. That is, even if the ILT is successful,
the recovered messages may be in error. Recall that the LT
code is a binary linear block code with a random generator
matrix G= [P I], where Pis randomly generated with
column weight distribution following the RSD, and Iis an
identity matrix. We can then obtain its parity-check matrix
H= [I PT]. Since the RSD produces a large number of
degree-2 nodes, His a sparse matrix. Therefore, the LT
code can be decoded directly by using the belief propagation
algorithm (BPA) of low-density parity-check (LDPC) codes
[13], with the soft information derived in Section III-B.
In our simulations, as the raw error rate of the system is
not given in [8], we follow [3] and [7] to set the substitution,
insertion, and deletion error rates, respectively. In particular,
based on the error analyses in [3] and [7], we can obtain
pi= pd / 5
Fig. 3. FER comparison of DNA Fountain with SDD and HDD.
the ranges of different types of raw error rates, i.e.,ps
[6×104,4.5×103],pi[5.4×104,1×103], and pd
[1.5×103,5×103]. Moreover, it has been observed that the
deletion error rate pdis approximately three to six times as
much as the insertion error rate pi, and the substitution error
rate psvaries, with the total raw error rate being in the range
of [2 ×103,1×102]. Therefore, we set pd= 5piand
vary the values of psin our simulations. In addition, based
on the supplementary materials of [8], the erasure rate of the
input sequence of the system is set to 5×103.
Moreover, different from LDPC codes, the LT code in DNA
Fountain is nonsystematic, i.e., none of the information bits
are written into oligos. Hence the soft information obtained
from the DNA storage channel are only associated with the
bit positions of Pin Gand correspondingly Iin H. In the
simulations the LLRs of oligos associated with Iin Hcan
be computed for each set of raw error rates according to (4),
based on their number of occurrence. The LLRs of all the
other oligos are set to 0.
Fig. 3 illustrates the simulated frame error rate (FER)
performance of DNA fountain with SDD and HDD, respec-
tively. Note that the FERs are evaluated for the frames with
sufficient number of sequenced oligos such that the ILT can
be successfully carried out. It can be seen from Fig. 3 that
SDD outperforms HDD with an FER reduction of two to
three orders of magnitude, over a wide range of substitution,
insertion, and deletion errors. This demonstrates the potential
of the proposed SDD for improving system’s tolerance to
various types of errors, and increasing the information density
of DNA-based data storage system.
In this paper, we have investigated, for the first time,
the SDD of ECCs for improving the error performance of
DNA-based data storage. In particular, we have proposed
a simplified system model for the DNA-based data storage
system through analyzing system’s major characteristics and
different types of errors. We have derived the error-free
probability of each sequenced oligo, based on which we
obtain its LLR that enables SDD of ECCs. To demonstrate
the effectiveness of the proposed SDD, we have applied it
to decode the LT code of DNA Fountain. Simulation results
have shown that for DNA Fountain, the proposed SDD can
effectively improve system’s tolerance to various types of
errors, and it achieves an FER reduction of two to three orders
of magnitude over HDD.
This work is supported by Singapore Ministry of Educa-
tion Academic Research Fund Tier 2 MOE2016-T2-2-054,
SUTD-ZJU grant ZJURP1500102, and SUTD SRG grant
[1] M. E. Allentoft, M. Collins, D. Harker, J. Haile, C. L. Oskam, M. L.
Hale, P. F. Campos, J. A. Samaniego, M. T. P. Gilbert, E. Willerslev,
G. Zhang, R. P. Scofield, R. N. Holdaway, and M. Bunce, “The half-
life of DNA in bone: measuring decay kinetics in 158 dated fossils,
Proceedings of the Royal Society of London B: Biological Sciences,
[2] C. Bancroft, T. Bowler, B. Bloom, and C. T. Clelland, “Long-term
storage of information in DNA,Science, vol. 293, no. 5536, pp. 1763–
1765, 2001.
[3] M. Blawat, K. Gaedke, I. Huetter, X.-M. Chen, B. Turczyk, S. Inverso,
B. W. Pruitt, and G. M. Church, “Forward error correction for DNA data
storage,” Procedia Computer Science, vol. 80, pp. 1011–1022, 2016.
[4] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust,
B. Sipos, and E. Birney, “Towards practical, high-capacity, low-
maintenance information storage in synthesized DNA,Nature, vol.
494, no. 7435, pp. 77–80, 2013.
[5] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and
K. Strauss, “A DNA-based archival storage system,SIGPLAN Not.,
vol. 51, no. 4, pp. 637–649, Mar. 2016.
[6] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark,
“Robust chemical preservation of digital information on DNA in silica
with error-correcting codes,” Angewandte Chemie International Edition,
vol. 54, no. 8, pp. 2552–2555, 2015.
[7] L. Organick, S. D. Ang, Y. J. Chen, R. Lopez, S. Yekhanin, K.
Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen, C.
Takahashi, S. Newman, H. Y. Parker, C. Rashtchian, G. G. K. Stewart,
R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss,
“Scaling up DNA data storage and random access retrieval,bioRxiv,
[8] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efficient
storage architecture,” Science, vol. 355, no. 6328, pp. 950–954, 2017.
[9] S. Kosuri and G. M. Church, “Large-scale de novo DNA synthesis:
technologies and applications,” Nature methods, vol. 11, no. 5, pp. 499–
507, 2014.
[10] E. M. LeProust, B. J. Peck, K. Spirin, H. B. McCuen, B. Moore,
E. Namsaraev, and M. H. Caruthers, “Synthesis of high-quality libraries
of long (150mer) oligonucleotides by a novel depurination controlled
process,” Nucleic acids research, vol. 38, no. 8, pp. 2522–2540, 2010.
[11] R. Heckel, G. Mikutis, and R. N. Grass, “A Characteriza-
tion of the DNA Data Storage Channel.arXiv:1803.03322, http-
s://, 2018.
[12] M. Luby, “LT codes,” in IEEE Symp. Found. of Comp. Science, 2002,
pp. 271–280.
[13] W. E. Ryan and S. Lin, Channel codes: classical and modern. Cam-
bridge University Press, 2009.
... The path metric represents the quantized components in the demodulator [6]. The optimum decision is based on normalized metric of the synchronized space [8]. The scaled regions are designed to achieve the uniform phase [7,8]. ...
... The optimum decision is based on normalized metric of the synchronized space [8]. The scaled regions are designed to achieve the uniform phase [7,8]. The symmetry properties of signal space codes are isomorphic to the signal set in the demodulator [9]. ...
For channel codes in communication systems, an efficient algorithm that controls error is proposed. It is an algorithm for soft decision decoding of block codes. The sufficient conditions to obtain the optimum decoding are deduced so that the efficient method which explores candidate code words can be presented. The information vector of signal space codes has isomorphic coherence. The path metric in the coded demodulator is the selected components of scaled regions. The carrier decision is derived by the normalized metric of synchronized space. An efficient algorithm is proposed based on the method. The algorithm finds out a group of candidate code words, in which the most likely one is chosen as a decoding result. The algorithm reduces the complexity, which is the number of candidate code words. It also increases the probability that the correct code word is included in the candidate code words. It is shown that both the error probability and the complexity are reduced. The positions of the first hard-decision decoded errors and the positions of the unreliable bits are carefully examined. From this examination, the candidate codewords are efficiently searched for. The aim of this paper is to reduce the required number of hard-decision decoding and to lower the block error probability.
... However, to integrate with these coding schemes, one needs to address issues such as the unordered nature [28] and biochemical constraints of [29] DNA strands. Furthermore, even though the decoder proposed in this paper is a hard decision decoder, modifications can be made so that a soft decision to passed to the outer code [30]. However, such issues are beyond the scope of this paper and are deferred to future work. ...
Full-text available
The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. The common setup assumes the codebook to be the entire space and the problem is to determine the minimum number of distinct reads that is required to reconstruct the transmitted codeword. Motivated by modern storage devices, we study a variant of the problem where the number of noisy reads $N$ is fixed. Specifically, we design reconstruction codes that reconstruct a codeword from $N$ distinct noisy reads. We focus on channels that introduce single edit error (i.e. a single substitution, insertion, or deletion) and their variants, and design reconstruction codes for all values of $N$. In particular, for the case of a single edit, we show that as the number of noisy reads increases, the number of redundant bits required can be gracefully reduced from $\log n+O(1)$ to $\log \log n+O(1)$, and then to $O(1)$, where $n$ denotes the length of a codeword. We also show that the redundancy of certain reconstruction codes is within one bit of optimality.
Full-text available
Owing to its longevity and enormous information density, DNA, the molecule encoding biological information, has emerged as a promising archival storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules that are stored in an unordered way, and can only be read by sampling from this DNA pool. Moreover, imperfections in writing (synthesis), reading (sequencing), storage, and handling of the DNA, in particular amplification via PCR, lead to a loss of DNA molecules and induce errors within the molecules. In order to design DNA storage systems, a qualitative and quantitative understanding of the errors and the loss of molecules is crucial. In this paper, we characterize those error probabilities by analyzing data from our own experiments as well as from experiments of two different groups. We find that errors within molecules are mainly due to synthesis and sequencing, while imperfections in handling and storage lead to a significant loss of sequences. The aim of our study is to help guide the design of future DNA data storage systems by providing a quantitative and qualitative understanding of the DNA data storage channel.
Full-text available
Channel coding lies at the heart of digital communication and data storage, and this detailed introduction describes the core theory as well as decoding algorithms, implementation details, and performance analyses. Known for their writing clarity, Professors Ryan and Lin provide the latest information on modern channel codes, including turbo and low-density parity-check (LDPC) codes. They also present detailed coverage of BCH codes, Reed-Solomon codes, convolutional codes, finite geometry codes, and product codes, providing a one-stop resource for both classical and modern coding techniques. Assuming no prior knowledge in the field of channel coding, the opening chapters begin with basic theory to introduce newcomers to the subject. Later chapters then extend to advanced topics such as code ensemble performance analyses and algebraic code design. 250 varied and stimulating end-of-chapter problems are also included to test and enhance learning, making this an essential resource for students and practitioners alike.
Full-text available
Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
Full-text available
Claims of extreme survival of DNA have emphasized the need for reliable models of DNA degradation through time. By analysing mitochondrial DNA (mtDNA) from 158 radiocarbon-dated bones of the extinct New Zealand moa, we confirm empirically a long-hypothesized exponential decay relationship. The average DNA half-life within this geographically constrained fossil assemblage was estimated to be 521 years for a 242 bp mtDNA sequence, corresponding to a per nucleotide fragmentation rate (k) of 5.50 × 10(-6) per year. With an effective burial temperature of 13.1°C, the rate is almost 400 times slower than predicted from published kinetic data of in vitro DNA depurination at pH 5. Although best described by an exponential model (R(2) = 0.39), considerable sample-to-sample variance in DNA preservation could not be accounted for by geologic age. This variation likely derives from differences in taphonomy and bone diagenesis, which have confounded previous, less spatially constrained attempts to study DNA decay kinetics. Lastly, by calculating DNA fragmentation rates on Illumina HiSeq data, we show that nuclear DNA has degraded at least twice as fast as mtDNA. These results provide a baseline for predicting long-term DNA survival in bone.
Full-text available
We have achieved the ability to synthesize thousands of unique, long oligonucleotides (150mers) in fmol amounts using parallel synthesis of DNA on microarrays. The sequence accuracy of the oligonucleotides in such large-scale syntheses has been limited by the yields and side reactions of the DNA synthesis process used. While there has been significant demand for libraries of long oligos (150mer and more), the yields in conventional DNA synthesis and the associated side reactions have previously limited the availability of oligonucleotide pools to lengths <100 nt. Using novel array based depurination assays, we show that the depurination side reaction is the limiting factor for the synthesis of libraries of long oligonucleotides on Agilent Technologies' SurePrint DNA microarray platform. We also demonstrate how depurination can be controlled and reduced by a novel detritylation process to enable the synthesis of high quality, long (150mer) oligonucleotide libraries and we report the characterization of synthesis efficiency for such libraries. Oligonucleotide libraries prepared with this method have changed the economics and availability of several existing applications (e.g. targeted resequencing, preparation of shRNA libraries, site-directed mutagenesis), and have the potential to enable even more novel applications (e.g. high-complexity synthetic biology).
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
We report on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today's digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, specifically tailored forward error correction schemes have to be applied. For the purpose of realizing a DNA data storage, we have developed an efficient and robust forwarderror-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School [1]. Our forward error correction scheme is able to cope with all error types of today's DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22 MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved further. This proves the feasibility to use synthetic DNA as longterm digital data storage media.
Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging. Here we show that digital information can be stored on DNA and recovered without errors for considerably longer time frames. To allow for the perfect recovery of the information, we encapsulate the DNA in an inorganic matrix, and employ error-correcting codes to correct storage-related errors. Specifically, we translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silica. Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions. The original information could be recovered error free, even after treating the DNA in silica at 70 °C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.
For over 60 years, the synthetic production of new DNA sequences has helped researchers understand and engineer biology. Here we summarize methods and caveats for the de novo synthesis of DNA, with particular emphasis on recent technologies that allow for large-scale and low-cost production. In addition, we discuss emerging applications enabled by large-scale de novo DNA constructs, as well as the challenges and opportunities that lie ahead.