ArticlePDF Available

Abstract

We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
1
Design of Capacity-Approaching Constrained Codes
for DNA-based Storage Systems
Kees A. Schouhamer Immink, Fellow, IEEE and Kui Cai, Senior Member, IEEE
Abstract—We consider coding techniques that limit the lengths
of homopolymer runs in strands of nucleotides used in DNA-
based mass data storage systems. We compute the maximum
number of user bits that can be stored per nucleotide when
a maximum homopolymer runlength constraint is imposed.
We describe simple and efficient implementations of coding
techniques that avoid the occurrence of long homopolymers, and
the rates of the constructed codes are close to the theoretical
maximum. The proposed sequence replacement method for k-
constrained q-ary data yields a significant improvement in coding
redundancy than the prior art sequence replacement method for
the k-constrained binary data. Using a simple transformation,
standard binary maximum runlength limited sequences can be
transformed into maximum runlength limited q-ary sequences,
which opens the door to applying the vast prior art binary code
constructions to DNA-based storage.
I. INT ROD UC TI ON
The first large-scale archival DNA-based storage architec-
ture was implemented by Church et al. [1] in 2012. Naturally
occurring DNA consists of four types of nucleotides (nt):
adenine (A), cytosine (C), guanine (G), and thymine (T).
A DNA strand (or string) is a linear sequence of these
nucleotides, and hence is essentially a q-ary sequence with
q= 4. Binary source, or user, data is translated into a strand
of nucleotides, for example, by mapping two binary source bits
into a single nucleotide. Repetitions of the same nucleotide,
a homopolymer run, may significantly increase the chance
of sequencing errors [2], [10]. From Fig. 5 of [10], a long
homopolymer run (e.g. more than 4 nt) would result in a
significant increase of insertion and deletion errors, so that
such long runs should be avoided.
In this paper, we focus on constrained coding techniques
that avoid the occurrence of long homopolymer runs. That is,
we will study the generation of sequences of q-ary symbols,
. . . , xi1, xi, xi+1, . . .,xi∈ Q ={0, .., q 1}, where the
occurrence of vexatious substrings is disallowed. Note that
we prefer for the DNA case, q= 4, the usage of the
alphabet Q={0,1,2,3}instead of the set of four nucleotide
types {A, C, G, T }as it allows the introduction of arithmetic
operations on the symbols.
Constrained sequences have been applied in a great number
of mass data storage systems such as optical and magnetic data
recording systems [3]. Constrained codes based on runlength
Kees A. Schouhamer Immink is with Turing Machines Inc, Willem-
skade 15d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-
machines.com.
Kui Cai is with Singapore University of Technology and Design (SUTD),
8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.
This work was supported by the SUTD-MIT International Design Center
(IDC) research grant.
limited (RLL) sequences have found almost universal applica-
tion in recording practice, and most of the codes are binary
with q= 2. The number of repetitions of the same consec-
utive symbol (nucleotide) is usually called runlength [4]. A
maximum runlength constraint is characterized by the integer
(k+ 1),k0, which stipulates the maximum runlength. We
focus on sequences, where the ‘zero’ runlength lies between
dand k. Such a sequence is often called a dk-constrained
sequence, and in case d= 0, it is called a k-constrained
sequence. A k-constrained sequence is converted into an RLL
sequence whose maximum runlength equals k+ 1, using
precoding, a modulo-qintegration step [3]. The notation k
versus k+ 1 for a k-constrained sequence versus a k+ 1
RLL sequence is inconvenient, but the term is generally used
in data recording practice, and is a heritage rooted in the
1960s [5]. We use the notation m=k+ 1 to denote a
maximum homopolymer run of mnt.
Bornholt et al. [2] presented a coding method that avoids
the occurrence of repetitions of the same nucleotide for DNA-
based storage. They convert binary user data into a (k= 0)-
constrained ternary data stream using a base-change converter,
where the generated ternary data are taken from the alphabet
{1,2,3}. The ternary data so obtained are translated using
modulo-4integration precoding into a strand of nucleotides,
where homopolymers are avoided, m= 1, that is substrings
‘00’, ‘11’, ‘22, ‘33’ (or in nucleotide language: ‘AA’, ‘CC’,
‘GG’, and ‘TT’) are not generated. The relative loss of
information capacity due to the proposed 3-base code, instead
of the full 4-base, equals 1log2(3)/20.208. The additional
loss of the proposed binary-based source word to ternary-based
codeword conversion using the proposed fixed-to-variable-
length Huffman code is ignored. The more than 20 percent loss
of information capacity is significant, and therefore alternative
coding methods with less overhead are desirable.
In this paper, we propose alternative, more efficient, coding
techniques that avoid the occurrence of long homopolymer
runs. In particular, in Section II, we compute the information
capacity of q-ary, k-constrained channels, which follows di-
rectly from Shannon’s noiseless input restricted channel [6].
Then, in Section III, we present the main contribution of this
work, algorithms for translating arbitrary binary source data
into k-constrained q-ary data. Among the three code design
methods we describe, the second method removes forbidden
substrings of q-ary sequences by using a recursive, ‘sequence
replacement’, method yielding a significant improvement in
coding redundancy than the prior art binary sequence re-
placement method [9]. In the third method, standard binary
maximum runlength limited sequences are transformed into
maximum runlength limited q-ary sequences using two simple
2
TABLE I
CAPACI TY,Ck,VE RSU S kAND m=k+ 1 FO R q= 4.
k m Ck(bit/nt)
0 1 1.5850(= log23)
1 2 1.9227
2 3 1.9824
3 4 1.9957
4 5 1.9989
5 6 1.9997
steps of precoding, which opens the door to using the vast
prior art binary code constructions to DNA-based storage.
Section IV concludes our paper.
II. IN FO RM ATIO N CA PACI TY
Strands of nucleotides with (long) repetitions of the same
nucleotide are prone to error, and DNA sequences with more
than m=k+ 1 consecutive nucleotides of the same type
must be avoided. Each k-constrained sequence of symbols
starting with a non-zero symbol can be seen to be composed
of substrings taken from the set {a0, a10, a202, . . . , ak0k},
where 0jstands for a string of jconsecutive ‘0’s, and the
integer ai∈ {1, . . . , q 1}. Let Nk(n)denote the number of
k-constrained sequences of q-ary symbols starting with a non-
zero symbol, then we may write down, following Shannon’s
approach [6], the recurrent relationship
Nk(n) = (q1)
k+1
i=1
Nk(ni), n > k. (1)
For large n, the number of sequences Nk(n)grows exponen-
tially, that is
Nk(n)n
k, n 1,(2)
where c1is a constant, and the growth factor, λk, is the
largest real root of
λk+2
kk+1
k+q1 = 0.(3)
The maximum number of user bits that can be stored per
nucleotide (nt), called (information) capacity, denoted by Ck,
is defined by
Ck= lim
n→∞
1
nlog2Nk(n) = log2λk(bit/nt).(4)
Table I shows the capacity of the k-constrained channel, Ck,
versus kand the maximum homopolymer run m=k+ 1, for
the DNA case, q= 4. For asymptotically large k, we obtain
λkq1q1
qk+2 , k 1,(5)
so that
Cklog2q1
ln 2
q1
q2qk.(6)
It is immediate from Table I that a relaxation of the maximum
homopolymer runlength constraint from m= 1, a value
proposed in [2], to a higher value may significantly increase
the maximum code rate. We are interested in k-constrained
code constructions of rate (n1)/n, where nis the codeword
length. Define the integer nmax as the largest nfor which a
rate (n1)/n,k-constrained code can be constructed. We
simply find that
1
nmax
1logqλk,(7)
or
nmax =1
logq
q
λk.(8)
Results of computations are collected in Table II. We may
notice that it is possible, in theory, to construct a code with
a redundancy of around half a percent, where homopolymers
runs have a length at most m= 4. A maximum homopolymer
run, m= 3, costs less than two percent redundancy. In the next
section, we investigate properties and constructions of practical
codes that translate arbitrary source data into k-constrained q-
ary sequences.
III. MAX IM UM RU NL EN GT H CO NS TR AI NE D CO DE S
We detail three methods for generating maximum runlength
limited q-ary sequences. In the second method, forbidden
substrings of q-ary sequences are removed by a recursive,
‘sequence replacement’, method. We assume that the binary
source data have been translated into q-ary data, which is ac-
complished by an efficient base converter. In the third method,
standard binary maximum runlength limited sequences are
transformed into maximum runlength limited q-ary sequences
using a simple transformation.
A. Cascadable block codes, Method A
Let nbe the length of a k-constrained q-ary word that ends
with at most r‘zero’s and starts with at most l‘zero’s. In case
l+rkwe may cascade the n-words without violating the k
constraint at the word boundaries. Blake [7] and Freiman and
Wyner [5] showed for the binary case, q= 2, that the number
of such constrained words, denoted by Nklr(n), is maximized
by choosing l=k/2and r=kl. Their arguments can
be generalized to q-ary words, and we denote the number of
k-constrained q-ary words by Nk,l0,r0(n), where l0=k/2
and r0=kl. Using generating functions and an algebraic
computer program, we can compute Nk,l0,r0(n)as a function
of kand n. As we are interested in the construction of code of
maximum rate 11/n, we computed the maximum n, denoted
by nA, for which a rate 11/n,k-constrained code using
Method A, is possible. Results of computations are collected
in Table II. For small nit is practically possible to directly
implement RLL codes using look-up tables. For larger codes,
and smaller redundancy, we must resort to alternative coding
methods. Kautz [8], for example, presented an enumeration
algorithm for encoding and decoding k-constrained binary
sequences. His enumeration algorithm can be rewritten for the
q-ary case at hand. The space complexity of his algorithm is
O(n2), which makes it less attractive than the replacement
techniques discussed in the next section.
3
B. Sequence replacement technique, Method B
The three sequence replacement techniques published by
Wijngaarden et al. [9] are recursive methods for removing
forbidden substrings from a binary source word. The encoder
removes the forbidden substrings, and the positions of the
forbidden substrings are encoded as binary pointer words, and
subsequently inserted at predefined positions of the codeword.
The sequence replacement techniques are attractive as the
complexity of encoder and decoder is very low, and the
methods are very efficient in terms of rate-capacity quotient. In
software or hardware, it would require a counter, a comparator,
and a few memory elements. The methods are also suited for
high speed implementation, as several steps in the encoding
and decoding procedure can be performed simultaneously.
We assume that both the source and encoded channel
data are represented in the same q-ary base. Let X=
(x1, . . . , xn1),xi∈ Q, be an (n1)symbol source word,
which has to be translated into an n-symbol code word
Y= (y1, . . . , yn),yi∈ Q. Obviously, the rate of the code
is (n1)/n. The task of the encoder is to translate the source
word into a k-constrained word.
The encoder simply starts by appending a ‘1’ to the (n1)-
symbol source word, yielding the n-symbol word, denoted by
X1. The encoder scans (from right to left, i.e. from LSB
to MSB) the word X1, and if this word does not have the
forbidden substring 0k+1, the q-ary codeword Y=X1is
transmitted. If, on the other hand, the first occurrence of
substring 0k+1 is found, we invoke the following replacement
procedure.
Replacement procedure: Let the source word be denoted
by X20k+1X11, where, by assumption, the tail X1has no
forbidden substring. The forbidden substring 0k+1 is removed,
yielding the (nk1)-symbol sequence X2X11. Let the for-
bidden substring, 0k+1, start at position p1,1p1nk1.
The position p1is represented by the (k+ 2)-symbol q-ary
pointer word, p=v1Av20, where v1, v2 Q \ {0}and A
is any q-ary word of k1symbols. Note that the number
of unique combinations of pointer pequals (q1)2qk1.
Subsequently, the tail symbol, ‘1’, of X2X11is replaced by
the (k+ 2)-symbol q-ary string, p, obtaining the sequence
X2X1p.
Note that the sequence X2X1pis of length n(as the
starting sequence X1). If, after the replacement, the sequence
X2X1pis free of other occurrences of the forbidden substring
0k+1 then the codeword Y=X2X1pis sent. Otherwise, the
encoder repeats the above sequence replacement procedure
for the string X2X1petc., until all forbidden substrings have
been removed. The decoder can uniquely undo the various
replacements and shifts made by the encoder. The space
complexity of the encoder and decoder is mainly the look-
up table for translating the position p1,1p1nk1,
into the (k+ 2)-symbol q-ary pointer and vice versa, which
amounts to O(n).
As the pointer p1is in the range 1p1nk1, and the
number of distinct combinations of p1equals (q1)2qk1,
we conclude that the codeword length nis upperbounded by
n(q1)2qk1+k+ 1, k 2.(9)
TABLE II
MAXIMUM LENGTHS nFO R WHI CH A R ATE (n1)/n,k-CONSTRAINED
4-ARY CO DE C AN BE C ONS TRU CT ED.
k m =k+ 1 nmax nAnB
1 2 25 22 11
2 3 113 106 39
3 4 467 445 148
4 5 1885 1848 581
The code uses one redundant q-ary symbol, so that we
conclude that the redundancy of the sequence replacement
code is approximately
log2q
nq
(q1)2qk, k 1.(10)
From (6), we infer that the redundancy of k-constrained q-ary
sequences is at least
log2qCk1
ln 2
q1
q2qk, k 1.(11)
The redundancy of the sequence replacement method is a
factor of q
q13
ln q
larger than optimal for k1. For DNA-based storage, q= 4,
the factor is around 3.29. The above sequence replacement
method is efficient in terms of redundancy and space/time
hardware requirements as no large look-up tables are needed.
For example, for a maximum homopolymer run m= 4, we
are able to construct a code of length n= 148 that needs only
one redundant nucleotide.
Table II shows results of computations for rate (n1)/n,
k-constrained codes for q= 4 and various values of k,
where nBdenotes the maximum npossible with the sequence
replacement method (Method B). Results of computations of
nmax have been collected in Table II.
C. Translating binary k-constrained codes into quaternary
k-constrained sequences, q= 4
Very efficient constructions of binary k-constrained codes
that avoid long repetitions of a ‘zero’ have been published in
the literature, see, for example, the survey in [9]. We show
that after applying a simple coding step to a k-constrained
binary sequence, we obtain a strand of nucleotides, where the
length of a homopolymer run is at most m=k
2.
We start with definitions of two simple operations on
symbol sequences and their (unique) inverse. Let x=
(x1, . . . , xn),xi∈ {0,1}, denote a word of nbinary symbols.
The first operation is defined as follows. The n-bit sequence,
x, is translated into a n
2-symbol sequence w, where two
consecutive binary symbols of xare translated into one
quaternary symbol wi∈ {0,1,2,3}, using
wi= 2x2i1+x2i,1in
2.
The above operation is denoted by the shorthand notation w=
Z(x).
4
The second operation, usually called precoding, is defined as
follows. The word w=(w1, . . . , wn),wi∈ {0,1}is obtained
by modulo 2integration of x, that is, by the operation
wi=
i
k=1
xk=wi1xi,1in, (12)
where the dummy symbol w0= 0, and the symbol
denotes symbol-wise modulo 2 addition. The above operation
is denoted by the shorthand notation w=I(x). Note that
the original word xcan be uniquely restored by a modulo 2
differentiation operation, defined by
xi=wiwi1,1in. (13)
The above differentiation operation is denoted by x=
I1(w). Clearly,
I1(I(x)) = x.(14)
Assume that the binary source data have been converted into
a binary k-constrained sequence, x=(x1, . . . , xn), where
xi∈ {0,1}, using a suitable k-constrained code. Then, by
definition, substrings in xof more than kconsecutive ‘zero’s,
are absent. Note that the operation w=Z(x)will not result in
ak-constrained sequence w. In order to limit the runlengths
of the output word, w, we first apply a two-step precoding
operation, defined by
w=I(I(x)).(15)
For example, we can easily verify the three operations on the
sequence x,
x= 011000011111111111001111000111
I(x) = 010000010101010101110101111010
I(I(x)) = 011111100110011001011001010011
Z(I(I(x))) = 133212121121103,
where xis a (k= 4)-constrained binary sequence. After the
first precoding step, the sequence, I(x), is a regular runlength
limited sequence with a maximum ‘zero’ and ‘one’ runlength
equal to k+ 1(= 5). The second precoding step limits the
number of consecutive ‘one’s and ‘zero’s to k+2 in I(I(x)),
and it also limits the number of consecutive ’10’s bits to k+2,
thus prohibiting the generation of long homopolymer runs.
In the above example, the 4-ary output sequence
Z(I(I(x))) has a maximum homopolymer run, m= 2.
In general, it can easily be verified that in case the binary
input sequence, x, is k-constrained that the 4-ary sequence,
Z(I(I(x))), has a maximum homopolymer run given by
m=k
2, k>2.(16)
The above method offers a simple translation of binary k-
constrained sequences into a strand of nucleotides with limited
homopolymer runs, which creates the opportunity to apply the
vast literature on binary runlength limited coding to DNA-
based storage.
IV. CONCLUSIONS
We have presented coding methods for translating source
data into strands of nucleotides with a maximum homopolymer
run. We found that the proposed algorithms can be imple-
mented efficiently, and that the information densities of the
constructed codes are close to the theoretical maximum. We
have proposed sequence replacement method for k-constrained
q-ary data, which yields a significant improvement in coding
redundancy than the prior art sequence replacement method for
the k-constrained binary data. We have shown that, using two
simple steps of precoding, it is possible to translate a binary
k-constrained sequence into a strand of nucleotides with a
maximum homopolymer run, which creates the opportunity to
applying a myriad of prior art binary code constructions to
DNA-based storage.
REF ER EN CE S
[1] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital infor-
mation storage in DNA,Science, vol. 337, no. 6012, pp. 1628-1628,
2012.
[2] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig, “A
DNA-based Archival Storage System,ACM SIGOPS Operating Systems
Review, vol. 50, pp. 637-649, 2016.
[3] K. A. S. Immink, Codes for Mass Data Storage Systems, Second Edi-
tion, ISBN 90-74249-27-2, Shannon Foundation Publishers, Eindhoven,
Netherlands, 2004.
[4] K. A. S. Immink, “Runlength-Limited Sequences,Proceedings of the
IEEE, vol. 78, no. 11, pp. 1745-1759, Nov. 1990.
[5] C. V. Freiman and A. D. Wyner, “Optimum Block Codes for Noiseless
Input Restricted Channels,” Information and Control, vol. 7, pp. 398-
415, 1964.
[6] C. E. Shannon, “A Mathematical Theory of Communication,Bell Syst.
Tech. J., vol. 27, pp. 379-423, July 1948.
[7] I. F. Blake, “The Enumeration of Certain Run Length Sequences,”
Information and Control, vol. 55, pp. 222-237, 1982.
[8] W. H. Kautz, “Fibonacci Codes for Synchronization Control,IEEE
Trans. Inform. Theory, vol. IT-11, pp. 284-292, 1965.
[9] A. J. de Lind van Wijngaarden and K. A. S. Immink, “Construction
of Maximum Run-Length Limited Codes Using Sequence Replacement
Techniques,IEEE Journal on Selected Areas of Communications, vol.
28, pp. 200-207, 2010.
[10] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R.
Hegarty, C. Nusbaum, D. B. Jaffe, “Characterizing and measuring bias
in sequence data,” Genome Biol. 14, R51, 2013.
... To improve the robustness of the DNA synthesis and sequencing processes, error correction codes such as fountain codes [5], Reed-Solomon codes [7], Bose-Chaudhuri-Hocquenghem codes [8], and low-density paritycheck codes [8]- [10] have been applied to DNA storage systems. Since a DNA strand is subject to two constraints, i.e., balanced GC-content (45%-55%) and short homopolymer length (≤3), which enable low error rates during DNA synthesis and sequencing processes [5], the constrained codes [4], [11] without the capability of error correction have been considered to reduce the occurrence of insertion, deletion, and substitution errors in DNA storage. Furthermore, the constrained code combined with the error correction code has been considered for DNA storage [12]. ...
... Additionally, a new binary SIDEC code combined with the maximum runlength r constrained code and efficient systematic encoding algorithm was proposed in [18]. All of these studies [8], [9], [11], [12], [17] focused on binary coding schemes for DNA storage systems. So far, there have been few studies focusing on q-ary coding schemes [19] for DNA storage systems. ...
Article
Full-text available
Due to the advantages of high information densities and longevity, DNA storage systems have begun to attract a lot of attention. However, common obstacles to DNA storage are caused by insertion, deletion, and substitution errors occurring in DNA synthesis and sequencing. In this paper, we first explain a method to convert binary data into general maximum run-length r sequences with specific length construction, which can be used as the message sequence of our proposed code. Then, we propose a new single insertion/deletion nonbinary systematic error correction code and its corresponding encoding algorithm. For the proposed code, we design the fixed maximum run-length r in the parity sequence of the proposed code to be three. Additionally, the last parity symbol and the first message symbol are always different. Hence, the overall maximum run-length r of the output codeword is guaranteed to be three when the maximum run-length of the message sequence is three. Finally, we determine the feasibility of the proposed encoding algorithm, verify successful decoding when a single insertion/deletion error occurs in the codeword, and present the comparison results with relevant works.
... Both Jeong 17 and Anavy 18 made further improvements to Erlich's initiative, reducing the read cost by 6.5%-8.9% in Jeong's initiative, and increasing storage density by 24% in Anavy's initiative. In addition to the use of fountain codes to reduce the generation of errors in DNA storage, Immink et al. 19 described a simple and efficient implementation of coding techniques to avoid the appearance of long homopolymers. Yazdi et al. 20 used weak mutual uncorrelation (WMU) coding for primer design DNA storage coding. ...
Article
Full-text available
The rapid development of information technology has generated substantial data, which urgently requires new storage media and storage methods. DNA, as a storage medium with high density, high durability, and ultra-long storage time characteristics, is promising as a potential solution. However, DNA storage is still in its infancy and suffers from low space utilization of DNA strands, high read coverage, and poor coding coupling. Therefore, in this work, an adaptive coding DNA storage system is proposed to use different coding schemes for different coding region locations, and the method of adaptively generating coding constraint thresholds is used to optimize at the system level to ensure the efficient operation of each link. Images, videos, and PDF files of size 698 KB were stored in DNA using adaptive coding algorithms. The data were sequenced and losslessly decoded into raw data. Compared with previous work, the DNA storage system implemented by adaptive coding proposed in this paper has high storage density and low read coverage, which promotes the development of carbon-based storage systems.
... As surveyed in [8], a practical sequencing channel might be much more involved, and include, e.g., deletions and insertions in addition to substitution errors. Moreover, constraints on long sequences of homopolymers or constraints on the composition of the nucleotides types in the codeword should practically also be taken into account [23], [24]. Third, as was established in [18] for the N = Θ(M ) case, the error probability is dominated by sampling events, in which some molecules are significantly undersampled. ...
Preprint
The DNA storage channel is considered, in which a codeword is comprised of $M$ unordered DNA molecules. At reading time, $N$ molecules are sampled with replacement, and then each molecule is sequenced. A coded-index concatenated-coding scheme is considered, in which the $m$th molecule of the codeword is restricted to a subset of all possible molecules (an inner code), which is unique for each $m$. The decoder has low-complexity, and is based on first decoding each molecule separately (the inner code), and then decoding the sequence of molecules (an outer code). Only mild assumptions are made on the sequencing channel, in the form of the existence of an inner code and decoder with vanishing error. The error probability of a random code as well as an expurgated code is analyzed and shown to decay exponentially with $N$. This establishes the importance of increasing the coverage depth $N/M$ in order to obtain low error probability.
... Initially, the problem was how to directly map between genetic alphabets and ASCII characters (Church et al., 2012;Jiménez-Sánchez, 2013. Later the problem shifted in the domain of addressable memory mapping, error detection and error correction, GC content and homopolymer run-length constraints in the later phase of developments (Blawat et al., 2016;Bornholt et al., 2016;Erlich and Zielinski, 2017;Goldman et al., 2013;Grass et al., 2015;Immink and Cai, 2018;Nguyen et al., 2020;Organick, 2018;Song et al., 2018;Wang, 2019;Yazdi et al., 2015;Yazdi et al., 2017). Although theoretically, 100% recovery of encoded information is possible, the algorithms present are not universally acceptable and can't compensate for naturally occurring defects like loss of individual bases or complete sequences during reading or writing of NAM, which occurs when DNA sequencing and synthesis happens respectively. ...
Article
As the time elapsed by, the present real life problems have guided the human race towards a data driven society. This in turn caused an exponential hype of data generation globally that led to a new challenge for the human to store and manage such an enormous amount of data. It was further analysed through other research works that this is going to manufacture immense tension on the availability of silicon and magnetic memories in the near future. At this point in time, good data compression algorithms became the prime focus of the computing community. However, it was able to check the pace of the growing scarcity of data storage technologies but could not solve the problem from the root. As a result, it became a necessity to develop an efficient alternative data storage technology when the Nucleic Acid Memory (NAM) was brought forward as a promising solution. On the other hand, the research on expansion of the genetic alphabets beyond the standard nucleotides have emerged recently which have drawn a significant attention in the domain of biological science simultaneously. This led to the creation of the Extended Nucleic Acid Memory (ENAM). However, the initial proposals were put forward without considering the real life sequencing constraints namely the homopolymer runlength and the GC content constraint. But, it was observed in the literature that encoding techniques which accounted for countering the sequencing constraints had to pay a penalty in terms of digital data holding capacity per nucleotide. In this context, taking the inspiration from the domain of cryptography a new encoding algorithm namely the Cipher Constrained Encoding (CCE) has been proposed in this work which has the capability of considering both the sequencing constraints without significantly penalizing the data capacity per nucleotide. Few properties of the Vigenére and Vernam Cipher have been adapted and integrated with basic statistical analytical techniques which was very efficient in checking the violation of the sequencing constraints. Furthermore, experimentation has been done and the results have been reported and compared with the previous works found in the literature which demonstrated promising outcome.
... For a long time, we have relied on cheap labor to process raw materials and lacked core technologies with independent intellectual property rights, which has kept China at the bottom of the global industrial chain [20]. With the economic transformation and upgrading, we need to change the status quo of the industry and enhance our own industrial value. ...
Article
Full-text available
While art design is based on innovation and creativity, information technology is advancing by leaps and bounds for Industry 5.0. Big data technology has achieved breakthrough development, and the big data era that has followed has begun. Promoting the resources, technology, thinking big data is the general trend. It is an important expression of creative thinking, which is closely connected with art design and studies the relationship between design and art. As the starting point for research, the average growth rate of China's cultural and creative industries is as high as 26.08%, which is not only limited by the growth rate of traditional industries. More results show that, in 18 years, radio, television, and Internet information services, which are rapidly developing, accounted for 27.9%, 21.8%, and 20.3%, while the advertising, exhibition, tourism, and leisure industries have also steadily increased, and they accounted for 9.6% and 7.1% of the total share. These research results show that the design art and cultural and creative industries are complementary.
... For Nanopore case where R nano = 1/2, the coding potential becomes ∼ 1.988 bits/nt; and for Illumina case where R illu = 4/5, the coding potential becomes ∼ 1.981 bits/nt, presenting only 1% gap from the upper boundary 2 bits/nt. The achieved coding potential is higher than the reported in the existing works [5,12,13]. ...
Preprint
Full-text available
DNA Data storage has recently attracted much attention due to its durable preservation and extremely high information density (bits per gram) properties. In this work, we propose a hybrid coding strategy comprising of generalized constrained codes to tackle homopolymer (run-length) limit and a protograph based low-density parity-check (LDPC) code to correct asymmetric nucleotide level (i.e., A/T/C/G) substitution errors that may occur in the process of DNA sequencing. Two sequencing techniques namely, Nanopore sequencer and Illumina sequencer with their equivalent channel models and capacities are analyzed. A coding architecture is proposed to potentially eliminate the catastrophic errors caused by the error-propagation in the constrained decoding while enabling high coding potential. We also show the log likelihood ratio (LLR) calculation method for the belief propagation decoding with this coding architecture. The simulation results and the theoretical analysis show that the proposed coding scheme exhibits good bit-error rate (BER) performance and high coding potential ($\sim1.98$ bits per nucleotide).
... Our method is based on the sequence replacement technique. The sequence replacement technique has been widely used in the literature [29]- [32]. It is an efficient method for removing forbidden windows from a source word. ...
Article
Full-text available
The subblock energy-constrained codes (SECCs) and sliding window-constrained codes (SWCCs) have recently attracted attention due to various applications in communication systems such as simultaneous energy and information transfer. In a SECC, each codeword is divided into smaller non-overlapping windows, called subblocks, and every subblock is constrained to carry sufficient energy. In a SWCC, however, the energy constraint is enforced over every window. In this work, we focus on the binary channel, where sufficient energy is achieved theoretically by using relatively high weight codes, and study the bounded SECCs and bounded SWCCs, where the weight in every window is bounded between a minimum and maximum number. Particularly, we focus on the cases of parameters that there is no rate loss, i.e. the channel capacity is one, and propose two methods to construct capacity-approaching codes with low redundancy and linear-time complexity, based on Knuth’s balancing technique and sequence replacement technique. These methods can be further extended to construct SECCs and SWCCs. For certain codes parameters, our methods incur only one redundant bit.We also impose the minimum distance constraint for error correction capability of the designed codes, which helps to reduce the error propagation during decoding as well.
Article
In 2019, at the World Economic Forum, DNA data storage was indicated as one of the breakthroughs expected to radically impact the global socio-economic order. Indeed, dry DNA is a relatively stable substance and an extremely capacious information carrier. One gram of DNA can hold up to 455 exabytes, provided that one nucleotide encodes two bits of information. In this critical review, the main attention is paid to nucleinography, meaning the conversion of digital data into nucleotide sequences. The evolution and diversity of approaches intended for encoding data with nucleotides are demonstrated. The most noticeable examples of storing minor as well as considerable quantities of non-biological information in DNA are given. Some issues of DNA data storage are also reported.
Book
Full-text available
Preface to the Second Edition About five years after the publication of the first edition, it was felt that an update of this text would be inescapable as so many relevant publications, including patents and survey papers, have been published. The author's principal aim in writing the second edition is to add the newly published coding methods, and discuss them in the context of the prior art. As a result about 150 new references, including many patents and patent applications, most of them younger than five years old, have been added to the former list of references. Fortunately, the US Patent Office now follows the European Patent Office in publishing a patent application after eighteen months of its first application, and this policy clearly adds to the rapid access to this important part of the technical literature. I am grateful to many readers who have helped me to correct (clerical) errors in the first edition and also to those who brought new and exciting material to my attention. I have tried to correct every error that I found or was brought to my attention by attentive readers, and seriously tried to avoid introducing new errors in the Second Edition. China is becoming a major player in the art of constructing, designing, and basic research of electronic storage systems. A Chinese translation of the first edition has been published early 2004. The author is indebted to prof. Xu, Tsinghua University, Beijing, for taking the initiative for this Chinese version, and also to Mr. Zhijun Lei, Tsinghua University, for undertaking the arduous task of translating this book from English to Chinese. Clearly, this translation makes it possible that a billion more people will now have access to it. Kees A. Schouhamer Immink Rotterdam, November 2004
Article
Full-text available
Background DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. Results We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. Conclusions The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
Article
Full-text available
Digital information is accumulating at an astounding rate, straining our ability to store and archive it. DNA is among the most dense and stable information media known. The development of new technologies in both DNA synthesis and sequencing make DNA an increasingly feasible digital storage medium. We developed a strategy to encode arbitrary digital information in DNA, wrote a 5.27-megabit book using DNA microchips, and read the book by using next-generation DNA sequencing.
Article
Full-text available
Coding techniques are used in communication systems to increase the efficiency of the channel. Not only is coding equipment being used in point-to-point communication channels, but coding methods are also used in digital recording devices such as sophisticated computer disk files and numerous domestic electronics such as stationary- and rotary-head digital audio tape recorders, the Compact Disc, and floppy disk drives. Since the early 1970s, coding methods based on runlength-limited sequences have played a key role for increasing the storage capacity of magnetic and optical disks or tapes. A detailed description is furnished of the limiting properties of runlength-limited sequences, and a comprehensive review is given of the practical aspects involved in the translation of arbitrary data into runlength-limited sequences.
Article
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Article
A method for determining maximum-size block codes, with the property that no concatenation of codewords violates the input restrictions of a given channel, is presented. The class of channels considered is essentially that of Shannon (1948) in which input restrictions are represented through use of a finite-state machine. The principal results apply to channels of finite memory and codes of length greater than the channel memory but shorter codes and non-finite memory channels are discussed briefly. A partial ordering is first defined over the set of states. On the basis of this ordering, complete terminal sets of states are determined. Use is then made of Mason's general gain formula to obtain a generating function for the size of the code which is associated with each complete terminal set. Comparison of coefficients for a particular block length establishes an optimum terminal set and codewords of the maximum-size code are then obtained directly. Two important classes of binary channels are then considered. In the first class, an upper bound is placed on the separation of 1's during transmission while, in the second class, a lower bound is placed on this separation. Universal solutions are obtained for both classes.
Article
The sequence replacement technique converts an input sequence into a constrained sequence in which a prescribed subsequence is forbidden to occur. Several coding algorithms are presented that use this technique for the construction of maximum run-length limited sequences. The proposed algorithms show how all forbidden subsequences can be successively or iteratively removed to obtain a constrained sequence and how special subsequences can be inserted at predefined positions in the constrained sequence to represent the indices of the positions where the forbidden subsequences were removed. Several modifications are presented to reduce the impact of transmission errors on the decoding operation, and schemes to provide error control are discussed as well. The proposed algorithms can be implemented efficiently, and the rates of the constructed codes are close to their theoretical maximum. As such, the proposed algorithms are of interest for storage systems and data networks.
Article
Binary block codes of length n with the property that an arbitrary catenation of codewords will produce a sequence with no runs of either zeros or ones of length greater than k, are considered. The enumeration of maximal codes is determined and the maximal rate such a code may have is found as the logarithm to the base two of the largest real root of a given polynomial.
Article
A new family of codes is described for representing serial binary data, subject to constraints on the maximum separation between successive changes in value (0 rightarrow 1, 1 rightarrow , or both), or between successive like digits ( 0 's, 1 's, or both). These codes have application to the recording or transmission of digital data without an accompanying clock. In such cases, the clock must be regenerated during reading (receiving, decoding), and its accuracy controlled directly from the data itself. The codes developed for this type of synchronization are shown to be optimal, and to require a very small amount of redundancy. Their encoders and decoders are not unreasonably complex, and they can be easily extended to include simple error detection or correction for almost the same additional cost as is required for arbitrary data.