PreprintPDF Available

# Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

## Abstract

We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, {\epsilon} > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-{\epsilon}, 0.5+{\epsilon}]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and ${\epsilon}$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.
1
Capacity-Approaching Constrained Codes with Error
Correction for DNA-Based Data Storage
Tuan Thanh Nguyen, Kui Cai, Kees A. Schouhamer Immink, and Han Mao Kiah
Abstract
We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable
of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ,  > 0, we
propose simple and efﬁcient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely
sequences of the symbols A,T,Cand G, that satisfy the following properties:
Runlength constraint: the maximum homopolymer run in each codeword is at most ,
GC-content constraint: the GC-content of each codeword is within [0.5, 0.5 + ],
Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error.
For practical values of and , we show that our encoders achieve much higher rates than existing results in the literature and
approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.
I. INTRODUCTION
In a DNA-based storage system, the input user data is translated into a large number of DNA strands (also known as DNA
sequences or oligos), which are synthesized and stored in a DNA pool. To retrieve the original data, the stored DNA strands are
sequenced and translated inversely back to the binary data. Several experiments have been conducted since 2012 (see [1]–[7]),
and it has been found that substitutions, deletions, and insertions are common errors occurring at the stages of synthesis and
sequencing. To improve the reliability of DNA storage, several channel coding techniques, including constrained coding and
error correction coding, have been introduced [8]–[12].
In a DNA strand, two properties that signiﬁcantly increase the chance of errors for most synthesis and sequencing technologies
are long homopolymer run [6], [7] and high (or low) GC-content. A homopolymer run refers to the repetition of the same
nucleotide. Ross et al. [6] reported that a homopolymer run of length more than six would result in a signiﬁcant increase
of substitution and deletion errors (see [6, Fig. 5]), and therefore, such long runs should be avoided. On the other hand, the
GC-content of a DNA strand refers to the percentage of nucleotides that are either Gor C, and DNA strands with GC-content that
are too high or too low are more prone to both synthesis and sequencing errors (see for example, [6], [13]). Therefore, most
experiments used DNA strands whose GC-content is close to 50% (for example, between 40% to 60% [7], or 45% to 55% [4]).
Designing efﬁcient constrained codes to translate binary data into DNA strands that satisfy the homopolymer runlength (also
known as runlength limited constraint, or RLL constraint in short) and the GC-content constraints has been a challenge. In
the literature, several prior art coding techniques have been introduced, mostly focusing on one speciﬁc value of maximum
runlength or requiring GC-content to be exactly 50%, also known as GC-balanced constraint [8], [9], [11], [12]. To encode
GC-balanced codewords, most works used a modiﬁcation of the Knuth’s balancing method for binary sequences [14]. Since the
constraint is strong, the coding redundancy is large (approximately log n, where nis the length of each codeword). In this work,
we investigate the problem of translating binary data to DNA strands whose GC-content is close to 50%, and we refer this as
almost-balanced. Via a simple modiﬁcation of Knuth’s method, we show that the number of redundant bits can be gracefully
reduced from log nto O(1).
Constrained codes can reduce the occurrence of substitution, deletion, and insertion errors in the DNA storage system.
However, the constrained code itself cannot correct errors. There are recent works that characterize the error probabilities
by analyzing data from experiments and then demonstrate the need for error-correction codes. For example, Organick et al.
recently stored 200MB of data in 13 million DNA strands and reported substitution, deletion, and insertion rates to be 4.5×103,
1.5×103and 5.4×104, respectively [5]. Since current technologies can only synthesize strands of DNA of one-two hundred
nucleotides, it is most likely that there is at most one error of each type. Motivated by this error behavior, several works focused
on the construction of error-correction codes that are capable of correcting the single edit (i.e. a single substitution, or a single
deletion, or a single insertion) and its variants [9], [10]. However, a problem of combining constrained codes with both the
homopolymer runlength and GC-content constraints with the single-edit-correction codes has not been addressed.
In this work, we propose novel channel coding techniques for DNA storage, where the codebooks satisfy the RLL constraint,
the GC-content constraint, and can also correct a single edit and its variants. During the decoding of the proposed constrained
Tuan Thanh Nguyen and Kui Cai are with the Singapore University of Technology and Design, Singapore 487372 (email: {tuanthanh nguyen,
cai kui}@sutd.edu.sg).
Kees A. Schouhamer Immink is with the Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands (email: immink@turing-
machines.com).
Han Mao Kiah is with the School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371 (email: hmkiah@ntu.edu.sg).
arXiv:2001.02839v1 [cs.IT] 9 Jan 2020
Notation Description
Σalphabet of size q
Σ4quaternary alphabet, i.e. q= 4,Σ4={0,1,2,3}
DDNA alphabet, D={A,T,C,G}
xy the concatenation of two sequences
x||ythe interleaved sequence
σ,Uσ,Lσa DNA sequence σ, the upper sequence of σ, and the lower sequence of σ
Ψthe one-to-one map that converts a DNA sequence to a binary sequence
Syn(x)the syndrome of a sequence x
indel single insertion or single deletion
edit single insertion, or single deletion, or single substitution
Bindel(x)the set of words that can be obtained from xvia at most a single indel
Bedit(x)the set of words that can be obtained from xvia at most a single edit
Encoder / Decoder Description Redundancy Remark
ENCA
RLL,DE CA
RLL encoder and decoder for -runlength limited codes using
enumeration technique
rA=n− blog4|C(n, , q)|c (symbols) Section III-A
ENCB
RLL,DE CB
RLL encoder and decoder for -runlength limited codes using
sequence replacement technique
rB= 1(symbol)if n6(q1)q1+1
or dn/((q2)q1+)esymbols, otherwise
Section III-B
ENCC
GC,DE CC
GC encoder and decoder for -balanced quaternary codes using
binary template
rC=dlog2(b1/2c+ 1)e(bits) Section IV-C
ENCD
GC,DE CD
GC encoder and decoder for -balanced quaternary codes using
Knuth’s technique
rD= 2dlog4(b1/2c+ 1)e(symbols) Section IV-D
ENC(,),DE C(,)constrained encoder/decoder for -balanced and -runlength
limited codes
rA+rD+4 (symbols) or rB+rD+4 (symbols) Section V
ENC(,;Bindel ),
DEC(,;Bindel )
error-control encoder/decoder for -balanced and -runlength
limited codes that can correct an indel
rA+rD+ log2n+ Θ(1) (symbols) or rB+
rD+ log2n+ Θ(1) (symbols)
Section VI-B
ENC(,;Bedit ),
DEC(,;Bedit )
error-control encoder/decoder for -balanced and -runlength
limited codes that can correct an edit
rA+rD+ 2 log2n+ Θ(1) (symbols) or rB+
rD+ 2 log2n+ Θ(1) (symbols)
Section VI-C
TABLE I: Notation and Results Summary. The redundancy is computed for DNA codewords of length n, given ,  > 0.
codes, a small number of corrupted bits at the channel output might lead to massive error propagation of the decoded bits.
Our proposed combination of constrained codes with error-correction codes also helps to minimize the error prorogation during
decoding.
The paper is organized as follows. We ﬁrst go through certain notations in Section II. In Section III, we present two efﬁcient
RLL coding methods that limit the maximum homopolymer run in each codeword to be at most for arbitrary  > 0. Our methods
are based on enumeration coding and sequence replacement technique, respectively. In Section IV, via a simple modiﬁcation
of Knuth’s balancing method, we describe linear-time encoders/decoders that translate binary data to DNA strands whose GC-
content is within [0.5, 0.5 + ]for arbitrary  > 0. This method yields a signiﬁcant improvement in coding redundancy with
respect to prior works. Then, in Section V, we present an efﬁcient (, )-constrained coding method where codewords obey both
RLL constraint and GC-content constraint. In Section VI, we modify the (, )-constrained coding so that the codewords can
correct a single deletion, or single insertion, or single substitution error.
For the convenience of the reader, relevant notation and terminology referred to throughout the paper is summarized in Table I.
II. NOTATI ON
Let Σq={0,1,2, . . . , q 1}denote an alphabet of size q2. Particularly, when q= 4, we use the following relation Φ
between the decimal alphabet Σ4={0,1,2,3}and the nucleotides D={A,T,C,G},Φ:0A,1T,2C, and 3G.
Given two sequences xand y, we let xy denote the concatenation of the two sequences. In the special case where x,yΣn
q,
we use x||yto denote their interleaved sequence x1y1x2y2. . . xnyn.
Let σ=σ1σ2. . . σnΣn
4, denote a 4-ary strand of nnucleotides. The GC-content or weight of strand σ, denoted by ω(σ),
is deﬁned by ω(σ) = (1/n).Pn
i=1 ϕ(σi)where ϕ(σi) = 0 if σi∈ {0,1}and ϕ(σi) = 1 if σi∈ {2,3}. Given  > 0, we say
that σis -balanced if |ω(σ)0.5| ≤ , in other words, ω(σ)(0.5, 0.5 + ). In particular, when nis even and = 0, we
say σis GC-balance. Over binary alphabet, a vector x∈ {0,1}nis called balanced if the number of ones in x, or the weight
wt(x), is n/2.
On the other hand, given  > 0, we say that σis -runlength limited if any run of the same nucleotide is at most . For
DNA-based storage, we are interested in codewords that are -balanced and -runlength limited for sufﬁcient small =o(1),
=o(n).
Deﬁnition 1. A nucleotide encoder EN C :{0,1}mΣn
4is a (, )-constrained encoder if ENC(x)is -balanced and -runlength
limited for all x∈ {0,1}m.
Motivated by the error behavior in DNA storage, we investigate constrained codes that also have error-correction capability.
Such codes are referred as error-control-codes. We use Bto denote the error ball function. For a sequence xΣn
4, let BD(x),
BI(x), and BS(x)denote the set of all words obtained from xvia a single deletion, single insertion, or at most one substitution,
respectively, and set
Bindel(x),BI(x)BD(x),Bedit (x),BS(x)BI(x)BD(x).
Observe that when σΣn
4, both Bindel(σ)and Bedit (σ)are subsets of Σn1
4Σn
4Σn+1
4. Hence, for convenience, we use
Σn
4to denote the set Σn1
4Σn
4Σn+1
4.
Deﬁnition 2. Let CΣn
4. Given ,  > 0and the error ball function B, we say that Cis an (, ;B)-error control codes if
(i) For all cC,cis -balanced,
(ii) For all cC,cis -runlength limited, and
(iii) B(c)B(c0) = for all distinct c,c0C.
For a code CΣn
q, the rate of C, denoted by rateC, is deﬁned by rateC,(1/n) logq|C|. The asymptotic rate of the family
of codes {C(n, N ;q)}
n=1 is deﬁned by limn→∞(1/n) logq|C|, if the limit exists.
Deﬁnition 3. A nucleotide encoder EN C :{0,1}mΣn
4is an (, ;B)-error-control-encoder if ENC (x)is -balanced and
-runlength limited for all x∈ {0,1}m, furthermore there exists a decoder map DEC : Σn
4→ {0,1}msuch that the following
hold.
(i) For all x∈ {0,1}n, we have DEC EN C(x) = x.
(ii) If c=EN C(x)and c0B(c), then D EC (c0) = x.
Hence, we have that the code C={c:c=ENC(x),x∈ {0,1}m}and hence, |C|= 2m. The redundancy of the encoder is
measured by the value 2nm(in bits) or nm/2(nucleotide symbols).
III. EFFIC IE NT HOMOPOLYMER RUNL EN GT H LIMITED COD ES
We present two methods of constructing maximum runlength limited q-ary constrained codes. Method A uses enumerative
coding technique to rank/unrank all codewords. While the technique is standard in constrained coding and combinatorics
literature, our contribution is a detailed analysis of the space and time complexities of the respective algorithm. The encoder
achieves maximum code rate, for example, when = 3, n = 200, q = 4, the rate of the encoder is 1.98 bits/nt. However, the
time and space complexity is O(n2), which makes it less attractive than the sequence replacement technique in Method B.
A. Method A Based on Enumeration Coding
Let C(n, , q)denote the set of all q-ary -runlength limited sequences of length n. We ﬁrst obtain a recursive formula for
the size of C(n, , q). This recursive formula is useful in the development of the ranking/unranking methods. To this end, we
partition C(n, , q)into classes and provide bijections from q-ary -runlength limited sequences of shorter lengths into them.
For 1i, let Ci(n, , q)denote the set of all q-ary -runlength limited sequences of length nwhose sufﬁx is the repetition
of a symbol in Σqfor exactly itimes. Clearly, we have Ci(n, , q)Cj(n, , q) = for i6=jand
C(n, , q) =

[
i=1
Ci(n, , q)
Let [n]denote the set {1,2, . . . , n}. Consider maps φ1, φ2, . . . , φwhere
φi:C(ni, , q)×[q1] Ci(n, , q),for 1i.
If x=x1x2. . . xniC(ni, , q)and j[q1], set ato be the jth element in Σq\ {xni}. Then set φi(x, j ) =
x1x2. . . xniai. Here, aidenotes the repetition of symbol afor itimes.
Theorem 4. For 1i, the map φiis a bijection. We then have the following recursion. For 1n,|C(n, , q)|=q,
and for n>
|C(n, , q)|=

X
i=1
(q1)|C(ni, , q)|.
Therefore, rateC(n,,q)= logqλ, where λis the largest real root of equation xP1
i=0 (q1)xi= 0.
Proof. We can prove that φiis bijection for 16i6by constructing the inverse map φ1
i. Speciﬁcally, we set φ1
i:
Ci(n, , q)C(ni, , q)×[q1] such that for x=x1x2. . . xnCi(n, , q ), φ1
i(x) = (x1. . . xni, j)where jis the
index of xnin Σq\ {xni}. It can be veriﬁed that φiφ1
iand φ1
iφiare identity maps on their respective domains. Since
C(n, , q) = S
i=1 Ci(n, , q), we then have for n>
|C(n, , q)|=

X
i=1
(q1)|C(ni, , q)|.
We now construct the RLL-Encoder A by providing a method of ranking/unranking all codewords in C(n, , q). A ranking
function for a ﬁnite set Sof cardinality Nis a bijection rank : S[N]. Associated with the function rank is a unique
unranking function unrank : [N]S, such that rank(s) = jif and only if unrank(j) = sfor all sSand j[N].
The basis of our ranking and unranking algorithms is the bijections {φi}
i=1 deﬁned earlier. As implied by the codomains of
these maps, for n>, we order the words in C(n, , q)such that words in Ci(n, , q)are ordered before words in Cj(n, , q)for
i<j. For words in C(n, , q)where n, we simply order them lexicographically. We illustrate the idea behind the unranking
algorithm through an example.
Example 5. Let n= 5, q = 4,  = 3. We then have |C(n, 3,4)|= 3|C(n1,3,4)|+ 3|C(n2,3,4)|+ 3|C(n3,3,4)|and
the values of C(m, , q)are as follow.
m1 2 3 4 5
I62(m, q)4 16 64 252 996
Suppose we want to compute the 900th codeword cC(5,3,4), in other words, unrank(900). We have
C(5,3,4) = C1(5,3,4) C2(5,3,4) C3(5,3,4) =
φ1(C(4,3,4) ×[3]) φ2(C(3,3,4) ×[3]) φ3(C(2,3,4) ×[3]),
Since 900 >3|C(4,3,4)|= 756 and 900 <3|C(4,3,4)|+ 3|C(3,3,4)|= 948, the 900th codeword of C(5,3,4), which is the
900 756 = 144th codeword in C2(5,3,4), is the image of map φ2. Since 144 = 3 ×48 + 0, the construction of φ2tells us
that the 144th codeword in C2(5,3,4) is the image of the 48th codeword, xC(3,3,4) under φ2. The 48th word of C(3,3,4)
is 344. Hence, c=φ2(x,3) This gives
unrank(900) = φ2(344,3)
= 34433
The formal unranking/ranking algorithms are described in Algorithm 1 and Algorithm 2.
Algorithm 1 unrank(n, , q, M )
Input: Integers n1,1,q>2,1M≤ |C(n, , q)|
Output: c, where cis the codeword of rank Min C(n, , q)
if nthen
return Mth codeword in C(n, , q)
Search the ﬁrst index 1jsuch that
M
j
X
i=1
(q1)|C(ni, , q)|
M0Pj
i=1(q1)|C(ni, , q )| − M
M00 ← dM0/(q1)e
kM0(mod q1)
return φj(unrank(nj, , q, M 00), k)
Algorithm 2 rank(n, , q, c)
Input: n1,  1,q>2and codeword c=c1c2. . . cn
Output: M, where 1M≤ |C(n, , q)|, the rank of cin C(n, , q)
if nthen
return rank(c)in C(n, , q)
if the sufﬁx of cis the repetition of symbol afor itimes then
c0c1c2. . . cni
ithe index of ain Σq\ {cni}
return (rank(ni, , q, c0)1)(q1) + i+Pi1
j=1(q1)|C(nj, , q )|
Example 6. Let n= 5,  = 3 and q= 4 as before. Suppose we want to compute rank(34433). Since 34433 C2(5,3,4),
we have that 34433 is obtained from applying φ2to 344 C(3,3,4). The adding symbol is 3, which is the third element in
Σ4\ {4}. Therefore,
rank(34411) = 3|C(4,3,4)|+ 3(rank(344) 1) + 3
= 3 ×252 + 3 ×47 + 3
= 900.
The set of values of {|C(m, , q)|:m6n}required in Algorithms 1 and 2 can be precomputed based on the recurrence in
Theorem 4. Since the size of C(n, , q)grow exponentially, these nstored values require O(n2)space.
Next, Algorithms 1 and 2 involve O(n)iterations and each iteration involves a constant number of arithmetic operations.
Therefore, Algorithms 1 and 2 involve O(n)arithmetics operations and have time complexity O(n2). For completeness, we
summarize the RLL-Encoder A and RLL-Decoder A as follows.
RLL-Encoder A. Set m=blog2|C(n, , q)|c.
INP UT:x∈ {0,1}m
OUT PU T:c,ENC A
RLL(x)C(n, , q)
(I) Let Mbe the positive integer whose binary representation of length mis x.
(II) Use Algorithm 1, set c=unrank(n, , q , M).
(III) Output c.
RLL-Decoder A. Set m=blog2|C(n, , q)|c.
INP UT:cC(n, , q )
OUT PU T:x,DEC A
RLL(c)∈ {0,1}m
(I) Use Algorithm 2, set M=rank(n, , q , c).
(II) Let xbe the binary representation of length mof M.
(III) Output x.
B. Method B Based on Sequence Replacement Technique
The sequence replacement technique has been widely used in the literature [8], [15]–[17]. This is an efﬁcient method for
removing forbidden substrings from a source word. In general, the encoder removes the forbidden strings and subsequently
inserts its representation (which also includes the position of the substring) at predeﬁned positions in the sequence. For example,
Schoeny et al. [17] used only one redundant bit to encode RLL binary sequences with >dlog ne+ 3. However, for DNA
data storage, with n[100,200], it is normally required that 66. Recently, Immink et al. [8] described a simple method for
constructing -runlength limited q-ary codes. However, the required codeword length nis bounded by a function of and q. For
example, when = 3, the method is only applicable for n639 (refer to [8, Table II]). In this work, we show that such bound
can be improved, and hence, the redundancy can be further reduced. For DNA storage channel, when n6200,∈ {5,6}, our
encoder incurs only one redundant symbol.
Deﬁnition 7. For a sequence x=x1x2. . . xnΣn
q, the differential of x, denoted by Diﬀ(x), is a sequence y=y1y2. . . ynΣn
q,
where y1=x1and yi=xixi1(mod q)for 2in.
It is easy to see that from y=y1y2. . . yn= Diﬀ(x), we can determine xuniquely as xi=Pi
j=1 yj(mod q)for 1in.
For convenience, we write x= Diﬀ1(y).
Lemma 8. Let xΣn
q. If the longest run of zero in Diﬀ(x)is at most 1then xis -runlength limited.
We now present an efﬁcient encoder for -runlength limited q-ary codes, and refer this as RLL Encoder B or ENCB
RLL. For a
source data xΣN1
q, we encode y=ENC(x)ΣN
qsuch that ycontains no 0as a substring, and then output c= Diﬀ1(y).
Initial Step. The encoder simply appends a ‘0’ to the end of x, yielding the N-symbols word, x0. The encoder then checks the
word x0, and if there is no substring 0, the output is simply c=x0. Otherwise, it proceeds to the replacement step.
Replacement Procedure. Let the current word c=y0z, where, by assumption, the preﬁx yhas no forbidden 0and the run 0
starts at position p, where 1pN. The encoder removes 0and updates the current word to be c=yzRe, where the
pointer Reis used to represent the position p, and
(i) RΣ1
q,
(ii) eΣq\ {0},
Note that the number of unique combinations of the pointer Reequals (q1)q1. Note that the current word c=yzReis
of length N. If, after the replacement, ccontains no substring 0then the encoder returns cas the codeword. Otherwise, the
encoder repeats the replacement procedure for the current word cuntil all substrings 0have been removed. Noted that during
every step, the length of the codeword is preserved. Since the last symbol in any additional pointer is nonzero, the concatenation
between any two consecutive pointers R1e1R2e2does not produce any substring 0, this procedure is guarantee to terminate.
As the position pis in the range 1pN+ 1, and the number of combinations of Reequals (q1)q1, we conclude
that Nis upper bounded by
N(q1)q1+1,for 2.(1)
Decoding Procedure. The decoder checks from the right to the left. If the last symbol is ‘0’, the decoder simply removes the
symbol ‘0’ and identiﬁes the ﬁrst N1symbols are source data. On the other hand, if the last symbol is not ‘0’, the decoder
takes the sufﬁx of length , identiﬁes it is the pointer, and then adds back the substring 0accordingly. It terminates when the
ﬁrst symbol ‘0’ is found.
Remark 9. The bound in (1) implies that for q= 4,  ∈ {4,5,6}, our encoder uses only one redundant symbols for all
n6196. Table 27 shows the improvement with respect to the result provided in [8]. In addition, this algorithm can be easily
extended for the case of arbitrary length nN. The main idea is that we divides the source data into subwords of length
N1, encodes separately each subword and concatenate them. The representation pointer needs to be modiﬁed so that the
concatenation between any two encoded subwords does not contain a substring 0. To do so, we simply append ’1’ to the end
of the source data instead, and require the pointers of the form Rewhere RΣ1
qand e /∈ {0,1}. The replacement procedure
and decoding procedure can be proceeded similarly.
\nmax Bound in (1) Previous work [8]
2 13 11
3 50 39
4 195 148
5 772 581
TABLE II: Maximum length nthat an encoder can achieve the rate (n1)/n for -runlength limited quaternary codes.
IV. EFFIC IE NT GC -CONTENT CONSTRAINED CODES
In this section, we propose linear-time encoders/decoders that translate binary input data to DNA strands whose GC-content is
within [0.5, 0.5 + ]for arbitrary  > 0, with ﬁxed number of redundant bits. This method yields a signiﬁcant improvement
in coding redundancy with respect to the prior works. We ﬁrst review the Knuth’s balancing technique.
A. Knuth’s Balancing Technique
Knuth’s balancing technique is a linear-time algorithm that maps a binary message xto a balanced word yof the same length
by ﬂipping the ﬁrst tbits of x[14]. The crucial observation demonstrated by Knuth is that such an index talways exists and
tis commonly referred to as the balancing index. To represent the balancing index, Knuth appends ywith a short balanced
sufﬁx of length log nand so, a lookup table of size log nis required.
Several works in the literature used this technique to encode DNA strands whose GC-content is exactly balanced (for example,
[9], [12]), and the coding redundancy is approximately log n. We generalize this technique for binary codes ﬁrst.
B. Generalization of Knuth’s Balancing Technique
Deﬁnition 10. Let nbe even. For arbitrary  > 0, a binary word x∈ {0,1}nis -balanced if the weight of x,wt(x), satisﬁes
wt(x)
n0.5
.
In other words, we have 0.5nn wt(x)0.5n+n.
Deﬁnition 11. Let nbe even. For arbitrary  > 0, the index t, where 1tn, is called the -balanced index of x∈ {0,1}n
if the word yobtained by ﬂipping the ﬁrst tbits in xis -balanced.
We now show that such an index talways exists and there is an efﬁcient method to ﬁnd t. For neven, let the -balanced set
S,n ⊂ {0,1,2, . . . , n}be the set of the following indices.
S,n ={0, n}∪{2bnc,4bnc,6bnc, . . .}.(2)
The size of S,n is at most b1/2c+ 1.
Theorem 12. Let nbe even,  > 0. For arbitrary binary sequence x∈ {0,1}n, there exists an index tin the set S,n, such that
tis the -balanced index of x.
Proof. In the trivial case, when xis -balanced, the index t= 0, which is in the set S,n. Assume that xis not -balanced,
and without loss of generality, assume that wt(x)<0.5nn. Let Flipk(x)be the word obtained by ﬂipping the ﬁrst kbits
in x. Since wt(x)<0.5nn, we have wt(Flipn(x)) >0.5n+n. Now consider the list of indices that we try to obtain
an -balanced word, t1= 2bnc, t2= 4bnc, and so on. Since Flipti(x)and Flipti+1 (x)differ at at most 2n positions, and
wt(x)<0.5nn,wt(Flipn(x)) >0.5n+n, there must be an index tsuch that 0.5nn wt(Flipt(x)) 0.5n+n.
We provide two methods to construct GC-Content constrained codes. The ﬁrst method uses -balanced binary codes as a
template to construct -balanced quaternary codes with at most log (b1/2c+ 1) bits of redundancy. On the other hand, the
second method proceeds directly over quaternary alphabet and appends a short balanced sufﬁx to the end of each codeword to
indicate the -balanced index.
C. Binary Construction of GC-Content Constrained Codes
When q= 4, we consider the following one-to-one correspondence between quaternary alphabet and two-bit sequences:
000,101,210,311.
Therefore, given a DNA sequence σof length n, we have a corresponding binary sequence x∈ {0,1}2nand we write x= Ψ(σ)
or σ= Ψ1(x). Given σΣn
4, let x= Ψ(σ)∈ {0,1}2nand we set Uσ=x1x3· · · x2n1and Lσ=x2x4· · · x2n. In other
words, σ= Ψ1(Uσ||Lσ). We refer to Uσand Lσas the upper sequence and lower sequence of σ, respectively. The following
result is immediate.
Lemma 13. Let σΣn
4. We have σis -balanced if and only if Uσis -balanced.
GC-Encoder C. Given n,  > 0, set k=dlog (b1/2c+ 1)eand m= 2nk. Set S,n be the set of indices as constructed in
(2) and we construct a one-to-one correspondence between the indices in S,n and kbits sequences.
INP UT:x∈ {0,1}n,y∈ {0,1}nkand so, xy ∈ {0,1}m
OUT PU T:σ=ENC C
GC(xy)
(I) Search for the ﬁrst tin S,n , such that Flipt(x)is -balanced.
(II) Set x0= Flipt(x).
(III) Let zbe the kbits sequence representing index t.
(IV) Set y0=yz of length n
(V) Finally, we set σ,Ψ1(x0||y0).
Example 14. Let n= 10,  = 0.1, k =dlog (b1/2c+ 1)e= 3, i.e. we want the GC-content of each codeword is within
[0.4,0.6]. The set S,n ={0,2,4,6,8,10}is of size six. We construct the one-to-one correspondence between the indices and
3bits sequences: 0000,2001,4010,6100,8011 and 10 111. Suppose the input sequence is c= 017, i.e
x= 010 and y= 07. We ﬁnd the index t= 4. Follow the encoder, we get x0= 1111000000 and y0= 0000000010. We then
obtain σ= Ψ1(x0||y0) = 2222000010.
GC-Decoder C. Given n,  > 0, set k=dlog (b1/2c+ 1)eand m= 2nk.
INP UT:σΣn
4,σis -balanced
OUT PU T:xy ∈ {0,1}m
(I) Set x0=Uσ∈ {0,1}nand y0=Lσ∈ {0,1}n.
(II) Set zbe the sufﬁx of length kin y0and let tbe the index in S,n corresponding to z.
(III) Set x= Flipt(x0).
(IV) Set y=y0removes z
(V) Finally, we output xy.
Remark 15. For constant  > 0, the complexity of an GC-Encoder C is linear and the redundancy is constant. For example,
when n= 200,  = 0.1, i.e. the GC-content is within [0.4,0.6], the set S,n ={0,40,80,120,160,200}is of size six. The
GC-Encoder C uses only dlog 6e= 3 bits of redundancy to indicate the -balanced index in the lower sequence and the rate of
the encoder is 1.985 bits/nt. Similarly, when = 0.05, i.e. the GC-content is within [0.45,0.55], the GC-Encoder C uses only
dlog 11e= 4 bits of redundancy and the rate is 1.98 bis/nt.
D. Knuth-like Construction of GC-Content Constrained Codes
Consider the quaternary alphabet Σ4={0,1,2,3}. To apply Knuth’s method, we deﬁne the ﬂipping rule f: Σ4Σ4,
where f(0) = 2, f (2) = 0, f(1) = 3 and f(3) = 1. For a sequence σΣn
4and index iwith 0in,fi(σ)denotes the
sequence obtained by ﬂipping the ﬁrst isymbols of σunder f.
Deﬁnition 16. Let nbe even. For arbitrary  > 0, the index t, where 1tn, is called the -balanced index of σΣn
4if
the sequence σ0=ft(σ)is -balanced.
Example 17. Consider n= 10,  = 0.1. Let σ= 0000000000. Observe that f4(σ) = 2222000000,f5(σ) = 2222200000 and
f6(σ) = 2222220000 are -balanced. Hence, t= 4,5,6are -balanced indices of σ. In general, there might be more than one
-balanced index.
The following result follows from Theorem 12.
Corollary 18. Let nbe even,  > 0. The set S,n is deﬁned as in (2). For any sequence σΣn
4, there exists an index tin the
set S,n, such that it is the -balanced index of σ.
To encode a -balanced sequence σ, we ﬁrst ﬁnd the smallest -balanced index tof σ, and then ﬂip the ﬁrst tsymbols of
σaccording to the rule f. To represent the index, we also append a short balanced sufﬁx to the end of codeword, and so, a
lookup table of size |S,n|is required and the redundancy is dlog (b1/2c+ 1)e. The following result is trivial.
Lemma 19. Let n, m be even. Assume that σΣn
4is -balanced and zΣm
4is balanced. The concatenation sequence σz
is also -balanced.
Example 20. Let n= 200,  = 0.1, i.e. we want the GC-content is within [0.4,0.6], and the set S,n ={0,40,80,120,160,200}
is of size six. We construct the one-to-one correspondence between the index and a short balanced sufﬁx of length 2 as follows:
002,40 03,80 12,120 13,160 20,200 30. Assume that σΣ200
4and the -balanced index tof σis t= 40.
The encoder ﬂips the ﬁrst 40 symbols in σto obtain σ0that is -balanced, and then append 03 to the end of σ0. The encoder
uses only two redundant symbols for = 0.1.
We now show that the sufﬁx can be encoded and decoded in linear time without the use of a lookup table. In addition, in
order to construct an (, )-constrained code, we encode the sufﬁx in such a way that it is also -runlength limited. The details
are as follows.
Index Encoder. Let nbe even, ,  > 0. The set S,n is deﬁned as in (2). Set k,dlog4(b1/2c+ 1)e.
INP UT:t,tS,n ,0tn1
OUT PU T:p,IND EX ENC (t)
(I) Let τ1τ2· · · τkbe the quaternary representation of tin S,n .
(II) Interleave the representation with the alternating length-ksequence f(τ1)f(τ2)· · · f(τk)to obtain pof length 2k. In other
words, set p=τ1f(τ1)τ2f(τ2)· · · τkf(τk).
The corresponding GC-content Encoder and Decoder are described as follows.
GC-Encoder D. Given n,  > 0, set k=dlog4(b1/2c+ 1)eand m= 2n4k. Set S,n2kbe the set of indices as constructed
in (2) and we construct a one-to-one correspondence between the indices in S,n2kand kbits sequences.
INP UT:x∈ {0,1}m
OUT PU T:σ=ENC D
GC(x)
(I) Set σ0= Ψ1(x)Σn2k
4
(II) Search for the ﬁrst tin S,n2k, such that tis the -balanced index of σ0.
(III) Obtain σ00 by ﬂipping the ﬁrst tsymbols in σ0.
(IV) Use Index Encoder to obtain prepresenting index tof length 2k.
(V) Finally, we set σ,σ00p.
GC-Decoder D.
INP UT:σΣn
4,σis -balanced
OUT PU T:x,DEC D
GC(σ)∈ {0,1}m
(I) Set pbe the sufﬁx of length 2kin σ, and σ0be the preﬁx of length n2k.
(II) Let zbe the sequence of odd indices in p, which is the kbits sequence representing index tin the set S,n2k.
(III) Flip the ﬁrst tsymbols in σ0according to the ﬂipping rule fto obtain σ00 .
(IV) Finally, output x= Ψ(σ00)
Remark 21. The advantage of Encoder C is low redundancy, however, it is hard to combine with an RLL Encoder to construct
an (, )-constrained encoder. In the next section, we present an efﬁcient (, )-constrained encoder using the construction of
Encoder D and the two RLL Encoders presented in Section III.
V. EFFI CI EN T (, )-CONSTRAINED COD ES
In this section, we present an (, )-constrained encoder that translates binary data to DNA strands that are -runlength limited
and -balanced for arbitrary ,  > 0. Prior to this work, literature results mostly focused on speciﬁc values of and [11],
[12]. For example, Song et al. [11] used concatenation technique to design RLL encoder for = 3, and their simulated results
showed that the GC-content of all codewords is between 0.4 and 0.6, i.e. = 0.1, and for n= 200, the rate of the encoder is
1.9 (bits/nt). In this section, we provide a more efﬁcient coding scheme such that the output codewords are -runlength limited
and -balanced.
Example 22. Consider n= 10,  = 0.1,  = 3. Let σ= 0002111011. Observe that even though σis -runlength limited, it is
not -balanced. We then get f3(σ) = 2222111011, is -balanced. However, f3(σ)is not -runlength limited.
The above example also illustrates that the sequence ft(σ)may not be -runlength limited given that σis -runlength limited.
Nevertheless, we observe that the preﬁx and sufﬁx of ft(σ)remain -runlength limited. For brevity, given a sequence σΣn
4,
we use Pi(σ)and Si(σ)to denote the preﬁx and sufﬁx of σof length i, respectively.
Lemma 23. Let 06t6n. If a sequence σis -runlength limited and σ0=ft(σ), then Pt(σ0)and Snt(σ0)are both
-runlength limited.
To ensure that the obtained sequence remains -runlength limited, we simply add one redundant symbol before concatenating
Pt(σ0)and Snt(σ0).
Corollary 24 (Concatenate two -runlength limited sequences).Let σ,σ0be -runlength limited. Suppose that the last symbol
of σis αand the ﬁrst symbol of σ0is β. Let γΣ4\ {α, β}, then σ00 =σγσ0is -runlength limited.
We illustrate the construction of (, )-constrained encoder through the following example.
Example 25 (Example 20 continued).Suppose n= 200,  = 0.1, and = 3. We show that there exists an efﬁcient (, )-
constrained encoder with at most 8redundant symbols. From the data sequence σΣ192
4, we use RLL Encoder A to obtain
σ1=ENCA
RLL(σ). This step requires two redundant symbols and hence, σ1Σ194
4is -runlength limited. We now search for
the -balanced index tof σ1in the set S0.1,194 of size six, i.e σ2=ft(σ1)is -balanced. Such index can be represented by a
pointer pof size two (similar to Example 20). We follow Corollary 24 to ﬁnd γ, γ0such that σ2= Pt(σ1)γSnt(σ1)γ0pΣ198
4
be -runlength limited. To ensure that the ﬁnal output is -balanced, recall that, Pt(σ1)Snt(σ1)pis -balanced, we then output
σ3=σ2f(γ0)f(γ). It is easy to verify that σ3is -runlength limited and -balanced. Thus, the encoder uses 8 redundant
symbols to encode codewords of length 200, and hence, the rate is 1.92 (bits/nt).
We now show that the representation pof the -balanced index can be encoded/decoded in linear time without using a lookup
table. Suppose we want to encode codewords in Σn
4where nis even. Set k,dlog4(b1/2c+ 1)e, and N=n2k4.
Let rRLL denote the number of redundant symbols used by the RLL Encoder (ENCA
RLL or EN CB
RLL) to encode the -runlength
limited codewords in ΣN
4. We summarize our proposed (, )-constrained encoder as follows.
(, )-Constrained Encoder. Given n, , ,neven and >3. Set m= 2n2(rRLL + 2k+ 4). Set S,N be the set of indices
as deﬁned by (2) and we construct a one-to-one correspondence between the indices in SNand kbits sequences.
INP UT:x∈ {0,1}m
OUT PU T:σ,ENC (,)(x)Σn
4
(I) Set σ1= Ψ1(x)ΣnrRLL 2k4
4
(II) Use RLL Encoder to obtain σ2=EN CRLL (σ1), where σ2ΣN
4is -runlength limited
(III) Search for the ﬁrst -balanced index tof σ2in S,N
(IV) Flip the ﬁrst tsymbols in σ2to obtain σ3=ft(σ2)
(V) Let τ1τ2· · · τkbe the quaternary representation of tin S,N . Set p=τ1f(τ1)τ2f(τ2)· · · τkf(τk)
(VI) Use Corollary 24 to ﬁnd γand γ0such that σ4= Pt(σ3)γSNt(σ3)γ0pis -runlength limited
(VII) Output σ=σ4f(γ)f(γ0). Note that σΣn
4
Theorem 26. The (, )-Constrained Encoder is correct. In other words, ENC(,)(x)is -balanced and -runlength limited for
all x∈ {0,1}m. The redundancy of the encoder is rRLL + 2k+ 4.
Proof. Let σ=ENC(,)(x). We ﬁrst show that σis -runlength limited. According to Corollary 24, σ4is -runlength limited.
Since two consecutive symbols in pare distinct, the concatenation pf(γ)f(γ0)is -runlength limited for all >3. Therefore,
σis -runlength limited.
We now show that σis -balanced. Since σ3is -balanced, pbalanced, γf(γ), γ0f(γ0)is balanced, we have σis -balanced
(according to Lemma 19).
Remark 27. The construction can be easily extended for ∈ {1,2}. For arbitrary  > 0,k=dlog4(b1/2c+ 1)e=O(1), is
a constant. Therefore, the rate of this encoder approaches the rate of the RLL Encoder. If we use the RLL Encoder based on
enumeration (ENCA
RLL) then the rate of the (, )-constrained encoder approaches the capacity for sufﬁcient large n. However,
this encoder A runs in Θ(n2). For DNA storage with ∈ {4,5,6}, we can use the linear time ENC B
RLL to achieve as good rate
as EN CA
RLL (refer to Remark 9).
For completeness, we describe the corresponding (, )-constrained decoder as follows.
(, )-Constrained Decoder.
INP UT:σΣn
4,σis -balanced and -runlength limited
OUT PU T:x,DEC (,)(σ)∈ {0,1}m
(I) Set pbe the sufﬁx of length 2k+ 2 and σ1be the preﬁx of length n2k3
(II) Remove the the last two symbols in p
(III) Let zbe the sequence of odd indices in p, which is the kbits sequence representing index tin S,N
(IV) Flip the ﬁrst tsymbols in σ1according to the ﬂipping rule fto obtain σ2
(V) Remove the (t+ 1)th symbol in σ2
(VI) Use RLL Decoder to obtain σ3=DE CRLL (σ2)
(VII) Output x= Ψ(σ3)
The efﬁciency of our designed (, )-constrained encoder are summarized in Table III. As can be seen, when the codeword
length increases, the rate of our proposed encoder is only a few percent below capacity.
Codeword length nCapacity CRate of encoder rη=r/C(%)
100 1.99542 1.81000 90.707%
200 1.99578 1.92000 96.203%
300 1.99577 1.94000 97.206%
TABLE III: Rate of the designed constrained encoder for = 0.1and = 4
.
VI. EFFI CI EN T (, ;B)-ERRO R-CO NT ROL CO DE S
We now construct (, ;B)-error-control codes to correct the most common error in DNA data storage such as a single
deletion, insertion, or substitution error. This also helps to reduce the error propagation of the constrained decoders proposed
earlier. Crucial to our construction is the binary Varshamov-Tenengolts (VT) codes deﬁned by Levenshtein [22] and the q-ary
VT codes deﬁned by Tenengolts [23].
A. Codes Correcting a Single Indel/Edit
Deﬁnition 28. The binary VT syndrome of a binary sequence x∈ {0,1}nis deﬁned to be Syn(x) = Pn
i=1 ixi.
For aZn+1, the Varshamov-Tenengolts code VTa(n)is deﬁned as follows.
VTa(n) = {x∈ {0,1}n: Syn(x) = a(mod n+ 1)}.(3)
For aZn+1, the code VTa(n)can correct a single indel and Levenshtein later provided a linear-time decoding algorithm
[22]. To also correct a substitution, Levenshtein [22] constructed the following code
La(n) = {x∈ {0,1}n: Syn(x) = a(mod 2n)},(4)
and provided a decoder that corrects a single edit.
Theorem 29 (Levenshtein [22]).Let La(n)be as deﬁned in (4). There exists a linear-time decoding algorithm DE CL
a:
{0,1}nLa(n)such that the following holds. If cLa(n)and yBedit(c), then DE CL
a(y) = c.
In 1984, Tenengolts [23] generalized the binary VT codes to nonbinary ones. Tenengolts deﬁned the signature of a q-ary
vector xof length nto be the binary vector π(x)of length n1, where π(x)i= 1 if xi+1 xi, and 0otherwise, for i[n1].
For aZnand bZq, set
Ta,b(n;q),xZn
q:π(x)VTa(n1) and
n
X
i=1
xi=b(mod q).
Then Tenengolts showed that Ta,b (n;q)corrects a single indel and there exists aand bsuch that the size of Ta,b(n;q)is at
least qn/(qn). These codes are known to be asymptotically optimal. In the same paper, Tenengolts also provided a systematic
q-ary single-indel-encoder with redundancy log n+Cq, where nis the length of a codeword and Cqis independent of n.
Theorem 30 (Tenengolts [23]).There exists a linear-time decoding algorithm DECT
(a,b):{0,1}nTa,b(n;q)such that the
following holds. If cTa,b(n;q)and yBindel (c), then DEC T
(a,b)(y) = c.
Recently, Chee et al. [9] presented linear-time encoders for GC-balanced codewords that are capable of correcting single edit
with 3 log n+ 2 bits of redundancy. In the following, we use the idea of VT codes to modify the (, )-constrained code so that
the codebook is capable of correcting either a single indel or a single edit.
For σΣn
4, recall the deﬁnition of Uσ,Lσ∈ {0,1}nand x=Uσ||Lσ= Ψ(σ)(refer to Section IV-III).
Proposition 31. Let σΣn
4. Then the following are true.
(a) σ0Bindel(σ)implies that Uσ0Bindel (Uσ)and Lσ0Bindel(Lσ).
(b) σ0Bedit(σ)implies that Uσ0Bedit (Uσ)and Lσ0Bedit(Lσ).
Remark 32. The statement in Proposition 31 can be made stronger. Suppose that there is an indel at position iof σ. Then
there is exactly one indel at the same position iin both upper and lower sequences of σ. For example, consider σ= 020313.
We have Uσ= 010101 and Lσ= 000101. If the third symbol in σ, which is 0, is deleted, we obtain σ0= 02313 and hence,
U0
σ0= 01101 and Lσ0= 00101.
The following construction is trivial.
Corollary 33. For n > 0, a Z2n, b Z2n, let C(a,b)(n)be the set of all sequences σΣn
4such that UσLa(n)and
LσLb(n). Then C(a,b)(n)is capable of correcting a single edit error.
B. Construction of (, ;Bindel )-Error-Control Codes
We follow Tenengolts’s construction to encode DNA sequences that are capable of correcting a single indel. We simply
append the information of the syndrome and the sum of symbols to the end of each codeword. In addition, we use the idea of
the Index Encoder (refer to Section IV-D) to ensure the redundant part is balanced and -runlength limited. The extra redundancy
is log n+ 4. For simplicity, assume that k0= log nis integer and k0is even.
(, ;Bindel)-Error-Control Encoder. Let nbe even, ,  > 0. Set k,dlog4(b1/2c+ 1)e. Set m= 2n2(rRLL + 2k+ 4),
and N=n2k4. Set S,n2k4be the set of indices as deﬁned by (2) and we construct a one-to-one correspondence
between the indices in S,n2k4and kbits sequences. Set k0= log n.
INP UT:x∈ {0,1}m
OUT PU T:σ,ENC (,;Bindel)(x)C(, ;Bindel )Σn+log n+4
4
(I) Use the (, )-constrained encoder to obtain σ0=EN C(,)(x)Σn
4, where σ0is -balanced and -runlength limited
(II) Let αbe the last symbol of σ0. Let βbe arbitrary symbol in Σ4\ {α, f(α)}
(III) Let a= Syn(π(σ0)) (mod n)and b=Pn
i=1 σ0
i(mod 4)
(IV) Let τ1τ2· · · τk0/2be the quaternary representation of a
(V) Set p=βf (β)τ1f(τ1)τ2f(τ2)· · · τk0/2f(τk0/2)bf (b)
(VI) Output σ=σ0p
Theorem 34. The (, ;Bindel)-error-control encoder is correct. In other words, ENC(,;Bindel)(x)is -balanced, -runlength
limited, and capable of correcting a single indel for all x∈ {0,1}m.
Proof. Let σ=ENC(,;Bindel )(x). It is easy to show that σis -balanced and -runlength limited (refer to the proof of
Theorem 26). It remains to show that σcan correct a single indel. To do so, we provide an efﬁcient decoding algorithm.
Suppose that there is a deletion (or insertion) in the received sequence σ0(this can be observed based on the length of the
received sequence). Without loss of generality, assume that the error is a deletion. The decoder proceeds as follows.
Localizing the deletion. Let p0be the sufﬁx of length k0+ 4 of σ0. Assume that p0=p0
1p0
2· · · p0
k0+4.
If p0
2=f(p0
1)then we conclude that there is no deletion in pand therefore, p0p.
If p0
26=f(p0
1)then we conclude that there is a deletion in p.
Recovering σ.
If there is no deletion in p, i.e. p0p, let σ00 be the sequence obtained by removing the sufﬁx pfrom σ0. Note that
Syn(σ00)and the sum of symbols in σ00 are known from p. We then set y=DECT
(a,b)(σ00), and use the (, )-constrained
encoder to obtain x=DEC(,)(y).
If there is a deletion in p, we do not need to do error correction here, and remove the sufﬁx of length k0+ 3 from σ0. We
then use the (, )-constrained encoder to obtain x=DEC(,)(σ0).
In conclusion, EN C(,;Bindel )(x)is -balanced, -runlength limited, and can correct a single indel for all x∈ {0,1}m.
Corollary 35. Let M=n+ log n+ 4. There exists a linear-time decoding algorithm DECindel : ΣM
4C(, ;Bindel)ΣM
4
such that the following holds. If σ=ENC(,;Bindel )(x)and σ0Bindel(σ), then DECindel(σ0) = σ.
For completeness, we describe the corresponding (, ;Bindel)-error-control decoder as follows.
(, ;Bindel)-Error-Control Decoder.
INP UT:σ0Σ(n+k0+4)
4
OUT PU T:x,DEC (,;Bindel)(σ0)∈ {0,1}m
(I) Let σ=DE Cindel (σ0)Σn+k0+4
4
(II) Use (, )-constrained decoder to obtain x=DE C(,)(σ)∈ {0,1}m
(III) Output x
C. Construction of (, ;Bedit )-Error-Control Codes
We follow the construction in Corollary 33 to encode DNA sequences that are capable of correcting a single edit. We simply
append the information of the syndrome of Uσand Lσto the end of each codeword. In addition, we also use the idea of the
Index Encoder (refer to Section IV-D) to ensure the redundant part is balanced and -runlength limited. The extra redundancy
is 2 log n+ 4. For simplicity, assume that k0= log nis integer and k0is even.
(, ;Bedit)-Error-Control Encoder. Let nbe even, ,  > 0. Set k,dlog4(b1/2c+ 1)e. Set m= 2n2(rRLL + 2k+ 4),
and N=n2k4. Set S,n2k4be the set of indices as deﬁned by (2) and we construct a one-to-one correspondence
between the indices in S,n2k4and kbits sequences. Set k0= log n.
INP UT:x∈ {0,1}m
OUT PU T:σ,ENC (,;Bedit)(x)C(, ;Bedit )Σn+2 log n+4
4
(I) Use the (, )-constrained encoder to obtain σ0=EN C(,)(x)Σn
4, where σ0is -balanced and -runlength limited
(II) Let αbe the last symbol of σ0. Let βbe arbitrary symbol in Σ4\ {α, f(α)}
(III) Let a= Syn(Uσ0)) (mod n+ 1) and b= Syn(Lσ0)) (mod n+ 1),c=Pn
i=1 σ0
i(mod 4)
(IV) Let τ1τ2· · · τk0/2be the quaternary representation of a, and ν1ν2· · · νk0/2be the quaternary representation of b
(V) Set p=βf (β)τ1f(τ1)τ2f(τ2)· · · τk0/2f(τk0/2)ν1f(ν1)ν2f(ν2)· · · νk0/2f(νk0/2)cf (c)
(VI) Output σ=σ0p
Theorem 36. The (, ;Bedit )-error-control encoder is correct. In other words, ENC (,;Bedit)(x)is -balanced, -runlength
limited, and capable of correcting a single edit for all x∈ {0,1}m.
Proof. Let σ=ENC(,;Bedit )(x). It is easy to show that σis -balanced and -runlength limited (refer to the proof of
Theorem 26). It remains to show that σcan correct a single edit. To do so, we provide an efﬁcient decoding algorithm. Suppose
the received sequence is σ0. The idea is to recover the ﬁrst nsymbols in σand then use the (, )-constrained decoder to recover
the information sequence x. First, the decoder decides whether a deletion, insertion or substitution has occurred. Note that this
information can be recovered by simply observing the length of the received sequence. The decoding operates as follows.
(i) If the length of σ0is exactly n+ 2 log n+ 4, we conclude that at most a single substitution has occurred.
Let p0be the sufﬁx of length 2 log n+ 4 of σ0, and p0=p0
1p0
2· · · p0
2k0+4.
Let σ00 be the preﬁx of length nof σ0. The decoder computes Syn(Uσ00 )and Syn(Lσ00 ) (mod n+ 1).
Let a0be the integer number whose quaternary representation is p0
3p0
5· · · p0
k0+1,b0be the integer number whose
quaternary representation is p0
k0+3p0
k0+5 · · · p0
2k0+1 and c0=p0
2k0+3.
If c0is the sum of symbols in σ00, then there is no error in σ00 . The decoder proceeds to obtain x=DEC(,)(σ00).
Otherwise, if a0= Syn(Uσ00 )and b0= Syn(Uσ00 )then there is no error in σ00, the decoder proceeds to obtain
x=DEC(,)(σ00 ). On the other hand, if either one statement is false, there is an error in σ00. The decoder sets
y=DECL
a0(Uσ00 )and z=DECL
b0(Lσ00 ). Finally, σ= Ψ(y||z)and the decoder returns x=DEC(,)(σ).
(ii) If the length of σ0is exactly n+ 2 log n+ 3, we conclude that a single deletion has occurred (the case of single insertion
can be done similarly). The decoder proceeds as follows.
Let p0be the sufﬁx of length 2 log n+ 4 of σ0, and p0=p0
1p0
2· · · p0
2k0+4.
If p0
26=f(p0
1), the decoder concludes that there is a deletion in p. The decoder removes the sufﬁx of length 2k0+ 3
from σ0, then use the (, )-constrained encoder to obtain x=DEC(,)(σ0)
If p0
2=f(p0
1), the decoder concludes that there is no deletion in pand therefore, p0p. Let σ00 be the sequence
obtained by removing the sufﬁx pfrom σ0. Note that Syn(Uσ00)and Syn(Lσ00 )are known from p. The decoder sets
y=DECL
a(Uσ00 )and z=DECL
b(Lσ00 ). Finally, σ= Ψ(y||z)and the decoder returns x=DEC(,)(σ).
In conclusion, EN C(,;Bedit )(x)is -balanced, -runlength limited, and can correct a single edit for all x∈ {0,1}m.
Corollary 37. Let M=n+ 2 log n+ 4. There exists a linear-time decoding algorithm DECedit : ΣM
4C(, ;Bedit)ΣM
4
such that the following holds. If σ=ENC(,;Bedit )(x)and σ0Bedit(σ), then DECedit(σ0) = σ.
For completeness, we describe the corresponding (, ;Bedit)-error-control decoder as follows.
(, ;Bedit)-Error-Control Decoder.
INP UT:σ0Σ(n+2 log n+4)
4
OUT PU T:x,DEC (,;Bedit)(σ0)∈ {0,1}m
(I) Let σ=DE Cedit (σ0)Σn+2 log n+4
4
(II) Use (, )-constrained decoder to obtain x=DE C(,)(σ)∈ {0,1}m
(III) Output x
Remark 38. We use rerror to denote the redundancy needed to correct single indel or edit error. When B=Bindel,rerror =
log n+ 4, and when B=Bedit,rerr or = 2 log n+ 4. Since log n
n0,rGC =O(1), is a constant, the rate of this encoder
approaches the rate of the RLL Encoder, and if we use RLL Encoder A then the rate of the (, , B)-error-control encoder
approaches the capacity for sufﬁcient large n.
VII. CONCLUSION
We have presented novel and efﬁcient encoders that translate binary data into strands of nucleotides which satisfy the RLL
constraint, the GC-content constraint, and are capable of correcting a single edit and its variants. Our proposed codes achieve
higher rates than previous results and approach capacity, have low encoding/decoding complexity and limited error propagation.
REFERENCES
[1] S. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage”, Scientiﬁc Reports, no. 5011, vol. 7, 2017.
[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628-1628, 2012.
[3] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information
storage in synthesized DNA,Nature, vol. 494, no. 7435, pp. 77-80, 2013.
[4] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efﬁcient storage architecture,” Science, vol. 355, no. 6328, pp. 950-954, 2017.
[5] L. Organick, S. Ang, Y. J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman,
H. Y. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss, “Random access in large-scale
DNA data storage”, Nature Biotechnology, vol. 36, no. 3, 242–248, 2018.
[6] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty, C. Nusbaum, and D. B. Jaffe, “Characterizing and measuring bias in sequence
data”, Genome Biology, vol. 14, 2013.
[7] R. Heckel, G. Mikutis, and R. N. Grass, “A Characterization of the DNA Data Storage Channel”, Scientiﬁc Reports, Jul. 2019.
[8] K. A. S. Immink, and K. Cai, “Design of Capacity-Approaching Constrained Codes for DNA-Based Data Storage Systems,” IEEE Communications Letters,
vol. 22, no. 2, pp. 224-227, 2018.
[9] K. Cai, Y. M. Chee, R. Gabrys, H. M. Kiah, and T. T. Nguyen, “Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage”, preprint,
arXiv, arXiv:1910.06501, 2019.
[10] R. Gabrys, E. Yaakobi, and O. Milenkovic, “Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction”, IEEE Trans. Inform.
Theory, Vol. 64, No. 4, 2018.
[11] W. Song, K. Cai, M. Zhang, and C. Yuen, “Codes with Run-Length and GC-Content Constraints for DNA-based Data Storage,IEEE Communications
Letters, vol. 22 , no. 10, pp. 2004-2007, Oct. 2018.
[12] D. Dube, W. Song, and K. Cai, “DNA Codes with Run-Length Limitation and Knuth-Like Balancing of the GC Contents”, Symposium on Information
Theory and its Applications (SITA), Japan, Nov. 2019.
[13] P. Yakovchuk, E. Protozanova, and M. D. Frank-Kamenetskii, “Base-stacking and base-pairing contributions into thermal stability of the DNA double
helix”, Nucl. Acids Res., vol. 34, no. 2, pp. 564-574, 2006.
[14] D. E. Knuth, “Efﬁcient Balanced Codes”, IEEE Trans. Inform. Theory, vol. IT-32, no. 1, pp. 51-53, Jan 1986.
[15] A. J. de Lind van Wijngaarden and K. A. S. Immink, “Construction of Maximum Run-Length Limited Codes Using Sequence Replacement Techniques,
IEEE Journal on Selected Areas of Communications, vol. 28, pp. 200-207, 2010.
[16] O. Elishco, R. Gabrys, M. Medard, and E. Yaakobi, “Repeated-Free Codes”, Proc. IEEE Int. Symp. Inf. Theory (ISIT), Paris, France, 2019.
[17] C. Schoeny, A. Wachter-Zeh, R. Gabrys, and E. Yaakobi, “Codes correcting a burst of deletions or insertions?, IEEE Trans. Inform. Theory, vol. 63, no.
4, pp. 1971-1985, 2017.
[18] J. P. M. Schalkwijk, “An algorithm for source coding,” IEEE Trans. Inf. Theory, IT-18, pp. 395-399, 1972.
[19] N. Alon, E. E. Bergmann, D. Coppersmith, and A. M. Odlyzko, “Balancing sets of vectors”, IEEE Trans. Inf. Theory, vol. IT-34, no. 1, pp. 128-130, Jan.
1988.
[20] V. Skachek and K. A. S. Immink, “Constant Weight Codes: An Approach Based on Knuth’s Balancing Method”, IEEE Journal on Selected Areas in
Communications, vol. 32, No. 5, May 2014.
[21] L. G. Tallini, R. M. Capocelli, and B. Bose, “Design of some new balanced codes,” IEEE Trans. Inf. Theory, vol. IT-42, pp. 790-802, May 1996.
[22] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals”, Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848,
1965.
[23] G. Tenengolts, “Nonbinary codes, correcting single deletion or insertion”, IEEE Trans. Inf. Theory, vol. 30, no. 5, pp. 766-769, 1984.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. We investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with 2log n + 2 redundant bits, while the other corrects a single indel with log n + 2 redundant bits. The latter encoder reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, exactly half of the symbols of a GC-balanced word are either C or G. Via a modification of Knuth’s balancing technique, we provide a linear-time map that translates binary messages into GC-balanced codewords and the resulting codebook is able to correct a single edit. The redundancy of our encoder is 3log n + 2 bits and this is the first known construction of a GC-balanced code that corrects a single edit.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
Owing to its longevity and enormous information density, DNA, the molecule encoding biological information, has emerged as a promising archival storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules that are stored in an unordered way, and can only be read by sampling from this DNA pool. Moreover, imperfections in writing (synthesis), reading (sequencing), storage, and handling of the DNA, in particular amplification via PCR, lead to a loss of DNA molecules and induce errors within the molecules. In order to design DNA storage systems, a qualitative and quantitative understanding of the errors and the loss of molecules is crucial. In this paper, we characterize those error probabilities by analyzing data from our own experiments as well as from experiments of two different groups. We find that errors within molecules are mainly due to synthesis and sequencing, while imperfections in handling and storage lead to a significant loss of sequences. The aim of our study is to help guide the design of future DNA data storage systems by providing a quantitative and qualitative understanding of the DNA data storage channel.
Article
Full-text available
Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Full-text available
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
Article
Full-text available
This paper studies codes that correct a burst of deletions or insertions. Namely, a code will be called a b-burstdeletion/ insertion-correcting code if it can correct a burst of deletions/ insertions of any b consecutive bits. While the lower bound on the redundancy of such codes was shown by Levenshtein to be asymptotically log(n)+b�1, the redundancy of the best code construction by Cheng et al. is b(log(n=b + 1)). In this paper, we close on this gap and provide codes with redundancy at most log(n) + (b � 1) log(log(n)) + b � log(b). We first show that the models of insertions and deletions are equivalent and thus it is enough to study codes correcting a burst of deletions. We then derive a non-asymptotic upper bound on the size of b-burst-deletion-correcting codes and extend the burst deletion model to two more cases: 1) A deletion burst of at most b consecutive bits and 2) A deletion burst of size at most b (not necessarily consecutive). We extend our code construction for the first case and study the second case for b = 3; 4.
Article
Motivated by applications in DNA-based storage, we introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which, in addition to deletions, insertions and substitution errors also accounts for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions. We conclude with constructions for joint block deletion and adjacent block transposition error-correcting codes.
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Article
In this article, we study properties and algorithms for constructing sets of 'constant weight' codewords with bipolar symbols, where the sum of the symbols is a constant q, q 6 0. We show various code constructions that extend Knuth's balancing vector scheme, q = 0, to the case where q > 0. We compute the redundancy of the new coding methods. Index Terms—Balanced code, channel capacity, constrained code, magnetic recording, optical recording. I. INTRODUCTION Let q be an integer. A setC, which is a subset of ( w = (w1;w2;:::;wn)2f 1; +1g n : n X i=1 wi = q )