Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Aug 16, 2021

Content may be subject to copyright.

1

Capacity-Approaching Constrained Codes with Error

Correction for DNA-Based Data Storage

Tuan Thanh Nguyen, Kui Cai, Kees A. Schouhamer Immink, and Han Mao Kiah

Abstract

We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable

of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given `, > 0, we

propose simple and efﬁcient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely

sequences of the symbols A,T,Cand G, that satisfy the following properties:

•Runlength constraint: the maximum homopolymer run in each codeword is at most `,

•GC-content constraint: the GC-content of each codeword is within [0.5−, 0.5 + ],

•Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error.

For practical values of `and , we show that our encoders achieve much higher rates than existing results in the literature and

approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.

I. INTRODUCTION

In a DNA-based storage system, the input user data is translated into a large number of DNA strands (also known as DNA

sequences or oligos), which are synthesized and stored in a DNA pool. To retrieve the original data, the stored DNA strands are

sequenced and translated inversely back to the binary data. Several experiments have been conducted since 2012 (see [1]–[7]),

and it has been found that substitutions, deletions, and insertions are common errors occurring at the stages of synthesis and

sequencing. To improve the reliability of DNA storage, several channel coding techniques, including constrained coding and

error correction coding, have been introduced [8]–[12].

In a DNA strand, two properties that signiﬁcantly increase the chance of errors for most synthesis and sequencing technologies

are long homopolymer run [6], [7] and high (or low) GC-content. A homopolymer run refers to the repetition of the same

nucleotide. Ross et al. [6] reported that a homopolymer run of length more than six would result in a signiﬁcant increase

of substitution and deletion errors (see [6, Fig. 5]), and therefore, such long runs should be avoided. On the other hand, the

GC-content of a DNA strand refers to the percentage of nucleotides that are either Gor C, and DNA strands with GC-content that

are too high or too low are more prone to both synthesis and sequencing errors (see for example, [6], [13]). Therefore, most

experiments used DNA strands whose GC-content is close to 50% (for example, between 40% to 60% [7], or 45% to 55% [4]).

Designing efﬁcient constrained codes to translate binary data into DNA strands that satisfy the homopolymer runlength (also

known as runlength limited constraint, or RLL constraint in short) and the GC-content constraints has been a challenge. In

the literature, several prior art coding techniques have been introduced, mostly focusing on one speciﬁc value of maximum

runlength or requiring GC-content to be exactly 50%, also known as GC-balanced constraint [8], [9], [11], [12]. To encode

GC-balanced codewords, most works used a modiﬁcation of the Knuth’s balancing method for binary sequences [14]. Since the

constraint is strong, the coding redundancy is large (approximately log n, where nis the length of each codeword). In this work,

we investigate the problem of translating binary data to DNA strands whose GC-content is close to 50%, and we refer this as

almost-balanced. Via a simple modiﬁcation of Knuth’s method, we show that the number of redundant bits can be gracefully

reduced from log nto O(1).

Constrained codes can reduce the occurrence of substitution, deletion, and insertion errors in the DNA storage system.

However, the constrained code itself cannot correct errors. There are recent works that characterize the error probabilities

by analyzing data from experiments and then demonstrate the need for error-correction codes. For example, Organick et al.

recently stored 200MB of data in 13 million DNA strands and reported substitution, deletion, and insertion rates to be 4.5×10−3,

1.5×10−3and 5.4×10−4, respectively [5]. Since current technologies can only synthesize strands of DNA of one-two hundred

nucleotides, it is most likely that there is at most one error of each type. Motivated by this error behavior, several works focused

on the construction of error-correction codes that are capable of correcting the single edit (i.e. a single substitution, or a single

deletion, or a single insertion) and its variants [9], [10]. However, a problem of combining constrained codes with both the

homopolymer runlength and GC-content constraints with the single-edit-correction codes has not been addressed.

In this work, we propose novel channel coding techniques for DNA storage, where the codebooks satisfy the RLL constraint,

the GC-content constraint, and can also correct a single edit and its variants. During the decoding of the proposed constrained

Tuan Thanh Nguyen and Kui Cai are with the Singapore University of Technology and Design, Singapore 487372 (email: {tuanthanh nguyen,

cai kui}@sutd.edu.sg).

Kees A. Schouhamer Immink is with the Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands (email: immink@turing-

machines.com).

Han Mao Kiah is with the School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371 (email: hmkiah@ntu.edu.sg).

arXiv:2001.02839v1 [cs.IT] 9 Jan 2020

Notation Description

Σalphabet of size q

Σ4quaternary alphabet, i.e. q= 4,Σ4={0,1,2,3}

DDNA alphabet, D={A,T,C,G}

xy the concatenation of two sequences

x||ythe interleaved sequence

σ,Uσ,Lσa DNA sequence σ, the upper sequence of σ, and the lower sequence of σ

Ψthe one-to-one map that converts a DNA sequence to a binary sequence

Syn(x)the syndrome of a sequence x

indel single insertion or single deletion

edit single insertion, or single deletion, or single substitution

Bindel(x)the set of words that can be obtained from xvia at most a single indel

Bedit(x)the set of words that can be obtained from xvia at most a single edit

Encoder / Decoder Description Redundancy Remark

ENCA

RLL,DE CA

RLL encoder and decoder for `-runlength limited codes using

enumeration technique

rA=n− blog4|C(n, `, q)|c (symbols) Section III-A

ENCB

RLL,DE CB

RLL encoder and decoder for `-runlength limited codes using

sequence replacement technique

rB= 1(symbol)if n6(q−1)q`−1+`−1

or dn/((q−2)q`−1+`)esymbols, otherwise

Section III-B

ENCC

GC,DE CC

GC encoder and decoder for -balanced quaternary codes using

binary template

rC=dlog2(b1/2c+ 1)e(bits) Section IV-C

ENCD

GC,DE CD

GC encoder and decoder for -balanced quaternary codes using

Knuth’s technique

rD= 2dlog4(b1/2c+ 1)e(symbols) Section IV-D

ENC(,`),DE C(,`)constrained encoder/decoder for -balanced and `-runlength

limited codes

rA+rD+4 (symbols) or rB+rD+4 (symbols) Section V

ENC(,`;Bindel ),

DEC(,`;Bindel )

error-control encoder/decoder for -balanced and `-runlength

limited codes that can correct an indel

rA+rD+ log2n+ Θ(1) (symbols) or rB+

rD+ log2n+ Θ(1) (symbols)

Section VI-B

ENC(,`;Bedit ),

DEC(,`;Bedit )

error-control encoder/decoder for -balanced and `-runlength

limited codes that can correct an edit

rA+rD+ 2 log2n+ Θ(1) (symbols) or rB+

rD+ 2 log2n+ Θ(1) (symbols)

Section VI-C

TABLE I: Notation and Results Summary. The redundancy is computed for DNA codewords of length n, given `, > 0.

codes, a small number of corrupted bits at the channel output might lead to massive error propagation of the decoded bits.

Our proposed combination of constrained codes with error-correction codes also helps to minimize the error prorogation during

decoding.

The paper is organized as follows. We ﬁrst go through certain notations in Section II. In Section III, we present two efﬁcient

RLL coding methods that limit the maximum homopolymer run in each codeword to be at most `for arbitrary ` > 0. Our methods

are based on enumeration coding and sequence replacement technique, respectively. In Section IV, via a simple modiﬁcation

of Knuth’s balancing method, we describe linear-time encoders/decoders that translate binary data to DNA strands whose GC-

content is within [0.5−, 0.5 + ]for arbitrary > 0. This method yields a signiﬁcant improvement in coding redundancy with

respect to prior works. Then, in Section V, we present an efﬁcient (`, )-constrained coding method where codewords obey both

RLL constraint and GC-content constraint. In Section VI, we modify the (`, )-constrained coding so that the codewords can

correct a single deletion, or single insertion, or single substitution error.

For the convenience of the reader, relevant notation and terminology referred to throughout the paper is summarized in Table I.

II. NOTATI ON

Let Σq={0,1,2, . . . , q −1}denote an alphabet of size q≥2. Particularly, when q= 4, we use the following relation Φ

between the decimal alphabet Σ4={0,1,2,3}and the nucleotides D={A,T,C,G},Φ:0→A,1→T,2→C, and 3→G.

Given two sequences xand y, we let xy denote the concatenation of the two sequences. In the special case where x,y∈Σn

q,

we use x||yto denote their interleaved sequence x1y1x2y2. . . xnyn.

Let σ=σ1σ2. . . σn∈Σn

4, denote a 4-ary strand of nnucleotides. The GC-content or weight of strand σ, denoted by ω(σ),

is deﬁned by ω(σ) = (1/n).Pn

i=1 ϕ(σi)where ϕ(σi) = 0 if σi∈ {0,1}and ϕ(σi) = 1 if σi∈ {2,3}. Given > 0, we say

that σis -balanced if |ω(σ)−0.5| ≤ , in other words, ω(σ)∈(0.5−, 0.5 + ). In particular, when nis even and = 0, we

say σis GC-balance. Over binary alphabet, a vector x∈ {0,1}nis called balanced if the number of ones in x, or the weight

wt(x), is n/2.

On the other hand, given ` > 0, we say that σis `-runlength limited if any run of the same nucleotide is at most `. For

DNA-based storage, we are interested in codewords that are -balanced and `-runlength limited for sufﬁcient small =o(1),

`=o(n).

Deﬁnition 1. A nucleotide encoder EN C :{0,1}m→Σn

4is a (, `)-constrained encoder if ENC(x)is -balanced and `-runlength

limited for all x∈ {0,1}m.

Motivated by the error behavior in DNA storage, we investigate constrained codes that also have error-correction capability.

Such codes are referred as error-control-codes. We use Bto denote the error ball function. For a sequence x∈Σn

4, let BD(x),

BI(x), and BS(x)denote the set of all words obtained from xvia a single deletion, single insertion, or at most one substitution,

respectively, and set

Bindel(x),BI(x)∪BD(x),Bedit (x),BS(x)∪BI(x)∪BD(x).

Observe that when σ∈Σn

4, both Bindel(σ)and Bedit (σ)are subsets of Σn−1

4∪Σn

4∪Σn+1

4. Hence, for convenience, we use

Σn∗

4to denote the set Σn−1

4∪Σn

4∪Σn+1

4.

Deﬁnition 2. Let C⊆Σn

4. Given , ` > 0and the error ball function B, we say that Cis an (, `;B)-error control codes if

(i) For all c∈C,cis -balanced,

(ii) For all c∈C,cis `-runlength limited, and

(iii) B(c)∩B(c0) = ∅for all distinct c,c0∈C.

For a code C⊆Σn

q, the rate of C, denoted by rateC, is deﬁned by rateC,(1/n) logq|C|. The asymptotic rate of the family

of codes {C(n, N ;q)}∞

n=1 is deﬁned by limn→∞(1/n) logq|C|, if the limit exists.

Deﬁnition 3. A nucleotide encoder EN C :{0,1}m→Σn

4is an (, `;B)-error-control-encoder if ENC (x)is -balanced and

`-runlength limited for all x∈ {0,1}m, furthermore there exists a decoder map DEC : Σn∗

4→ {0,1}msuch that the following

hold.

(i) For all x∈ {0,1}n, we have DEC ◦EN C(x) = x.

(ii) If c=EN C(x)and c0∈B(c), then D EC (c0) = x.

Hence, we have that the code C={c:c=ENC(x),x∈ {0,1}m}and hence, |C|= 2m. The redundancy of the encoder is

measured by the value 2n−m(in bits) or n−m/2(nucleotide symbols).

III. EFFIC IE NT HOMOPOLYMER RUNL EN GT H LIMITED COD ES

We present two methods of constructing maximum runlength limited q-ary constrained codes. Method A uses enumerative

coding technique to rank/unrank all codewords. While the technique is standard in constrained coding and combinatorics

literature, our contribution is a detailed analysis of the space and time complexities of the respective algorithm. The encoder

achieves maximum code rate, for example, when `= 3, n = 200, q = 4, the rate of the encoder is 1.98 bits/nt. However, the

time and space complexity is O(n2), which makes it less attractive than the sequence replacement technique in Method B.

A. Method A Based on Enumeration Coding

Let C(n, `, q)denote the set of all q-ary `-runlength limited sequences of length n. We ﬁrst obtain a recursive formula for

the size of C(n, `, q). This recursive formula is useful in the development of the ranking/unranking methods. To this end, we

partition C(n, `, q)into `classes and provide bijections from q-ary `-runlength limited sequences of shorter lengths into them.

For 1≤i≤`, let Ci(n, `, q)denote the set of all q-ary `-runlength limited sequences of length nwhose sufﬁx is the repetition

of a symbol in Σqfor exactly itimes. Clearly, we have Ci(n, `, q)∩Cj(n, `, q) = ∅for i6=jand

C(n, `, q) =

`

[

i=1

Ci(n, `, q)

Let [n]denote the set {1,2, . . . , n}. Consider `maps φ1, φ2, . . . , φ`where

φi:C(n−i, `, q)×[q−1] →Ci(n, `, q),for 1≤i≤`.

If x=x1x2. . . xn−i∈C(n−i, `, q)and j∈[q−1], set ato be the jth element in Σq\ {xn−i}. Then set φi(x, j ) =

x1x2. . . xn−iai. Here, aidenotes the repetition of symbol afor itimes.

Theorem 4. For 1≤i≤`, the map φiis a bijection. We then have the following recursion. For 1≤n≤`,|C(n, `, q)|=q`,

and for n>`

|C(n, `, q)|=

`

X

i=1

(q−1)|C(n−i, `, q)|.

Therefore, rateC(n,`,q)= logqλ, where λis the largest real root of equation x`−P`−1

i=0 (q−1)xi= 0.

Proof. We can prove that φiis bijection for 16i6`by constructing the inverse map φ−1

i. Speciﬁcally, we set φ−1

i:

Ci(n, `, q)→C(n−i, `, q)×[q−1] such that for x=x1x2. . . xn∈Ci(n, `, q ), φ−1

i(x) = (x1. . . xn−i, j)where jis the

index of xnin Σq\ {xn−i}. It can be veriﬁed that φi◦φ−1

iand φ−1

i◦φiare identity maps on their respective domains. Since

C(n, `, q) = S`

i=1 Ci(n, `, q), we then have for n>`

|C(n, `, q)|=

`

X

i=1

(q−1)|C(n−i, `, q)|.

We now construct the RLL-Encoder A by providing a method of ranking/unranking all codewords in C(n, `, q). A ranking

function for a ﬁnite set Sof cardinality Nis a bijection rank : S→[N]. Associated with the function rank is a unique

unranking function unrank : [N]→S, such that rank(s) = jif and only if unrank(j) = sfor all s∈Sand j∈[N].

The basis of our ranking and unranking algorithms is the bijections {φi}`

i=1 deﬁned earlier. As implied by the codomains of

these maps, for n>`, we order the words in C(n, `, q)such that words in Ci(n, `, q)are ordered before words in Cj(n, `, q)for

i<j. For words in C(n, `, q)where n≤`, we simply order them lexicographically. We illustrate the idea behind the unranking

algorithm through an example.

Example 5. Let n= 5, q = 4, ` = 3. We then have |C(n, 3,4)|= 3|C(n−1,3,4)|+ 3|C(n−2,3,4)|+ 3|C(n−3,3,4)|and

the values of C(m, `, q)are as follow.

m1 2 3 4 5

I62(m, q)4 16 64 252 996

Suppose we want to compute the 900th codeword c∈C(5,3,4), in other words, unrank(900). We have

C(5,3,4) = C1(5,3,4) ∪C2(5,3,4) ∪C3(5,3,4) =

φ1(C(4,3,4) ×[3]) ∪φ2(C(3,3,4) ×[3]) ∪φ3(C(2,3,4) ×[3]),

Since 900 >3|C(4,3,4)|= 756 and 900 <3|C(4,3,4)|+ 3|C(3,3,4)|= 948, the 900th codeword of C(5,3,4), which is the

900 −756 = 144th codeword in C2(5,3,4), is the image of map φ2. Since 144 = 3 ×48 + 0, the construction of φ2tells us

that the 144th codeword in C2(5,3,4) is the image of the 48th codeword, x∈C(3,3,4) under φ2. The 48th word of C(3,3,4)

is 344. Hence, c=φ2(x,3) This gives

unrank(900) = φ2(344,3)

= 34433

The formal unranking/ranking algorithms are described in Algorithm 1 and Algorithm 2.

Algorithm 1 unrank(n, `, q, M )

Input: Integers n≥1,`≥1,q>2,1≤M≤ |C(n, `, q)|

Output: c, where cis the codeword of rank Min C(n, `, q)

if n≤`then

return Mth codeword in C(n, `, q)

Search the ﬁrst index 1≤j≤`such that

M≤

j

X

i=1

(q−1)|C(n−i, `, q)|

M0←Pj

i=1(q−1)|C(n−i, `, q )| − M

M00 ← dM0/(q−1)e

k←M0(mod q−1)

return φj(unrank(n−j, `, q, M 00), k)

Algorithm 2 rank(n, `, q, c)

Input: n≥1, ` ≥1,q>2and codeword c=c1c2. . . cn

Output: M, where 1≤M≤ |C(n, `, q)|, the rank of cin C(n, `, q)

if n≤`then

return rank(c)in C(n, `, q)

if the sufﬁx of cis the repetition of symbol afor itimes then

c0←c1c2. . . cn−i

i←the index of ain Σq\ {cn−i}

return (rank(n−i, `, q, c0)−1)(q−1) + i+Pi−1

j=1(q−1)|C(n−j, `, q )|

Example 6. Let n= 5, ` = 3 and q= 4 as before. Suppose we want to compute rank(34433). Since 34433 ∈C2(5,3,4),

we have that 34433 is obtained from applying φ2to 344 ∈C(3,3,4). The adding symbol is 3, which is the third element in

Σ4\ {4}. Therefore,

rank(34411) = 3|C(4,3,4)|+ 3(rank(344) −1) + 3

= 3 ×252 + 3 ×47 + 3

= 900.

The set of values of {|C(m, `, q)|:m6n}required in Algorithms 1 and 2 can be precomputed based on the recurrence in

Theorem 4. Since the size of C(n, `, q)grow exponentially, these nstored values require O(n2)space.

Next, Algorithms 1 and 2 involve O(n)iterations and each iteration involves a constant number of arithmetic operations.

Therefore, Algorithms 1 and 2 involve O(n)arithmetics operations and have time complexity O(n2). For completeness, we

summarize the RLL-Encoder A and RLL-Decoder A as follows.

RLL-Encoder A. Set m=blog2|C(n, `, q)|c.

INP UT:x∈ {0,1}m

OUT PU T:c,ENC A

RLL(x)∈C(n, `, q)

(I) Let Mbe the positive integer whose binary representation of length mis x.

(II) Use Algorithm 1, set c=unrank(n, `, q , M).

(III) Output c.

RLL-Decoder A. Set m=blog2|C(n, `, q)|c.

INP UT:c∈C(n, `, q )

OUT PU T:x,DEC A

RLL(c)∈ {0,1}m

(I) Use Algorithm 2, set M=rank(n, `, q , c).

(II) Let xbe the binary representation of length mof M.

(III) Output x.

B. Method B Based on Sequence Replacement Technique

The sequence replacement technique has been widely used in the literature [8], [15]–[17]. This is an efﬁcient method for

removing forbidden substrings from a source word. In general, the encoder removes the forbidden strings and subsequently

inserts its representation (which also includes the position of the substring) at predeﬁned positions in the sequence. For example,

Schoeny et al. [17] used only one redundant bit to encode RLL binary sequences with `>dlog ne+ 3. However, for DNA

data storage, with n∈[100,200], it is normally required that `66. Recently, Immink et al. [8] described a simple method for

constructing `-runlength limited q-ary codes. However, the required codeword length nis bounded by a function of `and q. For

example, when `= 3, the method is only applicable for n639 (refer to [8, Table II]). In this work, we show that such bound

can be improved, and hence, the redundancy can be further reduced. For DNA storage channel, when n6200,`∈ {5,6}, our

encoder incurs only one redundant symbol.

Deﬁnition 7. For a sequence x=x1x2. . . xn∈Σn

q, the differential of x, denoted by Diﬀ(x), is a sequence y=y1y2. . . yn∈Σn

q,

where y1=x1and yi=xi−xi−1(mod q)for 2≤i≤n.

It is easy to see that from y=y1y2. . . yn= Diﬀ(x), we can determine xuniquely as xi=Pi

j=1 yj(mod q)for 1≤i≤n.

For convenience, we write x= Diﬀ−1(y).

Lemma 8. Let x∈Σn

q. If the longest run of zero in Diﬀ(x)is at most `−1then xis `-runlength limited.

We now present an efﬁcient encoder for `-runlength limited q-ary codes, and refer this as RLL Encoder B or ENCB

RLL. For a

source data x∈ΣN−1

q, we encode y=ENC(x)∈ΣN

qsuch that ycontains no 0`as a substring, and then output c= Diﬀ−1(y).

Initial Step. The encoder simply appends a ‘0’ to the end of x, yielding the N-symbols word, x0. The encoder then checks the

word x0, and if there is no substring 0`, the output is simply c=x0. Otherwise, it proceeds to the replacement step.

Replacement Procedure. Let the current word c=y0`z, where, by assumption, the preﬁx yhas no forbidden 0`and the run 0`

starts at position p, where 1≤p≤N−`. The encoder removes 0`and updates the current word to be c=yzRe, where the

pointer Reis used to represent the position p, and

(i) R∈Σ`−1

q,

(ii) e∈Σq\ {0},

Note that the number of unique combinations of the pointer Reequals (q−1)q`−1. Note that the current word c=yzReis

of length N. If, after the replacement, ccontains no substring 0`then the encoder returns cas the codeword. Otherwise, the

encoder repeats the replacement procedure for the current word cuntil all substrings 0`have been removed. Noted that during

every step, the length of the codeword is preserved. Since the last symbol in any additional pointer is nonzero, the concatenation

between any two consecutive pointers R1e1R2e2does not produce any substring 0`, this procedure is guarantee to terminate.

As the position pis in the range 1≤p≤N−`+ 1, and the number of combinations of Reequals (q−1)q`−1, we conclude

that Nis upper bounded by

N≤(q−1)q`−1+`−1,for `≥2.(1)

Decoding Procedure. The decoder checks from the right to the left. If the last symbol is ‘0’, the decoder simply removes the

symbol ‘0’ and identiﬁes the ﬁrst N−1symbols are source data. On the other hand, if the last symbol is not ‘0’, the decoder

takes the sufﬁx of length `, identiﬁes it is the pointer, and then adds back the substring 0`accordingly. It terminates when the

ﬁrst symbol ‘0’ is found.

Remark 9. The bound in (1) implies that for q= 4, ` ∈ {4,5,6}, our encoder uses only one redundant symbols for all

n6196. Table 27 shows the improvement with respect to the result provided in [8]. In addition, this algorithm can be easily

extended for the case of arbitrary length nN. The main idea is that we divides the source data into subwords of length

N−1, encodes separately each subword and concatenate them. The representation pointer needs to be modiﬁed so that the

concatenation between any two encoded subwords does not contain a substring 0`. To do so, we simply append ’1’ to the end

of the source data instead, and require the pointers of the form Rewhere R∈Σ`−1

qand e /∈ {0,1}. The replacement procedure

and decoding procedure can be proceeded similarly.

`\nmax Bound in (1) Previous work [8]

2 13 11

3 50 39

4 195 148

5 772 581

TABLE II: Maximum length nthat an encoder can achieve the rate (n−1)/n for `-runlength limited quaternary codes.

IV. EFFIC IE NT GC -CONTENT CONSTRAINED CODES

In this section, we propose linear-time encoders/decoders that translate binary input data to DNA strands whose GC-content is

within [0.5−, 0.5 + ]for arbitrary > 0, with ﬁxed number of redundant bits. This method yields a signiﬁcant improvement

in coding redundancy with respect to the prior works. We ﬁrst review the Knuth’s balancing technique.

A. Knuth’s Balancing Technique

Knuth’s balancing technique is a linear-time algorithm that maps a binary message xto a balanced word yof the same length

by ﬂipping the ﬁrst tbits of x[14]. The crucial observation demonstrated by Knuth is that such an index talways exists and

tis commonly referred to as the balancing index. To represent the balancing index, Knuth appends ywith a short balanced

sufﬁx of length log nand so, a lookup table of size log nis required.

Several works in the literature used this technique to encode DNA strands whose GC-content is exactly balanced (for example,

[9], [12]), and the coding redundancy is approximately log n. We generalize this technique for binary codes ﬁrst.

B. Generalization of Knuth’s Balancing Technique

Deﬁnition 10. Let nbe even. For arbitrary > 0, a binary word x∈ {0,1}nis -balanced if the weight of x,wt(x), satisﬁes

wt(x)

n−0.5

≤.

In other words, we have 0.5n−n ≤wt(x)≤0.5n+n.

Deﬁnition 11. Let nbe even. For arbitrary > 0, the index t, where 1≤t≤n, is called the -balanced index of x∈ {0,1}n

if the word yobtained by ﬂipping the ﬁrst tbits in xis -balanced.

We now show that such an index talways exists and there is an efﬁcient method to ﬁnd t. For neven, let the -balanced set

S,n ⊂ {0,1,2, . . . , n}be the set of the following indices.

S,n ={0, n}∪{2bnc,4bnc,6bnc, . . .}.(2)

The size of S,n is at most b1/2c+ 1.

Theorem 12. Let nbe even, > 0. For arbitrary binary sequence x∈ {0,1}n, there exists an index tin the set S,n, such that

tis the -balanced index of x.

Proof. In the trivial case, when xis -balanced, the index t= 0, which is in the set S,n. Assume that xis not -balanced,

and without loss of generality, assume that wt(x)<0.5n−n. Let Flipk(x)be the word obtained by ﬂipping the ﬁrst kbits

in x. Since wt(x)<0.5n−n, we have wt(Flipn(x)) >0.5n+n. Now consider the list of indices that we try to obtain

an -balanced word, t1= 2bnc, t2= 4bnc, and so on. Since Flipti(x)and Flipti+1 (x)differ at at most 2n positions, and

wt(x)<0.5n−n,wt(Flipn(x)) >0.5n+n, there must be an index tsuch that 0.5n−n ≤wt(Flipt(x)) ≤0.5n+n.

We provide two methods to construct GC-Content constrained codes. The ﬁrst method uses -balanced binary codes as a

template to construct -balanced quaternary codes with at most log (b1/2c+ 1) bits of redundancy. On the other hand, the

second method proceeds directly over quaternary alphabet and appends a short balanced sufﬁx to the end of each codeword to

indicate the -balanced index.

C. Binary Construction of GC-Content Constrained Codes

When q= 4, we consider the following one-to-one correspondence between quaternary alphabet and two-bit sequences:

0↔00,1↔01,2↔10,3↔11.

Therefore, given a DNA sequence σof length n, we have a corresponding binary sequence x∈ {0,1}2nand we write x= Ψ(σ)

or σ= Ψ−1(x). Given σ∈Σn

4, let x= Ψ(σ)∈ {0,1}2nand we set Uσ=x1x3· · · x2n−1and Lσ=x2x4· · · x2n. In other

words, σ= Ψ−1(Uσ||Lσ). We refer to Uσand Lσas the upper sequence and lower sequence of σ, respectively. The following

result is immediate.

Lemma 13. Let σ∈Σn

4. We have σis -balanced if and only if Uσis -balanced.

GC-Encoder C. Given n, > 0, set k=dlog (b1/2c+ 1)eand m= 2n−k. Set S,n be the set of indices as constructed in

(2) and we construct a one-to-one correspondence between the indices in S,n and kbits sequences.

INP UT:x∈ {0,1}n,y∈ {0,1}n−kand so, xy ∈ {0,1}m

OUT PU T:σ=ENC C

GC(xy)

(I) Search for the ﬁrst tin S,n , such that Flipt(x)is -balanced.

(II) Set x0= Flipt(x).

(III) Let zbe the kbits sequence representing index t.

(IV) Set y0=yz of length n

(V) Finally, we set σ,Ψ−1(x0||y0).

Example 14. Let n= 10, = 0.1, k =dlog (b1/2c+ 1)e= 3, i.e. we want the GC-content of each codeword is within

[0.4,0.6]. The set S,n ={0,2,4,6,8,10}is of size six. We construct the one-to-one correspondence between the indices and

3bits sequences: 0→000,2→001,4→010,6→100,8→011 and 10 →111. Suppose the input sequence is c= 017, i.e

x= 010 and y= 07. We ﬁnd the index t= 4. Follow the encoder, we get x0= 1111000000 and y0= 0000000010. We then

obtain σ= Ψ−1(x0||y0) = 2222000010.

GC-Decoder C. Given n, > 0, set k=dlog (b1/2c+ 1)eand m= 2n−k.

INP UT:σ∈Σn

4,σis -balanced

OUT PU T:xy ∈ {0,1}m

(I) Set x0=Uσ∈ {0,1}nand y0=Lσ∈ {0,1}n.

(II) Set zbe the sufﬁx of length kin y0and let tbe the index in S,n corresponding to z.

(III) Set x= Flipt(x0).

(IV) Set y=y0removes z

(V) Finally, we output xy.

Remark 15. For constant > 0, the complexity of an GC-Encoder C is linear and the redundancy is constant. For example,

when n= 200, = 0.1, i.e. the GC-content is within [0.4,0.6], the set S,n ={0,40,80,120,160,200}is of size six. The

GC-Encoder C uses only dlog 6e= 3 bits of redundancy to indicate the -balanced index in the lower sequence and the rate of

the encoder is 1.985 bits/nt. Similarly, when = 0.05, i.e. the GC-content is within [0.45,0.55], the GC-Encoder C uses only

dlog 11e= 4 bits of redundancy and the rate is 1.98 bis/nt.

D. Knuth-like Construction of GC-Content Constrained Codes

Consider the quaternary alphabet Σ4={0,1,2,3}. To apply Knuth’s method, we deﬁne the ﬂipping rule f: Σ4→Σ4,

where f(0) = 2, f (2) = 0, f(1) = 3 and f(3) = 1. For a sequence σ∈Σn

4and index iwith 0≤i≤n,fi(σ)denotes the

sequence obtained by ﬂipping the ﬁrst isymbols of σunder f.

Deﬁnition 16. Let nbe even. For arbitrary > 0, the index t, where 1≤t≤n, is called the -balanced index of σ∈Σn

4if

the sequence σ0=ft(σ)is -balanced.

Example 17. Consider n= 10, = 0.1. Let σ= 0000000000. Observe that f4(σ) = 2222000000,f5(σ) = 2222200000 and

f6(σ) = 2222220000 are -balanced. Hence, t= 4,5,6are -balanced indices of σ. In general, there might be more than one

-balanced index.

The following result follows from Theorem 12.

Corollary 18. Let nbe even, > 0. The set S,n is deﬁned as in (2). For any sequence σ∈Σn

4, there exists an index tin the

set S,n, such that it is the -balanced index of σ.

To encode a -balanced sequence σ, we ﬁrst ﬁnd the smallest -balanced index tof σ, and then ﬂip the ﬁrst tsymbols of

σaccording to the rule f. To represent the index, we also append a short balanced sufﬁx to the end of codeword, and so, a

lookup table of size |S,n|is required and the redundancy is dlog (b1/2c+ 1)e. The following result is trivial.

Lemma 19. Let n, m be even. Assume that σ∈Σn

4is -balanced and z∈Σm

4is balanced. The concatenation sequence σz

is also -balanced.

Example 20. Let n= 200, = 0.1, i.e. we want the GC-content is within [0.4,0.6], and the set S,n ={0,40,80,120,160,200}

is of size six. We construct the one-to-one correspondence between the index and a short balanced sufﬁx of length 2 as follows:

0→02,40 →03,80 →12,120 →13,160 →20,200 →30. Assume that σ∈Σ200

4and the -balanced index tof σis t= 40.

The encoder ﬂips the ﬁrst 40 symbols in σto obtain σ0that is -balanced, and then append 03 to the end of σ0. The encoder

uses only two redundant symbols for = 0.1.

We now show that the sufﬁx can be encoded and decoded in linear time without the use of a lookup table. In addition, in

order to construct an (, `)-constrained code, we encode the sufﬁx in such a way that it is also `-runlength limited. The details

are as follows.

Index Encoder. Let nbe even, , ` > 0. The set S,n is deﬁned as in (2). Set k,dlog4(b1/2c+ 1)e.

INP UT:t,t∈S,n ,0≤t≤n−1

OUT PU T:p,IND EX ENC (t)

(I) Let τ1τ2· · · τkbe the quaternary representation of tin S,n .

(II) Interleave the representation with the alternating length-ksequence f(τ1)f(τ2)· · · f(τk)to obtain pof length 2k. In other

words, set p=τ1f(τ1)τ2f(τ2)· · · τkf(τk).

The corresponding GC-content Encoder and Decoder are described as follows.

GC-Encoder D. Given n, > 0, set k=dlog4(b1/2c+ 1)eand m= 2n−4k. Set S,n−2kbe the set of indices as constructed

in (2) and we construct a one-to-one correspondence between the indices in S,n−2kand kbits sequences.

INP UT:x∈ {0,1}m

OUT PU T:σ=ENC D

GC(x)

(I) Set σ0= Ψ−1(x)∈Σn−2k

4

(II) Search for the ﬁrst tin S,n−2k, such that tis the -balanced index of σ0.

(III) Obtain σ00 by ﬂipping the ﬁrst tsymbols in σ0.

(IV) Use Index Encoder to obtain prepresenting index tof length 2k.

(V) Finally, we set σ,σ00p.

GC-Decoder D.

INP UT:σ∈Σn

4,σis -balanced

OUT PU T:x,DEC D

GC(σ)∈ {0,1}m

(I) Set pbe the sufﬁx of length 2kin σ, and σ0be the preﬁx of length n−2k.

(II) Let zbe the sequence of odd indices in p, which is the kbits sequence representing index tin the set S,n−2k.

(III) Flip the ﬁrst tsymbols in σ0according to the ﬂipping rule fto obtain σ00 .

(IV) Finally, output x= Ψ(σ00)

Remark 21. The advantage of Encoder C is low redundancy, however, it is hard to combine with an RLL Encoder to construct

an (, `)-constrained encoder. In the next section, we present an efﬁcient (, `)-constrained encoder using the construction of

Encoder D and the two RLL Encoders presented in Section III.

V. EFFI CI EN T (, `)-CONSTRAINED COD ES

In this section, we present an (, `)-constrained encoder that translates binary data to DNA strands that are `-runlength limited

and -balanced for arbitrary , ` > 0. Prior to this work, literature results mostly focused on speciﬁc values of and `[11],

[12]. For example, Song et al. [11] used concatenation technique to design RLL encoder for `= 3, and their simulated results

showed that the GC-content of all codewords is between 0.4 and 0.6, i.e. = 0.1, and for n= 200, the rate of the encoder is

1.9 (bits/nt). In this section, we provide a more efﬁcient coding scheme such that the output codewords are `-runlength limited

and -balanced.

Example 22. Consider n= 10, = 0.1, ` = 3. Let σ= 0002111011. Observe that even though σis `-runlength limited, it is

not -balanced. We then get f3(σ) = 2222111011, is -balanced. However, f3(σ)is not `-runlength limited.

The above example also illustrates that the sequence ft(σ)may not be `-runlength limited given that σis `-runlength limited.

Nevertheless, we observe that the preﬁx and sufﬁx of ft(σ)remain `-runlength limited. For brevity, given a sequence σ∈Σn

4,

we use Pi(σ)and Si(σ)to denote the preﬁx and sufﬁx of σof length i, respectively.

Lemma 23. Let 06t6n. If a sequence σis `-runlength limited and σ0=ft(σ), then Pt(σ0)and Sn−t(σ0)are both

`-runlength limited.

To ensure that the obtained sequence remains `-runlength limited, we simply add one redundant symbol before concatenating

Pt(σ0)and Sn−t(σ0).

Corollary 24 (Concatenate two `-runlength limited sequences).Let σ,σ0be `-runlength limited. Suppose that the last symbol

of σis αand the ﬁrst symbol of σ0is β. Let γ∈Σ4\ {α, β}, then σ00 =σγσ0is `-runlength limited.

We illustrate the construction of (, `)-constrained encoder through the following example.

Example 25 (Example 20 continued).Suppose n= 200, = 0.1, and `= 3. We show that there exists an efﬁcient (, `)-

constrained encoder with at most 8redundant symbols. From the data sequence σ∈Σ192

4, we use RLL Encoder A to obtain

σ1=ENCA

RLL(σ). This step requires two redundant symbols and hence, σ1∈Σ194

4is `-runlength limited. We now search for

the -balanced index tof σ1in the set S0.1,194 of size six, i.e σ2=ft(σ1)is -balanced. Such index can be represented by a

pointer pof size two (similar to Example 20). We follow Corollary 24 to ﬁnd γ, γ0such that σ2= Pt(σ1)γSn−t(σ1)γ0p∈Σ198

4

be `-runlength limited. To ensure that the ﬁnal output is -balanced, recall that, Pt(σ1)Sn−t(σ1)pis -balanced, we then output

σ3=σ2f(γ0)f(γ). It is easy to verify that σ3is `-runlength limited and -balanced. Thus, the encoder uses 8 redundant

symbols to encode codewords of length 200, and hence, the rate is 1.92 (bits/nt).

We now show that the representation pof the -balanced index can be encoded/decoded in linear time without using a lookup

table. Suppose we want to encode codewords in Σn

4where nis even. Set k,dlog4(b1/2c+ 1)e, and N=n−2k−4.

Let rRLL denote the number of redundant symbols used by the RLL Encoder (ENCA

RLL or EN CB

RLL) to encode the `-runlength

limited codewords in ΣN

4. We summarize our proposed (, `)-constrained encoder as follows.

(, `)-Constrained Encoder. Given n, , `,neven and `>3. Set m= 2n−2(rRLL + 2k+ 4). Set S,N be the set of indices

as deﬁned by (2) and we construct a one-to-one correspondence between the indices in SNand kbits sequences.

INP UT:x∈ {0,1}m

OUT PU T:σ,ENC (,`)(x)∈Σn

4

(I) Set σ1= Ψ−1(x)∈Σn−rRLL −2k−4

4

(II) Use RLL Encoder to obtain σ2=EN CRLL (σ1), where σ2∈ΣN

4is `-runlength limited

(III) Search for the ﬁrst -balanced index tof σ2in S,N

(IV) Flip the ﬁrst tsymbols in σ2to obtain σ3=ft(σ2)

(V) Let τ1τ2· · · τkbe the quaternary representation of tin S,N . Set p=τ1f(τ1)τ2f(τ2)· · · τkf(τk)

(VI) Use Corollary 24 to ﬁnd γand γ0such that σ4= Pt(σ3)γSN−t(σ3)γ0pis `-runlength limited

(VII) Output σ=σ4f(γ)f(γ0). Note that σ∈Σn

4

Theorem 26. The (, `)-Constrained Encoder is correct. In other words, ENC(,`)(x)is -balanced and `-runlength limited for

all x∈ {0,1}m. The redundancy of the encoder is rRLL + 2k+ 4.

Proof. Let σ=ENC(,`)(x). We ﬁrst show that σis `-runlength limited. According to Corollary 24, σ4is `-runlength limited.

Since two consecutive symbols in pare distinct, the concatenation pf(γ)f(γ0)is `-runlength limited for all `>3. Therefore,

σis `-runlength limited.

We now show that σis -balanced. Since σ3is -balanced, pbalanced, γf(γ), γ0f(γ0)is balanced, we have σis -balanced

(according to Lemma 19).

Remark 27. The construction can be easily extended for `∈ {1,2}. For arbitrary > 0,k=dlog4(b1/2c+ 1)e=O(1), is

a constant. Therefore, the rate of this encoder approaches the rate of the RLL Encoder. If we use the RLL Encoder based on

enumeration (ENCA

RLL) then the rate of the (, `)-constrained encoder approaches the capacity for sufﬁcient large n. However,

this encoder A runs in Θ(n2). For DNA storage with `∈ {4,5,6}, we can use the linear time ENC B

RLL to achieve as good rate

as EN CA

RLL (refer to Remark 9).

For completeness, we describe the corresponding (, `)-constrained decoder as follows.

(, `)-Constrained Decoder.

INP UT:σ∈Σn

4,σis -balanced and `-runlength limited

OUT PU T:x,DEC (,`)(σ)∈ {0,1}m

(I) Set pbe the sufﬁx of length 2k+ 2 and σ1be the preﬁx of length n−2k−3

(II) Remove the the last two symbols in p

(III) Let zbe the sequence of odd indices in p, which is the kbits sequence representing index tin S,N

(IV) Flip the ﬁrst tsymbols in σ1according to the ﬂipping rule fto obtain σ2

(V) Remove the (t+ 1)th symbol in σ2

(VI) Use RLL Decoder to obtain σ3=DE CRLL (σ2)

(VII) Output x= Ψ(σ3)

The efﬁciency of our designed (, `)-constrained encoder are summarized in Table III. As can be seen, when the codeword

length increases, the rate of our proposed encoder is only a few percent below capacity.

Codeword length nCapacity CRate of encoder rη=r/C(%)

100 1.99542 1.81000 90.707%

200 1.99578 1.92000 96.203%

300 1.99577 1.94000 97.206%

TABLE III: Rate of the designed constrained encoder for = 0.1and `= 4

.

VI. EFFI CI EN T (, `;B)-ERRO R-CO NT ROL CO DE S

We now construct (, `;B)-error-control codes to correct the most common error in DNA data storage such as a single

deletion, insertion, or substitution error. This also helps to reduce the error propagation of the constrained decoders proposed

earlier. Crucial to our construction is the binary Varshamov-Tenengolts (VT) codes deﬁned by Levenshtein [22] and the q-ary

VT codes deﬁned by Tenengolts [23].

A. Codes Correcting a Single Indel/Edit

Deﬁnition 28. The binary VT syndrome of a binary sequence x∈ {0,1}nis deﬁned to be Syn(x) = Pn

i=1 ixi.

For a∈Zn+1, the Varshamov-Tenengolts code VTa(n)is deﬁned as follows.

VTa(n) = {x∈ {0,1}n: Syn(x) = a(mod n+ 1)}.(3)

For a∈Zn+1, the code VTa(n)can correct a single indel and Levenshtein later provided a linear-time decoding algorithm

[22]. To also correct a substitution, Levenshtein [22] constructed the following code

La(n) = {x∈ {0,1}n: Syn(x) = a(mod 2n)},(4)

and provided a decoder that corrects a single edit.

Theorem 29 (Levenshtein [22]).Let La(n)be as deﬁned in (4). There exists a linear-time decoding algorithm DE CL

a:

{0,1}n∗→La(n)such that the following holds. If c∈La(n)and y∈Bedit(c), then DE CL

a(y) = c.

In 1984, Tenengolts [23] generalized the binary VT codes to nonbinary ones. Tenengolts deﬁned the signature of a q-ary

vector xof length nto be the binary vector π(x)of length n−1, where π(x)i= 1 if xi+1 ≥xi, and 0otherwise, for i∈[n−1].

For a∈Znand b∈Zq, set

Ta,b(n;q),x∈Zn

q:π(x)∈VTa(n−1) and

n

X

i=1

xi=b(mod q).

Then Tenengolts showed that Ta,b (n;q)corrects a single indel and there exists aand bsuch that the size of Ta,b(n;q)is at

least qn/(qn). These codes are known to be asymptotically optimal. In the same paper, Tenengolts also provided a systematic

q-ary single-indel-encoder with redundancy log n+Cq, where nis the length of a codeword and Cqis independent of n.

Theorem 30 (Tenengolts [23]).There exists a linear-time decoding algorithm DECT

(a,b):{0,1}n∗→Ta,b(n;q)such that the

following holds. If c∈Ta,b(n;q)and y∈Bindel (c), then DEC T

(a,b)(y) = c.

Recently, Chee et al. [9] presented linear-time encoders for GC-balanced codewords that are capable of correcting single edit

with 3 log n+ 2 bits of redundancy. In the following, we use the idea of VT codes to modify the (, `)-constrained code so that

the codebook is capable of correcting either a single indel or a single edit.

For σ∈Σn

4, recall the deﬁnition of Uσ,Lσ∈ {0,1}nand x=Uσ||Lσ= Ψ(σ)(refer to Section IV-III).

Proposition 31. Let σ∈Σn

4. Then the following are true.

(a) σ0∈Bindel(σ)implies that Uσ0∈Bindel (Uσ)and Lσ0∈Bindel(Lσ).

(b) σ0∈Bedit(σ)implies that Uσ0∈Bedit (Uσ)and Lσ0∈Bedit(Lσ).

Remark 32. The statement in Proposition 31 can be made stronger. Suppose that there is an indel at position iof σ. Then

there is exactly one indel at the same position iin both upper and lower sequences of σ. For example, consider σ= 020313.

We have Uσ= 010101 and Lσ= 000101. If the third symbol in σ, which is 0, is deleted, we obtain σ0= 02313 and hence,

U0

σ0= 01101 and Lσ0= 00101.

The following construction is trivial.

Corollary 33. For n > 0, a ∈Z2n, b ∈Z2n, let C(a,b)(n)be the set of all sequences σ∈Σn

4such that Uσ∈La(n)and

Lσ∈Lb(n). Then C(a,b)(n)is capable of correcting a single edit error.

B. Construction of (, `;Bindel )-Error-Control Codes

We follow Tenengolts’s construction to encode DNA sequences that are capable of correcting a single indel. We simply

append the information of the syndrome and the sum of symbols to the end of each codeword. In addition, we use the idea of

the Index Encoder (refer to Section IV-D) to ensure the redundant part is balanced and `-runlength limited. The extra redundancy

is log n+ 4. For simplicity, assume that k0= log nis integer and k0is even.

(, `;Bindel)-Error-Control Encoder. Let nbe even, , ` > 0. Set k,dlog4(b1/2c+ 1)e. Set m= 2n−2(rRLL + 2k+ 4),

and N=n−2k−4. Set S,n−2k−4be the set of indices as deﬁned by (2) and we construct a one-to-one correspondence

between the indices in S,n−2k−4and kbits sequences. Set k0= log n.

INP UT:x∈ {0,1}m

OUT PU T:σ,ENC (,`;Bindel)(x)∈C(, `;Bindel )∩Σn+log n+4

4

(I) Use the (, `)-constrained encoder to obtain σ0=EN C(,`)(x)∈Σn

4, where σ0is -balanced and `-runlength limited

(II) Let αbe the last symbol of σ0. Let βbe arbitrary symbol in Σ4\ {α, f(α)}

(III) Let a= Syn(π(σ0)) (mod n)and b=Pn

i=1 σ0

i(mod 4)

(IV) Let τ1τ2· · · τk0/2be the quaternary representation of a

(V) Set p=βf (β)τ1f(τ1)τ2f(τ2)· · · τk0/2f(τk0/2)bf (b)

(VI) Output σ=σ0p

Theorem 34. The (, `;Bindel)-error-control encoder is correct. In other words, ENC(,`;Bindel)(x)is -balanced, `-runlength

limited, and capable of correcting a single indel for all x∈ {0,1}m.

Proof. Let σ=ENC(,`;Bindel )(x). It is easy to show that σis -balanced and `-runlength limited (refer to the proof of

Theorem 26). It remains to show that σcan correct a single indel. To do so, we provide an efﬁcient decoding algorithm.

Suppose that there is a deletion (or insertion) in the received sequence σ0(this can be observed based on the length of the

received sequence). Without loss of generality, assume that the error is a deletion. The decoder proceeds as follows.

Localizing the deletion. Let p0be the sufﬁx of length k0+ 4 of σ0. Assume that p0=p0

1p0

2· · · p0

k0+4.

•If p0

2=f(p0

1)then we conclude that there is no deletion in pand therefore, p0≡p.

•If p0

26=f(p0

1)then we conclude that there is a deletion in p.

Recovering σ.

•If there is no deletion in p, i.e. p0≡p, let σ00 be the sequence obtained by removing the sufﬁx pfrom σ0. Note that

Syn(σ00)and the sum of symbols in σ00 are known from p. We then set y=DECT

(a,b)(σ00), and use the (, `)-constrained

encoder to obtain x=DEC(,`)(y).

•If there is a deletion in p, we do not need to do error correction here, and remove the sufﬁx of length k0+ 3 from σ0. We

then use the (, `)-constrained encoder to obtain x=DEC(,`)(σ0).

In conclusion, EN C(,`;Bindel )(x)is -balanced, `-runlength limited, and can correct a single indel for all x∈ {0,1}m.

Corollary 35. Let M=n+ log n+ 4. There exists a linear-time decoding algorithm DECindel : ΣM∗

4→C(, `;Bindel)∩ΣM

4

such that the following holds. If σ=ENC(,`;Bindel )(x)and σ0∈Bindel(σ), then DECindel(σ0) = σ.

For completeness, we describe the corresponding (, `;Bindel)-error-control decoder as follows.

(, `;Bindel)-Error-Control Decoder.

INP UT:σ0∈Σ(n+k0+4)∗

4

OUT PU T:x,DEC (,`;Bindel)(σ0)∈ {0,1}m

(I) Let σ=DE Cindel (σ0)∈Σn+k0+4

4

(II) Use (, `)-constrained decoder to obtain x=DE C(,`)(σ)∈ {0,1}m

(III) Output x

C. Construction of (, `;Bedit )-Error-Control Codes

We follow the construction in Corollary 33 to encode DNA sequences that are capable of correcting a single edit. We simply

append the information of the syndrome of Uσand Lσto the end of each codeword. In addition, we also use the idea of the

Index Encoder (refer to Section IV-D) to ensure the redundant part is balanced and `-runlength limited. The extra redundancy

is 2 log n+ 4. For simplicity, assume that k0= log nis integer and k0is even.

(, `;Bedit)-Error-Control Encoder. Let nbe even, , ` > 0. Set k,dlog4(b1/2c+ 1)e. Set m= 2n−2(rRLL + 2k+ 4),

and N=n−2k−4. Set S,n−2k−4be the set of indices as deﬁned by (2) and we construct a one-to-one correspondence

between the indices in S,n−2k−4and kbits sequences. Set k0= log n.

INP UT:x∈ {0,1}m

OUT PU T:σ,ENC (,`;Bedit)(x)∈C(, `;Bedit )∩Σn+2 log n+4

4

(I) Use the (, `)-constrained encoder to obtain σ0=EN C(,`)(x)∈Σn

4, where σ0is -balanced and `-runlength limited

(II) Let αbe the last symbol of σ0. Let βbe arbitrary symbol in Σ4\ {α, f(α)}

(III) Let a= Syn(Uσ0)) (mod n+ 1) and b= Syn(Lσ0)) (mod n+ 1),c=Pn

i=1 σ0

i(mod 4)

(IV) Let τ1τ2· · · τk0/2be the quaternary representation of a, and ν1ν2· · · νk0/2be the quaternary representation of b

(V) Set p=βf (β)τ1f(τ1)τ2f(τ2)· · · τk0/2f(τk0/2)ν1f(ν1)ν2f(ν2)· · · νk0/2f(νk0/2)cf (c)

(VI) Output σ=σ0p

Theorem 36. The (, `;Bedit )-error-control encoder is correct. In other words, ENC (,`;Bedit)(x)is -balanced, `-runlength

limited, and capable of correcting a single edit for all x∈ {0,1}m.

Proof. Let σ=ENC(,`;Bedit )(x). It is easy to show that σis -balanced and `-runlength limited (refer to the proof of

Theorem 26). It remains to show that σcan correct a single edit. To do so, we provide an efﬁcient decoding algorithm. Suppose

the received sequence is σ0. The idea is to recover the ﬁrst nsymbols in σand then use the (, `)-constrained decoder to recover

the information sequence x. First, the decoder decides whether a deletion, insertion or substitution has occurred. Note that this

information can be recovered by simply observing the length of the received sequence. The decoding operates as follows.

(i) If the length of σ0is exactly n+ 2 log n+ 4, we conclude that at most a single substitution has occurred.

•Let p0be the sufﬁx of length 2 log n+ 4 of σ0, and p0=p0

1p0

2· · · p0

2k0+4.

•Let σ00 be the preﬁx of length nof σ0. The decoder computes Syn(Uσ00 )and Syn(Lσ00 ) (mod n+ 1).

•Let a0be the integer number whose quaternary representation is p0

3p0

5· · · p0

k0+1,b0be the integer number whose

quaternary representation is p0

k0+3p0

k0+5 · · · p0

2k0+1 and c0=p0

2k0+3.

•If c0is the sum of symbols in σ00, then there is no error in σ00 . The decoder proceeds to obtain x=DEC(,`)(σ00).

Otherwise, if a0= Syn(Uσ00 )and b0= Syn(Uσ00 )then there is no error in σ00, the decoder proceeds to obtain

x=DEC(,`)(σ00 ). On the other hand, if either one statement is false, there is an error in σ00. The decoder sets

y=DECL

a0(Uσ00 )and z=DECL

b0(Lσ00 ). Finally, σ= Ψ(y||z)and the decoder returns x=DEC(,`)(σ).

(ii) If the length of σ0is exactly n+ 2 log n+ 3, we conclude that a single deletion has occurred (the case of single insertion

can be done similarly). The decoder proceeds as follows.

•Let p0be the sufﬁx of length 2 log n+ 4 of σ0, and p0=p0

1p0

2· · · p0

2k0+4.

•If p0

26=f(p0

1), the decoder concludes that there is a deletion in p. The decoder removes the sufﬁx of length 2k0+ 3

from σ0, then use the (, `)-constrained encoder to obtain x=DEC(,`)(σ0)

•If p0

2=f(p0

1), the decoder concludes that there is no deletion in pand therefore, p0≡p. Let σ00 be the sequence

obtained by removing the sufﬁx pfrom σ0. Note that Syn(Uσ00)and Syn(Lσ00 )are known from p. The decoder sets

y=DECL

a(Uσ00 )and z=DECL

b(Lσ00 ). Finally, σ= Ψ(y||z)and the decoder returns x=DEC(,`)(σ).

In conclusion, EN C(,`;Bedit )(x)is -balanced, `-runlength limited, and can correct a single edit for all x∈ {0,1}m.

Corollary 37. Let M=n+ 2 log n+ 4. There exists a linear-time decoding algorithm DECedit : ΣM∗

4→C(, `;Bedit)∩ΣM

4

such that the following holds. If σ=ENC(,`;Bedit )(x)and σ0∈Bedit(σ), then DECedit(σ0) = σ.

For completeness, we describe the corresponding (, `;Bedit)-error-control decoder as follows.

(, `;Bedit)-Error-Control Decoder.

INP UT:σ0∈Σ(n+2 log n+4)∗

4

OUT PU T:x,DEC (,`;Bedit)(σ0)∈ {0,1}m

(I) Let σ=DE Cedit (σ0)∈Σn+2 log n+4

4

(II) Use (, `)-constrained decoder to obtain x=DE C(,`)(σ)∈ {0,1}m

(III) Output x

Remark 38. We use rerror to denote the redundancy needed to correct single indel or edit error. When B=Bindel,rerror =

log n+ 4, and when B=Bedit,rerr or = 2 log n+ 4. Since log n

n→0,rGC =O(1), is a constant, the rate of this encoder

approaches the rate of the RLL Encoder, and if we use RLL Encoder A then the rate of the (, `, B)-error-control encoder

approaches the capacity for sufﬁcient large n.

VII. CONCLUSION

We have presented novel and efﬁcient encoders that translate binary data into strands of nucleotides which satisfy the RLL

constraint, the GC-content constraint, and are capable of correcting a single edit and its variants. Our proposed codes achieve

higher rates than previous results and approach capacity, have low encoding/decoding complexity and limited error propagation.

REFERENCES

[1] S. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage”, Scientiﬁc Reports, no. 5011, vol. 7, 2017.

[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628-1628, 2012.

[3] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information

storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77-80, 2013.

[4] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efﬁcient storage architecture,” Science, vol. 355, no. 6328, pp. 950-954, 2017.

[5] L. Organick, S. Ang, Y. J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman,

H. Y. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss, “Random access in large-scale

DNA data storage”, Nature Biotechnology, vol. 36, no. 3, 242–248, 2018.

[6] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty, C. Nusbaum, and D. B. Jaffe, “Characterizing and measuring bias in sequence

data”, Genome Biology, vol. 14, 2013.

[7] R. Heckel, G. Mikutis, and R. N. Grass, “A Characterization of the DNA Data Storage Channel”, Scientiﬁc Reports, Jul. 2019.

[8] K. A. S. Immink, and K. Cai, “Design of Capacity-Approaching Constrained Codes for DNA-Based Data Storage Systems,” IEEE Communications Letters,

vol. 22, no. 2, pp. 224-227, 2018.

[9] K. Cai, Y. M. Chee, R. Gabrys, H. M. Kiah, and T. T. Nguyen, “Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage”, preprint,

arXiv, arXiv:1910.06501, 2019.

[10] R. Gabrys, E. Yaakobi, and O. Milenkovic, “Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction”, IEEE Trans. Inform.

Theory, Vol. 64, No. 4, 2018.

[11] W. Song, K. Cai, M. Zhang, and C. Yuen, “Codes with Run-Length and GC-Content Constraints for DNA-based Data Storage,” IEEE Communications

Letters, vol. 22 , no. 10, pp. 2004-2007, Oct. 2018.

[12] D. Dube, W. Song, and K. Cai, “DNA Codes with Run-Length Limitation and Knuth-Like Balancing of the GC Contents”, Symposium on Information

Theory and its Applications (SITA), Japan, Nov. 2019.

[13] P. Yakovchuk, E. Protozanova, and M. D. Frank-Kamenetskii, “Base-stacking and base-pairing contributions into thermal stability of the DNA double

helix”, Nucl. Acids Res., vol. 34, no. 2, pp. 564-574, 2006.

[14] D. E. Knuth, “Efﬁcient Balanced Codes”, IEEE Trans. Inform. Theory, vol. IT-32, no. 1, pp. 51-53, Jan 1986.

[15] A. J. de Lind van Wijngaarden and K. A. S. Immink, “Construction of Maximum Run-Length Limited Codes Using Sequence Replacement Techniques,”

IEEE Journal on Selected Areas of Communications, vol. 28, pp. 200-207, 2010.

[16] O. Elishco, R. Gabrys, M. Medard, and E. Yaakobi, “Repeated-Free Codes”, Proc. IEEE Int. Symp. Inf. Theory (ISIT), Paris, France, 2019.

[17] C. Schoeny, A. Wachter-Zeh, R. Gabrys, and E. Yaakobi, “Codes correcting a burst of deletions or insertions?, IEEE Trans. Inform. Theory, vol. 63, no.

4, pp. 1971-1985, 2017.

[18] J. P. M. Schalkwijk, “An algorithm for source coding,” IEEE Trans. Inf. Theory, IT-18, pp. 395-399, 1972.

[19] N. Alon, E. E. Bergmann, D. Coppersmith, and A. M. Odlyzko, “Balancing sets of vectors”, IEEE Trans. Inf. Theory, vol. IT-34, no. 1, pp. 128-130, Jan.

1988.

[20] V. Skachek and K. A. S. Immink, “Constant Weight Codes: An Approach Based on Knuth’s Balancing Method”, IEEE Journal on Selected Areas in

Communications, vol. 32, No. 5, May 2014.

[21] L. G. Tallini, R. M. Capocelli, and B. Bose, “Design of some new balanced codes,” IEEE Trans. Inf. Theory, vol. IT-42, pp. 790-802, May 1996.

[22] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals”, Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848,

1965.

[23] G. Tenengolts, “Nonbinary codes, correcting single deletion or insertion”, IEEE Trans. Inf. Theory, vol. 30, no. 5, pp. 766-769, 1984.