Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Apr 21, 2020

Content may be subject to copyright.

Received February 27, 2020, accepted March 7, 2020, date of publication March 11, 2020, date of current version March 19, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.2980036

Properties and Constructions of Constrained

Codes for DNA-Based Data Storage

KEES A. SCHOUHAMER IMMINK 1, (Life Fellow, IEEE),

AND KUI CAI 2, (Senior Member, IEEE)

1Turing Machines Inc., 3016 DK Rotterdam, The Netherlands

2Singapore University of Technology and Design (SUTD), Singapore 487372

Corresponding author: Kees A. Schouhamer Immink (immink@turing-machines.com)

This work was supported by the Singapore Ministry of Education Academic Research Fund Tier 2 under Grant MOE2016-T2-2-054.

ABSTRACT We describe properties and constructions of constraint-based codes for DNA-based data

storage which account for the maximum repetition length and AT/GC balance. Generating functions and

approximations are presented for computing the number of sequences with maximum repetition length and

AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced

strings into DNA strands, and compute the efﬁciency of such routines. Expressions for the redundancy of

codes that account for both the maximum repetition length and AT/GC balance are derived.

INDEX TERMS Constrained coding, maximum runlength, balanced words, storage systems, DNA-based

storage.

I. INTRODUCTION

The ﬁrst large-scale archival DNA-based storage archi-

tecture was implemented by Church et al. [1] in 2012.

Blawat et al. [2] described successful experiments for storing

and retrieving data blocks of 22 Mbyte of digital data in

synthetic DNA. Erlich and Zielinski [3] further explored the

limits of storage capacity of DNA-based storage architec-

tures. Recent examples of experimental work on DNA-base

storage can be found in [4]–[6].

Naturally occurring DNA consists of four types of

nucleotides: adenine (A), cytosine (C), guanine (G), and

thymine (T). A DNA strand (or oligonucleotides, or oligo in

short) is a linear sequence of these four nucleotides that are

composed by DNA synthesizers. Binary source, or user, data

are translated into the four types of nucleotides, for exam-

ple, by mapping two binary source into a single nucleotide,

in short nt.

Strings of nucleotides should satisfy a few elementary

conditions, called constraints, in order to be less error

prone. Repetitions of the same nucleotide, a homopoly-

mer run, signiﬁcantly increase the chance of sequencing

errors [7], [8], so that such long runs should be avoided.

For example, in [8], experimental studies show that once the

The associate editor coordinating the review of this manuscript and

approving it for publication was Nadeem Iqbal .

homopolymer run is larger than four nt, the sequencing error

rate starts increasing signiﬁcantly. In addition, [8] also reports

that oligos with large unbalance between GC and AT content

exhibit high dropout rates and are prone to polymerase chain

reaction (PCR) errors, and should therefore be avoided.

Blawat’s format [2] incorporates a constrained code that

uses a look-up table for translating binary source data

into strands of nucleotides with a homopolymer run of

length at most three. Blawat’s format did not incorpo-

rate an AT/GC balance constraint. Strands that do not sat-

isfy both the maximum homopolymer run requirement and

the weak balance constraint are barred in Erlich’s coding

format [3].

In this paper, we describe properties and constructions

of quaternary constraint-based codes for DNA-based stor-

age which account for a maximum homopolymer run and

maximum unbalance between AT and GC contents. Binary

‘balanced’ and runlength limited sequences have found

widespread use in data communication and storage prac-

tice [9]. We show that constrained binary sequences can easily

be translated into constrained quaternary sequences, which

opens the door to a wealth of efﬁcient binary code con-

structions for application in DNA-based storage [10]–[13].

A further advantage of the binary-to-binary translation

instead of a ‘direct’ binary-to-quaternary translation is the

lower complexity of encoding and decoding look-up tables.

VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 49523

K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage

The disadvantage is, as we show, the loss in information

capacity of the binary versus the quaternary approach.

We start in Section II with a description of the limiting

properties and code constructions that impose a maximum

homopolymer run. We speciﬁcally compute and compare

the information capacity of binary versus ‘direct’ quaternary

coding techniques. In Section III, we enumerate the number

of binary and quaternary sequences with combined AT and

GC contents and run-length constraints. Section IV concludes

the paper.

II. MAXIMUM RUNLENGTH CONSTRAINT

Long repetitions of the same nucleotide (nt), called a

homopolymer run or runlength, may signiﬁcantly increase

the chance of sequencing errors [7], [8], and should be

avoided. Avoiding long runs of the same nucleotide will result

in loss of information capacity, and codes are required for

translating arbitrary source data into constrained quaternary

strings. Binary runlength limited (RLL) codes have found

widespread application in digital communication and storage

devices since the 1950s [9], [14]. MacLauhlin et al. [15] stud-

ied multi-level runlength limited codes for optical recording.

A string of n-nucleotide oligo’s of 4-ary symbols can be seen

as two parallel binary strings of length n, where the 4-ary

symbol is represented by two binary symbols. Such a system

of multiple parallel data streams with joint constraints is

reminiscent of ‘two-dimensional’ track systems, which have

been studied by Marcellin and Weber [16].

We start in the next subsection with the counting of

q-ary sequences that satisfy a maximum runlength, followed

by subsections where we describe limiting properties and

code constructions that avoid m+1 repetitions of the same

nucleotide.

A. COUNTING q-ARY SEQUENCES, CAPACITY

Let the number of q-ary n-length sequences having a max-

imum run, m, of the same symbol be denoted by Nq(m,n).

The number Nq(m,n) is found by using the recursive

relation [17, Part 1]:

Nq(m,n)=(qn,n≤m,

(q−1) Xm

k=1Nq(m,n−k),n>m.(1)

For n≤mthe above is trivial as all sequences satisfy

the maximum runlength constraint. For n>mwe follow

Shannon’s approach [17] for the discrete noiseless channel.

The runlength of ksymbols acan be seen as a ‘phrase’ aof

length k. After a phrase ahas been emitted, a phrase of sym-

bols b6= aof length kcan be emitted without violating the

maximum runlength constraint imposed. The total number of

allowed sequences, Nq(m,n), is equal to (q−1) times the sum

of the numbers of sequences ending with a phrase of length

k=1,2,...,m, which are equal to Nq(m,n−k). Addition of

these numbers yields (1), which proves (1). Using the above

expression, we may easily compute the feasibility of a q-ary

m-constrained code for relatively small values of nwhere a

coding look-up table is practicable, see Subsection II-C for

more details.

1) GENERATING FUNCTIONS

Generating functions are a very useful tool for enumerating

constrained sequences [18], and they offer tools for approx-

imating the number of constrained sequences for asymptot-

ically large values of the sequence length n. The series of

numbers {Nq(m,n)},n=1,2. . ., in (1), can be compactly

written as the coefﬁcients of a formal power series Hq,m(x)=

PNq(m,i)xi, where xis a dummy variable. There is a simple

relationship between the generating function, Hq,m(x), and

the linear homogenous recurrence relation (1) with constant

coefﬁcients that deﬁnes the same series [18]. We ﬁrst deﬁne

a generating function

G(x)=Xgixi.(2)

Let the operation [xn]G(x) denote the extraction of the coef-

ﬁcient of xnin the formal power series G(x), that is, deﬁne

[xn]Xgixi=gn.(3)

Let

T(x)=

m

X

i=1

xi.(4)

The generating function for the number of q-ary sequences

with a maximum runlength mis

qT (x)+q(q−1)T(x)2+q(q−1)2T(x)3+ · ·· .

We may rewrite the above as

qT (x)

1−(q−1)T(x),

so that the number of n-symbol m-constrained q-ary words is

Nq(m,n)=[xn]qT (x)

1−(q−1)T(x).(5)

2) ASYMPTOTICAL BEHAVIOR

For asymptotically large codeword length n, the maximum

number of (binary) user bits that can be stored per q-ary

symbol, called (information) capacity, denoted by Cq(m),

is given by [17]

Cq(m)=lim

n→∞

1

nlog2Nq(m,n)=log2λq(m),(6)

where λq(m), is the largest real root of the characteristic

equation [15], [17]

xm+1−qxm+q−1=0.(7)

Table 1shows the information capacities C2(m) and C4(m)

versus maximum allowed (homopolymer) run m. For asymp-

totically large nwe may approximate Nq(m,n) by [18]

Nq(m,n)≈Aq(m)λn

q(m).(8)

49524 VOLUME 8, 2020

K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage

TABLE 1. Capacity C2(m) and C4(m) versus m.

TABLE 2. Coefficient A2(m) and A4(m) versus m.

The coefﬁcient Aq(m) is found, see [14, page 157-158],

by rewriting Hq,m(x) as a quotient of two polynomials,

or Hq,m(x)=r(x)

p(x). Then

Aq(m)= −λq(m)r(1/λq(m))

p0(1/λq(m)) .(9)

Table 2shows the coefﬁcients A2(m) and A4(m) versus m.

For m=1, we simply ﬁnd N4(1,n)=4.3n−1. We found that

the approximation (8) is remarkably accurate. For a typical

example, N4(2,10) =676836, while the approximation

using (8) yields N4(2,10) ≈676835.9769. The redundancy

of a 4-ary string of length nwith a maximum runlength m,

denoted by r4(m,n), is, using (8),

r4(m,n)=2n−log2N4(m,n)

≈n(2−C4(m))−log2A4(m).(10)

B. BINARY-BASED RLL CODE CONSTRUCTION,

CONSTRUCTION I

Yazdi et al. [19] and Taranalli et al. [20] showed that we

may exploit binary maximum runlength limited (RLL) codes

for constructing quaternary RLL codes. Their construction,

denoted by Construction 1, exempliﬁes such a technique for

m>1. The construction is simple, but we show below that

this simplicity has its price in terms of extra redundancy.

Construction 1: Let u=(u1,...,un) be an n-bit RLL

string. We merge the RLL n-bit string, u, with an n-bit source

string y=(y1,...,yn), by using the addition vi=ui+2yi,

1≤i≤n, where v=(v1,...,vn), vi∈Qis the 4-ary output

string. It is easily veriﬁed that the 4-ary output string, v, has

maximum allowed run m, the same as the binary string u.

The number of distinct 4-ary sequences, v, of

Construction 1 equals 2nN2(m,n), so that the redundancy,

denoted by r2(m,n), is

r2(m,n)≈n(1−C2(m))−log2A2(m).(11)

TABLE 3. Asymptotic rate efficiency, η(m), of binary Construction 1 versus

maximum homopolymer run, m.

TABLE 4. Rate efficiency, Rm,0/C4(m), of binary Construction 1 versus

strand length, n, and maximum homopolymer run, m.

The rate efﬁciency with respect to the runlength limited 4-ary

channel, denoted by η(m), is expressed by

η(m)=1+C2(m)

C4(m).(12)

Table 3lists results of computations. We may notice that

Construction 1 will suffer a loss of up to 12 % for m=2.

For larger values of m, however, the loss is negligible.

The above asymptotic efﬁciency of Construction 1, η(m),

is valid for very large values of the strand length n. It is of

practical interest to assess the efﬁciency for smaller values of

the strand length. Construction 1 can be used with any binary

RLL code, and there are many binary code constructions

for generating maximum runlength constrained sequences,

see [14] for an overview. We propose here, for the efﬁciency

assessment, a simple two-mode block code of codeword

length n. Runlength constrained codewords in the ﬁrst mode

start with a symbol ‘zero’, while codewords in the second

mode start with a ‘one’. When the previous sent codeword

ends with a ‘one’ we use the codewords from the ﬁrst mode

and vice versa. The number of binary source words that can

be accommodated with Construction 1 equals 2n−1N2(m,n),

so that the code rate, denoted by Rm,0, is

Rm,0=1

nn−1+ blog2N2(m,n)c,(13)

where we truncated the code size to the largest power of two.

Table 4shows selected outcomes of computations of the rate

efﬁciency Rm,0/C4(m) versus mand n.

C. ENCODING OF QUATERNARY SEQUENCES WITHOUT

BINARY STEP

In this subsection, we investigate two simple constructions

of codes that transform binary source words directly (that

is, without an intermediate binary coding step) into 4-ary

maximum homopolymer constrained codewords. An exam-

ple of a simple 4-ary block code was presented by

VOLUME 8, 2020 49525

K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage

TABLE 5. Rate efficiency, Rm,1/C4(m), of the two-mode code construction

versus strand length, n, and maximum homopolymer run, m.

Blawat et al. [2]. The code converts 8 source bits into a

4-ary word of 5 nt. The 5-nt words can be cascaded without

violating the prescribed m=3 maximum homopolymer

run. The rate of Blawat’s construction is R=8/5=1.6.

As C4(m=3) =1.9824, see Table 1, the (rate) efﬁciency of

the construction is R/C4(m)=0.807. Alternative, and more

efﬁcient, constructions are described below.

In the ﬁrst construction, denoted by two-mode construc-

tion, each source word can be represented by one of two

possible codewords, where the codeword sent is chosen to

satisfy the runlength constraint at the junction of two cas-

caded codewords. Decoding is accomplished by observing

the n-symbol codeword. In the second, slightly more efﬁcient,

construction, denoted by four-mode construction, a source

word can be represented by four possible codewords. Decod-

ing is accomplished by observing the n-symbol codeword

plus the last symbol of the previous codeword.

1) TWO-MODE CONSTRUCTION

In this format, a source word can be represented by two

n-symbol 4-ary m-constrained codewords, where the alter-

native representations differ at the ﬁrst position. In case we

append a new codeword to the previous codeword, we are

always able to choose (at least) one representation whose ﬁrst

symbol differs from the last symbol of the previous codeword.

Then, clearly, the cascaded string of 4-ary symbols satisﬁes

the prescribed maximum homopolymer run constraint. The

rate of this two-mode construction, denoted by Rm,1, is

Rm,1=1

n(blog2(N4(m,n))c − 1),(14)

where we truncated the code size to the largest power of two

possible. Table 5shows outcomes of computations of the rate

efﬁciency Rm,1/C4(m) versus mand n. We observe that, for

m=2, the ‘quaternary’ efﬁciency R2,1/C4(2) is slightly

better than the ‘binary’ R2,0/C4(2), see Table 4. For m>2,

both approaches have the same efﬁciency. The conversion

of the binary source symbols into the 4-ary n-nt strands and

vice versa can be accomplished using two look-up tables of

complexity 4n.

2) FOUR-MODE CONSTRUCTION

In the above two-mode construction, the encoded codeword

depends on the last symbol of the previous codeword. Decod-

ing, however, is based on the observation of the nsym-

bols of the retrieved codeword. In the second construction,

TABLE 6. Encoding tables of a four-mode code for n=2 and m=2. The

parameter idenotes the (decimal) representation of the source word. The

tables L(i,a), a=0,1,2,3, show the corresponding codeword, where a

denotes the last symbol of the previous codeword.

the codeword also depends on the last symbol of the previous

codeword. Decoding, however, is accomplished by observing

the nsymbols of the retrieved codeword plus the last symbol

of the previous codeword. To that end, we deﬁne four tables

of codewords, denoted by L(i,a), where i, 1 ≤i≤K,

denotes the decimal representation of the source word to be

encoded, Kdenotes the size of the table, and adenotes the

last symbol of the previous codeword. The four tables are

constructed in such a way that the codewords in each table

L(i,a) do not start with the symbol a. As a result, the encoder

always generates a symbol transition between the tail and

nose symbols of consecutive codewords. The maximum size

of the four tables equals K=3

4N4(m,n) (note that N4(m,n)

is a multiple of 4). Table 6shows a simple example of the

encoding tables of a four-mode code for n=2 and m=2.

The size of this code equals K=12. Let, for example,

the source sequence be ‘0’, ‘1’, ‘3’, ‘6’. Then, using the

table, the encoded sequence is ‘10’, ‘11’, ‘03’, ‘22’. We may

simply verify that the maximum runlength is m=2. The

code size K=12, while the code size of the two-mode

code m=n=2 described above equals 16/2=8. The

table shows that the codeword ‘00’ is assigned to three source

words, namely ‘0’, ‘4’, and ‘8’, so that ‘00’ cannot be decoded

unambiguously by observing the codeword. Observation of

the retrieved codeword plus the last symbol of te previous

codeword solves the ambiguouty.

The rate of this four-mode construction, denoted by Rm,2,

is

Rm,2=1

nlog23

4N4(m,n).(15)

Table 7shows the rate efﬁciency of the four-mode con-

struction. The efﬁciency improvement with respect to the

two-mode construction, see Table 5, is obtained at the cost

of four look-up tables instead of two.

Example: Let (as in Blawat’s code [2]) n=5 and m=3.

We simply ﬁnd, using (1), N4(3,5) =996, so that the code

may accommodate K=3/4×996 =747 binary source

words. Since K>512 =29we may implement a code of

rate 9/5, which is 12% higher than that of Blawat’s code of

49526 VOLUME 8, 2020

TABLE 7. Rate efficiency, Rm,2/C4(m), of the four-mode construction

versus strand length, n, and maximum homopolymer run, m.

rate 8/5. As we have the freedom of deleting 747−512 =235

redundant codewords, we may, for example, bar the words

with the highest unbalance.

In the next section, we take a look at the combined AT and

GC contents balance and maximum polymer run constrained

codes.

III. COMBINED WEIGHT AND MAXIMUM RUN

CONSTRAINED CODES

Oligos with large unbalance between GC and AT content

exhibit high dropout rates and are prone to polymerase chain

reaction (PCR) errors, and should therefore be avoided.

Avoidance of such undesired sequences implies an extra

redundancy. In this section, we compute the redundancy of

binary and quaternary codes with combined RLL and AT/GC

constraints.

A. DEFINITION AT/GC CONTENT, BALANCE, AND WEIGHT

We use the nucleotide alphabet Q= {0,1,2,3}, where

we propose the following relation between the four decimal

symbols and the nucleotides: G=0,C=1,A=2, and

T=3. The AT/GC content constraint stipulates that around

half of the nucleotides should be either an A or a T nucleotide.

In order to study AT-balanced nucleotides, we start with a few

deﬁnitions. We deﬁne the weight or AT-content, denoted by

w4(x), of the n-nucleotide oligo x=(x1,...,xn), xi∈Q,

as the number of occurrences of A or T, or

w4(x)=

n

X

i=1

ϕ(xi),(16)

where

ϕ(u)=(0,u<2,

1,u>1.(17)

The weight of a binary word x=(x1,...,xn), xi∈ {0,1},

denoted by w2(x), is deﬁned by

w2(x)=

n

X

i=1

ϕ(2xi)=

n

X

i=1

xi.(18)

If we write the 4-ary word x=(x1,...,xn), xi∈Q, as

x=y+2z, where yiand zi∈ {0,1}then

w4(x)=

n

X

i=1

ϕ(xi)=

n

X

i=1

ϕ(2zi)=w2(z).(19)

Kerpez et al. [21], Braun and Immink [22], and Kurmaev [23]

analyzed properties and constructions of binary combined

weight and runlength constrained codes. Their results are

straightforwardly applied to the quaternary case at hand.

In the next subsections, we count binary and quaternary

sequences that satisfy combined maximum runlength and

weight constraints. We start by counting the number of binary

sequences, x, of length nthat satisfy a maximum runlength

constraint mand have weight w=w2(x). Paluncic and

Maharaj [24] enumerated this number for the balanced case

w=w2(x)=n/2.

B. COUNTING BINARY RLL SEQUENCES OF GIVEN

WEIGHT

Deﬁne the bi-variate generating function H(x,y) in the

dummy variables xand yby

H(x,y)=X

i,j

hi,jxiyj,(20)

and let [xn1yn2]h(x,y) denote the extraction of the coefﬁcient

of xn1yn2in the formal power series Phi,jxiyj, or

[xn1yn2]Xhi,jxiyj=hn1,n2.(21)

Deﬁne

T1(x,y)=

m

X

i=1

xiyi.(22)

Let the sequence start with a runlength of zero’s, then the

generating function for the number of binary sequences with

a maximum runlength mis

T(x)+T(x)T1(x,y)+T(x)2T1(x,y)+T(x)2T1(x,y)2+ · ·· .

In case the sequence starts with a run of one’s, we obtain for

the generating function

T1(x)+T(x)T1(x,y)+T(x)T1(x,y)2+T(x)2T1(x,y)2+ · ·· .

The generating function for the number of binary sequences

with a maximum runlength mstarting with a one or a zero

runlength is the sum of the two above generating functions.

Working out the sum yields

T1(x,y)+T(x)+2T1(x,y)T(x)

1−T1(x,y)T(x),

so that the number of n-bit codewords, x, with maximum

runlength m, denoted (with a slight abuse of notational con-

vention by adding an extra parameter) by N2(m,w,n), that

satisfy a given unbalance constraint w=w2(x) is given

by

N2(m,w,n)=[xnyw]T1(x,y)+T(x)+2T1(x,y)T(x)

1−T1(x,y)T(x).

(23)

With the above bi-variate generating function, we may

exactly compute the number of binary m-constrained words

of weight w.

VOLUME 8, 2020 49527

More insight is gained by an approximation of N2(m,w,n).

For a given maximum runlength, m, and asymptotically

large n, we are speciﬁcally interested in the distribution

of limn→∞ N2(m,w,n)/N(m,n) versus the weight w. The

weight wof a binary sequence of length nis the sum of

the runlengths of ones. The runlengths are random variables,

so that for asymptotically large n, according to the Central

Limit Theorem [18], the weight distribution approaches a

Gaussian distribution with mean n

2and variance denoted

by σ2

2(m,n). Then

N2(m,w,n)≈Gw;n

2, σ 2

2(m,n)

N2(m,n),n1,(24)

where

G(u;µ, σ 2)=1

σ√2πe−1

2(u−µ

σ)2,(25)

denotes the Gaussian distribution. The variance, σ2

2(m,n),

of the Gaussian distribution is computed below.

1) COMPUTATION OF THE VARIANCE, σ2

2(m,n)

Let xbe an inﬁnitely long binary m-constrained sequence,

where the probabilities of occurrence of the runlengths of

zeros and ones are chosen to maximize the information

rate (entropy) of the sequence. The probability of occurrence

of a runlength of length l,l≤m, in a maxentropic sequence

equals λ−l

2(m), see [14, Chapter 4], where for q=2, see (7),

Pm

l=1λ−l

2(m)=1. The average runlength, denoted by ¯

l,

equals

¯

l=

m

X

i=1

iλ−i

2(m).(26)

The runlength variance of an m-constrained sequence,

denoted by Var(l), is

Var(l)=

m

X

i=1

(i−¯

l)2λ−i

2(m).(27)

The weight variance, σ2

2(m,n), of the m-constrained sequence

is

σ2

2(m,n)=γ2(m)n

4,(28)

where

γ2(m)=Var(l)

¯

l.

Table 8shows results of computations (note that the

parameter γ4(m) is explained in Section III-C). In order

to verify the accuracy of the Gaussian approximation,

we have numerically compared it with the (accurate) out-

comes of the generating function. Figure 1shows a com-

parison between the accurate and approximate distributions,

N2(m,w,n)/N2(m,n), for n=100 and m=2,3,4.

Except for the discrepancy in the tails of the distributions,

the accuracy of the Gaussian approximation is quite sufﬁcient

for engineering applications. The Gaussian approximation is

accurate within a few percent within the two-sigma limits of

the distribution.

TABLE 8. Coefficient γ2(m) and γ4(m) versus maximum homopolymer

run m.

FIGURE 1. Comparison of the weight distribution of

N2(m,w,n)/N2(m,n), using (a) the Gaussian distribution (24) and

(b) generating functions for n=100 and m=2,3,4.

C. COUNTING QUATERNARY RLL SEQUENCES OF GIVEN

WEIGHT

We count the number of n-tuples xof 4-ary symbols that

satisfy a maximum runlength constraint, m, and have weight

w=w4(x), denoted (with a slight abuse of notational con-

vention) by N4(m,w,n).

1) MAXIMUM RUNLENGTH CONSTRAINT

For the special case m=1, Limbachiya et al. [25] presented a

closed-form expression of N4(1,w,n). For other values of the

prescribed maximum runlength, m, we may readily compute

the number of 4-ary sequences, N4(m,w,n), versus weight,

w=w4(x), by applying generating functions.

The 4-ary symbols are generated by a constrained data

source that can be modelled as a four-state Moore-type

ﬁnite-state machine. The machine steps from state to state

where when state i∈Qis visited a sequence of k, 1 ≤k≤m,

symbols ‘i’ are emitted. After visiting state i, the data source

may not return to state i(and so forbidding to again emit a

sequence of the same symbol ‘i’), but it enters state j6= i,

j∈Q. When the machine enters state 3 or 4, the word

weight, w, is incremented by k, where k, 1 ≤k≤m,

denotes the run of symbols ‘3’ or ‘4’. When, on the other

hand, states 1 or 2 are entered, the weight increment is nil. The

resulting 4 ×4 one-step skeleton or state-transition matrix,

D(x,y), of the ﬁnite-state machine is

D(x,y)=

0a0a0a0

a00a0a0

a1a10a1

a1a1a10

,(29)

49528 VOLUME 8, 2020

TABLE 9. Number of balanced words, N4(m,n

2,n), versus mand n.

where a0=T(x) and a1=T1(x,y). We are now in

the position to write a general expression for N4(m,w,n).

The number of 4-ary sequences of length nwith maximum

runlength constraint mand weight wequals

N4(m,w,n)=[xnyw]1

3X

i,j

n

X

k=1

d[k]

i,j(x,y),(30)

where d[k]

i,j(x,y) denotes the entries of Dk(x,y). The

entries d[k]

i,j(x,y) of Dk(x,y) are equal to the number of

sequences (paths) of krunlengths starting in state iand ending

in state j. Summation for all possible runlengths k≤nand

matrix entries, and division by three yields the generating

function of N4(m,w,n), which proves (30).

Balanced codewords with w=n/2, neven, play an

important role. Table 9shows outcomes of computations

of N4(m,n

2,n) using (30), for m=1,2,and 3. The case

m=1 was earlier presented in [25]. Note that the integer

sequence N4(m=1,n

2,n) versus nis also known as OEIST

sequence A085363 (multiplied by 2), for which an alternative

generating function is presented in [26].

Generating functions (30) allow us to accurately compute

N4(m,w,n). For some applications, we may sacriﬁce accu-

racy for simplicity of the expression. In the next subsection,

we derive a simple approximation to N4(m,w,n) valid for

asymptotically large nand small relative weight w/n.

2) ESTIMATE OF THE WEIGHT DISTRIBUTION

The weight w4(x) is the number of nucleotides A and T in

the sequence x, see (19). Then, as in the binary case above,

for asymptotically large n, according to the Central Limit

Theorem, the weight distribution is approximately Gaussian,

that is, we may conveniently approximate N4(m,w,n) by

N4(m,w,n)≈Gw;n

2, σ 2

4(m,n)N4(m,n),n1,(31)

where σ2

4(m,n) denotes the variance of the Gaussian weight

distribution. The variance σ2

4(m,n) can be computed as

follows.

3) COMPUTATION OF THE VARIANCE σ2

4(m,n)

Let ui,i=1,2, . . .,ui∈Q, be an inﬁnitely long 4-ary

sequence generated by a maxentropic source that satisﬁes

a prescribed maximum runlength m. Although the 4-ary

sequence ui,i=1,2, . . ., satisﬁes a limited runlength con-

straint, m, the runs of the binary weight sequence vi=ϕ(ui),

i=1,2, . . ., see deﬁnition (17), are without any limit.

The variance, σ2

4(m,n), of the Gaussian weight distribution

is governed by the runlength distribution, P(k), of the binary

sequence vi, where P(k), k>0, denotes the probability

of occurrence of a runlength k. Clearly, Pk>0P(k)=1.

The probability P(k) is proportional to the number of binary

m-sequences of length k,N2(m,k), times the probability of

such a sequence, λ−k

4, or

P(k)=cN2(m,k)λ−k

4,k≥1,(32)

where the normalization constant cis chosen such that

P∞

k=1P(k)=1. The term N2(m,k) is the number of AT

combinations of length k, which may exist of a single A or T

run or a plurality of alternating A and T runs. Then we have

σ2

4(m,n)=γ4(m)n

4,(33)

where, see [14, Chapter 4],

γ4(m)=1

¯

l

∞

X

k=1

(k−¯

l)2P(k) (34)

and

¯

l=∞

X

k=1

kP(k).(35)

Table 8shows results of computations of γ4(m) versus m.

We infer from (31) and Table 8that, for nﬁxed, the weight

distribution becomes wider with increasing maximum run-

length m, see also Figure 1. Note that the above outcome is

not consistent with the results by Erlich and Zielinski [3],

as they assume a Gaussian balance distribution whose vari-

ance equals n/4, independent of m.

An estimate of the number of balanced codewords,

N4(m,n

2,n), is

N4m,n

2,n≈√2

√πγ4(m)nN4(m,n),neven.(36)

For the case m=1 we have, (see [26], sequence A085363,

for a similar result)

N41,n

2,n≈8

√πn3n−1,neven.(37)

Using the above approximation, we obtain, for example, that

N4(1,8,16) ≈16191008, which is 2% higher than its exact

value, 15873240, listed in Table 9.

D. REDUNDANCY OF BINARY AND QUATERNARY CODES

WITH COMBINED RLL AND AT/GC BALANCE

CONSTRAINTS

For DNA-based storage, we do not require that the strands

of the codebook, S, are strictly balanced, as a small unbal-

ance, that is αS1, between the GC and AT content is

permitted without affecting the error performance. Such a

constraint is called a weak balance constraint. The relative

unbalance of a word, α(x), is deﬁned by α(x)=

w4(x)

n−1

2.

An n-nucleotide oligo is said to be balanced if α(x)=0. Code

VOLUME 8, 2020 49529

FIGURE 2. Redundancy (bits), r4(a,n),versus word length, n, with the

relative unbalance, a, as a parameter. The raggedness of the curves is

caused by the truncation effects in the summation in (39).

constructions for combined RLL and weak balanced codes

have been published in [3], and for m=3 [27], [28].

We ﬁrst study the balance of sequences without and

m-constraint. The number of 4-ary words of length nwith

balance w=w4(x), denoted by N(w,n), equals

N4(w,n)=n

w2n.(38)

The number of oligo’s, denoted by N4,a(n), of length n, whose

relative unbalance, α(x)≤a, is given by

N4,a(n)=X

|w

n−1

2|<a

N4(w,n)=2nX

|w

n−1

2|<an

w.(39)

The redundancy of 4-ary nearly balanced strands, denoted

by r4(a,n), equals

r4(a,n)=log2

4n

N4,a(n).(40)

Figure 2shows examples of computations of the redundancy,

r4(a,n), versus nwith the relative unbalance, a, as a param-

eter. The raggedness of the curves is caused by the trunca-

tion effects in the summation in (39). The distribution for

asymptotically large nof N4(w,n) versus wis approximately

Gaussian shaped, that is

N4(w,n)≈Gw;n

2,n

44n,n1,(41)

so that the redundancy equals

r4,a(n)≈ −log2[1 −2Q(2a√n)],n1,(42)

where the Q-function is deﬁned by

Q(x)=1

√2πZ∞

x

e−u2

2du.(43)

We now study q-ary sequences with both an m-constraint

and a given weight w. As in Construction 1, let the quaternary

word x=(x1,...,xn), xi∈Q, be written as x=y+2z,

where the constituting elements yiand zi∈ {0,1}. If the

binary sequence zis m-constrained and has weight w=

w2(z), then xis m-constrained and it has weight w4(z)=w.

Using (11), (24), and (31), we obtain for n1, that

the redundancy of q-ary sequences with combined RLL and

balance constraints, denoted by rq,a(m,n), equals

rq,a(m,n)≈rq(m,n)−log21−2Q2arn

γq(m).(44)

A numerical analysis of the above expression shows that the

redundancy difference due to the balance (right hand) term

is around 0.5-1 bit for m=2. For larger values of the

homopolymer run mthe extra redundancy is negligible for

n>10. The redundancy difference, r2(m,n)−r4(m,n), due

to the imposed runlength constraint is much larger for n>10

than the redundancy due the balance constraint.

IV. CONCLUSION

We have compared two coding approaches for constraint-based

coding of DNA strings. In the ﬁrst approach, an intermediate,

‘binary’, coding step is used, while in the second approach we

‘directly’ translate source data into constrained quaternary

sequences. The binary approach is attractive as it yields a

lower complexity of encoding and decoding look-up tables.

The redundancy of the binary approach is higher than that of

the quaternary approach for generating combined weight and

run-length constrained sequences. The redundancy difference

is small for larger values of the maximum homopolymer run.

We have found exact and approximate expressions for the

number of binary and quaternary sequences with combined

weight and run-length constraints.

REFERENCES

[1] G. M. Church, Y. Gao, and S. Kosuri, ‘‘Next-generation digital information

storage in DNA,’’ Science, vol. 337, no. 6102, p. 1628, Sep. 2012.

[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,

B. W. Pruitt, and G. M. Church, ‘‘Forward error correction for DNA

data storage,’’ in Proc. Int. Conf. Comput. Sci. (ICCS), vol. 80, 2016,

pp. 1011–1022.

[3] Y. Erlich and D. Zielinski, ‘‘DNA fountain enables a robust and efﬁcient

storage architecture,’’ Science, vol. 355, no. 6328, pp. 950–954, Mar. 2017.

[4] J. Koch, S. Gantenbein, K. Masania, W. J. Stark, Y. Erlich, and R. N. Grass,

‘‘A DNA-of-things storage architecture to create materials with embedded

memory,’’ Nature Biotechnol., vol. 38, no. 1, pp. 39–43, Jan. 2020.

[5] Y. Wang, M. Noor-A-Rahim, J. Zhang, E. Gunawan, Y. L. Guan, and

C. L. Poh, ‘‘High capacity DNA data storage with variable-length oligonu-

cleotides using repeat accumulate code and hybrid mapping,’’ J. Biol. Eng.,

vol. 13, no. 1, p. 89, Dec. 2019.

[6] L. Ceze, J. Nivala, and K. Strauss, ‘‘Molecular digital data storage using

DNA,’’ Nature Rev. Genet., vol. 20, no. 8, pp. 456–466, Aug. 2019.

[7] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,

‘‘A DNA-based archival storage system,’’ ACM SIGOPS Oper. Syst. Rev.,

vol. 50, pp. 637–649, 2016.

[8] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty,

C. Nusbaum, and D. B. Jaffe, ‘‘Characterizing and measuring bias in

sequence data,’’ Genome Biol., vol. 14, no. 5, p. R51, 2013.

[9] K. W. Cattermole, ‘‘Principles of digital line coding,’’ Int. J. Electron.,

vol. 55, pp. 3–33, Jul. 1983.

[10] K. A. Schouhamer Immink and K. Cai, ‘‘Design of capacity-approaching

constrained codes for DNA-based storage systems,’’ IEEE Commun. Lett.,

vol. 22, no. 2, pp. 224–227, Feb. 2018.

[11] Y.-S. Kim and S.-H. Kim, ‘‘New construction of DNA codes with constant-

GC contents from binary sequences with ideal autocorrelation,’’ in Proc.

IEEE Int. Symp. Inf. Theory Process., Jul. 2011, pp. 1569–1573.

49530 VOLUME 8, 2020

[12] Y. M. Chee and S. Ling, ‘‘Improved lower bounds for constant

GC-content DNA codes,’’ IEEE Trans. Inf. Theory, vol. 54, no. 1,

pp. 391–394, Jan. 2008.

[13] K. A. Schouhamer Immink and K. Cai, ‘‘Efﬁcient balanced and maximum

homopolymer-run restricted block codes for DNA-based data storage,’’

IEEE Commun. Lett., vol. 23, no. 10, pp. 1676–1679, Oct. 2019.

[14] K. A. S. Immink, Codes for Mass Data Storage Systems, 2nd ed.

Eindhoven, The Netherlands: Shannon Foundation, 2004.

[15] S. W. MacLauhlin, J. Luo, and Q. Xie, ‘‘On the capacity of M-ary

Runlength-limited codes,’’ IEEE Trans. Inf. Theory, vol. 41, no. 5,

pp. 1508–1511, Sep. 1995.

[16] M. W. Marcellin and H. J. Weber, ‘‘Two-dimensional modulation codes,’’

IEEE J. Sel. Areas Commun., vol. 10, no. 1, pp. 254–266, Jan. 1992.

[17] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst.

Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.

[18] P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge, U.K.:

Cambridge Univ. Press, 2009.

[19] S. M. Hossein, T. Yazdi, H. M. Kiah, and O. Milenkovic, ‘‘Weakly mutu-

ally uncorrelated codes,’’ in Proc. IEEE Int. Symp. Inf. Theory (ISIT),

Barcelona, Spain, Jul. 2016, pp. 2649–2653.

[20] V. Taranalli, H. Uchikawa, and P. H. Siegel, ‘‘Error analysis and inter-cell

interference mitigation in multi-level cell ﬂash memories,’’ in Proc. IEEE

Int. Conf. Commun. (ICC), London, U.K., Jun. 2015, pp. 271–276.

[21] K. J. Kerpez, A. Gallopoulos, and C. Heegard, ‘‘Maximum entropy charge-

constrained run-length codes,’’ IEEE J. Sel. Areas Commun., vol. 10, no. 1,

pp. 242–253, Jan. 1992.

[22] V. Braun and K. A. Schouhamer Immink, ‘‘An enumerative coding tech-

nique for DC-free runlength-limited sequences,’’ IEEE Trans. Commun.,

vol. 48, no. 12, pp. 2024–2031, Dec. 2000.

[23] O. F. Kurmaev, ‘‘Constant-weight and constant-charge binary run-length

limited codes,’’ IEEE Trans. Inf. Theory, vol. 57, no. 7, pp. 4497–4515,

Jul. 2011.

[24] F. Paluncic and B. T. J. Maharaj, ‘‘Using bivariate generating functions

to count the number of balanced runlength-limited words,’’ in Proc.

GLOBECOM - IEEE Global Commun. Conf., Singapore, Dec. 2017,

pp. 4–8.

[25] D. Limbachiya, M. K. Gupta, and V. Aggarwal, ‘‘Family of constrained

codes for archival DNAdata storage,’’ IEEE Commun. Lett., vol. 22, no. 10,

pp. 1972–1975, Oct. 2018.

[26] N. J. A. Sloane. (2019). The On-Line Encyclopedia of Integer Sequences.

[Online]. Available: http://oeis.org

[27] Y. Wang, M. Noor-A-Rahim, E. Gunawan, Y. L. Guan, and C. L. Poh,

‘‘Construction of bio-constrained code for DNA data storage,’’ IEEE Com-

mun. Lett., vol. 23, no. 6, pp. 963–966, Jun. 2019.

[28] W. Song, K. Cai, M. Zhang, and C. Yuen, ‘‘Codes with run-length and

GC-content constraints for DNA-based data storage,’’ IEEE Commun.

Lett., vol. 22, no. 10, pp. 2004–2007, Oct. 2018.

KEES A. SCHOUHAMER IMMINK (Life Fellow,

IEEE) is currently a Founder and the President

of Turing Machines Inc., an innovative start-up

focused on coding and signal processing for

DNA-based storage. He received the 2017 IEEE

Medal of Honor for his for pioneering contribu-

tions to video, audio, and data recording tech-

nology, the Knighthood, in 2000, the Personal

Emmy Award, in 2004, the 1999 Audio Engineer-

ing Society’s (AES) Gold Medal, the 2004 SMPTE

Progress Medal, the 2014 Eduard Rhein Prize for Technology, and the

2015 IET Faraday Medal. He received an Honorary Doctorate from the

University of Johannesburg, in 2014. He was inducted into the Consumer

Electronics Hall of Fame, elected into the Royal Netherlands Academy of

Arts and Sciences, and the (US) National Academy of Engineering. He has

served the profession as a Governor for the IEEE Information Theory and

Consumer Electronics Societies and the President for the Audio Engineering

Society.

KUI CAI (Senior Member, IEEE) received the

B.E. degree in information and control engineering

from Shanghai Jiao Tong University, Shanghai,

China, and the joint Ph.D. degree in electrical

engineering from the Technical University of

Eindhoven, The Netherlands, and the National

University of Singapore. She is currently an Asso-

ciate Professor with the Singapore University

of Technology and Design (SUTD). Her main

research interests are in the areas of coding the-

ory, information theory, signal processing for various data storage systems,

and digital communications. She received the 2008 IEEE Communications

Society Best Paper Award in Coding and Signal Processing for Data Storage.

She has served as the Vice-Chair (Academia) for the IEEE Communications

Society and the Data Storage Technical Committee (DSTC), from 2015

to 2016.

VOLUME 8, 2020 49531