Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Dec 10, 2018

Content may be subject to copyright.

1

Properties and constructions of constrained codes

for DNA-based data storage

Kees A. Schouhamer Immink and Kui Cai

Abstract—We describe properties and constructions of

constraint-based codes for DNA-based data storage which ac-

count for the maximum repetition length and AT/GC balance.

We present algorithms for computing the number of sequences

with maximum repetition length and AT/GC balance con-

straint. We describe routines for translating binary runlength

limited and/or balanced strings into DNA strands, and compute

the efﬁciency of such routines. We show that the implementa-

tion of AT/GC-balanced codes is straightforward accomplished

with binary balanced codes. We present codes that account for

both the maximum repetition length and AT/GC balance. We

compute the redundancy difference between the binary and a

fully ﬂedged quaternary approach.

I. INT ROD UC TI ON

The ﬁrst large-scale archival DNA-based storage architec-

ture was implemented by Church et al. [1] in 2012. Blawat

et al. [2] described successful experiments for storing and

retrieving data blocks of 22 Mbyte of digital data in syn-

thetic DNA. Ehrlich and Zielinski [3] further explored the

limits of storage capacity of DNA-based storage architec-

tures.

Naturally occurring DNA consists of four types of nu-

cleotides: adenine (A), cytosine (C), guanine (G), and

thymine (T). A DNA strand (or oligonucleotides, or oligo in

short) is a linear sequence of these four nucleotides that are

composed by DNA synthesizers. Binary source, or user, data

are translated into the four types of nucleotides, for example,

by mapping two binary source into a single nucleotide (nt).

Strings of nucleotides should satisfy a few elementary

conditions, called constraints, in order to be less error

prone. Repetitions of the same nucleotide, a homopoly-

mer run, signiﬁcantly increase the chance of sequencing

errors [4], [5], so that such long runs should be avoided.

For example, in [5], experimental studies show that once

the homopolymer run is larger than 4 nt, the sequencing

error rate starts increasing signiﬁcantly. In addition, [5]

also reports that oligos with large unbalance between GC

and AT content exhibit high dropout rates and are prone

to polymerase chain reaction (PCR) errors, and should

therefore be avoided.

Blawat’s format [2] incorporates a constrained code that

uses a look-up table for translating binary source data

Kees A. Schouhamer Immink is with Turing Machines Inc, Willemskade

15b-d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-

machines.com.

Kui Cai is with Singapore University of Technology and Design (SUTD),

8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.

This work is supported by Singapore Ministry of Education Academic

Research Fund Tier 2 MOE2016-T2-2-054

into strands of nucleotides with a homopolymer run of

length at most three. Blawat’s format did not incorporate

an AT/GC balance constraint. Strands that do not satisfy

the maximum homopolymer run requirement or the weak

balance constraint are barred in Erlich’s coding format [3].

In this paper, we describe properties and constructions

of quaternary constraint-based codes for DNA-based stor-

age which account for a maximum homopolymer run

and maximum unbalance between AT and GC contents.

Binary ‘balanced’ and runlength limited sequences have

found widespread use in data communication and storage

practice [6]. We show that constrained binary sequences

can easily be translated into constrained quaternary se-

quences, which opens the door to a wealth of efﬁcient

binary code constructions for application in DNA-based

storage [7], [8], [9]. A further advantage of this binary

approach instead of a ‘direct’ 4-ary translation approach

is the lower complexity of encoding and decoding look-

up tables. The disadvantage is, as we show, the loss in

information capacity of the binary versus the quaternary

approach.

We start in Section II with a description of the limit-

ing properties of AT/GC-balanced codes, while Section III

presents code designs for efﬁciently generating AT/GC-

balanced strands. Limiting properties and code constructions

that impose a maximum homopolymer run are discussed

in Section IV. In Section V, we enumerate the number

of binary and quaternary sequences with combined weight

and run-length constraints. We speciﬁcally compute and

compare the information capacity of binary versus ‘direct’

quaternary coding techniques. Section VI concludes the

paper.

II. AT/GC CONTENT BAL AN CE

We use the nucleotide alphabet Q={0,1,2,3}, where

we propose the following relation between the four decimal

symbols and the nucleotides: G= 0, C = 1, A = 2,

and T= 3. The AT/GC content constraint stipulates that

around half of the nucleotides should be either an A or a

T nucleotide. In order to study AT-balanced nucleotides, we

start with a few deﬁnitions. We deﬁne the weight or AT-

content, denoted by w4(x), of the n-nucleotide oligo x=

(x1, . . . , xn),xi∈ Q, as the number of occurrences of A

or T, or

w4(x) =

n

i=1

φ(xi),(1)

2

where

φ(u) = 0, u < 2,

1, u > 1.(2)

The relative unbalance of a word, α(x), is deﬁned by

α(x) =

w4(x)

n−1

2

. An n-nucleotide oligo is said to be

balanced if α(x) = 0. In case we have a set Sof n-symbol

codewords, we deﬁne the worst case relative unbalance of

S, denoted by αS, by αS= maxx∈S α(x). Similarly the

weight of a binary word x= (x1, . . . , xn),xi∈ {0,1},

denoted by w2(x), is deﬁned by

w2(x) =

n

i=1

φ(2xi) =

n

i=1

xi.(3)

If we write the 4-ary word x=(x1, . . . , xn),xi∈ Q, as

x=y+ 2z, where both yiand zi∈ {0,1}then

w4(x) =

n

i=1

φ(xi) =

n

i=1

φ(2zi) = w2(z).(4)

For DNA-based storage, we do not require that the strands of

the codebook, S, are strictly balanced, as a small unbalance,

that is αS≪1, between the GC and AT content is permitted

without affecting the error performance. Such a constraint

is called a weak balance constraint. Let Swdenote the set

of 4-ary words of length nwith balance w=w4(x), or

Sw={x∈ Qn:w=w4(x)}.(5)

The cardinality of Sw, denoted by N(w, n), equals

N(w, n) = |Sw|=n

w2n.(6)

The number of oligo’s, Na(n), of length n, whose relative

unbalance α(x)≤a, is given by

Na(n) =

|w/n−1

2|<a

N(w, n) = 2n

|w/n−1

2|<a n

w.(7)

The redundancy of nearly balanced strands, denoted by

r(a, n), equals

r(a, n) = log2

4n

Na(n).(8)

Figure 1 shows examples of computations of the redundancy

versus nwith the relative unbalance, a, as a parameter. The

raggedness of the curves is caused by the truncation effects

in the summation in (7). The distribution for asymptotically

large nof N(w, n)versus wis approximately Gaussian

shaped, that is

N(w, n)∼ G w;n

2,n

44n, n ≫1,(9)

where

G(u;µ, σ2) = 1

σ√2πe−1

2(u−µ

σ)2,(10)

denotes the Gaussian distribution and µand σ2denote

the mean and variance of the distribution. The number

0 50 100 150

0

0.5

1

1.5

2

2.5

3

redundancy

n

a=0.125

a=0.0625

a=0.03125

Fig. 1. Redundancy (bits) versus word length, n, with the relative

unbalance, a, as a parameter. The raggedness of the curves is

caused by the truncation effects in the summation in (7).

of oligo’s, Na(n), of length n, whose relative unbalance

α(x)≤a, is given by [ [3], supplement]

Na(n)∼4n1−2Q(2a√n), n ≫1,(11)

where the Q-function is deﬁned by

Q(x) = 1

√2π∞

x

e−u2

2du. (12)

In the next section, we discuss various embodiments of

codes that balance strands of nucleotides.

III. IMP LE ME NTATIONS OF BA LA NC ED GC/AT CONTENT

There is a wealth of prior art binary balanced codes [10],

and application of such prior art codes to the problem at

hand is shown below. Earlier embodiments can be found

in [11], [12].

A. Binary sequences, Construction I

We assume the encoder receives a string of ℓ+n,n≥ℓ,

binary symbols, which are translated into a balanced word

of n4-ary symbols. To that end, let (y1, . . . , yℓ+n),yi∈

{0,1},n≥ℓ, be an (ℓ+n)-bit source string. We translate

the ﬁrst ℓbits of the binary source data, (y1, . . . , yℓ), into

a (nearly) balanced binary string (u1, . . . , un),ui∈ {0,1}.

We merge the n-bit string, (u1, . . . , un), and the remain-

ing n-bit segment of the source string, (yℓ+1, . . . , yℓ+n),

into the 4-ary vector v,vi∈ Q, using the operation

vi=yℓ+i+ 2ui,1≤i≤n. The balance of the output

string, v, is given by, see (4), w4(v) = w2(u).The rate

of the above 4-ary code construction equals R= 1 + ℓ

n.

Implementations of balanced codes can be found in the

literature. For example, the 8B10B is a binary code of rate

8/10 that has found application in both transmission and

data storage systems [13]. The 10-bit codewords may have

four, ﬁve or six ‘one’s, and the two-state code guarantees

3

that the unbalance of the encoded sequence is at most ±1.

In case we translate p8-bit words into p10-bits words, we

have αS=1

10p. The (overall) rate R=9

5.

1) Weak Knuth code: Knuth [14] presented an encoding

technique for generating binary balanced codewords capable

of handling (very) large binary blocks. An n-bit user word,

neven, is forwarded to the encoder, which inverts the ﬁrst

k0bits of the user word, where k0is chosen in such a

way that the modiﬁed word has equal numbers of ones

and zeros. Knuth showed that such an index k0can always

be found. The index k0is represented by a (preferably)

balanced word, called preﬁx, of length p0,p0≥log2nbits,

so that the redundancy of Knuth’s method is approximately

log2n(bit). The (balanced) p0-bit preﬁx and the balanced

n-bit user word are both transmitted. The receiver can

easily undo the inversion of the ﬁrst k0bits of the received

word. Modiﬁcations of the generic Knuth scheme have been

presented by Weber & Immink [15].

DNA-based storage does not require exact strand GC/AT-

content balance, and we may attempt to construct less

redundant nearly-balanced codes. We modify Knuth’s al-

gorithm for generating nearly balanced binary codes. Let

x= (x1, . . . , xn), be the word to be balanced. Deﬁne

the m0= 2p0balancing positions, denoted by bi, i =

0, . . . , m0−1, that are evenly distributed over the npossible

positions, say bi= 1 + is,i= 0, . . . , m0−1, where

s=⌈n/m0⌉. Mimicking the original Knuth encoder, the

encoder successively inverts the symbols of the ith segment

of x,i= 0,·· · , m0−1, thereby successively inverting the

symbols x1till xb0,x1till xb1, etc, until x1till xbm0−1.

The encoder selects the index, bˆ

i, that enables the least

unbalance. In similar vein as in Knuth’s method, the index ˆ

i

is represented by a redundant (balanced or nearly balanced)

p-bit preﬁx that is appended to the weakly-balanced word.

According to Knuth we can choose at least one index k0,

1≤k0≤n, such that exact balance can be achieved. As an

‘exact’ balancing index, k0, is at most ⌊s/2⌋positions away

from position bˆ

i, we conclude that the relative unbalance is

αS∼1

2p0+1 .(13)

The redundancy of the above weak Knuth code equals at

least p0bits (note that additional redundancy is needed to

encode the preﬁx into a nearly balanced word). Let, for ex-

ample, the code redundancy be p0= 3, then αS= 0.0625.

Figure 1 shows that for a relative unbalance a= 0.0625 we

need, in theory, less than 1.5 bit redundancy for n > 25, so

that we conclude that the above modiﬁcation of Knuth’s al-

gorithm falls far short of the minimum redundancy required.

In the next section, we discuss constructions for generating

strings that avoid long repetitions of the same nucleotide.

IV. MAXIMUM RUNLENGTH CONSTRAINT

Long repetitions of the same nucleotide (nt), called a

homopolymer run or runlength, may signiﬁcantly increase

the chance of sequencing errors [4], [5], and should be

avoided. Avoiding long runs of the same nucleotide will

result in loss of information capacity, tand codes are re-

quired for translating arbitrary source data into constrained

quaternary strings. Binary runlength limited (RLL) codes

have found widespread application in digital communication

and storage devices since the 1950s [6], [10]. MacLaughlin

et al. [16] studied multi-level runlength limited codes for

optical recording. An n-nucleotide oligo, a string of 4-

ary symbols of length n, can be seen as two parallel

binary strings of length n, namely a string of a least and

a most signiﬁcant bit with which the 4-ary symbol can be

represented. Such a system of multiple parallel data streams

with joint constraints is reminiscent of ‘two-dimensional’

track systems, which have been studied by Marcellin and

Weber [17].

We start in the next subsection with the counting of q-

ary sequences that satisfy a maximum runlength, followed

by subsections where we describe limiting properties and

code constructions that avoid m+ 1 repetitions of the same

nucleotide.

A. Counting q-ary sequences, capacity

Let the number of n-length sequences consisting of q-

ary symbols have a maximum run, m, of the same symbol

be denoted by Nq(m, n). The number Nq(m, n)can be

found using the next Theorem which deﬁnes a recursive

relation [18], Part 1.

Theorem 1:

Nq(m, n) = qn, n ≤m,

(q−1) m

k=1 Nq(m, n −k), n > m.

(14)

Proof: For n≤mthe above is trivial as all sequences

satisfy the maximum runlength constraint. For n > m we

follow Shannon’s approach [18] for the discrete noiseless

channel. The runlength of ksymbols acan be seen as a

’phrase’ aof length k. After a phrase ahas been emitted, a

phrase of symbols b̸=aof length kcan be emitted without

violating the maximum runlength constraint imposed. The

total number of allowed sequences, Nq(m, n), is equal to

(q−1) times the sum of the numbers of sequences ending

with a phrase of length k= 1,2,...m, which are equal to

Nq(m, n−k). Addition of these numbers yields (14), which

proves the Theorem.

Using the above expressions, we may easily compute the

feasibility of a q-ary m-constrained code for relatively small

values of nwhere a coding look-up table is practicable, see

Subsection IV-C for more details.

1) Generating functions: Generating functions are a very

useful tool for enumerating constrained sequences [19], and

they offer tools for approximating the number of constrained

sequences for asymptotically large values of the sequence

length n. The series of numbers {Nq(m, n)},n= 1,2...,

in (14), can be compactly written as the coefﬁcients of a

formal power series Hq,m(x) = Nq(m, i)xi, where xis

a dummy variable. There is a simple relationship between

4

TABLE I

CAPACI TY C2(m)AND C4(m)V ERS US m.

m C2(m)C4(m)

1 0.0000 1.5850(= log23)

2 0.6942 1.9227

3 0.8791 1.9824

4 0.9468 1.9957

5 0.9752 1.9989

6 0.9881 1.9997

the generating function, Hq,m(x), and the linear homoge-

nous recurrence relation (14) with constant coefﬁcients that

deﬁnes the same series [19]. We ﬁrst deﬁne a generating

function

G(x) = gixi.(15)

Let the operation [xn]g(x)denote the extraction of the

coefﬁcient of xnin the formal power series G(x), that is,

deﬁne

[xn]gixi=gn.(16)

Let

T(x) =

m

i=1

xi.(17)

Theorem 2: The number of n-symbol m-constrained q-

ary words is

Nq(m, n) = [xn]qT (x)

1−(q−1)T(x).(18)

Proof: The generating function for the number of q-ary

sequences with a maximum runlength mis

qT (x) + q(q−1)T(x)2+q(q−1)2T(x)3+··· .

We may rewrite the above as

qT (x)

1−(q−1)T(x),

which proves the Theorem.

2) Asymptotical behavior: For asymptotically large code-

word length n, the maximum number of (binary) user bits

that can be stored per q-ary symbol, called (information)

capacity, denoted by Cq(m), is given by [18]

Cq(m) = lim

n→∞

1

nlog2Nq(m, n) = log2λq(m),(19)

where λq(m), is the largest real root of the characteristic

equation [18], [16]

xm+1 −qxm+q−1 = 0.(20)

Table I shows the information capacities C2(m)and

C4(m)versus maximum allowed (homopolymer) run m.

For asymptotically large nwe may approximate Nq(m, n)

by [19]

Nq(m, n)∼Aq(m)λn

q(m).(21)

TABLE II

COE FFICI EN T A2(m)AND A4(m)VER SUS m.

m A2(m)A4(m)

1 1.3333(= 4/3)

2 1.4477 1.1031

3 1.2368 1.0341

4 1.1327 1.0110

5 1.0759 1.0034

6 1.0435 1.0010

The coefﬁcient Aq(m)is found, see [ [10], page 157-158],

by rewriting Hq,m(x)as a quotient of two polynomials, or

Hq,m(x) = r(x)

p(x). Then

Aq(m) = −λq(m)r(1/λq(m))

p′(1/λq(m)).(22)

Table II shows the coefﬁcients A2(m)and A4(m)versus

m. For m= 1, we simply ﬁnd N4(1, n) = 4.3n−1. We

found that the approximation (21) is remarkably accurate.

For a typical example, N4(2,10) = 676836, while the

approximation using (21) yields N4(2,10) ∼676835.9769.

The redundancy of a 4-ary string of length nwith a

maximum runlength m, denoted by r4(m, n), is

r4(m, n)=2n−log2N4(m, n)

∼n(2 −C4(m)) −log2A4(m).(23)

B. Binary-based RLL code construction, Construction II

In a similar vein as presented in Section III, we may

exploit binary maximum runlength limited (RLL) codes for

generating quaternary RLL sequences. Construction II ex-

empliﬁes such a technique for m > 1. Let u= (u1, . . . , un)

be an n-bit RLL string. We merge the RLL n-bit string, u,

with an n-bit source string y= (y1, . . . , yn), by using the

addition vi=ui+ 2yi,1≤i≤n, where v= (v1, . . . , vn),

vi∈ Q is the 4-ary output string. It is easily veriﬁed that

the 4-ary output string, v, has maximum allowed run m, the

same as the binary string u. The number of distinct 4-ary

sequences, v, of Construction II equals 2nN2(m, n), so that

the redundancy, denoted by r2(mn, n)is

r2(m, n)∼n(1 −C2(m)) −log2A2(m).(24)

The capacity loss with respect to the runlength limited 4-ary

channel, denoted by η(m), is expressed by

η(m) = 1 + C2(m)

C4(m).(25)

Table III lists results of computations. We may notice that

for small values of m, Construction II will suffer a capacity

loss of up to 12 % for m= 2. For larger values of m,

however, the capacity loss is negligible.

The above asymptotic efﬁciency of Construction II, η(m),

is valid for very large values of the strand length n, and it

is of practical interest to assess the efﬁciency for smaller

values of the strand length. Construction II can be used

5

TABLE III

ASY MPT OTI C RATE EFFI CI ENC Y,η(m),OF BIN ARY CONSTRUCTION II

VE RSU S MA XIM UM HO MO POLY MER R UN,m.

m η(m)

2 0.881

3 0.948

4 0.975

5 0.988

6 0.994

7 0.997

TABLE IV

RATE EFFI CIE NC Y,Rm,0/C4(m),OF BINA RY CONSTRUCTION II

VERSUS STRAND LENGTH,n,AND M AX IMU M HO MOP OLYM ER RU N,m.

n m = 2 m= 3 m= 4

5 0.832 0.807 0.802

6 0.780 0.841 0.835

7 0.817 0.865 0.859

8 0.845 0.883 0.877

9 0.809 0.897 0.891

10 0.832 0.908 0.902

with any binary RLL code, and there are many binary

code constructions for generating maximum runlength con-

strained sequences, see [10] for an overview. We propose

here, for the efﬁciency assessment, a simple two-mode

block code of codeword length n. Runlength constrained

codewords in the ﬁrst mode start with a symbol ‘zero’,

while codewords in the second mode start with a ‘one’.

When the previous sent codeword ends with a ‘one’ we

use the codewords from the ﬁrst mode and vice versa. The

number of binary source words that can be accommodated

with Construction II equals 2n−1N2(m, n), so that the code

rate, denoted by Rm,0, is

Rm,0=1

n(n−1 + ⌊log2N2(m, n)⌋),(26)

where we truncated the code size to the largest power of two

possible. Table IV shows selected outcomes of computations

of the rate efﬁciency Rm,0/C4(m)versus mand n.

C. Encoding of quaternary sequences without binary step

In this subsection, we investigate constructions of codes

that transform binary words directly (that is, without an

intermediate binary coding step) into 4-ary maximum ho-

mopolymer constrained codewords. An example of a simple

4-ary block code was presented by Blawat et al. [2]. The

code converts 8 source bits into a 4-ary word of 5 nt. The

5-nt words can be cascaded without violating the prescribed

m= 3 maximum homopolymer run. The rate of Blawat’s

construction is R= 8/5=1.6. As C4(m= 3) = 1.9824,

see Table I, the (rate) efﬁciency of the construction is

R/C4(m) = 0.807. Alternative, and more efﬁcient, con-

structions are described below.

TABLE V

RATE EFFI CIE NC Y,Rm,1/C4(m),OF THE 4-ARY CODE CONSTRUCTION

VE RSU S ST RAN D LEN GT H,n,AN D MA XIM UM H OMO PO LYME R RUN,m.

n m = 1 m= 2 m= 3 m= 4

5 0.883 0.832 0.807 0.802

6 0.841 0.867 0.841 0.835

7 0.901 0.892 0.865 0.859

8 0.946 0.910 0.883 0.877

9 0.911 0.925 0.897 0.891

10 0.946 0.936 0.908 0.902

1) State-independent decoding: A source word can be

represented by two n-symbol 4-ary m-constrained code-

words. The two representations differ at the ﬁrst position. In

case we cascade a new codeword to the previous codeword,

we are always able to choose (at least) one representation

whose ﬁrst symbol differs from the last symbol of the

previous codeword. Then, clearly, the cascaded string of

4-ary symbols satisﬁes the maximum homopolymer run

constraint. The rate of this two-mode construction, denoted

by Rm,1, is

Rm,1=1

n(⌊log2(N4(m, n))⌋ − 1),(27)

where we truncated the code size to the largest power of

two possible. Table V shows selected outcomes of compu-

tations of the rate efﬁciency Rm,1/C4(m)versus mand

n. We observe that, for m= 2, the ’quaternary’ efﬁciency

R2,1/C4(2) is slightly better than the ’binary’ R2,0/C4(2),

For m > 2, both approaches have the same efﬁciency. The

conversion of the binary source symbols into the 4-ary n-nt

strands and vice versa can be accomplished using look-up

tables of complexity 4n.

2) State-dependent decoding: In the above construction,

the encoded codeword depends on the last symbol of the

previous codeword. Decoding, however, is based on the

observation of the nsymbols of the retrieved codeword.

In this subsection, we discuss a state-dependent decoding

construction, where the codeword chosen depends on the

last symbol of the previous codeword, and decoding is

based on the observation of the nsymbols of the retrieved

codeword plus the last symbol of the previous codeword.

We deﬁne four tables of codewords, denoted by L(i, a),

where i,1≤i≤K, denotes the decimal representation of

the source word to be encoded, Kdenotes the size of the

table, and adenotes the encoder state a=∈ {1,2,3,4}. We

construct the four tables in such as way that the codewords

in each table L(i, a)do not start with the symbol a. Then,

the maximum size of the tables equals K=3

4N4(m, n)

(note that N4(m, n)is a multiple of 4). The representation,

L(i, a), chosen depends on the last symbol of the previ-

ous codeword, a. The rate of this four-mode construction,

denoted by Rm,2, is

Rm,2=1

nlog23

4N4(m, n).(28)

6

TABLE VI

RATE EFFI CIE NC Y,Rm,2/C4(m),OF THE 4-ARY CODE CONSTRUCTION

VERSUS STRAND LENGTH,n,AND M AX IMU M HO MOP OLYM ER RU N,m.

n m = 1 m= 2 m= 3 m= 4

5 0.883 0.936 0.908 0.902

6 0.946 0.954 0.925 0.919

7 0.991 0.966 0.937 0.931

8 0.946 0.975 0.946 0.940

9 0.981 0.982 0.953 0.946

10 0.946 0.936 0.958 0.952

Table VI shows the rate efﬁciencies that can be reached with

this construction. The efﬁciency improvement with respect

to Table V is obtained at the cost of a four times larger look-

up table. Decoding of a codewords is uniquely accomplished

by observing the n-symbol codeword plus the last symbol

of the previous codeword.

Example 1: Let (as in Blawat’s code [2]) n= 5 and

m= 3. We simply ﬁnd, using (14), N4(3,5) = 996, so that

the code may accommodate K= 3/4×996 = 747 binary

source words. Since K > 512 = 29we may implement a

code of rate 9/5, which is 12% higher than that of Blawat’s

code of rate 8/5. As we have the freedom of deleting

747-512=235 redundant codewords, we may bar the words

with the highest unbalance.

In the next section, we take a look at the combination of

balance and maximum polymer run constrained codes.

V. CO MB IN ED W EI GH T AN D MA XI MU M RUN

CONSTRAINED CODES

Kerpez et al. [20], Braun and Immink [21], and Kur-

maev [22] analyzed properties and constructions of binary

combined weight and runlength constrained codes. Their

results are straightforwardly applied to the quaternary case

at hand. In the next section, we count binary and quaternary

sequences that satisfy combined maximum runlength and

weight constraints. We start by counting the number of

binary sequences, x, of length nthat satisfy a maximum

runlength constraint mand have a weight w=w2(x).

Paluncic and Maharaj [23] enumerated this number for the

balanced case w=w2(x) = 0.

A. Counting binary RLL sequences of given weight

Deﬁne the bi-variate generating function H(x, y)in the

dummy variables xand yby

H(x, y) =

i,j

hi,j xiyj,(29)

and let [xn1yn2]h(x, y)denote the extraction of the coef-

ﬁcient of xn1yn2in the formal power series hi,jxiyj,

or

[xn1yn2]hi,j xiyj=hn1,n2.(30)

Deﬁne

T1(x, y) =

m

i=1

xiyi.(31)

The number of n-bit codewords, x, with maximum run-

length m, denoted (with a slight abuse of notational con-

vention by adding an extra parameter) by N2(m, w, n), that

satisfy a given unbalance constraint w=w2(x)is given by

the next Theorem.

Theorem 3:

N2(m, w, n) = [xnyw]T1(x, y) + T(x)+2T1(x, y)T(x)

1−T1(x, y)T(x).

Proof: Let the sequence start with a runlength of zero’s, then

the generating function for the number of binary sequences

with a maximum runlength mis

T(x)+T(x)T1(x, y)+T(x)2T1(x, y)+T(x)2T1(x, y )2+··· .

In case the sequence starts with a run of one’s, we obtain

for the generating function

T1(x)+T(x)T1(x, y)+T(x)T1(x, y)2+T(x)2T1(x, y )2+··· .

The generating function for the number of binary sequences

with a maximum runlength mstarting with a one or a zero

runlength is the sum of the two above generating functions.

Working out the sum yields

T1(x, y) + T(x) + 2T1(x, y)T(x)

1−T1(x, y)T(x),

which proves the Theorem.

With the above bi-variate generating function, we may

exactly compute the number of binary m-constrained words

of weight w. More insight is gained by an approximation

of N2(m, w, n). For a given maximum runlength, m, and

large n, we are speciﬁcally interested in the distribution of

N2(m, w, n)versus the weight w. For asymptotically large

n, according to the central limit theorem, the distribution of

the number of sequences versus weight, w, is approximately

Gaussian [19].

Theorem 4:

N2(m, w, n)∼ G w;n

2,γ2(m)n

4N2(m, n),(32)

where

γ2(m) = 1

¯

l

m

i=1

(i−¯

l)2λ−i

2(m)(33)

and

¯

l=

m

i=1

iλ−i

2(m).(34)

Proof: The probability of occurrence of a runlength of

length k,k≤m, is λ−k

2(m), see [10], Chapter 4. So

that the average number of runlengths in a sequence of n

symbols is n/¯

l. The weight wis the sum of the runlengths

of ones, so that according to the central limit theorem the

weight distribution is approximately Gaussian for large n

with mean n

2and variance γ2(m)n

4.

7

TABLE VII

COE FFICI EN T γ2(m)AND γ4(m)V ERS US M AXI MU M HOM OP OLYM ER

RUN m.

m γ2(m)γ4(m)

1 0.5000

2 0.1708 0.7410

3 0.3449 0.8796

4 0.5059 0.9497

5 0.6426 0.9808

10 0.9565 0.9999

∞1 1

Table VII shows results of computations (the parameter

γ4(m) is explained in Section V-B). Perusal of the outcomes

clearly demonstrates that for small values of mthe unbal-

ance variance, γ2(m)n, is smaller than that of unconstrained

sequences (that is, m=∞) of the same length n. In other

words, a maximum runlength ‘helps’ to reduce the expected

unbalance.

B. Counting quaternary RLL sequences of given weight

We count the number of n-tuples xof 4-ary symbols

that satisfy a maximum run length constraint, m, and

have weight w=w4(x), denoted (with a slight abuse of

notational convention) by N4(m, w, n).

1) Maximum runlength constraint: For the special case

m= 1, Limbachiya [24] et al. presented a closed expression

of N4(1, w, n). For other values of the prescribed maximum

runlength, m, we may readily compute the number of 4-

ary sequences, N4(m, w, n), versus weight, w=w4(x),

by applying generating functions. The 4-ary symbols are

generated by a constrained data source that can be modelled

as a four-state Moore-type ﬁnite-state machine. The machine

steps from state to state where when state i∈ Q is visited

a sequence of k,1≤k≤m, symbols ‘i’ are emitted. After

visiting state i, the data source may not return to state i(and

thus emit a sequence of the same symbol ‘i’ again), but it

enters state j̸=i,j∈ Q. When the machine enters state 3

or 4, the word weight, w, is incremented by k, where k,

1≤k≤m, denotes the run of symbols ‘3’ or ‘4’. When,

on the other hand, states 1 or 2 are entered, the weight

increment is nil. The resulting 4×4one-step skeleton or

state-transition matrix, D(x, y), of the ﬁnite-state machine

is

D(x, y) =

0a0a0a0

a00a0a0

a1a10a1

a1a1a10

,(35)

where a0=T(x)and a1=T1(x, y).

Theorem 5: The number of 4-ary sequences of length n

with maximum runlength constraint mand weight wequals

N4(m, w, n) = [xnyw]1

3

i,j

d[n]

i,j (x, y),(36)

where d[n]

i,j (x, y)denotes the entries of Dn(x, y).

Proof: The entries d[n]

i,j (x, y)of Dn(x, y)are equal to the

number of sequences (paths) of length nstarting in state i

and ending in state j. Summation of the entries and division

by 3 yields the generating function of N4(m, w, n).

In the next subsection, we derive a simple approximation

to N4(m, w, n)valid for large n.

2) Estimate of the weight distribution: For asymptoti-

cally large n, the weight distribution is approximately Gaus-

sian, that is, we may conveniently approximate N4(m, w, n)

using the next Theorem.

Theorem 6:

N4(m, w, n)∼ G w;n

2, σ2

4(m, n)N4(m, n), n ≫1,

(37)

where σ2

4(m, n), denotes the variance of the Gaussian

weight distribution.

Proof: Let ui,i= 1,2, . . .,ui∈ Q, be an inﬁnitely

long 4-ary sequence generated by a maxentropic source

that satisﬁes a prescribed maximum runlength, m. Note that

although the 4-ary sequence ui,i= 1,2, . . ., satisﬁes a lim-

ited runlength constraint, m, that runs of the binary weight

sequence vi=φ(ui),i= 1,2, . . ., are without limit. The

variance, σ2

4(m, n), of the Gaussian weight distribution is

governed by the runlength distribution, P(k), of the binary

sequence vi, where P(k),k > 0, denotes the probability of

occurrence of a runlength k. Clearly, k>0P(k) = 1. The

probability P(k)is given by

P(k) = cN2(m, k)λ−k

4, k ≥1,(38)

where the normalization constant cis chosen such that

∞

k=1 P(k)=1. The term N2(m, k)is the number of AT

combinations of length k, which may exist of a single A or

T run or a plurality of alternating A and T runs. We have

σ2

4(m, n) = γ4(m)n

4, where, see [10], Chapter 4,

γ4(m) = 1

¯

l

∞

k=1

(k−¯

l)2P(k)(39)

and

¯

l=

∞

k=1

kP (k).(40)

Table VII shows results of computations of γ4(m)versus

m. We may notice that the weights of the quaternary RLL

sequences are more concentrated around the mean n/2than

those of binary RLL sequences. The above outcome is not

consistent with the results by Ehrlich and Zielinski [3], as

they assume that the balance variance equals n/4, indepen-

dent of m.

C. Redundancy of binary and quaternary codes with com-

bined constraints

As in Constructions I and II, let the quaternary word x=

(x1, . . . , xn),xi∈ Q, be written as x=y+ 2z, where

8

the constituting elements yiand zi∈ {0,1}. If the binary

sequence zis m-constrained and has a weight w=w2(z),

then xis m-constrained and it has a weight w4(z) = w. The

redundancy of the binary constrained sequences, z, denoted

(with a slight abuse of convention) by r2(m, a, n), equals

r2(m, a, n) = n−log2N2(m, w, n).(41)

Using (24) and (32), we obtain for n≫1, that the

redundancy of the binary approach is

r2(m, a, n)∼r2(m, n)−log21−2Q2an

γ2(m).

(42)

The redundancy of the quaternary approach, denoted by

r4(m, a, n), equals, for n≫1,

r4(m, a, n) = log2

4n

N4(m, w, n)(43)

∼r4(m, n)−log21−2Q2an

γ4(m).

A numerical analysis of the above expressions shows that

the redundancy difference due to the balance (right hand)

term is around 0.5-1 bit for m= 2. For larger values of

the homopolymer run mthe extra redundancy is negligible.

The redundancy difference, r2(m, n)−r4(m, n), due to the

imposed runlength constraint is much larger for n > 10 than

the redundancy due the balance constraint. For m > 6the

difference between r2(m, n)and r4(m, n)is negligible, see

Subsection IV-B, so that considering the much larger look-

up tables needed for quaternary codes, the binary approach

using Construction 1 for combined constraints is preferable

from a practical point of view.

VI. CONCLUSIONS

We have described coding techniques for weakly balanc-

ing GC and AT-content and avoiding homopolymer runs

larger than mnt’s of quaternary DNA strings. We have

found exact and approximate expressions for the number

of binary and quaternary sequences with combined weight

and run-length constraints. We have compared two coding

approaches for constraint-based coding of DNA strings. In

the ﬁrst approach, an intermediate, ‘binary’, coding step is

used, while in the second approach we ‘directly’ translate

source data into constrained quaternary sequences. The

binary approach is attractive as it yields a lower complexity

of encoding and decoding look-up tables. The redundancy

of the binary approach is higher than that of the quaternary

approach for generating combined weight and run-length

constrained sequences. The redundancy difference is small

for larger values of the maximum homopolymer run.

REF ER EN CE S

[1] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital

information storage in DNA,” Science, vol. 337, no. 6012, pp. 1628-

1628, 2012.

[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,

B. W. Pruitt, G. M. Church, “Forward Error Correction for DNA Data

Storage,” International Conference on Computational Science (ICCS

2016), vol. 80, pp. 1011-1022, 2016.

[3] Y. Erlich and D. Zielinski, “DNA Fountain enables a robust and

efﬁcient storage architecture,” Science, vol. 355, pp. 950-954, March

2017.

[4] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,

“A DNA-based Archival Storage System,” ACM SIGOPS Operating

Systems Review, vol. 50, pp. 637-649, 2016.

[5] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R.

Hegarty, C. Nusbaum, D. B. Jaffe, “Characterizing and Measuring

Bias in Sequence Data,” Genome Biol. 14, R51, 2013.

[6] K. W. Cattermole, “Principles of Digital Line Coding,” Int. Journal

of Electronics, vol. 55, pp. 3-33, July 1983.

[7] K. A. S. Immink and K. Cai, “Design of Capacity-Approaching

Constrained Codes for DNA-based Storage Systems,” IEEE Commun.

Letters, vol. 22, pp. 224-227, Feb. 2018.

[8] Y.-S. Kim and S.-H Kim, “New Construction of DNA Codes with

Constant-GC Contents from Binary Sequences with Ideal Autocorre-

lation,” IEEE International Symposium on Information Theory (ISIT),

pp. 1569-1573, 2011.

[9] Y.-M. Chee and S. Ling, “Improved Lower Bounds for Constant GC-

Content DNA Codes,” IEEE Trans. Inform. Theory, vol. IT-54, no.

1, pp. 391-394, Jan. 2008.

[10] K. A. S. Immink, Codes for Mass Data Storage Systems, Second

Edition, ISBN 90-74249-27-2, Shannon Foundation Publishers, Eind-

hoven, Netherlands, 2004.

[11] V. Taranalli, H. Uchikawa, P. H. Siegel, ”Error Analysis and Inter-Cell

Interference Mitigation in Multi-Level Cell Flash Memories,” Pro-

ceedings IEEE International Conference on Communications (ICC),

London, pp. 271-276, June 2015.

[12] S. M. H. T. Yazdi, H. M. Kiah, and O. Milenkovic, “Weakly Mutually

Uncorrelated Codes,” IEEE International Symposium on Information

Theory (ISIT), pp. 2649-2653, Barcelona, Spain, July 2016.

[13] A. X. Widmer and P. A. Franaszek, “A Dc-balanced, Partitioned-

Block, 8b/10b Transmission Code,” IBM J. Res. Develop., vol. 27,

no. 5, pp. 440-451, Sept. 1983.

[14] D. E. Knuth, “Efﬁcient Balanced Codes,” IEEE Trans. Inform.

Theory, vol. IT-32, no. 1, pp. 51-53, Jan. 1986.

[15] J. H. Weber and K. A. S. Immink, “Knuth’s Balancing of Codewords

Revisited,” IEEE Trans. Inform. Theory, vol. 56, no. 4, pp. 1673-1679,

2010.

[16] S. W. MacLauhlin, J. Luo, and Q. Xie, “On the Capacity of M-ary

Runlength-Limited Codes,” IEEE Trans. Inform. Theory, vol. IT-41,

no. 5, pp. 1508-1511, Sept. 1995.

[17] M. W. Marcellin and H. J. Weber, “Two-dimensional Modulation

Codes,” IEEE Journal on Selected Areas in Communications, vol.

10, no. 1, pp. 254-266, Jan. 1992.

[18] C. E. Shannon, “A Mathematical Theory of Communication,” Bell

Syst. Tech. J., vol. 27, pp. 379-423, July 1948.

[19] P. Flajolet and R. Sedgewick, Analytic Combinatorics, ISBN 978-0-

521-89806-5, Cambridge University Press, 2009.

[20] K. J. Kerpez, A. Gallopoulos, and C. Heegard, “Maximum Entropy

Charge-Constrained Run-Length Codes,” IEEE Journal on Selected

Areas in Communications., vol. 10, no. 1, pp. 242-253, Jan. 1992.

[21] V. Braun and K. A. S. Immink, “An Enumerative Coding Technique

for DC-free Runlength-Limited Sequences,” IEEE Trans on Commu-

nications,vol. 48, no. 12, pp. 2024-2031, Dec. 2000.

[22] O. Kurmaev, “Constant-Weight and Constant-Charge Binary Run-

Length Limited Codes,” IEEE Trans. Inform. Theory, vol. IT-57, no.

7, pp. 4497-4515, July 2011.

[23] F. Paluncic and B. T. J. Maharaj, “Using Bivariate Generating

Functions to Count the Number of Balanced Runlength-Limited

Words,” Singapore, 4-8 Dec. 2017, IEEE Globecom 2017.

[24] D. Limbachiya, M. K. Gupta, and V. Aggarwal, “Family of Con-

strained Codes for Archival DNA Data Storage,” IEEE Communica-

tions Letters, August 2018.