Content uploaded by Kees Schouhamer Immink
Author content
All content in this area was uploaded by Kees Schouhamer Immink on Apr 21, 2020
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Received February 27, 2020, accepted March 7, 2020, date of publication March 11, 2020, date of current version March 19, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2980036
Properties and Constructions of Constrained
Codes for DNA-Based Data Storage
KEES A. SCHOUHAMER IMMINK 1, (Life Fellow, IEEE),
AND KUI CAI 2, (Senior Member, IEEE)
1Turing Machines Inc., 3016 DK Rotterdam, The Netherlands
2Singapore University of Technology and Design (SUTD), Singapore 487372
Corresponding author: Kees A. Schouhamer Immink (immink@turing-machines.com)
This work was supported by the Singapore Ministry of Education Academic Research Fund Tier 2 under Grant MOE2016-T2-2-054.
ABSTRACT We describe properties and constructions of constraint-based codes for DNA-based data
storage which account for the maximum repetition length and AT/GC balance. Generating functions and
approximations are presented for computing the number of sequences with maximum repetition length and
AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced
strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of
codes that account for both the maximum repetition length and AT/GC balance are derived.
INDEX TERMS Constrained coding, maximum runlength, balanced words, storage systems, DNA-based
storage.
I. INTRODUCTION
The first large-scale archival DNA-based storage archi-
tecture was implemented by Church et al. [1] in 2012.
Blawat et al. [2] described successful experiments for storing
and retrieving data blocks of 22 Mbyte of digital data in
synthetic DNA. Erlich and Zielinski [3] further explored the
limits of storage capacity of DNA-based storage architec-
tures. Recent examples of experimental work on DNA-base
storage can be found in [4]–[6].
Naturally occurring DNA consists of four types of
nucleotides: adenine (A), cytosine (C), guanine (G), and
thymine (T). A DNA strand (or oligonucleotides, or oligo in
short) is a linear sequence of these four nucleotides that are
composed by DNA synthesizers. Binary source, or user, data
are translated into the four types of nucleotides, for exam-
ple, by mapping two binary source into a single nucleotide,
in short nt.
Strings of nucleotides should satisfy a few elementary
conditions, called constraints, in order to be less error
prone. Repetitions of the same nucleotide, a homopoly-
mer run, significantly increase the chance of sequencing
errors [7], [8], so that such long runs should be avoided.
For example, in [8], experimental studies show that once the
The associate editor coordinating the review of this manuscript and
approving it for publication was Nadeem Iqbal .
homopolymer run is larger than four nt, the sequencing error
rate starts increasing significantly. In addition, [8] also reports
that oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Blawat’s format [2] incorporates a constrained code that
uses a look-up table for translating binary source data
into strands of nucleotides with a homopolymer run of
length at most three. Blawat’s format did not incorpo-
rate an AT/GC balance constraint. Strands that do not sat-
isfy both the maximum homopolymer run requirement and
the weak balance constraint are barred in Erlich’s coding
format [3].
In this paper, we describe properties and constructions
of quaternary constraint-based codes for DNA-based stor-
age which account for a maximum homopolymer run and
maximum unbalance between AT and GC contents. Binary
‘balanced’ and runlength limited sequences have found
widespread use in data communication and storage prac-
tice [9]. We show that constrained binary sequences can easily
be translated into constrained quaternary sequences, which
opens the door to a wealth of efficient binary code con-
structions for application in DNA-based storage [10]–[13].
A further advantage of the binary-to-binary translation
instead of a ‘direct’ binary-to-quaternary translation is the
lower complexity of encoding and decoding look-up tables.
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 49523
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
The disadvantage is, as we show, the loss in information
capacity of the binary versus the quaternary approach.
We start in Section II with a description of the limiting
properties and code constructions that impose a maximum
homopolymer run. We specifically compute and compare
the information capacity of binary versus ‘direct’ quaternary
coding techniques. In Section III, we enumerate the number
of binary and quaternary sequences with combined AT and
GC contents and run-length constraints. Section IV concludes
the paper.
II. MAXIMUM RUNLENGTH CONSTRAINT
Long repetitions of the same nucleotide (nt), called a
homopolymer run or runlength, may significantly increase
the chance of sequencing errors [7], [8], and should be
avoided. Avoiding long runs of the same nucleotide will result
in loss of information capacity, and codes are required for
translating arbitrary source data into constrained quaternary
strings. Binary runlength limited (RLL) codes have found
widespread application in digital communication and storage
devices since the 1950s [9], [14]. MacLauhlin et al. [15] stud-
ied multi-level runlength limited codes for optical recording.
A string of n-nucleotide oligo’s of 4-ary symbols can be seen
as two parallel binary strings of length n, where the 4-ary
symbol is represented by two binary symbols. Such a system
of multiple parallel data streams with joint constraints is
reminiscent of ‘two-dimensional’ track systems, which have
been studied by Marcellin and Weber [16].
We start in the next subsection with the counting of
q-ary sequences that satisfy a maximum runlength, followed
by subsections where we describe limiting properties and
code constructions that avoid m+1 repetitions of the same
nucleotide.
A. COUNTING q-ARY SEQUENCES, CAPACITY
Let the number of q-ary n-length sequences having a max-
imum run, m, of the same symbol be denoted by Nq(m,n).
The number Nq(m,n) is found by using the recursive
relation [17, Part 1]:
Nq(m,n)=(qn,n≤m,
(q−1) Xm
k=1Nq(m,n−k),n>m.(1)
For n≤mthe above is trivial as all sequences satisfy
the maximum runlength constraint. For n>mwe follow
Shannon’s approach [17] for the discrete noiseless channel.
The runlength of ksymbols acan be seen as a ‘phrase’ aof
length k. After a phrase ahas been emitted, a phrase of sym-
bols b6= aof length kcan be emitted without violating the
maximum runlength constraint imposed. The total number of
allowed sequences, Nq(m,n), is equal to (q−1) times the sum
of the numbers of sequences ending with a phrase of length
k=1,2,...,m, which are equal to Nq(m,n−k). Addition of
these numbers yields (1), which proves (1). Using the above
expression, we may easily compute the feasibility of a q-ary
m-constrained code for relatively small values of nwhere a
coding look-up table is practicable, see Subsection II-C for
more details.
1) GENERATING FUNCTIONS
Generating functions are a very useful tool for enumerating
constrained sequences [18], and they offer tools for approx-
imating the number of constrained sequences for asymptot-
ically large values of the sequence length n. The series of
numbers {Nq(m,n)},n=1,2. . ., in (1), can be compactly
written as the coefficients of a formal power series Hq,m(x)=
PNq(m,i)xi, where xis a dummy variable. There is a simple
relationship between the generating function, Hq,m(x), and
the linear homogenous recurrence relation (1) with constant
coefficients that defines the same series [18]. We first define
a generating function
G(x)=Xgixi.(2)
Let the operation [xn]G(x) denote the extraction of the coef-
ficient of xnin the formal power series G(x), that is, define
[xn]Xgixi=gn.(3)
Let
T(x)=
m
X
i=1
xi.(4)
The generating function for the number of q-ary sequences
with a maximum runlength mis
qT (x)+q(q−1)T(x)2+q(q−1)2T(x)3+ · ·· .
We may rewrite the above as
qT (x)
1−(q−1)T(x),
so that the number of n-symbol m-constrained q-ary words is
Nq(m,n)=[xn]qT (x)
1−(q−1)T(x).(5)
2) ASYMPTOTICAL BEHAVIOR
For asymptotically large codeword length n, the maximum
number of (binary) user bits that can be stored per q-ary
symbol, called (information) capacity, denoted by Cq(m),
is given by [17]
Cq(m)=lim
n→∞
1
nlog2Nq(m,n)=log2λq(m),(6)
where λq(m), is the largest real root of the characteristic
equation [15], [17]
xm+1−qxm+q−1=0.(7)
Table 1shows the information capacities C2(m) and C4(m)
versus maximum allowed (homopolymer) run m. For asymp-
totically large nwe may approximate Nq(m,n) by [18]
Nq(m,n)≈Aq(m)λn
q(m).(8)
49524 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 1. Capacity C2(m) and C4(m) versus m.
TABLE 2. Coefficient A2(m) and A4(m) versus m.
The coefficient Aq(m) is found, see [14, page 157-158],
by rewriting Hq,m(x) as a quotient of two polynomials,
or Hq,m(x)=r(x)
p(x). Then
Aq(m)= −λq(m)r(1/λq(m))
p0(1/λq(m)) .(9)
Table 2shows the coefficients A2(m) and A4(m) versus m.
For m=1, we simply find N4(1,n)=4.3n−1. We found that
the approximation (8) is remarkably accurate. For a typical
example, N4(2,10) =676836, while the approximation
using (8) yields N4(2,10) ≈676835.9769. The redundancy
of a 4-ary string of length nwith a maximum runlength m,
denoted by r4(m,n), is, using (8),
r4(m,n)=2n−log2N4(m,n)
≈n(2−C4(m))−log2A4(m).(10)
B. BINARY-BASED RLL CODE CONSTRUCTION,
CONSTRUCTION I
Yazdi et al. [19] and Taranalli et al. [20] showed that we
may exploit binary maximum runlength limited (RLL) codes
for constructing quaternary RLL codes. Their construction,
denoted by Construction 1, exemplifies such a technique for
m>1. The construction is simple, but we show below that
this simplicity has its price in terms of extra redundancy.
Construction 1: Let u=(u1,...,un) be an n-bit RLL
string. We merge the RLL n-bit string, u, with an n-bit source
string y=(y1,...,yn), by using the addition vi=ui+2yi,
1≤i≤n, where v=(v1,...,vn), vi∈Qis the 4-ary output
string. It is easily verified that the 4-ary output string, v, has
maximum allowed run m, the same as the binary string u.
The number of distinct 4-ary sequences, v, of
Construction 1 equals 2nN2(m,n), so that the redundancy,
denoted by r2(m,n), is
r2(m,n)≈n(1−C2(m))−log2A2(m).(11)
TABLE 3. Asymptotic rate efficiency, η(m), of binary Construction 1 versus
maximum homopolymer run, m.
TABLE 4. Rate efficiency, Rm,0/C4(m), of binary Construction 1 versus
strand length, n, and maximum homopolymer run, m.
The rate efficiency with respect to the runlength limited 4-ary
channel, denoted by η(m), is expressed by
η(m)=1+C2(m)
C4(m).(12)
Table 3lists results of computations. We may notice that
Construction 1 will suffer a loss of up to 12 % for m=2.
For larger values of m, however, the loss is negligible.
The above asymptotic efficiency of Construction 1, η(m),
is valid for very large values of the strand length n. It is of
practical interest to assess the efficiency for smaller values of
the strand length. Construction 1 can be used with any binary
RLL code, and there are many binary code constructions
for generating maximum runlength constrained sequences,
see [14] for an overview. We propose here, for the efficiency
assessment, a simple two-mode block code of codeword
length n. Runlength constrained codewords in the first mode
start with a symbol ‘zero’, while codewords in the second
mode start with a ‘one’. When the previous sent codeword
ends with a ‘one’ we use the codewords from the first mode
and vice versa. The number of binary source words that can
be accommodated with Construction 1 equals 2n−1N2(m,n),
so that the code rate, denoted by Rm,0, is
Rm,0=1
nn−1+ blog2N2(m,n)c,(13)
where we truncated the code size to the largest power of two.
Table 4shows selected outcomes of computations of the rate
efficiency Rm,0/C4(m) versus mand n.
C. ENCODING OF QUATERNARY SEQUENCES WITHOUT
BINARY STEP
In this subsection, we investigate two simple constructions
of codes that transform binary source words directly (that
is, without an intermediate binary coding step) into 4-ary
maximum homopolymer constrained codewords. An exam-
ple of a simple 4-ary block code was presented by
VOLUME 8, 2020 49525
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 5. Rate efficiency, Rm,1/C4(m), of the two-mode code construction
versus strand length, n, and maximum homopolymer run, m.
Blawat et al. [2]. The code converts 8 source bits into a
4-ary word of 5 nt. The 5-nt words can be cascaded without
violating the prescribed m=3 maximum homopolymer
run. The rate of Blawat’s construction is R=8/5=1.6.
As C4(m=3) =1.9824, see Table 1, the (rate) efficiency of
the construction is R/C4(m)=0.807. Alternative, and more
efficient, constructions are described below.
In the first construction, denoted by two-mode construc-
tion, each source word can be represented by one of two
possible codewords, where the codeword sent is chosen to
satisfy the runlength constraint at the junction of two cas-
caded codewords. Decoding is accomplished by observing
the n-symbol codeword. In the second, slightly more efficient,
construction, denoted by four-mode construction, a source
word can be represented by four possible codewords. Decod-
ing is accomplished by observing the n-symbol codeword
plus the last symbol of the previous codeword.
1) TWO-MODE CONSTRUCTION
In this format, a source word can be represented by two
n-symbol 4-ary m-constrained codewords, where the alter-
native representations differ at the first position. In case we
append a new codeword to the previous codeword, we are
always able to choose (at least) one representation whose first
symbol differs from the last symbol of the previous codeword.
Then, clearly, the cascaded string of 4-ary symbols satisfies
the prescribed maximum homopolymer run constraint. The
rate of this two-mode construction, denoted by Rm,1, is
Rm,1=1
n(blog2(N4(m,n))c − 1),(14)
where we truncated the code size to the largest power of two
possible. Table 5shows outcomes of computations of the rate
efficiency Rm,1/C4(m) versus mand n. We observe that, for
m=2, the ‘quaternary’ efficiency R2,1/C4(2) is slightly
better than the ‘binary’ R2,0/C4(2), see Table 4. For m>2,
both approaches have the same efficiency. The conversion
of the binary source symbols into the 4-ary n-nt strands and
vice versa can be accomplished using two look-up tables of
complexity 4n.
2) FOUR-MODE CONSTRUCTION
In the above two-mode construction, the encoded codeword
depends on the last symbol of the previous codeword. Decod-
ing, however, is based on the observation of the nsym-
bols of the retrieved codeword. In the second construction,
TABLE 6. Encoding tables of a four-mode code for n=2 and m=2. The
parameter idenotes the (decimal) representation of the source word. The
tables L(i,a), a=0,1,2,3, show the corresponding codeword, where a
denotes the last symbol of the previous codeword.
the codeword also depends on the last symbol of the previous
codeword. Decoding, however, is accomplished by observing
the nsymbols of the retrieved codeword plus the last symbol
of the previous codeword. To that end, we define four tables
of codewords, denoted by L(i,a), where i, 1 ≤i≤K,
denotes the decimal representation of the source word to be
encoded, Kdenotes the size of the table, and adenotes the
last symbol of the previous codeword. The four tables are
constructed in such a way that the codewords in each table
L(i,a) do not start with the symbol a. As a result, the encoder
always generates a symbol transition between the tail and
nose symbols of consecutive codewords. The maximum size
of the four tables equals K=3
4N4(m,n) (note that N4(m,n)
is a multiple of 4). Table 6shows a simple example of the
encoding tables of a four-mode code for n=2 and m=2.
The size of this code equals K=12. Let, for example,
the source sequence be ‘0’, ‘1’, ‘3’, ‘6’. Then, using the
table, the encoded sequence is ‘10’, ‘11’, ‘03’, ‘22’. We may
simply verify that the maximum runlength is m=2. The
code size K=12, while the code size of the two-mode
code m=n=2 described above equals 16/2=8. The
table shows that the codeword ‘00’ is assigned to three source
words, namely ‘0’, ‘4’, and ‘8’, so that ‘00’ cannot be decoded
unambiguously by observing the codeword. Observation of
the retrieved codeword plus the last symbol of te previous
codeword solves the ambiguouty.
The rate of this four-mode construction, denoted by Rm,2,
is
Rm,2=1
nlog23
4N4(m,n).(15)
Table 7shows the rate efficiency of the four-mode con-
struction. The efficiency improvement with respect to the
two-mode construction, see Table 5, is obtained at the cost
of four look-up tables instead of two.
Example: Let (as in Blawat’s code [2]) n=5 and m=3.
We simply find, using (1), N4(3,5) =996, so that the code
may accommodate K=3/4×996 =747 binary source
words. Since K>512 =29we may implement a code of
rate 9/5, which is 12% higher than that of Blawat’s code of
49526 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 7. Rate efficiency, Rm,2/C4(m), of the four-mode construction
versus strand length, n, and maximum homopolymer run, m.
rate 8/5. As we have the freedom of deleting 747−512 =235
redundant codewords, we may, for example, bar the words
with the highest unbalance.
In the next section, we take a look at the combined AT and
GC contents balance and maximum polymer run constrained
codes.
III. COMBINED WEIGHT AND MAXIMUM RUN
CONSTRAINED CODES
Oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Avoidance of such undesired sequences implies an extra
redundancy. In this section, we compute the redundancy of
binary and quaternary codes with combined RLL and AT/GC
constraints.
A. DEFINITION AT/GC CONTENT, BALANCE, AND WEIGHT
We use the nucleotide alphabet Q= {0,1,2,3}, where
we propose the following relation between the four decimal
symbols and the nucleotides: G=0,C=1,A=2, and
T=3. The AT/GC content constraint stipulates that around
half of the nucleotides should be either an A or a T nucleotide.
In order to study AT-balanced nucleotides, we start with a few
definitions. We define the weight or AT-content, denoted by
w4(x), of the n-nucleotide oligo x=(x1,...,xn), xi∈Q,
as the number of occurrences of A or T, or
w4(x)=
n
X
i=1
ϕ(xi),(16)
where
ϕ(u)=(0,u<2,
1,u>1.(17)
The weight of a binary word x=(x1,...,xn), xi∈ {0,1},
denoted by w2(x), is defined by
w2(x)=
n
X
i=1
ϕ(2xi)=
n
X
i=1
xi.(18)
If we write the 4-ary word x=(x1,...,xn), xi∈Q, as
x=y+2z, where yiand zi∈ {0,1}then
w4(x)=
n
X
i=1
ϕ(xi)=
n
X
i=1
ϕ(2zi)=w2(z).(19)
Kerpez et al. [21], Braun and Immink [22], and Kurmaev [23]
analyzed properties and constructions of binary combined
weight and runlength constrained codes. Their results are
straightforwardly applied to the quaternary case at hand.
In the next subsections, we count binary and quaternary
sequences that satisfy combined maximum runlength and
weight constraints. We start by counting the number of binary
sequences, x, of length nthat satisfy a maximum runlength
constraint mand have weight w=w2(x). Paluncic and
Maharaj [24] enumerated this number for the balanced case
w=w2(x)=n/2.
B. COUNTING BINARY RLL SEQUENCES OF GIVEN
WEIGHT
Define the bi-variate generating function H(x,y) in the
dummy variables xand yby
H(x,y)=X
i,j
hi,jxiyj,(20)
and let [xn1yn2]h(x,y) denote the extraction of the coefficient
of xn1yn2in the formal power series Phi,jxiyj, or
[xn1yn2]Xhi,jxiyj=hn1,n2.(21)
Define
T1(x,y)=
m
X
i=1
xiyi.(22)
Let the sequence start with a runlength of zero’s, then the
generating function for the number of binary sequences with
a maximum runlength mis
T(x)+T(x)T1(x,y)+T(x)2T1(x,y)+T(x)2T1(x,y)2+ · ·· .
In case the sequence starts with a run of one’s, we obtain for
the generating function
T1(x)+T(x)T1(x,y)+T(x)T1(x,y)2+T(x)2T1(x,y)2+ · ·· .
The generating function for the number of binary sequences
with a maximum runlength mstarting with a one or a zero
runlength is the sum of the two above generating functions.
Working out the sum yields
T1(x,y)+T(x)+2T1(x,y)T(x)
1−T1(x,y)T(x),
so that the number of n-bit codewords, x, with maximum
runlength m, denoted (with a slight abuse of notational con-
vention by adding an extra parameter) by N2(m,w,n), that
satisfy a given unbalance constraint w=w2(x) is given
by
N2(m,w,n)=[xnyw]T1(x,y)+T(x)+2T1(x,y)T(x)
1−T1(x,y)T(x).
(23)
With the above bi-variate generating function, we may
exactly compute the number of binary m-constrained words
of weight w.
VOLUME 8, 2020 49527
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
More insight is gained by an approximation of N2(m,w,n).
For a given maximum runlength, m, and asymptotically
large n, we are specifically interested in the distribution
of limn→∞ N2(m,w,n)/N(m,n) versus the weight w. The
weight wof a binary sequence of length nis the sum of
the runlengths of ones. The runlengths are random variables,
so that for asymptotically large n, according to the Central
Limit Theorem [18], the weight distribution approaches a
Gaussian distribution with mean n
2and variance denoted
by σ2
2(m,n). Then
N2(m,w,n)≈Gw;n
2, σ 2
2(m,n)
N2(m,n),n1,(24)
where
G(u;µ, σ 2)=1
σ√2πe−1
2(u−µ
σ)2,(25)
denotes the Gaussian distribution. The variance, σ2
2(m,n),
of the Gaussian distribution is computed below.
1) COMPUTATION OF THE VARIANCE, σ2
2(m,n)
Let xbe an infinitely long binary m-constrained sequence,
where the probabilities of occurrence of the runlengths of
zeros and ones are chosen to maximize the information
rate (entropy) of the sequence. The probability of occurrence
of a runlength of length l,l≤m, in a maxentropic sequence
equals λ−l
2(m), see [14, Chapter 4], where for q=2, see (7),
Pm
l=1λ−l
2(m)=1. The average runlength, denoted by ¯
l,
equals
¯
l=
m
X
i=1
iλ−i
2(m).(26)
The runlength variance of an m-constrained sequence,
denoted by Var(l), is
Var(l)=
m
X
i=1
(i−¯
l)2λ−i
2(m).(27)
The weight variance, σ2
2(m,n), of the m-constrained sequence
is
σ2
2(m,n)=γ2(m)n
4,(28)
where
γ2(m)=Var(l)
¯
l.
Table 8shows results of computations (note that the
parameter γ4(m) is explained in Section III-C). In order
to verify the accuracy of the Gaussian approximation,
we have numerically compared it with the (accurate) out-
comes of the generating function. Figure 1shows a com-
parison between the accurate and approximate distributions,
N2(m,w,n)/N2(m,n), for n=100 and m=2,3,4.
Except for the discrepancy in the tails of the distributions,
the accuracy of the Gaussian approximation is quite sufficient
for engineering applications. The Gaussian approximation is
accurate within a few percent within the two-sigma limits of
the distribution.
TABLE 8. Coefficient γ2(m) and γ4(m) versus maximum homopolymer
run m.
FIGURE 1. Comparison of the weight distribution of
N2(m,w,n)/N2(m,n), using (a) the Gaussian distribution (24) and
(b) generating functions for n=100 and m=2,3,4.
C. COUNTING QUATERNARY RLL SEQUENCES OF GIVEN
WEIGHT
We count the number of n-tuples xof 4-ary symbols that
satisfy a maximum runlength constraint, m, and have weight
w=w4(x), denoted (with a slight abuse of notational con-
vention) by N4(m,w,n).
1) MAXIMUM RUNLENGTH CONSTRAINT
For the special case m=1, Limbachiya et al. [25] presented a
closed-form expression of N4(1,w,n). For other values of the
prescribed maximum runlength, m, we may readily compute
the number of 4-ary sequences, N4(m,w,n), versus weight,
w=w4(x), by applying generating functions.
The 4-ary symbols are generated by a constrained data
source that can be modelled as a four-state Moore-type
finite-state machine. The machine steps from state to state
where when state i∈Qis visited a sequence of k, 1 ≤k≤m,
symbols ‘i’ are emitted. After visiting state i, the data source
may not return to state i(and so forbidding to again emit a
sequence of the same symbol ‘i’), but it enters state j6= i,
j∈Q. When the machine enters state 3 or 4, the word
weight, w, is incremented by k, where k, 1 ≤k≤m,
denotes the run of symbols ‘3’ or ‘4’. When, on the other
hand, states 1 or 2 are entered, the weight increment is nil. The
resulting 4 ×4 one-step skeleton or state-transition matrix,
D(x,y), of the finite-state machine is
D(x,y)=
0a0a0a0
a00a0a0
a1a10a1
a1a1a10
,(29)
49528 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 9. Number of balanced words, N4(m,n
2,n), versus mand n.
where a0=T(x) and a1=T1(x,y). We are now in
the position to write a general expression for N4(m,w,n).
The number of 4-ary sequences of length nwith maximum
runlength constraint mand weight wequals
N4(m,w,n)=[xnyw]1
3X
i,j
n
X
k=1
d[k]
i,j(x,y),(30)
where d[k]
i,j(x,y) denotes the entries of Dk(x,y). The
entries d[k]
i,j(x,y) of Dk(x,y) are equal to the number of
sequences (paths) of krunlengths starting in state iand ending
in state j. Summation for all possible runlengths k≤nand
matrix entries, and division by three yields the generating
function of N4(m,w,n), which proves (30).
Balanced codewords with w=n/2, neven, play an
important role. Table 9shows outcomes of computations
of N4(m,n
2,n) using (30), for m=1,2,and 3. The case
m=1 was earlier presented in [25]. Note that the integer
sequence N4(m=1,n
2,n) versus nis also known as OEIST
sequence A085363 (multiplied by 2), for which an alternative
generating function is presented in [26].
Generating functions (30) allow us to accurately compute
N4(m,w,n). For some applications, we may sacrifice accu-
racy for simplicity of the expression. In the next subsection,
we derive a simple approximation to N4(m,w,n) valid for
asymptotically large nand small relative weight w/n.
2) ESTIMATE OF THE WEIGHT DISTRIBUTION
The weight w4(x) is the number of nucleotides A and T in
the sequence x, see (19). Then, as in the binary case above,
for asymptotically large n, according to the Central Limit
Theorem, the weight distribution is approximately Gaussian,
that is, we may conveniently approximate N4(m,w,n) by
N4(m,w,n)≈Gw;n
2, σ 2
4(m,n)N4(m,n),n1,(31)
where σ2
4(m,n) denotes the variance of the Gaussian weight
distribution. The variance σ2
4(m,n) can be computed as
follows.
3) COMPUTATION OF THE VARIANCE σ2
4(m,n)
Let ui,i=1,2, . . .,ui∈Q, be an infinitely long 4-ary
sequence generated by a maxentropic source that satisfies
a prescribed maximum runlength m. Although the 4-ary
sequence ui,i=1,2, . . ., satisfies a limited runlength con-
straint, m, the runs of the binary weight sequence vi=ϕ(ui),
i=1,2, . . ., see definition (17), are without any limit.
The variance, σ2
4(m,n), of the Gaussian weight distribution
is governed by the runlength distribution, P(k), of the binary
sequence vi, where P(k), k>0, denotes the probability
of occurrence of a runlength k. Clearly, Pk>0P(k)=1.
The probability P(k) is proportional to the number of binary
m-sequences of length k,N2(m,k), times the probability of
such a sequence, λ−k
4, or
P(k)=cN2(m,k)λ−k
4,k≥1,(32)
where the normalization constant cis chosen such that
P∞
k=1P(k)=1. The term N2(m,k) is the number of AT
combinations of length k, which may exist of a single A or T
run or a plurality of alternating A and T runs. Then we have
σ2
4(m,n)=γ4(m)n
4,(33)
where, see [14, Chapter 4],
γ4(m)=1
¯
l
∞
X
k=1
(k−¯
l)2P(k) (34)
and
¯
l=∞
X
k=1
kP(k).(35)
Table 8shows results of computations of γ4(m) versus m.
We infer from (31) and Table 8that, for nfixed, the weight
distribution becomes wider with increasing maximum run-
length m, see also Figure 1. Note that the above outcome is
not consistent with the results by Erlich and Zielinski [3],
as they assume a Gaussian balance distribution whose vari-
ance equals n/4, independent of m.
An estimate of the number of balanced codewords,
N4(m,n
2,n), is
N4m,n
2,n≈√2
√πγ4(m)nN4(m,n),neven.(36)
For the case m=1 we have, (see [26], sequence A085363,
for a similar result)
N41,n
2,n≈8
√πn3n−1,neven.(37)
Using the above approximation, we obtain, for example, that
N4(1,8,16) ≈16191008, which is 2% higher than its exact
value, 15873240, listed in Table 9.
D. REDUNDANCY OF BINARY AND QUATERNARY CODES
WITH COMBINED RLL AND AT/GC BALANCE
CONSTRAINTS
For DNA-based storage, we do not require that the strands
of the codebook, S, are strictly balanced, as a small unbal-
ance, that is αS1, between the GC and AT content is
permitted without affecting the error performance. Such a
constraint is called a weak balance constraint. The relative
unbalance of a word, α(x), is defined by α(x)=
w4(x)
n−1
2.
An n-nucleotide oligo is said to be balanced if α(x)=0. Code
VOLUME 8, 2020 49529
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
FIGURE 2. Redundancy (bits), r4(a,n),versus word length, n, with the
relative unbalance, a, as a parameter. The raggedness of the curves is
caused by the truncation effects in the summation in (39).
constructions for combined RLL and weak balanced codes
have been published in [3], and for m=3 [27], [28].
We first study the balance of sequences without and
m-constraint. The number of 4-ary words of length nwith
balance w=w4(x), denoted by N(w,n), equals
N4(w,n)=n
w2n.(38)
The number of oligo’s, denoted by N4,a(n), of length n, whose
relative unbalance, α(x)≤a, is given by
N4,a(n)=X
|w
n−1
2|<a
N4(w,n)=2nX
|w
n−1
2|<an
w.(39)
The redundancy of 4-ary nearly balanced strands, denoted
by r4(a,n), equals
r4(a,n)=log2
4n
N4,a(n).(40)
Figure 2shows examples of computations of the redundancy,
r4(a,n), versus nwith the relative unbalance, a, as a param-
eter. The raggedness of the curves is caused by the trunca-
tion effects in the summation in (39). The distribution for
asymptotically large nof N4(w,n) versus wis approximately
Gaussian shaped, that is
N4(w,n)≈Gw;n
2,n
44n,n1,(41)
so that the redundancy equals
r4,a(n)≈ −log2[1 −2Q(2a√n)],n1,(42)
where the Q-function is defined by
Q(x)=1
√2πZ∞
x
e−u2
2du.(43)
We now study q-ary sequences with both an m-constraint
and a given weight w. As in Construction 1, let the quaternary
word x=(x1,...,xn), xi∈Q, be written as x=y+2z,
where the constituting elements yiand zi∈ {0,1}. If the
binary sequence zis m-constrained and has weight w=
w2(z), then xis m-constrained and it has weight w4(z)=w.
Using (11), (24), and (31), we obtain for n1, that
the redundancy of q-ary sequences with combined RLL and
balance constraints, denoted by rq,a(m,n), equals
rq,a(m,n)≈rq(m,n)−log21−2Q2arn
γq(m).(44)
A numerical analysis of the above expression shows that the
redundancy difference due to the balance (right hand) term
is around 0.5-1 bit for m=2. For larger values of the
homopolymer run mthe extra redundancy is negligible for
n>10. The redundancy difference, r2(m,n)−r4(m,n), due
to the imposed runlength constraint is much larger for n>10
than the redundancy due the balance constraint.
IV. CONCLUSION
We have compared two coding approaches for constraint-based
coding of DNA strings. In the first approach, an intermediate,
‘binary’, coding step is used, while in the second approach we
‘directly’ translate source data into constrained quaternary
sequences. The binary approach is attractive as it yields a
lower complexity of encoding and decoding look-up tables.
The redundancy of the binary approach is higher than that of
the quaternary approach for generating combined weight and
run-length constrained sequences. The redundancy difference
is small for larger values of the maximum homopolymer run.
We have found exact and approximate expressions for the
number of binary and quaternary sequences with combined
weight and run-length constraints.
REFERENCES
[1] G. M. Church, Y. Gao, and S. Kosuri, ‘‘Next-generation digital information
storage in DNA,’’ Science, vol. 337, no. 6102, p. 1628, Sep. 2012.
[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,
B. W. Pruitt, and G. M. Church, ‘‘Forward error correction for DNA
data storage,’’ in Proc. Int. Conf. Comput. Sci. (ICCS), vol. 80, 2016,
pp. 1011–1022.
[3] Y. Erlich and D. Zielinski, ‘‘DNA fountain enables a robust and efficient
storage architecture,’’ Science, vol. 355, no. 6328, pp. 950–954, Mar. 2017.
[4] J. Koch, S. Gantenbein, K. Masania, W. J. Stark, Y. Erlich, and R. N. Grass,
‘‘A DNA-of-things storage architecture to create materials with embedded
memory,’’ Nature Biotechnol., vol. 38, no. 1, pp. 39–43, Jan. 2020.
[5] Y. Wang, M. Noor-A-Rahim, J. Zhang, E. Gunawan, Y. L. Guan, and
C. L. Poh, ‘‘High capacity DNA data storage with variable-length oligonu-
cleotides using repeat accumulate code and hybrid mapping,’’ J. Biol. Eng.,
vol. 13, no. 1, p. 89, Dec. 2019.
[6] L. Ceze, J. Nivala, and K. Strauss, ‘‘Molecular digital data storage using
DNA,’’ Nature Rev. Genet., vol. 20, no. 8, pp. 456–466, Aug. 2019.
[7] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,
‘‘A DNA-based archival storage system,’’ ACM SIGOPS Oper. Syst. Rev.,
vol. 50, pp. 637–649, 2016.
[8] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty,
C. Nusbaum, and D. B. Jaffe, ‘‘Characterizing and measuring bias in
sequence data,’’ Genome Biol., vol. 14, no. 5, p. R51, 2013.
[9] K. W. Cattermole, ‘‘Principles of digital line coding,’’ Int. J. Electron.,
vol. 55, pp. 3–33, Jul. 1983.
[10] K. A. Schouhamer Immink and K. Cai, ‘‘Design of capacity-approaching
constrained codes for DNA-based storage systems,’’ IEEE Commun. Lett.,
vol. 22, no. 2, pp. 224–227, Feb. 2018.
[11] Y.-S. Kim and S.-H. Kim, ‘‘New construction of DNA codes with constant-
GC contents from binary sequences with ideal autocorrelation,’’ in Proc.
IEEE Int. Symp. Inf. Theory Process., Jul. 2011, pp. 1569–1573.
49530 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
[12] Y. M. Chee and S. Ling, ‘‘Improved lower bounds for constant
GC-content DNA codes,’’ IEEE Trans. Inf. Theory, vol. 54, no. 1,
pp. 391–394, Jan. 2008.
[13] K. A. Schouhamer Immink and K. Cai, ‘‘Efficient balanced and maximum
homopolymer-run restricted block codes for DNA-based data storage,’’
IEEE Commun. Lett., vol. 23, no. 10, pp. 1676–1679, Oct. 2019.
[14] K. A. S. Immink, Codes for Mass Data Storage Systems, 2nd ed.
Eindhoven, The Netherlands: Shannon Foundation, 2004.
[15] S. W. MacLauhlin, J. Luo, and Q. Xie, ‘‘On the capacity of M-ary
Runlength-limited codes,’’ IEEE Trans. Inf. Theory, vol. 41, no. 5,
pp. 1508–1511, Sep. 1995.
[16] M. W. Marcellin and H. J. Weber, ‘‘Two-dimensional modulation codes,’’
IEEE J. Sel. Areas Commun., vol. 10, no. 1, pp. 254–266, Jan. 1992.
[17] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst.
Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.
[18] P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge, U.K.:
Cambridge Univ. Press, 2009.
[19] S. M. Hossein, T. Yazdi, H. M. Kiah, and O. Milenkovic, ‘‘Weakly mutu-
ally uncorrelated codes,’’ in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Barcelona, Spain, Jul. 2016, pp. 2649–2653.
[20] V. Taranalli, H. Uchikawa, and P. H. Siegel, ‘‘Error analysis and inter-cell
interference mitigation in multi-level cell flash memories,’’ in Proc. IEEE
Int. Conf. Commun. (ICC), London, U.K., Jun. 2015, pp. 271–276.
[21] K. J. Kerpez, A. Gallopoulos, and C. Heegard, ‘‘Maximum entropy charge-
constrained run-length codes,’’ IEEE J. Sel. Areas Commun., vol. 10, no. 1,
pp. 242–253, Jan. 1992.
[22] V. Braun and K. A. Schouhamer Immink, ‘‘An enumerative coding tech-
nique for DC-free runlength-limited sequences,’’ IEEE Trans. Commun.,
vol. 48, no. 12, pp. 2024–2031, Dec. 2000.
[23] O. F. Kurmaev, ‘‘Constant-weight and constant-charge binary run-length
limited codes,’’ IEEE Trans. Inf. Theory, vol. 57, no. 7, pp. 4497–4515,
Jul. 2011.
[24] F. Paluncic and B. T. J. Maharaj, ‘‘Using bivariate generating functions
to count the number of balanced runlength-limited words,’’ in Proc.
GLOBECOM - IEEE Global Commun. Conf., Singapore, Dec. 2017,
pp. 4–8.
[25] D. Limbachiya, M. K. Gupta, and V. Aggarwal, ‘‘Family of constrained
codes for archival DNAdata storage,’’ IEEE Commun. Lett., vol. 22, no. 10,
pp. 1972–1975, Oct. 2018.
[26] N. J. A. Sloane. (2019). The On-Line Encyclopedia of Integer Sequences.
[Online]. Available: http://oeis.org
[27] Y. Wang, M. Noor-A-Rahim, E. Gunawan, Y. L. Guan, and C. L. Poh,
‘‘Construction of bio-constrained code for DNA data storage,’’ IEEE Com-
mun. Lett., vol. 23, no. 6, pp. 963–966, Jun. 2019.
[28] W. Song, K. Cai, M. Zhang, and C. Yuen, ‘‘Codes with run-length and
GC-content constraints for DNA-based data storage,’’ IEEE Commun.
Lett., vol. 22, no. 10, pp. 2004–2007, Oct. 2018.
KEES A. SCHOUHAMER IMMINK (Life Fellow,
IEEE) is currently a Founder and the President
of Turing Machines Inc., an innovative start-up
focused on coding and signal processing for
DNA-based storage. He received the 2017 IEEE
Medal of Honor for his for pioneering contribu-
tions to video, audio, and data recording tech-
nology, the Knighthood, in 2000, the Personal
Emmy Award, in 2004, the 1999 Audio Engineer-
ing Society’s (AES) Gold Medal, the 2004 SMPTE
Progress Medal, the 2014 Eduard Rhein Prize for Technology, and the
2015 IET Faraday Medal. He received an Honorary Doctorate from the
University of Johannesburg, in 2014. He was inducted into the Consumer
Electronics Hall of Fame, elected into the Royal Netherlands Academy of
Arts and Sciences, and the (US) National Academy of Engineering. He has
served the profession as a Governor for the IEEE Information Theory and
Consumer Electronics Societies and the President for the Audio Engineering
Society.
KUI CAI (Senior Member, IEEE) received the
B.E. degree in information and control engineering
from Shanghai Jiao Tong University, Shanghai,
China, and the joint Ph.D. degree in electrical
engineering from the Technical University of
Eindhoven, The Netherlands, and the National
University of Singapore. She is currently an Asso-
ciate Professor with the Singapore University
of Technology and Design (SUTD). Her main
research interests are in the areas of coding the-
ory, information theory, signal processing for various data storage systems,
and digital communications. She received the 2008 IEEE Communications
Society Best Paper Award in Coding and Signal Processing for Data Storage.
She has served as the Vice-Chair (Academia) for the IEEE Communications
Society and the Data Storage Technical Committee (DSTC), from 2015
to 2016.
VOLUME 8, 2020 49531