PreprintPDF Available

Properties and constructions of constrained codes for DNA-based data storage

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We describe properties and constructions of constraint-based codes for DNA-based data storage which accounts for the maximum repetition length and AT balance. We present algorithms for computing the number of sequences with maximum repetition length and AT balance constraint. We present efficient routines for translating binary runlength limited and/or balanced strings into DNA strands. We show that the implementation of AT-balanced codes is straightforwardly accomplished with binary balanced codes. We present codes that accounts for both the maximum repetition length and AT balance.
Content may be subject to copyright.
1
Properties and constructions of constrained codes
for DNA-based data storage
Kees A. Schouhamer Immink and Kui Cai
Abstract—We describe properties and constructions of
constraint-based codes for DNA-based data storage which ac-
count for the maximum repetition length and AT/GC balance.
We present algorithms for computing the number of sequences
with maximum repetition length and AT/GC balance con-
straint. We describe routines for translating binary runlength
limited and/or balanced strings into DNA strands, and compute
the efficiency of such routines. We show that the implementa-
tion of AT/GC-balanced codes is straightforward accomplished
with binary balanced codes. We present codes that account for
both the maximum repetition length and AT/GC balance. We
compute the redundancy difference between the binary and a
fully fledged quaternary approach.
I. INT ROD UC TI ON
The first large-scale archival DNA-based storage architec-
ture was implemented by Church et al. [1] in 2012. Blawat
et al. [2] described successful experiments for storing and
retrieving data blocks of 22 Mbyte of digital data in syn-
thetic DNA. Ehrlich and Zielinski [3] further explored the
limits of storage capacity of DNA-based storage architec-
tures.
Naturally occurring DNA consists of four types of nu-
cleotides: adenine (A), cytosine (C), guanine (G), and
thymine (T). A DNA strand (or oligonucleotides, or oligo in
short) is a linear sequence of these four nucleotides that are
composed by DNA synthesizers. Binary source, or user, data
are translated into the four types of nucleotides, for example,
by mapping two binary source into a single nucleotide (nt).
Strings of nucleotides should satisfy a few elementary
conditions, called constraints, in order to be less error
prone. Repetitions of the same nucleotide, a homopoly-
mer run, significantly increase the chance of sequencing
errors [4], [5], so that such long runs should be avoided.
For example, in [5], experimental studies show that once
the homopolymer run is larger than 4 nt, the sequencing
error rate starts increasing significantly. In addition, [5]
also reports that oligos with large unbalance between GC
and AT content exhibit high dropout rates and are prone
to polymerase chain reaction (PCR) errors, and should
therefore be avoided.
Blawat’s format [2] incorporates a constrained code that
uses a look-up table for translating binary source data
Kees A. Schouhamer Immink is with Turing Machines Inc, Willemskade
15b-d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-
machines.com.
Kui Cai is with Singapore University of Technology and Design (SUTD),
8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.
This work is supported by Singapore Ministry of Education Academic
Research Fund Tier 2 MOE2016-T2-2-054
into strands of nucleotides with a homopolymer run of
length at most three. Blawat’s format did not incorporate
an AT/GC balance constraint. Strands that do not satisfy
the maximum homopolymer run requirement or the weak
balance constraint are barred in Erlich’s coding format [3].
In this paper, we describe properties and constructions
of quaternary constraint-based codes for DNA-based stor-
age which account for a maximum homopolymer run
and maximum unbalance between AT and GC contents.
Binary ‘balanced’ and runlength limited sequences have
found widespread use in data communication and storage
practice [6]. We show that constrained binary sequences
can easily be translated into constrained quaternary se-
quences, which opens the door to a wealth of efficient
binary code constructions for application in DNA-based
storage [7], [8], [9]. A further advantage of this binary
approach instead of a ‘direct’ 4-ary translation approach
is the lower complexity of encoding and decoding look-
up tables. The disadvantage is, as we show, the loss in
information capacity of the binary versus the quaternary
approach.
We start in Section II with a description of the limit-
ing properties of AT/GC-balanced codes, while Section III
presents code designs for efficiently generating AT/GC-
balanced strands. Limiting properties and code constructions
that impose a maximum homopolymer run are discussed
in Section IV. In Section V, we enumerate the number
of binary and quaternary sequences with combined weight
and run-length constraints. We specifically compute and
compare the information capacity of binary versus ‘direct’
quaternary coding techniques. Section VI concludes the
paper.
II. AT/GC CONTENT BAL AN CE
We use the nucleotide alphabet Q={0,1,2,3}, where
we propose the following relation between the four decimal
symbols and the nucleotides: G= 0, C = 1, A = 2,
and T= 3. The AT/GC content constraint stipulates that
around half of the nucleotides should be either an A or a
T nucleotide. In order to study AT-balanced nucleotides, we
start with a few definitions. We define the weight or AT-
content, denoted by w4(x), of the n-nucleotide oligo x=
(x1, . . . , xn),xi∈ Q, as the number of occurrences of A
or T, or
w4(x) =
n
i=1
φ(xi),(1)
2
where
φ(u) = 0, u < 2,
1, u > 1.(2)
The relative unbalance of a word, α(x), is defined by
α(x) =
w4(x)
n1
2
. An n-nucleotide oligo is said to be
balanced if α(x) = 0. In case we have a set Sof n-symbol
codewords, we define the worst case relative unbalance of
S, denoted by αS, by αS= maxx∈S α(x). Similarly the
weight of a binary word x= (x1, . . . , xn),xi∈ {0,1},
denoted by w2(x), is defined by
w2(x) =
n
i=1
φ(2xi) =
n
i=1
xi.(3)
If we write the 4-ary word x=(x1, . . . , xn),xi∈ Q, as
x=y+ 2z, where both yiand zi∈ {0,1}then
w4(x) =
n
i=1
φ(xi) =
n
i=1
φ(2zi) = w2(z).(4)
For DNA-based storage, we do not require that the strands of
the codebook, S, are strictly balanced, as a small unbalance,
that is αS1, between the GC and AT content is permitted
without affecting the error performance. Such a constraint
is called a weak balance constraint. Let Swdenote the set
of 4-ary words of length nwith balance w=w4(x), or
Sw={x∈ Qn:w=w4(x)}.(5)
The cardinality of Sw, denoted by N(w, n), equals
N(w, n) = |Sw|=n
w2n.(6)
The number of oligo’s, Na(n), of length n, whose relative
unbalance α(x)a, is given by
Na(n) =
|w/n1
2|<a
N(w, n) = 2n
|w/n1
2|<a n
w.(7)
The redundancy of nearly balanced strands, denoted by
r(a, n), equals
r(a, n) = log2
4n
Na(n).(8)
Figure 1 shows examples of computations of the redundancy
versus nwith the relative unbalance, a, as a parameter. The
raggedness of the curves is caused by the truncation effects
in the summation in (7). The distribution for asymptotically
large nof N(w, n)versus wis approximately Gaussian
shaped, that is
N(w, n) G w;n
2,n
44n, n 1,(9)
where
G(u;µ, σ2) = 1
σ2πe1
2(uµ
σ)2,(10)
denotes the Gaussian distribution and µand σ2denote
the mean and variance of the distribution. The number
0 50 100 150
0
0.5
1
1.5
2
2.5
3
redundancy
n
a=0.125
a=0.0625
a=0.03125
Fig. 1. Redundancy (bits) versus word length, n, with the relative
unbalance, a, as a parameter. The raggedness of the curves is
caused by the truncation effects in the summation in (7).
of oligo’s, Na(n), of length n, whose relative unbalance
α(x)a, is given by [ [3], supplement]
Na(n)4n12Q(2an), n 1,(11)
where the Q-function is defined by
Q(x) = 1
2π
x
eu2
2du. (12)
In the next section, we discuss various embodiments of
codes that balance strands of nucleotides.
III. IMP LE ME NTATIONS OF BA LA NC ED GC/AT CONTENT
There is a wealth of prior art binary balanced codes [10],
and application of such prior art codes to the problem at
hand is shown below. Earlier embodiments can be found
in [11], [12].
A. Binary sequences, Construction I
We assume the encoder receives a string of +n,n,
binary symbols, which are translated into a balanced word
of n4-ary symbols. To that end, let (y1, . . . , y+n),yi
{0,1},n, be an (+n)-bit source string. We translate
the first bits of the binary source data, (y1, . . . , y), into
a (nearly) balanced binary string (u1, . . . , un),ui∈ {0,1}.
We merge the n-bit string, (u1, . . . , un), and the remain-
ing n-bit segment of the source string, (y+1, . . . , y+n),
into the 4-ary vector v,vi∈ Q, using the operation
vi=y+i+ 2ui,1in. The balance of the output
string, v, is given by, see (4), w4(v) = w2(u).The rate
of the above 4-ary code construction equals R= 1 +
n.
Implementations of balanced codes can be found in the
literature. For example, the 8B10B is a binary code of rate
8/10 that has found application in both transmission and
data storage systems [13]. The 10-bit codewords may have
four, five or six ‘one’s, and the two-state code guarantees
3
that the unbalance of the encoded sequence is at most ±1.
In case we translate p8-bit words into p10-bits words, we
have αS=1
10p. The (overall) rate R=9
5.
1) Weak Knuth code: Knuth [14] presented an encoding
technique for generating binary balanced codewords capable
of handling (very) large binary blocks. An n-bit user word,
neven, is forwarded to the encoder, which inverts the first
k0bits of the user word, where k0is chosen in such a
way that the modified word has equal numbers of ones
and zeros. Knuth showed that such an index k0can always
be found. The index k0is represented by a (preferably)
balanced word, called prefix, of length p0,p0log2nbits,
so that the redundancy of Knuth’s method is approximately
log2n(bit). The (balanced) p0-bit prefix and the balanced
n-bit user word are both transmitted. The receiver can
easily undo the inversion of the first k0bits of the received
word. Modifications of the generic Knuth scheme have been
presented by Weber & Immink [15].
DNA-based storage does not require exact strand GC/AT-
content balance, and we may attempt to construct less
redundant nearly-balanced codes. We modify Knuth’s al-
gorithm for generating nearly balanced binary codes. Let
x= (x1, . . . , xn), be the word to be balanced. Define
the m0= 2p0balancing positions, denoted by bi, i =
0, . . . , m01, that are evenly distributed over the npossible
positions, say bi= 1 + is,i= 0, . . . , m01, where
s=n/m0. Mimicking the original Knuth encoder, the
encoder successively inverts the symbols of the ith segment
of x,i= 0,·· · , m01, thereby successively inverting the
symbols x1till xb0,x1till xb1, etc, until x1till xbm01.
The encoder selects the index, bˆ
i, that enables the least
unbalance. In similar vein as in Knuth’s method, the index ˆ
i
is represented by a redundant (balanced or nearly balanced)
p-bit prefix that is appended to the weakly-balanced word.
According to Knuth we can choose at least one index k0,
1k0n, such that exact balance can be achieved. As an
‘exact’ balancing index, k0, is at most s/2positions away
from position bˆ
i, we conclude that the relative unbalance is
αS1
2p0+1 .(13)
The redundancy of the above weak Knuth code equals at
least p0bits (note that additional redundancy is needed to
encode the prefix into a nearly balanced word). Let, for ex-
ample, the code redundancy be p0= 3, then αS= 0.0625.
Figure 1 shows that for a relative unbalance a= 0.0625 we
need, in theory, less than 1.5 bit redundancy for n > 25, so
that we conclude that the above modification of Knuth’s al-
gorithm falls far short of the minimum redundancy required.
In the next section, we discuss constructions for generating
strings that avoid long repetitions of the same nucleotide.
IV. MAXIMUM RUNLENGTH CONSTRAINT
Long repetitions of the same nucleotide (nt), called a
homopolymer run or runlength, may significantly increase
the chance of sequencing errors [4], [5], and should be
avoided. Avoiding long runs of the same nucleotide will
result in loss of information capacity, tand codes are re-
quired for translating arbitrary source data into constrained
quaternary strings. Binary runlength limited (RLL) codes
have found widespread application in digital communication
and storage devices since the 1950s [6], [10]. MacLaughlin
et al. [16] studied multi-level runlength limited codes for
optical recording. An n-nucleotide oligo, a string of 4-
ary symbols of length n, can be seen as two parallel
binary strings of length n, namely a string of a least and
a most significant bit with which the 4-ary symbol can be
represented. Such a system of multiple parallel data streams
with joint constraints is reminiscent of ‘two-dimensional’
track systems, which have been studied by Marcellin and
Weber [17].
We start in the next subsection with the counting of q-
ary sequences that satisfy a maximum runlength, followed
by subsections where we describe limiting properties and
code constructions that avoid m+ 1 repetitions of the same
nucleotide.
A. Counting q-ary sequences, capacity
Let the number of n-length sequences consisting of q-
ary symbols have a maximum run, m, of the same symbol
be denoted by Nq(m, n). The number Nq(m, n)can be
found using the next Theorem which defines a recursive
relation [18], Part 1.
Theorem 1:
Nq(m, n) = qn, n m,
(q1) m
k=1 Nq(m, n k), n > m.
(14)
Proof: For nmthe above is trivial as all sequences
satisfy the maximum runlength constraint. For n > m we
follow Shannon’s approach [18] for the discrete noiseless
channel. The runlength of ksymbols acan be seen as a
’phrase’ aof length k. After a phrase ahas been emitted, a
phrase of symbols b̸=aof length kcan be emitted without
violating the maximum runlength constraint imposed. The
total number of allowed sequences, Nq(m, n), is equal to
(q1) times the sum of the numbers of sequences ending
with a phrase of length k= 1,2,...m, which are equal to
Nq(m, nk). Addition of these numbers yields (14), which
proves the Theorem.
Using the above expressions, we may easily compute the
feasibility of a q-ary m-constrained code for relatively small
values of nwhere a coding look-up table is practicable, see
Subsection IV-C for more details.
1) Generating functions: Generating functions are a very
useful tool for enumerating constrained sequences [19], and
they offer tools for approximating the number of constrained
sequences for asymptotically large values of the sequence
length n. The series of numbers {Nq(m, n)},n= 1,2...,
in (14), can be compactly written as the coefficients of a
formal power series Hq,m(x) = Nq(m, i)xi, where xis
a dummy variable. There is a simple relationship between
4
TABLE I
CAPACI TY C2(m)AND C4(m)V ERS US m.
m C2(m)C4(m)
1 0.0000 1.5850(= log23)
2 0.6942 1.9227
3 0.8791 1.9824
4 0.9468 1.9957
5 0.9752 1.9989
6 0.9881 1.9997
the generating function, Hq,m(x), and the linear homoge-
nous recurrence relation (14) with constant coefficients that
defines the same series [19]. We first define a generating
function
G(x) = gixi.(15)
Let the operation [xn]g(x)denote the extraction of the
coefficient of xnin the formal power series G(x), that is,
define
[xn]gixi=gn.(16)
Let
T(x) =
m
i=1
xi.(17)
Theorem 2: The number of n-symbol m-constrained q-
ary words is
Nq(m, n) = [xn]qT (x)
1(q1)T(x).(18)
Proof: The generating function for the number of q-ary
sequences with a maximum runlength mis
qT (x) + q(q1)T(x)2+q(q1)2T(x)3+··· .
We may rewrite the above as
qT (x)
1(q1)T(x),
which proves the Theorem.
2) Asymptotical behavior: For asymptotically large code-
word length n, the maximum number of (binary) user bits
that can be stored per q-ary symbol, called (information)
capacity, denoted by Cq(m), is given by [18]
Cq(m) = lim
n→∞
1
nlog2Nq(m, n) = log2λq(m),(19)
where λq(m), is the largest real root of the characteristic
equation [18], [16]
xm+1 qxm+q1 = 0.(20)
Table I shows the information capacities C2(m)and
C4(m)versus maximum allowed (homopolymer) run m.
For asymptotically large nwe may approximate Nq(m, n)
by [19]
Nq(m, n)Aq(m)λn
q(m).(21)
TABLE II
COE FFICI EN T A2(m)AND A4(m)VER SUS m.
m A2(m)A4(m)
1 1.3333(= 4/3)
2 1.4477 1.1031
3 1.2368 1.0341
4 1.1327 1.0110
5 1.0759 1.0034
6 1.0435 1.0010
The coefficient Aq(m)is found, see [ [10], page 157-158],
by rewriting Hq,m(x)as a quotient of two polynomials, or
Hq,m(x) = r(x)
p(x). Then
Aq(m) = λq(m)r(1q(m))
p(1q(m)).(22)
Table II shows the coefficients A2(m)and A4(m)versus
m. For m= 1, we simply find N4(1, n) = 4.3n1. We
found that the approximation (21) is remarkably accurate.
For a typical example, N4(2,10) = 676836, while the
approximation using (21) yields N4(2,10) 676835.9769.
The redundancy of a 4-ary string of length nwith a
maximum runlength m, denoted by r4(m, n), is
r4(m, n)=2nlog2N4(m, n)
n(2 C4(m)) log2A4(m).(23)
B. Binary-based RLL code construction, Construction II
In a similar vein as presented in Section III, we may
exploit binary maximum runlength limited (RLL) codes for
generating quaternary RLL sequences. Construction II ex-
emplifies such a technique for m > 1. Let u= (u1, . . . , un)
be an n-bit RLL string. We merge the RLL n-bit string, u,
with an n-bit source string y= (y1, . . . , yn), by using the
addition vi=ui+ 2yi,1in, where v= (v1, . . . , vn),
vi∈ Q is the 4-ary output string. It is easily verified that
the 4-ary output string, v, has maximum allowed run m, the
same as the binary string u. The number of distinct 4-ary
sequences, v, of Construction II equals 2nN2(m, n), so that
the redundancy, denoted by r2(mn, n)is
r2(m, n)n(1 C2(m)) log2A2(m).(24)
The capacity loss with respect to the runlength limited 4-ary
channel, denoted by η(m), is expressed by
η(m) = 1 + C2(m)
C4(m).(25)
Table III lists results of computations. We may notice that
for small values of m, Construction II will suffer a capacity
loss of up to 12 % for m= 2. For larger values of m,
however, the capacity loss is negligible.
The above asymptotic efficiency of Construction II, η(m),
is valid for very large values of the strand length n, and it
is of practical interest to assess the efficiency for smaller
values of the strand length. Construction II can be used
5
TABLE III
ASY MPT OTI C RATE EFFI CI ENC Y,η(m),OF BIN ARY CONSTRUCTION II
VE RSU S MA XIM UM HO MO POLY MER R UN,m.
m η(m)
2 0.881
3 0.948
4 0.975
5 0.988
6 0.994
7 0.997
TABLE IV
RATE EFFI CIE NC Y,Rm,0/C4(m),OF BINA RY CONSTRUCTION II
VERSUS STRAND LENGTH,n,AND M AX IMU M HO MOP OLYM ER RU N,m.
n m = 2 m= 3 m= 4
5 0.832 0.807 0.802
6 0.780 0.841 0.835
7 0.817 0.865 0.859
8 0.845 0.883 0.877
9 0.809 0.897 0.891
10 0.832 0.908 0.902
with any binary RLL code, and there are many binary
code constructions for generating maximum runlength con-
strained sequences, see [10] for an overview. We propose
here, for the efficiency assessment, a simple two-mode
block code of codeword length n. Runlength constrained
codewords in the first mode start with a symbol ‘zero’,
while codewords in the second mode start with a ‘one’.
When the previous sent codeword ends with a ‘one’ we
use the codewords from the first mode and vice versa. The
number of binary source words that can be accommodated
with Construction II equals 2n1N2(m, n), so that the code
rate, denoted by Rm,0, is
Rm,0=1
n(n1 + log2N2(m, n)),(26)
where we truncated the code size to the largest power of two
possible. Table IV shows selected outcomes of computations
of the rate efficiency Rm,0/C4(m)versus mand n.
C. Encoding of quaternary sequences without binary step
In this subsection, we investigate constructions of codes
that transform binary words directly (that is, without an
intermediate binary coding step) into 4-ary maximum ho-
mopolymer constrained codewords. An example of a simple
4-ary block code was presented by Blawat et al. [2]. The
code converts 8 source bits into a 4-ary word of 5 nt. The
5-nt words can be cascaded without violating the prescribed
m= 3 maximum homopolymer run. The rate of Blawat’s
construction is R= 8/5=1.6. As C4(m= 3) = 1.9824,
see Table I, the (rate) efficiency of the construction is
R/C4(m) = 0.807. Alternative, and more efficient, con-
structions are described below.
TABLE V
RATE EFFI CIE NC Y,Rm,1/C4(m),OF THE 4-ARY CODE CONSTRUCTION
VE RSU S ST RAN D LEN GT H,n,AN D MA XIM UM H OMO PO LYME R RUN,m.
n m = 1 m= 2 m= 3 m= 4
5 0.883 0.832 0.807 0.802
6 0.841 0.867 0.841 0.835
7 0.901 0.892 0.865 0.859
8 0.946 0.910 0.883 0.877
9 0.911 0.925 0.897 0.891
10 0.946 0.936 0.908 0.902
1) State-independent decoding: A source word can be
represented by two n-symbol 4-ary m-constrained code-
words. The two representations differ at the first position. In
case we cascade a new codeword to the previous codeword,
we are always able to choose (at least) one representation
whose first symbol differs from the last symbol of the
previous codeword. Then, clearly, the cascaded string of
4-ary symbols satisfies the maximum homopolymer run
constraint. The rate of this two-mode construction, denoted
by Rm,1, is
Rm,1=1
n(log2(N4(m, n))⌋ − 1),(27)
where we truncated the code size to the largest power of
two possible. Table V shows selected outcomes of compu-
tations of the rate efficiency Rm,1/C4(m)versus mand
n. We observe that, for m= 2, the ’quaternary’ efficiency
R2,1/C4(2) is slightly better than the ’binary’ R2,0/C4(2),
For m > 2, both approaches have the same efficiency. The
conversion of the binary source symbols into the 4-ary n-nt
strands and vice versa can be accomplished using look-up
tables of complexity 4n.
2) State-dependent decoding: In the above construction,
the encoded codeword depends on the last symbol of the
previous codeword. Decoding, however, is based on the
observation of the nsymbols of the retrieved codeword.
In this subsection, we discuss a state-dependent decoding
construction, where the codeword chosen depends on the
last symbol of the previous codeword, and decoding is
based on the observation of the nsymbols of the retrieved
codeword plus the last symbol of the previous codeword.
We define four tables of codewords, denoted by L(i, a),
where i,1iK, denotes the decimal representation of
the source word to be encoded, Kdenotes the size of the
table, and adenotes the encoder state a=∈ {1,2,3,4}. We
construct the four tables in such as way that the codewords
in each table L(i, a)do not start with the symbol a. Then,
the maximum size of the tables equals K=3
4N4(m, n)
(note that N4(m, n)is a multiple of 4). The representation,
L(i, a), chosen depends on the last symbol of the previ-
ous codeword, a. The rate of this four-mode construction,
denoted by Rm,2, is
Rm,2=1
nlog23
4N4(m, n).(28)
6
TABLE VI
RATE EFFI CIE NC Y,Rm,2/C4(m),OF THE 4-ARY CODE CONSTRUCTION
VERSUS STRAND LENGTH,n,AND M AX IMU M HO MOP OLYM ER RU N,m.
n m = 1 m= 2 m= 3 m= 4
5 0.883 0.936 0.908 0.902
6 0.946 0.954 0.925 0.919
7 0.991 0.966 0.937 0.931
8 0.946 0.975 0.946 0.940
9 0.981 0.982 0.953 0.946
10 0.946 0.936 0.958 0.952
Table VI shows the rate efficiencies that can be reached with
this construction. The efficiency improvement with respect
to Table V is obtained at the cost of a four times larger look-
up table. Decoding of a codewords is uniquely accomplished
by observing the n-symbol codeword plus the last symbol
of the previous codeword.
Example 1: Let (as in Blawat’s code [2]) n= 5 and
m= 3. We simply find, using (14), N4(3,5) = 996, so that
the code may accommodate K= 3/4×996 = 747 binary
source words. Since K > 512 = 29we may implement a
code of rate 9/5, which is 12% higher than that of Blawat’s
code of rate 8/5. As we have the freedom of deleting
747-512=235 redundant codewords, we may bar the words
with the highest unbalance.
In the next section, we take a look at the combination of
balance and maximum polymer run constrained codes.
V. CO MB IN ED W EI GH T AN D MA XI MU M RUN
CONSTRAINED CODES
Kerpez et al. [20], Braun and Immink [21], and Kur-
maev [22] analyzed properties and constructions of binary
combined weight and runlength constrained codes. Their
results are straightforwardly applied to the quaternary case
at hand. In the next section, we count binary and quaternary
sequences that satisfy combined maximum runlength and
weight constraints. We start by counting the number of
binary sequences, x, of length nthat satisfy a maximum
runlength constraint mand have a weight w=w2(x).
Paluncic and Maharaj [23] enumerated this number for the
balanced case w=w2(x) = 0.
A. Counting binary RLL sequences of given weight
Define the bi-variate generating function H(x, y)in the
dummy variables xand yby
H(x, y) =
i,j
hi,j xiyj,(29)
and let [xn1yn2]h(x, y)denote the extraction of the coef-
ficient of xn1yn2in the formal power series hi,jxiyj,
or
[xn1yn2]hi,j xiyj=hn1,n2.(30)
Define
T1(x, y) =
m
i=1
xiyi.(31)
The number of n-bit codewords, x, with maximum run-
length m, denoted (with a slight abuse of notational con-
vention by adding an extra parameter) by N2(m, w, n), that
satisfy a given unbalance constraint w=w2(x)is given by
the next Theorem.
Theorem 3:
N2(m, w, n) = [xnyw]T1(x, y) + T(x)+2T1(x, y)T(x)
1T1(x, y)T(x).
Proof: Let the sequence start with a runlength of zero’s, then
the generating function for the number of binary sequences
with a maximum runlength mis
T(x)+T(x)T1(x, y)+T(x)2T1(x, y)+T(x)2T1(x, y )2+··· .
In case the sequence starts with a run of one’s, we obtain
for the generating function
T1(x)+T(x)T1(x, y)+T(x)T1(x, y)2+T(x)2T1(x, y )2+··· .
The generating function for the number of binary sequences
with a maximum runlength mstarting with a one or a zero
runlength is the sum of the two above generating functions.
Working out the sum yields
T1(x, y) + T(x) + 2T1(x, y)T(x)
1T1(x, y)T(x),
which proves the Theorem.
With the above bi-variate generating function, we may
exactly compute the number of binary m-constrained words
of weight w. More insight is gained by an approximation
of N2(m, w, n). For a given maximum runlength, m, and
large n, we are specifically interested in the distribution of
N2(m, w, n)versus the weight w. For asymptotically large
n, according to the central limit theorem, the distribution of
the number of sequences versus weight, w, is approximately
Gaussian [19].
Theorem 4:
N2(m, w, n) G w;n
2,γ2(m)n
4N2(m, n),(32)
where
γ2(m) = 1
¯
l
m
i=1
(i¯
l)2λi
2(m)(33)
and
¯
l=
m
i=1
i
2(m).(34)
Proof: The probability of occurrence of a runlength of
length k,km, is λk
2(m), see [10], Chapter 4. So
that the average number of runlengths in a sequence of n
symbols is n/¯
l. The weight wis the sum of the runlengths
of ones, so that according to the central limit theorem the
weight distribution is approximately Gaussian for large n
with mean n
2and variance γ2(m)n
4.
7
TABLE VII
COE FFICI EN T γ2(m)AND γ4(m)V ERS US M AXI MU M HOM OP OLYM ER
RUN m.
m γ2(m)γ4(m)
1 0.5000
2 0.1708 0.7410
3 0.3449 0.8796
4 0.5059 0.9497
5 0.6426 0.9808
10 0.9565 0.9999
1 1
Table VII shows results of computations (the parameter
γ4(m) is explained in Section V-B). Perusal of the outcomes
clearly demonstrates that for small values of mthe unbal-
ance variance, γ2(m)n, is smaller than that of unconstrained
sequences (that is, m=) of the same length n. In other
words, a maximum runlength ‘helps’ to reduce the expected
unbalance.
B. Counting quaternary RLL sequences of given weight
We count the number of n-tuples xof 4-ary symbols
that satisfy a maximum run length constraint, m, and
have weight w=w4(x), denoted (with a slight abuse of
notational convention) by N4(m, w, n).
1) Maximum runlength constraint: For the special case
m= 1, Limbachiya [24] et al. presented a closed expression
of N4(1, w, n). For other values of the prescribed maximum
runlength, m, we may readily compute the number of 4-
ary sequences, N4(m, w, n), versus weight, w=w4(x),
by applying generating functions. The 4-ary symbols are
generated by a constrained data source that can be modelled
as a four-state Moore-type finite-state machine. The machine
steps from state to state where when state i∈ Q is visited
a sequence of k,1km, symbols ‘i’ are emitted. After
visiting state i, the data source may not return to state i(and
thus emit a sequence of the same symbol ‘i’ again), but it
enters state j̸=i,j∈ Q. When the machine enters state 3
or 4, the word weight, w, is incremented by k, where k,
1km, denotes the run of symbols ‘3’ or ‘4’. When,
on the other hand, states 1 or 2 are entered, the weight
increment is nil. The resulting 4×4one-step skeleton or
state-transition matrix, D(x, y), of the finite-state machine
is
D(x, y) =
0a0a0a0
a00a0a0
a1a10a1
a1a1a10
,(35)
where a0=T(x)and a1=T1(x, y).
Theorem 5: The number of 4-ary sequences of length n
with maximum runlength constraint mand weight wequals
N4(m, w, n) = [xnyw]1
3
i,j
d[n]
i,j (x, y),(36)
where d[n]
i,j (x, y)denotes the entries of Dn(x, y).
Proof: The entries d[n]
i,j (x, y)of Dn(x, y)are equal to the
number of sequences (paths) of length nstarting in state i
and ending in state j. Summation of the entries and division
by 3 yields the generating function of N4(m, w, n).
In the next subsection, we derive a simple approximation
to N4(m, w, n)valid for large n.
2) Estimate of the weight distribution: For asymptoti-
cally large n, the weight distribution is approximately Gaus-
sian, that is, we may conveniently approximate N4(m, w, n)
using the next Theorem.
Theorem 6:
N4(m, w, n) G w;n
2, σ2
4(m, n)N4(m, n), n 1,
(37)
where σ2
4(m, n), denotes the variance of the Gaussian
weight distribution.
Proof: Let ui,i= 1,2, . . .,ui∈ Q, be an infinitely
long 4-ary sequence generated by a maxentropic source
that satisfies a prescribed maximum runlength, m. Note that
although the 4-ary sequence ui,i= 1,2, . . ., satisfies a lim-
ited runlength constraint, m, that runs of the binary weight
sequence vi=φ(ui),i= 1,2, . . ., are without limit. The
variance, σ2
4(m, n), of the Gaussian weight distribution is
governed by the runlength distribution, P(k), of the binary
sequence vi, where P(k),k > 0, denotes the probability of
occurrence of a runlength k. Clearly, k>0P(k) = 1. The
probability P(k)is given by
P(k) = cN2(m, k)λk
4, k 1,(38)
where the normalization constant cis chosen such that
k=1 P(k)=1. The term N2(m, k)is the number of AT
combinations of length k, which may exist of a single A or
T run or a plurality of alternating A and T runs. We have
σ2
4(m, n) = γ4(m)n
4, where, see [10], Chapter 4,
γ4(m) = 1
¯
l
k=1
(k¯
l)2P(k)(39)
and
¯
l=
k=1
kP (k).(40)
Table VII shows results of computations of γ4(m)versus
m. We may notice that the weights of the quaternary RLL
sequences are more concentrated around the mean n/2than
those of binary RLL sequences. The above outcome is not
consistent with the results by Ehrlich and Zielinski [3], as
they assume that the balance variance equals n/4, indepen-
dent of m.
C. Redundancy of binary and quaternary codes with com-
bined constraints
As in Constructions I and II, let the quaternary word x=
(x1, . . . , xn),xi∈ Q, be written as x=y+ 2z, where
8
the constituting elements yiand zi∈ {0,1}. If the binary
sequence zis m-constrained and has a weight w=w2(z),
then xis m-constrained and it has a weight w4(z) = w. The
redundancy of the binary constrained sequences, z, denoted
(with a slight abuse of convention) by r2(m, a, n), equals
r2(m, a, n) = nlog2N2(m, w, n).(41)
Using (24) and (32), we obtain for n1, that the
redundancy of the binary approach is
r2(m, a, n)r2(m, n)log212Q2an
γ2(m).
(42)
The redundancy of the quaternary approach, denoted by
r4(m, a, n), equals, for n1,
r4(m, a, n) = log2
4n
N4(m, w, n)(43)
r4(m, n)log212Q2an
γ4(m).
A numerical analysis of the above expressions shows that
the redundancy difference due to the balance (right hand)
term is around 0.5-1 bit for m= 2. For larger values of
the homopolymer run mthe extra redundancy is negligible.
The redundancy difference, r2(m, n)r4(m, n), due to the
imposed runlength constraint is much larger for n > 10 than
the redundancy due the balance constraint. For m > 6the
difference between r2(m, n)and r4(m, n)is negligible, see
Subsection IV-B, so that considering the much larger look-
up tables needed for quaternary codes, the binary approach
using Construction 1 for combined constraints is preferable
from a practical point of view.
VI. CONCLUSIONS
We have described coding techniques for weakly balanc-
ing GC and AT-content and avoiding homopolymer runs
larger than mnt’s of quaternary DNA strings. We have
found exact and approximate expressions for the number
of binary and quaternary sequences with combined weight
and run-length constraints. We have compared two coding
approaches for constraint-based coding of DNA strings. In
the first approach, an intermediate, ‘binary’, coding step is
used, while in the second approach we ‘directly’ translate
source data into constrained quaternary sequences. The
binary approach is attractive as it yields a lower complexity
of encoding and decoding look-up tables. The redundancy
of the binary approach is higher than that of the quaternary
approach for generating combined weight and run-length
constrained sequences. The redundancy difference is small
for larger values of the maximum homopolymer run.
REF ER EN CE S
[1] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital
information storage in DNA,Science, vol. 337, no. 6012, pp. 1628-
1628, 2012.
[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,
B. W. Pruitt, G. M. Church, “Forward Error Correction for DNA Data
Storage,” International Conference on Computational Science (ICCS
2016), vol. 80, pp. 1011-1022, 2016.
[3] Y. Erlich and D. Zielinski, “DNA Fountain enables a robust and
efficient storage architecture,” Science, vol. 355, pp. 950-954, March
2017.
[4] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,
“A DNA-based Archival Storage System,” ACM SIGOPS Operating
Systems Review, vol. 50, pp. 637-649, 2016.
[5] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R.
Hegarty, C. Nusbaum, D. B. Jaffe, “Characterizing and Measuring
Bias in Sequence Data,” Genome Biol. 14, R51, 2013.
[6] K. W. Cattermole, “Principles of Digital Line Coding,” Int. Journal
of Electronics, vol. 55, pp. 3-33, July 1983.
[7] K. A. S. Immink and K. Cai, “Design of Capacity-Approaching
Constrained Codes for DNA-based Storage Systems,IEEE Commun.
Letters, vol. 22, pp. 224-227, Feb. 2018.
[8] Y.-S. Kim and S.-H Kim, “New Construction of DNA Codes with
Constant-GC Contents from Binary Sequences with Ideal Autocorre-
lation,” IEEE International Symposium on Information Theory (ISIT),
pp. 1569-1573, 2011.
[9] Y.-M. Chee and S. Ling, “Improved Lower Bounds for Constant GC-
Content DNA Codes,IEEE Trans. Inform. Theory, vol. IT-54, no.
1, pp. 391-394, Jan. 2008.
[10] K. A. S. Immink, Codes for Mass Data Storage Systems, Second
Edition, ISBN 90-74249-27-2, Shannon Foundation Publishers, Eind-
hoven, Netherlands, 2004.
[11] V. Taranalli, H. Uchikawa, P. H. Siegel, ”Error Analysis and Inter-Cell
Interference Mitigation in Multi-Level Cell Flash Memories,” Pro-
ceedings IEEE International Conference on Communications (ICC),
London, pp. 271-276, June 2015.
[12] S. M. H. T. Yazdi, H. M. Kiah, and O. Milenkovic, “Weakly Mutually
Uncorrelated Codes,” IEEE International Symposium on Information
Theory (ISIT), pp. 2649-2653, Barcelona, Spain, July 2016.
[13] A. X. Widmer and P. A. Franaszek, “A Dc-balanced, Partitioned-
Block, 8b/10b Transmission Code,IBM J. Res. Develop., vol. 27,
no. 5, pp. 440-451, Sept. 1983.
[14] D. E. Knuth, “Efficient Balanced Codes,IEEE Trans. Inform.
Theory, vol. IT-32, no. 1, pp. 51-53, Jan. 1986.
[15] J. H. Weber and K. A. S. Immink, “Knuth’s Balancing of Codewords
Revisited,IEEE Trans. Inform. Theory, vol. 56, no. 4, pp. 1673-1679,
2010.
[16] S. W. MacLauhlin, J. Luo, and Q. Xie, “On the Capacity of M-ary
Runlength-Limited Codes,” IEEE Trans. Inform. Theory, vol. IT-41,
no. 5, pp. 1508-1511, Sept. 1995.
[17] M. W. Marcellin and H. J. Weber, “Two-dimensional Modulation
Codes,” IEEE Journal on Selected Areas in Communications, vol.
10, no. 1, pp. 254-266, Jan. 1992.
[18] C. E. Shannon, “A Mathematical Theory of Communication,” Bell
Syst. Tech. J., vol. 27, pp. 379-423, July 1948.
[19] P. Flajolet and R. Sedgewick, Analytic Combinatorics, ISBN 978-0-
521-89806-5, Cambridge University Press, 2009.
[20] K. J. Kerpez, A. Gallopoulos, and C. Heegard, “Maximum Entropy
Charge-Constrained Run-Length Codes,” IEEE Journal on Selected
Areas in Communications., vol. 10, no. 1, pp. 242-253, Jan. 1992.
[21] V. Braun and K. A. S. Immink, “An Enumerative Coding Technique
for DC-free Runlength-Limited Sequences,” IEEE Trans on Commu-
nications,vol. 48, no. 12, pp. 2024-2031, Dec. 2000.
[22] O. Kurmaev, “Constant-Weight and Constant-Charge Binary Run-
Length Limited Codes,” IEEE Trans. Inform. Theory, vol. IT-57, no.
7, pp. 4497-4515, July 2011.
[23] F. Paluncic and B. T. J. Maharaj, “Using Bivariate Generating
Functions to Count the Number of Balanced Runlength-Limited
Words,” Singapore, 4-8 Dec. 2017, IEEE Globecom 2017.
[24] D. Limbachiya, M. K. Gupta, and V. Aggarwal, “Family of Con-
strained Codes for Archival DNA Data Storage,” IEEE Communica-
tions Letters, August 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Book
Full-text available
Preface to the Second Edition About five years after the publication of the first edition, it was felt that an update of this text would be inescapable as so many relevant publications, including patents and survey papers, have been published. The author's principal aim in writing the second edition is to add the newly published coding methods, and discuss them in the context of the prior art. As a result about 150 new references, including many patents and patent applications, most of them younger than five years old, have been added to the former list of references. Fortunately, the US Patent Office now follows the European Patent Office in publishing a patent application after eighteen months of its first application, and this policy clearly adds to the rapid access to this important part of the technical literature. I am grateful to many readers who have helped me to correct (clerical) errors in the first edition and also to those who brought new and exciting material to my attention. I have tried to correct every error that I found or was brought to my attention by attentive readers, and seriously tried to avoid introducing new errors in the Second Edition. China is becoming a major player in the art of constructing, designing, and basic research of electronic storage systems. A Chinese translation of the first edition has been published early 2004. The author is indebted to prof. Xu, Tsinghua University, Beijing, for taking the initiative for this Chinese version, and also to Mr. Zhijun Lei, Tsinghua University, for undertaking the arduous task of translating this book from English to Chinese. Clearly, this translation makes it possible that a billion more people will now have access to it. Kees A. Schouhamer Immink Rotterdam, November 2004
Article
Full-text available
Background DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. Results We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. Conclusions The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
Article
DNA-based data storage systems have evolved as a solution to accommodate data explosion. In this letter, some properties of DNA codewords that are essential for an archival DNA storage are considered for the design of codes. Constraintbased DNA codes which avoid runs of nucleotides, have fixed GC-weight, and a specific minimum distance are presented. An altruistic algorithm which enumerates DNA codewords with the above constraints is provided. A theoretical bound on such DNA codewords is obtained. This bound is tight when there is no minimum distance constraint. IEEE
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Article
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Conference Paper
Knuth published a very simple algorithm for constructing bipolar codewords with equal numbers of +1's and -1's, called balanced codes. In our paper we will present new code constructions that generate balanced runlength limited sequences using a modification of Knuth's algorithm.
Article
The role of line coding is to convert source data to a digital form resistant to noise in combination with such other impairments as a specific medium may suffer (notably intersymbol interference, digit timing jitter and carrier phase error), while being reasonably economical in the use of bandwidth. This paper discusses the nature and role of various constraints on code words and word sequences, including those commonly used on metallic lines, optical fibres, carrier channels and radio links ; and gives some examples from each of these applications. It should serve both as a general review of the subject and as an introduction to the companion papers on specific topics.