ArticlePDF Available

Properties and Constructions of Constrained Codes for DNA-Based Data Storage

Authors:

Abstract and Figures

We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
Content may be subject to copyright.
Received February 27, 2020, accepted March 7, 2020, date of publication March 11, 2020, date of current version March 19, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2980036
Properties and Constructions of Constrained
Codes for DNA-Based Data Storage
KEES A. SCHOUHAMER IMMINK 1, (Life Fellow, IEEE),
AND KUI CAI 2, (Senior Member, IEEE)
1Turing Machines Inc., 3016 DK Rotterdam, The Netherlands
2Singapore University of Technology and Design (SUTD), Singapore 487372
Corresponding author: Kees A. Schouhamer Immink (immink@turing-machines.com)
This work was supported by the Singapore Ministry of Education Academic Research Fund Tier 2 under Grant MOE2016-T2-2-054.
ABSTRACT We describe properties and constructions of constraint-based codes for DNA-based data
storage which account for the maximum repetition length and AT/GC balance. Generating functions and
approximations are presented for computing the number of sequences with maximum repetition length and
AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced
strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of
codes that account for both the maximum repetition length and AT/GC balance are derived.
INDEX TERMS Constrained coding, maximum runlength, balanced words, storage systems, DNA-based
storage.
I. INTRODUCTION
The first large-scale archival DNA-based storage archi-
tecture was implemented by Church et al. [1] in 2012.
Blawat et al. [2] described successful experiments for storing
and retrieving data blocks of 22 Mbyte of digital data in
synthetic DNA. Erlich and Zielinski [3] further explored the
limits of storage capacity of DNA-based storage architec-
tures. Recent examples of experimental work on DNA-base
storage can be found in [4]–[6].
Naturally occurring DNA consists of four types of
nucleotides: adenine (A), cytosine (C), guanine (G), and
thymine (T). A DNA strand (or oligonucleotides, or oligo in
short) is a linear sequence of these four nucleotides that are
composed by DNA synthesizers. Binary source, or user, data
are translated into the four types of nucleotides, for exam-
ple, by mapping two binary source into a single nucleotide,
in short nt.
Strings of nucleotides should satisfy a few elementary
conditions, called constraints, in order to be less error
prone. Repetitions of the same nucleotide, a homopoly-
mer run, significantly increase the chance of sequencing
errors [7], [8], so that such long runs should be avoided.
For example, in [8], experimental studies show that once the
The associate editor coordinating the review of this manuscript and
approving it for publication was Nadeem Iqbal .
homopolymer run is larger than four nt, the sequencing error
rate starts increasing significantly. In addition, [8] also reports
that oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Blawat’s format [2] incorporates a constrained code that
uses a look-up table for translating binary source data
into strands of nucleotides with a homopolymer run of
length at most three. Blawat’s format did not incorpo-
rate an AT/GC balance constraint. Strands that do not sat-
isfy both the maximum homopolymer run requirement and
the weak balance constraint are barred in Erlich’s coding
format [3].
In this paper, we describe properties and constructions
of quaternary constraint-based codes for DNA-based stor-
age which account for a maximum homopolymer run and
maximum unbalance between AT and GC contents. Binary
‘balanced’ and runlength limited sequences have found
widespread use in data communication and storage prac-
tice [9]. We show that constrained binary sequences can easily
be translated into constrained quaternary sequences, which
opens the door to a wealth of efficient binary code con-
structions for application in DNA-based storage [10]–[13].
A further advantage of the binary-to-binary translation
instead of a ‘direct’ binary-to-quaternary translation is the
lower complexity of encoding and decoding look-up tables.
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 49523
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
The disadvantage is, as we show, the loss in information
capacity of the binary versus the quaternary approach.
We start in Section II with a description of the limiting
properties and code constructions that impose a maximum
homopolymer run. We specifically compute and compare
the information capacity of binary versus ‘direct’ quaternary
coding techniques. In Section III, we enumerate the number
of binary and quaternary sequences with combined AT and
GC contents and run-length constraints. Section IV concludes
the paper.
II. MAXIMUM RUNLENGTH CONSTRAINT
Long repetitions of the same nucleotide (nt), called a
homopolymer run or runlength, may significantly increase
the chance of sequencing errors [7], [8], and should be
avoided. Avoiding long runs of the same nucleotide will result
in loss of information capacity, and codes are required for
translating arbitrary source data into constrained quaternary
strings. Binary runlength limited (RLL) codes have found
widespread application in digital communication and storage
devices since the 1950s [9], [14]. MacLauhlin et al. [15] stud-
ied multi-level runlength limited codes for optical recording.
A string of n-nucleotide oligo’s of 4-ary symbols can be seen
as two parallel binary strings of length n, where the 4-ary
symbol is represented by two binary symbols. Such a system
of multiple parallel data streams with joint constraints is
reminiscent of ‘two-dimensional’ track systems, which have
been studied by Marcellin and Weber [16].
We start in the next subsection with the counting of
q-ary sequences that satisfy a maximum runlength, followed
by subsections where we describe limiting properties and
code constructions that avoid m+1 repetitions of the same
nucleotide.
A. COUNTING q-ARY SEQUENCES, CAPACITY
Let the number of q-ary n-length sequences having a max-
imum run, m, of the same symbol be denoted by Nq(m,n).
The number Nq(m,n) is found by using the recursive
relation [17, Part 1]:
Nq(m,n)=(qn,nm,
(q1) Xm
k=1Nq(m,nk),n>m.(1)
For nmthe above is trivial as all sequences satisfy
the maximum runlength constraint. For n>mwe follow
Shannon’s approach [17] for the discrete noiseless channel.
The runlength of ksymbols acan be seen as a ‘phrase’ aof
length k. After a phrase ahas been emitted, a phrase of sym-
bols b6= aof length kcan be emitted without violating the
maximum runlength constraint imposed. The total number of
allowed sequences, Nq(m,n), is equal to (q1) times the sum
of the numbers of sequences ending with a phrase of length
k=1,2,...,m, which are equal to Nq(m,nk). Addition of
these numbers yields (1), which proves (1). Using the above
expression, we may easily compute the feasibility of a q-ary
m-constrained code for relatively small values of nwhere a
coding look-up table is practicable, see Subsection II-C for
more details.
1) GENERATING FUNCTIONS
Generating functions are a very useful tool for enumerating
constrained sequences [18], and they offer tools for approx-
imating the number of constrained sequences for asymptot-
ically large values of the sequence length n. The series of
numbers {Nq(m,n)},n=1,2. . ., in (1), can be compactly
written as the coefficients of a formal power series Hq,m(x)=
PNq(m,i)xi, where xis a dummy variable. There is a simple
relationship between the generating function, Hq,m(x), and
the linear homogenous recurrence relation (1) with constant
coefficients that defines the same series [18]. We first define
a generating function
G(x)=Xgixi.(2)
Let the operation [xn]G(x) denote the extraction of the coef-
ficient of xnin the formal power series G(x), that is, define
[xn]Xgixi=gn.(3)
Let
T(x)=
m
X
i=1
xi.(4)
The generating function for the number of q-ary sequences
with a maximum runlength mis
qT (x)+q(q1)T(x)2+q(q1)2T(x)3+ · ·· .
We may rewrite the above as
qT (x)
1(q1)T(x),
so that the number of n-symbol m-constrained q-ary words is
Nq(m,n)=[xn]qT (x)
1(q1)T(x).(5)
2) ASYMPTOTICAL BEHAVIOR
For asymptotically large codeword length n, the maximum
number of (binary) user bits that can be stored per q-ary
symbol, called (information) capacity, denoted by Cq(m),
is given by [17]
Cq(m)=lim
n→∞
1
nlog2Nq(m,n)=log2λq(m),(6)
where λq(m), is the largest real root of the characteristic
equation [15], [17]
xm+1qxm+q1=0.(7)
Table 1shows the information capacities C2(m) and C4(m)
versus maximum allowed (homopolymer) run m. For asymp-
totically large nwe may approximate Nq(m,n) by [18]
Nq(m,n)Aq(m)λn
q(m).(8)
49524 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 1. Capacity C2(m) and C4(m) versus m.
TABLE 2. Coefficient A2(m) and A4(m) versus m.
The coefficient Aq(m) is found, see [14, page 157-158],
by rewriting Hq,m(x) as a quotient of two polynomials,
or Hq,m(x)=r(x)
p(x). Then
Aq(m)= −λq(m)r(1q(m))
p0(1q(m)) .(9)
Table 2shows the coefficients A2(m) and A4(m) versus m.
For m=1, we simply find N4(1,n)=4.3n1. We found that
the approximation (8) is remarkably accurate. For a typical
example, N4(2,10) =676836, while the approximation
using (8) yields N4(2,10) 676835.9769. The redundancy
of a 4-ary string of length nwith a maximum runlength m,
denoted by r4(m,n), is, using (8),
r4(m,n)=2nlog2N4(m,n)
n(2C4(m))log2A4(m).(10)
B. BINARY-BASED RLL CODE CONSTRUCTION,
CONSTRUCTION I
Yazdi et al. [19] and Taranalli et al. [20] showed that we
may exploit binary maximum runlength limited (RLL) codes
for constructing quaternary RLL codes. Their construction,
denoted by Construction 1, exemplifies such a technique for
m>1. The construction is simple, but we show below that
this simplicity has its price in terms of extra redundancy.
Construction 1: Let u=(u1,...,un) be an n-bit RLL
string. We merge the RLL n-bit string, u, with an n-bit source
string y=(y1,...,yn), by using the addition vi=ui+2yi,
1in, where v=(v1,...,vn), viQis the 4-ary output
string. It is easily verified that the 4-ary output string, v, has
maximum allowed run m, the same as the binary string u.
The number of distinct 4-ary sequences, v, of
Construction 1 equals 2nN2(m,n), so that the redundancy,
denoted by r2(m,n), is
r2(m,n)n(1C2(m))log2A2(m).(11)
TABLE 3. Asymptotic rate efficiency, η(m), of binary Construction 1 versus
maximum homopolymer run, m.
TABLE 4. Rate efficiency, Rm,0/C4(m), of binary Construction 1 versus
strand length, n, and maximum homopolymer run, m.
The rate efficiency with respect to the runlength limited 4-ary
channel, denoted by η(m), is expressed by
η(m)=1+C2(m)
C4(m).(12)
Table 3lists results of computations. We may notice that
Construction 1 will suffer a loss of up to 12 % for m=2.
For larger values of m, however, the loss is negligible.
The above asymptotic efficiency of Construction 1, η(m),
is valid for very large values of the strand length n. It is of
practical interest to assess the efficiency for smaller values of
the strand length. Construction 1 can be used with any binary
RLL code, and there are many binary code constructions
for generating maximum runlength constrained sequences,
see [14] for an overview. We propose here, for the efficiency
assessment, a simple two-mode block code of codeword
length n. Runlength constrained codewords in the first mode
start with a symbol ‘zero’, while codewords in the second
mode start with a ‘one’. When the previous sent codeword
ends with a ‘one’ we use the codewords from the first mode
and vice versa. The number of binary source words that can
be accommodated with Construction 1 equals 2n1N2(m,n),
so that the code rate, denoted by Rm,0, is
Rm,0=1
nn1+ blog2N2(m,n)c,(13)
where we truncated the code size to the largest power of two.
Table 4shows selected outcomes of computations of the rate
efficiency Rm,0/C4(m) versus mand n.
C. ENCODING OF QUATERNARY SEQUENCES WITHOUT
BINARY STEP
In this subsection, we investigate two simple constructions
of codes that transform binary source words directly (that
is, without an intermediate binary coding step) into 4-ary
maximum homopolymer constrained codewords. An exam-
ple of a simple 4-ary block code was presented by
VOLUME 8, 2020 49525
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 5. Rate efficiency, Rm,1/C4(m), of the two-mode code construction
versus strand length, n, and maximum homopolymer run, m.
Blawat et al. [2]. The code converts 8 source bits into a
4-ary word of 5 nt. The 5-nt words can be cascaded without
violating the prescribed m=3 maximum homopolymer
run. The rate of Blawat’s construction is R=8/5=1.6.
As C4(m=3) =1.9824, see Table 1, the (rate) efficiency of
the construction is R/C4(m)=0.807. Alternative, and more
efficient, constructions are described below.
In the first construction, denoted by two-mode construc-
tion, each source word can be represented by one of two
possible codewords, where the codeword sent is chosen to
satisfy the runlength constraint at the junction of two cas-
caded codewords. Decoding is accomplished by observing
the n-symbol codeword. In the second, slightly more efficient,
construction, denoted by four-mode construction, a source
word can be represented by four possible codewords. Decod-
ing is accomplished by observing the n-symbol codeword
plus the last symbol of the previous codeword.
1) TWO-MODE CONSTRUCTION
In this format, a source word can be represented by two
n-symbol 4-ary m-constrained codewords, where the alter-
native representations differ at the first position. In case we
append a new codeword to the previous codeword, we are
always able to choose (at least) one representation whose first
symbol differs from the last symbol of the previous codeword.
Then, clearly, the cascaded string of 4-ary symbols satisfies
the prescribed maximum homopolymer run constraint. The
rate of this two-mode construction, denoted by Rm,1, is
Rm,1=1
n(blog2(N4(m,n))c − 1),(14)
where we truncated the code size to the largest power of two
possible. Table 5shows outcomes of computations of the rate
efficiency Rm,1/C4(m) versus mand n. We observe that, for
m=2, the ‘quaternary’ efficiency R2,1/C4(2) is slightly
better than the ‘binary’ R2,0/C4(2), see Table 4. For m>2,
both approaches have the same efficiency. The conversion
of the binary source symbols into the 4-ary n-nt strands and
vice versa can be accomplished using two look-up tables of
complexity 4n.
2) FOUR-MODE CONSTRUCTION
In the above two-mode construction, the encoded codeword
depends on the last symbol of the previous codeword. Decod-
ing, however, is based on the observation of the nsym-
bols of the retrieved codeword. In the second construction,
TABLE 6. Encoding tables of a four-mode code for n=2 and m=2. The
parameter idenotes the (decimal) representation of the source word. The
tables L(i,a), a=0,1,2,3, show the corresponding codeword, where a
denotes the last symbol of the previous codeword.
the codeword also depends on the last symbol of the previous
codeword. Decoding, however, is accomplished by observing
the nsymbols of the retrieved codeword plus the last symbol
of the previous codeword. To that end, we define four tables
of codewords, denoted by L(i,a), where i, 1 iK,
denotes the decimal representation of the source word to be
encoded, Kdenotes the size of the table, and adenotes the
last symbol of the previous codeword. The four tables are
constructed in such a way that the codewords in each table
L(i,a) do not start with the symbol a. As a result, the encoder
always generates a symbol transition between the tail and
nose symbols of consecutive codewords. The maximum size
of the four tables equals K=3
4N4(m,n) (note that N4(m,n)
is a multiple of 4). Table 6shows a simple example of the
encoding tables of a four-mode code for n=2 and m=2.
The size of this code equals K=12. Let, for example,
the source sequence be ‘0’, ‘1’, ‘3’, ‘6’. Then, using the
table, the encoded sequence is ‘10’, ‘11’, ‘03’, ‘22’. We may
simply verify that the maximum runlength is m=2. The
code size K=12, while the code size of the two-mode
code m=n=2 described above equals 16/2=8. The
table shows that the codeword ‘00’ is assigned to three source
words, namely ‘0’, ‘4’, and ‘8’, so that ‘00’ cannot be decoded
unambiguously by observing the codeword. Observation of
the retrieved codeword plus the last symbol of te previous
codeword solves the ambiguouty.
The rate of this four-mode construction, denoted by Rm,2,
is
Rm,2=1
nlog23
4N4(m,n).(15)
Table 7shows the rate efficiency of the four-mode con-
struction. The efficiency improvement with respect to the
two-mode construction, see Table 5, is obtained at the cost
of four look-up tables instead of two.
Example: Let (as in Blawat’s code [2]) n=5 and m=3.
We simply find, using (1), N4(3,5) =996, so that the code
may accommodate K=3/4×996 =747 binary source
words. Since K>512 =29we may implement a code of
rate 9/5, which is 12% higher than that of Blawat’s code of
49526 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 7. Rate efficiency, Rm,2/C4(m), of the four-mode construction
versus strand length, n, and maximum homopolymer run, m.
rate 8/5. As we have the freedom of deleting 747512 =235
redundant codewords, we may, for example, bar the words
with the highest unbalance.
In the next section, we take a look at the combined AT and
GC contents balance and maximum polymer run constrained
codes.
III. COMBINED WEIGHT AND MAXIMUM RUN
CONSTRAINED CODES
Oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Avoidance of such undesired sequences implies an extra
redundancy. In this section, we compute the redundancy of
binary and quaternary codes with combined RLL and AT/GC
constraints.
A. DEFINITION AT/GC CONTENT, BALANCE, AND WEIGHT
We use the nucleotide alphabet Q= {0,1,2,3}, where
we propose the following relation between the four decimal
symbols and the nucleotides: G=0,C=1,A=2, and
T=3. The AT/GC content constraint stipulates that around
half of the nucleotides should be either an A or a T nucleotide.
In order to study AT-balanced nucleotides, we start with a few
definitions. We define the weight or AT-content, denoted by
w4(x), of the n-nucleotide oligo x=(x1,...,xn), xiQ,
as the number of occurrences of A or T, or
w4(x)=
n
X
i=1
ϕ(xi),(16)
where
ϕ(u)=(0,u<2,
1,u>1.(17)
The weight of a binary word x=(x1,...,xn), xi∈ {0,1},
denoted by w2(x), is defined by
w2(x)=
n
X
i=1
ϕ(2xi)=
n
X
i=1
xi.(18)
If we write the 4-ary word x=(x1,...,xn), xiQ, as
x=y+2z, where yiand zi∈ {0,1}then
w4(x)=
n
X
i=1
ϕ(xi)=
n
X
i=1
ϕ(2zi)=w2(z).(19)
Kerpez et al. [21], Braun and Immink [22], and Kurmaev [23]
analyzed properties and constructions of binary combined
weight and runlength constrained codes. Their results are
straightforwardly applied to the quaternary case at hand.
In the next subsections, we count binary and quaternary
sequences that satisfy combined maximum runlength and
weight constraints. We start by counting the number of binary
sequences, x, of length nthat satisfy a maximum runlength
constraint mand have weight w=w2(x). Paluncic and
Maharaj [24] enumerated this number for the balanced case
w=w2(x)=n/2.
B. COUNTING BINARY RLL SEQUENCES OF GIVEN
WEIGHT
Define the bi-variate generating function H(x,y) in the
dummy variables xand yby
H(x,y)=X
i,j
hi,jxiyj,(20)
and let [xn1yn2]h(x,y) denote the extraction of the coefficient
of xn1yn2in the formal power series Phi,jxiyj, or
[xn1yn2]Xhi,jxiyj=hn1,n2.(21)
Define
T1(x,y)=
m
X
i=1
xiyi.(22)
Let the sequence start with a runlength of zero’s, then the
generating function for the number of binary sequences with
a maximum runlength mis
T(x)+T(x)T1(x,y)+T(x)2T1(x,y)+T(x)2T1(x,y)2+ · ·· .
In case the sequence starts with a run of one’s, we obtain for
the generating function
T1(x)+T(x)T1(x,y)+T(x)T1(x,y)2+T(x)2T1(x,y)2+ · ·· .
The generating function for the number of binary sequences
with a maximum runlength mstarting with a one or a zero
runlength is the sum of the two above generating functions.
Working out the sum yields
T1(x,y)+T(x)+2T1(x,y)T(x)
1T1(x,y)T(x),
so that the number of n-bit codewords, x, with maximum
runlength m, denoted (with a slight abuse of notational con-
vention by adding an extra parameter) by N2(m,w,n), that
satisfy a given unbalance constraint w=w2(x) is given
by
N2(m,w,n)=[xnyw]T1(x,y)+T(x)+2T1(x,y)T(x)
1T1(x,y)T(x).
(23)
With the above bi-variate generating function, we may
exactly compute the number of binary m-constrained words
of weight w.
VOLUME 8, 2020 49527
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
More insight is gained by an approximation of N2(m,w,n).
For a given maximum runlength, m, and asymptotically
large n, we are specifically interested in the distribution
of limn→∞ N2(m,w,n)/N(m,n) versus the weight w. The
weight wof a binary sequence of length nis the sum of
the runlengths of ones. The runlengths are random variables,
so that for asymptotically large n, according to the Central
Limit Theorem [18], the weight distribution approaches a
Gaussian distribution with mean n
2and variance denoted
by σ2
2(m,n). Then
N2(m,w,n)Gw;n
2, σ 2
2(m,n)
N2(m,n),n1,(24)
where
G(u;µ, σ 2)=1
σ2πe1
2(uµ
σ)2,(25)
denotes the Gaussian distribution. The variance, σ2
2(m,n),
of the Gaussian distribution is computed below.
1) COMPUTATION OF THE VARIANCE, σ2
2(m,n)
Let xbe an infinitely long binary m-constrained sequence,
where the probabilities of occurrence of the runlengths of
zeros and ones are chosen to maximize the information
rate (entropy) of the sequence. The probability of occurrence
of a runlength of length l,lm, in a maxentropic sequence
equals λl
2(m), see [14, Chapter 4], where for q=2, see (7),
Pm
l=1λl
2(m)=1. The average runlength, denoted by ¯
l,
equals
¯
l=
m
X
i=1
iλi
2(m).(26)
The runlength variance of an m-constrained sequence,
denoted by Var(l), is
Var(l)=
m
X
i=1
(i¯
l)2λi
2(m).(27)
The weight variance, σ2
2(m,n), of the m-constrained sequence
is
σ2
2(m,n)=γ2(m)n
4,(28)
where
γ2(m)=Var(l)
¯
l.
Table 8shows results of computations (note that the
parameter γ4(m) is explained in Section III-C). In order
to verify the accuracy of the Gaussian approximation,
we have numerically compared it with the (accurate) out-
comes of the generating function. Figure 1shows a com-
parison between the accurate and approximate distributions,
N2(m,w,n)/N2(m,n), for n=100 and m=2,3,4.
Except for the discrepancy in the tails of the distributions,
the accuracy of the Gaussian approximation is quite sufficient
for engineering applications. The Gaussian approximation is
accurate within a few percent within the two-sigma limits of
the distribution.
TABLE 8. Coefficient γ2(m) and γ4(m) versus maximum homopolymer
run m.
FIGURE 1. Comparison of the weight distribution of
N2(m,w,n)/N2(m,n), using (a) the Gaussian distribution (24) and
(b) generating functions for n=100 and m=2,3,4.
C. COUNTING QUATERNARY RLL SEQUENCES OF GIVEN
WEIGHT
We count the number of n-tuples xof 4-ary symbols that
satisfy a maximum runlength constraint, m, and have weight
w=w4(x), denoted (with a slight abuse of notational con-
vention) by N4(m,w,n).
1) MAXIMUM RUNLENGTH CONSTRAINT
For the special case m=1, Limbachiya et al. [25] presented a
closed-form expression of N4(1,w,n). For other values of the
prescribed maximum runlength, m, we may readily compute
the number of 4-ary sequences, N4(m,w,n), versus weight,
w=w4(x), by applying generating functions.
The 4-ary symbols are generated by a constrained data
source that can be modelled as a four-state Moore-type
finite-state machine. The machine steps from state to state
where when state iQis visited a sequence of k, 1 km,
symbols ‘i’ are emitted. After visiting state i, the data source
may not return to state i(and so forbidding to again emit a
sequence of the same symbol ‘i’), but it enters state j6= i,
jQ. When the machine enters state 3 or 4, the word
weight, w, is incremented by k, where k, 1 km,
denotes the run of symbols ‘3’ or ‘4’. When, on the other
hand, states 1 or 2 are entered, the weight increment is nil. The
resulting 4 ×4 one-step skeleton or state-transition matrix,
D(x,y), of the finite-state machine is
D(x,y)=
0a0a0a0
a00a0a0
a1a10a1
a1a1a10
,(29)
49528 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 9. Number of balanced words, N4(m,n
2,n), versus mand n.
where a0=T(x) and a1=T1(x,y). We are now in
the position to write a general expression for N4(m,w,n).
The number of 4-ary sequences of length nwith maximum
runlength constraint mand weight wequals
N4(m,w,n)=[xnyw]1
3X
i,j
n
X
k=1
d[k]
i,j(x,y),(30)
where d[k]
i,j(x,y) denotes the entries of Dk(x,y). The
entries d[k]
i,j(x,y) of Dk(x,y) are equal to the number of
sequences (paths) of krunlengths starting in state iand ending
in state j. Summation for all possible runlengths knand
matrix entries, and division by three yields the generating
function of N4(m,w,n), which proves (30).
Balanced codewords with w=n/2, neven, play an
important role. Table 9shows outcomes of computations
of N4(m,n
2,n) using (30), for m=1,2,and 3. The case
m=1 was earlier presented in [25]. Note that the integer
sequence N4(m=1,n
2,n) versus nis also known as OEIST
sequence A085363 (multiplied by 2), for which an alternative
generating function is presented in [26].
Generating functions (30) allow us to accurately compute
N4(m,w,n). For some applications, we may sacrifice accu-
racy for simplicity of the expression. In the next subsection,
we derive a simple approximation to N4(m,w,n) valid for
asymptotically large nand small relative weight w/n.
2) ESTIMATE OF THE WEIGHT DISTRIBUTION
The weight w4(x) is the number of nucleotides A and T in
the sequence x, see (19). Then, as in the binary case above,
for asymptotically large n, according to the Central Limit
Theorem, the weight distribution is approximately Gaussian,
that is, we may conveniently approximate N4(m,w,n) by
N4(m,w,n)Gw;n
2, σ 2
4(m,n)N4(m,n),n1,(31)
where σ2
4(m,n) denotes the variance of the Gaussian weight
distribution. The variance σ2
4(m,n) can be computed as
follows.
3) COMPUTATION OF THE VARIANCE σ2
4(m,n)
Let ui,i=1,2, . . .,uiQ, be an infinitely long 4-ary
sequence generated by a maxentropic source that satisfies
a prescribed maximum runlength m. Although the 4-ary
sequence ui,i=1,2, . . ., satisfies a limited runlength con-
straint, m, the runs of the binary weight sequence vi=ϕ(ui),
i=1,2, . . ., see definition (17), are without any limit.
The variance, σ2
4(m,n), of the Gaussian weight distribution
is governed by the runlength distribution, P(k), of the binary
sequence vi, where P(k), k>0, denotes the probability
of occurrence of a runlength k. Clearly, Pk>0P(k)=1.
The probability P(k) is proportional to the number of binary
m-sequences of length k,N2(m,k), times the probability of
such a sequence, λk
4, or
P(k)=cN2(m,k)λk
4,k1,(32)
where the normalization constant cis chosen such that
P
k=1P(k)=1. The term N2(m,k) is the number of AT
combinations of length k, which may exist of a single A or T
run or a plurality of alternating A and T runs. Then we have
σ2
4(m,n)=γ4(m)n
4,(33)
where, see [14, Chapter 4],
γ4(m)=1
¯
l
X
k=1
(k¯
l)2P(k) (34)
and
¯
l=
X
k=1
kP(k).(35)
Table 8shows results of computations of γ4(m) versus m.
We infer from (31) and Table 8that, for nfixed, the weight
distribution becomes wider with increasing maximum run-
length m, see also Figure 1. Note that the above outcome is
not consistent with the results by Erlich and Zielinski [3],
as they assume a Gaussian balance distribution whose vari-
ance equals n/4, independent of m.
An estimate of the number of balanced codewords,
N4(m,n
2,n), is
N4m,n
2,n2
πγ4(m)nN4(m,n),neven.(36)
For the case m=1 we have, (see [26], sequence A085363,
for a similar result)
N41,n
2,n8
πn3n1,neven.(37)
Using the above approximation, we obtain, for example, that
N4(1,8,16) 16191008, which is 2% higher than its exact
value, 15873240, listed in Table 9.
D. REDUNDANCY OF BINARY AND QUATERNARY CODES
WITH COMBINED RLL AND AT/GC BALANCE
CONSTRAINTS
For DNA-based storage, we do not require that the strands
of the codebook, S, are strictly balanced, as a small unbal-
ance, that is αS1, between the GC and AT content is
permitted without affecting the error performance. Such a
constraint is called a weak balance constraint. The relative
unbalance of a word, α(x), is defined by α(x)=
w4(x)
n1
2.
An n-nucleotide oligo is said to be balanced if α(x)=0. Code
VOLUME 8, 2020 49529
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
FIGURE 2. Redundancy (bits), r4(a,n),versus word length, n, with the
relative unbalance, a, as a parameter. The raggedness of the curves is
caused by the truncation effects in the summation in (39).
constructions for combined RLL and weak balanced codes
have been published in [3], and for m=3 [27], [28].
We first study the balance of sequences without and
m-constraint. The number of 4-ary words of length nwith
balance w=w4(x), denoted by N(w,n), equals
N4(w,n)=n
w2n.(38)
The number of oligo’s, denoted by N4,a(n), of length n, whose
relative unbalance, α(x)a, is given by
N4,a(n)=X
|w
n1
2|<a
N4(w,n)=2nX
|w
n1
2|<an
w.(39)
The redundancy of 4-ary nearly balanced strands, denoted
by r4(a,n), equals
r4(a,n)=log2
4n
N4,a(n).(40)
Figure 2shows examples of computations of the redundancy,
r4(a,n), versus nwith the relative unbalance, a, as a param-
eter. The raggedness of the curves is caused by the trunca-
tion effects in the summation in (39). The distribution for
asymptotically large nof N4(w,n) versus wis approximately
Gaussian shaped, that is
N4(w,n)Gw;n
2,n
44n,n1,(41)
so that the redundancy equals
r4,a(n)≈ −log2[1 2Q(2an)],n1,(42)
where the Q-function is defined by
Q(x)=1
2πZ
x
eu2
2du.(43)
We now study q-ary sequences with both an m-constraint
and a given weight w. As in Construction 1, let the quaternary
word x=(x1,...,xn), xiQ, be written as x=y+2z,
where the constituting elements yiand zi∈ {0,1}. If the
binary sequence zis m-constrained and has weight w=
w2(z), then xis m-constrained and it has weight w4(z)=w.
Using (11), (24), and (31), we obtain for n1, that
the redundancy of q-ary sequences with combined RLL and
balance constraints, denoted by rq,a(m,n), equals
rq,a(m,n)rq(m,n)log212Q2arn
γq(m).(44)
A numerical analysis of the above expression shows that the
redundancy difference due to the balance (right hand) term
is around 0.5-1 bit for m=2. For larger values of the
homopolymer run mthe extra redundancy is negligible for
n>10. The redundancy difference, r2(m,n)r4(m,n), due
to the imposed runlength constraint is much larger for n>10
than the redundancy due the balance constraint.
IV. CONCLUSION
We have compared two coding approaches for constraint-based
coding of DNA strings. In the first approach, an intermediate,
‘binary’, coding step is used, while in the second approach we
‘directly’ translate source data into constrained quaternary
sequences. The binary approach is attractive as it yields a
lower complexity of encoding and decoding look-up tables.
The redundancy of the binary approach is higher than that of
the quaternary approach for generating combined weight and
run-length constrained sequences. The redundancy difference
is small for larger values of the maximum homopolymer run.
We have found exact and approximate expressions for the
number of binary and quaternary sequences with combined
weight and run-length constraints.
REFERENCES
[1] G. M. Church, Y. Gao, and S. Kosuri, ‘‘Next-generation digital information
storage in DNA,’’ Science, vol. 337, no. 6102, p. 1628, Sep. 2012.
[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,
B. W. Pruitt, and G. M. Church, ‘‘Forward error correction for DNA
data storage,’’ in Proc. Int. Conf. Comput. Sci. (ICCS), vol. 80, 2016,
pp. 1011–1022.
[3] Y. Erlich and D. Zielinski, ‘‘DNA fountain enables a robust and efficient
storage architecture,’Science, vol. 355, no. 6328, pp. 950–954, Mar. 2017.
[4] J. Koch, S. Gantenbein, K. Masania, W. J. Stark, Y. Erlich, and R. N. Grass,
‘‘A DNA-of-things storage architecture to create materials with embedded
memory,’’ Nature Biotechnol., vol. 38, no. 1, pp. 39–43, Jan. 2020.
[5] Y. Wang, M. Noor-A-Rahim, J. Zhang, E. Gunawan, Y. L. Guan, and
C. L. Poh, ‘‘High capacity DNA data storage with variable-length oligonu-
cleotides using repeat accumulate code and hybrid mapping,’J. Biol. Eng.,
vol. 13, no. 1, p. 89, Dec. 2019.
[6] L. Ceze, J. Nivala, and K. Strauss, ‘‘Molecular digital data storage using
DNA,’’ Nature Rev. Genet., vol. 20, no. 8, pp. 456–466, Aug. 2019.
[7] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,
‘‘A DNA-based archival storage system,’ACM SIGOPS Oper. Syst. Rev.,
vol. 50, pp. 637–649, 2016.
[8] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty,
C. Nusbaum, and D. B. Jaffe, ‘‘Characterizing and measuring bias in
sequence data,’Genome Biol., vol. 14, no. 5, p. R51, 2013.
[9] K. W. Cattermole, ‘‘Principles of digital line coding,’Int. J. Electron.,
vol. 55, pp. 3–33, Jul. 1983.
[10] K. A. Schouhamer Immink and K. Cai, ‘‘Design of capacity-approaching
constrained codes for DNA-based storage systems,’’ IEEE Commun. Lett.,
vol. 22, no. 2, pp. 224–227, Feb. 2018.
[11] Y.-S. Kim and S.-H. Kim, ‘‘New construction of DNA codes with constant-
GC contents from binary sequences with ideal autocorrelation,’’ in Proc.
IEEE Int. Symp. Inf. Theory Process., Jul. 2011, pp. 1569–1573.
49530 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
[12] Y. M. Chee and S. Ling, ‘‘Improved lower bounds for constant
GC-content DNA codes,’’ IEEE Trans. Inf. Theory, vol. 54, no. 1,
pp. 391–394, Jan. 2008.
[13] K. A. Schouhamer Immink and K. Cai, ‘‘Efficient balanced and maximum
homopolymer-run restricted block codes for DNA-based data storage,’’
IEEE Commun. Lett., vol. 23, no. 10, pp. 1676–1679, Oct. 2019.
[14] K. A. S. Immink, Codes for Mass Data Storage Systems, 2nd ed.
Eindhoven, The Netherlands: Shannon Foundation, 2004.
[15] S. W. MacLauhlin, J. Luo, and Q. Xie, ‘‘On the capacity of M-ary
Runlength-limited codes,’IEEE Trans. Inf. Theory, vol. 41, no. 5,
pp. 1508–1511, Sep. 1995.
[16] M. W. Marcellin and H. J. Weber, ‘‘Two-dimensional modulation codes,’
IEEE J. Sel. Areas Commun., vol. 10, no. 1, pp. 254–266, Jan. 1992.
[17] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst.
Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.
[18] P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge, U.K.:
Cambridge Univ. Press, 2009.
[19] S. M. Hossein, T. Yazdi, H. M. Kiah, and O. Milenkovic, ‘‘Weakly mutu-
ally uncorrelated codes,’’ in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Barcelona, Spain, Jul. 2016, pp. 2649–2653.
[20] V. Taranalli, H. Uchikawa, and P. H. Siegel, ‘‘Error analysis and inter-cell
interference mitigation in multi-level cell flash memories,’’ in Proc. IEEE
Int. Conf. Commun. (ICC), London, U.K., Jun. 2015, pp. 271–276.
[21] K. J. Kerpez, A. Gallopoulos, and C. Heegard, ‘‘Maximum entropy charge-
constrained run-length codes,’IEEE J. Sel. Areas Commun., vol. 10, no. 1,
pp. 242–253, Jan. 1992.
[22] V. Braun and K. A. Schouhamer Immink, ‘‘An enumerative coding tech-
nique for DC-free runlength-limited sequences,’IEEE Trans. Commun.,
vol. 48, no. 12, pp. 2024–2031, Dec. 2000.
[23] O. F. Kurmaev, ‘‘Constant-weight and constant-charge binary run-length
limited codes,’IEEE Trans. Inf. Theory, vol. 57, no. 7, pp. 4497–4515,
Jul. 2011.
[24] F. Paluncic and B. T. J. Maharaj, ‘‘Using bivariate generating functions
to count the number of balanced runlength-limited words,’’ in Proc.
GLOBECOM - IEEE Global Commun. Conf., Singapore, Dec. 2017,
pp. 4–8.
[25] D. Limbachiya, M. K. Gupta, and V. Aggarwal, ‘‘Family of constrained
codes for archival DNAdata storage,’’ IEEE Commun. Lett., vol. 22, no. 10,
pp. 1972–1975, Oct. 2018.
[26] N. J. A. Sloane. (2019). The On-Line Encyclopedia of Integer Sequences.
[Online]. Available: http://oeis.org
[27] Y. Wang, M. Noor-A-Rahim, E. Gunawan, Y. L. Guan, and C. L. Poh,
‘‘Construction of bio-constrained code for DNA data storage,’IEEE Com-
mun. Lett., vol. 23, no. 6, pp. 963–966, Jun. 2019.
[28] W. Song, K. Cai, M. Zhang, and C. Yuen, ‘‘Codes with run-length and
GC-content constraints for DNA-based data storage,’’ IEEE Commun.
Lett., vol. 22, no. 10, pp. 2004–2007, Oct. 2018.
KEES A. SCHOUHAMER IMMINK (Life Fellow,
IEEE) is currently a Founder and the President
of Turing Machines Inc., an innovative start-up
focused on coding and signal processing for
DNA-based storage. He received the 2017 IEEE
Medal of Honor for his for pioneering contribu-
tions to video, audio, and data recording tech-
nology, the Knighthood, in 2000, the Personal
Emmy Award, in 2004, the 1999 Audio Engineer-
ing Society’s (AES) Gold Medal, the 2004 SMPTE
Progress Medal, the 2014 Eduard Rhein Prize for Technology, and the
2015 IET Faraday Medal. He received an Honorary Doctorate from the
University of Johannesburg, in 2014. He was inducted into the Consumer
Electronics Hall of Fame, elected into the Royal Netherlands Academy of
Arts and Sciences, and the (US) National Academy of Engineering. He has
served the profession as a Governor for the IEEE Information Theory and
Consumer Electronics Societies and the President for the Audio Engineering
Society.
KUI CAI (Senior Member, IEEE) received the
B.E. degree in information and control engineering
from Shanghai Jiao Tong University, Shanghai,
China, and the joint Ph.D. degree in electrical
engineering from the Technical University of
Eindhoven, The Netherlands, and the National
University of Singapore. She is currently an Asso-
ciate Professor with the Singapore University
of Technology and Design (SUTD). Her main
research interests are in the areas of coding the-
ory, information theory, signal processing for various data storage systems,
and digital communications. She received the 2008 IEEE Communications
Society Best Paper Award in Coding and Signal Processing for Data Storage.
She has served as the Vice-Chair (Academia) for the IEEE Communications
Society and the Data Storage Technical Committee (DSTC), from 2015
to 2016.
VOLUME 8, 2020 49531
... Constrained coding is a method that is employed in several domains such as magneto-optical recording (see, for example, [1] or [2]), DNA data storage [3], [4], and energy harvesting communication [5], [6], which allows the encoding of arbitrary user data sequences into only those sequences that respect a certain constraint. Our interest in this paper is in constrained codes that are also resilient to symmetric errors and erasures. ...
... Del 3 2 ( , ; 3 2 ) GenSph( , ; 3 We concern ourselves with the ( , ∞)-runlength limited (RLL) constraint. This constraint mandates that there be at least 0s between every pair of successive 1s in the binary input sequence, where ≥ 1. ...
... Del 3 2 ( , ; 3 2 ) GenSph( , ; 3 We concern ourselves with the ( , ∞)-runlength limited (RLL) constraint. This constraint mandates that there be at least 0s between every pair of successive 1s in the binary input sequence, where ≥ 1. ...
Article
Full-text available
In this paper, we study binary constrained codes that are resilient to bit-flip errors and erasures. In our first approach, we compute the sizes of constrained subcodes of linear codes. Since there exist well-known linear codes that achieve vanishing probabilities of error over the binary symmetric channel (which causes bit-flip errors) and the binary erasure channel, constrained subcodes of such linear codes are also resilient to random bit-flip errors and erasures. We employ a simple identity from the Fourier analysis of Boolean functions, which transforms the problem of counting constrained codewords of linear codes to a question about the structure of the dual code. We illustrate the utility of our method in providing explicit values or efficient algorithms for our counting problem, by showing that the Fourier transform of the indicator function of the constraint is computable, for different constraints. Our second approach is to obtain good upper bounds, using an extension of Delsarte’s linear program (LP), on the largest sizes of constrained codes that can correct a fixed number of combinatorial errors or erasures. We observe that the numerical values of our LP-based upper bounds beat the generalized sphere packing bounds of Fazeli, Vardy, and Yaakobi (2015).
... Constrained coding is a method that is employed in several domains such as magneto-optical recording (see, for example, [1] or [2]), DNA data storage [3], [4], and energy harvesting communication [5], [6], which allows the encoding of arbitrary user data sequences into only those sequences that respect a certain constraint. Our interest in this paper is in constrained codes that are also resilient to symmetric errors and erasures. ...
... The version of Delsarte's LP that is most often used in papers in coding theory, such as in[25], is obtained after symmetrizing Del( , ). In particular, the common version of Delsarte's LP is Del / ( , ) (see the remark following Theorem III.2), where is the symmetry group on elements.3 Note that when C is a linear code (or a subspace of F 2 ), for all x, z ∈ {0, 1} , we have that 1 C (z)1 C (x + z) = 1 C (z)1 C (x), and hence C evaluates to simply 1 C .January 12, 2023 DRAFT ...
Preprint
Full-text available
In this paper, we study binary constrained codes that are also resilient to bit-flip errors and erasures. In our first approach, we compute the sizes of constrained subcodes of linear codes. Since there exist well-known linear codes that achieve vanishing probabilities of error over the binary symmetric channel (which causes bit-flip errors) and the binary erasure channel, constrained subcodes of such linear codes are also resilient to random bit-flip errors and erasures. We employ a simple identity from the Fourier analysis of Boolean functions, which transforms the problem of counting constrained codewords of linear codes to a question about the structure of the dual code. Via examples of constraints, we illustrate the utility of our method in providing explicit values or efficient algorithms for our counting problem. Our second approach is to obtain good upper bounds on the sizes of the largest constrained codes that can correct a fixed number of combinatorial errors or erasures. We accomplish this using an extension of Delsarte's linear program (LP) to the setting of constrained systems. We observe that the numerical values of our LP-based upper bounds beat those obtained by using the generalized sphere packing bounds of Fazeli, Vardy, and Yaakobi (2015).
... The work [1] designed 3-SSA and 4-run-length limited codes with rate 1.1609. Run-length limited and GC-balanced codes are constructed in [8], [15], [18], [19]. The work [9] considered 3-SSA with run-length limit and GC-balanced limit using constacyclic codes. ...
Preprint
Full-text available
In DNA-based data storage, DNA codes with biochemical constraints and error correction are designed to protect data reliability. Single-stranded DNA sequences with secondary structure avoidance (SSA) help to avoid undesirable secondary structures which may cause chemical inactivity. Homopolymer run-length limit and GC-balanced limit also help to reduce the error probability of DNA sequences during synthesizing and sequencing. In this letter, based on a recent work \cite{bib7}, we construct DNA codes free of secondary structures of stem length $\geq m$ and have homopolymer run-length $\leq\ell$ for odd $m\leq11$ and $\ell\geq3$ with rate $1+\log_2\rho_m-3/(2^{\ell-1}+\ell+1)$, where $\rho_m$ is in Table \ref{tm}. In particular, when $m=3$, $\ell=4$, its rate tends to 1.3206 bits/nt, beating a previous work by Benerjee {\it et al.}. We also construct DNA codes with all of the above three constraints as well as single error correction. At last, codes with GC-locally balanced constraint are presented.
... Even a small error can lead to a significant decrease in product quality and has to be considered. In order to minimize errors, an encoding scheme must obey the following five constraints for DNA: Considering C1, DNA sequences with a too low or too high GC content are known to be less stable and thus have to be avoided 35,36 . C2 is necessary because long consecutive repeats of the same nucleotide destabilize DNA strands, and sequencing machines fail to read them correctly 11 . ...
Article
Full-text available
Recent developments in DNA data storage systems have revealed the great potential to store large amounts of data at a very high density with extremely long persistence and low cost. However, despite recent contributions to robust data encoding, current DNA storage systems offer limited support for random access on DNA storage devices due to restrictive biochemical constraints. Moreover, state-of-the-art approaches do not support content-based filter queries on DNA storage. This paper introduces the first encoding for DNA that enables content-based searches on structured data like relational database tables. We provide the details of the methods for coding and decoding millions of directly accessible data objects on DNA. We evaluate the derived codes on real data sets and verify their robustness.
... Several other works in the literature provide solutions for challenges in the field of DNA data storage, e.g., image processing for DNA storage 13 , adaptation of the JPEG image coding algorithm for DNA data storage 14 , error correction codes using LDPC 15 or Polar codes 16 , random access solutions 6,17,18 , and constrained codes 19 . ...
Article
Full-text available
The extensive information capacity of DNA, coupled with decreasing costs for DNA synthesis and sequencing, makes DNA an attractive alternative to traditional data storage. The processes of writing, storing, and reading DNA exhibit specific error profiles and constraints DNA sequences have to adhere to. We present DNA-Aeon, a concatenated coding scheme for DNA data storage. It supports the generation of variable-sized encoded sequences with a user-defined Guanine-Cytosine (GC) content, homopolymer length limitation, and the avoidance of undesired motifs. It further enables users to provide custom codebooks adhering to further constraints. DNA-Aeon can correct substitution errors, insertions, deletions, and the loss of whole DNA strands. Comparisons with other codes show better error-correction capabilities of DNA-Aeon at similar redundancy levels with decreased DNA synthesis costs. In-vitro tests indicate high reliability of DNA-Aeon even in the case of skewed sequencing read distributions and high read-dropout.
Preprint
DNA is an attractive medium for digital data storage. When data is stored on DNA, errors occur, which makes error-correcting coding techniques critical for reliable DNA data storage. To reduce the number of errors, a common technique is to include constraints that avoid homopolymers (consecutive repeated nucleotides) and balance the GC content, as sequences with homopolymers and unbalanced GC contents are often associated with larger error rates. However, constrained coding comes at the cost of an increase in redundancy. An alternative (unconstrained coding) is to control the errors by randomizing the sequences, embracing errors, and paying for them with additional coding redundancy. In this paper, we determine the error regimes in which embracing errors is more efficient than constrained coding. We find that constrained coding is inefficient in most common error regimes for DNA data storage. Specifically, the error probabilities for nucleotides in homopolymers and in sequences with unbalanced GC contents must be very large for constrained coding to achieve a higher code rate than unconstrained coding.
Preprint
Full-text available
For any given alphabet of size $q$, a Homopolymer Free code (HF code) refers to an $(n, M, d)_q$ code of length $n$, size $M$ and minimum Hamming distance $d$, where all the codewords are homopolymer free sequences. For any given alphabet, this work provides upper and lower bounds on the maximum size of any HF code using Sphere Packing bound and Gilbert-Varshamov bound. Further, upper and lower bounds on the maximum size of HF codes for various HF code families are calculated. Also, as a specific case, upper and lower bounds are obtained on the maximum size of homopolymer free DNA codes.
Article
In this paper, we propose a new coding algorithm for DNA storage over both error-free and error channels. For the error-free case, we propose a constrained code called bit insertion-based constrained (BIC) code. BIC codes convert a binary data sequence to multiple oligo sequences satisfying the maximum homopolymer run (i.e., run-length (RL)) constraint by inserting dummy bits. We show that the BIC codes nearly achieves the capacity in terms of information density while the simple structure of the BIC codes allows linear-time encoding and fast parallel decoding. Also, by combining a balancing technique with the BIC codes, we obtain the constrained coding algorithm to satisfy the GC-content constraint as well as the RL constraint. Next, for DNA storage channel with errors, we integrate the proposed constrained coding algorithm with a rate-compatible low-density parity-check (LDPC) code to correct errors and erasures. Specifically, we incorporate LDPC codes adopted in the 5G new radio standard because they have powerful error-correction capability and appealing features for the integration. Simulation results show that the proposed integrated coding algorithm outperforms existing coding algorithms in terms of information density and error correctability.
Preprint
Full-text available
Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints have a general optimal solution, there is increasing demand for parametric constraints that are dependent on the message length. Several works have tackled such parametric constraints through iterative algorithms, yet they require complex constructions specific to each constraint to guarantee convergence through monotonic progression. In this paper, we propose a universal framework for tackling any parametric constrained-channel problem through a novel simple iterative algorithm. By reducing an execution of this iterative algorithm to an acyclic graph traversal, we prove a surprising result that guarantees convergence with efficient average time complexity even without requiring any monotonic progression. We demonstrate the effectiveness of this universal framework by applying it to a variety of both local and global channel constraints. We begin by exploring the local constraints involving illegal substrings of variable length, where the universal construction essentially iteratively replaces forbidden windows. We apply this local algorithm to the minimal periodicity, minimal Hamming weight, local almost-balanced Hamming weight and the previously-unsolved minimal palindrome constraints. We then continue by exploring global constraints, and demonstrate the effectiveness of the proposed construction on the repeat-free encoding, reverse-complement encoding, and the open problem of global almost-balanced encoding. For reverse-complement, we also tackle a previously-unsolved version of the constraint that addresses overlapping windows. Overall, the proposed framework generates state-of-the-art constructions with significant ease while also enabling the simultaneous integration of multiple constraints for the first time.
Article
Full-text available
DNA storage offers substantial information density1,2,3,4,5,6,7 and exceptional half-life³. We devised a ‘DNA-of-things’ (DoT) storage architecture to produce materials with immutable memory. In a DoT framework, DNA molecules record the data, and these molecules are then encapsulated in nanometer silica beads⁸, which are fused into various materials that are used to print or cast objects in any shape. First, we applied DoT to three-dimensionally print a Stanford Bunny⁹ that contained a 45 kB digital DNA blueprint for its synthesis. We synthesized five generations of the bunny, each from the memory of the previous generation without additional DNA synthesis or degradation of information. To test the scalability of DoT, we stored a 1.4 MB video in DNA in plexiglass spectacle lenses and retrieved it by excising a tiny piece of the plexiglass and sequencing the embedded DNA. DoT could be applied to store electronic health records in medical implants, to hide data in everyday objects (steganography) and to manufacture objects containing their own blueprint. It may also facilitate the development of self-replicating machines.
Article
Full-text available
Background: With the inherent high density and durable preservation, DNA has been recently recognized as a distinguished medium to store enormous data over millennia. To overcome the limitations existing in a recently reported high-capacity DNA data storage while achieving a competitive information capacity, we are inspired to explore a new coding system that facilitates the practical implementation of DNA data storage with high capacity. Result: In this work, we devised and implemented a DNA data storage scheme with variable-length oligonucleotides (oligos), where a hybrid DNA mapping scheme that converts digital data to DNA records is introduced. The encoded DNA oligos stores 1.98 bits per nucleotide (bits/nt) on average (approaching the upper bound of 2 bits/nt), while conforming to the biochemical constraints. Beyond that, an oligo-level repeat-accumulate coding scheme is employed for addressing data loss and corruption in the biochemical processes. With a wet-lab experiment, an error-free retrieval of 379.1 KB data with a minimum coverage of 10x is achieved, validating the error resilience of the proposed coding scheme. Along with that, the theoretical analysis shows that the proposed scheme exhibits a net information density (user bits per nucleotide) of 1.67 bits/nt while achieving 91% of the information capacity. Conclusion: To advance towards practical implementations of DNA storage, we proposed and tested a DNA data storage system enabling high potential mapping (bits to nucleotide conversion) scheme and low redundancy but highly efficient error correction code design. The advancement reported would move us closer to achieving a practical high-capacity DNA data storage system.
Article
Full-text available
We analyze codes for DNA-based data storage which accounts for the maximum homopolymer repetition length and GC-AT balance. We present a new precoding method for translating words with a maximum run of k zeros into words with a maximum homopolymer run m = k + 1, which is atractive for securing GC-AT balance. Generating functions are presented for enumerating the number of n-symbol k-constrained codewords of given GC-AT balance Various efficient constructions are presented of block codes that satisfy a combined balance and maximum homopolymer run.
Article
Full-text available
With extremely high density and durable preservation, DNA data storage has become one of the most cutting-edge techniques for long-term data storage. Similar to traditional storage which impose restrictions on the form of encoded data, data stored in DNA storage systems are also subject to two biochemical constraints, i.e., maximum homopolymer run limit and balanced GC content limit. Previous studies used successive process to satisfy these two constraints. As a result, the process suffers low efficiency and high complexity. In this paper, we propose a novel content-balanced run-length limited (C-RLL) code with an efficient code construction method, which generates short DNA sequences that satisfy both constraints at one time. Besides, we develop an encoding method to map binary data into long DNA sequences for DNA data storage, which ensures both local and global stability in terms of satisfying the biochemical constraints. The proposed encoding method has high effective code rate of 1.917 bits per nucleotide and low coding complexity.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review, we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology. Throughout evolution, DNA has been the primary medium of biological information storage. In this article, Ceze, Nivala and Strauss discuss how DNA can be adopted as a storage medium for custom data, as a potential future complement to current data storage media such as computer hard disks, optical disks and tape. They discuss strategies for coding, decoding and error correction and give examples of implementation both in vitro and in vivo.
Article
DNA-based data storage systems have evolved as a solution to accommodate data explosion. In this letter, some properties of DNA codewords that are essential for an archival DNA storage are considered for the design of codes. Constraintbased DNA codes which avoid runs of nucleotides, have fixed GC-weight, and a specific minimum distance are presented. An altruistic algorithm which enumerates DNA codewords with the above constraints is provided. A theoretical bound on such DNA codewords is obtained. This bound is tight when there is no minimum distance constraint. IEEE