ArticlePDF Available

Abstract and Figures

We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
Content may be subject to copyright.
Received February 27, 2020, accepted March 7, 2020, date of publication March 11, 2020, date of current version March 19, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2980036
Properties and Constructions of Constrained
Codes for DNA-Based Data Storage
KEES A. SCHOUHAMER IMMINK 1, (Life Fellow, IEEE),
AND KUI CAI 2, (Senior Member, IEEE)
1Turing Machines Inc., 3016 DK Rotterdam, The Netherlands
2Singapore University of Technology and Design (SUTD), Singapore 487372
Corresponding author: Kees A. Schouhamer Immink (immink@turing-machines.com)
This work was supported by the Singapore Ministry of Education Academic Research Fund Tier 2 under Grant MOE2016-T2-2-054.
ABSTRACT We describe properties and constructions of constraint-based codes for DNA-based data
storage which account for the maximum repetition length and AT/GC balance. Generating functions and
approximations are presented for computing the number of sequences with maximum repetition length and
AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced
strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of
codes that account for both the maximum repetition length and AT/GC balance are derived.
INDEX TERMS Constrained coding, maximum runlength, balanced words, storage systems, DNA-based
storage.
I. INTRODUCTION
The first large-scale archival DNA-based storage archi-
tecture was implemented by Church et al. [1] in 2012.
Blawat et al. [2] described successful experiments for storing
and retrieving data blocks of 22 Mbyte of digital data in
synthetic DNA. Erlich and Zielinski [3] further explored the
limits of storage capacity of DNA-based storage architec-
tures. Recent examples of experimental work on DNA-base
storage can be found in [4]–[6].
Naturally occurring DNA consists of four types of
nucleotides: adenine (A), cytosine (C), guanine (G), and
thymine (T). A DNA strand (or oligonucleotides, or oligo in
short) is a linear sequence of these four nucleotides that are
composed by DNA synthesizers. Binary source, or user, data
are translated into the four types of nucleotides, for exam-
ple, by mapping two binary source into a single nucleotide,
in short nt.
Strings of nucleotides should satisfy a few elementary
conditions, called constraints, in order to be less error
prone. Repetitions of the same nucleotide, a homopoly-
mer run, significantly increase the chance of sequencing
errors [7], [8], so that such long runs should be avoided.
For example, in [8], experimental studies show that once the
The associate editor coordinating the review of this manuscript and
approving it for publication was Nadeem Iqbal .
homopolymer run is larger than four nt, the sequencing error
rate starts increasing significantly. In addition, [8] also reports
that oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Blawat’s format [2] incorporates a constrained code that
uses a look-up table for translating binary source data
into strands of nucleotides with a homopolymer run of
length at most three. Blawat’s format did not incorpo-
rate an AT/GC balance constraint. Strands that do not sat-
isfy both the maximum homopolymer run requirement and
the weak balance constraint are barred in Erlich’s coding
format [3].
In this paper, we describe properties and constructions
of quaternary constraint-based codes for DNA-based stor-
age which account for a maximum homopolymer run and
maximum unbalance between AT and GC contents. Binary
‘balanced’ and runlength limited sequences have found
widespread use in data communication and storage prac-
tice [9]. We show that constrained binary sequences can easily
be translated into constrained quaternary sequences, which
opens the door to a wealth of efficient binary code con-
structions for application in DNA-based storage [10]–[13].
A further advantage of the binary-to-binary translation
instead of a ‘direct’ binary-to-quaternary translation is the
lower complexity of encoding and decoding look-up tables.
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 49523
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
The disadvantage is, as we show, the loss in information
capacity of the binary versus the quaternary approach.
We start in Section II with a description of the limiting
properties and code constructions that impose a maximum
homopolymer run. We specifically compute and compare
the information capacity of binary versus ‘direct’ quaternary
coding techniques. In Section III, we enumerate the number
of binary and quaternary sequences with combined AT and
GC contents and run-length constraints. Section IV concludes
the paper.
II. MAXIMUM RUNLENGTH CONSTRAINT
Long repetitions of the same nucleotide (nt), called a
homopolymer run or runlength, may significantly increase
the chance of sequencing errors [7], [8], and should be
avoided. Avoiding long runs of the same nucleotide will result
in loss of information capacity, and codes are required for
translating arbitrary source data into constrained quaternary
strings. Binary runlength limited (RLL) codes have found
widespread application in digital communication and storage
devices since the 1950s [9], [14]. MacLauhlin et al. [15] stud-
ied multi-level runlength limited codes for optical recording.
A string of n-nucleotide oligo’s of 4-ary symbols can be seen
as two parallel binary strings of length n, where the 4-ary
symbol is represented by two binary symbols. Such a system
of multiple parallel data streams with joint constraints is
reminiscent of ‘two-dimensional’ track systems, which have
been studied by Marcellin and Weber [16].
We start in the next subsection with the counting of
q-ary sequences that satisfy a maximum runlength, followed
by subsections where we describe limiting properties and
code constructions that avoid m+1 repetitions of the same
nucleotide.
A. COUNTING q-ARY SEQUENCES, CAPACITY
Let the number of q-ary n-length sequences having a max-
imum run, m, of the same symbol be denoted by Nq(m,n).
The number Nq(m,n) is found by using the recursive
relation [17, Part 1]:
Nq(m,n)=(qn,nm,
(q1) Xm
k=1Nq(m,nk),n>m.(1)
For nmthe above is trivial as all sequences satisfy
the maximum runlength constraint. For n>mwe follow
Shannon’s approach [17] for the discrete noiseless channel.
The runlength of ksymbols acan be seen as a ‘phrase’ aof
length k. After a phrase ahas been emitted, a phrase of sym-
bols b6= aof length kcan be emitted without violating the
maximum runlength constraint imposed. The total number of
allowed sequences, Nq(m,n), is equal to (q1) times the sum
of the numbers of sequences ending with a phrase of length
k=1,2,...,m, which are equal to Nq(m,nk). Addition of
these numbers yields (1), which proves (1). Using the above
expression, we may easily compute the feasibility of a q-ary
m-constrained code for relatively small values of nwhere a
coding look-up table is practicable, see Subsection II-C for
more details.
1) GENERATING FUNCTIONS
Generating functions are a very useful tool for enumerating
constrained sequences [18], and they offer tools for approx-
imating the number of constrained sequences for asymptot-
ically large values of the sequence length n. The series of
numbers {Nq(m,n)},n=1,2. . ., in (1), can be compactly
written as the coefficients of a formal power series Hq,m(x)=
PNq(m,i)xi, where xis a dummy variable. There is a simple
relationship between the generating function, Hq,m(x), and
the linear homogenous recurrence relation (1) with constant
coefficients that defines the same series [18]. We first define
a generating function
G(x)=Xgixi.(2)
Let the operation [xn]G(x) denote the extraction of the coef-
ficient of xnin the formal power series G(x), that is, define
[xn]Xgixi=gn.(3)
Let
T(x)=
m
X
i=1
xi.(4)
The generating function for the number of q-ary sequences
with a maximum runlength mis
qT (x)+q(q1)T(x)2+q(q1)2T(x)3+ · ·· .
We may rewrite the above as
qT (x)
1(q1)T(x),
so that the number of n-symbol m-constrained q-ary words is
Nq(m,n)=[xn]qT (x)
1(q1)T(x).(5)
2) ASYMPTOTICAL BEHAVIOR
For asymptotically large codeword length n, the maximum
number of (binary) user bits that can be stored per q-ary
symbol, called (information) capacity, denoted by Cq(m),
is given by [17]
Cq(m)=lim
n→∞
1
nlog2Nq(m,n)=log2λq(m),(6)
where λq(m), is the largest real root of the characteristic
equation [15], [17]
xm+1qxm+q1=0.(7)
Table 1shows the information capacities C2(m) and C4(m)
versus maximum allowed (homopolymer) run m. For asymp-
totically large nwe may approximate Nq(m,n) by [18]
Nq(m,n)Aq(m)λn
q(m).(8)
49524 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 1. Capacity C2(m) and C4(m) versus m.
TABLE 2. Coefficient A2(m) and A4(m) versus m.
The coefficient Aq(m) is found, see [14, page 157-158],
by rewriting Hq,m(x) as a quotient of two polynomials,
or Hq,m(x)=r(x)
p(x). Then
Aq(m)= −λq(m)r(1q(m))
p0(1q(m)) .(9)
Table 2shows the coefficients A2(m) and A4(m) versus m.
For m=1, we simply find N4(1,n)=4.3n1. We found that
the approximation (8) is remarkably accurate. For a typical
example, N4(2,10) =676836, while the approximation
using (8) yields N4(2,10) 676835.9769. The redundancy
of a 4-ary string of length nwith a maximum runlength m,
denoted by r4(m,n), is, using (8),
r4(m,n)=2nlog2N4(m,n)
n(2C4(m))log2A4(m).(10)
B. BINARY-BASED RLL CODE CONSTRUCTION,
CONSTRUCTION I
Yazdi et al. [19] and Taranalli et al. [20] showed that we
may exploit binary maximum runlength limited (RLL) codes
for constructing quaternary RLL codes. Their construction,
denoted by Construction 1, exemplifies such a technique for
m>1. The construction is simple, but we show below that
this simplicity has its price in terms of extra redundancy.
Construction 1: Let u=(u1,...,un) be an n-bit RLL
string. We merge the RLL n-bit string, u, with an n-bit source
string y=(y1,...,yn), by using the addition vi=ui+2yi,
1in, where v=(v1,...,vn), viQis the 4-ary output
string. It is easily verified that the 4-ary output string, v, has
maximum allowed run m, the same as the binary string u.
The number of distinct 4-ary sequences, v, of
Construction 1 equals 2nN2(m,n), so that the redundancy,
denoted by r2(m,n), is
r2(m,n)n(1C2(m))log2A2(m).(11)
TABLE 3. Asymptotic rate efficiency, η(m), of binary Construction 1 versus
maximum homopolymer run, m.
TABLE 4. Rate efficiency, Rm,0/C4(m), of binary Construction 1 versus
strand length, n, and maximum homopolymer run, m.
The rate efficiency with respect to the runlength limited 4-ary
channel, denoted by η(m), is expressed by
η(m)=1+C2(m)
C4(m).(12)
Table 3lists results of computations. We may notice that
Construction 1 will suffer a loss of up to 12 % for m=2.
For larger values of m, however, the loss is negligible.
The above asymptotic efficiency of Construction 1, η(m),
is valid for very large values of the strand length n. It is of
practical interest to assess the efficiency for smaller values of
the strand length. Construction 1 can be used with any binary
RLL code, and there are many binary code constructions
for generating maximum runlength constrained sequences,
see [14] for an overview. We propose here, for the efficiency
assessment, a simple two-mode block code of codeword
length n. Runlength constrained codewords in the first mode
start with a symbol ‘zero’, while codewords in the second
mode start with a ‘one’. When the previous sent codeword
ends with a ‘one’ we use the codewords from the first mode
and vice versa. The number of binary source words that can
be accommodated with Construction 1 equals 2n1N2(m,n),
so that the code rate, denoted by Rm,0, is
Rm,0=1
nn1+ blog2N2(m,n)c,(13)
where we truncated the code size to the largest power of two.
Table 4shows selected outcomes of computations of the rate
efficiency Rm,0/C4(m) versus mand n.
C. ENCODING OF QUATERNARY SEQUENCES WITHOUT
BINARY STEP
In this subsection, we investigate two simple constructions
of codes that transform binary source words directly (that
is, without an intermediate binary coding step) into 4-ary
maximum homopolymer constrained codewords. An exam-
ple of a simple 4-ary block code was presented by
VOLUME 8, 2020 49525
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 5. Rate efficiency, Rm,1/C4(m), of the two-mode code construction
versus strand length, n, and maximum homopolymer run, m.
Blawat et al. [2]. The code converts 8 source bits into a
4-ary word of 5 nt. The 5-nt words can be cascaded without
violating the prescribed m=3 maximum homopolymer
run. The rate of Blawat’s construction is R=8/5=1.6.
As C4(m=3) =1.9824, see Table 1, the (rate) efficiency of
the construction is R/C4(m)=0.807. Alternative, and more
efficient, constructions are described below.
In the first construction, denoted by two-mode construc-
tion, each source word can be represented by one of two
possible codewords, where the codeword sent is chosen to
satisfy the runlength constraint at the junction of two cas-
caded codewords. Decoding is accomplished by observing
the n-symbol codeword. In the second, slightly more efficient,
construction, denoted by four-mode construction, a source
word can be represented by four possible codewords. Decod-
ing is accomplished by observing the n-symbol codeword
plus the last symbol of the previous codeword.
1) TWO-MODE CONSTRUCTION
In this format, a source word can be represented by two
n-symbol 4-ary m-constrained codewords, where the alter-
native representations differ at the first position. In case we
append a new codeword to the previous codeword, we are
always able to choose (at least) one representation whose first
symbol differs from the last symbol of the previous codeword.
Then, clearly, the cascaded string of 4-ary symbols satisfies
the prescribed maximum homopolymer run constraint. The
rate of this two-mode construction, denoted by Rm,1, is
Rm,1=1
n(blog2(N4(m,n))c − 1),(14)
where we truncated the code size to the largest power of two
possible. Table 5shows outcomes of computations of the rate
efficiency Rm,1/C4(m) versus mand n. We observe that, for
m=2, the ‘quaternary’ efficiency R2,1/C4(2) is slightly
better than the ‘binary’ R2,0/C4(2), see Table 4. For m>2,
both approaches have the same efficiency. The conversion
of the binary source symbols into the 4-ary n-nt strands and
vice versa can be accomplished using two look-up tables of
complexity 4n.
2) FOUR-MODE CONSTRUCTION
In the above two-mode construction, the encoded codeword
depends on the last symbol of the previous codeword. Decod-
ing, however, is based on the observation of the nsym-
bols of the retrieved codeword. In the second construction,
TABLE 6. Encoding tables of a four-mode code for n=2 and m=2. The
parameter idenotes the (decimal) representation of the source word. The
tables L(i,a), a=0,1,2,3, show the corresponding codeword, where a
denotes the last symbol of the previous codeword.
the codeword also depends on the last symbol of the previous
codeword. Decoding, however, is accomplished by observing
the nsymbols of the retrieved codeword plus the last symbol
of the previous codeword. To that end, we define four tables
of codewords, denoted by L(i,a), where i, 1 iK,
denotes the decimal representation of the source word to be
encoded, Kdenotes the size of the table, and adenotes the
last symbol of the previous codeword. The four tables are
constructed in such a way that the codewords in each table
L(i,a) do not start with the symbol a. As a result, the encoder
always generates a symbol transition between the tail and
nose symbols of consecutive codewords. The maximum size
of the four tables equals K=3
4N4(m,n) (note that N4(m,n)
is a multiple of 4). Table 6shows a simple example of the
encoding tables of a four-mode code for n=2 and m=2.
The size of this code equals K=12. Let, for example,
the source sequence be ‘0’, ‘1’, ‘3’, ‘6’. Then, using the
table, the encoded sequence is ‘10’, ‘11’, ‘03’, ‘22’. We may
simply verify that the maximum runlength is m=2. The
code size K=12, while the code size of the two-mode
code m=n=2 described above equals 16/2=8. The
table shows that the codeword ‘00’ is assigned to three source
words, namely ‘0’, ‘4’, and ‘8’, so that ‘00’ cannot be decoded
unambiguously by observing the codeword. Observation of
the retrieved codeword plus the last symbol of te previous
codeword solves the ambiguouty.
The rate of this four-mode construction, denoted by Rm,2,
is
Rm,2=1
nlog23
4N4(m,n).(15)
Table 7shows the rate efficiency of the four-mode con-
struction. The efficiency improvement with respect to the
two-mode construction, see Table 5, is obtained at the cost
of four look-up tables instead of two.
Example: Let (as in Blawat’s code [2]) n=5 and m=3.
We simply find, using (1), N4(3,5) =996, so that the code
may accommodate K=3/4×996 =747 binary source
words. Since K>512 =29we may implement a code of
rate 9/5, which is 12% higher than that of Blawat’s code of
49526 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 7. Rate efficiency, Rm,2/C4(m), of the four-mode construction
versus strand length, n, and maximum homopolymer run, m.
rate 8/5. As we have the freedom of deleting 747512 =235
redundant codewords, we may, for example, bar the words
with the highest unbalance.
In the next section, we take a look at the combined AT and
GC contents balance and maximum polymer run constrained
codes.
III. COMBINED WEIGHT AND MAXIMUM RUN
CONSTRAINED CODES
Oligos with large unbalance between GC and AT content
exhibit high dropout rates and are prone to polymerase chain
reaction (PCR) errors, and should therefore be avoided.
Avoidance of such undesired sequences implies an extra
redundancy. In this section, we compute the redundancy of
binary and quaternary codes with combined RLL and AT/GC
constraints.
A. DEFINITION AT/GC CONTENT, BALANCE, AND WEIGHT
We use the nucleotide alphabet Q= {0,1,2,3}, where
we propose the following relation between the four decimal
symbols and the nucleotides: G=0,C=1,A=2, and
T=3. The AT/GC content constraint stipulates that around
half of the nucleotides should be either an A or a T nucleotide.
In order to study AT-balanced nucleotides, we start with a few
definitions. We define the weight or AT-content, denoted by
w4(x), of the n-nucleotide oligo x=(x1,...,xn), xiQ,
as the number of occurrences of A or T, or
w4(x)=
n
X
i=1
ϕ(xi),(16)
where
ϕ(u)=(0,u<2,
1,u>1.(17)
The weight of a binary word x=(x1,...,xn), xi∈ {0,1},
denoted by w2(x), is defined by
w2(x)=
n
X
i=1
ϕ(2xi)=
n
X
i=1
xi.(18)
If we write the 4-ary word x=(x1,...,xn), xiQ, as
x=y+2z, where yiand zi∈ {0,1}then
w4(x)=
n
X
i=1
ϕ(xi)=
n
X
i=1
ϕ(2zi)=w2(z).(19)
Kerpez et al. [21], Braun and Immink [22], and Kurmaev [23]
analyzed properties and constructions of binary combined
weight and runlength constrained codes. Their results are
straightforwardly applied to the quaternary case at hand.
In the next subsections, we count binary and quaternary
sequences that satisfy combined maximum runlength and
weight constraints. We start by counting the number of binary
sequences, x, of length nthat satisfy a maximum runlength
constraint mand have weight w=w2(x). Paluncic and
Maharaj [24] enumerated this number for the balanced case
w=w2(x)=n/2.
B. COUNTING BINARY RLL SEQUENCES OF GIVEN
WEIGHT
Define the bi-variate generating function H(x,y) in the
dummy variables xand yby
H(x,y)=X
i,j
hi,jxiyj,(20)
and let [xn1yn2]h(x,y) denote the extraction of the coefficient
of xn1yn2in the formal power series Phi,jxiyj, or
[xn1yn2]Xhi,jxiyj=hn1,n2.(21)
Define
T1(x,y)=
m
X
i=1
xiyi.(22)
Let the sequence start with a runlength of zero’s, then the
generating function for the number of binary sequences with
a maximum runlength mis
T(x)+T(x)T1(x,y)+T(x)2T1(x,y)+T(x)2T1(x,y)2+ · ·· .
In case the sequence starts with a run of one’s, we obtain for
the generating function
T1(x)+T(x)T1(x,y)+T(x)T1(x,y)2+T(x)2T1(x,y)2+ · ·· .
The generating function for the number of binary sequences
with a maximum runlength mstarting with a one or a zero
runlength is the sum of the two above generating functions.
Working out the sum yields
T1(x,y)+T(x)+2T1(x,y)T(x)
1T1(x,y)T(x),
so that the number of n-bit codewords, x, with maximum
runlength m, denoted (with a slight abuse of notational con-
vention by adding an extra parameter) by N2(m,w,n), that
satisfy a given unbalance constraint w=w2(x) is given
by
N2(m,w,n)=[xnyw]T1(x,y)+T(x)+2T1(x,y)T(x)
1T1(x,y)T(x).
(23)
With the above bi-variate generating function, we may
exactly compute the number of binary m-constrained words
of weight w.
VOLUME 8, 2020 49527
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
More insight is gained by an approximation of N2(m,w,n).
For a given maximum runlength, m, and asymptotically
large n, we are specifically interested in the distribution
of limn→∞ N2(m,w,n)/N(m,n) versus the weight w. The
weight wof a binary sequence of length nis the sum of
the runlengths of ones. The runlengths are random variables,
so that for asymptotically large n, according to the Central
Limit Theorem [18], the weight distribution approaches a
Gaussian distribution with mean n
2and variance denoted
by σ2
2(m,n). Then
N2(m,w,n)Gw;n
2, σ 2
2(m,n)
N2(m,n),n1,(24)
where
G(u;µ, σ 2)=1
σ2πe1
2(uµ
σ)2,(25)
denotes the Gaussian distribution. The variance, σ2
2(m,n),
of the Gaussian distribution is computed below.
1) COMPUTATION OF THE VARIANCE, σ2
2(m,n)
Let xbe an infinitely long binary m-constrained sequence,
where the probabilities of occurrence of the runlengths of
zeros and ones are chosen to maximize the information
rate (entropy) of the sequence. The probability of occurrence
of a runlength of length l,lm, in a maxentropic sequence
equals λl
2(m), see [14, Chapter 4], where for q=2, see (7),
Pm
l=1λl
2(m)=1. The average runlength, denoted by ¯
l,
equals
¯
l=
m
X
i=1
iλi
2(m).(26)
The runlength variance of an m-constrained sequence,
denoted by Var(l), is
Var(l)=
m
X
i=1
(i¯
l)2λi
2(m).(27)
The weight variance, σ2
2(m,n), of the m-constrained sequence
is
σ2
2(m,n)=γ2(m)n
4,(28)
where
γ2(m)=Var(l)
¯
l.
Table 8shows results of computations (note that the
parameter γ4(m) is explained in Section III-C). In order
to verify the accuracy of the Gaussian approximation,
we have numerically compared it with the (accurate) out-
comes of the generating function. Figure 1shows a com-
parison between the accurate and approximate distributions,
N2(m,w,n)/N2(m,n), for n=100 and m=2,3,4.
Except for the discrepancy in the tails of the distributions,
the accuracy of the Gaussian approximation is quite sufficient
for engineering applications. The Gaussian approximation is
accurate within a few percent within the two-sigma limits of
the distribution.
TABLE 8. Coefficient γ2(m) and γ4(m) versus maximum homopolymer
run m.
FIGURE 1. Comparison of the weight distribution of
N2(m,w,n)/N2(m,n), using (a) the Gaussian distribution (24) and
(b) generating functions for n=100 and m=2,3,4.
C. COUNTING QUATERNARY RLL SEQUENCES OF GIVEN
WEIGHT
We count the number of n-tuples xof 4-ary symbols that
satisfy a maximum runlength constraint, m, and have weight
w=w4(x), denoted (with a slight abuse of notational con-
vention) by N4(m,w,n).
1) MAXIMUM RUNLENGTH CONSTRAINT
For the special case m=1, Limbachiya et al. [25] presented a
closed-form expression of N4(1,w,n). For other values of the
prescribed maximum runlength, m, we may readily compute
the number of 4-ary sequences, N4(m,w,n), versus weight,
w=w4(x), by applying generating functions.
The 4-ary symbols are generated by a constrained data
source that can be modelled as a four-state Moore-type
finite-state machine. The machine steps from state to state
where when state iQis visited a sequence of k, 1 km,
symbols ‘i’ are emitted. After visiting state i, the data source
may not return to state i(and so forbidding to again emit a
sequence of the same symbol ‘i’), but it enters state j6= i,
jQ. When the machine enters state 3 or 4, the word
weight, w, is incremented by k, where k, 1 km,
denotes the run of symbols ‘3’ or ‘4’. When, on the other
hand, states 1 or 2 are entered, the weight increment is nil. The
resulting 4 ×4 one-step skeleton or state-transition matrix,
D(x,y), of the finite-state machine is
D(x,y)=
0a0a0a0
a00a0a0
a1a10a1
a1a1a10
,(29)
49528 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
TABLE 9. Number of balanced words, N4(m,n
2,n), versus mand n.
where a0=T(x) and a1=T1(x,y). We are now in
the position to write a general expression for N4(m,w,n).
The number of 4-ary sequences of length nwith maximum
runlength constraint mand weight wequals
N4(m,w,n)=[xnyw]1
3X
i,j
n
X
k=1
d[k]
i,j(x,y),(30)
where d[k]
i,j(x,y) denotes the entries of Dk(x,y). The
entries d[k]
i,j(x,y) of Dk(x,y) are equal to the number of
sequences (paths) of krunlengths starting in state iand ending
in state j. Summation for all possible runlengths knand
matrix entries, and division by three yields the generating
function of N4(m,w,n), which proves (30).
Balanced codewords with w=n/2, neven, play an
important role. Table 9shows outcomes of computations
of N4(m,n
2,n) using (30), for m=1,2,and 3. The case
m=1 was earlier presented in [25]. Note that the integer
sequence N4(m=1,n
2,n) versus nis also known as OEIST
sequence A085363 (multiplied by 2), for which an alternative
generating function is presented in [26].
Generating functions (30) allow us to accurately compute
N4(m,w,n). For some applications, we may sacrifice accu-
racy for simplicity of the expression. In the next subsection,
we derive a simple approximation to N4(m,w,n) valid for
asymptotically large nand small relative weight w/n.
2) ESTIMATE OF THE WEIGHT DISTRIBUTION
The weight w4(x) is the number of nucleotides A and T in
the sequence x, see (19). Then, as in the binary case above,
for asymptotically large n, according to the Central Limit
Theorem, the weight distribution is approximately Gaussian,
that is, we may conveniently approximate N4(m,w,n) by
N4(m,w,n)Gw;n
2, σ 2
4(m,n)N4(m,n),n1,(31)
where σ2
4(m,n) denotes the variance of the Gaussian weight
distribution. The variance σ2
4(m,n) can be computed as
follows.
3) COMPUTATION OF THE VARIANCE σ2
4(m,n)
Let ui,i=1,2, . . .,uiQ, be an infinitely long 4-ary
sequence generated by a maxentropic source that satisfies
a prescribed maximum runlength m. Although the 4-ary
sequence ui,i=1,2, . . ., satisfies a limited runlength con-
straint, m, the runs of the binary weight sequence vi=ϕ(ui),
i=1,2, . . ., see definition (17), are without any limit.
The variance, σ2
4(m,n), of the Gaussian weight distribution
is governed by the runlength distribution, P(k), of the binary
sequence vi, where P(k), k>0, denotes the probability
of occurrence of a runlength k. Clearly, Pk>0P(k)=1.
The probability P(k) is proportional to the number of binary
m-sequences of length k,N2(m,k), times the probability of
such a sequence, λk
4, or
P(k)=cN2(m,k)λk
4,k1,(32)
where the normalization constant cis chosen such that
P
k=1P(k)=1. The term N2(m,k) is the number of AT
combinations of length k, which may exist of a single A or T
run or a plurality of alternating A and T runs. Then we have
σ2
4(m,n)=γ4(m)n
4,(33)
where, see [14, Chapter 4],
γ4(m)=1
¯
l
X
k=1
(k¯
l)2P(k) (34)
and
¯
l=
X
k=1
kP(k).(35)
Table 8shows results of computations of γ4(m) versus m.
We infer from (31) and Table 8that, for nfixed, the weight
distribution becomes wider with increasing maximum run-
length m, see also Figure 1. Note that the above outcome is
not consistent with the results by Erlich and Zielinski [3],
as they assume a Gaussian balance distribution whose vari-
ance equals n/4, independent of m.
An estimate of the number of balanced codewords,
N4(m,n
2,n), is
N4m,n
2,n2
πγ4(m)nN4(m,n),neven.(36)
For the case m=1 we have, (see [26], sequence A085363,
for a similar result)
N41,n
2,n8
πn3n1,neven.(37)
Using the above approximation, we obtain, for example, that
N4(1,8,16) 16191008, which is 2% higher than its exact
value, 15873240, listed in Table 9.
D. REDUNDANCY OF BINARY AND QUATERNARY CODES
WITH COMBINED RLL AND AT/GC BALANCE
CONSTRAINTS
For DNA-based storage, we do not require that the strands
of the codebook, S, are strictly balanced, as a small unbal-
ance, that is αS1, between the GC and AT content is
permitted without affecting the error performance. Such a
constraint is called a weak balance constraint. The relative
unbalance of a word, α(x), is defined by α(x)=
w4(x)
n1
2.
An n-nucleotide oligo is said to be balanced if α(x)=0. Code
VOLUME 8, 2020 49529
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
FIGURE 2. Redundancy (bits), r4(a,n),versus word length, n, with the
relative unbalance, a, as a parameter. The raggedness of the curves is
caused by the truncation effects in the summation in (39).
constructions for combined RLL and weak balanced codes
have been published in [3], and for m=3 [27], [28].
We first study the balance of sequences without and
m-constraint. The number of 4-ary words of length nwith
balance w=w4(x), denoted by N(w,n), equals
N4(w,n)=n
w2n.(38)
The number of oligo’s, denoted by N4,a(n), of length n, whose
relative unbalance, α(x)a, is given by
N4,a(n)=X
|w
n1
2|<a
N4(w,n)=2nX
|w
n1
2|<an
w.(39)
The redundancy of 4-ary nearly balanced strands, denoted
by r4(a,n), equals
r4(a,n)=log2
4n
N4,a(n).(40)
Figure 2shows examples of computations of the redundancy,
r4(a,n), versus nwith the relative unbalance, a, as a param-
eter. The raggedness of the curves is caused by the trunca-
tion effects in the summation in (39). The distribution for
asymptotically large nof N4(w,n) versus wis approximately
Gaussian shaped, that is
N4(w,n)Gw;n
2,n
44n,n1,(41)
so that the redundancy equals
r4,a(n)≈ −log2[1 2Q(2an)],n1,(42)
where the Q-function is defined by
Q(x)=1
2πZ
x
eu2
2du.(43)
We now study q-ary sequences with both an m-constraint
and a given weight w. As in Construction 1, let the quaternary
word x=(x1,...,xn), xiQ, be written as x=y+2z,
where the constituting elements yiand zi∈ {0,1}. If the
binary sequence zis m-constrained and has weight w=
w2(z), then xis m-constrained and it has weight w4(z)=w.
Using (11), (24), and (31), we obtain for n1, that
the redundancy of q-ary sequences with combined RLL and
balance constraints, denoted by rq,a(m,n), equals
rq,a(m,n)rq(m,n)log212Q2arn
γq(m).(44)
A numerical analysis of the above expression shows that the
redundancy difference due to the balance (right hand) term
is around 0.5-1 bit for m=2. For larger values of the
homopolymer run mthe extra redundancy is negligible for
n>10. The redundancy difference, r2(m,n)r4(m,n), due
to the imposed runlength constraint is much larger for n>10
than the redundancy due the balance constraint.
IV. CONCLUSION
We have compared two coding approaches for constraint-based
coding of DNA strings. In the first approach, an intermediate,
‘binary’, coding step is used, while in the second approach we
‘directly’ translate source data into constrained quaternary
sequences. The binary approach is attractive as it yields a
lower complexity of encoding and decoding look-up tables.
The redundancy of the binary approach is higher than that of
the quaternary approach for generating combined weight and
run-length constrained sequences. The redundancy difference
is small for larger values of the maximum homopolymer run.
We have found exact and approximate expressions for the
number of binary and quaternary sequences with combined
weight and run-length constraints.
REFERENCES
[1] G. M. Church, Y. Gao, and S. Kosuri, ‘‘Next-generation digital information
storage in DNA,’’ Science, vol. 337, no. 6102, p. 1628, Sep. 2012.
[2] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso,
B. W. Pruitt, and G. M. Church, ‘‘Forward error correction for DNA
data storage,’’ in Proc. Int. Conf. Comput. Sci. (ICCS), vol. 80, 2016,
pp. 1011–1022.
[3] Y. Erlich and D. Zielinski, ‘‘DNA fountain enables a robust and efficient
storage architecture,’Science, vol. 355, no. 6328, pp. 950–954, Mar. 2017.
[4] J. Koch, S. Gantenbein, K. Masania, W. J. Stark, Y. Erlich, and R. N. Grass,
‘‘A DNA-of-things storage architecture to create materials with embedded
memory,’’ Nature Biotechnol., vol. 38, no. 1, pp. 39–43, Jan. 2020.
[5] Y. Wang, M. Noor-A-Rahim, J. Zhang, E. Gunawan, Y. L. Guan, and
C. L. Poh, ‘‘High capacity DNA data storage with variable-length oligonu-
cleotides using repeat accumulate code and hybrid mapping,’J. Biol. Eng.,
vol. 13, no. 1, p. 89, Dec. 2019.
[6] L. Ceze, J. Nivala, and K. Strauss, ‘‘Molecular digital data storage using
DNA,’’ Nature Rev. Genet., vol. 20, no. 8, pp. 456–466, Aug. 2019.
[7] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig,
‘‘A DNA-based archival storage system,’ACM SIGOPS Oper. Syst. Rev.,
vol. 50, pp. 637–649, 2016.
[8] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty,
C. Nusbaum, and D. B. Jaffe, ‘‘Characterizing and measuring bias in
sequence data,’Genome Biol., vol. 14, no. 5, p. R51, 2013.
[9] K. W. Cattermole, ‘‘Principles of digital line coding,’Int. J. Electron.,
vol. 55, pp. 3–33, Jul. 1983.
[10] K. A. Schouhamer Immink and K. Cai, ‘‘Design of capacity-approaching
constrained codes for DNA-based storage systems,’’ IEEE Commun. Lett.,
vol. 22, no. 2, pp. 224–227, Feb. 2018.
[11] Y.-S. Kim and S.-H. Kim, ‘‘New construction of DNA codes with constant-
GC contents from binary sequences with ideal autocorrelation,’’ in Proc.
IEEE Int. Symp. Inf. Theory Process., Jul. 2011, pp. 1569–1573.
49530 VOLUME 8, 2020
K. A. S. Immink, K. Cai: Properties and Constructions of Constrained Codes for DNA-Based Data Storage
[12] Y. M. Chee and S. Ling, ‘‘Improved lower bounds for constant
GC-content DNA codes,’’ IEEE Trans. Inf. Theory, vol. 54, no. 1,
pp. 391–394, Jan. 2008.
[13] K. A. Schouhamer Immink and K. Cai, ‘‘Efficient balanced and maximum
homopolymer-run restricted block codes for DNA-based data storage,’’
IEEE Commun. Lett., vol. 23, no. 10, pp. 1676–1679, Oct. 2019.
[14] K. A. S. Immink, Codes for Mass Data Storage Systems, 2nd ed.
Eindhoven, The Netherlands: Shannon Foundation, 2004.
[15] S. W. MacLauhlin, J. Luo, and Q. Xie, ‘‘On the capacity of M-ary
Runlength-limited codes,’IEEE Trans. Inf. Theory, vol. 41, no. 5,
pp. 1508–1511, Sep. 1995.
[16] M. W. Marcellin and H. J. Weber, ‘‘Two-dimensional modulation codes,’
IEEE J. Sel. Areas Commun., vol. 10, no. 1, pp. 254–266, Jan. 1992.
[17] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst.
Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948.
[18] P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge, U.K.:
Cambridge Univ. Press, 2009.
[19] S. M. Hossein, T. Yazdi, H. M. Kiah, and O. Milenkovic, ‘‘Weakly mutu-
ally uncorrelated codes,’’ in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Barcelona, Spain, Jul. 2016, pp. 2649–2653.
[20] V. Taranalli, H. Uchikawa, and P. H. Siegel, ‘‘Error analysis and inter-cell
interference mitigation in multi-level cell flash memories,’’ in Proc. IEEE
Int. Conf. Commun. (ICC), London, U.K., Jun. 2015, pp. 271–276.
[21] K. J. Kerpez, A. Gallopoulos, and C. Heegard, ‘‘Maximum entropy charge-
constrained run-length codes,’IEEE J. Sel. Areas Commun., vol. 10, no. 1,
pp. 242–253, Jan. 1992.
[22] V. Braun and K. A. Schouhamer Immink, ‘‘An enumerative coding tech-
nique for DC-free runlength-limited sequences,’IEEE Trans. Commun.,
vol. 48, no. 12, pp. 2024–2031, Dec. 2000.
[23] O. F. Kurmaev, ‘‘Constant-weight and constant-charge binary run-length
limited codes,’IEEE Trans. Inf. Theory, vol. 57, no. 7, pp. 4497–4515,
Jul. 2011.
[24] F. Paluncic and B. T. J. Maharaj, ‘‘Using bivariate generating functions
to count the number of balanced runlength-limited words,’’ in Proc.
GLOBECOM - IEEE Global Commun. Conf., Singapore, Dec. 2017,
pp. 4–8.
[25] D. Limbachiya, M. K. Gupta, and V. Aggarwal, ‘‘Family of constrained
codes for archival DNAdata storage,’’ IEEE Commun. Lett., vol. 22, no. 10,
pp. 1972–1975, Oct. 2018.
[26] N. J. A. Sloane. (2019). The On-Line Encyclopedia of Integer Sequences.
[Online]. Available: http://oeis.org
[27] Y. Wang, M. Noor-A-Rahim, E. Gunawan, Y. L. Guan, and C. L. Poh,
‘‘Construction of bio-constrained code for DNA data storage,’IEEE Com-
mun. Lett., vol. 23, no. 6, pp. 963–966, Jun. 2019.
[28] W. Song, K. Cai, M. Zhang, and C. Yuen, ‘‘Codes with run-length and
GC-content constraints for DNA-based data storage,’’ IEEE Commun.
Lett., vol. 22, no. 10, pp. 2004–2007, Oct. 2018.
KEES A. SCHOUHAMER IMMINK (Life Fellow,
IEEE) is currently a Founder and the President
of Turing Machines Inc., an innovative start-up
focused on coding and signal processing for
DNA-based storage. He received the 2017 IEEE
Medal of Honor for his for pioneering contribu-
tions to video, audio, and data recording tech-
nology, the Knighthood, in 2000, the Personal
Emmy Award, in 2004, the 1999 Audio Engineer-
ing Society’s (AES) Gold Medal, the 2004 SMPTE
Progress Medal, the 2014 Eduard Rhein Prize for Technology, and the
2015 IET Faraday Medal. He received an Honorary Doctorate from the
University of Johannesburg, in 2014. He was inducted into the Consumer
Electronics Hall of Fame, elected into the Royal Netherlands Academy of
Arts and Sciences, and the (US) National Academy of Engineering. He has
served the profession as a Governor for the IEEE Information Theory and
Consumer Electronics Societies and the President for the Audio Engineering
Society.
KUI CAI (Senior Member, IEEE) received the
B.E. degree in information and control engineering
from Shanghai Jiao Tong University, Shanghai,
China, and the joint Ph.D. degree in electrical
engineering from the Technical University of
Eindhoven, The Netherlands, and the National
University of Singapore. She is currently an Asso-
ciate Professor with the Singapore University
of Technology and Design (SUTD). Her main
research interests are in the areas of coding the-
ory, information theory, signal processing for various data storage systems,
and digital communications. She received the 2008 IEEE Communications
Society Best Paper Award in Coding and Signal Processing for Data Storage.
She has served as the Vice-Chair (Academia) for the IEEE Communications
Society and the Data Storage Technical Committee (DSTC), from 2015
to 2016.
VOLUME 8, 2020 49531
... The spectral null frequencies of primary interest in communications are the zero frequency (dc) and the Nyquist frequency. Dc-free codes have found widespread application in various fields such as data transmission and data storage [25,26,27,28,29]. ...
... In Erlich and Zielinski's experiments [40], both the maximum run requirement and the weak balance constraint are taken into account. Constrained codes that avoid both the maximum run requirement and the weak balance constraint can readily be designed with earlier theory developed in [29]. ...
Article
Full-text available
Constrained coding is a somewhat nebulous term which we may define by either inclusion or exclusion. A constrained system is defined by a constrained set of 'good' or 'allowable' sequences to be recorded or transmitted. Constrained coding focuses on the analysis of constrained systems and the design of efficient encoders and decoders that transform arbitrary user sequences into constrained sequences. Constrained coding has extensively been used since the advent in the 1950s of digital storage and communication devices. They have found application in all hard disk, non-volatile memories, optical discs, such as CD, DVD and Blu-Ray Disc, and they are now projected for usage in DNA-based storage. We survey theory and practice of constrained coding, tracing the evolution of the subject from its origins in Shannon's classic 1948 paper to present-day applications in DNA-based data storage systems.
... In recent years, interest in balanced codes has rekindled because of the emergence of DNA macromolecules as a next-generation data storage medium with its unprecedented density, durability and replication efficiency [2], [3]. Specifically, a DNA string comprises four bases or letters: A, T, C, and G, and a string is GC-rich (or GC-poor) if a high (or low) proportion of the bases corresponds to either G or C. Since GC-rich or GC-poor DNA strings are prone to both synthesis and sequencing errors [4], [5], we aim to reduce the difference between the number of G and C and the number of A and T on every DNA codeword. This requirement turns out to be equivalent to reducing the imbalance of a related binary word (see for example [5], [6]). ...
... Specifically, a DNA string comprises four bases or letters: A, T, C, and G, and a string is GC-rich (or GC-poor) if a high (or low) proportion of the bases corresponds to either G or C. Since GC-rich or GC-poor DNA strings are prone to both synthesis and sequencing errors [4], [5], we aim to reduce the difference between the number of G and C and the number of A and T on every DNA codeword. This requirement turns out to be equivalent to reducing the imbalance of a related binary word (see for example [5], [6]). ...
Preprint
Full-text available
We study and propose schemes that map messages onto constant-weight codewords using variable-length prefixes. We provide polynomial-time computable formulas that estimate the average number of redundant bits incurred by our schemes. In addition to the exact formulas, we also perform an asymptotic analysis and demonstrate that our scheme uses $\frac12 \log n+O(1)$ redundant bits to encode messages into length-$n$ words with weight $(n/2)+{\sf q}$ for constant ${\sf q}$.
... • GC content must either be constant (strongly constrained) or within a certain interval (weakly constrained) (21) to reduce the probability of secondary structure formation and to ensure uniform sequence coverage in the sequencing (22). • Homopolymers are continuous repeats of a certain nucleotide that can lead to increased error rates in sequencing methods (23), as sequencing methods often fail to recognize the correct lengths of homopolymers. ...
... To take these limitations and constraints into account, a flexible code word design is required for DNA storage systems. Various deterministic approaches adhering to the homopolymer and GC content constraints exist, for instance in (21,(27)(28)(29)(30). Other heuristic methods, e.g. in (16,(31)(32)(33)(34), additionally take into account a large minimal Hamming distance (the number of positions that differ between two strings). ...
Article
Full-text available
The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.
... Constrained codes are used for the purpose of avoiding such sequences and thereby reducing the possibility of an erroneous symbol detection or a synchronization fault. Due to their usefulness in designing reliable information storage systems, constrained codes have found applications in hard disk, nonvolatile memories, optical discs, etc. [7], [19], and they are also projected for usage in future DNA storage systems [8]. This paper is devoted to an important class of constrained sequences called runlength-limited (RLL) sequences, which have been widely studied and applied in both line coding and error control coding contexts [7], [19]. ...
... In particular, the additional constraints we consider are: i) the constant-weight constraint, i.e., the requirement that all the codewords have the same Hamming weight, and ii) the constant-number-of-runs 1 constraint, i.e., the requirement that all the codewords have the same number of runs of identical symbols. Constant-weight and bounded-weight codes have numerous applications in communications (see, e.g., [8], [9], [14], [15] for a study of constant-weight codes in the context of runlength constraints). Apart from these, the motivation behind the above-mentioned constraints that we analyze here is twofold: 1) on the theoretical side, to quantify precisely the asymptotic behavior and derive the typical values of the relevant quantities in RLL sequences, and 2) on the application side, to exhibit their usefulness in the analysis of various communication scenarios. ...
Article
Full-text available
This paper studies properties of binary runlength-limited sequences with additional constraints on their Hamming weight and/or their number of runs of identical symbols. An algebraic and a probabilistic (entropic) characterization of the exponential growth rate of the number of such sequences, i.e., their information capacity, are obtained by using the methods of multivariate analytic combinatorics, and properties of the capacity as a function of its parameters are stated. The second-order term in the asymptotic expansion of the rate of these sequences is also given, and the typical values of the relevant quantities are derived. Several applications of the results are illustrated, including bounds on codes for weight-preserving and run-preserving channels (e.g., the run-preserving insertion-deletion channel), a sphere-packing bound for channels with sparse error patterns, and the asymptotics of constant-weight sub-block constrained sequences. In addition, the asymptotics of a closely related notion—q-ary sequences with fixed Manhattan weight—is briefly discussed, and an application in coding for molecular timing channels is illustrated.
... We start with an explicit formula for the capacity of S k (see, e.g., [15], [31]). ...
Preprint
In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Quite recently, [26] studied the role of batch optimization in reducing the cost of large scale DNA synthesis, for a given pool $\mathcal{S}$ of random quaternary strings of fixed length. Among other things, it was shown that the asymptotic cost savings of batch optimization are significantly greater when the strings in $\mathcal{S}$ contain repeats of the same character (homopolymer run of length one), as compared to the case where strings are unconstrained. Following the lead of [26], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length $k\geq1$. Specifically, we are given a set of DNA strands $\mathcal{S}$, randomly drawn from a natural Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we prove that for any $k\geq 1$, the optimal reference strand, minimizing the cost of DNA synthesis is, perhaps surprisingly, the periodic sequence $\overline{\mathsf{ACGT}}$. It turns out that tackling the homopolymer constraint of length $k\geq2$ is a challenging problem; our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.
Article
The work aims to study the application of Deoxyribonucleic Acid (DNA) multi-source data storage in Digital Twins (DT). Through the investigation of the research status of DT and DNA computing, the work puts forward the concept of DNA multi-source data storage for DT. Raptor code is improved from the design direction of degree distribution function, and six degree function distribution schemes are proposed in turn in the process of describing the research method. Additionally, a quaternary dynamic Huffman coding method is applied in DNA data storage, combined with the improved concatenated code as the error correction code. Considering the content of cytosine deoxynucleotide (C) and guanine deoxynucleotide Guanine (G) and the distribution of homopolymer in DNA storage, the work proposes and verifies an improved concatenated code algorithm Deoxyribonucleic Acid-Improved Concatenated code (DNA-ICC). The results show that while the Signal-to-Noise Ratio (SNR) increases, the Bit Error Rate (BER) decreases gradually and the trend is similar. But the anti-interference ability of the degree distribution function optimized by the probability transfer method is better. The BER of DNA-ICC scheme decreases with the decrease of error probability, which is stronger than other error correction codes. Compared with the original concatenated code, it saves at least 1.65 s, and has a good control effect on homopolymer. When the size of homopolymer exceeds 4 nt, the probability of homopolymer is only 0.44%. The proposed Quaternary dynamic Huffman code and concatenated error correction code have excellent performance.
Article
Full-text available
We present and analyze a new construction of bipolar balanced codes where each codeword contains equally many -1’s and +1’s. The new code is minimally modified as the number of symbol changes made to the source word for translating it into a balanced codeword is as small as possible. The balanced codes feature low redundancy and time complexity. Large look-up tables are avoided.
Preprint
Full-text available
We present and analyze a new systematic construction of bipolar balanced codes where each code word contains equally many −1's and +1's. The new code is minimally modified as the number of symbol changes made to the source word for translating it into a balanced code word is as small as possible. The balanced codes feature low redundancy and time complexity. Large look-up tables are avoided.
Article
A design of 7/9-rate sparse code for spin-torque transfer magnetic random access memory (STT-MRAM) is proposed in this work. The STT-MRAM using spin-polarized current through magnetic tunnel junction (MTJ) to write data is one of the most promising candidates for the next-generation nonvolatile memory technologies in consumer and data center applications. The proposed code is designed to exploit the asymmetric write failure feature of the STT-MRAM. In particular, 7-bit user-data sequences incoming the encoder is encoded into 9-bit codewords, where the Hamming weights of the codewords are equal to 2 and 4 only. A single look-up table accomplishes encoding, whereas the maximum likelihood decoding is deployed in this work. Simulation results demonstrate that the designed code can provide significant improvements for the reliability of STT-MRAM under the effect of both write and read errors.
Article
Full-text available
DNA storage offers substantial information density1,2,3,4,5,6,7 and exceptional half-life³. We devised a ‘DNA-of-things’ (DoT) storage architecture to produce materials with immutable memory. In a DoT framework, DNA molecules record the data, and these molecules are then encapsulated in nanometer silica beads⁸, which are fused into various materials that are used to print or cast objects in any shape. First, we applied DoT to three-dimensionally print a Stanford Bunny⁹ that contained a 45 kB digital DNA blueprint for its synthesis. We synthesized five generations of the bunny, each from the memory of the previous generation without additional DNA synthesis or degradation of information. To test the scalability of DoT, we stored a 1.4 MB video in DNA in plexiglass spectacle lenses and retrieved it by excising a tiny piece of the plexiglass and sequencing the embedded DNA. DoT could be applied to store electronic health records in medical implants, to hide data in everyday objects (steganography) and to manufacture objects containing their own blueprint. It may also facilitate the development of self-replicating machines.
Article
Full-text available
Background: With the inherent high density and durable preservation, DNA has been recently recognized as a distinguished medium to store enormous data over millennia. To overcome the limitations existing in a recently reported high-capacity DNA data storage while achieving a competitive information capacity, we are inspired to explore a new coding system that facilitates the practical implementation of DNA data storage with high capacity. Result: In this work, we devised and implemented a DNA data storage scheme with variable-length oligonucleotides (oligos), where a hybrid DNA mapping scheme that converts digital data to DNA records is introduced. The encoded DNA oligos stores 1.98 bits per nucleotide (bits/nt) on average (approaching the upper bound of 2 bits/nt), while conforming to the biochemical constraints. Beyond that, an oligo-level repeat-accumulate coding scheme is employed for addressing data loss and corruption in the biochemical processes. With a wet-lab experiment, an error-free retrieval of 379.1 KB data with a minimum coverage of 10x is achieved, validating the error resilience of the proposed coding scheme. Along with that, the theoretical analysis shows that the proposed scheme exhibits a net information density (user bits per nucleotide) of 1.67 bits/nt while achieving 91% of the information capacity. Conclusion: To advance towards practical implementations of DNA storage, we proposed and tested a DNA data storage system enabling high potential mapping (bits to nucleotide conversion) scheme and low redundancy but highly efficient error correction code design. The advancement reported would move us closer to achieving a practical high-capacity DNA data storage system.
Article
Full-text available
We analyze codes for DNA-based data storage which accounts for the maximum homopolymer repetition length and GC-AT balance. We present a new precoding method for translating words with a maximum run of k zeros into words with a maximum homopolymer run m = k + 1, which is atractive for securing GC-AT balance. Generating functions are presented for enumerating the number of n-symbol k-constrained codewords of given GC-AT balance Various efficient constructions are presented of block codes that satisfy a combined balance and maximum homopolymer run.
Article
Full-text available
With extremely high density and durable preservation, DNA data storage has become one of the most cutting-edge techniques for long-term data storage. Similar to traditional storage which impose restrictions on the form of encoded data, data stored in DNA storage systems are also subject to two biochemical constraints, i.e., maximum homopolymer run limit and balanced GC content limit. Previous studies used successive process to satisfy these two constraints. As a result, the process suffers low efficiency and high complexity. In this paper, we propose a novel content-balanced run-length limited (C-RLL) code with an efficient code construction method, which generates short DNA sequences that satisfy both constraints at one time. Besides, we develop an encoding method to map binary data into long DNA sequences for DNA data storage, which ensures both local and global stability in terms of satisfying the biochemical constraints. The proposed encoding method has high effective code rate of 1.917 bits per nucleotide and low coding complexity.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review, we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology. Throughout evolution, DNA has been the primary medium of biological information storage. In this article, Ceze, Nivala and Strauss discuss how DNA can be adopted as a storage medium for custom data, as a potential future complement to current data storage media such as computer hard disks, optical disks and tape. They discuss strategies for coding, decoding and error correction and give examples of implementation both in vitro and in vivo.
Article
DNA-based data storage systems have evolved as a solution to accommodate data explosion. In this letter, some properties of DNA codewords that are essential for an archival DNA storage are considered for the design of codes. Constraintbased DNA codes which avoid runs of nucleotides, have fixed GC-weight, and a specific minimum distance are presented. An altruistic algorithm which enumerates DNA codewords with the above constraints is provided. A theoretical bound on such DNA codewords is obtained. This bound is tight when there is no minimum distance constraint. IEEE