Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Feb 28, 2020

Content may be subject to copyright.

Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Apr 07, 2019

Content may be subject to copyright.

1

Design of Capacity-Approaching Constrained Codes

for DNA-based Storage Systems

Kees A. Schouhamer Immink, Fellow, IEEE and Kui Cai, Senior Member, IEEE

Abstract—We consider coding techniques that limit the lengths

of homopolymer runs in strands of nucleotides used in DNA-

based mass data storage systems. We compute the maximum

number of user bits that can be stored per nucleotide when

a maximum homopolymer runlength constraint is imposed.

We describe simple and efﬁcient implementations of coding

techniques that avoid the occurrence of long homopolymers, and

the rates of the constructed codes are close to the theoretical

maximum. The proposed sequence replacement method for k-

constrained q-ary data yields a signiﬁcant improvement in coding

redundancy than the prior art sequence replacement method for

the k-constrained binary data. Using a simple transformation,

standard binary maximum runlength limited sequences can be

transformed into maximum runlength limited q-ary sequences,

which opens the door to applying the vast prior art binary code

constructions to DNA-based storage.

I. INT ROD UC TI ON

The ﬁrst large-scale archival DNA-based storage architec-

ture was implemented by Church et al. [1] in 2012. Naturally

occurring DNA consists of four types of nucleotides (nt):

adenine (A), cytosine (C), guanine (G), and thymine (T).

A DNA strand (or string) is a linear sequence of these

nucleotides, and hence is essentially a q-ary sequence with

q= 4. Binary source, or user, data is translated into a strand

of nucleotides, for example, by mapping two binary source bits

into a single nucleotide. Repetitions of the same nucleotide,

a homopolymer run, may signiﬁcantly increase the chance

of sequencing errors [2], [10]. From Fig. 5 of [10], a long

homopolymer run (e.g. more than 4 nt) would result in a

signiﬁcant increase of insertion and deletion errors, so that

such long runs should be avoided.

In this paper, we focus on constrained coding techniques

that avoid the occurrence of long homopolymer runs. That is,

we will study the generation of sequences of q-ary symbols,

. . . , xi−1, xi, xi+1, . . .,xi∈ Q ={0, .., q −1}, where the

occurrence of vexatious substrings is disallowed. Note that

we prefer for the DNA case, q= 4, the usage of the

alphabet Q={0,1,2,3}instead of the set of four nucleotide

types {A, C, G, T }as it allows the introduction of arithmetic

operations on the symbols.

Constrained sequences have been applied in a great number

of mass data storage systems such as optical and magnetic data

recording systems [3]. Constrained codes based on runlength

Kees A. Schouhamer Immink is with Turing Machines Inc, Willem-

skade 15d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-

machines.com.

Kui Cai is with Singapore University of Technology and Design (SUTD),

8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.

This work was supported by the SUTD-MIT International Design Center

(IDC) research grant.

limited (RLL) sequences have found almost universal applica-

tion in recording practice, and most of the codes are binary

with q= 2. The number of repetitions of the same consec-

utive symbol (nucleotide) is usually called runlength [4]. A

maximum runlength constraint is characterized by the integer

(k+ 1),k≥0, which stipulates the maximum runlength. We

focus on sequences, where the ‘zero’ runlength lies between

dand k. Such a sequence is often called a dk-constrained

sequence, and in case d= 0, it is called a k-constrained

sequence. A k-constrained sequence is converted into an RLL

sequence whose maximum runlength equals k+ 1, using

precoding, a modulo-qintegration step [3]. The notation k

versus k+ 1 for a k-constrained sequence versus a k+ 1

RLL sequence is inconvenient, but the term is generally used

in data recording practice, and is a heritage rooted in the

1960s [5]. We use the notation m=k+ 1 to denote a

maximum homopolymer run of mnt.

Bornholt et al. [2] presented a coding method that avoids

the occurrence of repetitions of the same nucleotide for DNA-

based storage. They convert binary user data into a (k= 0)-

constrained ternary data stream using a base-change converter,

where the generated ternary data are taken from the alphabet

{1,2,3}. The ternary data so obtained are translated using

modulo-4integration precoding into a strand of nucleotides,

where homopolymers are avoided, m= 1, that is substrings

‘00’, ‘11’, ‘22, ‘33’ (or in nucleotide language: ‘AA’, ‘CC’,

‘GG’, and ‘TT’) are not generated. The relative loss of

information capacity due to the proposed 3-base code, instead

of the full 4-base, equals 1−log2(3)/2∼0.208. The additional

loss of the proposed binary-based source word to ternary-based

codeword conversion using the proposed ﬁxed-to-variable-

length Huffman code is ignored. The more than 20 percent loss

of information capacity is signiﬁcant, and therefore alternative

coding methods with less overhead are desirable.

In this paper, we propose alternative, more efﬁcient, coding

techniques that avoid the occurrence of long homopolymer

runs. In particular, in Section II, we compute the information

capacity of q-ary, k-constrained channels, which follows di-

rectly from Shannon’s noiseless input restricted channel [6].

Then, in Section III, we present the main contribution of this

work, algorithms for translating arbitrary binary source data

into k-constrained q-ary data. Among the three code design

methods we describe, the second method removes forbidden

substrings of q-ary sequences by using a recursive, ‘sequence

replacement’, method yielding a signiﬁcant improvement in

coding redundancy than the prior art binary sequence re-

placement method [9]. In the third method, standard binary

maximum runlength limited sequences are transformed into

maximum runlength limited q-ary sequences using two simple

2

TABLE I

CAPACI TY,Ck,VE RSU S kAND m=k+ 1 FO R q= 4.

k m Ck(bit/nt)

0 1 1.5850(= log23)

1 2 1.9227

2 3 1.9824

3 4 1.9957

4 5 1.9989

5 6 1.9997

steps of precoding, which opens the door to using the vast

prior art binary code constructions to DNA-based storage.

Section IV concludes our paper.

II. IN FO RM ATIO N CA PACI TY

Strands of nucleotides with (long) repetitions of the same

nucleotide are prone to error, and DNA sequences with more

than m=k+ 1 consecutive nucleotides of the same type

must be avoided. Each k-constrained sequence of symbols

starting with a non-zero symbol can be seen to be composed

of substrings taken from the set {a0, a10, a202, . . . , ak0k},

where 0jstands for a string of jconsecutive ‘0’s, and the

integer ai∈ {1, . . . , q −1}. Let Nk(n)denote the number of

k-constrained sequences of q-ary symbols starting with a non-

zero symbol, then we may write down, following Shannon’s

approach [6], the recurrent relationship

Nk(n) = (q−1)

k+1

i=1

Nk(n−i), n > k. (1)

For large n, the number of sequences Nk(n)grows exponen-

tially, that is

Nk(n)∼cλn

k, n ≫1,(2)

where c∼1is a constant, and the growth factor, λk, is the

largest real root of

λk+2

k−qλk+1

k+q−1 = 0.(3)

The maximum number of user bits that can be stored per

nucleotide (nt), called (information) capacity, denoted by Ck,

is deﬁned by

Ck= lim

n→∞

1

nlog2Nk(n) = log2λk(bit/nt).(4)

Table I shows the capacity of the k-constrained channel, Ck,

versus kand the maximum homopolymer run m=k+ 1, for

the DNA case, q= 4. For asymptotically large k, we obtain

λk∼q1−q−1

qk+2 , k ≫1,(5)

so that

Ck∼log2q−1

ln 2

q−1

q2q−k.(6)

It is immediate from Table I that a relaxation of the maximum

homopolymer runlength constraint from m= 1, a value

proposed in [2], to a higher value may signiﬁcantly increase

the maximum code rate. We are interested in k-constrained

code constructions of rate (n−1)/n, where nis the codeword

length. Deﬁne the integer nmax as the largest nfor which a

rate (n−1)/n,k-constrained code can be constructed. We

simply ﬁnd that

1

nmax

≥1−logqλk,(7)

or

nmax =1

logq

q

λk.(8)

Results of computations are collected in Table II. We may

notice that it is possible, in theory, to construct a code with

a redundancy of around half a percent, where homopolymers

runs have a length at most m= 4. A maximum homopolymer

run, m= 3, costs less than two percent redundancy. In the next

section, we investigate properties and constructions of practical

codes that translate arbitrary source data into k-constrained q-

ary sequences.

III. MAX IM UM RU NL EN GT H CO NS TR AI NE D CO DE S

We detail three methods for generating maximum runlength

limited q-ary sequences. In the second method, forbidden

substrings of q-ary sequences are removed by a recursive,

‘sequence replacement’, method. We assume that the binary

source data have been translated into q-ary data, which is ac-

complished by an efﬁcient base converter. In the third method,

standard binary maximum runlength limited sequences are

transformed into maximum runlength limited q-ary sequences

using a simple transformation.

A. Cascadable block codes, Method A

Let nbe the length of a k-constrained q-ary word that ends

with at most r‘zero’s and starts with at most l‘zero’s. In case

l+r≤kwe may cascade the n-words without violating the k

constraint at the word boundaries. Blake [7] and Freiman and

Wyner [5] showed for the binary case, q= 2, that the number

of such constrained words, denoted by Nklr(n), is maximized

by choosing l=⌊k/2⌋and r=k−l. Their arguments can

be generalized to q-ary words, and we denote the number of

k-constrained q-ary words by Nk,l0,r0(n), where l0=⌊k/2⌋

and r0=k−l. Using generating functions and an algebraic

computer program, we can compute Nk,l0,r0(n)as a function

of kand n. As we are interested in the construction of code of

maximum rate 1−1/n, we computed the maximum n, denoted

by nA, for which a rate 1−1/n,k-constrained code using

Method A, is possible. Results of computations are collected

in Table II. For small nit is practically possible to directly

implement RLL codes using look-up tables. For larger codes,

and smaller redundancy, we must resort to alternative coding

methods. Kautz [8], for example, presented an enumeration

algorithm for encoding and decoding k-constrained binary

sequences. His enumeration algorithm can be rewritten for the

q-ary case at hand. The space complexity of his algorithm is

O(n2), which makes it less attractive than the replacement

techniques discussed in the next section.

3

B. Sequence replacement technique, Method B

The three sequence replacement techniques published by

Wijngaarden et al. [9] are recursive methods for removing

forbidden substrings from a binary source word. The encoder

removes the forbidden substrings, and the positions of the

forbidden substrings are encoded as binary pointer words, and

subsequently inserted at predeﬁned positions of the codeword.

The sequence replacement techniques are attractive as the

complexity of encoder and decoder is very low, and the

methods are very efﬁcient in terms of rate-capacity quotient. In

software or hardware, it would require a counter, a comparator,

and a few memory elements. The methods are also suited for

high speed implementation, as several steps in the encoding

and decoding procedure can be performed simultaneously.

We assume that both the source and encoded channel

data are represented in the same q-ary base. Let X=

(x1, . . . , xn−1),xi∈ Q, be an (n−1)symbol source word,

which has to be translated into an n-symbol code word

Y= (y1, . . . , yn),yi∈ Q. Obviously, the rate of the code

is (n−1)/n. The task of the encoder is to translate the source

word into a k-constrained word.

The encoder simply starts by appending a ‘1’ to the (n−1)-

symbol source word, yielding the n-symbol word, denoted by

X1. The encoder scans (from right to left, i.e. from LSB

to MSB) the word X1, and if this word does not have the

forbidden substring 0k+1, the q-ary codeword Y=X1is

transmitted. If, on the other hand, the ﬁrst occurrence of

substring 0k+1 is found, we invoke the following replacement

procedure.

Replacement procedure: Let the source word be denoted

by X20k+1X11, where, by assumption, the tail X1has no

forbidden substring. The forbidden substring 0k+1 is removed,

yielding the (n−k−1)-symbol sequence X2X11. Let the for-

bidden substring, 0k+1, start at position p1,1≤p1≤n−k−1.

The position p1is represented by the (k+ 2)-symbol q-ary

pointer word, p=v1Av20, where v1, v2∈ Q \ {0}and A

is any q-ary word of k−1symbols. Note that the number

of unique combinations of pointer pequals (q−1)2qk−1.

Subsequently, the tail symbol, ‘1’, of X2X11is replaced by

the (k+ 2)-symbol q-ary string, p, obtaining the sequence

X2X1p.

Note that the sequence X2X1pis of length n(as the

starting sequence X1). If, after the replacement, the sequence

X2X1pis free of other occurrences of the forbidden substring

0k+1 then the codeword Y=X2X1pis sent. Otherwise, the

encoder repeats the above sequence replacement procedure

for the string X2X1petc., until all forbidden substrings have

been removed. The decoder can uniquely undo the various

replacements and shifts made by the encoder. The space

complexity of the encoder and decoder is mainly the look-

up table for translating the position p1,1≤p1≤n−k−1,

into the (k+ 2)-symbol q-ary pointer and vice versa, which

amounts to O(n).

As the pointer p1is in the range 1≤p1≤n−k−1, and the

number of distinct combinations of p1equals (q−1)2qk−1,

we conclude that the codeword length nis upperbounded by

n≤(q−1)2qk−1+k+ 1, k ≥2.(9)

TABLE II

MAXIMUM LENGTHS nFO R WHI CH A R ATE (n−1)/n,k-CONSTRAINED

4-ARY CO DE C AN BE C ONS TRU CT ED.

k m =k+ 1 nmax nAnB

1 2 25 22 11

2 3 113 106 39

3 4 467 445 148

4 5 1885 1848 581

The code uses one redundant q-ary symbol, so that we

conclude that the redundancy of the sequence replacement

code is approximately

log2q

n∼q

(q−1)2q−k, k ≫1.(10)

From (6), we infer that the redundancy of k-constrained q-ary

sequences is at least

log2q−Ck∼1

ln 2

q−1

q2q−k, k ≫1.(11)

The redundancy of the sequence replacement method is a

factor of q

q−13

ln q

larger than optimal for k≫1. For DNA-based storage, q= 4,

the factor is around 3.29. The above sequence replacement

method is efﬁcient in terms of redundancy and space/time

hardware requirements as no large look-up tables are needed.

For example, for a maximum homopolymer run m= 4, we

are able to construct a code of length n= 148 that needs only

one redundant nucleotide.

Table II shows results of computations for rate (n−1)/n,

k-constrained codes for q= 4 and various values of k,

where nBdenotes the maximum npossible with the sequence

replacement method (Method B). Results of computations of

nmax have been collected in Table II.

C. Translating binary k′-constrained codes into quaternary

k-constrained sequences, q= 4

Very efﬁcient constructions of binary k′-constrained codes

that avoid long repetitions of a ‘zero’ have been published in

the literature, see, for example, the survey in [9]. We show

that after applying a simple coding step to a k′-constrained

binary sequence, we obtain a strand of nucleotides, where the

length of a homopolymer run is at most m=k′

2.

We start with deﬁnitions of two simple operations on

symbol sequences and their (unique) inverse. Let x=

(x1, . . . , xn),xi∈ {0,1}, denote a word of nbinary symbols.

The ﬁrst operation is deﬁned as follows. The n-bit sequence,

x, is translated into a n

2-symbol sequence w, where two

consecutive binary symbols of xare translated into one

quaternary symbol wi∈ {0,1,2,3}, using

wi= 2x2i−1+x2i,1≤i≤n

2.

The above operation is denoted by the shorthand notation w=

Z(x).

4

The second operation, usually called precoding, is deﬁned as

follows. The word w=(w1, . . . , wn),wi∈ {0,1}is obtained

by modulo 2integration of x, that is, by the operation

wi=

i

k=1

xk=wi−1⊕xi,1≤i≤n, (12)

where the dummy symbol w0= 0, and the symbol ⊕

denotes symbol-wise modulo 2 addition. The above operation

is denoted by the shorthand notation w=I(x). Note that

the original word xcan be uniquely restored by a modulo 2

differentiation operation, deﬁned by

xi=wi⊕wi−1,1≤i≤n. (13)

The above differentiation operation is denoted by x=

I−1(w). Clearly,

I−1(I(x)) = x.(14)

Assume that the binary source data have been converted into

a binary k′-constrained sequence, x=(x1, . . . , xn), where

xi∈ {0,1}, using a suitable k′-constrained code. Then, by

deﬁnition, substrings in xof more than k′consecutive ‘zero’s,

are absent. Note that the operation w=Z(x)will not result in

ak-constrained sequence w. In order to limit the runlengths

of the output word, w, we ﬁrst apply a two-step precoding

operation, deﬁned by

w=I(I(x)).(15)

For example, we can easily verify the three operations on the

sequence x,

x= 011000011111111111001111000111

I(x) = 010000010101010101110101111010

I(I(x)) = 011111100110011001011001010011

Z(I(I(x))) = 133212121121103,

where xis a (k′= 4)-constrained binary sequence. After the

ﬁrst precoding step, the sequence, I(x), is a regular runlength

limited sequence with a maximum ‘zero’ and ‘one’ runlength

equal to k′+ 1(= 5). The second precoding step limits the

number of consecutive ‘one’s and ‘zero’s to k′+2 in I(I(x)),

and it also limits the number of consecutive ’10’s bits to k′+2,

thus prohibiting the generation of long homopolymer runs.

In the above example, the 4-ary output sequence

Z(I(I(x))) has a maximum homopolymer run, m= 2.

In general, it can easily be veriﬁed that in case the binary

input sequence, x, is k′-constrained that the 4-ary sequence,

Z(I(I(x))), has a maximum homopolymer run given by

m=k′

2, k′>2.(16)

The above method offers a simple translation of binary k′-

constrained sequences into a strand of nucleotides with limited

homopolymer runs, which creates the opportunity to apply the

vast literature on binary runlength limited coding to DNA-

based storage.

IV. CONCLUSIONS

We have presented coding methods for translating source

data into strands of nucleotides with a maximum homopolymer

run. We found that the proposed algorithms can be imple-

mented efﬁciently, and that the information densities of the

constructed codes are close to the theoretical maximum. We

have proposed sequence replacement method for k-constrained

q-ary data, which yields a signiﬁcant improvement in coding

redundancy than the prior art sequence replacement method for

the k-constrained binary data. We have shown that, using two

simple steps of precoding, it is possible to translate a binary

k′-constrained sequence into a strand of nucleotides with a

maximum homopolymer run, which creates the opportunity to

applying a myriad of prior art binary code constructions to

DNA-based storage.

REF ER EN CE S

[1] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital infor-

mation storage in DNA,” Science, vol. 337, no. 6012, pp. 1628-1628,

2012.

[2] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, and G. Seelig, “A

DNA-based Archival Storage System,” ACM SIGOPS Operating Systems

Review, vol. 50, pp. 637-649, 2016.

[3] K. A. S. Immink, Codes for Mass Data Storage Systems, Second Edi-

tion, ISBN 90-74249-27-2, Shannon Foundation Publishers, Eindhoven,

Netherlands, 2004.

[4] K. A. S. Immink, “Runlength-Limited Sequences,” Proceedings of the

IEEE, vol. 78, no. 11, pp. 1745-1759, Nov. 1990.

[5] C. V. Freiman and A. D. Wyner, “Optimum Block Codes for Noiseless

Input Restricted Channels,” Information and Control, vol. 7, pp. 398-

415, 1964.

[6] C. E. Shannon, “A Mathematical Theory of Communication,” Bell Syst.

Tech. J., vol. 27, pp. 379-423, July 1948.

[7] I. F. Blake, “The Enumeration of Certain Run Length Sequences,”

Information and Control, vol. 55, pp. 222-237, 1982.

[8] W. H. Kautz, “Fibonacci Codes for Synchronization Control,” IEEE

Trans. Inform. Theory, vol. IT-11, pp. 284-292, 1965.

[9] A. J. de Lind van Wijngaarden and K. A. S. Immink, “Construction

of Maximum Run-Length Limited Codes Using Sequence Replacement

Techniques,” IEEE Journal on Selected Areas of Communications, vol.

28, pp. 200-207, 2010.

[10] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R.

Hegarty, C. Nusbaum, D. B. Jaffe, “Characterizing and measuring bias

in sequence data,” Genome Biol. 14, R51, 2013.