PreprintPDF Available

Sequence-Subset Distance and Coding for Error Control in DNA Data Storage

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The process of DNA data storage can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, termed sequence subset codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derived some upper bounds on the size of the sequence subset codes including a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, which imply lower bounds on the size of such codes.
Content may be subject to copyright.
1
Sequence-Subset Distance and Coding for Error
Control in DNA-based Data Storage
Wentu Song, Kui Cai and Kees A. Schouhamer Immink
Abstract
The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel,
termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA
storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a
distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error
correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as sequence-subset
codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum
distance. We derive some upper bounds on the size of the sequence-subset codes including a tight bound for a special case, a
Singleton-like bound and a Plotkin-like bound. We also propose some constructions, including an optimal construction for that
special case, which imply lower bounds on the size of such codes.
Index Terms
DNA data storage, error-correcting codes, Singleton bound, Plotkin bound.
I. INT ROD UC TI ON
The idea of storing data in synthetic DNA sequences has been around since 1988 [1] and DNA-based data storage has
been progressing rapidly in recent years with the development of DNA synthesis and sequencing technology. Compared to
traditional magnetic and optical media, DNA storage has competing advantages including extreme high density, long durability
[7], and low energy consumption [2].
A DNA sequence is mathematically represented by a quaternary sequence, each symbol represent one of the four types
of base nucleotides: adenine (A), cytosine (C), guanine (G)and thymine (T). Basically, in a DNA-based storage system, the
original binary data is first encoded to a set of quaternary sequences. Then the corresponding DNA nucleotide sequences
(oligos) are synthesized and stored. To retrieve the original data, the stored oligos are sequenced to generate a set of quaternary
sequences, which then are decoded to the original binary data. The process of DNA synthesizing, storing and sequencing can
be mathematically modelled as a communication channel, called the DNA storage channel, which can be depicted by Fig. 1.
Binary
data
file F
DNA
Storage
Encoder
DNA
nucleotide
sequences:
x1,· · · ,xM
DNA
Storage
Channel
DNA
nucleotide
sequences:
y1,···,y˜
M
DNA
Storage
Decoder
Binary
data
file F
Fig 1. System model of the DNA storage: The DNA storage channel is the mathematical model of the process of DNA synthesizing, storing and sequencing.
A reliable system should guarantee that with sufficiently high probability the decoded file Fequals to the original file F.
Unlike the conventional magnetic or optical recording systems, the DNA sequences are stored in “pools”, where structured
addressing is not allowed. Therefore, the inputs and outputs of the DNA storage channel can only be viewed as sets of unordered
DNA sequences.
The output of the DNA storage channel may be distorted by the following five types of errors:
Sequence deletion: One or more of the input sequences are lost. As a result, the number of output sequences is smaller
than the number of input sequences.
Sequence insertion: One or more sequences that do not belong to the set of input sequences are added into the output
sequences. As a result, the number of output sequences is larger than the number of input sequences.
Symbol deletion: One or more symbols in a sequence are removed. As a result, the length of the erroneous sequence is
decreased.
Symbol insertion: One or more symbols are added into a sequence. As a result, the length of the erroneous sequence is
increased.
Wentu Song and Kui Cai are with Singapore University of Technology and Design, Singapore (e-mails: {wentu song, cai kui}@sutd.edu.sg).
Kees A. Schouhamer Immink is with Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands (e-mails: immink@turing-machines.com).
Part of this paper was submitted to 2019 IEEE International Symposium on Information Theory, Paris, France, July 2019.
2
Symbol substitution: One or more symbols in a sequence are replaced by other symbols. In this case, the length of the
erroneous sequence remains unchanged.
Note that sequence deletion and sequence insertion can take place simultaneously. If the number of sequence deletions equals
the number of sequence insertions, then the total number of input sequences remain unchanged. In this case, the combining
effect of sequence deletion and sequence insertion is equivalent to symbol substitutions.
To combat different types of errors in DNA synthesizing and sequencing, various coding techniques are used by DNA storage.
Most demonstration researches employ constrained coding combined with classical error correcting codes (e.g. Reed-Solomon
codes) [2]-[10]. In addition, to combat the lack of ordering of the transmitted sequences, a unique address (index) is added to
each sequence.
Codes that can correct s(or fewer) losses of sequences and e(or fewer) substitutions in each of t(or fewer) sequences
were studied in [11] by considering the so-called error ball. Codes dealing with insertion/deletion errors were also studied in
[11]. Codes that can correct a total of Ksubstitution errors were studied in [12] using the sphere packing arguments, which
is essentially the same as the error ball arguments.
A. Our Contribution
In this paper, we consider error control for DNA storage channel by introducing a new metric, termed the sequence-subset
distance, over the power set of the set of all vectors over a finite alphabet with fixed length, which is the space of the
inputs/outputs of the DNA storage channel. This metric is a generalization of the classical Hamming distance and can help to
establish a uniform framework to design codes for DNA storage channel that can correct errors of sequence deletion, sequence
insertion and symbol substitution.
We study error correcting codes with respect to the sequence-subset distance, which we refer to as sequence-subset codes,
for DNA-based data storage. We show that similar to codes with respect to the classical Hamming distance, a sequence-subset
code Ccan correct any number of nDsequence deletions, nIsequence insertions, and nSsymbol substitutions, provided
nS+L·max{nI, nD} ≤ dS(C)1
2, where Lis the length of the sequences and dS(C)is the minimum distance of C.
We derived some upper bounds on the size of the sequence-subset codes including a Singleton-like bound and a Plotkin-like
bound.
We give a construction of optimal codes (with respect to size) for the special case that L|dand M
1
L
0is an integer, where
M0=d
L. We also give some general constructions, which imply lower bounds of the size of such codes.
B. Other Related Work
In [13], the input and output of the DNA storage channel are both viewed as a multi-set, rather than set, of DNA molecules,
where the numbers of the input sequences and output sequences may be different. The fundamental limits of the DNA storage
model was investigated under the assumption that each sampled molecule is read in an error-free manner.
Another different channel model for DNA storage was studied in [14], where the process of DNA storage is modelled by
two successive channels, i.e., the synthesis channel and the sequencing channel, and the output of the sequencing channel is
a set of DNA fragments which can be represented by a profile vector. And three types of errors, namely, substitution errors
due to synthesis, coverage errors, and `-gram substitution errors due to sequencing, are considered.
There are other communication channels similar to DNA storage channel. The permutation channel considered in [15] has
input and output as vectors over a finite alphabet and the transmitted vector is corrupted by a permutation on its coordination.
Permutation channel with impairments was considered in [16], where the input and output are multi-sets, rather than vectors,
of symbols from a finite alphabet. Such models are not appropriate for DNA storage because in these models, sequences are
treated in the symbol level and the structure information of sequences is neglected.
C. Organization
The rest of the paper is organized as follows. In Section II, we introduce the sequence-subset distance and provide the basic
properties of codes with sequence-subset distance. We analyze the upper bound on the size of sequence-subset codes in Section
III and give some constructions of such codes in Section IV. The paper is concluded in Section V.
D. Notations
The following notations will be used in this paper:
1) For any positive integer n,[n] := {1,2,··· , n}.
2) For any set A,|A|denotes the size (i.e., cardinality) of Aand P(A)denotes the power set of A(i.e., the collection of
all subsets of A).
3) For any two sets Xand Y,X\Yis the set of all elements of Xthat do not belong to Y.
4) For any n-tuple xAnand any i[n],x(i)denotes the ith coordinate of x, and hence xis denoted as
x= (x(1),x(2),··· ,x(n)).
3
II. PR EL IM INA RY
We first introduce the concept of sequence-subset distance. Then we discuss the error pattern and error correction in DNA
storage channel using codes with sequence-subset distance.
A. Sequence-Subset Distance
Let Abe a fixed finite alphabet. For DNA data storage, typically A={A,T,C,G}, representing the four types of base
nucleotides. In this work, for generality, we assume that Ais any fixed finite alphabet of size q2.
Let Lbe a positive integer. For any x1,x2AL, the Hamming distance between x1and x2, denoted by dH(x1,x2), is
defined as the number of coordinates where x1and x2differ, that is,
dH(x1,x2) := |{i[L]; x1(i)6=x2(i)}|.
For any two subsets X1and X2of ALsuch that |X1| ≤ |X2|and any injection χ:X1X2, denote
dχ(X1,X2):= X
xX1
dH(x, χ(x))+L(|X2|−|X1|).(1)
Then a natural way to generalize Hamming distance to the space of all subsets of ALis as follows.
Definition 1: For any X1,X2AL, without loss of generality, assuming |X1| ≤ |X2|, the sequence-subset distance between
X1and X2is defined as
dS(X1,X2) = dS(X2,X1) := min
χXdχ(X1,X2),(2)
where Xis the set of all injections χ:X1X2.1
We first prove some important properties of the function dS(·,·). Then we will prove that it is really a distance function.
First, intuitively, the elements in X1X2should have no effect on the sequence-subset distance between X1and X2. This
is shown to be true by the following lemma and corollary.
Lemma 1: For any X1,X2ALsuch that |X1| ≤ |X2|, there exists a χ0Xsuch that dS(X1,X2) = dχ0(X1,X2)and
χ0(x) = xfor all xX1X2.
Proof: The proof is given in Appendix A.
Corollary 1: For any two subsets X1and X2of AL,
dS(X1,X2) = dS(X1\X2,X2\X1).
Proof: This corollary is just a direct consequence of Definition 1 and Lemma 1.
Lemma 2: Suppose X1,X2ALsuch that |X1| ≤ |X2|. Suppose X0
2X2such that |X1| ≤ |X0
2|. Then
dS(X1,X0
2)dS(X1,X2).
Proof: The proof is given in Appendix B.
Now we prove that dS(·,·)is really a distance function (metric) over P(AL).
Theorem 1: The function dS(·,·)is a distance function over the power set P(AL).
Proof: The proof is given in Appendix C.
B. Error Pattern of DNA Storage Channel
In this paper we consider DNA storage channel with sequence deletion/insertion and symbol substitution.2The input of the
channel is a set of unordered sequences
X={x1,x2,··· ,xM} ⊆ AL
and the output is another set of unordered sequences
Y={y1,y2,··· ,y˜
M} ⊆ AL,
where Lis the length of the sequences. Due to the channel noise, Y6=Xis possible. Sequences in the subset XYare correctly
transmitted; Sequences in X\Yare either lost (sequence deletion) or changed to sequences in Y\X(symbol substitution);
Sequences in Y\Xare either excessive (sequence insertion) or obtained from some sequences in X\Y(symbol substitution).
1A more accurate notation for the set Xis XX1,X2because it is related to the subsets X1and X2. However, we can omit the subscripts safely because
they can be easily specified by the context.
2Symbol insertions/deletions can be handled as sequence insertion or symbol substitution as follows. For each received sequence ywith length L0< L,
replace yby y0= (y,z)such that zis a randomly chosen sequence with length LL0; for each received sequence ywith length L0> L, replace yby y0,
where y0is the subsequence of yformed by the first Lcoordinates of y. Then y0is either a sequence insertion or at most Lsymbol substitutions.
4
Let nI,nDand nSdenote the number of sequence insertions, sequence deletions and symbol substitutions, respectively. Then
we call the 3-tuple (nI, nD, nS)the error pattern of Y. And we have the following lemma.
Lemma 3: Suppose the channel input is Xand output is Yof error pattern (nI, nD, nS). Then
dS(X,Y)nS+L·max{nI, nD}.
Proof: Note that we can always partition the two subsets X\Yand Y\Xas
X\Y=XDXSand Y\X=YIYS,
where XDis the set of lost input sequences, XSis the set of input sequences that are changed to YSby symbol substitution,
and YIis the set of sequences that are inserted to Y. Clearly, we have
nI=|YI|and nD=|XD|.
Moreover, |XS|=|YS|and there exists a bijection χ:XSYSsuch that for each xXS,χ(x)is the erroneous sequence of
xby symbol substitution. Hence, we have
nS=X
xXS
dH(x, χ(x)).
For further discussion, we need to consider the following two cases.
Case 1: nInD. In this case, |YI|=nInD=|XD|and |Y\X| ≤ |X\Y|. So there exists an injection χ0:YIXDand
we can let ¯χ:Y\XX\Ybe such that
¯χ(y) = (χ1(y)if yYS;
χ0(y)if yYI.
Since |X\Y| − |Y\X|=|XD| − |YI|=nDnI, then by (1),
d¯χ(Y\X,X\Y) = X
yY\X
dH(y,¯χ(y)) + L·(|X\Y| − |Y\X|)
=X
yYS
dH(y, χ(y)) + X
yYI
dH(y, χ0(y))
+L·(nDnI)
nS+L·nI+L·(nDnI)
=nS+L·nD
=nS+L·max{nI, nD}
where the inequality comes from the simple fact that dH(z,z0)Lfor any z,z0AL. Hence, by Corollary 1 and Definition
1, we have
dS(X,Y) = dS(X\Y,Y\X)
d¯χ(Y\X,X\Y)
nS+L·max{nI, nD}.
Case 2: nI> nD. In this case, there exists an injection χ0:XDYIand we can let ¯χ:X\YY\Xbe such that
¯χ(x) = (χ(x)if xXS;
χ0(x)if xXD.
Since |Y\X| − |X\Y|=|YI| − |XD|=nInD, then by (1),
d¯χ(X\Y,Y\X) = X
xX\Y
dH(x,¯χ(x)) + L·(|Y\X| − |X\Y|)
=X
xXS
dH(x, χ(x)) + X
xXD
dH(x, χ0(x))
+L·(nInD)
nS+L·nD+L·(nInD)
=nS+L·nI
=nS+L·max{nI, nD}.
5
Hence, similar to Case 1, we have
dS(X,Y) = dS(Y\X,X\Y)
d¯χ(X\Y,Y\X)
nS+L·max{nI, nD}.
In both cases, we have dS(X,Y)nS+L·max{nI, nD}, which completes the proof.
For the decoder, when receiving a subset YAL, its task is to find a possible input subset ˆ
XALthat is most similar
to Y. By the above discussion and Corollary 1, clearly, the sequence-subset distance is a good choice of metric for similarity
between Yand ˆ
X. In the next subsection, we will discuss error correction in DNA storage channel using codes with respect
to sequence-subset distance.
C. Codes with Sequence-Subset Distance
Asequence-subset code over ALis a subset Cof the power set P(AL)of the set AL. We call each element of ALasequence
and call Lthe sequence length of C. The size |C| of Cis called the code size of C. In contrast, for each codeword X∈ C, the
size of X(i.e., the number of sequences contained in X)is called the codeword size of C. The maximum of codeword sizes of
C, i.e., M= max{|X|;X∈ C}, is called the maximal codeword size of C. A sequence-subset code Cis said to have constant
codeword size if all codewords of Chave the same codeword size.
The code rate of Cis defined as
R=logq|C|
logqPM
m=0 qL
m,
where q=|A|and PM
m=0 qL
mis the number of all subsets of ALof size not greater than M.
The minimum distance of a sequence-subset code C, denoted by dS(C), is the minimum of the sequence-subset distance
between any two distinct codewords of C, that is,
dS(C) = min{dS(X,X0); X,X0∈ C and X6=X0}.
In general, L, M, |C | and dS(C)are three main parameters of C, and we will call Can (L, M , |C|, dS(C))qcode, where qis
the size of the alphabet A.
Let C ⊆ P (AL)be a sequence-subset code. We denote C={X;X∈ C}, where X=AL\X. By Corollary 1, for any
X1,X2∈ C, we have dS(X1,X2) = dS(X1\X2,X2\X1) = dS(X1,X2). So Cand Chave the same sequence length L, code
size |C| =|C| and minimum distance dS(C) = dS(C). Hence, for sequence-subset code with constant codeword size M, it is
reasonable to assume M|A|L
2. Otherwise, we can consider C, which has constant codeword size M=|A|LM|A|L
2.
A minimum-distance decoder for Cis a function D:P(AL)→ C such that for any Y∈ P(AL),
D(Y) = arg min
X0∈C dS(X0,Y).
Theorem 2: Suppose Chas minimum distance dS(C)and
nS+L·max{nI, nD} ≤ dS(C)1
2.(3)
Then any error of pattern (nI, nD, nS)can be corrected by the minimum-distance decoder for C.
Proof: Let Xbe the set of input sequences and Ybe the set of output sequences of the DNA storage channel. By Lemma
3, if Yhas error pattern (nI, nD, nS), then
dS(X,Y)nS+L·max{nI, nD}.
Combining this with (3), we have
dS(X,Y)dS(C)1
2.
So X= arg minX0∈C dS(X0,Y) = D(Y), and hence Xcan be correctly recovered by the minimum distance decoder.
In [11] and [12], it was assumed that the number of output sequences is always smaller than the number of input sequences.
In this work, we dismiss this assumption and allow the number of output sequences of the DNA storage channel to be larger
than the number of input sequences.
6
III. BOUNDS ON THE SIZE OF SEQUENCE SUBSET CODES
Let Sq(L, M, d)denote the maximum number of codewords in a sequence-subset code over a q-ary alphabet with sequence
length L, constant codeword size Mand minimum sequence-subset distance at least d. A q-ary sequence-subset code is said
to be optimal (with respect to code size) if it has the largest possible code size of any q-ary sequence-subset code of the given
parameters L, M and d. In this section, we always assume that Ais an alphabet of size q. We will derive some upper bounds
on Sq(L, M, d).
Clearly, for any sequence-subset code C ⊆ P(AL)with constant codeword size M, its minimum distance dS(C)LM ,
and hence MdS(C)
L. For this reason, in the following, we always assume that dLM , or equivalently, Md
L.
A. Upper Bound for the Special Case L|d
First, consider the special case that M0=d
L. Since M0is an integer, we need to further assume that L|d. Then we have
the following upper bound on Sq(L, M0, d).
Theorem 3: Suppose L|dand M0=d
L. We have
Sq(L, M0, d)jqM 1
L
0k.(4)
Proof: Let C={X1,X2,··· ,XN} ⊆ P(AL)be an arbitrary sequence-subset code with constant codeword size M0and
minimum distance d, where for each i[N],Xi={xi,1,xi,2,··· ,xi,M0} ⊆ AL. We need to prove NqM 1
L
0. For each
`[L]and i[N], let
Wi,` =[
j[M0]
{xi,j (`)}.
Note that the minimum distance of Cis d=LM0. Then from Definition 1, it is necessary that for any distinct i1, i2
[N]and any (not necessarily distinct) j1, j2[M0],dH(xi1,j1,xi2,j2) = L, which implies that for any `[L]and any
(j1, j2,··· , jN)[M0]N,x1,j1(`),x2,j2(`),···,xN,jN(`)are distinct elements of A. Hence, for each fixed `[L]and
i[N],W1,`, W2,` ,··· , WN,` are mutually disjoint subsets of A, which implies that
N
X
i=1
|Wi,`| ≤ |A|=q. (5)
By the construction of Wi,` , for each i[N]and j[M0], we have xi,j Wi,1×Wi,2× · · · × Wi,L, which implies that
Xi={xi,1,xi,2,··· ,xi,M0} ⊆ Wi,1×Wi,2× · · · × Wi,L , and hence we have
|Wi,1×Wi,2× · · · × Wi,L|=
L
Y
`=1
|Wi,`| ≥ |Xi|=M0.(6)
Now, consider (5). By the inequality of arithmetic and geometric means, for each `[L], we have
q
N1
N
N
X
i=1
|Wi,`| ≥ N
Y
i=1
|Wi,`|!
1
N
.
Combining this with (6), we have
q
NL
L
Y
`=1 N
Y
i=1
|Wi,`|!
1
N
=
N
Y
i=1 L
Y
`=1
|Wi,`|!
1
N
(M
1
N
0)N
=M0.
From this we have q
NM
1
L
0, which implies NqM 1
L
0. Hence,
Sq(L, M0, d)qM 1
L
0.
Since Sq(L, M0, d)is an integer, so
Sq(L, M0, d)jqM 1
L
0k,
which completes the proof.
7
B. Plotkin-like Bound
We present the Plotkin-like Bound of sequence-subset codes as the following theorem.
Theorem 4 (Plotkin-like Bound): Let Cbe an (L, M , N, d)qcode such that rLM < d, where r= 1 1
q. Then
Nd
drLM .
Proof: Our proof of this theorem is similar to the proof of [17, Theorem 2.2.1].
Suppose C={X1,X2,··· ,XN}such that for each i[N],Xi={xi,1,xi,2,··· ,xi,M } ⊆ AL. First, we have the following
claim, which we will prove later.
Claim 1: For any distinct i1, i2[N], we have
dS(Xi1,Xi2)1
MX
j1,j2[M]
dH(xi1,j1,xi2,j2).
Now, let
A=X
i1,i2[N]X
j1,j2[M]
dH(xi1,j1,xi2,j2).
Since dis the minimum distance of C, by the averaging principle [18], we have
dN
21
X
{i1,i2}⊆[N]
dS(Xi1,Xi2)
=1
2N
21
X
i1,i2[N],i16=i2
dS(Xi1,Xi2)
1
2N
21
X
i1,i2[N]
dS(Xi1,Xi2)
1
N(N1) X
i1,i2[N]
1
MX
j1,j2[M]
dH(xi1,j1,xi2,j2)
=1
N(N1)
1
M·A, (7)
where the last inequality is obtained by Claim 1.
For each aAand `[L], let n`,a be the number of (i, j )[N]×[M]such that xi,j(`) = a. Then for each fixed `[L],
we have
X
aA
n`,a =NM. (8)
Moreover, we have
A=X
i1,i2[N]X
j1,j2[M]
dH(xi1,j1,xi2,j2)
=
L
X
`=1 X
aA
n`,a(N M n`,a )
=L(NM)2
L
X
`=1 X
aA
n2
`,a.(9)
For each `[L], by the Cauchy-Schwartz inequality,
X
aA
n`,a!2
qX
aA
n2
`,a,
8
where q=|A|. Combining this with (9), we obtain
AL(NM)2
L
X
`=1
1
q X
aA
n`,a!2
=L(NM)2
L
X
`=1
1
q(NM)2
=11
qL(NM)2,(10)
where the first equality is obtained from (8). Combining (7) and (10), we obtain
d1
N(N1)
1
M·11
qL(NM)2.
Solving Nfrom the above inequality we obtain
Nd
drLM ,
where r= 1 1
q.
To complete the proof of Theorem 4, we still need to prove Claim 1.
Proof of Claim 1: Let SMdenote the permutation group on [M]. Note that for any j1, j2[M], not necessarily distinct,
there are (M1)! permutations χSMsuch that χ(j1) = j2. So we have
X
χSMX
j[M]
dH(xi1,j ,xi2(j))
= (M1)! X
j1,j2[M]
dH(xi1,j1,xi2,j2).(11)
Further, by Definition 1 and the averaging principle [18], we have
dS(Xi1,Xi2)1
M!X
χSM
dχ(Xi1,Xi2)
=1
M!X
χSMX
j[M]
dH(xi1,j ,xi2(j))
=(M1)!
M!X
j1,j2[M]
dH(xi1,j1,xi2,j2)
=1
MX
j1,j2[M]
dH(xi1,j1,xi2,j2),
where the second equality comes from (11).
C. Singleton-like Bound
For each code C={X1,X2,··· ,XN} ⊆ P(AL), denote
V(C) =
N
[
i=1
Xi.(12)
Further, let ¯
Sq(L, M, K, d)denote the maximum number of codewords in a sequence-subset code Cover a q-ary alphabet A
with sequence length L, constant codeword size M, minimum sequence-subset distance at least dand |V(C)| ≤ K. Clearly,
for any KqL,
¯
Sq(L, M, K, d)¯
Sq(L, M, qL, d) = Sq(L, M, d).(13)
We first prove a recursive bound on ¯
Sq(L, M, K, d)as the following theorem.
Theorem 5: Suppose dLM and KqL. We have
¯
Sq(L, M, K, d)K
M¯
Sq(L, M 1, K 1, d).(14)
Proof: Let C={X1,X2,··· ,XN} ⊆ P(AL)be a sequence-subset code with constant codeword size M, minimum
distance at least dsuch that |V(C)| ≤ Kand code size |C| =N=¯
Sq(L, M, K, d), where XiALfor each i[N].
9
For each xV(C), let
C(x) = {X∈ C;xX}
and
C(x) = {X=X\{x};X∈ C(x)}.
Then C(x)⊆ P(AL)has constant codeword size M1, size |C(x)|=|C (x)|and |V(C(x))| ≤ K1.
Moreover, for any distinct Xi1,Xi2∈ C(x), by the construction of C(x), we have Xi1=Xi1\{x}and Xi2=Xi2\{x}for
some distinct Xi1,Xi2∈ C(x). So Xi1\Xi2=Xi1\Xi2and Xi2\Xi1=Xi2\Xi1, and hence by Corollary 1,
dS(Xi1,Xi2) = dS(Xi1,Xi2).
So we have dS(C(x)) = dS(C(x)). On the other hand, since C(x)⊆ C, then dS(C(x)) dS(C)d. Hence, dS(C(x)) d.
By the above discussion, for each xV(C), we have
|C(x)| ≤ ¯
Sq(L, M 1, K 1, d).(15)
Now, we estimate |C(x)|. Since |C(x)|=|C(x)|, it is sufficient to estimate |C(x)|. Denote V(C) = {x1,x2,··· ,x¯
K}, where
¯
K=|V(C)|. Consider the Nׯ
Kmatrix I= (ai,j )such that ai,j = 1 if xjXi, and ai,j = 0 otherwise. Note that the
number of ones in row iof Iis |Xi|=Mand the number of ones in column jof Iis |C(xj)|. By counting the total number
of ones in I, we obtain X
xV(C)
|C(x)|=X
X∈C
|X|=MN.
By the averaging principle [18], there exists an xj0V(C)such that
|C(xj0)| ≥ MN
|V(C)|MN
K.
Hence,
NK
M|C(xj0)|=K
M|C(xj0)|.
Note that |C| =¯
Sq(L, M, K, d) = N. Then we have
¯
Sq(L, M, K, d)K
M|C(x0)|.
This, combining with (15), implies that
¯
Sq(L, M, K, d)K
M¯
Sq(L, M 1, K 1, d),
which completes the proof.
Now, we can prove a Singleton-like bound for sequence-subset codes as follows.
Theorem 6 (Singleton-like Bound): Suppose rLM0< d LM0, where r= 1 1
qand M0=d
L. Then
Sq(L, M, d) MM01
Y
k=0
qLk
Mk!·f(L, M0, q),
where
f(L, M0, q) =
jqM 1
L
0kif d=LM0;
d
drLM0
if rLM0< d < LM0.
(16)
Proof: Denote M=MM0. Repeatedly using Theorem 5, we obtain
¯
Sq(L, M, qL, d)
M1
Y
k=0
qLk
Mk
¯
Sq(L, M0, qLM , d).
Moreover, according to (13), we have
Sq(L, M, d) = ¯
Sq(L, M, qL, d)
and
¯
Sq(L, M0, qLM , d)Sq(L, M0, qL, d) = Sq(L, M0, d).
10
Combining the above three equations, we have
Sq(L, M, d) MM01
Y
k=0
qLk
Mk!·Sq(L, M0, d).(17)
Let f(L, M0, q)be defined as in (16). By Theorem 3 and Theorem 4, we have
Sq(L, M0, d)f(L, M0, q).
Combining this with (17), we have
Sq(L, M, d) MM01
Y
k=0
qLk
Mk!·f(L, M0, q),
which completes the proof.
Remark 1: It is easy to see that
qL
M= MM01
Y
k=0
qLk
Mk!qLM+M0
M0.
So the bound in Theorem 6 gives a bound on the code rate as
Sq(L, M, d)
qL
M1
qLM+M0
M0·f(L, M0, q),
where f(L, M0, q)is defined as in (16).
IV. CON ST RUC TI ON S OF SEQU EN CE -S UB SE T COD ES
In this section, we give some constructions of sequence-subset codes. As in Section III, we will always assume that Ais
an alphabet of size q.
A. Construction of Optimal Codes
In this subsection, we give a construction of optimal (L, M0, d)qcode (with respect to code size) for the special case that
L|dand M
1
L
0is an integer, where M0=d
L.
Theorem 7: Suppose L|dand M
1
L
0is an integer, where M0=d
L. There exists an (L, M0, d)qsequence-subset code whose
code size is N=jqM 1
L
0k.
Proof: Since N=jqM1
L
0k, we have NqM 1
L
0, and hence
qNM
1
L
0.
So we can partition Ainto Nmutually disjoint subsets W1, W1,··· , WNsuch that for each i[N],|Wi| ≥ M
1
L
0. So
the size of the Cartesian product WL
iof Lcopies of Wiis greater than |M0|, and hence we can pick a subset Xi=
{xi,1,xi,2,··· ,xi,M0} ⊆ WL
i. Now, let C={Xi;i[N]}. Then C ⊆ P(AL)is a sequence-subset code with constant
codeword size M0and |C| =N=jqM1
L
0k. Moreover, since W1, W1,··· , WNare mutually disjoint, it is easy to verify that
for any distinct i1, i2[N]and any j1, j2[M0],
dH(xi1,j1,xi2,j2) = L.
So for any distinct i1, i2[N],
dS(Xi1,Xi2) = LM0=d,
which implies that dS(C) = d.
In summary, Cis an (L, M0, d)qsequence-subset code of size N=jqM1
L
0k.
Note that by Theorem 3, if L|dand M0=d
L, then Sq(L, M0, d)jqM 1
L
0k.So the code Cconstructed in Theorem 7
is optimal with respect to code size, and we have the following corollary.
Corollary 2: Suppose L|dand M
1
L
0is an integer, where M0=d
L. We have
Sq(L, M0, d) = jqM 1
L
0k.
11
B. Construction Based on Binary Codes
In the rest of this section, to distinguish from sequence-subset code (i.e., a subset of the power set P(AL)of the set AL),
we will call any subset of ALa conventional code. An (L, N , d)qconventional code is a subset of ALwith Ncodewords and
minimum Hamming distance d(recalling that qis the size of the alphabet A). Our following constructions of sequence-subset
codes are based on conventional codes with respect to Hamming distance.
The construction given in this subsection is a modification of the Construction 2 of [11].
Let C1={x1,x2,··· ,xK} ⊆ ALbe a conventional code over Aand C2={w1,w2,··· ,wN} ⊆ FK
2be a conventional
binary code. For each wi∈ C2, let
Xi={xj;jsupp(wi)},
where supp(wi) = {j[K]; wi(j)6= 0}is the support of wi. Further, let
C={X1,X2,··· ,XN}.
Then C ⊆ P (AL)is a sequence-subset code over Aand we have the following theorem.
Theorem 8: Suppose C1has minimum (Hamming) distance d1and C2has minimum (Hamming) distance d2. Then Chas
sequence length L, code size |C| =N, and the minimum sequence-subset distance dS(C)satisfies
dS(C)d1·d2
2.
Proof: Clearly, Chas sequence length Land code size |C| =N. It remains to prove that dS(C)d1·d2
2.
Let Xi1and Xi2be any distinct codewords of C. We need to prove dS(Xi1,Xi2)d1·d2
2.
Without loss of generality, assume that |Xi1| ≤ |Xi2|. Then we have |Xi1\Xi2| ≤ |Xi2\Xi1|.To simplify notation, denote
Xi1=Xi1\Xi2and Xi2=Xi2\Xi1.
For an arbitrary injection χ:Xi1Xi2, by (1),
dχ(Xi1,Xi2)= X
xXi1
dH(x, χ(x))+L(|Xi2|− |Xi1|).(18)
Since C1has minimum (Hamming) distance d1and by construction of C,xand χ(x)are distinct codeword in C1, so
X
xXi1
dH(x, χ(x)) ≥ |Xi1| · d1.
Moreover, since C1AL, then Ld1. Hence, (18) implies that
dχ(Xi1,Xi2)≥ |Xi1| · d1+d1(|Xi2| − |Xi1|)
=d1· |Xi2|
=d1· |Xi2\Xi1|.(19)
By the construction of C,Xi1={xj;jsupp(wi1)}and Xi2={xj;jsupp(wi2)}for some distinct wi1,wi2∈ C2. Then
we have
|Xi1\Xi2|+|Xi2\Xi1|=dH(wi1,wi2)d2,
where d2is the minimum (Hamming) distance of C2. Note that |Xi1\Xi2| ≤ |Xi2\Xi1|.Then by the above equation, we have
|Xi2\Xi1| ≥ d2
2.Moreover, since |Xi2\Xi1|is an integer, so
|Xi2\Xi1| ≥ d2
2.
Combining this with (19), we have
dχ(Xi1,Xi2)d1·d2
2.
Note that χ:Xi1\Xi2Xi2\Xi1is an arbitrary injection. So by Definition 1 and Corollary 1, we have
dS(Xi1,Xi2) = dS(Xi1\Xi2,Xi2\Xi1)d1·d2
2,
which completes the proof.
Remark 2: The code Cconstructed in this subsection may or may not have constant codeword size, depending on whether
C2is a constant weight binary code. In fact, if C2is a constant weight code, then Chas constant codeword size. Otherwise, C
does not have constant codeword size.
12
C. Construction Based on Non-binary Codes
Let Aand Bbe two alphabets of size qand ˜q, respectively. Let C1be an (L, M ˜q , d1)qconventional code over Aand C2be
an (M, N , d2)˜qconventional code over B. The M˜qcodewords of C1can be indexed as
C1={xi,j :i[M], j B}.
Then from each c= (c1, c2,··· , cM)∈ C2, we can obtain a subset
Xc={x1,c1,x2,c2,··· ,xM,cM} ⊆ C1.
Let
C={Xc;c∈ C2}.(20)
Then Cis a sequence-subset code over Aand we have the following theorem.
Theorem 9: The code Cconstructed by (20) has sequence length L, constant codeword size M, code size |C | =N, and
minimum sequence-subset distance
dS(C)d1d2.
Proof: From the construction it is easy to see that Chas sequence length L, constant codeword size Mand code size
|C| =N. It remains to prove that dS(C)d1d2, that is, dS(Xc,Xc0)d1d2for any distinct Xcand Xc0in C, where
c= (c1, c2,··· , cM)and c0= (c0
1, c0
2,··· , c0
M)are any pair of distinct codewords in C2.
Let Abe the set of all i[M]such that ci6=c0
i. Since C2has minimum (Hamming) distance d2, then
|A|=dH(c,c0)d2.
Denote
Xc={xi,ci;iA}and Xc0={xi,c0
i;iA}.
Then by the construction, we have
Xc=Xc\Xc0and Xc0=Xc0\Xc.
So by Corollary 1, it suffices to prove that dS(Xc,Xc0)d1d2.
Note that |Xc|=|Xc0|=|A|and XcXc0=. Then for any injection χ:XcXc0, we have
dχ(Xc,Xc0) = X
xXc
dH(x, χ(x))
≥ |Ad1
d1d2,
where the equality comes from (1), the first inequality comes from the assumption that C1has minimum (Hamming) distance
d1, and the second inequality comes from the fact that |A| ≥ d2. By Definition 1, dS(Xc,Xc0)d1d2, and hence by Corollary
1, dS(Xc,Xc0)d1d2. Since Xcand Xc0are any pair of distinct codewords in C, we have dS(C)d1d2, which completes the
proof.
The following example is a special case of this construction.
Example 1: Let C1be an [L, k, d1]qlinear code such that the first ksymbols of the codewords of C1are the information
symbols. For any given integer rsuch that 1r < k, let ˜q=qrand M=qs, where s=kr. Note that there exists a
bijection π: [M]Fs
q. Moreover, fixing a basis, each element of Fqrcan be uniquely represented as a vector in Fr
q, so we
can identify each element of Fqras a vector in Fr
q. Then for each i[M]and each jFqr, we can let
xi,j = (x1, x2,··· , xL) :
(x1, x2,··· , xs) = π(i)and (xs+1,··· , xk) = j.
Now, let C2be an [M, K, d2]qrlinear code, where K[M]is another design parameter. Then for each c= (c1, c2,··· , cM)
C2, we can obtain
Xc={x1,c1,x2,c2,··· ,xM,cM} ⊆ C1,
that is, for each i[M],xi,ci= (x1, x2,··· , xL)such that
(x1, x2,··· , xs) = π(i)and (xs+1,··· , xk) = ci.
Finally, we have
C={Xc;c∈ C2}.
The construction method of this special case is essentially similar to the method used in [4].
13
D. Construction Based on Sequence Index
In this subsection, if x= (x(1),x(2),··· ,x(L)) ALand I={i1, i2,··· , im} ⊆ [L]such that i1< i2<··· < im, then
we denote x(I) = (x(i1),x(i2),··· ,x(im)).
The construction given in this subsection is an improvement of the Construction 1 of [11].
Let C1={s1,s2,··· ,sM} ⊆ AL1be a conventional code over Awith block length L1and minimum (Hamming) distance
d1, and C2={u1,u2,··· ,uN} ⊆ Ad1Mbe a conventional code over Awith block length d1Mand minimum (Hamming)
distance d2. For each j[M], let
Ij={`Z; (j1)d1< ` jd1}
and for each i[N], let
Xi={xi,1,xi,2,··· ,xi,M }
such that for each j[M],
xi,j = (sj,ui(Ij)).
Finally, let
C={Xi;i[N]}.(21)
Then Cis a sequence-subset code over A. In this construction, each codeword sjof C1serves as an index of the sequence
xi,j of the codeword Xi, and ui(Ij)) is the information part of xi,j . It is the reason that we say this construction is based on
sequence index. Moreover, we have the following theorem.
Theorem 10: The code Cconstructed by (21) has sequence length L=L1+d1, constant codeword size M, code size
|C| =N, and minimum sequence-subset distance
dS(C)d2.
Proof: Clearly, Chas sequence length L=L1+d1, constant codeword size Mand code size |C | =N. It remains to
prove that dS(C)d2.
Let i1, i2[N]be any two distinct elements of [N], we need to prove that dS(Xi1,Xi2)d2,where Xi1=
{xi1,1,xi1,2,··· ,xi1,M }and Xi2={xi2,1,xi2,2,··· ,xi2,M }. For any permutation3χ: [M][M], let
N={j[M]; χ(j) = j}
and
N={j[M]; χ(j)6=j}.
Then N ∩ N =and N ∪ N = [M]. Moreover, by (1), we have
dχ(Xi1,Xi2) =
M
X
j=1
dH(xi1,j ,xi2(j))
=X
j∈N
dH(xi1,j ,xi2(j)) + X
j∈N
dH(xi1,j ,xi2(j)
=X
j∈N
dH(xi1,j ,xi2,j ) + X
j∈N
dH(xi1,j ,xi2(j).(22)
We will estimate the two terms of the right side of Equation (22) separately.
First, by the construction, we have
M
X
j=1
dH(xi1,j ,xi2,j ) =
M
X
j=1
dH(ui1(Ij),ui2(Ij))
=dH(ui1,ui2) = d2.
Moreover, since for each i[N]and j[M],ui(Ij)has length d1, then again by construction of C, we have
dH(xi1,j ,xi2,j) = dH(ui1(Ij),ui2(Ij)) d1.
3Note that any bijection between Xi1and Xi2can be uniquely represented by a permutation on the index set [M]. So when applying (1) to the pair
{Xi1,Xi2}, we can use permutations on [M]to replace bijections between Xi1and Xi2.
14
Hence, we obtain
X
j∈N
dH(xi1,j ,xi2,j) =
M
X
j=1
dH(xi1,j ,xi2,j )X
j∈N
dH(xi1,j ,xi2,j )
=
M
X
j=1
dH(ui1(Ij),ui2(Ij))
X
j∈N
dH(ui1(Ij),ui2(Ij))
d2− |N | · d1.
Second, since C1has minimum (Hamming) distance d1, then by construction of C, we have
X
j∈N
dH(xi1,j ,xi2(j)X
j∈N
dH(sj,sχ(j)) |N | · d1.
Combining the above two inequalities with (22), we obtain
dχ(Xi1,Xi2) = X
j∈N
dH(xi1,j ,xi2,j) + X
j∈N
dH(xi1,j ,xi2(j)
d2− |N | · d1+|N | · d1
=d2.
Since χ: [M][M]is an arbitrary bijection, then by Definition 1, we have
dS(Xi1,Xi2)d2.
Moreover, since i1and i2are any two distinct elements of [N], so we have
dS(C)d2,
which completes the proof.
Remark 3: Using the product of multiple copies of C, the construction in this subsection can be further extended as follows.
Let nbe a given positive integer. For each n-tuple i= (i1, i2,··· , in)[N]n, let
Xi={xi,1,xi,2,··· ,xi,M }
such that for each j[M],
xi,j = (sj,ui1
(Ij),···,uin
(Ij)).
Finally, let
C={Xi;i= (i1, i2,··· , in)[N]n}.
Then the code Chas sequence length L=L1+nd1, constant codeword size M, code size |C| =Nn, and minimum sequence-
subset distance
dS(C)d2.
V. CO NC LU SI ON S
We introduced a new metric over the power set of the set of all vectors over a finite alphabet, which generalizes the classical
Hamming distance and was used to establish a uniform framework to design error-correcting codes for DNA storage channel.
Some upper bounds on the size of the sequence-subset codes were derived and some constructions of such codes were proposed.
It is still an open problem to analyze the tight upper bounds on the size of sequence-subset codes and design optimal codes
for general parameters of sequence length, codeword size and minimum distance.
Another interesting problem is how to design sequence-subset codes for DNA storage channel that can be efficiently encoded
and decoded.
The sequence-subset distance (Definition 1)can be directly applied to multisets of AL. So studying of the properties of
codes over the space of all multisets of ALwith sequence-subset distance is also a possible research direction.
15
APP EN DI X A
PROO F OF LE MM A 1
If X1X2=, the claim is naturally true. So we assume that X1X26=.
First, we claim that for each χXsuch that dS(X1,X2) = dχ(X1,X2)and each yX1X2, there exists an xX1
such that y=χ(x). This can be proved, by contradiction, as follows. Suppose there is a yX1X2such that y6=χ(x0)
for all x0X1. Since yX1X2, then we have χ(y)6=y, and hence we can let χ0:X1X2be such that χ0(y) = y
and χ0(x0) = χ(x0)for all x0X1\{y}(see Fig. 2 for an illustration). Note that dH(y, χ0(y)) = 0 < dH(y, χ(y)) and
dH(x0, χ0(x0)) = dH(x0, χ(x0)) for all x0X1\{y}.So by (1), we have dχ0(X1,X2)< dχ(X1,X2), which contradicts to (2).
Hence, by contradiction, for each yX1X2, there exists an xX1such that y=χ(x).
y
y
y0
χ0
χχ0=χ χ0=χ
· · · · · ·
· · · · · ·
X1:
X2:
Fig 2. An illustration of the injections in the proof of Lemma 1: For the injection χ, there exists a yX1X2such that χ(y)6=y. Denote χ(y) = y0.
Then we can modify the injection χto a different injection χ0by letting χ0(y) = y, and the image of all other elements of X1keep unchanged.
y
y
y0
χχ χ0
χ0
χ0=χ χ0=χ
· · · · · ·x
· · · · · ·
X1:
X2:
Fig 3. An illustration of the bijections in the proof of Lemma 1: For the bijection χ, we have χ(x) = yand χ(y) = y06=y, where yX1X2. We
modify the bijection χto a different bijection χ0by letting χ0(x) = y0and χ0(y) = y, and the image of all other elements of X1keeping unchanged.
Now, pick a χXsuch that dS(X1,X2) = dχ(X1,X2)and denote
N(χ) = {y0X1X2;χ(y0)6=y0}.
If N(χ) = , then by the definition of N(χ),χ(x) = xfor all xX1X2and we can choose χ0=χ. Otherwise, pick a
y N (χ)and we have χ(y) = y0for some y0X2\{y}. Moreover, by previous discussion, there exists an xX1such that
y=χ(x). Then we can let χ0:X1X2be such that χ0(x) = y0,χ0(y) = yand χ0(x0) = χ0(x0)for all x0X1\{x,y}(see
Fig. 3 for an illustration). Note that
dH(x, χ0(x)) + dH(y, χ0(y)) = dH(x,y0) + dH(y,y)
=dH(x,y0)
dH(x,y) + dH(y,y0)
=dH(x, χ(x)) + dH(y, χ(y))
and by construction of χ0,
dH(x0, χ0(x0)) = dH(x0, χ(x0)),x0X1\{x,y}.
So by (2), we have
dS(X1,X2) = dχ(X1,X2) = dχ0(X1,X2).
Again by construction of χ0, we have N(χ0) = N(χ)\{y},and hence
|N (χ0)|=|N (χ)| − 1,
where
N(χ0) = {yX1X2;χ0(y)6=y}.
16
If N(χ0) = , then χ0(x) = xfor all xX1X2and we can choose χ0=χ0. Otherwise, by the same discussion,
we can obtain a χ00 :X1X2such that dS(X1,X2) = dχ00 (X1,X2)and |N (χ00 )|=|N (χ0)| − 1, and so on. Noting that
N(χ00)X1X2is a finite set, we can always find a χ0Xsuch that dS(X1,X2) = dχ0(X1,X2)and
N(χ0) = {yX1X2;χ0(y)6=y}=.
Hence, we have χ0(x) = xfor all xX1X2, which completes the proof.
APP EN DI X B
PROO F OF LE MM A 2
It suffices to prove that if X0
2X2and |X1| ≤ |X0
2|=|X2| − 1, then
dS(X1,X0
2)dS(X1,X2).
Without loss of generality, we can assume
X1={x1,··· ,xn},
X0
2={y1,··· ,yn,yn+1,··· ,yn+s1}
and
X2={y1,··· ,yn,yn+1,··· ,yn+s1,yn+s},
where s1, such that
dS(X1,X0
2) =
n
X
i=1
dH(xi,yi) + L(s1).
By Definition 1, we can suppose
dS(X1,X2) =
n
X
i=1
dH(xi,y`i) + Ls,
where {`i;i= 1,2,··· , n}is a subset of {1,2,··· , n +s}. We have the following two cases.
Case 1: n+s /∈ {`1, `2,··· , `n}. In this case, we have
dS(X1,X0
2) =
n
X
i=1
dH(xi,yi) + L(s1)
n
X
i=1
dH(xi,y`i) + L(s1)
<
n
X
i=1
(dH(xi,y`i) + Ls
=dS(X1,X2),
where the first inequality is obtained by (2).
Case 2: There exists a k∈ {1,2,··· , n}such that n+s=`k. Noticing that s1, then there exists an m∈ {1,2,··· , n +
s1}such that m /∈ {`1, `2,··· , `n}. Denote `0
k=mand `0
i=`ifor i∈ {1,2,··· , n}\{k}. Then we have
{`0
1, `0
2,··· , `0
n} ⊆ {1,2,··· , n +s1}.(23)
Moreover, noticing that {xk,ym,y`k} ⊆ X1X2AL, then dH(xk,ym)Land dH(xk,y`k)L. So we can obtain
dH(xk,ym)dH(xk,y`k)L. (24)
17
And further we have
dS(X1,X0
2) =
n
X
i=1
dH(xi,yi) + L(s1)
n
X
i=1
dH(xi,y`0
i) + L(s1)
=
n
X
i=1
dH(xi,y`i)dH(xk,y`k) + dH(xk,ym)
+L(s1)
n
X
i=1
(dH(xi,y`i) + L+L(s1)
=dS(X1,X2),
where the first inequality is obtained by (23) and (2), and the second inequality is obtained by (24).
Hence, we always have dS(X1,X0
2)dS(X1,X2), which completes the proof.
APP EN DI X C
PROO F OF TH EO RE M 1
By Definition 1, it is easy to see that for any two subsets X1and X2of AL,dS(X1,X2) = dS(X2,X1)0. Moreover, by
Corollary 1, we can easily see that dS(X1,X2) = 0 if and only if X1=X2. So to prove that dS(·,·)is a distance function, we
only need to prove the triangle inequality, that is,
dS(X1,X2)dS(X1,X3) + dS(X2,X3)
for any three subsets X1,X2and X3of AL. Without loss of generality, we can assume that |X1| ≤ |X2|. Then we have the
following three cases.
Case 1.|X1| ≤ |X2| ≤ |X3|. In this case, we can fix a subset X0
3X3of size |X0
3|=|X2|. Then by Lemma 2,
dS(X1,X0
3)dS(X1,X3)and dS(X2,X0
3)dS(X2,X3). So it suffices to prove that
dS(X1,X2)dS(X1,X0
3) + dS(X2,X0
3).
Without loss of generality, we can assume
X1={x1,··· ,xn},
X2={y1,··· ,yn,yn+1,··· ,yn+s},
X0
3={z1,··· ,zn,zn+1,··· ,zn+s}
such that
dS(X1,X0
3) =
n
X
i=1
dH(xi,zi) + Ls,
dS(X2,X0
3) =
n+s
X
i=1
dH(yi,zi)
and
dS(X1,X2) =
n
X
i=1
dH(xi,y`i) + Ls,
18
where s0and {`1, `2,··· , `n} ⊆ {1,2,··· , n +s}. Then we have
dS(X1,X2) =
n
X
i=1
dH(xi,y`i) + Ls
n
X
i=1
dH(xi,yi) + Ls
n
X
i=1
(dH(xi,zi) + dH(yi,zi)) + Ls
n
X
i=1
dH(xi,zi)+Ls+
n+s
X
i=1
dH(yi,zi)
=dS(X1,X0
3) + dS(X2,X0
3)
dS(X1,X3) + dS(X2,X3),
where the first inequality is obtained by (2) and the last inequality is obtained by Lemma 2.
Case 2.|X1| ≤ |X3| ≤ |X2|. In this case, we can assume
X1={x1,··· ,xn},
X3={y1,··· ,yn,yn+1,··· ,yn+s},
X2={z1,··· ,zn,zn+1,··· ,zn+s,zn+s+1 ,··· ,zn+s+t}
such that
dS(X1,X3) =
n
X
i=1
dH(xi,yi) + Ls,
dS(X2,X3) =
n+s
X
i=1
dH(yi,zi) + Lt
and
dS(X1,X2) =
n
X
i=1
dH(xi,z`i) + L(s+t),
where s, t 0and {`1, `2,··· , `n} ⊆ {1,2,··· , n +s+t}. Then we have
dS(X1,X2) =
n
X
i=1
dH(xi,z`i) + L(s+t)
n
X
i=1
dH(xi,zi) + L(s+t)
n
X
i=1
(dH(xi,yi) + dH(yi,zi)) + L(s+t)
n
X
i=1
dH(xi,yi)+Ls+
n+s
X
i=1
dH(yi,zi)+Lt
=dS(X1,X3) + dS(X2,X3),
where the first inequality is obtained by (2).
Case 3.|X3| ≤ |X1| ≤ |X2|. In this case, we can assume
X3={x1,··· ,xn},
X1={y1,··· ,yn,yn+1,··· ,yn+s},
X2={z1,··· ,zn,zn+1,··· ,zn+s,zn+s+1 ,··· ,zn+s+t}
such that
dS(X1,X3) =
n
X
i=1
dH(xi,yi) + Ls,
dS(X2,X3) =
n
X
i=1
dH(xi,zi) + L(s+t)
19
and
dS(X1,X2) =
n+s
X
i=1
dH(yi,z`i) + Lt,
where s, t 0and {`1, `2,··· , `n} ⊆ {1,2,··· , n +s+t}. Then we have
dS(X1,X2) =
n+s
X
i=1
dH(yi,z`i) + Lt
n+s
X
i=1
dH(yi,zi) + Lt
n+s
X
i=1
(dH(xi,yi) + dH(xi,zi)) + Lt
n
X
i=1
dH(xi,yi)+Ls+
n
X
i=1
dH(xi,zi)+L(s+t)
=dS(X1,X3) + dS(X2,X3),
where the first inequality is obtained by (2), and the third inequality is obtained from the simple fact that dH(·,·)L.
For all cases, we have proved that
dS(X1,X2)dS(X1,X3) + dS(X2,X3).
So dS(·,·)satisfies the triangle inequality.
By the above discussion, we proved that dS(·,·)is a distance function over P(AL).
REF ER EN CE S
[1] J. Davis, “Microvenus,” Art J, 55, 70 (1996), doi:10.2307/777811
[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628-1628, 2012.
[3] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, lowmaintenance information
storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77-80, 2013.
[4] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust chemical preservation of digital information on DNA in silica with error-correcting
codes,” Angew. Chem. Int. Ed., vol. 54, no. 8, pp. 2552-2555, 2015.
[5] M. Blawat, K. Gaedke1, I. H¨
utter, X.-M. Chen, B. Turczyk, S. Inverso, B. W. Pruitt, G. M. Church, “Forward error correction for DNA data storage,”
Procedia Compu Sci, vol. 80, pp. 1011-1022, 2016.
[6] S. M. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A Rewritable, Random-Access DNA-Based Storage System,” Nature Scientific Reports,
5(14138), 2015.
[7] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss, “A DNA-based archival storage system,” in Proceedings of the Twenty-First
International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, pp. 637-649, 2016.
[8] Y. Erlich, and D. Zielinski, “DNA Fountain enables a robust and efficient storage architecture,” Science, vol. 355, no. 6328, pp. 950-954, 2017.
[9] W. Song, K. Cai, M. Zhang, and C. Yuen , “Codes with Run-Length and GC-Content Constraints for DNA-based Data Storage,” IEEE Communications
Letters, 2018, DOI: 10.1109/LCOMM.2018.2866566
[10] K.A.S. Immink, and K. Cai, “Design of Capacity-Approaching Constrained Codes for DNA-Based Storage Systems,” IEEE Communications Letters,
vol. 22, no. 2, pp. 224-227, 2018.
[11] A. Lenz, P. H. Siegel, A. W-Zeh, and E. Yaakobi, “Coding over Sets for DNA Storage,” 2018, Available: https://arxiv.org/abs/1801.04882
[12] J. Sima, N. Raviv, and J. Bruck, “On Coding over Sliced Information,” 2018, Available: https://arxiv.org/abs/1809.02716
[13] R. Heckel, I. Shomorony, K. Ramchandran, and D. N. C. Tse, “Fundamental limits of DNA storage systems,” in IEEE Int. Symp. Inform. Theory (ISIT),
Aachen, Germany, Jun. 2017, pp. 3130-3134.
[14] H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence profiles,” IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3125-3146, Jun. 2016.
[15] M. Langberg, M. Schwartz, and E. Yaakobi, “Coding for the `-Limited Permutation Channel,” IEEE Trans. Inf. Theory, vol. 63, no. 12, pp. 7676-7686,
Dec. 2017.
[16] M. Kovaˇ
cevi´
c, and V. Y. F. Tan, “Codes in the Space of Multisets — Coding for Permutation Channels with Impairments,” IEEE Trans. Inf. Theory,
2018, DOI: 10.1109/TIT.2017.2789292
[17] W. C. Huffman and V. Pless, Fundamentals of Error-Correcting Codes. Cambridge, U.K.: Cambridge Univ. Press, 2003
[18] S. Jukna, Extremal Combinatorics. New York: Springer-Verlag, 2001.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose a coding method to transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following two properties • Run-length constraint. The maximum run-length of each symbol in each codeword is at most three; • GC-content constraint: The GC-content of each codeword is close to 0.5, say between 0.4 and 0.6. The proposed coding scheme is motivated by the problem of designing codes for DNA-based data storage systems, where the binary digital data is stored in synthetic DNA base sequences. Existing literature either achieve code rates not greater than 1.78 bits per nucleotide or lead to severe error propagation. Our method achieves a rate of 1.9 bits per DNA base with low encoding/decoding complexity and limited error propagation.
Article
Full-text available
We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.
Article
Full-text available
Motivated by communication channels in which the transmitted sequences are subject to random permutations, as well as by DNA storage systems, we study the error control problem in settings where the information is stored/transmitted in the form of multisets of symbols from a given finite alphabet. A general channel model is assumed in which the transmitted multisets are potentially impaired by insertions, deletions, substitutions, and erasures of symbols. Several constructions of error-correcting codes for this channel are described, and bounds on the size of optimal codes correcting any given number of errors derived. The construction based on the notion of Sidon sets in finite Abelian groups is shown to be optimal, in the sense of minimal asymptotic code redundancy, for any "error radius'" and any alphabet size. It is also shown to be optimal in the sense of maximal code cardinality in various cases.
Article
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is asymptotically equal to the amount required in the classical error correcting paradigm.
Article
In this paper we study error-correcting codes for the storage of data in synthetic DNA. We investigate a storage model where a data set is represented by an unordered set of M sequences, each of length L. Errors within that model are losses of whole sequences and point errors inside the sequences, such as insertions, deletions and substitutions. We propose code constructions which can correct errors in such a storage system that can be encoded and decoded efficiently. By deriving upper bounds on the cardinalities of these codes using sphere packing arguments, we show that many of our codes are close to optimal.
Article
We consider the communication of information in the presence of synchronization errors. Specifically, we consider permutation channels in which a transmitted codeword x = (x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , ... , x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> ) is corrupted by a permutation π ∈ S <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> to yield the received wordy = (y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> , . . . , y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sub> ), where y <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</sub> = x <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">π(i)</sub> . We initiate the study of worst case (or zero-error) communication over permutation channels that distort the information by applying permutations π, which are limited to displacing any symbol by at most r locations, i.e., permutations π with weight at most r in the ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">∞</sub> -metric. We present direct and recursive constructions, as well as bounds on the rate of such channels for binary and general alphabets. Specific attention is given to the case of r = 1.
Article
Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and tradeoffs of DNA-based storage systems under a simple model, motivated by current technological constraints on DNA synthesis and sequencing. Our model captures two key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the data is read by randomly sampling from this DNA pool. Under this model, we characterize the storage capacity, and show that a simple index-based coding scheme is optimal.
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Conference Paper
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Article
We report on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today's digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, specifically tailored forward error correction schemes have to be applied. For the purpose of realizing a DNA data storage, we have developed an efficient and robust forwarderror-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School [1]. Our forward error correction scheme is able to cope with all error types of today's DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22 MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved further. This proves the feasibility to use synthetic DNA as longterm digital data storage media.