PreprintPDF Available

Survey of Information Encoding Techniques for DNA

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The yearly global production of data is growing exponentially, outpacing the capacity of existing storage media, such as tape and disk, and surpassing our ability to store it. DNA storage-the representation of arbitrary information as sequences of nucleotides-offers a promising storage medium with several important advantages. DNA is nature's information storage molecule of choice and has a number of key properties: it is extremely dense, offering the theoretical possibility of storing 455 EB/g, it is durable with a half-life of approximately 520 years which can be increased to thousands of years when chilled and desiccated, and is amenable to automated synthesis and sequencing. Furthermore, the potential exists for employing biochemical processes that act on DNA for performing highly-parallel computation. Whilst biological information is encoded in DNA as triplet sequences of nucleotides (A,T,G, or C) i.e. base4-known as codons-there are many possible encoding schemes that can map data to chemical sequences of nucleotides for synthesis, storage, retrieval and computation. However, there are several biological constraints, error-correcting factors and information retrieval considerations that encoding schemes need to address for DNA storage to be viable. This comprehensive review focuses on comparing existing work done in encoding arbitrary data within DNA, particularly the encoding schemes used, methods employed to address biological constraints, and measures to provide error-correction. Furthermore, we compare encoding approaches on the overall information density they achieve, as well as the data retrieval method they employ, i.e. sequential or random access. We will also discuss the background and evolution of the encoding schemes.
Content may be subject to copyright.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Survey of Information Encoding Techniques for DNA
THOMAS HEINISand JAMIE J. ALNASIR,Imperial College, UK
The yearly global production of data is growing exponentially, outpacing the capacity of existing storage media, such as tape and
disk, and surpassing our ability to store it. DNA storage — the representation of arbitrary information as sequences of nucleotides —
oers a promising storage medium. DNA is nature’s information storage molecule of choice and has a number of key properties: it
is extremely dense, oering the theoretical possibility of storing 455 EB/g, it is durable with a half-life of approximately 520 years
which can be increased to thousands of years when chilled and desiccated, and is amenable to automated synthesis and sequencing.
Furthermore, the potential exists for employing biochemical processes that act on DNA for performing highly-parallel computations.
Whilst biological information is encoded in DNA as triplet sequences of nucleotides (also referred to as bases or base pairs) (A,T,G,
or C), i.e., base 4 — known as codons — there are many possible encoding schemes that can map data to chemical sequences of
nucleotides for synthesis, storage, retrieval and computation. However, there are several biological constraints, error-correcting factors
and information retrieval considerations that encoding schemes need to address for DNA storage to be viable.
This comprehensive review focuses on comparing existing work done in encoding arbitrary data within DNA, particularly the
encoding schemes used, methods employed to address biological constraints, and measures to provide error-correction. Furthermore,
we compare encoding approaches on the overall information density they achieve, as well as the data retrieval method they employ,
i.e., sequential or random access. We will also discuss the background and evolution of the encoding schemes.
Concepts: Applied computing;Hardware External storage;
Additional Key Words and Phrases: DNA Storage, Encoding, Error Detection & Correction, Retrieval, Information Density
Reference:
Thomas Heinis and Jamie J. Alnasir. 2021. Survey of Information Encoding Techniques for DNA. 1, 1 (July 2021), 28 pages. https:
//doi.org/10.1145/1122445.1122456
1 INTRODUCTION
DNA storage, whereby DNA is used as a medium to store arbitrary data, has been proposed as a method to address
the storage crises that is currently arising - the world’s data production is outpacing our ability to store it [11,17,18].
Furthermore, as [
8
] point out, rapid advancements in information storage technologies in our digital age risk that, as a
result of hardware and software obsolescence as well as storage medium decay, data currently stored in magnetic or
optical media will likely be unrecoverable in a century or less. Hence new methods such as immutable, high-latency
storage technologies are required to allow data to be retrieved after longer periods, e.g., centuries or millennia [9].
As a storage medium, DNA oers several important advantages. It is extremely durable — without treating it to
improve durability, DNA has a half-life of approximately 520 years [
5
]. Inherent in DNA is a base 4 system of nucleotides
that encodes information necessary for life [
12
] which can also be used to encode arbitrary information at extremely
high densities — theoretically it is possible to store 455 EB/g, i.e., 455 thousand Terabytes per gram [
9
]. Also, as a result
Both authors contributed equally to this research.
Authors’ address: Thomas Heinis, t.heinis@imperial.ac.uk; Jamie J. Alnasir, j.alnasir@imperial.ac.uk, Imperial College, 180 Queen’s Gate, South Kensington,
London, England, UK, SW7 2AZ.
©2021 Heinis and Alnasir
1
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
2 Heinis and Alnasir, et al.
of DNA being the information storage biomolecule of all life, all of the materials for synthesising DNA are abundantly
available.
DNA storage approaches use an encoding to map arbitrary binary information/data to a DNA sequence made of
four distinct nucleotides (also referred to as bases): Adenine, Thymine, Guanine and Cytosine (designated A,T,G,C
respectively for convention). The resulting mapped sequence is then physically created as a DNA molecule (often
referred to an oligonucleotide — meaning a short DNA sequence) through enzymatic or chemical synthesis. The
synthesised DNA molecules can be single-stranded (ssDNA) — as typically used in DNA storage — or double-stranded
(dsDNA), i.e., with a complementary set of nucleotides. The synthesised DNA molecules are stored for extended periods
in dry conditions.
To read the information back, DNA sequencing is used. Presently, two distinct types of sequencing technology have
evolved, namely sequencing by synthesis (SBS) and nanopore sequencing. SBS sequencers have a limited read-length of
hundreds to thousands of nucleotides (i.e. basepairs, or bp). Samples undergoing SBS must undergo signicant molecular
biological preparatory steps including fragmentation (to the sizes appropriate to the length tolerated by the sequencer)
and amplication based on a molecular biology operation known as PCR (Polymerase Chain Reaction). PCR copies DNA
sequences hundred or thousands of times, thus ensuring enough DNA material is available for sequencing. Nanopore
sequencers, on the other hand, do not require fragmentation or amplication as they are capable of sequencing long
DNA molecules. In both cases, molecules of the same identity (i.e. having the same sequence) are sequenced multiple
times (sequencing the same sequence molecules 4 times would be referred to as sequencing depth of x4) in order to get
a consensus for each nucleotide at each position (a process known as basecalling). Following sequencing, the DNA
sequences are decoded, i.e. mapped back to their binary representation and error correction is applied where possible
to retrieve the stored data with high delity.
2 COMPARING DNA STORAGE APPROACHES
Key to DNA storage is encoding the information to a sequence of nucleotides before it can be synthesised for storage.
In this survey, we study dierent approaches to encode information in DNA. Although some clearly have only been
designed for artistic purposes and not for information storage, we still want to discuss and compare all approaches
developed to date. We compare approaches with respect to storage mechanism, information density, error detection &
correction mechanisms, biological constraints considered, access mechanism, data stored and the size of the data stored.
We rst introduce the criteria for comparison and then discuss the encoding approaches, roughly in chronological
order which also tends to reect their level of sophistication reasonably well.
2.1 Criteria for Comparison
The storage of arbitrary information in DNA involves a number of key factors that determine the viability of a particular
approach. These will form the criteria of our comparison, and are as follows:
Storage medium
: there are several options for storing DNA. It is possible to store information in vivo by
inserting the DNA into the genetic information of a living organism, such as bacteria. e.g. E. coli, B. subtilis S.
cerevisiae, D. radiodurans. However, errors may be introduced when the cells divide (recombination) and the
quantity of information inserted is limited to the organism’s tolerance to extraneous DNA, i.e., without being
incapacitated. DNA can also be stored in-vitro, e.g., by hybridisation to beads, or by storing within wells or
microplates.
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
Survey of Information Encoding Techniques for DNA 3
Biological constraints
: the encoding or mapping of data to DNA must adhere to multiple design restrictions. Not
all possible sequences of nucleotides can be synthesised. Homopolymers, i.e., sequences of the same nucleotide,
of a length of more than two, for example, cannot be synthesised without potential errors. Similarly, extremes in
GC content of the resulting sequences should also be avoided as such GC rich or poor sequences pose problems
for both synthesis and sequencing [15].
Error-correction
: both synthesis — writing information to DNA — and sequencing — reading from DNA — are
error prone. Substitutions, deletions and insertions of nucleotides are inherent in using DNA. Strand breakage
may also occur in storing the DNA. An encoding must therefore be designed to be resilient to errors through error
correction codes or replication. Incorporating error detection and correction codes in the encoded information is
crucial to correctly recover all information.
Information-density
: synthesis of DNA using current technologies is expensive, hence an encoding scheme
must map as many bits as possible to one nucleotide, and a crucial metric — the information density — assesses
how much information (bits) can be stored per nucleotide. Several factors aect information density. For example,
if a number of nucleotides in a sequence is used to store error detection and correction information, the there
are less available nucleotides for encoding the payload — the actual underlying data to be stored — and hence
the overall information density decreases. Information density therefore strongly depends on the features of
the encoding and it cannot be considered in isolation. Hence we compare encoding approaches by the overall
information density they achieve.
Data-retrieval method
: there are two main methods for recovering stored data in general that are applicable to
DNA storage: i) sequential access, the entire DNA is to be sequenced to reconstruct the information to access parts
of it which can be a slow and costly process. ii) in random access, subsets of the information can be selectively read
without having to sequence all information — this requires placement of primers, that is, nucleotide sequences
that initiate the chemical reactions, such that subsets can selectively be amplied. Amplication is the process
by which small amounts of DNA are replicated in order to have sucient concentration of sample to sequence.
These are placed at the the 5’ (head) and 3’ (tail) ends of the DNA oligonucleotides.
2.2 Information Density
In order to appropriately review the encoding schemes, we calculate the overall information density in bits per nucleotide
(b/nt), which includes the payload encoded — i.e., the actual information carried — together with the additional
subsequences like primers needed for reading/amplication etc. We have also calculated the information density for the
payload only. Both values for the projects we have surveyed are shown in Table 1. The calculation used is given in 1in
the Appendix.
When the payload only information density is calculated, the number of bits or nucleotides used for optional,
additional information such as primers, indexes, and error correction codes is deducted from the terms in Equation 1.
As mentioned, we use the overall information density as the comparison, but in some cases we also mention the
payload only information density, particularly when there is a large dierence between the two due to the use of
additional nucleotides for primers, indexes, and error correction codes.
3 INITIAL IDEAS
The rst use of DNA as a storage device had primarily artistic purposes and was not meant to provide read and write
functionality [
13
]. The idea was to incorporate articial (non-biological) information into DNA, and more specically,
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
4 Heinis and Alnasir, et al.
into living bacteria (E. coli). The information encoded is the ancient Germanic rune for life and the female earth,
bit-mapped to a binary representation. This resulted in a 5
×
7 or 7
×
5 gure, totaling 35 bits. To translate the bits to
DNA nucleotides, a phase-change model is used. In this model, each nucleotide indicates how may times a bit (0 or 1) is
repeated before a change to another bit occurs:
XC XX T X X X A XXXX G
With this code, the string
10101
is encoded as
CCCCC
, as each binary digit occurs once before changing, and
0111000
becomes
CAA
. The entire 35-bit message, which in this case starts with a 1, is therefore encoded with a sequence of
18 nucleotides, i.e.,
CCCCCCAACGCGCGCGCT
. An additional sequence
CTTAAAGGGG
is prepended to indicate the start of
the encoded message, and hints at the nature of the encoding nucleotides which represent repeating binary bits. This
produces a DNA strand with a nal length of 28 nucleotides which was then implanted into a living cell, E. coli using
the recombination molecular biology technique.
The information density of this approach depends on the exact message to be encoded. If, for example, the same bit
is repeated multiple times, this approach only requires very few nucleotides and essentially compresses the information.
The example message, using their encoding, allows for an overall information density of 1.25 b/nt (35 bits are encoded
in 28 nucleotides). If the payload alone, excluding the prepended 10nt sequence, is encoded using the same scheme, the
information density is 1.94 b/nt. The overall information density of the DNA storage method, therefore, depends not only
on the encoding scheme but also on the additional features supported which require incorporation of primers, addresses
and error correction codes which, assuming xed sequence length, reduce the space to store information/payload.
Henceforth, we will focus mainly on the overall information density, and will include payload only information densities
in the summary in Table 1.
The disadvantages of the encoding approach are the absence of any type of error detection or correction. It neither
takes into account biological constraints on the resulting DNA sequence such as long homopolymers or adequate
GC content, making stability, synthesis and sequencing challenging. To retrieve the information, all data has to be
sequenced, thus only allowing for sequential access to it. Furthermore, a specic downside of this approach is that DNA
sequences are not uniquely decodable. As the nucleotides encode changes, we can translate, for example, a
C
as either a
0or 1, and this generalizes to any sequence, e.g., TCA, can be decoded as 001000 or 110111.
Similarly, to be understood as primarily art, is the Genesis project [
19
] which has the goal of encoding a sentence
from The Book of Genesis. The text used is Let man have dominion over the sh of the sea, and over the fowl of the air, and
over every living thing that moves upon the earth. The text is translated to Morse Code which is then again translated to
a nucleotide sequence. Both the text and the Morse Code t the symbolism of "genesis" (the Morse Code being one of
the rst technologies of the information age).
The Morse code is translated to nucleotides as follows:
Dash T Dot C Space (W ord ) → A Space (Lett er ) → G
Furthermore, for this encoding, the length of the DNA sequence depends on the exact message encoded as dierent
letters have a Morse code representation which varies in length. For the message used, 427 nucleotides are needed
(using ITU Morse code). Assuming 5 bits are sucient to encode a character (only adequate for characters), this results
in an overall information density of 1.52 b/nt (130 chars. x 5 bits/char. / 427 nt), i.e., somewhat more than the previous
approach.
The resulting encoding addresses one of the issues of the previous encoding, i.e., it encodes information without
ambiguity. Still, the resulting DNA sequences may suer from problems like homopolymer runs and imbalanced GC
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Survey of Information Encoding Techniques for DNA 5
content and can only be read through sequential access. Also, due to mutations in the bacteria, it was not possible to
recover the original message (undesired modications occurred) as no error detection or correction means were used in
the encoding.
4 FIRST APPROACHES TO DNA STORAGE
The rst approach with the goal of storing and also reading information [
7
] splits the entire data to be stored in chunks
on information DNA (iDNA) sequences while the metadata (e.g., the order of the chunks) is stored on polyprimer key
(PPK) sequences.
More specically, the iDNA contains a data chunk in its information segment which is anked by common forward
(F) and reverse (R) primer (approximately 10-20 nucleotides) needed for amplication with Polymerase Chain Reaction
(PCR). PCR is a widely-used molecular biology technique for amplifying DNA, that is replicating DNA molecules such
that there is sucient concentration of the sample for sequencing. The iDNA further contains a sequencing primer
(similar length), essentially an identier for the iDNA sequence, and a common spacer (3-4 nucleotides) that indicates
the start of the information segment.
The PPK sequence - the approach currently only uses one such sequence - is anked by the same forward and reverse
primers and, crucially, contains the unique sequencing primers for each iDNA in the order they have to be read. Both
type of sequences are illustrated in 1.
For the retrieval process, the PPK is rst amplied using PCR and sequenced to know the order in which the iDNA
sequences have to be read to reconstruct the initial data. All iDNA sequences are then also amplied and sequenced.
Fig. 1. Illustrations of the structure of the iDNA and PPK sequences. The iDNA carries the information whereas the PPK stores the
order of the information segments in the iDNA sequences.
For the experiment, the text It was the best of times it was the worst of times and It was the age of foolishness it was the
epoch of belief from A Tale of Two Cities by Charles Dickens was chosen. The sentences were picked on the basis that
some words were repeated multiple times and thus act as a test for the robustness of the encoding scheme, as repeating
data can lead to undesirable eects, such as, for example, homopolymer runs or extremes in GC content.
The text is encoded only using the three nucleotides
A
,
C
and
T
, while the sequencing primers are created using all 4
nucleotides, with the additional constraint that each fourth position is a
G
. Given this encoding approach, information
segments cannot be mistaken for sequencing primers as only the latter contain
G
nucleotides (at least in each fourth
position).
The text is processed alphabetically by a ternary code, that starts with
A
and alternates
C
and
T
in the third, second
and rst positions, e.g., the rst letters of the alphabet are encoded as:
A000 AAA B001 AAC C002 AAT ...
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
6 Heinis and Alnasir, et al.
The two sentences were encoded and recovered successfully. As there are only 27 combinations of the three
nucleotides, encoding more than 27 dierent letters would scale less eciently as multiple triplets would be required
for each letter — even the triplet code used by nature for encoding amino acids oers 4
3
, i.e., 64 possible letters, which
redundantly codes for the 20 amino acids. Further, using only one PPK inherently limits the amount of data that can be
stored. However, the authors argue that one PPK should be enough to store the metadata (e.g., the order) for all iDNA
sequences in a microwell of a microplate.
The method of encoding the information does neither avoid homopolymers of length three or more. A further issue
is GC content: not using the
G
nucleotide in the information segment also means that encoded sequences may result in
an extreme in GC content, i.e. low. The avoidance of GC extremes is important for stability and for reducing sequencing
errors[15]. The approach neither uses error detection or correction codes.
Although not specically designed for it, the approach allows for random access to the data. The PPK can be read
out to retrieve the primers needed to read a specic iDNA sequence.
iDNA sequences do not strictly have a limit in length but commercial synthesis at the time the work was carried out
meant two of their iDNA sequences of 232 and 247 nt were too long to synthesise and were therefore constructed using
overlapping oligonucleotides. The forward and reverse primers require a total of 40 nucleotides, the sequencing primer
requires a further 20 nucleotides and the spacer ideally only requires 4 nucleotides (as opposed to the experiments
where 19 nucleotides were used for the spacer), leaving 136 nucleotides for encoding information which amounts to
about 45 letters given the encoding used. Assuming 5 bits are used to encode a letter, this results in an information
density — for the payload only — of 1.65 b/nt.
For their experiments, two iDNA sequences, one for each sentence encoded, were used with lengths of 232 and 247
nucleotides respectively. For the sake of analysing the information density, we assume iDNA sequences of 232 and 247
nucleotides length. The PPK contains forward and reverse primers of 20 nucleotides each, the two sequencing primers
of each iDNA sequence each of 20 nucleotides length and two spacers with 4 nucleotides each, totaling 88 nucleotides.
Factoring in the length of the PPK, this approach encodes to a nal, overall information density of 0.94 b/nt.
A further approach [
25
] studied encoding and storage in extreme conditions. The objective of this project is the
development of a solution that can survive under extreme conditions, such as ultraviolet, partial vacuum, oxidation,
high temperatures and radiation. For this purpose, the information was stored in the DNA of two well-studied bacteria,
E. coli and Deinococcus radiodurans. The latter is particularly well-suited for extreme environments.
A similar encoding to [
7
] is used, but all four nucleotides are used to form the triplets. With this, 64 symbols can be
encoded, including letters, digits and punctuation. The rst mappings between binary representation and nucleotides
appear to be arbitrarily picked, i.e., to cycle through the nucleotides and are as follows:
0AAA 1AAC 2AAG 3AAT 4ACA ...
To avoid interfering with the natural processes of the organism and to nd the encoded message in the genome
of the host organism, it is anked by two sentinels, essentially two DNA sequences (see illustration in 2). These two
sequences must satisfy two constraints. First, the sentinels must not occur naturally in the host organism such that no
sentinel will be mistaken for natural DNA in the host. Second, the sentinels need to contain triplets like
TAG
or
TAA
that
act as markers which tell the bacterium to stop translating the sequence (to proteins). Analyzing the entire genomes of
E. coli and Deinococcus radiodurans, the authors found 25 such sequences with a length of 20 nucleotides.
For the experiment, fragments from the song "It’s a Small World" were used, and the authors successfully stored and
retrieved seven articially synthesized fragments of length between 57 to 99 nucleotides, each anked by two sentinels
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
Survey of Information Encoding Techniques for DNA 7
l
Fig. 2. Illustration of the sentinels flanking the message in the plasmid.
picked from the set of 25 sentinels, in seven individual bacteria. Since the anking sequences are known, they can be
used as primers and the retrieval is then performed using PCR.
As discussed for the previous encoding, encoding symbols as triplets limits the alphabet and hence the amount of
information that can be encoded — a maximum of only 64 symbols can be used which, in most cases, is too restrictive to
be of practical use unless multiple triplets are used which then decreases the information density that can be achieved.
Another scalability issue is that the genome of the host has to be sequenced and then searched for appropriate locations
to insert messages as well as for sentinels. Even for microorganisms the search space is in the realm of billions, so
the task is computationally expensive. Finally, storing information in living organisms is subject to mutations and
environmental factors. To alleviate these issues, one of the "toughest" microorganisms known was used (Deinococcus),
but generally the problem of mutations remains unpredictable and would likely not accommodate large amounts of
information. Care must also be taken not to damage the bacterium by it accidentally translating or transcribing DNA
that is meant for information storage and is not part of its genome. Finally, homopolymers and GC content are not
considered in the encoding.
In their experiments, the authors encode 182 characters which require 826 nucleotides (including the sentinels).
Assuming 7 bits are needed to encode one letter/character (given that the encoding scheme allows for an alphabet of up
to 128 letters), this yields an overall information density of 1.54 b/nt (182 chars. x 7 bits / 826 nt).
A similar encoding approach [
10
] is used to encode messages and hiding them in a microdot. Alphanumeric values
are mapped to sequences of three nucleotides:
ACGA BCCA CGTT DTTG EGGT ...
The mapping also includes special characters in a total of 36 symbols. A total of 64 symbols could be encoded with
three nucleotides and so 28 sequences are not used. By picking the sequences carefully, one might avoid homopolymers
or also control the GC content. However, in this approach homopolymers are not avoided and there is no control of GC
content.
For the experiment, the message "JUNE6 INVASION:NORMANDY" was encoded, resulting in a sequence of 69
nucleotides. The sequences are anked by two 20 nucleotide primers, resulting in an overall length of 109 nucleotides.
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
8 Heinis and Alnasir, et al.
Given that the same alphabet could be encoded with 6 bits per symbol, results in an overall information density of 1.27
b/nt.
Subsequent work [
24
] is the rst to use non-trivial encoding and thus oers a considerable improvement in terms of
information density. The authors propose three encoding schemes: Human codes,comma codes and alternating codes.
The development of the three schemes is driven by eciency and robustness: the former is achieved by packing more
information/bits in the same number of nucleotides as previous approaches and the latter by ensuring that insertions
and deletions can be detected (to some degree). A further design goal is that the output of the encoding should be
clearly recognisable as articial, so that it cannot be confused with DNA that occurs naturally.
The approaches are theoretical, i.e., have not been tested experimentally, and focus primarily on encoding the
information. Although both the comma code and the alternating code possess basic inherent error detection, these
approaches do not consider another important encoding factor — error-correction, nor do they address incorporation of
primers for amplication or the information retrieval method, i.e., sequential access vs random access.
The Human Code:
This approach follows the classic algorithm introduced by Human. The input language is
encoded such that the most frequent character is encoded with the least number of symbols, and similarly the least
frequent input character is encoded with the most symbols (using variable length encodings). For a given alphabet
and language, this is the optimal code. For the approach, the Human code is used to translate the letters of the Latin
alphabet according to English language character frequencies:
eT(12.7% frequency)
nGC (6.7% frequency)
fACG (2.2% frequency)
zCCCTG (0.1% frequency)
The most frequent letter in the English language is e, and thus it is encoded as a single nucleotide,
T
. As the frequency
decreases, the codeword grows in size. The average codeword length is 2.2 nucleotides. The size of the codewords
ranges between one and ve, and the size of the code is 26 (for the letters of the alphabet). While this code is uniquely
decodable and optimal in the sense of being the most economical, it also comes with three disadvantages. Firstly, it is
cumbersome and not optimal for encoding arbitrary data, and in the case of encoding text, the frequency of letters
depends heavily on the particular language. Secondly, due to the varying codeword length, it is dicult to observe a
pattern in the encoded data, such that it could result in sequences similar to naturally occurring data. The avoidance
of naturally occurring sequences could be an advantageous feature of an encoding scheme, as articial data could be
easily dierentiated from natural ones. Thirdly, no error detection or correction codes are incorporated.
Additionally, the encoding scheme does not conform to the requirements to prevent homopolymers and suboptimal
GC content, which, when present in encoded data, would present considerable challenge for sequencing.
Given that only letters are considered, 5 bits suce to encode each letter, this yields an overall information density
of 2.27 b/nt.
The comma code:
This code uses one nucleotide, G, as a comma to separate all other codewords with a length of
ve nucleotides but never uses it elsewhere (i.e., in the codewords themselves). The proposed code uses
G
as the comma
occurring every six nucleotides:
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
Survey of Information Encoding Techniques for DNA 9
G∗∗∗∗∗G∗∗∗∗∗G∗∗∗∗∗G...
The ve-nucleotide codewords are limited to use the remaining three nucleotides:
A
,
C
, and
T
with the added constraint
that there must be only three
A
or
T
nucleotides and only two
C
s. Balancing
C
’s and
A
or
T
has the benet that it leads to
a more ecient amplication process, as well as balancing the GC content. The general format of the codewords is
CBBBC, where B∈ {A,T}, and the Bs and Cs can occur in any permutation, leading to 80 dierent possible codewords.
The main advantages of the approach are that it provides simple error-detection capabilities. Insertions and deletions
can easily be detected: codewords which are too long or short have clearly been subject to insertion/deletion. Changes,
i.e., ipping of one nucleotide, can be detected in 83% of the cases. Given a codeword
GCBBBC
, three possible point
mutations can occur in each position, resulting in 18 single point mutations. Of the single point mutations, only 17%
result in valid codons (ipping Ato Tor vice versa) and the remaining 83% can be detected.
Using one
G
and two
C
nucleotides in one codon means that the GC content is exactly 50% and thus well-suited for
sequencing. A nal advantage is that the occurrence of a
G
nucleotide in every sixth position means that the sequence
can easily be identied as synthetic. While this is not crucial in the context of the data storage, it is a design goal of the
authors.
The disadvantages of the encoding are that no mechanism for error correction is provided. Also, DNA sequences
produced with this encoding can contain homopolymers of length three.
The encoding uses 80 dierent codons, each with a length of 7 nucleotides. To encode about the same number (128)
of symbols with bits, 7 bits are required. The information density therefore is exactly 1 b/nt.
The alternating code:
this code is made of 64 codewords of length 6, of the form
XYXYXY
, where
X∈ {A,G}
and
Y∈ {C,T}
(the
X
and
Y
are alternating). The alternating structure is arbitrary and it is argued that other formats, e.g.,
YXYXYX or XXXYYY have the same properties.
The advantages of this method are similar to the comma code: it has simple structure and a ratio of 1:1 of the
nucleotides, suggesting that it is an articial code. It also shares the error-detection features of the comma-code,
although to a lesser extent (67% of the mutations result in codons that are not part of the code and hence can be
identied as errors). Furthermore, given the suggested structure of
XYXYXY
, homopolymers of length three are avoided.
Disadvantages include potentially a suboptimal GC content, although this could be avoided by using a structure
such as XYXYXY with X∈ {C,G}and Y∈ {A,T}, as well as the lack of error correction codes.
The encoding uses 64 dierent codons. To encode the same number of symbols in a binary representation, 6 bits are
needed and thus the information density again is exactly 1 b/nt.
A subsequent project [
26
] avoids explicit error correction codes. Instead, encoded messages are inserted into the
genome of the host organisms repeatedly.
The message is rst translated to the Keyboard Scan Code Set2 [
1
]. This hexadecimal code is then converted to
binary, and a binary encoding to dinucleotides (pairs of nucleotides) is used to convert the bit sequence into a DNA
sequence. The mappings are illustrated in 3. The message encoded in the experiments is "E=mc^2 1905!".
Redundancy is introduced by encoding the message four times. More precisely, the the bit sequence is copied four
times, with each copy anked by dierent start and end bit sequences of dierent length. With the help of these bit
sequences, the DNA sequences can be identied within the genome of the host organism. Owing to the dierent lengths
of the start and end bit sequences, the resulting DNA sequences are distinct, and should a homopolymer occur, this can
be ltered out and the other start-end bit sequences used instead. More specically, for the rst encoding (C1), the
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
10 Heinis and Alnasir, et al.
%12%24%12%4E%3A
%21%55%1E%29%16
%46%45%2E%12%16
0001 0010 0010 0100 0001
0010 0100 1110 0011 1010
0010 0001 0101 0101 0001
1110 0010 1001 0001 0110
0100 0110 0100 0101 0010
1110 0001 0010 0001 0110
AA = 0000 AG = 1000
CA = 0001 CG = 1001
GA = 0010 GG = 1010
TA = 0011 TG = 1011
AC = 0100 AT = 1100
CC = 0101 CT = 1101
GC = 0110 GT = 1110
TC = 0111 TT = 1111
Fig. 3. The translation to Keyboard Scan code (le), translation of to binary (middle) and the mapping from binary to dinucleotides
(right).
bit sequence is anked with
0000
and
1111
and hence the DNA sequence is anked with
AA
and
TT
. For the second
encoding, the bit sequence is anked with
111
and
0
and the DNA sequence thus starts with
GT
and ends with
AT
. The
bit sequence for the third and fourth copy are anked with
00
,
00
,
1
and
111
respectively. As the bit shift is less than
four bits for the second, third and fourth copy, the messages are encoded completely dierent in DNA. 4illustrates the
dierent encodings.
Fig. 4. Illustration of the encoding of the same bit sequence, with dierent start and end bit sequences, four times.
These redundant encoded messages are then inserted in dierent places into the host organism (in this experiment,
a strain of Bacillus subtilis). To decode the information, the entire genome is sequenced and then searched for start
and end sequences. For this procedure to work, the messages should be above a minimum chosen length. Otherwise,
naturally occurring duplicate sequences might be mistaken for the encoded message. For this experiment and particular
organism (B.subtilis) it is argued that a minimum length of 20 provides enough specicity to avoid the messages to be
mistaken for host DNA.
As with other techniques that store information in an organism, evolution and mutation pose problems, since the
message might be altered. This problem is mitigated by using redundant copies placed in dierent regions of the genome,
although this is means the whole genome has to be sequenced, rendering the approach potentially inecient. The
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
Survey of Information Encoding Techniques for DNA 11
authors predict that a large-scale system is possible, although as explained, the size of the sequences for this method is
limited.
The approach does not control homopolymers or the GC content. Error detection and correction are also not taken
into account although redundancy helps to remedy errors. In the experiment carried out, however, the data was not
completely recovered as only 99% accuracy was achieved.
The encoding uses two nucleotides to encode four bits and thus suggests an information density of 2 b/nt. However,
when taking into account that every message is copied four times, and that the start and end sequences are 5 bits on
average across the copies, this leads to an overall information density of 0.49 b/nt.
Subsequent work [
22
] encodes information entirely dierently with the goal of avoiding sequencing to read the
information back. They accomplish this by taking the binary information and encoding a 0with a subsequence that has
a dierent weight than the subsequence for 1. By doing so, the content of the encoded information can be decoded based
on the weight of the subsequences, making it possible to, for example, use gel electrophoresis instead of sequencing.
In more practical terms, the subsequences representing binary 0’s and 1’s are separated by restriction sites which
allow to cut or separate the encoded information precisely. The entirety of encoded information is inserted into the
genome of a bacteria, also anked by restriction sites (dened subsequences of biological relevance) such that it can be
cut out precisely.
To facilitate decoding of the DNA oligonucleotides by electrophoresis, which identies molecules by their masses,
this approach uses 4 nucleotides to encode a 1and 8 nucleotides for a 0. Information density thus depends on the exact
message encoded. In the experiment the message ’MEMO’ was encoded with a 3-bit alphabet — the payload of 12 bits
(4 characters
×
3 bits) was encoded in 60 nucleotides. The information density of the payload is, therefore, 0.2 b/nt.
However, including the restriction sites, which are essential for the encoding, an additional 50 nucleotides are required,
resulting in a sequence of 110 nucleotides — the overall information density is thus 0.11 b/nt. This is very low and is
mostly due to the need for restriction sites.
No error detection and correction codes are incorporated in the encoding. Although no particular reference is made
to biological constraints, given the encoding uses a plasmid transfected into a host bacteria, care is needed to ensure
that the sequences introduced do not disrupt cellular activity of the host. As only two dierent codons are needed — one
for 1and one for 0— the respective subsequence can be chosen to (a) balance GC content and (b) avoid homopolymers.
Regarding the former, even though the distribution of 1’s and 0’s may be uneven, as long as the GC content within the
two subsequences is balanced, the GC content of the encoded sequence will be balanced as well.
Having each subsequence representing a bit anked by restriction sites limits the scalability of the approach as
the number of bits that can be stored depends on the number of unique restriction sites available. At the same time,
however, using restriction sites enables, albeit rather coincidental, random access to the data.
One of the rst works [
4
] to move beyond a basic proof of concept storing simple text messages, archives text, image
and music. The encoding is fundamentally based on an improved Human code, building upon previous methods
such as [
24
]. The les stored are not digital information stored in a format such as
.mp3, .png
, etc. Instead, music is
encoded as a series of notes, with additional rhythm information, while for images, it is only possible to encode a series
of primitives (circle, ellipse, line, rectangle) along with their position (and orientation).
The approach uses two components, the index plasmid as well as the information plasmid. The index plasmid is
responsible for holding metadata such as title, authors, size of library and primer information. Similarly to Bancroft et
al [
7
], the index information is anked by unique primers, and thus is easy to nd, amplify and decode. However, as
opposed to Bancroft, the sequencing primers, needed to nd the actual data, are not themselves stored in the index
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
12 Heinis and Alnasir, et al.
plasmid, but in the information DNA of the library — this is termed the iDNA in the Bancroft et al approach and we
will reuse the term henceforth when the same method is employed. Thus the index is responsible only for information
and structure describing the actual data.
The approach encodes text, music and images. To distinguish between the type of content encoded, it is prepended
by tx*,mu* and im* respectively.
One of the main advances proposed by this approach is a more general encoding based on the Human code.
Previous methods have encoded a limited set of symbols — such as, for example, only letters with a lack of numbers and
punctuation, or a very short message with symbols (albeit a small subset of keyboard scan codes) as in the encoding of
"E=mc^2 1905!" by [
26
]. However, the important factor is the capacity (i.e., number of symbols) an encoding scheme
oers, rather than the number that are actually used in a demonstration encoding. In this approach, for text encoding, a
complete keyboard input set (including shift, space and so on) is encoded by splitting the symbols into three groups of
under 25 characters each. Each group has a header, denoted by one or few nucleotides. More specically, Group 1
G
,
Group 2
TT
and Group 3
TA
. The characters in each group are encoded in decreasing order of their frequency, as
is common for Human codes. For example, the letter eis encoded as
GCT
, where
G
denotes the group header and
CT
is
the encoding for e,shift is GTC and gis GAAT and encoding the word Egg thus becomes GTCGCTGAATGAAT.
For music, a single column is used to provide encodings for note values, pitches and meter. For example, a
D
half-note
is encoded as
CGTT
, where
G
denotes the
D
and the
TT
indicates a half-note. Encodings are provided for all other notes.
For images, the graphic primitives along with their properties, e.g., location, size and orientation are encoded. An
excerpt of the encoding for graphic primitives is:
;G
.TT
0TA
1AT
2CT
...
S(s;x1;y1; a) → AAG
R(l;w;x1,y1;a) → AAG
L(x1;y1; x2; y2) → CAA
...
A square (S) is dened by the length of its sides (s), the location of its upper right vertex (x1,y1) and the angle of its
base (a). A line (L) is dened by its two endpoints (x1,y1) and (x2,y2).
For example, a square in location (0,0) with a side of 1 and angle of 0 will be encoded as
AAGATTATATA
. Regardless of
how many graphic primitives are used, the rst one has to be prepended with im* or
GCGTAACTACCA
to indicate the
start of graphic primitives.
In theory, an iDNA library of size up to 10,000nt could be achieved by introducing sequencing primers (of length
20-30nt) every 500nt fragment. This limitation of library size to be inserted into the plasmid occurs due to the tolerance
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
Survey of Information Encoding Techniques for DNA 13
of non-organism DNA that the organism receiving plasmid is able to tolerate, whilst still allowing for the successful
amplication of that plasmid. The paper provides an example where 409nt of text, 113nt of music, and 238nt of image
data, with an additional 16nt for punctuation and 68nt for three primers, for a total of 844nt is used. The information
was retrieved with 100% accuracy, although the authors suggest that sequencing with the three primers should be done
in a particular order (so as to minimise the chance of retrieving some of the plasmids DNA), and sequencing in both
orientations might be needed (to maximise the chances of full recovery of the data).
This method comes with the common advantages and disadvantages of storing information in organisms capable of
independent replication (plasmids). While this is the rst work to store considerable amounts data in dierent forms
(text, music and image data), the implementation is only capable of storing simple, structured data (discrete music notes,
graphical primitives) and not actual recordings or photographs.
The encoding itself uses Human encoding to encode characters and music or graphic primitives. For encoding text,
given the distribution/frequency of characters/symbols in the English language, the encoding requires on average 3.5
nucleotides for the each of the 71 symbols used. The least number of binary bits required to encode the nearest number
of characters (128) is 7 bits, resulting in an information density of 2b/nt. Assessing the information density of music
and graphic, however, is not possible as the distribution of the dierent primitives and numbers (coordinates, length
etc.) is not known.
Generally, the encoding does not take into account biological constraints like GC content or homopolymers. It neither
provides any mechanism for error correction and detection. On the positive side, the encoding is very economical using
very few nucleotides per bit, at least for text.
A further project [
2
] uses a very simple mapping where one letter of the alphabet is mapped to three nucleotides, e.g.,
OATA
. The project has primarily artistic purposes. The encoding is used to map article 1 of the Universal Declaration
of Human rights to synthetic DNA. The DNA is then inserted into a bacteria sprayed on apples (to represent the
forbidden fruit as well as the tree of knowledge).
With primarily artistic purposes, the project does not take into account error detection/correction or biological
constraints. It requires full sequencing to retrieve the information and is stored in a bacterial genome which was then
sprayed onto apples. Given that three nucleotides are used to encode a letter, an alphabet of 64 letters can be encoded,
which results in approximately 6 bits, thus leading to an information density of exactly 2b/nt.
Other recent work has been specically developed to focus on eciently encoding imaging data, particularly JPEG
images [
14
]. They have developed an encoder that considers the characteristics of the input signal being encoded. To
reduce the high DNA synthesis cost it proposes an end-to-end encoding workow which optimally compresses an
image before it is stored into DNA, allowing to control the encoding rate and thus reducing the synthesis cost. An
additional advantage of the proposed approach is the use of an encoding algorithm which is robust to biological errors
coming from the synthesis and the sequencing processes. Due to the variable compression factor applied to the JPEG
before the encoding, we have assessed the overall information density achieved for the data that was actually encoded,
and found it achieved 1.6 b/nt.
4.1 Advanced Approaches featuring Error Correction and Random Access
The work [
9
] by Church et al. is an important milestone in DNA storage, as it is the rst to store relatively large amounts
of information (5.27MBs). This is possible due to the use of "Next-generation" (at the time) sequencing and synthesis
technologies. The goal is long-term, archival storage. The encoded information is not inserted into living bacteria - but
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
14 Heinis and Alnasir, et al.
is simply synthesised DNA that is stored on a microplate. The archived content is composed of a 53.426-word draft of a
book, 11 JPG images and one JavaScript program.
The approach has multiple advantages, starting with the encoding: a nucleotide encodes one bit (
A
or
C
for zero,
G
or
T
for one), instead of two. It is thus possible to encode the same message (as a sequence of bits, i.e., 0s and 1s)
in dierent sequences of nucleotides. A bit sequence of
1111
can, for example, be encoded as
GGGT
, thus avoiding a
homopolymer of length greater than three. The decision whether a zero is encoded as
A
or
C
(or a one as
G
or
T
) is
made at random unless a homopolymer of length four needs to be avoided. The randomness automatically ensures a
somewhat balanced GC content (although it could also be enforced).
The information is split into addressable data blocks (which are encoded individually) to avoid long sequences which
are generally dicult to synthesise and which may also form secondary structures. Each block has a 19-bit code, and,
due to redundancy in multiple possible nucleotides per bit, also has a 19 nucleotide subsequence that identies its
position. Each sequence is anked by primers of 22 nucleotides length for amplication when sequencing.
Synthesis and sequencing errors have a low probability of being coincident, e.g., errors are unlikely to occur in the
same location when either synthesising and sequencing. To address error correction and detection, each sequence is
consequently synthesised multiple times, enabling errors to be identied easily when comparing multiple sequences, i.e.
through consensus calling.
As this encoding uses a single nucleotide to encode each bit, the density achieved in this project is higher than
in previous attempts, at 5.5 petabits/mm
3
, reinforcing the idea that DNA is suitable as a platform for archiving data.
The one shortcoming is the lack of a more sophisticated encoding (parity checks, error-correction). With the simple
encoding, a number of small errors (10, mostly caused by homopolymers) occurred, indicating that a more robust
technique is needed. The reported long access time (hours to days) is on par with other approaches (due to the slow
speed of sequencing) and is acceptable considering this is an approach for archival data which is accessed only very
infrequently.
The theoretical information density of this approach is 1 b/nt. While this is not particularly good — 2.27 b/nt is the
best — this encoding approach is the rst that both avoids homopolymers and also allows to balance the GC content.
Taking into account the addressing information (19 nucleotides) and the sequencing primers (twice 22 nucleotides),
only 52 nucleotides can be used to store data (given the sequence length of 115 nucleotides).
This results in an information density of 0.46 b/nt. However, Church et al. have used replication as their error-
correction strategy, and their rationale is that "Because errors in synthesis and sequencing are rarely coincident, each
molecular copy corrects errors in the other copies." [9].
However, as no error correction codes are considered, each sequence needs to be replicated multiple times, thereby
also adversely aecting the information density.
Work following this encoded substantial amounts of data for the rst time. Goldman [
17
] et al. encoded binary data
(text, PDF, MP3, JPEG) totaling 757’051 Bytes.
Each of the Bytes of the les is mapped to base 3 numbers. A Human encoding is used to map each Byte to either
ve or six base 3 numbers which in turn are translated to nucleotides. More specically, each base 3 number is translated
to a nucleotide, picked such that it is dierent from the last one (to avoid homopolymers).
The resulting (long) sequence is partitioned into segments of 100 nucleotides, each overlapping the previous by 75
nucleotides, thereby introducing four fold redundancy as every 25 nucleotide subsequence will be in four sequences as
is illustrated in 5. To map each sequence to a le, two base 3 numbers are used, allowing for 9 dierent les to be stored.
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
Survey of Information Encoding Techniques for DNA 15
Fig. 5. The Goldman encoding uses fourfold redundancy of each segment.
l
Fig. 6. The codon wheel translates base 47 numbers to DNA sequences of length three.
To order the sequences within a le, 12 base 3 numbers are used allowing for 531,441 locations per le. One base 3
number is used as parity check and appended at the end (encoded such that no homopolymer is created).
The fourfold redundancy provides eective error correction as each nucleotide is encoded in four of the DNA
segments any systematic or random errors in synthesis or sequencing can be corrected by majority vote.
The approach is specically designed to avoid homopolymers but no specic mechanism is used to balance the GC
content. Error detection is rst implemented using the parity nucleotide. With it, errors can be detected. The four fold
redundancy then also helps to correct errors by majority vote. The approach is based on Human encoding and so the
information density depends on the data encoded as some bytes are mapped to 5 or 6 base 3 numbers. For the data used
in the experiments, taking into account this redundancy, an overall information density of 1.59 b/nt is achieved.
In subsequent work [
18
], Grass et al. propose a novel approach which incorporates error detection and correction.
With it, information can be reliably extracted from DNA that is treated to simulate a 2000 year storage period in
appropriate conditions. The amount of information stored, 83kB organised into 4991 DNA segments of length 158nt
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
16 Heinis and Alnasir, et al.
each (with 117 nucleotides used to encode information), is much less than in [
9
], but the focus is on the novel encoding,
embedding the DNA in silica, and the simulating of ageing of the DNA for long-term storage.
The encoding mechanism is based on Reed-Solomon (RS) codes, a group of well-studied error-correcting codes
which have the property of being maximum distance separable (MDS for short - they achieve equality in the Singleton
bound). This property is important as a MDS code with parameters (n, k) has the greatest error detection and correction
capabilities compared to other codes with the same parameters. The set of alphabet symbols of an RS code is a nite
eld and the number of elements must therefore be a power of a prime number.
Typical input data are les, which can be thought of as sequences of bytes or of a number in base 2
8=
256. The rst
step of the algorithm is to convert two bytes of the input to three numbers in the nite eld
GF (
47
)
(called a Galois
Field, hence GF). The number 47 is chosen as it is the closest prime number to 48, the number of 3-nucleotide sequences
that can be constructed to satisfy biological constraints such as avoiding homopolymers. Once the conversion to
GF (
47
)
is done, the resulting codewords are arranged into a 594
×
39 block. The codewords are shown in 7in the red box. The
dark red box show one line of codewords. A triple of three numbers (in base 47) represents two Bytes. RS-codes are
used on the codewords (
rows
) to add redundancy information of size 119, for a total blocklength of 594
+
119
=
713.
This is called the outer RS code (on the left). In the next step, an index of size 3 is added to each
column
(top), for a
vector of size 33, followed by the use of a second (inner) RS code (bottom) on said vector of length 6 thus totaling 39.
The index serves to identify each column (one column will be stored in one DNA sequence) and, given the length of 3
(numbers in 47
3
), allows to address 103823 columns. The inner RS code is added to correct individual nucleotide errors,
whereas the outer RS code helps to correct sequences or to recover completely lost sequences.
Fig. 7. Approach to add index as well as inner and outer error correction codes to data.
The columns (yellow rectangle) are uniquely mapped to DNA sequences of length 117, by mapping each ele-
ment/number to three nucleotides, as described above. Non-random primers, referred to as sequencing adapters, are
added to ank the resulting sequence.
For the decoding process, the DNA is read using the Illumina MiSeq platform. On the sequences read, the inner code
is decoded followed by the decoding of the outer code. In the experiments the inner code xes on average 0.7 nucleotide
errors per sequence while the outer code xes on average the loss of 0.3% of total sequences and corrects about 0.4%.
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
Survey of Information Encoding Techniques for DNA 17
The information was completely recovered, without errors. Furthermore, an experiment where the DNA suers decay
(achieved by thermal treatment) equivalent to one and four half-lives (about 500, and respectively, 2000 years) shows
that the encoding scheme, aided by half-life extending in-silica storage, can retrieve the information without errors
even in these conditions.
The encoding is the rst to properly incorporate error detection and correction codes (without resorting to replication).
In this encoding, the RS codes are chosen such that they can x 3 arbitrary errors per sequence, correct 8% of sequences
if they are incorrect and recover 16% of all sequences if they are missing. Homopolymers an GC content are also
incorporated through the use of carefully picked subsequences (in the codon wheel 6).
Only 90 nucleotides of the 158 nucleotide sequences carry actual information (20 Bytes, 160 bits) whereas the
remainder is used for index, error correction and primers. The overall information density achieved by this method is
1.01 b/nt.
The major novelty of the work [
8
] of Bornholt et al. is the concept of random-access, a feature missing from previous
approaches. While the goal is still archival storage, with read-write times similar to other implementations (hours to
days), the addition of random-access enables a new level of eciency, as it is no longer needed to sequence all DNA
sequences.
Previous work stored the DNA in a single pool, which comes with several disadvantages. To distinguish between
dierent types of encoded data, i.e. les, dierent primers are needed, and this in turn increases the risk of erroneous
interactions, e.g. primers initiating reactions at undesired positions of the oligonucleotides, or primers intended for one
nucleotide initiate the reaction on another. In addition, it is unlikely that a random read will contain all the data. On the
other hand, a separate pool for each object represents too big of a sacrice regarding density. As such, the proposed
solution is to use a library of pools of a xed size, thus balancing the two issues. In theory, the system provides a simple
key-value interface, with two basic operations. A "store" operation,
put(key, value)
that associates the key with the
value, and "read" operation
get(key)
that retrieves the value assigned to key. In practice, this is achieved by mapping a
key to a pair of PCR primers. When writing, the primers are added to the data strand. At read time, the primers are
used for PCR amplication only of the desired data. As DNA molecules do not have a particular spatial organisation,
each encoded strand must contain an address that identies its position in the original data stream.
As before, to encode a large amount of information, the data is split into blocks, which have the following structure:
primers at both ends, a payload of information anked by two sense nucleotides that aid the decoding, and an address
(see Figure 8).
Fig. 8. Strand structure containing primer, address and payload.
The payload represents the actual data to be stored. As mentioned, this is a fragment of the entire information that
needs to be encoded, and its length can be adjusted depending on the chosen strand length. The two sense nucleotides
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
18 Heinis and Alnasir, et al.
l
Fig. 9. Mapping of ternary codes to nucleotides.
specify if the strand has been reverse complemented, which is helpful during decoding. The primers are used, as usual, in
the PCR amplication process in order to select the desired blocks. The address serves two purposes. First, it associates
a key with a block, so that the
put
and
get
operations can be implemented. Second, the address also indexes the block
within the value associated with the key.
As there is a unique mapping from key to primer sequences, all the strands of a particular object will share a common
primer (because an object is associated with a single key). Thus the read operation is simply a PCR amplication using
that particular key’s primer. The strands retrieved for a specic object can then be rearranged to recover the original
information based on their Address.
The encoding from binary information to nucleotides is similar to previous approaches. Bytes are transformed using
a ternary Human code to 5 or 6 ternary digits. The ternary strings are then converted to nucleotides using a rotating
code that avoids homopolymers (the encoding for a 0, 1 or 2 depends on the previous nucleotide, such that sequences
of repeating nucleotides are avoided). The principle of the mapping is illustrated in 9. For example, the binary string
01100001
is mapped to the base 3 string
01112
, which can be encoded to DNA as
CTCTG
. A Human code is used as an
intermediary step, to map more common characters to 5 ternary digits, while less frequent ones map to 6 digits.
A novel way to provide error detection and correction through redundancy is introduced as well. The inspiration
comes from the encoding used by Goldman et al. [
17
]. Essentially, the Goldman encoding splits the DNA nucleotides
into four overlapping fragments, thus providing fourfold redundancy (See 5). This encoding is taken as a baseline to
evaluate the results of the current scheme. Bornholt et al. propose a XOR Encoding, in a similar fashion to RAID 5. With
this approach, the exclusive-or operation is performed on the payloads
A
and
B
of two strands, which produces a new
payload
AB
. The address of the newly created block can be used to tell if it is an original strand or the result of
the exclusive-or operation. Any two of the three strands
A
,
B
and
AB
are sucient to recover the third. The major
advantage of this method is that is oers similar reliability to the Goldman encoding, while being more ecient in
terms of information density. The Goldman encoding repeats each nucleotide (up to) four times, while for the Bornholt
encoding each nucleotide is repeated 1.5 times on average (See Figure 10).
Overall, this design introduces a remarkable feature, random-access, as a proof-of-concept. It incorporates error
detection and correction and avoids homopolymers. The information density of the payload only using this encoding is
0.88 b/nt. However, the redundancy is provided by repetition with a (tunable) factor of 1.5, meaning that in fact, the
overall information density is 0.59 b/nt.
Other work that implements this feature has been developed concurrently by Yazdi et al. [
27
]. The approach (or
rather experiments) use rather long DNA sequences of 1000 nucleotides length, divided into an address of 16 nucleotides
while the remainder is used for encoded information.
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
Survey of Information Encoding Techniques for DNA 19
Fig. 10. The XOR of payloads of dierent strand is used in a parity strand to allow for error correction.
The encoding of the information and the design of the addresses follow three goals. First, none of the (distinct)
addresses should be part/subsequence of any of the encoded information (weak mutual correlation). Second, the
addresses should be as distinguishable as possible. The third goal is to balance GC content and to avoid homopolymers.
The method developed generates two sets
S1
and
S2
of codewords.
S1
is to be used as addresses and the Hamming
distance is used to make sure they are distinguishable compared to each other.
S2
contains codewords which are to be
used to encode information. Properties such as weak mutual correlation, GC content and avoidance of homopolymers
are provable with this encoding approach.
In experiments the authors encode the image of a black and white movie poster (Citizen Kane) as well as the colour
image of a smiley face. The information is rst compressed and then encoded in 17 sequences - each with a length of
1000 nucleotides - by picking an address from
S1
and mapping the binary content of the images to codewords from
S2
.
Errors are detected and corrected during post-processing, after sequencing. Using PCR before sequencing means
that each sequence will be replicated numerous times. High-quality reads are identied as the ones where the address
contains no errors. Following this, the high-quality reads are aligned and a consensus is computed. The alignment
uses possible codewords as hint. In the experiments, most errors at this stage are errors of deletions in homopolymers
of length at least two. The remaining errors are further corrected by taking into account GC balance, which helps to
determine which homopolymers are likely. A constraint is applied to subsequences of 8 nt which ensures that the sum
of the nucleotides, and any homopolymers, must be 8, which aids in the detection of deletions. To balance the GC
content, the nucleotide subsequence is converted to binary and is selected if the word is balanced i.e., it contains half 1s
and half 0s.
To sum up this approach, this encoding approach avoids homopolymers and also balances GC content. It does not
use any error detection or correction mechanism when encoding. Errors are detected and corrected in post processing.
In terms of information density, the images are of size 29064 bits (compressed) and are encoded with 16880 nucleotides
(16
×
1000 and one 880 sequences), thus resulting in 1.72 b/nt. As is done in the work by Church et al. [
9
], this method
uses replication by PCR for error correction. This means that, in the absence of a replication factor, even if a conservative
assumption that each oligonucleotide is copied 4x (again, it is likely much more as the replication method is by PCR
amplication), the resulting overall information density achieved is 0.43 b/nt.
Subsequent work [6] focused on encoding structured database information and implementing database operations.
Two dierent encodings are discussed, one for storing structured database information, i.e., relational tables and one
for implementing database operations.
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
20 Heinis and Alnasir, et al.
The rst encoding exploits the inherent structure in databases, i.e., that each attribute of a record in a table can be
linked to the record using the primary key. Doing so means that attributes of the same record can be distributed across
dierent DNA sequences without the need for addressing. Instead the primary key is used for this purpose, reducing
the space needed for the address. 11 illustrates how one record of a table is stored in multiple sequences.
Fig. 11. Shredding a table into multiple sequences each containing the primary key.
More specically, dictionary encoding is used to compress the information. The dictionary is encoded in DNA as
well. Subsequently, as many attributes as possible are stored in a DNA sequence along with the primary key (to link
together attributes of the same record). A parity nucleotide is added to each DNA sequence for error detection. After
sequencing, the parity nucleotide and length of the DNA sequence are used to discard invalid sequences. The remaining
sequences are aligned to compute a consensus. In the experiments, based on a subset of the database benchmark
TPC-H[3], multiple tables are encoded, synthesized, sequenced and fully recovered.
The encoding used is based on previous work by Church [
9
] discussed previously and thus avoids homopolymers
and also allows to balance the GC content. Error detection is possible through the parity nucleotide but correction is
left to the decoding process (through computing the consensus). The Church approach encodes one bit per nucleotide
for the payload only and consequently this approach uses the same.
The second encoding developed in the same work enables processing of data such as selective retrieval of records
based on their value as well as database joins, i.e., nd pairs records which agree on a particular attribute.
To enable these operations, each attribute is stored in one DNA sequence. To simplify the design, xed sized hash
values are computed for the variable length elds. More particularly, table and attribute name as well as the primary key
are hashed for a xed length and arranged in a DNA sequence with value and error correction codes as shown in 12.
PCR is used to retrieve all DNA sequences encoding a particular value for a specic attribute. Similarly, overlap
extension PCR is used to implement a join by annealing matching sequence/attributes together as shown in 13.
In both cases, the encoding of the value must also serve as a primer and must therefore be designed specially. More
specically, to avoid retrieval of similar DNA sequences or joining of similar sequences, similar values must be encoded
substantially dierent. To do so, the encoding outlined before is used but the checksum of the attribute value is computed
(using SHA3) and encoded as well. To then make similar attributes encodings substantially dierent, the encoding
of an attribute
a
as well as the checksum
c
is separately split into subsets (
a1
,
a2
,
a3
, . . . and
c1
,
c2
,
c3
, . . . ) and one is
interleaved with the other (resulting in
a1
,
c1
,
a2
,
c2
,
a3
,
c3
, . . . ). Thanks to the avalanche eect of SHA3 (and other
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
Survey of Information Encoding Techniques for DNA 21
cryptographic functions) — small dierences in input value lead to considerable dierences in the checksum — the
resulting sequence will be substantially dierent, even if the attribute values are similar.
Fig. 12. Structure of the encoding.
Fig. 13. Overlap extension PCR used to find matching tuples/sequences.
Also here, the encoding used is based on previous work by Church [
9
] discussed previously and thus avoids
homopolymers and also allows to balance the GC content. Error correction and detection is incorporated using Reed-
Solomon codes. The Church approach encodes one bit per nucleotide for the payload only and consequently this
approach uses the same.
In other work by Organick et al [
21
], 35 distinct les, totaling over 200 Mb, were encoded in a pool of more than 13
million 150 basepair oligonucleotide sequences. The aim of the work was focused on error free recovery of the encoded
les — from Illumina sequencing — using a Reed-Solomon encoding and also to do so using a random access approach
to reading the encoded data as in earlier work by [
7
], [
9
]. Using a similar approach to Bancroft et al — which used
information DNA (iDNA) and polyprimer key (PPK) sequences — they partitioned each les’ data into chunks, each
of which are assigned to a large number of oligonucleotides, all sharing the same primer (representing the le ID),
and ordered using a strand-specic ID [
7
]. A subset of primers are selected from a larger pool by performing BLAST
sequence alignment and discarding sequences with similar identity (i.e., similarity). To avoid homopolymer runs, the
input data for the les are subjected to XOR with a pseudo-random number sequence. A ternary code is then used to
map the Bytes to the DNA sequence and a Reed-Solomon code is applied to the payload for error correction. To retrieve
the data from the DNA, their encoder uses a clustering algorithm whereby oligonucleotides are clustered on sequence
identity. A variation of the BMA (Bitwise Majority Alignment) algorithm is used to identify insertions, deletions and
substitutions and to recover the original sequence via a consensus of each nucleotide position of the aligned reads. The
XOR process is repeated with the same pseudo-random number sequence in order to restore the original data, with
any repeats. The outer Reed-Solomon code is then decoded yielding the original data. They have demonstrated that
200.2 Mb of data (35 les) can be correctly recovered without any errors (byte-for-byte equal to original) from the DNA
encoding using Illumina sequencing, with an overall information density of 0.81 b/nt and an average sequencing depth
of 5x. In a separate experiment, the authors encoded 0.033 Mb of data and sequenced the oligonucleotide pool using an
Oxford Nanopore sequencer. In this case, a sequencing depth between 36x and 80x was required, achieving the same
information density as Illumina data (the encoding and decoding are the same). This reects the presently noisy nature
of the Oxford Nanopore device, which requires mitigation by signicantly increasing the sequencing depth.
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
22 Heinis and Alnasir, et al.
4.2 DNA Fountain codes
Another approach to encoding information, proposed by [
16
], uses fountain codes [
20
] to encode the data and Reed-
Solomon codes to protect the seed. In this work, 2,146,816 bytes (2.14 Mb) of data was encoded from a tarball that
packaged several les, including a complete graphical operating system of 1.4Mbytes, and a 52 second movie.
Their encoder comprises three steps. The rst preprocesses binary data by rst compressing it into a tarball using
gzip, and then partitions the le into 32 byte chunks (although this can be a user-dened length). 32 bytes (256 bits)
was chosen because it generates oligonucleotide lengths that fall within their manufacturer’s limits. The second step
uses a pseudo-random number generator to select 32 byte chunks to which a Luby Transformation is performed. This is
to produce droplets — collections of data chunks to be converted into DNA during synthesis. This step employs the
bitwise addition of chunks using modulus 2, attachment of a random seed (which is protected with a Reed-Solomon
code), and conversion of the binary droplet into a DNA oligonucleotide whereby the binary 00,01,10,11 is mapped to
A,C,G,T, respectively. The third step screens the droplet oligonucleotides such that they are ltered to conform to the
biological constraints, i.e. no homopolymer runs and balanced GC content. Their approach achieves a payload only
information density of 1.98 b/nt and an overall information density of 1.57 b/nt.
4.3 Recovery from DNA Errors at the Exabyte Scale
Previous encoding schemes we have discussed only provide error correction that is able to recover from substitution
and deletion events but not insertions. Recent work by Press et al. [
23
] — which also uses Reed-Solomon codes —
recognises that for DNA-Storage to be viable, encoding schemes must be able to correct for all three types of error
event, i.e., substitutions, insertions and deletions, and that the error rate of an encoding scheme should be low enough
to be scalable to encode and successfully retrieve large amounts of data. To this end, they have devised an algorithm,
HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search), an error-correcting code that is able to repair all
three basic types of DNA errors: insertions, deletions, and substitutions.
HEDGES is used as an inner-code to encode in combination with a Reed-Solomon outer-code. Data to be stored
is rst encoded in blocks called message-packets or DNA-packets. An individual packet comprises an ordered set of
255 strands, each of which has the same xed length, allowing insertions or deletions to be identied, and contains a
packet ID and strand ID. The bit stream is encoded with the HEDGES encoding algorithm, and the generated indexes
for the strand (packet ID and strand ID) are protected using an encoding salt — the encoding process uses a hash and is
depicted in Figure 14, a Reed-Solomon outer-code is then applied diagonally across the DNA strands within a message
packet. Application of the salt enables errors to be converted to erasures that can later by the Reed-Solomon outer-code.
The authors suggest that the advantages of using HEDGES as an inner code are: i) strands are of a xed length (when
un-corrupted) ii) recovering synchronisation has the highest priority, iii) deletions that are known are less problematic
than substitutions that are unknown as RS can correct twice as many deletions as substitutions, iv) burst-errors within
a single byte become less problematic than distributed errors as RS can correct each byte at a time, and v) error-free
messages can be yielded despite residual errors, so long as the byte errors and deletions are within the RS code’s capacity.
Decoding of the information, uses an expanding tree of hypothesis for each bit on which a greedy search is performed
and is depicted in Figure 15.
To test the HEDGES algorithm, they used both in-silico simulation and in-vitro synthesis of pools of oligonucleotides.
For in-silico testing, nucleotides were articially generated across binned ranges of information densities and error
probabilities. For each information density r in (0.166, 0.250, 0.333, 0.500, 0.600, 0.750), and each estimated error
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
Survey of Information Encoding Techniques for DNA 23
probability
Per r
in (0.01, 0.02, 0.03, 0.05, 0.07, 0.10, 0.15) 10 packets (
10
6
nucleotides) were encoded. These were
pooled and replicated to equate to a sequencing depth of 5. Verication that the pooled and duplicated strands could be
decoded was tested at various information densities and error probabilities. This is contingent on the packets being
recovered, ordered, and having Reed-Solomon applied to them in order to recover the error-free message, up until a
maximum tolerable error probability. The error rate for bit and byte errors was computed for the HEDGES output, i.e.,
for each information density as a function of the error probability. This was then used to extrapolate to the Petabyte
and Exabyte scales. They observed that the rate of byte errors was less that 8x that of the rate expected for independent
bit errors and postulate that this is because decode errors tended to occur in bursts.
For their in-vitro work, 5,865 300 nt DNA strands were synthesised, each with a payload of 256 nucleotides anked
by two primers of 23 nucleotide at their 5’ (head) and 3’ (tail) ends. These were degraded either via error-prone PCR
mutagenesis or by incubation at high temperature and then sequenced at a depth of approximately 50. They calculated
the end-to-end DNA error rates — that is errors that are introduced during synthesis, sample handling and storage,
preparation, and nally sequencing — for insertions, deletions and substitutions, which were observed to be to be 17%,
40% and 43% respectively.
4.4 Summary of Comparison
In Table 1below we compare all the approaches discussed previously. We focus on comparing them with respect to their
storage mechanism (organism or microplate), information density (b/nt), error detection and correction mechanisms as
well as consideration of biological constraints (avoiding homopolymers and balancing GC content). The Information
density achieved is shown in two columns: The rst column is calculated for the number of bits encoded in a single
nucleotide including primers, sequencing adapters, indexes, and any replication of nucleotides. The second column is
for the payload only.
Fig. 14. HEDGES encoding scheme. Each bit is encoded using a hash which is a function of sbits of a Salt, the low qbits of the index
and ppreviously encoded bits.
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
24 Heinis and Alnasir, et al.
Fig. 15. The HEDGES decoding algorithm. A greedy search is performed on an expanding tree of hypotheses. The hypothesis
simultaneously guesses one or more message bits vi , its bit position index i, and its corresponding DNA character position index k. To
limit exponential tree growth, a "greediness parameter" is used (see Supplementary Text of [23])
.
5 DISCUSSION
We have systematically reviewed research on various approaches to encode information in DNA for storage and retrieval.
The criteria we have used, which are key to assessing DNA storage approaches, are dened in Section 2.1. To recap, these
are: the storage medium used, biological constraints observed, error-correction implemented, the information-density
achieved, and the data-retrieval method employed.
With respect to the storage medium, early approaches, in particular, tend to store information in organisms —
inserting the DNA into the genetic information of a living organism — whereas later ones store the synthetic DNA
on microplates. Storing information in living organisms has the benet that it can be passed on in highly resistant
organisms, e.g., bacteria resistant to adverse conditions (extreme temperatures and similar), over many generations and
thus for a very long time. At the same time, however, recombination of the DNA when cells of the organisms divide is
an error prone process and so errors may be introduced in the stored information. Using living organisms also limits the
amount of information that can be stored because the sequence inserted into the organism cannot be too long without
incapacitating the organism.
Given the dependence of operations on DNA on biochemical and molecular processes, we have also looked at
whether the approaches we reviewed adhere to these biological constraints, e.g., whether long homopolymers (repeats
of the same nucleotides
>
3) are avoided, as well as extreme GC content. Both of these biological constraints are
necessary to improve the stability of the sequence and reduce errors during synthesis and sequencing — more than half
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
Survey of Information Encoding Techniques for DNA 25
Table 1. Comparison of DNA encoding schemes
Reference
Storage
Mechanism
Encoding
Alphabet
Overall Infor-
mation Density
(b/nt)
Payload Only
Information
Density (b/nt)
Error Correction
and Detection
Biological
Constraints
Access
Mechanism
Microvenus [13] Organism [0,1] (bit-pattern) 1.25 1.94 - - sequential
Genesis [19] Organism [A-Z] 1.52 1.52 - - sequential
[7]Microplate [A-Z] 0.94 1.65 - - random
[25] Organism
[A-Z, a-z, 0-9, !] 64
symbols
1.54 1.62 - - sequential
Microdot [10] Microplate
[A-Z, a-z, !] 64 sym-
bols
1.27 2.00 - - sequential
Human codes [24] Microplate [A-Z] 2.27 2.27 - - sequential
Comma codes [24] Microplate
[A-Z, a-z, 0-9, !] 80
symbols
1 1 detection
balanced GC con-
tent
sequential
Alternating
codes [24]
Microplate
[A-Z, a-z, !] 64 sym-
bols
1 1 detection no homopolymers sequential
[26]Organism
[A-Z, a-z, 0-9, !] en-
tire keyboard
0.49 1.95 replication (x4) - sequential
[22]Organism [A,E,I,M,O,S,U,Z] 0.11 0.2
replication
(cloning)
- electrophoresis
[4]Organism
[A-Z, a-z, 0-9, !] en-
tire keyboard
2.00 - - - sequential
[2]Organism
[A-Z] up to 64 sym-
bols
2.00 2.00 - - sequential
[9]Microplate
[A-Z, a-z, !] UTF-8
(8 bits) 256 symbols
0.46 1.00 replication (PCR)
no homopolymers
balanced GC con-
tent
random
[17]Microplate
[A-Z, a-z, !] (8 bits)
256 symbols
0.39 1.59
error detection and
correction
no homopolymers sequential
[18]Microplate
[A-Z, a-z, !] (8 bits)
256 symbols
1.01 1.37
error detection and
correction
no homopolymers
balanced GC con-
tent
sequential
[8]Microplate
[A-Z, a-z, !] (8 bits)
256 symbols
0.59 0.88 replication (x1.5) no homopolymers random
[14] - 2K codewords 1.6 - error detection
no homopolymers
balanced GC con-
tent
sequential
[27]Microplate
[A-Z, a-z, !] MIME-
64 (64 bit) binary
0.43 1.72
replication er-
ror detection
and correction
(post-sequencing)
no homopolymers
balanced GC con-
tent
sequential
Oligoarchive DB [6] Microplate
[A-Z, a-z, !] (8 bits
256 symbols
1 - error detection no homopolymers random
Oligoarchive
Join [6]
Microplate
[A-Z, a-z, !] (8 bits)
256 symbols
1 -
error detection and
correction
no homopolymers random
Fountain codes [16] - - 1.98 1.57
error detection and
correction
no homopolymers
balanced GC con-
tent
random
[21] Microplate - 0.81 1.10
error detection and
correction
no homopolymers
balanced GC con-
tent
random
[23] Microplate - 1 - 1.2 1 - 2
error detection and
correction
no homopolymers
balanced GC con-
tent
sequential
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
26 Heinis and Alnasir, et al.
of the approaches to date, which we have reviewed, do not adhere to both of these constraints, several adhere to either
preventing homopolymer runs or avoiding GC extremes (but not both), and only four adhere to both constraints.
Inherent in the use of DNA as an information storage medium are unwanted substitution, deletion and insertion
events. These errors can occur both during the synthesis and sequencing processes, in-vivo when information is stored
in living organisms and, notwithstanding the general stability of DNA, occur naturally, e.g., due to temperature, UV
radiation, pH etc. Whilst living organisms have molecular mechanisms in place to repair these errors to some extent,
the use of DNA as a storage medium for arbitrary data will require the application of algorithms and design approaches
to mitigate errors. Furthermore the error rate will determine the extent to which DNA-Storage approaches can scale.
Hence, we have also analysed the error detection and correction techniques used in each approach, spanning from
none to sophisticated Reed-Solomon codes.
The maximum theoretical information density of DNA is 455 EB/g, allowing vast amounts of information to be
stored in this medium. However, current DNA synthesis technologies mean the process is expensive. Consequently,
the information density achieved in the experiments of each approach, as well as the data stored, is important, and
has been reviewed and compared. Although the improved Human code [
4
] and the artwork DNA storage project by
[
2
] have achieved the highest information density (each have managed 2 b/nt), both approaches do not adhere to any
biological constraints nor to they employ any error detection or correction — when these are adhered to, the achievable
information density drops. They will, therefore, not be scalable and would be impractical for application to arbitrary
data storage.
Work done by [
23
] uses a combination of their HEDGES encoding as an inner code and a Reed-Solomon outer-code
with a decoding tree algorithm. It is the rst approach that both corrects for all types of errors (insertions, deletions
and substitutions) and achieves error-free petabyte scale whilst tolerating as many as 10% of the total nucleotides being
erroneous.
Finally, we have reviewed the data-retrieval method, i.e., sequential vs random access, that each DNA storage approach
implements. Accessing only part of the information stored in DNA using a sequential access requires the sequencing to
reconstruct the information, which is slow and costly. This method is used by most of the DNA storage approaches to
date, particularly in the earlier ones. Conversely, using random access, subsets of the data can be selectively retrieved
by amplifying only select regions of the DNA using primers associated with the regions of interest. This is cheaper
and more ecient and provides faster access. Several approaches use random access, the rst of which was in the
information DNA (iDNA) and polyprimer key (PPK) sequences developed by [
7
], a similar method by [
8
], and both the
Oligoarchive DB and join projects [6].
6 CONCLUSIONS & OUTLOOK
In this paper we have reviewed the state of the art in data in DNA for the purpose of archival storage. Encodings for
DNA storage have evolved drastically over time. Initially primarily used for artistic purposes and understood as proof
of principle, the encodings and their features have evolved considerably with the latest ones providing a solid basis for
viable, long-term DNA data storage. Fundamental biological constraints, e.g., homopolymers and extreme GC content,
have generally been considered in encodings early on. Also, important features such as error detection and correction
have been incorporated in most encodings, particularly in the later ones. More advanced features, such as random
access through the incorporation of primers (to allow for selective reading of regions of interest, thereby avoiding the
need to sequence all of the information), are rather new and are only used where required.
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
Survey of Information Encoding Techniques for DNA 27
Albeit a comparison of information density (bits/nucleotide) - a crucial metric given today’s synthesis cost - is
dicult given the various features dierent encodings support, it has dropped over the years, mostly because more
constraints have been considered and more features incorporated which adversely aects the information density.
The encodings (and their features) appear to converge with the vast majority of recent encodings considering
biological constraints and incorporating some form of error detection and correction.
Future encodings are likely to consider additional constraints (e.g., storage, subsequences unsuitable for synthesis or
sequencing etc.) to improve stability in storage or to lower errors in synthesis and sequencing. Further, advances in
synthesis and sequencing technology will change their error characteristics (presumably lowering them) which will
again aect the encoding, requiring fewer error correction and detection codes. Advances in encoding are also likely in
the application of dierent or new error correction codes (LDPC, polar codes) which are either more compact and/or
protect better against insertions, deletions and substitution errors of the DNA channel. A signicant improvement
regarding information density, however, is unlikely.
REFERENCES
[1] 2014. Keyboard scan codes. https://ww w.marjorie.de/ps2/scancode-set2.htm. (2014). [Online; accessed 20-June-2019].
[2]
2019. Blighted by Kenning. http://www.netherlandsproteomicscentre.nl/npc/education/blighted-by-kenning.html. (2019). [Online; accessed
20-June-2019].
[3] 2021. TPC-H Decision Support Benchmark. http://www.tpc.org/tpch/. (2021). [Online; accessed 14-April-2021].
[4]
Menachem Ailenberg and Ori D. Rotstein. 2009. An Improved Human Coding Method for Archiving Text, Images, and Music Characters in DNA.
BioTechniques 47, 3 (2009). https://doi.org/10.2144/000113218 arXiv:https://doi.org/10.2144/000113218
[5]
Morten E. Allentoft, Matthew Collins, David Harker, James Haile, Charlotte L. Oskam, Marie L. Hale, Paula F.Campos, Jose A. Samaniego, Thomas P.M.
Gilbert, Eske Willerslev, Guojie Zhang, R. Paul Scoeld, Richard N. Holdaway, and Michael Bunce. 2012. The Half-life of DNA in Bone: Measuring
Decay Kinetics in 158 Dated Fossils. Proceedings of the Royal Society B: Biological Sciences (2012). https://doi.org/10.1098/rspb.2012.1745
[6]
Raja Appuswamy, Kevin Le Brigand, Pascal Barbry, Marc Antonini, Olivier Madderson, Paul Freemont, James McDonald, and Thomas Heinis. 2019.
OligoArchive: Using DNA in the DBMS Storage Hierarchy. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research CIDR.
http://cidrdb.org/cidr2019/papers/p98-appuswamy- cidr19.pdf
[7]
Carter Bancroft, Timothy Bowler, Brian Bloom, and Catherine Taylor Clelland. 2001. Long-Term Storage of Information in DNA. Science 293, 5536
(2001). https://doi.org/10.1126/science.293.5536.1763c arXiv:https://science.sciencemag.org/content/293/5536/1763.3
[8]
James Bornholt, Luis Ceze, Randolph Lopez, Georg Seelig, Douglas M Carmean, and Karin Strauss. 2016. A DNA-Based Archival Storage
System. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS.
https://doi.org/10.1145/2872362
[9]
George M. Church, Yuan Gao, and Sriram Kosuri. 2012. Next-Generation Digital Information Storage in DNA. Science 337, 6102 (2012). https:
//doi.org/10.1126/science.1226355
[10]
Catherine Taylor Clelland, Viviana Risca, and Carter Bancroft. 1999. Hiding Messages in DNA Microdots. Nature 399, 6736 (1999). https:
//doi.org/10.1038/21092
[11] Jonathan P. L. Cox. 2001. Long-term data storage in DNA. TRENDS in Biotechnology 19, 7 (2001), 247–250.
[12]
Francis HC Crick, John Stanley Grith, and Leslie E Orgel. 1957. Codes without commas. Proceedings of the National Academy of Sciences of the
United States of America 43, 5 (1957), 416.
[13]
Joe Davis. 1996. Microvenus. Art Journal 55, 1 (1996). https://doi.org/10.1080/00043249.1996.10791743
arXiv:https://caa.tandfonline.com/doi/pdf/10.1080/00043249.1996.10791743
[14] Melpomeni Dimopoulou and Marc Antonini. 2020. Image storage in DNA using Vector Quantization. In EUSIPCO 2020.
[15]
Juliane C Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. 2008. Substantial biases in ultra-short read data sets from high-throughput
DNA sequencing. Nucleic acids research 36, 16 (2008), e105.
[16] Yaniv Erlich and Dina Zielinski. 2017. DNA Fountain enables a robust and ecient storage architecture. Science 355, 6328 (2017), 950–954.
[17]
Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. Leproust, Botond Sipos, and Ewan Birney. 2013. Towards Practical,
High-capacity, Low-maintenance Information Storage in Synthesized DNA. Nature 494 (2013). https://doi.org/10.1038/nature11875
[18]
Robert N. Grass, Reinhard Heckel, Michela Puddu, Daniela Paunescu, and Wendelin J. Stark. 2015. Robust Chemical Preservation of Digital Information
on DNA in Silica with Error-Correcting Codes. Angewandte Chemie International Edition 54, 8 (2015). https://doi.org/10.1002/anie.201411378
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/anie.201411378
[19] Eduardo Kac. 1999. GENESIS. http://www.ekac.org/geninfo.html. (1999). [Online; accessed 20-June-2019].
[20] David JC MacKay. 2005. Fountain codes. IEE Proceedings-Communications 152, 6 (2005), 1062–1068.
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
28 Heinis and Alnasir, et al.
[21]
Lee Organick, Siena Dumas Ang, Yuan Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z. Racz, Govinda Kamath,
Parikshit Gopalan, Bichlien Nguyen, Christopher N. Takahashi, Sharon Newman, Hsing Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan
Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, and Karin Strauss. 2018. Random access in large-scale DNA data
storage. Nature Biotechnology (2018). https://doi.org/10.1038/nbt.4079
[22]
Nathaniel G. Portney, Yonghui Wu, Lauren K. Quezada, Stefano Lonardi, and Mihrimah Ozkan. 2008. Length-Based Encoding of Binary Data in
DNA. Langmuir 24, 5 (2008).
[23]
William H Press, John A Hawkins, Stephen K Jones, Jerey M Schaub, and Ilya J Finkelstein. 2020. HEDGES error-correcting code for DNA storage
corrects indels and allows sequence constraints. Proceedings of the National Academy of Sciences 117, 31 (2020), 18489–18496.
[24]
Geo C. Smith, Ceridwyn C. Fiddes, Jonathan P. Hawkins, and Jonathan P.L. Cox. 2003. Some Possible Codes for Encrypting Data in DNA.
Biotechnology Letters 25, 14 (01 Jul 2003).
[25]
Pak Chung Wong, Kwong-kwok Wong, and Harlan Foote. 2003. Organic Data Memory Using the DNA Approach. Commun. ACM 46, 1 (Jan. 2003), 4.
https://doi.org/10.1145/602421.602426
[26]
Nozomu Yachie, Kazuhide Sekiyama, Junichi Sugahara, Yoshiaki Ohashi, and Masaru Tomita. 2007. Alignment-Based Approach for Durable Data
Storage into Living Organisms. Biotechnology Progress 23, 2 (2007). https://doi.org/10.1021/bp060261y
[27]
S. M. Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. 2017. Portable and Error-Free DNA-Based Data Storage. Scientic Reports 7, 1
(2017). https://doi.org/10.1038/s41598-017- 05188-1
A APPENDIX
A.1 Information Density Calculation
bits_per _nt =num_symbols ×bits _per _symbol
total_nucleotides (1)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Rapid technological advances and the increasing use of social media has caused a tremendous increase in the generation of digital data, a fact that imposes nowadays a great challenge for the field of digital data storage due to the short-term reliability of conventional storage devices. Hard disks, flash, tape or even optical storage have a durability of 5 to 20 years while running data centers also require huge amounts of energy. An alternative to hard drives is the use of DNA, which is life's information-storage material, as a means of digital data storage. Recent works have proven that storing digital data into DNA is not only feasible but also very promising as the DNA's biological properties allow the storage of a great amount of information into an extraordinary small volume for centuries or even longer with no loss of information. In this work we present an extended end-to-end storage workflow specifically designed for the efficient storage of images onto synthetic DNA. This workflow uses a new encoding algorithm which serves the needs of image compression while also being robust to the biological errors which may corrupt the encoding.
Article
Full-text available
Significance This paper constructs an error-correcting code for the { A , C , G , T } alphabet of DNA. By contrast with previous work, the code corrects insertions and deletions directly, in a single strand of DNA, without the need for multiple alignment of strands. This code, when coupled to a standard outer code, can achieve error-free storage of petabyte-scale data even when ∼ 10% of all nucleotides are erroneous.
Article
Full-text available
Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.
Article
Full-text available
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
Article
Full-text available
Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
Article
Full-text available
Claims of extreme survival of DNA have emphasized the need for reliable models of DNA degradation through time. By analysing mitochondrial DNA (mtDNA) from 158 radiocarbon-dated bones of the extinct New Zealand moa, we confirm empirically a long-hypothesized exponential decay relationship. The average DNA half-life within this geographically constrained fossil assemblage was estimated to be 521 years for a 242 bp mtDNA sequence, corresponding to a per nucleotide fragmentation rate (k) of 5.50 × 10(-6) per year. With an effective burial temperature of 13.1°C, the rate is almost 400 times slower than predicted from published kinetic data of in vitro DNA depurination at pH 5. Although best described by an exponential model (R(2) = 0.39), considerable sample-to-sample variance in DNA preservation could not be accounted for by geologic age. This variation likely derives from differences in taphonomy and bone diagenesis, which have confounded previous, less spatially constrained attempts to study DNA decay kinetics. Lastly, by calculating DNA fragmentation rates on Illumina HiSeq data, we show that nuclear DNA has degraded at least twice as fast as mtDNA. These results provide a baseline for predicting long-term DNA survival in bone.
Article
DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10⁶ bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10¹⁵ retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.
Conference Paper
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up. Using DNA to archive data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm³ (109 GB/mm³), and long-lasting, with observed half-life of over 500 years. This paper presents an architecture for a DNA-based archival storage system. It is structured as a key-value store, and leverages common biochemical techniques to provide random access. We also propose a new encoding scheme that offers controllable redundancy, trading off reliability for density. We demonstrate feasibility, random access, and robustness of the proposed encoding with wet lab experiments involving 151 kB of synthesized DNA and a 42 kB random-access subset, and simulation experiments of larger sets calibrated to the wet lab experiments. Finally, we highlight trends in biotechnology that indicate the impending practicality of DNA storage for much larger datasets.
Article
In this digital age, the technology used for information storage is undergoing rapid advances. Data currently being stored in magnetic or optical media will probably become unrecoverable within a century or less, through the combined effects of hardware and software obsolescence and decay of the
Article
Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging. Here we show that digital information can be stored on DNA and recovered without errors for considerably longer time frames. To allow for the perfect recovery of the information, we encapsulate the DNA in an inorganic matrix, and employ error-correcting codes to correct storage-related errors. Specifically, we translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silica. Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions. The original information could be recovered error free, even after treating the DNA in silica at 70 °C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.