International Invention Journal of Biochemistry and Bioinformatics Vol. 1(1) pp. 1-4, October, 2013
Available online http://internationalinventjournals.org/journals/IIJBB
Copyright ©2013 International Invention Journals
A proposal for a DNA-based computer code
Department of Genetics. University of Extremadura. 06071-Badajoz. Spain.
Accepted date: 28 October, 2013
The use of DNA has become an attractive method for storing information in the future biocomputers
due to its capacity to store a large amount of information while requiring little physical volume. In the
last decade, the order of nucleotides (nt) has been considered as the best method to store a large
amount of data. However, proposals for this method have weaknesses. I present a new coding system
for DNA-based computing that uses 4 nt per symbol. This code is based on the conversion of all 256
computer symbols' ASCII numbers into base-4 numbers and on assigning nucleotides ATCG to 0123
respectively. This encoding has: uniformity, due to all symbols coded with 4 nt; consistency, due to a
biunivocal relationship between the symbols and tetraplets; homogeneity, because similar symbols
share the same first nt; intuitiveness in locating reading frames; and error resistance, due to shorter
sequences, homogeneity on the first nt, and almost none nt repetition longer than two. This coding
system will provide a more efficient method to implement DNA-based information storage, which will
thus help to design upcoming biocomputers.
Keywords: DNA code, DNA computer
DNA is a long molecule composed of two parallel
polynucleotide strands assembled in a double helix
(Watson and Crick, 1953). These polynucleotides are
made of four nucleotides, each nucleotide consisting in a
monosaccharide (deoxyribose) with a phosphate group
bound to its 5' C, and a nitrogenous base bound to the 1'
C. The four nucleotides differ only in their base type, and
they are designated by the first initial of their names: A,
T, G, and C. These letters are called "life's code"
because all of the information required to make all past,
present and future life is stored in DNA in the order of
these four nt. The genome is the name for the complete
sequence of nucleotides of the DNA contained in the
nucleus of each cell in the body of any organism. In
humans, this sequence is made up by 3,200 millions nt
(Int. Hum. Genome Seq. Consort., 2001). DNA is
synthesized by the polymerization of nucleotides using
the energy from the removal of two out of the three
phosphate groups bound at the 5'C of their deoxyribose.
Consequently, all known DNA polymerases synthesize
DNA by binding the 5'C phosphate of a new nucleotide
to the 3'C of the last nucleotide in the strand, that is to
say polynucleotides grow from 5' to 3'. Thereafter, this
same orientation is the accepted for writing any DNA
sequence from left to right.
The possibility of using DNA to store information
arose after the seminal work of Adleman (1994) and now
captivates many scientists. The potential use of DNA
has opened the possibility for computers to store a large
amount of information in a miniscule amount of space. A
number of different implementations have been
proposed to facilitate the use of this molecule in
computing. After it was shown that the hybridisation of
complementary DNA strands has other uses (Adleman,
1994), this method has been suggested by several
authors (James et al., 1998; Roweis et al., 1999;
Schmidt et al., 2004). Many other authors developed
potential uses for this molecule in regard to its
Recombination functionality (Kari, 2001), self-assembly
driven by restriction, hybridization
(Benenson et al., 2001;
displacement (Phillips and Cardelli, 2009; Qian and
2 Int. Inv. J. Biochem. Bioinform.
Winfree, 2011), partial restriction digestion (Portney et
al., 2008), and logic gates (Gearheart et al., 2012). The
simplest method to store a vast amount of information
the same manner that cells do is to organise that
information by the order of nucleotides. To access or use
that information we can allocate a number of nucleotides
and a specific order to each symbol. In the last decade,
several groups have proposed different approaches to
this issue. Some have suggested encoding for only one
group of letters (Ailenberg and Rotstein, 2009; Bancroft
et al., 2001; Smith et al., 2003), but these researchers
failed in allocating a code to the entire computer
keyboard. Presently only two research groups have
encoded all 256 symbols (Church et al., 2012; Goldman
et al., 2013).
Church and colleages (2012) use the four
nucleotides to make a binary code. Using a binary text,
they replaced digit 0 with A or C and 1 with G or T. They
used redundancy to disallow a homopolymer run of four
or more nt and, consequently, used 8 nt per symbol.
This method makes encoding more expensive and
requires larger storage volume and a higher rate of error
than any other. Using this encoding method, they
synthesised 5.27 Mnt in short pieces that yielded 10
errors when read. A different method to use DNA for
information storage was developed by Goldman and
colleages (2013). In their method of encoding, two
adjacent positions with the same nt were avoided by
converting ASCII numbers into base-3 digits. They then
used these digits to select a nt that was different from
the last one. Therefore, they suggested an encoding with
5 or 6 nt per symbol, which was different every time it
used one character.
THEORY AND DISCUSSION
I propose a new method for DNA encoding based on
achieving both the highest coding capability of every nt
and the lowest likelihood of errors. According to
Shannon's information theory (Shannon, 1948), every
single nt will contain -log2p = 2 bits of information. As this
information is twice the content of a binary digit, a binary
byte of 8 bits is equivalent to a byte made with 4 nt or a
tyte (tetramer byte). Therefore, I can encode the 256
computer symbols using four nt. To encode this, the 256
symbols' ASCII numbers were converted into the
corresponding base-4 numbers, and nt were assigned
using A for 0, T for 1, C for 2, and G for 3, with digits
from left to right corresponding with nt from 5' to 3'
(Table 1). This correspondence was selected from
the 24 possible allocations of the four letters because it
designates fewer repeating nt, which diminishes the
likelihood of errors by polymerase slippage.
This encoding has several remarkable properties
that will provide clear-cut benefits in the feasible bio-
computing. i) It is uniform, and all symbols are coded
using 4 nt that will facilitate the devise of the software
required for reading and decoding letter sequences read
from DNA memory. ii) It is consistent because it provides
a biunivocal relationship between the symbols and the
tytes that will enable the use of a fixed dictionary and
contribute to quick and safe decoding. iii) It has a very
homogeneous encoding for the first two nt, which allows
for an easy recognition of text because all uppercase
text characters begin with TA/T (TA or TT), all lower case
text characters begin with TC/G, all numbers begin with
AG, and the most frequent punctuation marks (period,
comma and space) begin with AC. This homogeneity will
allow potent methods for error recognition. iv). This
homogeneity ensures an intuitive location for the reading
frame and also simplifies the detection of frameshift
errors. v) This encoding additionally minimises the error
probability made at the DNA writing or reading process
due to a number of nt shorter than any other encoding.
As this encoding method uses 4 nt per tetramer byte
(tyte), the Church's encoding of 658,776 bytes (Church
et al., 2012) will require 2.63 millions tetramer bits (tet).
Using the same error frequency of 1.90x10-6, which will
reduce the 10 errors found by the afore mentioned work
to 5. vi) If one assumes that the code refers to a text
document, the first tet of all of the letter's tyte begins with
T; therefore, that tet contains no information. Using this
temporary simplification, one can assume to have an
encoding made of 3 meaningful tets or 1.97 tets and, in
the above instance, the 5 errors could ideally be
decreased to 3.75. Furthermore, all lower case letters
contain a C or G as the second nt; therefore, any
erroneous change from these letters to A or T will be
easily detected, therefore, its probability of a not
detectable error will be a third of the error frequency and
the 10 errors will be reduced to 2.92. vii) Errors caused
by polymerase slippage are also diminished in the
selected code because it contains a very low number of
repeating nt. This encoding creates only one letter, U,
which is coded by a T-homotetramer tyte, the three
uppercase letters, T, V, and W, that contain a T-
homotrimer, and just one lower case letter, j, with a C-
homotrimer containing tyte. Furthermore, as the second
tet in the 26 lower case letters is C or G, no combination
of two lower case letters could produce a homopolymer
greater than 3 nt. The error probability can be further
decreased through error-proof methods such as
using a DNA polymerase with high strand displacement
Table 1. Correspondence among computer symbols, ASCII numbers, and tytes
efficiency (Canceill et al., 1999), or by synthesising and
reading several sequences that contain the same data.
These functions may be required to verify numbers.
The redundancy of language, however, is more than
sufficient to detect a single letter mistake. To give a text
example, my estimate of 2.92 errors in the 658,000
letters used in Church's work (Church et al., 2012)
would result in only 1.39 errors in Mark Twain's entire
186 pages book "The adventures of Tom Sawyer" which
contains 313,876 letters. Most of the conventional books
have more than this error frequency that usually kept
Moreover than keyboard symbols, this coding
system would allow any computer work such as image
digitalization. Image formats like BMP, PNG, or JPEG,
that support 24-bit colours, uses 24 bits per pixel. A pixel
colour is determined by three colours, red, green, and
blue, or RGB, each colour coded by two hexadecimal
digits or eight bits. The proposed DNA-based computer
code would use pixels coded by 12 tets, 4 tets per each
RGB colour, to support the same 16.7 millions of colours
binary or hexadecimal coding does (an Excel page
for conversions among binary, tetranary, decimal,
and hexadecimal systems can be download from
Symbol ASCII tyte Symbol ASCII tyte
4 Int. Inv. J. Biochem. Bioinform. Download full-text
In addition to the mentioned benefits drawn from its
properties, this code will enable the storage of 7.3 x1018
B (7.3 quintillions of bytes or exabytes) in just 1 mg of a
double-stranded DNA (more stable than single-stranded
DNA molecules) that, at present time, is the smaller
memory's volume to hold this amount of information.
I believe that this encoding system could be widely
accepted due to its
homogeneity, intuitiveness, easy programming, and its
resistance to errors. The advantageous properties of this
system make it a forerunner of those attempting to
create DNA-based computers.
Adleman LM (1994). "Molecular computation of solutions to
combinatorial problems", Science, vol. 266, pp.1021-4.
Ailenberg M, Rotstein OD (2009). "An improved Huffman coding
method for archiving text, images, and music characters in DNA",
BioTechniques, vol. 47, pp. 747-754.
Bancroft C, Bowler T, Bloom B, Clelland CT (2001). "Long-term
storage of information in DNA", Science, vol. 293, pp. 1763-5.
Benenson Y, Adar R, Paz-Elizur T, Livneh Z, Shapiro E (2003). "DNA
molecule provides a computing machine with both data and fuel",
Proc Nat Acad Sci USA, vol. 100, pp. 2191-2196.
Benenson Y, Gil B, Ben-Dor U, Adar R, Shapiro E (2004). "An
autonomous molecular computer for logical control of gene
expression", Nature, vol. 429, pp. 423-9.
Benenson Y, Paz-Elizur T, Adar R, Keinan E, Livneh Z, Shapiro E
(2001). "Programmable and autonomous computing machine made
of biomolecules", Nature, vol. 414, pp. 430-4.
Canceill D, Viguera E, Ehrlich SD (1999). "Replication slippage of
different DNA polymerases is inversely related to their strand
displacement efficiency", J Biol Chem, vol. 274, pp. 27481-90.
Church GM, Gao Y, Kosuri S (2012). "Next-generation digital
information storage in DNA", Science, vol. 337, pp. 1628
Gearheart CM, Rouchka EC, Arazi B (2012). "DNA-Based Active Logic
Design and Its Implications", J Emerg Trends Comp Inf Sci, vol. 3,
Goldman N, Bertone P, Chen S, Dessimoz C, LeProust EM, Sipos B,
Birney E (2013). "Towards practical, high-capacity, low-maintenance
information storage in
International Human Genome Sequencing Consortium (2001). “Initial
sequencing and analysis of the human genome", Nature, vol. 409,
James KD, Boles AR, Henckel D, Ellington AD (1998). “The fidelity of
template-directed oligonucleotide ligation and its relevance to DNA
computation", Nuc Ac Res, vol. 26, pp. 5203-11.
Kari L (2001). “DNA computing in vitro and in vivo", Fut Gener Comp
Systems, vol. 17, pp. 823-834.
Phillips A, Cardelli L (2009). “A programming language for composable
DNA circuits", J R Soc Interface, vol. 6, pp. S419-S436.
Portney NG, Wu Y, Quezada LK, Lonardi S, Ozkan M (2008). “Length-
based encoding of binary data in DNA", Langmuir, vol. 24, pp. 1613-
Qian L, Winfree E (2011). “Scaling up digital circuit computation with
DNA strand displacement cascades", Science, vol. 332, pp. 1196-
Roweis S, Winfree E, Burgoyne R, Chelyapov NV, Goodman MF,
Rothemund PWK, Adleman LM (1999). “A sticker based model for
DNA computation", DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, vol. 44, pp. 1–29.
Schmidt KA, Henkel CV, Rozenberg G, Spaink HP (2004). “DNA
computing using single-molecule hybridization detection", Nuc Ac
Res, vol. 32, pp. 4962-8.
Shannon CE (1948). “A mathematical theory of communication", Bell
System Tech J, vol. 27, pp. 379-423.
Smith GC, Fiddes CC, Hawkins JP, Cox JPL (2003). “Some possible
codes for encrypting data in DNA", Biotech Lett, vol. 25, pp. 1125-
Watson JD, Crick FHC (1953). “Molecular structure of nucleics acids. A
structure for Deoxyribose Nucleic Acid", Nature, vol. 171, pp. 737-8.
How to cite this article: Alfonso JS (2013). A proposal for a DNA-based
computer code. Int. Inv. J. Biochem. Bioinform.
synthesized DNA", Nature,