ArticlePDF Available

An investigation into inter- and intragenomic variations of graphic genomic signatures


Abstract and Figures

Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Content may be subject to copyright.
Karamichalis et al.
An investigation into inter- and intragenomic
variations of graphic genomic signatures
Rallis Karamichalis1, Lila Kari1*, Stavros Konstantinidis2and Steffen Kopecki1,2
Background: Motivated by the general need to identify and classify species based on molecular evidence,
genome comparisons have been proposed that are based on measuring Euclidean distances between Chaos
Game Representation (CGR) patterns of genomic DNA sequences.
Results: We provide, on an extensive dataset and using several different distances, confirmation of the
hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA
sequences originating from genomes of different species. This finding lends support to the theory that CGRs of
genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over
five hundred different 150,000 bp genomic sequences originating from the genomes of six organisms, each
belonging to one of the kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi;
chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli
(Bacteria - full genome), and P. furiosus (Archaea - full genome). We also provide preliminary evidence of this
method’s applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over
one hundred and fifty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome
Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between
CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps that
visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display
their interrelationships.
Conclusion: Our analysis confirms that CGR patterns of DNA sequences from the same genome are in general
quantitatively similar, while being different for DNA sequences from genomes of different species. Our analysis
of the performance of the assessed distances uses three different quality measures and suggests that several
distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In
particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance
(introduced here) are best able to classify genomic sequences.
Keywords: comparative genomics; genomic signature; species classification
Alongside DNA barcoding, [1] and Klee diagrams
[2], Chaos Game Representation (CGR) patterns of
genomic segments have been proposed as another
method for the classification and identification of ge-
nomic sequences [37]. The concept of genomic signa-
ture was first introduced in [8], as being any specific
quantitative characteristic of a DNA genomic sequence
that is pervasive along the genome of the same organ-
ism, while being dissimilar for DNA sequences origi-
nating from different organisms. Initial studies [3,9],
1Department of Computer Science, University of Western Ontario,
London, ON, Canada
Full list of author information is available at the end of the article
suggested that short fragments of genomic sequences
retain most of the characteristics of the species they
come from, thus implying that genomic signatures ex-
ist. Moreover, the Chaos Game Representation (CGR)
of a DNA sequence, a graphic representation of its se-
quence composition, was proposed in [3] as having both
the pervasiveness and differentiability properties nec-
essary for it to qualify as a genomic signature. This
hypothesis was quantitatively tested and largely con-
firmed in [4] for 3,176 mitochondrial DNA (mtDNA)
sequences, and Molecular Distance Maps were pro-
posed therein as vizualizations of species relationships
based on measuring the distances between the CGR-
images of their mtDNA genomes. Note that CGR pat-
arXiv:1503.00162v1 [q-bio.GN] 28 Feb 2015
Karamichalis et al. Page 2 of 14
terns of mtDNA sequences can be different from those
of DNA sequences from the major genome of the same
organism, and that large scale quantitative analyses of
the hypothesis that CGR can play the role of a ge-
nomic signature for genomic sequences have not, to
our knowledge, been performed. The objective of this
study is to confirm that CGR can play the role of ge-
nomic signature for genomic DNA sequences, as well
as to assess various distances that can be used to com-
pare CGRs of genomic sequences.
We analyze 508 fragments, 150 kbp (kilo base pairs)
long, taken from complete genomic DNA sequences
of six species, each representing a different kingdom:
chromosome 21 of Homo sapiens, chromosome 4 of
Saccharomyces cerevisiae, chromosome 1 of Arabidop-
sis thaliana, chromosome 14 of Plasmodium falci-
parum, the genome of Escherichia coli, and the genome
of Pyrococcus furiosus, for a total length of 76,200,000
bp analyzed. We analyze the intergenomic and intrage-
nomic variation of CGR genomic signatures of these se-
quences by using six different distances for image com-
parison: Structural Dissimilarity Index (DSSIM) [10],
Euclidean distance, Pearson correlation distance [11],
Manhattan distance [12], approximated information
distance [13], and a distance we propose here, called
descriptor distance. We visualize the results by com-
puting the Molecular Distance Maps of all DNA se-
quences in the database, for each of the six distances.
The resulting Molecular Distance Maps show a good
clustering of the DNA sequences, with those origi-
nating from the same genome being largely grouped
together, and separated from sequences belonging to
genomes of different organisms. We observe that, in
some of the cases where the clustering was suboptimal,
the computation of three-dimensional Molecular Dis-
tance Maps resolves what appeared to be cluster over-
laps in the two-dimensional Molecular Distance Maps.
Lastly, using the “ground-truth” that sequences from
the same genomes should have similar structural char-
acteristics and thus be grouped together, while those
from genomes of different organisms should be sepa-
rated, we assess the six distances by combining three
different quality measures: correlation to an idealized
cluster distance, silhouette accuracy, and histogram
overlap. We conclude that DSSIM and the descriptor
distance perform best according to these measures. We
also provide preliminary evidence of this method’s ap-
plicability to classifying genomic DNA sequences of
closely related species by comparing the H. sapiens
(chromosome 21) sequences with 168 genomic DNA
sequences, 150 kbp long, from Pan troglodytes (chimp,
chromosome Y), for an additional length of 25,200,000
bp analyzed. Further research may lead to improve-
ments of these distances for optimal genomic DNA se-
quence identification and classification results.
Note that other alignment-free methods have been
used for phylogenetic analysis of DNA sequences. The
initial reports on CGRs of genomic sequences [3,14]
contained mostly qualitative assessments of CGR pat-
terns of whole genes. In [7], several datasets of up to
36 genomic DNA sequences were analyzed, and in [9]
some various-length sequences were analyzed based on
computing Euclidean distances between frequencies of
their k-mers, for k= 1, ..., 8. Subsequently, [5] com-
puted the Euclidean distance between frequencies of
k-mers (k5) for the analysis of 125 GenBank DNA
sequences from 20 bird species and the American al-
ligator. In [15], 27 microbial genomes were analyzed
to find implications of 4-mer frequencies (k= 4) on
their evolutionary relationships. In [13], 20 mammalian
complete mtDNA sequences were analyzed using the
“similarity metric”, for k= 7. Another study, [16], an-
alyzed 459 bacteriophage genomes and compared them
with their host genomes to infer host-phage relation-
ships, by computing Euclidean distances between fre-
quencies of k-mers for k= 4. In [17], 75 complete HIV
genome sequences were compared using the Euclidean
distance between frequencies of 6-mers (k= 6), in or-
der to group them in subtypes. In [4] a dataset of 3,176
complete mtDNA sequences was analyzed, and several
Molecular Distance Maps were obtained using DSSIM
and a value of k= 9.
The main contributions of this paper are:
We tested and confirmed for an extensive dataset,
of a total length of 101,400,000 bp, the hypothe-
sis that CGR images of genomic DNA sequences
can play the role of a (graphic) genomic signa-
ture, meaning that they have a desirable genome-
and species- specificity. The dataset comprised
150 kbp long sequences taken from genomes of
organisms from each of the six kingdoms of life,
augmented by a set of same-length genomic se-
quences from P. troglogytes as a test-case of this
method’s applicability to closely related species.
We assessed the performance of six different dis-
tances in this context, and this analysis included
both same-genome and different-genome DNA
fragment pairs. For several of these distances, the
intragenomic values were overall smaller than in-
tergenomic values, suggesting that this method
could separate DNA genomic fragments belong-
ing to different genomes, based on their CGRs.
We showed that several distances outperform the
Euclidean distance, which has so far been al-
most exclusively used for such studies. In par-
ticular, we determined that the DSSIM distance
and descriptor distance (introduced here), both of
whom essentially compare the k-mer composition
of DNA sequences (herein k= 9), were best able
Karamichalis et al. Page 3 of 14
to differentiate sequences originating from differ-
ent genomes in this dataset.
This study represents, to the best of our knowl-
edge, the largest combined dataset size and value
of kfor this type of analysis.
Based on preliminary data, we suggest the use
of three-dimensional Molecular Distance Maps for
improved visualization of the simultaneous inter-
relationships among similar or very distant DNA
In this section we first describe the dataset used for our
analysis, then present an overview of the three main
steps of the method, and conclude with a description
of the six distances that we considered.
The dataset we used includes complete genomic se-
quences from six organisms, each representing one of
the six kingdoms of life, see Table 1. For additional
information about the dataset see Appendix A.
Organism NCBI Acc. Nr.
1H. sapiens, chrom. 21 (Animalia) NC 000021.8
2E. coli (Bacteria) NC 000913.3
3S. cerevisiae, chrom. 4 (Fungi) NC 001136.10
4A. thaliana, chrom. 1 (Plantae) NC 003070.9
5P. falciparum, chrom. 14 (Protista) NC 004317.2
6P. furiosus (Archaea) NC 018092.1
Table 1 NCBI accession numbers of the dataset of the
complete genomic DNA sequences considered, in increasing
order of their NCBI accession number.
Organism Length(bp) # Letters “N” # Fragments
H. sapiens 48,129,895 13,023,253 234
E. coli 4,641,652 0 30
S. cerevisiae 1,531,933 0 10
A. thaliana 30,427,671 164,359 201
P. falciparum 3,291,871 37 21
P. furiosus 1,909,827 10 12
Table 2 Organism considered, total length of genomic
sequence, number of ignored letters “N”, and number of DNA
fragments (sequences) obtained by splitting each complete
genomic DNA sequence into consecutive, non-overlapping,
equal length (150 kbp) contiguous fragments.
In order to have relatively comparable number of
DNA sequences for each organism, we chose the longest
chromosomes for all organisms except H. sapiens, for
which the shortest chromosome was chosen.
The DNA sequences in the NCBI database are rep-
resented as strings of letters “A”, “C”, “G”, “T”, and
“N” which represent the four nucleobases Adenine,
Cytosine, Guanine, Thymine, and “unidentified Nu-
cleotide”, respectively. For our analysis we ignored all
letters “N”. In S. cerevisiae and E. coli there were no
ignored letters, and in P. falciparum and P. furiosus
the number of ignored letters is of the order of 0.001%
of the length of the sequence. In H. sapiens this num-
ber is 27%, and in A. thaliana is 0.54%. In H. sapiens,
in particular, 96.4% of these ignored letters exist in
centromeric and telomeric regions of the chromosome.
The resulting genomic DNA sequences were di-
vided into successive, non-overlapping, contiguous
fragments, each 150 kbp long. When the last sequence
was shorter than 150 kbp, it was not included in the
analysis. This resulted in 234 fragments for H. sapiens,
30 fragments for E. coli, 10 fragments for S. cerevisiae,
201 fragments for A. thaliana, 21 fragments for P. fal-
ciparum, and 12 fragments for P. furiosus, for a total
of 508 DNA fragments, see Table 2.
The method we used to analyze and classify the 508
sequences of the dataset has three steps: (i) gener-
ate graphical representations (images) of each DNA
sequence using Chaos Game Representation (CGR),
(ii) compute all pairwise distances between these im-
ages, and (iii) visualize the interrelationships implied
by these distances as two- or three-dimensional maps,
using Multi-Dimensional Scaling (MDS).
CGR is a method introduced by Jeffrey [3] in 1990
to visualize the structure of a DNA sequence. A CGR
associates an image to each DNA sequence as follows.
Starting from a unit square with corners labelled A, C,
G, and T, and the center of the square as the starting
point, the image is obtained by successively plotting
each nucleotide as the middle point between the cur-
rent point and the corner labelled by the nucleotide to
be plotted. If the generated square image has a size of
2k×2kpixels, then every pixel represents a distinct
k-mer: A pixel is black if the k-mer it represents oc-
curs in the DNA sequence, otherwise it is white. CGR
images of genetic DNA sequences originating from var-
ious species show patterns such as squares, parallel
lines, rectangles, triangles, and also complex fractal
patterns, Figure 1.
For step (i), a slight modification of the original CGR
was used, introduced by Deschavanne [7]: a k-th or-
der FCGR (frequency CGR) is a 2k×2kmatrix that
can be constructed by dividing the CGR plot into a
2k×2kgrid, and defining the element aij as the num-
ber of points that are situated in the corresponding
grid square. A first and second order FCGR are shown
below, where Nwis the number of occurrences of the
oligonucleotide win the sequence s.
F CGR1(s) = NCNG
Karamichalis et al. Page 4 of 14
F CGR2(s) =
The (k+ 1)-th order F CGRk+1 (s) can be obtained
by replacing each element NXin F CGRk(s) with four
where Xis a sequence of length kover the alphabet
{A, C, G, T }.
(a) H. sapiens (b) E. coli (c) S. cerevisiae
(d) A. thaliana (e) P. falciparum (f) P. furiosus
Figure 1 29×29CGR images of 150 kbp genomic DNA
sequences. of H. sapiens,E. coli,S. cerevisiae,A. thaliana,
P. falciparum, and P. furiosus.
For step (ii), after computing the FCGR matrices for
each of the 150 kbp sequences in our dataset, the goal
was to measure “distances” between every two CGR
images. There are many distances that can be defined
and used for this purpose, [18]. One of the goals of
this study was to identify what distance is better able
to differentiate the structural differences of various ge-
nomic DNA sequences and classify them based on the
species they belong to. In this paper we use six differ-
ent distances: Structural Dissimilarity Index (DSSIM),
descriptor distance (defined here), Euclidean distance,
Manhattan distance, Pearson correlation distance, and
approximated information distance.
For step (iii), after computing all possible pairwise
distances we obtained six different distance matrices.
To visualize the inter-relationships between sequences
implied by each of the distance matrices, and to thus
visually assess each of the distances, we used Multi-
Dimensional Scaling (MDS). MDS is an information
visualization technique introduced by Kruskal in [19].
Given as input a distance matrix that contains the
pairwise distances among a set of items[1], the out-
put of MDS is a spatial representation of the items on
a common Euclidean space wherein each item is rep-
resented as a point and the spatial distance between
any two points corresponds to the distance between
the items in the distance matrix: Objects with a small
pairwise distance will result in points that are close to
each other, while objects with a large pairwise distance
will become points that are far apart. For example,
in [4] MDS was used in conjunction with DSSIM and
CGR to produce Molecular Distance Maps that visu-
ally display the simultaneous interrelationships among
a set of full mitochondrial DNA sequences.
The ideal Molecular Distance Map is a placement of
nitems as points in an (n1)-dimensional space. The
two-dimensional Molecular Distance Map is simply an
approximation, a flattening of this highly-dimensional
space onto the plane, which may sometimes result in
erroneous positioning of some points. Increasing the
dimensionality of the Molecular Distance Map often
results in a more accurate representation of the real
interrelationships between sequences, as embodied in
the original distance matrix.
In this section we describe and formally define each of
the six distances used in our analysis: DSSIM, descrip-
tor distance (introduced here), Euclidean, Manhattan,
Pearson, and approximated information distance.
Structural Similarity Index, SSIM, was introduced
in [10] for the purpose of assessing the degree of simi-
larity between two images. Given two images X, Y as
n×nmatrices having as elements integers ranging in
the interval [0, L], SSIM computes three factors (lumi-
nance, contrast and structure) and combines them to
obtain a similarity value. However, instead of comput-
ing a global similarity between the two images, each
image is divided into 11 ×11 sliding square windows
Xij (Yij respectively) with i, j = 1,· · · , n 10 which
move pixel by pixel to eventually cover the entire im-
age, and the SSIM similarity of any given pair of im-
ages is computed by comparing their corresponding
windows. In addition, an 11 ×11 circular symmet-
ric Gaussian weighting function WR11×11 with a
fixed standard deviation of 1.5, normalized to unit sum
p=1 P11
q=1 Wpq = 1), is used. Then, the mean µx,i,j
(µy,i,j for Y), variance σx,i,j (σy,i,j for Y) and corre-
lation σxy,i,j are computed, as follows:
µx,i,j =
[1]In this paper the items are the 150 kpb DNA se-
quences analyzed.
Karamichalis et al. Page 5 of 14
σx,i,j =v
pq µx,i,j )2
σxy,i,j =
pq µx,i,j )(Yij
pq µy,i,j )
where Apq denotes the (p, q) element of the matrix A.
Based on these values, the luminance l(Xij, Y ij ), con-
trast c(Xij , Y ij ) and structure s(Xij , Y ij ) are com-
puted as
l(Xij , Y ij ) = 2µx,i,j µy,i,j +C1
x,i,j +µ2
y,i,j +C1
c(Xij , Y ij ) = 2σx,i,j σy,i,j +C2
x,i,j +σ2
y,i,j +C2
s(Xij , Y ij ) = σxy,i,j +C3
σx,i,j σy,i,j +C3
where C1= (0.01)2,C2= (0.03)2,C3=C2
2. Then,
these three factors are combined to get
SSIM (Xij , Y ij ) = l(Xij , Y ij )c(Xij , Y ij )s(Xij , Y ij)
and finally, the SSIM index used to evaluate the over-
all image similarity is computed as
SSIM (X, Y ) = 1
SSIM (Xij , Y ij ).
In theory, the values for SSIM range in the interval
[1,1] with the similarity being 1 between two identi-
cal images, 0, for example, between a black image and
a white image, and 1 if the two images are negatively
correlated; that is, SSIM(X, Y ) = 1 if and only if X
and Yhave the same luminance µand every pixel xi
of image Xhas the inverted value of the corresponding
pixel yi= 2µxiin Y.
To compute the distance rather than the similarity
between two images, we calculate DSSIM (X, Y ) =
1SSIM(X, Y ). Consequently, the range of DSSIM
is the interval [0,2]: two identical images will result
in a DSSIM distance of 0, while two images that are
the negatives of each other would result in a DSSIM
distance of 2.
The descriptor distance between two FCGRs X, Y
N2k×2kaims to compare a combination of several dif-
ferent“descriptors”, that is, a combination of several
different aspects, of the two given FCGRs.
Adescriptor is a vector characterized by parameters
mand r, as well as rintervals, where mis the size
of the non-overlapping windows in which the FCGR is
divided (scale of the comparison), and the rintervals
represent the “granularity” of the analysis, in that they
define the intervals of numbers of k-mer occurrences
that are considered significant.
For a given mkand r, and intervals [a0, a1),[a1, a2),
· · · ,[ar1, ar) such that Sr1
i=0 [ai, ai+1) = [0,) and
[ai, ai+1)[aj, aj+1 ) = ∅ ∀i, j with i6=j, a decriptor
is constructed as follows.
Starting from the top-left corner, we divide each of
the two FCGR matrices Xand Yinto non-overlapping
submatrices[2] of size 2m×2m. This procedure re-
sults in 4kmsubmatrices Xij and Yij with i, j =
1,· · · ,2km, which will be pairwise compared.
The choice of the rintervals, called “bins”, points
to the fact that, rather than considering the finest
granularity, we are interested in a coarser compari-
son. This means that, instead of a computationally
expensive pairwise comparison of all possible numbers
of occurrences of k-mers, we are interested only in cer-
tain “bins” of such numbers. For example, in our case,
we use r= 5 and consider only 5 different bins, that
is only k-mers with number of occurences: 0 (not oc-
curring), 1 (one occurrence), 2 (two occurrences), be-
tween 2 and 5, between 5 and 20, and greater than
20 (most frequent). Formally, we use r= 5 and
[0,) = [0,1) [1,2) [2,5) [5,20) [20,) as the
5 bins.
Afterwards, we compute for every Xij a vector
vecXij =1
(2m×2m)(b1, b2,· · · , br) where bi=|{x
Xij :ai1x < ai}|. In our case, for each Xij, we
compute a five-tuple wherein, for example, the 4th el-
ement represents the number of 9-mers whose number
of occurrences is in the 4th bin, that is, at least 5 but
less than 20. The division to 2m×2mis to obtain a
probability distribution for each submatrix. The same
procedure is performed for Yij , resulting in the vector
We further append all vectors vecXij and form a new
vector vecXm,r and, using the same order of append-
ing, we append all vectors vecYij forming a new vector
vecYm,r. These two vectors are the “descriptors” of
the FCGR matrices Xand Yfor the parameters m,r
and the rchosen bins.
As a last step, we combine descriptors vecXm,r (re-
spectively vecYm,r) for several values of mand rby
appending them one after another, in the same order,
to obtain the vector vecX(respectively vecY).
[2]In general, these windows (submatrices) can be over-
lapping, but in this paper we made the choice of using
non-overlapping windows.
Karamichalis et al. Page 6 of 14
The descriptor distance between the two FCGRs X
and Yis now defined as the Euclidean distance be-
tween the vectors vecXand vecY
dD(X, Y ) = dE(vecX, vecY).
In our case we computed descriptors for m= 4,5,6
therefore forming vectors vecXand vecYof length
64 )2+ (512
32 )2+ (512
16 )2= 6720. In general,
for a given r, the length of the vectors compared
is r((2km1)2+ (2km2)2+... + (2kmp)2), where
m1, m2, . . . , mpare the values used for m. The choice
of mfor this study was made to balance the com-
putational cost of calculating the vector of descriptors
with the ability to compare the two matrices at various
scales: large (m= 6, that is, compare windows of size
64×64), medium (m= 5, windows of size 32×32)) and
small (m= 4, windows of size 16×16). The parameter
r= 5 and the 5 bins were kept constant throughout
our calculations but, in general, these parameters can
also be varied, and the resulting vectors for each value
added to the vector of descriptors, resulting in a larger
In principle, the descriptor distance between two FC-
GRs effectively compares the distribution of frequen-
cies of k-mers between the corresponding submatrices
Xij and Yij , and does that for several values of m,
that is, at several different scales. (Note that, in each
window Xij, all k-mers have the same suffix of length
We now illustrate the descriptor distance by an ex-
ample wherein k= 3, m= 2, r= 3, and the 3 bins are
[0,15)[15,30)[30,). Since k= 3, the FCGR table
will contain the number of occurrences of all 3-mers in
a DNA sequence, as follows:
Take the two FCGRs X, Y N8×8, (k= 3, thus
23×23) corresponding to two genomic 150 kbp se-
quences of our dataset (one human and one bacterial),
respectively. In order to use small numbers throughout
the example, we divide all elements of the obtained ma-
trices by 100 and take the integer part of each element,
42 33 9 33 14 10 15 45
22 30 26 25 9 5 37 37
32 21 33 19 44 35 41 35
17 9 13 21 23 10 22 18
37 26 6 32 34 24 9 23
29 24 31 27 19 27 18 28
21 23 10 9 19 17 21 15
35 15 14 14 19 12 17 30
18 34 40 27 30 36 27 12
27 18 27 32 24 23 15 23
24 17 13 17 36 12 32 18
27 17 28 26 18 8 22 25
32 32 23 16 16 25 23 22
20 29 18 25 16 16 15 17
25 25 7 16 26 27 20 25
32 21 20 21 25 18 27 34
Thus, in the human DNA sequence, the triplet CCC
appears about 4200 times, the triplet GCC appears
about 3300 times, the triplet CGC appears about 900
times, etc.
Since m= 2, we divide each of the matrices Xand Y
into non-overlapping submatrices of size 4×4 (22×22).
For Xwe thus obtain X11, X12, X21 , X22
42 33 9 33
22 30 26 25
32 21 33 19
17 9 13 21
14 10 15 45
9 5 37 37
44 35 41 35
23 10 22 18
37 26 6 32
29 24 31 27
21 23 10 9
35 15 14 14
34 24 9 23
19 27 18 28
19 17 21 15
19 12 17 30
and similarly for Y.
Since the r= 3 bins are [0,15) [15,30) [30,),
we will count, for each submatrix, the number of 3-
mers for which the number of occurrences is less than
15, between 15 and 30, and greater than or equal to
30. Thus we obtain vecX11 =1
16 (3,7,6) which has
as elements the number of elements of X11 which be-
long in each of the intervals selected, divided by the
total number of elements of X11. We proceed simi-
larly for vecX12 =1
16 (5,4,7), vecX21 =1
16 (5,7,4),
vecX22 =1
16 (2,12,2) and we form vecXby appending
these vectors one after the other, that is
16 (3,7,6,5,4,7,5,7,4,2,12,2) .
Karamichalis et al. Page 7 of 14
We apply exactly the same procedure for the matrix
Yand we get
16 (1,12,3,3,9,4,1,12,3,0,15,1) .
The descriptor distance between these two FCGRs is
computed as the Euclidean distance between vecXand
vecY, in this case dD(X, Y )0.718. Note that, since
we started by dividing the number of 3-mer occur-
rences by 100, as well as because of the bin selection,
this is a fictitious example. The real value of the de-
scriptor distance between the mentioned human and
bacterial sequences is 8.66, and the range of the de-
scriptor distance for this dataset of DNA sequences is
[0, 13.17]. In general, the descriptor distance has a vari-
able range, that depends on the choices of parameters
To compute the Euclidean, Manhattan and Pearson
distances, we first convert the matrices X, Y Nn×n
into 1 ×n2vectors. For two vectors x, y Rn, their
Euclidean distance dE(x, y) and their Manhattan dis-
tance dM(x, y) are computed as
dE(x, y) = v
dM(x, y) =
while their Pearson distance dP(x, y) is defined as
dP(x, y)=1σxy
xi, σx=v
σxy =1
In theory, the correlation coefficient σxy
σxσyranges in
the interval [1,1], and therefore the Pearson distance
ranges in the interval [0,2].
The last distance we considered is based on the in-
formation distance defined in [13]. The use of this dis-
tance is motivated computationally since it is easily
computed from FCGRs as it tracks the number of dif-
ferent k-mers for a sequence instead of the actual set.
In [13], for a given k, the information distance for two
strings x, y is defined as
dAID (x, y) = Nk(x|y) + Nk(y|x)
Nk(x|y) = Nk(xy)Nk(x)
where Nk(x) is the number of different k-mers (pos-
sibly overlapping) which occur in x. We go one step
further and modify this in order to avoid the creation
of “unwanted” k-mers from the concatenation xy of
xand y. First, we need to show how we compute
Nk(x) for a sequence x. For a sequence x, firstly, we
build its FCGR(x) = XN2k×2k, which is a ma-
trix of 2k×2kwith element values in N. Then we
unitize X, that is every non-zero entry becomes 1,
while zeros remain 0. Nk(x) is now computed as the
sum of the elements of this unitized FCGR, that is,
Nk(x) = f(X) = SumOfElements(Unitize(X)). For
two strings xand y, with FCGRs Xand Yrespec-
tively, we define Nk(x|y) as:
Nk(x|y) = f(X+Y)Nk(x) (1)
This slight modification of the information distance
gives us also the desired properties of d(x, x) = 0 and
d(x, y) = d(y, x) which were not satisfied before. Us-
ing (1), we now define the approximated information
distance (AID) as:
dAID (x, y) = 2 f(X) + f(Y)
where x, y are the strings and X, Y N2k×2ktheir
FCGRs, respectively. It also turns out that this dis-
tance is in fact the normalised Hamming Distance of
the unitized FCGRs Xand Y. Note that, for two
sets Xand Y, the normalized Hamming distance is
|X 4Y|
|X ∪Y| = 2 |X |+|Y |
|X ∪Y| where 4denotes the symmetric
The generation of CGR images, calculation of dis-
tance matrices and creation of 2D and 3D Molecu-
lar Distance Maps with MDS were done and can be
tested with the code available in [20] written in Wol-
fram Mathematica, version 9. The interactive webtool
ModMap, [21], allows in-depth exploration of the 2D
Mod Maps (Molecular Distance Maps) in this paper[3] .
[3]When using the interactive webtool MoDMap, click-
ing on a distance underneath a dataset will result in
Karamichalis et al. Page 8 of 14
Online Supplemental Material [20] includes all dis-
tance matrices and the code used to produce all figures
and plots in this paper. More details about the online
resources can be found in Appendix B.
Analysis and Results
For our dataset, we use k= 9, that is, each DNA se-
quence was represented as a 29×29FCGR matrix.
In practice, this means that the FCGR of a DNA
sequence contains the full information regarding its
k-mer sequence composition, for k= 1,2, ..., 9. The
length choice of 150 kbp and value of k= 9 is justified
by the fact that, for a random sequence of length 150
kbp, its CGR at resolution 29×29has around half of
the pixels black, and half white.
Figure 2depicts two-dimensional Molecular Distance
Maps for the over five hundred DNA sequences in
our dataset, computed using the DSSIM distance, de-
scriptor distance, Euclidean distance, Manhattan dis-
tance, Pearson distance and approximated informa-
tion distance, respectively. Figure 3depicts the corre-
sponding three-dimensional Molecular Distance Maps
for the same dataset. The projection of each three-
dimensional map is chosen by hand in order to visually
separate clusters of points which appear to be overlap-
ping in the two-dimensional maps, as discussed below.
We note that MDS is not a clustering method, as the
clusters are defined beforehand by the coloring scheme
used (blue for H. sapiens, green for E. coli, and so on).
MDS simply tries to display visually the interrelation-
ships between the given items, based on the pairwise
distances in the distance matrix which is its input.
Note also that an increase in dimensionality from 2 to
3 can lead to a better cluster visualization. For exam-
ple, if we compare the two-dimensional and the three-
dimensional Molecular Distance Maps obtained using
DSSIM, we see that points that appeared to be erro-
neously mixed with each other in the two-dimensional
map, Figure 2(a), (S. cerevisiae and P. falciparum se-
quences mixed in with A. thaliana sequences) were in
fact clearly separated from each other in Figure 3(a),
the three-dimensional version of the Molecular Dis-
tance Map.
plotting the MoD Map of the dataset computed with
that distance. On any particular MoD Map, clicking on
a point will display a window with information about
the subsequence represented by that point: its NCBI
accession number, scientific name of the organism it
originates from, and its CGR pattern. Clicking on the
“From here” and “To here” buttons on two such se-
lected windows will display the distance between the
corresponding genomic subsequences in the distance
Figure 4displays the histograms of the pairwise in-
tragenomic distances (dark blue and turqoise) and in-
tergenomic distances (grey) of DNA sequences from
H. sapiens and A. thaliana, obtained using each of
the six distances. As noted, some distances seem to
perform better than others. Visually, the poorest per-
former for these two sets of sequences (from H. sapiens
and A. thaliana) seems to be the Euclidean distance
wherein the intragenomic distances are as high as in-
tergenomic distances, and no separation is visible. In
contrast, DSSIM gives – for the same data – interge-
nomic distances that are overall much higher than in-
tragenomic distances, resulting in a clear classification
of DNA sequences into the species they belong to.
Table 3displays the mean and standard deviation of
distances between clusters Ciand Cj, 1 i, j 6,
where a cluster C`is defined as the set of all ge-
nomic sequences from the genome of organism `, as
labelled in Table 1. In each subtable, the diagonals
represent the means and standard deviation for in-
tragenomic distances, while the other entries are all
intergenomic distances. From this table we see that
for DSSIM, Manhattan and approximated information
distance, the maximum of all the averages of intrage-
nomic distances in this dataset is strictly smaller than
the minimum of all the averages of intergenomic dis-
tances. For the descriptor distance and Pearson dis-
tance the previous statement does not hold but, for
each pair of organisms, the two averages of intrage-
nomic distances (e.g., human-human and plant-plant)
are both lower than the average of the intergenomic
distances (human-plant). For the Euclidean distance,
none of the previous statements holds: For example,
the average of the plant-plant intragenomic distances
(element 4-4 in the Euclidean distance subtable of Ta-
ble 3) intragenomic distances is 723, which is larger
than 672, the average of the yeast-plant intergenomic
distances (element 3-4 in the Euclidean distance sub-
table of Table 3). The complete histograms of all pair-
wise comparisons CiCjcan be found in Appendix C.
Karamichalis et al. Page 9 of 14
- 1 2 3 4 5 6
0.81 ±
0.99 ±
0.92 ±
0.91 ±
0.92 ±
0.91 ±
2-0.85 ±
0.97 ±
0.99 ±
0.99 ±
0.01 0.99±0.
3- - 0.87 ±
0.89 ±
0.02 0.91±0.0.91 ±
4---0.87 ±
0.91 ±
5- - - - 0.74 ±
0.01 0.94±0.
6DSSIM 0.83 ±
3.76 ±
9.74 ±
5.92 ±
5.71 ±
9.33 ±
5.44 ±
8.05 ±
12.67 ±
9.38 ±
3- - 2.12 ±
3.42 ±
9.48 ±
4---2.75 ±
8.23 ±
4.94 ±
5- - - - 1.53 ±
9.99 ±
6Descriptors 2.4±
756 ±
856 ±
756 ±
818 ±
3914 ±
812 ±
2-558±5674 ±
802 ±
4102 ±
466 696 ±18
3- - 564 ±
672 ±
3964 ±
472 633 ±20
4---723 ±
3923 ±
748 ±
5- - - - 999 ±
4085 ±
6Euclidean 585 ±24
171 ±
15 222±5189 ±
13 188 ±17 213 ±20 191 ±9
2-175±2 209±4 219 ±8 252 ±4 218 ±3
3- - 171±2 177 ±10 206 ±2 184 ±2
4---172 ±16 200 ±11 188 ±9
5- - - - 105 ±3224 ±2
6Manhattan (in thousands) 167 ±3
0.97 ±
0.69 ±
0.64 ±
0.65 ±
0.81 ±
2-0.71 ±
0.93 ±
0.96 ±
0.98 ±
0.99 ±
3- - 0.6±
0.71 ±
0.75 ±
4---0.53 ±
0.63 ±
0.76 ±
5- - - - 0.02 ±
0.94 ±
6Pearson 0.64 ±
0.65 ±
0.78 ±
0.76 ±
0.69 ±
2-0.67 ±
0.75 ±
0.77 ±
0.85 ±
0.77 ±
3- - 0.67 ±
0.68 ±
0.02 0.74±0.0.69 ±0.
4---0.67 ±
0.73 ±
0.69 ±
5- - - - 0.64 ±
0.76 ±
6Approx. Information 0.65 ±
Table 3 Mean and standard deviation of distances between
clusters CiCjfor i, j = 1, ..., 6.
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 2 Two-dimensional Molecular Distance Maps of DNA
genomic sequences from all six organisms in the dataset,
obtained using DSSIM, descriptor, Euclidean, Manhattan,
Pearson and aproximated information distance, respectively.
Each point corresponds to a 150 kbp genomic sequence from
H. sapiens (blue), E. coli (green), S. cerevisiae (red),
A. thaliana (turqoise), P. falciparum (magenta), and
P. furiosus (orange).
Quality Measures for Distances
In this section we present three quality measures that
each evaluates the quality of the six distances con-
sidered. In the data mining literature a wide range
of quality measures for clusterings has been defined;
see for example [22,23]. Most of these methods are
designed to assess the quality of different automated
clustering methods while using the same distance. Our
set-up is different, as we use different distances while
the clustering is fixed and given by the initial colour-
coding of the sequence-representing points. Thus, we
have to use other approaches to compare the distances
we analyze. In particular, as the six distances have
Karamichalis et al. Page 10 of 14
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 3 Three-dimensional Molecular Distance Maps of
genomic DNA sequences from all six organisms in the dataset,
obtained using DSSIM, descriptor, Euclidean, Manhattan,
Pearson and approximated information distance, respectively.
Each point corresponds to a 150 kbp genomic sequence from
H. sapiens (blue), E. coli (green), S. cerevisiae (red),
A. thaliana (turqoise), P. falciparum (magenta), and
P. furiosus (orange).
different ranges, we have to use assessment methods
which are invariant to the scale of the distance.
The “ground-truth” that we use as a basis for our
distance assessment is the fact that the “ideal” clus-
tering of DNA sequences and the points that repre-
sent them is known: sequences from the same organism
should be close to one another and far from sequences
originating from other organisms. (This assumption is
justified – for this dataset – as the six organisms con-
sidered are very different from one another, belonging
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 4 Histograms of pairwise intragenomic and
intergenomic distances among the DNA sequences from
H. sapiens and A. thaliana.
to different kingdoms of life.) Thus, an optimal dis-
tance should yield a relatively small distance between
two FCGRs which were generated from the DNA se-
quences originating from the same organism, and rel-
atively high distances between two FCGR originating
from DNA sequences coming from different organisms.
In order to assess each of the six distances quantita-
tively, we computed three quality measures which rate
different features of a distance:
the correlation to an idealized cluster distance
the silhouette cluster accuracy
the relative overlap between the intragenomic and
intergenomic distance histograms.
Let us stress that all three quality measures of the six
distances are based on the distance matrices which we
computed and not on their MDS plots. We will define
the three quality measures such that their expected
values range in the interval [0,1] where higher values
correspond to better performance.
Let us first describe the three quality measures infor-
mally. An idealized distance is a distance that would
be able to differentiate DNA sequences by species, that
is, a distance δfor which δ(x, y) = 0 if xand yare
Karamichalis et al. Page 11 of 14
sequences from the same species and δ(x, y) = 1 oth-
erwise. The first quality measure, the correlation to
an idealized cluster distance, measures how well a dis-
tance is linearly correlated to the idealized distance δ.
The second quality measure, silhouette cluster accu-
racy, is the percentage of points that are best embed-
ded in the cluster they belong to. The third quality
measure quantifies the “visual overlap” between the
intragenomic and intergenomic distance histograms.
Given our dataset, it is reasonable to expect that a
good distance gives a low value if applied to FCGRs of
genomic sequences of the same organism, and a high
value when applied to FCGRs of genomic sequences
from two different organisms, thus separating the his-
tograms of intragenomic distances from that of interge-
nomic distances. This is illustrated by the histograms
in Figure 4, where a high overlap between the graph
of intragenomic distances (dark blue and turquoise)
and the graphs of intergenomic distances (grey) is an
indication of a poorly performing distance. In a theo-
retically optimal situation, there would exist a value c
such that all distances that are smaller than care in-
tragenomic distances and all distances that are larger
than care intergenomic distances. This can usually not
be expected from real data, but a low overlap between
histograms is nevertheless indicative of a “good” dis-
In order to formally define the three quality mea-
sures, we consider a dataset Vwhich is partitioned
into pnon-overlapping clusters C1, . . . , Cpfor which a
distance dα:V×VR0exists. The cardinalities of
the sets are |V|=mand |Ci|=mifor i= 1, . . . , p.
In our analysis, p= 6 and C1contains all FCGRs
generated from genomic DNA sequences from H. sapi-
ens,C2contains all FCGRs generated from genomic
sequences of E.coli, and so on, according to the order
in Table 1. The distance dαis one of the six distances
α∈ {DSSIM, D, E, M, P, AID}.
The correlation to an idealized cluster distance is
computed as follows. We define the idealized cluster
distance as a function (or matrix) δ:V×V→ {0,1}
such that δ(x, y) = 0 if and only if xand ybelong to
the same cluster, and δ(x, y) = 1 otherwise. Because
we can view dαand δas discrete, symmetric functions
which have the same domain, we can compute their
correlation coefficient. We define the correlation of δ
to dαto be the Pearson correlation of δand dα. More
precisely, the upper triangular part of the matrix cor-
responding to a distance dαis interpreted as a vector
(x1, . . . , xn) and compared with the corresponding val-
ues (y1, . . . , yn) given by δ. We obtain the δ-correlation
The correlation ranges in the interval [1,1]: a value
of 1 means that dαand δare linearly correlated, and
a value of 0 means that they are unrelated. In other
words, if the value obtained by measuring the correla-
tion of a given distance to the idealized cluster distance
is close to 1, this means that the given distance is closer
to the idealized cluster distance, and hence, performs
well. Note that negative values for this measure are not
expected as this would imply that dαand δwere neg-
atively related (dαwould perform worse than a matrix
containing random entries).
The silhouette cluster accuracy is based on the sil-
houette coefficient, defined in [24], as a measure that
determines how well a single point is embedded in the
cluster to which it belongs. For a point xfrom cluster
Ciwe define axas the average distance of this point
to all other points in Ci, that is,
dα(x, y),
and we define bxas the minimum over the average
distances of xto all points of a different cluster
dα(x, y)
The silhouette coefficient of xis defined as
Sα(x) = bxax
max{ax, bx}.
If a point xhas a silhouette coefficient Sα(x)0,
then xis at least as close to a cluster to which it does
not belong than to its own cluster. The silhouette clus-
ter accuracy Aαdenotes the percentage of points with
a silhouette coefficient greater than 0, that is the per-
centage of points which are well-embedded in their own
Aα=|{xV| Sα(x)>0}|
Obviously, the silhouette cluster accuracy ranges in
[0,1] with a high accuracy being desirable.
For assessing the relative overlap of the histograms,
consider any two clusters Ciand Cjwith i6=j(for
example, C1is the H. sapiens cluster and C4the
A. thaliana cluster). We compare the two sets of in-
tragenomic distances CiCiand CjCjwith the set
of intergenomic distances CiCj. For a distance dα,
we divide the range from min(dα) to the maximum
distance max(dα) in this dataset into 100 bins of size
Karamichalis et al. Page 12 of 14
100 and count the distances which fall
into this bin: ci,i[`] denotes bin `containing distances
from CiCiand ci,j [`] denotes bin icontaining dis-
tances from CiCj. For `= 1,...,100 we let
ci0,j0[`] = |{{x, y} | xCi0, y Cj0and x6=y
and (`1) ·r < dα(x, y)`·r}|.
By si0,j0we denote the sum over all ci0,j0-bins si0,j0=
`=1 ci0,j0[`]. We define the relative overlap Oα(i, j) of
CiCi(intragenomic distances) with CiCj(interge-
nomic distances) as
Oα(i, j) = max{si,i, si,j }
min{si,i, si,j }·P100
i=1 min{ci,i, ci,j }
i=1 max{ci,i, ci,j }.
The relative overlap Oα(j, i) of CjCjwith CiCjis
defined analogously; note that Oα(i, j)6=Oα(j, i) in
general. The overlap is normalized to the range [0,1]
where 0 means no overlap of elements of bins between
intra- and intergenomic distances, and 1 means that
one of the histograms completely “covers” the other.
Also note that we are not interested in the overlap of
CiCiwith CjCjas both sets of distances are intrage-
nomic distances.
Since we intend to define the a quality measure where
a value close to 1 should represent a small overlap, we
will use 1 − Oα(i, j) as relative overlap. Furthermore,
we combine these quantities for all possible pairs of
clusters Ciand Cj, obtaining the relative overlap as:
Oα= 1 1
Oα(i, j).
For example, in Figure 4, for each of the considered
distance, the dark blue histograms depict the C1C1
(H. sapiens H. sapiens) intragenomic distances,
the turquoise histograms the C4C4(A. thaliana
A. thaliana) intragenomic distances, and grey his-
tograms the C1C4(H. sapiens A. thaliana) in-
tergenomic distances. As seen from this figure, the de-
scriptor distance appears to visually perform best at
separating the two intragenomic distance histograms
from the intergenomic histogram, while the Euclidean
distance has the weakest performance. The relative
overlap attempts to quantify this by computing the
overlaps of each of the two pairs of histograms (dark
blue with grey and turquoise with grey). Note that
small visual histogram overlaps will result in a high
numerical relative overlap, and is indicative of a bet-
ter performing distance.
Distance Comparison Results
The results of comparing the six distances we ana-
lyzed, using the three quality measures, are listed in
Table 4. Recall that all quality measures have an ex-
pected range of [0,1] where larger values imply better
DαAαOαz-score sum Rank
DSSIM 0.627 1.000 0.965 1.895 2nd
Descriptors 0.639 0.976 0.988 2.509 1st
Euclidean 0.231 0.325 0.907 4.831 6th
Manhattan 0.527 1.000 0.951 0.84 3rd
Pearson 0.536 0.980 0.888 0.875 5th
Approx. Inf. 0.527 1.000 0.937 0.462 4th
Table 4 Summary of quality measures for the performances of six
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,
approximated information distance) on a dataset of 508 genomic
DNA sequences taken from organisms from each kingdom of life.
Dαis the correlation to an idealized cluster, Aαthe silhouette
cluster accuracy, and Oαthe relative overlap. Higher is better.
To compare each distance relative to all the other dis-
tances, we further compute for each quality measure
(each column) the standard scores (z-scores) of each
distance dα, where α∈ {DSSIM, D, E, M, P, AID}, as
z(dα) = dαµ
σwhere µis the mean and σis the devi-
ation of all six dαfor that particular quality measure
(column). A positive value of the standard score will
mean that a distance performs above average (in this
category) and a negative value that it performs below
Finally, we compute the sum of the z-scores for each
quality measure as seen in Table 4. Note that the total
of z-scores for a distance represents the performance
of that distance relative to the other distances, and
indicates its relative ranking.
The conclusion of this analysis is that the best
performing distances are the descriptor distance and
DSSIM. Manhattan, Pearson, and approximate infor-
mation distance perform well in some categories but
not so well in other categories. For this dataset and
value of k, the Euclidean distance had the weakest per-
formance in all measured categories, which confirms
the visual assessment of the MDS plots obtained by
using the Euclidean distance, as seen in Figure 2and
Figure 3.
It is worth noting that the two distances which per-
form best (DSSIM and descriptor) treat FCGR ma-
trices as two-dimensional maps in which the local ar-
rangement of the cells (matrix entries) influences the
computed distance, whereas the other distances treat
the FCGR matrices as linear vectors. This suggests
that the organization of the k-mer tallies (in this pa-
per k= 9) of a DNA sequence as an FCGR matrix,
rather than a simple vector, reveals structural prop-
erties of the DNA sequence that could be utilized in
order to identify and classify genomic DNA sequences.
Karamichalis et al. Page 13 of 14
Discussion and Conclusions
In this study we test the hypothesis that CGR-based
genomic signatures of genomic DNA sequences are in-
deed species and genome-specific. With this goal in
mind we analyze over five hundred 150 kbp DNA
genomic sequences originating from organisms repre-
senting each of the kingdoms of life. Our quantita-
tive comparison of six different distances suggests that
several other distances outperform the Euclidean dis-
tance, which has been until now almost exclusively
used in such studies. Our preliminary results show
that two of these distances, DSSIM and descriptor dis-
tance (introduced here) when applied to CGR-based
genomic signatures, have indeed the ability to differ-
entiate between DNA sequences coming from different
species. This indicates that the k-mer sequence compo-
sition (where k= 1,2, ..., 9) of genomic sequences con-
tains taxonomic information which could potentially
aid in the identification, comparison and classifica-
tion of species based on molecular evidence. The two-
dimensional and three-dimensional Molecular Distance
Maps we obtain, which visualize the simultaneous in-
tragenomic and intergenomic interrelationships among
the sequences in our dataset, show this method’s po-
Further analysis is needed to explore this method’s
potential to the analysis of closely related species.
As a preliminary experiment, we applied it to H.
sapiens chromosome 21 (NC 000021.8), which yields
234 fragments, and P. troglodytes chromosome Y
(NC 006492.3) which yields 168 sequences, also 150
kbp long.
DαAαOαz-score sum Rank
DSSIM 0.167 0.915 0.136 3.453 1st
Descriptors 0.015 0.500 0.101 2.593 5th
Euclidean 0.037 0.58 0.069 2.899 6th
Manhattan 0.112 0.863 0.108 1.27 3rd
Pearson 0.142 0.714 0.119 1.339 2nd
Approx. Inf. 0.075 0.933 0.062 0.569 4th
Table 5 Summary of quality measures for the performances of six
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,
approximated information distance) on a dataset of 402 DNA
sequences from H. sapiens, chromosome 21 and P. troglodytes,
chromosome Y. Dαis the correlation to an idealized cluster, Aα
is the silhouette cluster accuracy, and Oαis the relative overlap.
The Molecular Distance Maps in Figure 5and Fig-
ure 6, of 402 DNA sequences, suggests that several of
the distances are able to differentiate even between
DNA sequences from closely related organisms. As
seen in Table 5, the Euclidean distance was again out-
performed by other distances, when assessed with the
quality measures we described. In this case-study, we
note a change in the distance rankings: DSSIM, which
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance. (d) Manhattan distance.
(e) Pearson distance. (f) Approx. inform. distance.
Figure 5 Two-dimensional Molecular Distance Maps of
150 kbp genomic DNA sequences from H. sapiens (blue),
P. troglodytes (red) using the six distances.
ranked second previously, now ranks first, while the
descriptor distance, which ranked first previously, now
ranks second last. This may be an indication that de-
scriptor distance, which was designed to detect pattern
differences, may only perform well for analyses of se-
quences of distantly related organisms while DSSIM,
which is sensitive to small differences in similar images,
may be the preferred option for fine-grained analyses
at the genus, family and species level.
Further large-scale computational experiments have
to be carried out to confirm these preliminary results
and establish their validity. Such experiments could
provide additional insights regarding the choice of op-
timal distance for structural genome comparison in
different settings.
Karamichalis et al. Page 14 of 14
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance. (d) Manhattan distance.
(e) Pearson distance. (f) Approx. inform. distance.
Figure 6 Three-dimensional Molecular Distance Maps of
150 kbp genomic DNA sequences from H. sapiens (blue),
P. troglodytes (red) using the six distances.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
RK data acquisition; data analysis, methodology and result interpretation;
manuscript draft; manuscript editing; software design. LK data analysis,
methodology and result interpretation; manuscript draft; manuscript
editing. S.Kon data analysis, methodology and result interpretation;
manuscript editing. S.Kop data analysis, methodology and result
interpretation; manuscript editing. All authors read and approved the final
We thank Yuri Boykov, Lena Gorelick and Olga Veksler for discussions on
the definition for the descriptor distance, and Stephen Solis for comments
on earlier drafts of the manuscript.
Author details
1Department of Computer Science, University of Western Ontario,
London, ON, Canada. 2Department of Mathematics and Computing
Science, Saint Mary’s University, Halifax, NS, Canada.
1. Hebert, P.D., Cywinska, A., Ball, S.L., et al.: Biological identifications
through DNA barcodes. Proceedings of the Royal Society of London.
Series B: Biological Sciences 270(1512), 313–321 (2003)
2. Sirovich, L., Stoeckle, M.Y., Zhang, Y.: Structural analysis of
biodiversity. PLoS One 5(2), 9266 (2010)
3. Jeffrey, H.: Chaos Game Representation of gene structure. Nucleic
Acids Research 18(8), 2163–2170 (1990)
4. Kari, L., Hill, K.A., Sayem, A.S., Karamichalis, R., Bryans, N., Davis,
K., Dattani, N.S.: Mapping the Space of Genomic Signatures
(Submitted). ArXiv e-prints
5. Edwards, S., Fertil, B., Girron, A., Deschavanne, P.: A genomic schism
in birds revealed by phylogenetic analysis of DNA strings. Systematic
Biology 51(4), 599–613 (2002)
6. Pandit, A., Vadlamudi, J., Sinha, S.: Analysis of dinucleotide
signatures in HIV-1 subtype B genomes. Journal of genetics 92(3),
403–412 (2013)
7. Deschavanne, P., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic
signature: characterization and classification of species assessed by
Chaos Game Representation of sequences. Molecular Biology and
Evolution 16(10), 1391–1399 (1999)
8. Gentles, A.J., Karlin, S.: Genome-scale compositional comparisons in
eukaryotes. Genome Research 11(4), 540–546 (2001).
9. Deschavanne, P., Giron, A., Vilain, J., Dufraigne, C., Fertil, B.:
Genomic signature is preserved in short DNA fragments. In:
Proceedings of IEEE International Symposium on Bio-Informatics and
Biomedical Engineering, pp. 161–167 (2000)
10. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality
assessment: From error visibility to structural similarity. IEEE
Transactions on Image Processing 13(4), 600–612 (2004).
11. Iversen, G.R., Gergen, M., Gergen, M.M.: Statistics: The Conceptual
Approach. Springer, Berlin Heidelberg (1997)
12. Krause, E.F.: Taxicab Geometry: An Adventure in non-Euclidean
Geometry. Courier Dover Publications, Mineola, New York (2012)
13. Li, M., Chen, X., Li, X., Ma, B., Vitany, P.: The similarity metric.
IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
14. Jeffrey, H.: Chaos game visualization of sequences. Comput. Graphics
16(1), 25–33 (1992)
15. Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary
implications of microbial genome tetranucleotide frequency biases.
Genome Research 13(2), 145–158 (2003)
16. Deschavanne, P., DuBow, M., Regeard, C.: The use of genomic
signature distance between bacteriophages and their hosts diplays
evolutionary relationships and phage growth cycle determination.
Virology Journal 7(1), 163 (2010)
17. Pandit, A., Sinha, S.: Using genomic signatures for HIV-1 subtyping.
BMC Bioinformatics 11(Suppl 1), 26 (2010)
18. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin
Heidelberg (2009)
19. Kruskal, J.: Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)
20. Supplemental Material.
21. Karamichalis, R.: Molecular Distance Map Interactive Webtool (2014).
22. Pang-Ning, T., Steinbach, M., Kumar, V., et al.: Introduction to data
mining. In: Library of Congress (2006)
23. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of
selected criterion functions for document clustering. Machine Learning
55(3), 311–331 (2004)
24. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. Journal of Computational and Applied
Mathematics 20(0), 53–65 (1987). doi:10.1016/0377-0427(87)90125-7
... Here, these DNA patterns are what we call the organismal signature, which is on the basis of word frequency-free methods. The potential of the pairwise distance between organismal signatures was rapidly recognized and started to be largely applied in the literature [37][38][39]. Furthermore, in a recent publication the mapping of k-word distribution into a single value has been explored as a measure of organismal complexity [40]. ...
... The second step organizes k-word frequencies into an array, where each entry corresponds to the number of times each particular word of length k appears in the given sequence. Finally, the third step computes a metric to quantify the distance between two given word frequencies [39]. Thus, similarity is related to a distance metric, where two identical sequences would correspond to a distance length of zero. ...
... Taxonomical classification [63] and phylogenetic analyses [64][65][66][67] are some problems addressed in this cluster. Other studies include the identification of intra-genomic and inter-genomic variations [37][38][39][68][69][70][71][72][73][74], codon usage biases in bacteria [75], and the classification of novel sequences obtained from metagenomic data [76][77][78][79]. It also collects studies analyzing host-parasite relationships [80][81][82][83][84][85][86][87] and evolutionary origins, such as in the case of SARS-CoV-2 and HIV. ...
Full-text available
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
... Here we present an augmented machine learning model that combines the GT and PN approaches (Babayan, et al., 2018) with Machine Learning with Digital Signal Processing (MLDSP)-based structural patterns (M-SP) of viral sequences (Randhawa, et al., 2019;Randhawa, et al., 2020a). The MLDSP approach applies a one-dimensional, consistent numeric recoding, or the two-dimensional Chaos Game Representation (CGR) (Almeida, et al., 2001;Jeffrey, 1990;Karamichalis, et al., 2015), of the sequences in a signal processing approach and was used to classify early SARS-CoV-2 sequences within a large viral genomic sequence dataset (Randhawa, et al., 2020b). This more structured approach to an alignmentfree method than the GT approach might allow better detection of nonhomologous viruses with the same reservoir host. ...
... The Machine Learning with Digital Signal Processing (MLDSP) strategy (Randhawa, et al., 2019;Randhawa, et al., 2020a), which was developed to make alignment-free comparisons among sequences, was adopted. Sequences were recoded based on a one-dimensional purine/pyrimidine code (a two-dimensional Chaos Game Representation Method with a ktuple size of 7 (Karamichalis, et al., 2015;Randhawa, et al., 2020b) was also examined, see Supplementary information), Fourier transformed (FT) and the Pearson correlation coefficients among the FT sequences were obtained (Randhawa, et al., 2019;Randhawa, et al., 2020a). A distance matrix was computed from the input sequences, and the matrix of a virus was associated with the host categories to gauge the weight of each host group (create a weight vector) for the target virus. ...
Motivation The emergence and subsequent pandemic of the SARS-CoV-2 virus raised urgent questions about its origin and, particularly, its reservoir host. These types of questions are long-standing problems in the management of emerging infectious diseases and are linked to virus discovery programs and the prediction of viruses that are likely to become zoonotic. Conventional means to identify reservoir hosts have relied on surveillance, experimental studies and phylogenetics. More recently, machine learning approaches have been applied to generate tools to swiftly predict reservoir hosts from sequence data. Results Here, we extend a recent work that combined sequence alignment and a mixture of alignment-free approaches using a gradient boosting machines (GBMs) machine learning model, which integrates genomic traits (GT) and phylogenetic neighbourhood (PN) signatures to predict reservoir hosts. We add a more uniform approach by applying Machine Learning with Digital Signal Processing (MLDSP)-based structural patterns (M-SP). The extended model was applied to an existing virus/reservoir host dataset and to the SARS-CoV-2 and related viruses and generated an improvement in prediction accuracy. Availability and implementation The source code used in this work is freely available at Supplementary information Supplementary data are available at Bioinformatics online.
... In turn, a representation of each viral sequence is created in euclidean space with corresponding distances between sequences that are equivalent to their distance given in the matrix. Therefore, similar viral sequences should be relatively close in this representation which has been previously shown for DNA sequences [26] [27]. ...
... Besides this visual approach, other methods exist to calculate the distance between two FCGRs, e.g., the Euclidean distance. Karamichalis et al. [38] compared six different distance metrics with respect to the construction of phylogenetic trees based on FCGRs, namely (i) the Structural Similarity Index (SSI), (ii) the descriptor distance, (iii) the Euclidean distance, (iv) the Manhattan distance, (v) the Pearson distance, and (vi) the approximated information distance. They demonstrated that the SSI and the descriptor distance show a better performance compared to the other metrics. ...
Full-text available
Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
... This creates images with certain fractal properties that capture frequencies of k-mers for modest values of k. A number of different processing methods have then been deployed in order to classify these images; references [1,4,10,11,[21][22][23] how several of these and also give some indication of their variety. ...
Full-text available
We apply a recent alignment-free method of genomic comparison to sequences of SARS-CoV-2 as well as other sequences from the Coronaviridae family. We show that this method, while approximate, can enable fast and accurate classification. We illustrate how it might be applied in the search for the possible intermediary host or hosts. We also use this methodology at a finer level, to create a phylogenetic tree from SARS-CoV-2 sequences taken over a period of time and from geographically distinct locations. This can help to determine routes by which the disease has traveled and also help to chart the course of mutations both in time and geography, thus providing useful information in the realms of epidemiology and public health policy. As an important application we analyze geographical locations in which certain more infectious variants have appeared. By comparing fraction of variant appearances against date of collection we can estimate the rate at which such variants are spreading .
... Analyzing thousands of complete genomes using alignment-based methods is too expensive. To overcome the difficulties of alignment-based methods, alignment-free methods have been introduced [8,9]. Recent studies revealed that machine learning techniques have been applied successfully for virus classification [10,11]. ...
Full-text available
Accurate identification of COVID-19 is now a critical task since it has seriously damaged daily life, public health, and the economy. It is essential to identify the infected people to prevent the further spread of the pandemic and to treat infected patients quickly. Machine learning techniques have a significant role in predicting of COVID-19. In this study, we performed binary classification (COVID-19 vs. other types of coronavirus) by extracting features from genome sequences. Support vector machines, naive Bayes, K-nearest neighbor, and random forest methods were used for classification. We used viral gene sequences from the 2019 Novel Coronavirus Resource Database. Experimental results presented show that a decision tree method achieved 93% accuracy.
Full-text available
Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences known as chaos game representation (CGR) has achieved many applications in the studies of bioinformatics.. The key issue in the application of CGR is to extract as many useful features as possible from CGR. Initially, CGR was applied to DNA sequences, but in our case, we apply it to protein sequences. For this report CGR is used for the identification of several hundred protein sequences into their respective viral groups through feature extraction using python programming language. These features include, CGR centroid, amino acid frequency, compounded frequency, Shannon entropy, and Kullback-Lieber Discrimination Information. The results on datasets indicate that our method is accurate and efficient for classifying proteins and inferring the phylogeny of viruses.
Chaos game representation (CGR) is a useful one-to-one visualization tool to represent nucleotide sequences, in which both local and global patterns of nucleotides can be graphically described. Deep learning networks have been proved to achieve outstanding performance on feature extraction and image recognition. In this paper, we use convolutional spiking neural networks (SNNs) with reward-modulated spike-timing-dependent plasticity (R-STDP) learning rule to learn from the frequency matrix chaos game representation (FCGR) images of essential and non-essential genes of 32 bacteria in the DEG database and make intra-organism and cross-organism essential gene predictions. For intra-organism predictions, our highest accuracy(ACC) score is 0.90 and the average ACC is 0.78, and for cross-organism predictions, our highest ACC is 0.79 and the average ACC is 0.68. Compared with the results of traditional machine learning classifiers training with FCGR images or numerical fractal features pre-calculated from CGR representations, our intra-organism prediction results are much better for all the bacteria or most bacteria, respectively, indicating that our spiking neural networks can make better essential gene prediction by extracting the gene features directly from the FCGR images of essential and nonessential genes. Compared with essential gene prediction methods using gene sequence features and topological features, our cross-organism prediction results can achieve performance close to or even better than such methods, while requiring much fewer input features.
Full-text available
The chaos game representation (CGR) is an interesting method to visualize one-dimensional sequences. In this paper, we show how to construct a chaos game representation. The applications mentioned here are biological, in which CGR was able to uncover patterns in DNA or proteins that were previously unknown. We also show how CGR might be introduced in the classroom, either in a modelling course or in a dynamical systems course. Some sequences that are tested are taken from the Online Encyclopedia of Integer Sequences, and others are taken from sequences that arose mainly from a course in experimental mathematics.
Full-text available
In the emerging field of environmental genomics, direct cloning and sequencing of genomic fragments from complex microbial communities has proven to be a valuable source of new enzymes, expanding the knowledge of basic biological processes. The central problem of this so called metagenome-approach is that the cloned fragments often lack suitable phylogenetic marker genes, rendering the identification of clones that are likely to originate from the same genome difficult or impossible. In such cases, the analysis of intrinsic DNA-signatures like tetranucleotide frequencies can provide valuable hints on fragment affiliation. With this application in mind, the TETRA web-service and the TETRA stand-alone program have been developed, both of which automate the task of comparative tetranucleotide frequency analysis. Availability: TETRA provides a statistical analysis of tetranucleotide usage patterns in genomic fragments, either via a web-service or a stand-alone program. With respect to discriminatory power, such an analysis outperforms the assignment of genomic fragments based on the (G+C)-content, which is a widely-used sequence-based measure for assessing fragment relatedness. While the web-service is restricted to the calculation of correlation coefficients between tetranucleotide usage patterns of submitted DNA sequences, the stand-alone program generates a much more detailed output, comprising all raw data and graphical plots. The stand-alone program is controlled via a graphical user interface and can batch-process a multitude of sequences. Furthermore, it comes with pre-computed tetranucleotide usage patterns for 166 prokaryote chromosomes, providing a useful reference dataset and source for data-mining. Up to now, the analysis of skewed oligonucleotide distributions within DNA sequences is not a commonly used tool within metagenomics. With the TETRA web-service and stand-alone program, the method is now accessible in an easy to use manner for a broad audience. This will hopefully facilitate the interrelation of genomic fragments from metagenome libraries, ultimately leading to new insights into the genetic potentials of yet uncultured microorganisms.
Full-text available
Although much biological research depends upon species diagnoses, taxonomic expertise is collapsing. We are convinced that the sole prospect for a sustainable identification capability lies in the construction of systems that employ DNA sequences as taxon 'barcodes'. We establish that the mitochondrial gene cytochrome c oxidase I (COI) can serve as the core of a global bioidentification system for animals. First, we demonstrate that COI profiles, derived from the low-density sampling of higher taxonomic categories, ordinarily assign newly analysed taxa to the appropriate phylum or order. Second, we demonstrate that species-level assignments can be obtained by creating comprehensive COI profiles. A model COI profile, based upon the analysis of a single individual from each of 200 closely allied species of lepidopterans, was 100% successful in correctly identifying subsequent specimens. When fully developed, a COI identification system will provide a reliable, cost-effective and accessible solution to the current problem of species identification. Its assembly will also generate important new insights into the diversification of life and the rules of molecular evolution.
Full-text available
Background: Human Papillomavirus (HPV) genotyping is an important approach to fight cervical cancer due to the relevant information regarding risk stratification for diagnosis and the better understanding of the relationship of HPV with carcinogenesis. This paper proposed two new feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, for predicting HPV genotypes associated with the cancer. The additional diversified 12 HPV genotypes, i.e. types 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66, were studied in this paper. In our proposed techniques, a partitioned Chaos Game Representation (CGR) is deployed to represent HPV genomes. ChaosCentroid captures the structure of sequences in terms of centroid of each sub-region with Euclidean distances among the centroids and the center of CGR as the relations of all sub-regions. ChaosFrequency extracts the statistical distribution of mono-, di-, or higher order nucleotides along HPV genomes and forms a matrix of frequency of dots in each sub-region. For performance evaluation, four different types of classifiers, i.e. Multi-layer Perceptron, Radial Basis Function, K-Nearest Neighbor, and Fuzzy K-Nearest Neighbor Techniques were deployed, and our best results from each classifier were compared with the NCBI genotyping tool. Results: The experimental results obtained by four different classifiers are in the same trend. ChaosCentroid gave considerably higher performance than ChaosFrequency when the input length is one but it was moderately lower than ChaosFrequency when the input length is two. Both proposed techniques yielded almost or exactly the best performance when the input length is more than three. But there is no significance between our proposed techniques and the comparative alignment method. Conclusions: Our proposed alignment-free and scale-independent method can successfully transform HPV genomes with 7,000 - 10,000 base pairs into features of 1 - 11 dimensions. This signifies that our ChaosCentroid and ChaosFrequency can be served as the effective feature extraction techniques for predicting the HPV genotypes.
Full-text available
We propose a computational process to measure and simultaneously visualize the interrelationships among any number of DNA sequences which allows, for example, the examination of hundreds or thousands of complete mtDNA genomes. The process starts by computing an image distance between graphical representations of DNA sequences' composition and proceeds to visualize the inferred interrelationships as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial distance between any two points reflects the degree of structural similarity between the corresponding sequences. This is a general-purpose method that does not require DNA sequence homology and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same length or of different lengths. The graphical representation of DNA sequences utilized in this process is the Chaos Game Representation (CGR) of DNA sequences, which has been shown to be genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform taxonomic clarifications, species identification, placement of species in existing taxonomic categories, as well as studies of evolutionary history. The image distance employed in this process, the Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to a given k (herein k = 9) in the given DNA sequences. We computed such distances for more than 5 million pairs of complete mitochondrial sequences, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various taxonomic subsets of interest: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia only, and order Primates.
Dinucleotide usage is known to vary in the genomes of organisms. The dinucleotide usage profiles or genome signatures are similar for sequence samples taken from the same genome, but are different for taxonomically distant species. This concept of genome signatures has been used to study several organisms including viruses, to elucidate the signatures of evolutionary processes at the genome level. Genome signatures assume greater importance in the case of host-pathogen interactions, where molecular interactions between the two species take place continuously, and can influence their genomic composition. In this study, analyses of whole genome sequences of the HIV-1 subtype B, a retrovirus that caused global pandemic of AIDS, have been carried out to analyse the variation in genome signatures of the virus from 1983 to 2007. We show statistically significant temporal variations in some dinucleotide patterns highlighting the selective evolution of the dinucleotide profiles of HIV-1 subtype B, possibly a consequence of host specific selection.
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.