Content uploaded by Stavros Konstantinidis
Author content
All content in this area was uploaded by Stavros Konstantinidis on Mar 30, 2015
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Karamichalis et al.
RESEARCH
An investigation into inter- and intragenomic
variations of graphic genomic signatures
Rallis Karamichalis1, Lila Kari1*, Stavros Konstantinidis2and Steffen Kopecki1,2
Abstract
Background: Motivated by the general need to identify and classify species based on molecular evidence,
genome comparisons have been proposed that are based on measuring Euclidean distances between Chaos
Game Representation (CGR) patterns of genomic DNA sequences.
Results: We provide, on an extensive dataset and using several different distances, confirmation of the
hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA
sequences originating from genomes of different species. This finding lends support to the theory that CGRs of
genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over
five hundred different 150,000 bp genomic sequences originating from the genomes of six organisms, each
belonging to one of the kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi;
chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli
(Bacteria - full genome), and P. furiosus (Archaea - full genome). We also provide preliminary evidence of this
method’s applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over
one hundred and fifty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome
Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between
CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps that
visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display
their interrelationships.
Conclusion: Our analysis confirms that CGR patterns of DNA sequences from the same genome are in general
quantitatively similar, while being different for DNA sequences from genomes of different species. Our analysis
of the performance of the assessed distances uses three different quality measures and suggests that several
distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In
particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance
(introduced here) are best able to classify genomic sequences.
Keywords: comparative genomics; genomic signature; species classification
Introduction
Alongside DNA barcoding, [1] and Klee diagrams
[2], Chaos Game Representation (CGR) patterns of
genomic segments have been proposed as another
method for the classification and identification of ge-
nomic sequences [3–7]. The concept of genomic signa-
ture was first introduced in [8], as being any specific
quantitative characteristic of a DNA genomic sequence
that is pervasive along the genome of the same organ-
ism, while being dissimilar for DNA sequences origi-
nating from different organisms. Initial studies [3,9],
*Correspondence: lila.kari@uwo.ca
1Department of Computer Science, University of Western Ontario,
London, ON, Canada
Full list of author information is available at the end of the article
suggested that short fragments of genomic sequences
retain most of the characteristics of the species they
come from, thus implying that genomic signatures ex-
ist. Moreover, the Chaos Game Representation (CGR)
of a DNA sequence, a graphic representation of its se-
quence composition, was proposed in [3] as having both
the pervasiveness and differentiability properties nec-
essary for it to qualify as a genomic signature. This
hypothesis was quantitatively tested and largely con-
firmed in [4] for 3,176 mitochondrial DNA (mtDNA)
sequences, and Molecular Distance Maps were pro-
posed therein as vizualizations of species relationships
based on measuring the distances between the CGR-
images of their mtDNA genomes. Note that CGR pat-
arXiv:1503.00162v1 [q-bio.GN] 28 Feb 2015
Karamichalis et al. Page 2 of 14
terns of mtDNA sequences can be different from those
of DNA sequences from the major genome of the same
organism, and that large scale quantitative analyses of
the hypothesis that CGR can play the role of a ge-
nomic signature for genomic sequences have not, to
our knowledge, been performed. The objective of this
study is to confirm that CGR can play the role of ge-
nomic signature for genomic DNA sequences, as well
as to assess various distances that can be used to com-
pare CGRs of genomic sequences.
We analyze 508 fragments, 150 kbp (kilo base pairs)
long, taken from complete genomic DNA sequences
of six species, each representing a different kingdom:
chromosome 21 of Homo sapiens, chromosome 4 of
Saccharomyces cerevisiae, chromosome 1 of Arabidop-
sis thaliana, chromosome 14 of Plasmodium falci-
parum, the genome of Escherichia coli, and the genome
of Pyrococcus furiosus, for a total length of 76,200,000
bp analyzed. We analyze the intergenomic and intrage-
nomic variation of CGR genomic signatures of these se-
quences by using six different distances for image com-
parison: Structural Dissimilarity Index (DSSIM) [10],
Euclidean distance, Pearson correlation distance [11],
Manhattan distance [12], approximated information
distance [13], and a distance we propose here, called
descriptor distance. We visualize the results by com-
puting the Molecular Distance Maps of all DNA se-
quences in the database, for each of the six distances.
The resulting Molecular Distance Maps show a good
clustering of the DNA sequences, with those origi-
nating from the same genome being largely grouped
together, and separated from sequences belonging to
genomes of different organisms. We observe that, in
some of the cases where the clustering was suboptimal,
the computation of three-dimensional Molecular Dis-
tance Maps resolves what appeared to be cluster over-
laps in the two-dimensional Molecular Distance Maps.
Lastly, using the “ground-truth” that sequences from
the same genomes should have similar structural char-
acteristics and thus be grouped together, while those
from genomes of different organisms should be sepa-
rated, we assess the six distances by combining three
different quality measures: correlation to an idealized
cluster distance, silhouette accuracy, and histogram
overlap. We conclude that DSSIM and the descriptor
distance perform best according to these measures. We
also provide preliminary evidence of this method’s ap-
plicability to classifying genomic DNA sequences of
closely related species by comparing the H. sapiens
(chromosome 21) sequences with 168 genomic DNA
sequences, 150 kbp long, from Pan troglodytes (chimp,
chromosome Y), for an additional length of 25,200,000
bp analyzed. Further research may lead to improve-
ments of these distances for optimal genomic DNA se-
quence identification and classification results.
Note that other alignment-free methods have been
used for phylogenetic analysis of DNA sequences. The
initial reports on CGRs of genomic sequences [3,14]
contained mostly qualitative assessments of CGR pat-
terns of whole genes. In [7], several datasets of up to
36 genomic DNA sequences were analyzed, and in [9]
some various-length sequences were analyzed based on
computing Euclidean distances between frequencies of
their k-mers, for k= 1, ..., 8. Subsequently, [5] com-
puted the Euclidean distance between frequencies of
k-mers (k≤5) for the analysis of 125 GenBank DNA
sequences from 20 bird species and the American al-
ligator. In [15], 27 microbial genomes were analyzed
to find implications of 4-mer frequencies (k= 4) on
their evolutionary relationships. In [13], 20 mammalian
complete mtDNA sequences were analyzed using the
“similarity metric”, for k= 7. Another study, [16], an-
alyzed 459 bacteriophage genomes and compared them
with their host genomes to infer host-phage relation-
ships, by computing Euclidean distances between fre-
quencies of k-mers for k= 4. In [17], 75 complete HIV
genome sequences were compared using the Euclidean
distance between frequencies of 6-mers (k= 6), in or-
der to group them in subtypes. In [4] a dataset of 3,176
complete mtDNA sequences was analyzed, and several
Molecular Distance Maps were obtained using DSSIM
and a value of k= 9.
The main contributions of this paper are:
•We tested and confirmed for an extensive dataset,
of a total length of 101,400,000 bp, the hypothe-
sis that CGR images of genomic DNA sequences
can play the role of a (graphic) genomic signa-
ture, meaning that they have a desirable genome-
and species- specificity. The dataset comprised
150 kbp long sequences taken from genomes of
organisms from each of the six kingdoms of life,
augmented by a set of same-length genomic se-
quences from P. troglogytes as a test-case of this
method’s applicability to closely related species.
•We assessed the performance of six different dis-
tances in this context, and this analysis included
both same-genome and different-genome DNA
fragment pairs. For several of these distances, the
intragenomic values were overall smaller than in-
tergenomic values, suggesting that this method
could separate DNA genomic fragments belong-
ing to different genomes, based on their CGRs.
•We showed that several distances outperform the
Euclidean distance, which has so far been al-
most exclusively used for such studies. In par-
ticular, we determined that the DSSIM distance
and descriptor distance (introduced here), both of
whom essentially compare the k-mer composition
of DNA sequences (herein k= 9), were best able
Karamichalis et al. Page 3 of 14
to differentiate sequences originating from differ-
ent genomes in this dataset.
•This study represents, to the best of our knowl-
edge, the largest combined dataset size and value
of kfor this type of analysis.
•Based on preliminary data, we suggest the use
of three-dimensional Molecular Distance Maps for
improved visualization of the simultaneous inter-
relationships among similar or very distant DNA
sequences.
Methods
In this section we first describe the dataset used for our
analysis, then present an overview of the three main
steps of the method, and conclude with a description
of the six distances that we considered.
Dataset
The dataset we used includes complete genomic se-
quences from six organisms, each representing one of
the six kingdoms of life, see Table 1. For additional
information about the dataset see Appendix A.
Organism NCBI Acc. Nr.
1H. sapiens, chrom. 21 (Animalia) NC 000021.8
2E. coli (Bacteria) NC 000913.3
3S. cerevisiae, chrom. 4 (Fungi) NC 001136.10
4A. thaliana, chrom. 1 (Plantae) NC 003070.9
5P. falciparum, chrom. 14 (Protista) NC 004317.2
6P. furiosus (Archaea) NC 018092.1
Table 1 NCBI accession numbers of the dataset of the
complete genomic DNA sequences considered, in increasing
order of their NCBI accession number.
Organism Length(bp) # Letters “N” # Fragments
H. sapiens 48,129,895 13,023,253 234
E. coli 4,641,652 0 30
S. cerevisiae 1,531,933 0 10
A. thaliana 30,427,671 164,359 201
P. falciparum 3,291,871 37 21
P. furiosus 1,909,827 10 12
Table 2 Organism considered, total length of genomic
sequence, number of ignored letters “N”, and number of DNA
fragments (sequences) obtained by splitting each complete
genomic DNA sequence into consecutive, non-overlapping,
equal length (150 kbp) contiguous fragments.
In order to have relatively comparable number of
DNA sequences for each organism, we chose the longest
chromosomes for all organisms except H. sapiens, for
which the shortest chromosome was chosen.
The DNA sequences in the NCBI database are rep-
resented as strings of letters “A”, “C”, “G”, “T”, and
“N” which represent the four nucleobases Adenine,
Cytosine, Guanine, Thymine, and “unidentified Nu-
cleotide”, respectively. For our analysis we ignored all
letters “N”. In S. cerevisiae and E. coli there were no
ignored letters, and in P. falciparum and P. furiosus
the number of ignored letters is of the order of 0.001%
of the length of the sequence. In H. sapiens this num-
ber is 27%, and in A. thaliana is 0.54%. In H. sapiens,
in particular, 96.4% of these ignored letters exist in
centromeric and telomeric regions of the chromosome.
The resulting genomic DNA sequences were di-
vided into successive, non-overlapping, contiguous
fragments, each 150 kbp long. When the last sequence
was shorter than 150 kbp, it was not included in the
analysis. This resulted in 234 fragments for H. sapiens,
30 fragments for E. coli, 10 fragments for S. cerevisiae,
201 fragments for A. thaliana, 21 fragments for P. fal-
ciparum, and 12 fragments for P. furiosus, for a total
of 508 DNA fragments, see Table 2.
Overview
The method we used to analyze and classify the 508
sequences of the dataset has three steps: (i) gener-
ate graphical representations (images) of each DNA
sequence using Chaos Game Representation (CGR),
(ii) compute all pairwise distances between these im-
ages, and (iii) visualize the interrelationships implied
by these distances as two- or three-dimensional maps,
using Multi-Dimensional Scaling (MDS).
CGR is a method introduced by Jeffrey [3] in 1990
to visualize the structure of a DNA sequence. A CGR
associates an image to each DNA sequence as follows.
Starting from a unit square with corners labelled A, C,
G, and T, and the center of the square as the starting
point, the image is obtained by successively plotting
each nucleotide as the middle point between the cur-
rent point and the corner labelled by the nucleotide to
be plotted. If the generated square image has a size of
2k×2kpixels, then every pixel represents a distinct
k-mer: A pixel is black if the k-mer it represents oc-
curs in the DNA sequence, otherwise it is white. CGR
images of genetic DNA sequences originating from var-
ious species show patterns such as squares, parallel
lines, rectangles, triangles, and also complex fractal
patterns, Figure 1.
For step (i), a slight modification of the original CGR
was used, introduced by Deschavanne [7]: a k-th or-
der FCGR (frequency CGR) is a 2k×2kmatrix that
can be constructed by dividing the CGR plot into a
2k×2kgrid, and defining the element aij as the num-
ber of points that are situated in the corresponding
grid square. A first and second order FCGR are shown
below, where Nwis the number of occurrences of the
oligonucleotide win the sequence s.
F CGR1(s) = NCNG
NANT,
Karamichalis et al. Page 4 of 14
F CGR2(s) =
NCC NGC NC G NGG
NAC NT C NAG NT G
NCA NGA NC T NGT
NAA NT A NAT NT T
.
The (k+ 1)-th order F CGRk+1 (s) can be obtained
by replacing each element NXin F CGRk(s) with four
elements
NCX NGX
NAX NT X
where Xis a sequence of length kover the alphabet
{A, C, G, T }.
(a) H. sapiens (b) E. coli (c) S. cerevisiae
(d) A. thaliana (e) P. falciparum (f) P. furiosus
Figure 1 29×29CGR images of 150 kbp genomic DNA
sequences. of H. sapiens,E. coli,S. cerevisiae,A. thaliana,
P. falciparum, and P. furiosus.
For step (ii), after computing the FCGR matrices for
each of the 150 kbp sequences in our dataset, the goal
was to measure “distances” between every two CGR
images. There are many distances that can be defined
and used for this purpose, [18]. One of the goals of
this study was to identify what distance is better able
to differentiate the structural differences of various ge-
nomic DNA sequences and classify them based on the
species they belong to. In this paper we use six differ-
ent distances: Structural Dissimilarity Index (DSSIM),
descriptor distance (defined here), Euclidean distance,
Manhattan distance, Pearson correlation distance, and
approximated information distance.
For step (iii), after computing all possible pairwise
distances we obtained six different distance matrices.
To visualize the inter-relationships between sequences
implied by each of the distance matrices, and to thus
visually assess each of the distances, we used Multi-
Dimensional Scaling (MDS). MDS is an information
visualization technique introduced by Kruskal in [19].
Given as input a distance matrix that contains the
pairwise distances among a set of items[1], the out-
put of MDS is a spatial representation of the items on
a common Euclidean space wherein each item is rep-
resented as a point and the spatial distance between
any two points corresponds to the distance between
the items in the distance matrix: Objects with a small
pairwise distance will result in points that are close to
each other, while objects with a large pairwise distance
will become points that are far apart. For example,
in [4] MDS was used in conjunction with DSSIM and
CGR to produce Molecular Distance Maps that visu-
ally display the simultaneous interrelationships among
a set of full mitochondrial DNA sequences.
The ideal Molecular Distance Map is a placement of
nitems as points in an (n−1)-dimensional space. The
two-dimensional Molecular Distance Map is simply an
approximation, a flattening of this highly-dimensional
space onto the plane, which may sometimes result in
erroneous positioning of some points. Increasing the
dimensionality of the Molecular Distance Map often
results in a more accurate representation of the real
interrelationships between sequences, as embodied in
the original distance matrix.
Distances
In this section we describe and formally define each of
the six distances used in our analysis: DSSIM, descrip-
tor distance (introduced here), Euclidean, Manhattan,
Pearson, and approximated information distance.
Structural Similarity Index, SSIM, was introduced
in [10] for the purpose of assessing the degree of simi-
larity between two images. Given two images X, Y as
n×nmatrices having as elements integers ranging in
the interval [0, L], SSIM computes three factors (lumi-
nance, contrast and structure) and combines them to
obtain a similarity value. However, instead of comput-
ing a global similarity between the two images, each
image is divided into 11 ×11 sliding square windows
Xij (Yij respectively) with i, j = 1,· · · , n −10 which
move pixel by pixel to eventually cover the entire im-
age, and the SSIM similarity of any given pair of im-
ages is computed by comparing their corresponding
windows. In addition, an 11 ×11 circular symmet-
ric Gaussian weighting function W∈R11×11 with a
fixed standard deviation of 1.5, normalized to unit sum
(P11
p=1 P11
q=1 Wpq = 1), is used. Then, the mean µx,i,j
(µy,i,j for Y), variance σx,i,j (σy,i,j for Y) and corre-
lation σxy,i,j are computed, as follows:
µx,i,j =
11
X
p=1
11
X
q=1
WpqXij
pq
[1]In this paper the items are the 150 kpb DNA se-
quences analyzed.
Karamichalis et al. Page 5 of 14
σx,i,j =v
u
u
t
11
X
p=1
11
X
q=1
Wpq(Xij
pq −µx,i,j )2
σxy,i,j =
11
X
p=1
11
X
q=1
Wpq(Xij
pq −µx,i,j )(Yij
pq −µy,i,j )
where Apq denotes the (p, q) element of the matrix A.
Based on these values, the luminance l(Xij, Y ij ), con-
trast c(Xij , Y ij ) and structure s(Xij , Y ij ) are com-
puted as
l(Xij , Y ij ) = 2µx,i,j µy,i,j +C1
µ2
x,i,j +µ2
y,i,j +C1
c(Xij , Y ij ) = 2σx,i,j σy,i,j +C2
σ2
x,i,j +σ2
y,i,j +C2
s(Xij , Y ij ) = σxy,i,j +C3
σx,i,j σy,i,j +C3
where C1= (0.01)2,C2= (0.03)2,C3=C2
2. Then,
these three factors are combined to get
SSIM (Xij , Y ij ) = l(Xij , Y ij )c(Xij , Y ij )s(Xij , Y ij)
and finally, the SSIM index used to evaluate the over-
all image similarity is computed as
SSIM (X, Y ) = 1
(n−10)2
n−10
X
i=1
n−10
X
j=1
SSIM (Xij , Y ij ).
In theory, the values for SSIM range in the interval
[−1,1] with the similarity being 1 between two identi-
cal images, 0, for example, between a black image and
a white image, and −1 if the two images are negatively
correlated; that is, SSIM(X, Y ) = −1 if and only if X
and Yhave the same luminance µand every pixel xi
of image Xhas the inverted value of the corresponding
pixel yi= 2µ−xiin Y.
To compute the distance rather than the similarity
between two images, we calculate DSSIM (X, Y ) =
1−SSIM(X, Y ). Consequently, the range of DSSIM
is the interval [0,2]: two identical images will result
in a DSSIM distance of 0, while two images that are
the negatives of each other would result in a DSSIM
distance of 2.
The descriptor distance between two FCGRs X, Y ∈
N2k×2kaims to compare a combination of several dif-
ferent“descriptors”, that is, a combination of several
different aspects, of the two given FCGRs.
Adescriptor is a vector characterized by parameters
mand r, as well as rintervals, where mis the size
of the non-overlapping windows in which the FCGR is
divided (scale of the comparison), and the rintervals
represent the “granularity” of the analysis, in that they
define the intervals of numbers of k-mer occurrences
that are considered significant.
For a given m≤kand r, and intervals [a0, a1),[a1, a2),
· · · ,[ar−1, ar) such that Sr−1
i=0 [ai, ai+1) = [0,∞) and
[ai, ai+1)∩[aj, aj+1 ) = ∅ ∀i, j with i6=j, a decriptor
is constructed as follows.
Starting from the top-left corner, we divide each of
the two FCGR matrices Xand Yinto non-overlapping
submatrices[2] of size 2m×2m. This procedure re-
sults in 4k−msubmatrices Xij and Yij with i, j =
1,· · · ,2k−m, which will be pairwise compared.
The choice of the rintervals, called “bins”, points
to the fact that, rather than considering the finest
granularity, we are interested in a coarser compari-
son. This means that, instead of a computationally
expensive pairwise comparison of all possible numbers
of occurrences of k-mers, we are interested only in cer-
tain “bins” of such numbers. For example, in our case,
we use r= 5 and consider only 5 different bins, that
is only k-mers with number of occurences: 0 (not oc-
curring), 1 (one occurrence), 2 (two occurrences), be-
tween 2 and 5, between 5 and 20, and greater than
20 (most frequent). Formally, we use r= 5 and
[0,∞) = [0,1) ∪[1,2) ∪[2,5) ∪[5,20) ∪[20,∞) as the
5 bins.
Afterwards, we compute for every Xij a vector
vecXij =1
(2m×2m)(b1, b2,· · · , br) where bi=|{x∈
Xij :ai−1≤x < ai}|. In our case, for each Xij, we
compute a five-tuple wherein, for example, the 4th el-
ement represents the number of 9-mers whose number
of occurrences is in the 4th bin, that is, at least 5 but
less than 20. The division to 2m×2mis to obtain a
probability distribution for each submatrix. The same
procedure is performed for Yij , resulting in the vector
vecYij.
We further append all vectors vecXij and form a new
vector vecXm,r and, using the same order of append-
ing, we append all vectors vecYij forming a new vector
vecYm,r. These two vectors are the “descriptors” of
the FCGR matrices Xand Yfor the parameters m,r
and the rchosen bins.
As a last step, we combine descriptors vecXm,r (re-
spectively vecYm,r) for several values of mand rby
appending them one after another, in the same order,
to obtain the vector vecX(respectively vecY).
[2]In general, these windows (submatrices) can be over-
lapping, but in this paper we made the choice of using
non-overlapping windows.
Karamichalis et al. Page 6 of 14
The descriptor distance between the two FCGRs X
and Yis now defined as the Euclidean distance be-
tween the vectors vecXand vecY
dD(X, Y ) = dE(vecX, vecY).
In our case we computed descriptors for m= 4,5,6
therefore forming vectors vecXand vecYof length
5(512
64 )2+ (512
32 )2+ (512
16 )2= 6720. In general,
for a given r, the length of the vectors compared
is r((2k−m1)2+ (2k−m2)2+... + (2k−mp)2), where
m1, m2, . . . , mpare the values used for m. The choice
of mfor this study was made to balance the com-
putational cost of calculating the vector of descriptors
with the ability to compare the two matrices at various
scales: large (m= 6, that is, compare windows of size
64×64), medium (m= 5, windows of size 32×32)) and
small (m= 4, windows of size 16×16). The parameter
r= 5 and the 5 bins were kept constant throughout
our calculations but, in general, these parameters can
also be varied, and the resulting vectors for each value
added to the vector of descriptors, resulting in a larger
vector.
In principle, the descriptor distance between two FC-
GRs effectively compares the distribution of frequen-
cies of k-mers between the corresponding submatrices
Xij and Yij , and does that for several values of m,
that is, at several different scales. (Note that, in each
window Xij, all k-mers have the same suffix of length
k−m.)
We now illustrate the descriptor distance by an ex-
ample wherein k= 3, m= 2, r= 3, and the 3 bins are
[0,15)∪[15,30)∪[30,∞). Since k= 3, the FCGR table
will contain the number of occurrences of all 3-mers in
a DNA sequence, as follows:
CCC GCC CGC GGC CCG GCG CGG GGG
ACC TCC AGC TGC ACG TCG AGG TGG
CAC GAC CTC GTC CAG GAG CTG GTG
AAC TAC ATC TTC AAG TAG ATG TTG
CCA GCA CGA GGA CCT GCT CGT GGT
ACA TCA AGA TGA ACT TCT AGT TGT
CAA GAA CTA GTA CAT GAT CTT GTT
AAA TAA ATA TTA AAT TAT ATT TTT
Take the two FCGRs X, Y ∈N8×8, (k= 3, thus
23×23) corresponding to two genomic 150 kbp se-
quences of our dataset (one human and one bacterial),
respectively. In order to use small numbers throughout
the example, we divide all elements of the obtained ma-
trices by 100 and take the integer part of each element,
obtaining:
X=
42 33 9 33 14 10 15 45
22 30 26 25 9 5 37 37
32 21 33 19 44 35 41 35
17 9 13 21 23 10 22 18
37 26 6 32 34 24 9 23
29 24 31 27 19 27 18 28
21 23 10 9 19 17 21 15
35 15 14 14 19 12 17 30
,
Y=
18 34 40 27 30 36 27 12
27 18 27 32 24 23 15 23
24 17 13 17 36 12 32 18
27 17 28 26 18 8 22 25
32 32 23 16 16 25 23 22
20 29 18 25 16 16 15 17
25 25 7 16 26 27 20 25
32 21 20 21 25 18 27 34
.
Thus, in the human DNA sequence, the triplet CCC
appears about 4200 times, the triplet GCC appears
about 3300 times, the triplet CGC appears about 900
times, etc.
Since m= 2, we divide each of the matrices Xand Y
into non-overlapping submatrices of size 4×4 (22×22).
For Xwe thus obtain X11, X12, X21 , X22
42 33 9 33
22 30 26 25
32 21 33 19
17 9 13 21
,
14 10 15 45
9 5 37 37
44 35 41 35
23 10 22 18
,
37 26 6 32
29 24 31 27
21 23 10 9
35 15 14 14
,
34 24 9 23
19 27 18 28
19 17 21 15
19 12 17 30
.
and similarly for Y.
Since the r= 3 bins are [0,15) ∪[15,30) ∪[30,∞),
we will count, for each submatrix, the number of 3-
mers for which the number of occurrences is less than
15, between 15 and 30, and greater than or equal to
30. Thus we obtain vecX11 =1
16 (3,7,6) which has
as elements the number of elements of X11 which be-
long in each of the intervals selected, divided by the
total number of elements of X11. We proceed simi-
larly for vecX12 =1
16 (5,4,7), vecX21 =1
16 (5,7,4),
vecX22 =1
16 (2,12,2) and we form vecXby appending
these vectors one after the other, that is
vecX=1
16 (3,7,6,5,4,7,5,7,4,2,12,2) .
Karamichalis et al. Page 7 of 14
We apply exactly the same procedure for the matrix
Yand we get
vecY=1
16 (1,12,3,3,9,4,1,12,3,0,15,1) .
The descriptor distance between these two FCGRs is
computed as the Euclidean distance between vecXand
vecY, in this case dD(X, Y )≈0.718. Note that, since
we started by dividing the number of 3-mer occur-
rences by 100, as well as because of the bin selection,
this is a fictitious example. The real value of the de-
scriptor distance between the mentioned human and
bacterial sequences is 8.66, and the range of the de-
scriptor distance for this dataset of DNA sequences is
[0, 13.17]. In general, the descriptor distance has a vari-
able range, that depends on the choices of parameters
used.
To compute the Euclidean, Manhattan and Pearson
distances, we first convert the matrices X, Y ∈Nn×n
into 1 ×n2vectors. For two vectors x, y ∈Rn, their
Euclidean distance dE(x, y) and their Manhattan dis-
tance dM(x, y) are computed as
dE(x, y) = v
u
u
t
n
X
i=1
(xi−yi)2,
dM(x, y) =
n
X
i=1
|xi−yi|,
while their Pearson distance dP(x, y) is defined as
dP(x, y)=1−σxy
σxσy
,
where
µx=1
n
n
X
i=1
xi, σx=v
u
u
t
1
n−1
n
X
i=1
(xi−µx)2,
σxy =1
n−1
n
X
i=1
(xi−µx)(yi−µy).
In theory, the correlation coefficient σxy
σxσyranges in
the interval [−1,1], and therefore the Pearson distance
ranges in the interval [0,2].
The last distance we considered is based on the in-
formation distance defined in [13]. The use of this dis-
tance is motivated computationally since it is easily
computed from FCGRs as it tracks the number of dif-
ferent k-mers for a sequence instead of the actual set.
In [13], for a given k, the information distance for two
strings x, y is defined as
dAID (x, y) = Nk(x|y) + Nk(y|x)
Nk(xy)
with
Nk(x|y) = Nk(xy)−Nk(x)
where Nk(x) is the number of different k-mers (pos-
sibly overlapping) which occur in x. We go one step
further and modify this in order to avoid the creation
of “unwanted” k-mers from the concatenation xy of
xand y. First, we need to show how we compute
Nk(x) for a sequence x. For a sequence x, firstly, we
build its FCGR(x) = X∈N2k×2k, which is a ma-
trix of 2k×2kwith element values in N. Then we
unitize X, that is every non-zero entry becomes 1,
while zeros remain 0. Nk(x) is now computed as the
sum of the elements of this unitized FCGR, that is,
Nk(x) = f(X) = SumOfElements(Unitize(X)). For
two strings xand y, with FCGRs Xand Yrespec-
tively, we define Nk(x|y) as:
Nk(x|y) = f(X+Y)−Nk(x) (1)
This slight modification of the information distance
gives us also the desired properties of d(x, x) = 0 and
d(x, y) = d(y, x) which were not satisfied before. Us-
ing (1), we now define the approximated information
distance (AID) as:
dAID (x, y) = 2 −f(X) + f(Y)
f(X+Y)(2)
where x, y are the strings and X, Y ∈N2k×2ktheir
FCGRs, respectively. It also turns out that this dis-
tance is in fact the normalised Hamming Distance of
the unitized FCGRs Xand Y. Note that, for two
sets Xand Y, the normalized Hamming distance is
|X 4Y|
|X ∪Y| = 2 −|X |+|Y |
|X ∪Y| where 4denotes the symmetric
difference.
The generation of CGR images, calculation of dis-
tance matrices and creation of 2D and 3D Molecu-
lar Distance Maps with MDS were done and can be
tested with the code available in [20] written in Wol-
fram Mathematica, version 9. The interactive webtool
ModMap, [21], allows in-depth exploration of the 2D
Mod Maps (Molecular Distance Maps) in this paper[3] .
[3]When using the interactive webtool MoDMap, click-
ing on a distance underneath a dataset will result in
Karamichalis et al. Page 8 of 14
Online Supplemental Material [20] includes all dis-
tance matrices and the code used to produce all figures
and plots in this paper. More details about the online
resources can be found in Appendix B.
Analysis and Results
For our dataset, we use k= 9, that is, each DNA se-
quence was represented as a 29×29FCGR matrix.
In practice, this means that the FCGR of a DNA
sequence contains the full information regarding its
k-mer sequence composition, for k= 1,2, ..., 9. The
length choice of 150 kbp and value of k= 9 is justified
by the fact that, for a random sequence of length 150
kbp, its CGR at resolution 29×29has around half of
the pixels black, and half white.
Figure 2depicts two-dimensional Molecular Distance
Maps for the over five hundred DNA sequences in
our dataset, computed using the DSSIM distance, de-
scriptor distance, Euclidean distance, Manhattan dis-
tance, Pearson distance and approximated informa-
tion distance, respectively. Figure 3depicts the corre-
sponding three-dimensional Molecular Distance Maps
for the same dataset. The projection of each three-
dimensional map is chosen by hand in order to visually
separate clusters of points which appear to be overlap-
ping in the two-dimensional maps, as discussed below.
We note that MDS is not a clustering method, as the
clusters are defined beforehand by the coloring scheme
used (blue for H. sapiens, green for E. coli, and so on).
MDS simply tries to display visually the interrelation-
ships between the given items, based on the pairwise
distances in the distance matrix which is its input.
Note also that an increase in dimensionality from 2 to
3 can lead to a better cluster visualization. For exam-
ple, if we compare the two-dimensional and the three-
dimensional Molecular Distance Maps obtained using
DSSIM, we see that points that appeared to be erro-
neously mixed with each other in the two-dimensional
map, Figure 2(a), (S. cerevisiae and P. falciparum se-
quences mixed in with A. thaliana sequences) were in
fact clearly separated from each other in Figure 3(a),
the three-dimensional version of the Molecular Dis-
tance Map.
plotting the MoD Map of the dataset computed with
that distance. On any particular MoD Map, clicking on
a point will display a window with information about
the subsequence represented by that point: its NCBI
accession number, scientific name of the organism it
originates from, and its CGR pattern. Clicking on the
“From here” and “To here” buttons on two such se-
lected windows will display the distance between the
corresponding genomic subsequences in the distance
matrix.
Figure 4displays the histograms of the pairwise in-
tragenomic distances (dark blue and turqoise) and in-
tergenomic distances (grey) of DNA sequences from
H. sapiens and A. thaliana, obtained using each of
the six distances. As noted, some distances seem to
perform better than others. Visually, the poorest per-
former for these two sets of sequences (from H. sapiens
and A. thaliana) seems to be the Euclidean distance
wherein the intragenomic distances are as high as in-
tergenomic distances, and no separation is visible. In
contrast, DSSIM gives – for the same data – interge-
nomic distances that are overall much higher than in-
tragenomic distances, resulting in a clear classification
of DNA sequences into the species they belong to.
Table 3displays the mean and standard deviation of
distances between clusters Ciand Cj, 1 ≤i, j ≤6,
where a cluster C`is defined as the set of all ge-
nomic sequences from the genome of organism `, as
labelled in Table 1. In each subtable, the diagonals
represent the means and standard deviation for in-
tragenomic distances, while the other entries are all
intergenomic distances. From this table we see that
for DSSIM, Manhattan and approximated information
distance, the maximum of all the averages of intrage-
nomic distances in this dataset is strictly smaller than
the minimum of all the averages of intergenomic dis-
tances. For the descriptor distance and Pearson dis-
tance the previous statement does not hold but, for
each pair of organisms, the two averages of intrage-
nomic distances (e.g., human-human and plant-plant)
are both lower than the average of the intergenomic
distances (human-plant). For the Euclidean distance,
none of the previous statements holds: For example,
the average of the plant-plant intragenomic distances
(element 4-4 in the Euclidean distance subtable of Ta-
ble 3) intragenomic distances is 723, which is larger
than 672, the average of the yeast-plant intergenomic
distances (element 3-4 in the Euclidean distance sub-
table of Table 3). The complete histograms of all pair-
wise comparisons Ci−Cjcan be found in Appendix C.
Karamichalis et al. Page 9 of 14
- 1 2 3 4 5 6
1
0.81 ±
0.04
0.99 ±
0.01
0.92 ±
0.02
0.91 ±
0.03
0.92 ±
0.03
0.91 ±
0.02
2-0.85 ±
0.01
0.97 ±
0.01
0.99 ±
0.01
0.99 ±
0.01 0.99±0.
3- - 0.87 ±
0.01
0.89 ±
0.02 0.91±0.0.91 ±
0.01
4---0.87 ±
0.03
0.9±
0.02
0.91 ±
0.01
5- - - - 0.74 ±
0.01 0.94±0.
6DSSIM 0.83 ±
0.01
1
3.76 ±
1.69
9.74 ±
0.66
5.92 ±
1.14
5.71 ±
1.41
9.33 ±
1.23
5.44 ±
0.92
2-2.5±
0.28
8.05 ±
0.39
9.1±
0.55
12.67 ±
0.19
9.38 ±
0.41
3- - 2.12 ±
0.08
3.42 ±
1.05
9.48 ±
0.31
4.6±
0.09
4---2.75 ±
1.33
8.23 ±
0.94
4.94 ±
0.76
5- - - - 1.53 ±
0.14
9.99 ±
0.28
6Descriptors 2.4±
0.32
1
756 ±
498
856 ±
349
756 ±
361
818 ±
514
3914 ±
510
812 ±
356
2-558±5674 ±
17
802 ±
366
4102 ±
466 696 ±18
3- - 564 ±
11
672 ±
383
3964 ±
472 633 ±20
4---723 ±
535
3923 ±
506
748 ±
372
5- - - - 999 ±
276
4085 ±
468
6Euclidean 585 ±24
1
171 ±
15 222±5189 ±
13 188 ±17 213 ±20 191 ±9
2-175±2 209±4 219 ±8 252 ±4 218 ±3
3- - 171±2 177 ±10 206 ±2 184 ±2
4---172 ±16 200 ±11 188 ±9
5- - - - 105 ±3224 ±2
6Manhattan (in thousands) 167 ±3
1
0.5±
0.12
0.97 ±
0.02
0.69 ±
0.1
0.64 ±
0.12
0.65 ±
0.09
0.81 ±
0.06
2-0.71 ±
0.02
0.93 ±
0.02
0.96 ±
0.02
0.98 ±
0.01
0.99 ±
0.02
3- - 0.6±
0.02
0.6±
0.07
0.71 ±
0.03
0.75 ±
0.02
4---0.53 ±
0.11
0.63 ±
0.09
0.76 ±
0.04
5- - - - 0.02 ±
0.01
0.94 ±
0.01
6Pearson 0.64 ±
0.03
1
0.65 ±
0.03
0.78 ±
0.01
0.7±
0.03
0.7±
0.03
0.76 ±
0.04
0.69 ±
0.02
2-0.67 ±
0.
0.75 ±
0.01
0.77 ±
0.02
0.85 ±
0.01
0.77 ±
0.01
3- - 0.67 ±
0.01
0.68 ±
0.02 0.74±0.0.69 ±0.
4---0.67 ±
0.03
0.73 ±
0.02
0.69 ±
0.02
5- - - - 0.64 ±
0.01
0.76 ±
0.01
6Approx. Information 0.65 ±
0.01
Table 3 Mean and standard deviation of distances between
clusters Ci−Cjfor i, j = 1, ..., 6.
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 2 Two-dimensional Molecular Distance Maps of DNA
genomic sequences from all six organisms in the dataset,
obtained using DSSIM, descriptor, Euclidean, Manhattan,
Pearson and aproximated information distance, respectively.
Each point corresponds to a 150 kbp genomic sequence from
H. sapiens (blue), E. coli (green), S. cerevisiae (red),
A. thaliana (turqoise), P. falciparum (magenta), and
P. furiosus (orange).
Quality Measures for Distances
In this section we present three quality measures that
each evaluates the quality of the six distances con-
sidered. In the data mining literature a wide range
of quality measures for clusterings has been defined;
see for example [22,23]. Most of these methods are
designed to assess the quality of different automated
clustering methods while using the same distance. Our
set-up is different, as we use different distances while
the clustering is fixed and given by the initial colour-
coding of the sequence-representing points. Thus, we
have to use other approaches to compare the distances
we analyze. In particular, as the six distances have
Karamichalis et al. Page 10 of 14
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 3 Three-dimensional Molecular Distance Maps of
genomic DNA sequences from all six organisms in the dataset,
obtained using DSSIM, descriptor, Euclidean, Manhattan,
Pearson and approximated information distance, respectively.
Each point corresponds to a 150 kbp genomic sequence from
H. sapiens (blue), E. coli (green), S. cerevisiae (red),
A. thaliana (turqoise), P. falciparum (magenta), and
P. furiosus (orange).
different ranges, we have to use assessment methods
which are invariant to the scale of the distance.
The “ground-truth” that we use as a basis for our
distance assessment is the fact that the “ideal” clus-
tering of DNA sequences and the points that repre-
sent them is known: sequences from the same organism
should be close to one another and far from sequences
originating from other organisms. (This assumption is
justified – for this dataset – as the six organisms con-
sidered are very different from one another, belonging
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance (d) Manhattan distance
(e) Pearson distance (f ) Approx. inform. distance
Figure 4 Histograms of pairwise intragenomic and
intergenomic distances among the DNA sequences from
H. sapiens and A. thaliana.
to different kingdoms of life.) Thus, an optimal dis-
tance should yield a relatively small distance between
two FCGRs which were generated from the DNA se-
quences originating from the same organism, and rel-
atively high distances between two FCGR originating
from DNA sequences coming from different organisms.
In order to assess each of the six distances quantita-
tively, we computed three quality measures which rate
different features of a distance:
•the correlation to an idealized cluster distance
•the silhouette cluster accuracy
•the relative overlap between the intragenomic and
intergenomic distance histograms.
Let us stress that all three quality measures of the six
distances are based on the distance matrices which we
computed and not on their MDS plots. We will define
the three quality measures such that their expected
values range in the interval [0,1] where higher values
correspond to better performance.
Let us first describe the three quality measures infor-
mally. An idealized distance is a distance that would
be able to differentiate DNA sequences by species, that
is, a distance δfor which δ(x, y) = 0 if xand yare
Karamichalis et al. Page 11 of 14
sequences from the same species and δ(x, y) = 1 oth-
erwise. The first quality measure, the correlation to
an idealized cluster distance, measures how well a dis-
tance is linearly correlated to the idealized distance δ.
The second quality measure, silhouette cluster accu-
racy, is the percentage of points that are best embed-
ded in the cluster they belong to. The third quality
measure quantifies the “visual overlap” between the
intragenomic and intergenomic distance histograms.
Given our dataset, it is reasonable to expect that a
good distance gives a low value if applied to FCGRs of
genomic sequences of the same organism, and a high
value when applied to FCGRs of genomic sequences
from two different organisms, thus separating the his-
tograms of intragenomic distances from that of interge-
nomic distances. This is illustrated by the histograms
in Figure 4, where a high overlap between the graph
of intragenomic distances (dark blue and turquoise)
and the graphs of intergenomic distances (grey) is an
indication of a poorly performing distance. In a theo-
retically optimal situation, there would exist a value c
such that all distances that are smaller than care in-
tragenomic distances and all distances that are larger
than care intergenomic distances. This can usually not
be expected from real data, but a low overlap between
histograms is nevertheless indicative of a “good” dis-
tance.
In order to formally define the three quality mea-
sures, we consider a dataset Vwhich is partitioned
into pnon-overlapping clusters C1, . . . , Cpfor which a
distance dα:V×V→R≥0exists. The cardinalities of
the sets are |V|=mand |Ci|=mifor i= 1, . . . , p.
In our analysis, p= 6 and C1contains all FCGRs
generated from genomic DNA sequences from H. sapi-
ens,C2contains all FCGRs generated from genomic
sequences of E.coli, and so on, according to the order
in Table 1. The distance dαis one of the six distances
α∈ {DSSIM, D, E, M, P, AID}.
The correlation to an idealized cluster distance is
computed as follows. We define the idealized cluster
distance as a function (or matrix) δ:V×V→ {0,1}
such that δ(x, y) = 0 if and only if xand ybelong to
the same cluster, and δ(x, y) = 1 otherwise. Because
we can view dαand δas discrete, symmetric functions
which have the same domain, we can compute their
correlation coefficient. We define the correlation of δ
to dαto be the Pearson correlation of δand dα. More
precisely, the upper triangular part of the matrix cor-
responding to a distance dαis interpreted as a vector
(x1, . . . , xn) and compared with the corresponding val-
ues (y1, . . . , yn) given by δ. We obtain the δ-correlation
as
Dα=σxy
σxσy
.
The correlation ranges in the interval [−1,1]: a value
of 1 means that dαand δare linearly correlated, and
a value of 0 means that they are unrelated. In other
words, if the value obtained by measuring the correla-
tion of a given distance to the idealized cluster distance
is close to 1, this means that the given distance is closer
to the idealized cluster distance, and hence, performs
well. Note that negative values for this measure are not
expected as this would imply that dαand δwere neg-
atively related (dαwould perform worse than a matrix
containing random entries).
The silhouette cluster accuracy is based on the sil-
houette coefficient, defined in [24], as a measure that
determines how well a single point is embedded in the
cluster to which it belongs. For a point xfrom cluster
Ciwe define axas the average distance of this point
to all other points in Ci, that is,
ax=1
mi−1X
y∈Ci,y6=x
dα(x, y),
and we define bxas the minimum over the average
distances of xto all points of a different cluster
bx=K
min
j=1,j6=i
1
mjX
y∈Cj
dα(x, y)
.
The silhouette coefficient of xis defined as
Sα(x) = bx−ax
max{ax, bx}.
If a point xhas a silhouette coefficient Sα(x)≤0,
then xis at least as close to a cluster to which it does
not belong than to its own cluster. The silhouette clus-
ter accuracy Aαdenotes the percentage of points with
a silhouette coefficient greater than 0, that is the per-
centage of points which are well-embedded in their own
cluster,
Aα=|{x∈V| Sα(x)>0}|
m.
Obviously, the silhouette cluster accuracy ranges in
[0,1] with a high accuracy being desirable.
For assessing the relative overlap of the histograms,
consider any two clusters Ciand Cjwith i6=j(for
example, C1is the H. sapiens cluster and C4the
A. thaliana cluster). We compare the two sets of in-
tragenomic distances Ci–Ciand Cj–Cjwith the set
of intergenomic distances Ci–Cj. For a distance dα,
we divide the range from min(dα) to the maximum
distance max(dα) in this dataset into 100 bins of size
Karamichalis et al. Page 12 of 14
r=max(dα)−min(dα)
100 and count the distances which fall
into this bin: ci,i[`] denotes bin `containing distances
from Ci–Ciand ci,j [`] denotes bin icontaining dis-
tances from Ci–Cj. For `= 1,...,100 we let
ci0,j0[`] = |{{x, y} | x∈Ci0, y ∈Cj0and x6=y
and (`−1) ·r < dα(x, y)≤`·r}|.
By si0,j0we denote the sum over all ci0,j0-bins si0,j0=
P100
`=1 ci0,j0[`]. We define the relative overlap Oα(i, j) of
Ci–Ci(intragenomic distances) with Ci–Cj(interge-
nomic distances) as
Oα(i, j) = max{si,i, si,j }
min{si,i, si,j }·P100
i=1 min{ci,i, ci,j }
P100
i=1 max{ci,i, ci,j }.
The relative overlap Oα(j, i) of Cj–Cjwith Ci–Cjis
defined analogously; note that Oα(i, j)6=Oα(j, i) in
general. The overlap is normalized to the range [0,1]
where 0 means no overlap of elements of bins between
intra- and intergenomic distances, and 1 means that
one of the histograms completely “covers” the other.
Also note that we are not interested in the overlap of
Ci–Ciwith Cj–Cjas both sets of distances are intrage-
nomic distances.
Since we intend to define the a quality measure where
a value close to 1 should represent a small overlap, we
will use 1 − Oα(i, j) as relative overlap. Furthermore,
we combine these quantities for all possible pairs of
clusters Ciand Cj, obtaining the relative overlap as:
Oα= 1 −1
p(p−1)
p
X
i=1
p
X
j=1,i6=j
Oα(i, j).
For example, in Figure 4, for each of the considered
distance, the dark blue histograms depict the C1−C1
(H. sapiens –H. sapiens) intragenomic distances,
the turquoise histograms the C4−C4(A. thaliana
–A. thaliana) intragenomic distances, and grey his-
tograms the C1−C4(H. sapiens –A. thaliana) in-
tergenomic distances. As seen from this figure, the de-
scriptor distance appears to visually perform best at
separating the two intragenomic distance histograms
from the intergenomic histogram, while the Euclidean
distance has the weakest performance. The relative
overlap attempts to quantify this by computing the
overlaps of each of the two pairs of histograms (dark
blue with grey and turquoise with grey). Note that
small visual histogram overlaps will result in a high
numerical relative overlap, and is indicative of a bet-
ter performing distance.
Distance Comparison Results
The results of comparing the six distances we ana-
lyzed, using the three quality measures, are listed in
Table 4. Recall that all quality measures have an ex-
pected range of [0,1] where larger values imply better
performance.
DαAαOαz-score sum Rank
DSSIM 0.627 1.000 0.965 1.895 2nd
Descriptors 0.639 0.976 0.988 2.509 1st
Euclidean 0.231 0.325 0.907 −4.831 6th
Manhattan 0.527 1.000 0.951 0.84 3rd
Pearson 0.536 0.980 0.888 −0.875 5th
Approx. Inf. 0.527 1.000 0.937 0.462 4th
Table 4 Summary of quality measures for the performances of six
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,
approximated information distance) on a dataset of 508 genomic
DNA sequences taken from organisms from each kingdom of life.
Dαis the correlation to an idealized cluster, Aαthe silhouette
cluster accuracy, and Oαthe relative overlap. Higher is better.
To compare each distance relative to all the other dis-
tances, we further compute for each quality measure
(each column) the standard scores (z-scores) of each
distance dα, where α∈ {DSSIM, D, E, M, P, AID}, as
z(dα) = dα−µ
σwhere µis the mean and σis the devi-
ation of all six dαfor that particular quality measure
(column). A positive value of the standard score will
mean that a distance performs above average (in this
category) and a negative value that it performs below
average.
Finally, we compute the sum of the z-scores for each
quality measure as seen in Table 4. Note that the total
of z-scores for a distance represents the performance
of that distance relative to the other distances, and
indicates its relative ranking.
The conclusion of this analysis is that the best
performing distances are the descriptor distance and
DSSIM. Manhattan, Pearson, and approximate infor-
mation distance perform well in some categories but
not so well in other categories. For this dataset and
value of k, the Euclidean distance had the weakest per-
formance in all measured categories, which confirms
the visual assessment of the MDS plots obtained by
using the Euclidean distance, as seen in Figure 2and
Figure 3.
It is worth noting that the two distances which per-
form best (DSSIM and descriptor) treat FCGR ma-
trices as two-dimensional maps in which the local ar-
rangement of the cells (matrix entries) influences the
computed distance, whereas the other distances treat
the FCGR matrices as linear vectors. This suggests
that the organization of the k-mer tallies (in this pa-
per k= 9) of a DNA sequence as an FCGR matrix,
rather than a simple vector, reveals structural prop-
erties of the DNA sequence that could be utilized in
order to identify and classify genomic DNA sequences.
Karamichalis et al. Page 13 of 14
Discussion and Conclusions
In this study we test the hypothesis that CGR-based
genomic signatures of genomic DNA sequences are in-
deed species and genome-specific. With this goal in
mind we analyze over five hundred 150 kbp DNA
genomic sequences originating from organisms repre-
senting each of the kingdoms of life. Our quantita-
tive comparison of six different distances suggests that
several other distances outperform the Euclidean dis-
tance, which has been until now almost exclusively
used in such studies. Our preliminary results show
that two of these distances, DSSIM and descriptor dis-
tance (introduced here) when applied to CGR-based
genomic signatures, have indeed the ability to differ-
entiate between DNA sequences coming from different
species. This indicates that the k-mer sequence compo-
sition (where k= 1,2, ..., 9) of genomic sequences con-
tains taxonomic information which could potentially
aid in the identification, comparison and classifica-
tion of species based on molecular evidence. The two-
dimensional and three-dimensional Molecular Distance
Maps we obtain, which visualize the simultaneous in-
tragenomic and intergenomic interrelationships among
the sequences in our dataset, show this method’s po-
tential.
Further analysis is needed to explore this method’s
potential to the analysis of closely related species.
As a preliminary experiment, we applied it to H.
sapiens chromosome 21 (NC 000021.8), which yields
234 fragments, and P. troglodytes chromosome Y
(NC 006492.3) which yields 168 sequences, also 150
kbp long.
DαAαOαz-score sum Rank
DSSIM 0.167 0.915 0.136 3.453 1st
Descriptors 0.015 0.500 0.101 −2.593 5th
Euclidean 0.037 0.58 0.069 −2.899 6th
Manhattan 0.112 0.863 0.108 1.27 3rd
Pearson 0.142 0.714 0.119 1.339 2nd
Approx. Inf. 0.075 0.933 0.062 −0.569 4th
Table 5 Summary of quality measures for the performances of six
distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,
approximated information distance) on a dataset of 402 DNA
sequences from H. sapiens, chromosome 21 and P. troglodytes,
chromosome Y. Dαis the correlation to an idealized cluster, Aα
is the silhouette cluster accuracy, and Oαis the relative overlap.
The Molecular Distance Maps in Figure 5and Fig-
ure 6, of 402 DNA sequences, suggests that several of
the distances are able to differentiate even between
DNA sequences from closely related organisms. As
seen in Table 5, the Euclidean distance was again out-
performed by other distances, when assessed with the
quality measures we described. In this case-study, we
note a change in the distance rankings: DSSIM, which
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance. (d) Manhattan distance.
(e) Pearson distance. (f) Approx. inform. distance.
Figure 5 Two-dimensional Molecular Distance Maps of
150 kbp genomic DNA sequences from H. sapiens (blue),
P. troglodytes (red) using the six distances.
ranked second previously, now ranks first, while the
descriptor distance, which ranked first previously, now
ranks second last. This may be an indication that de-
scriptor distance, which was designed to detect pattern
differences, may only perform well for analyses of se-
quences of distantly related organisms while DSSIM,
which is sensitive to small differences in similar images,
may be the preferred option for fine-grained analyses
at the genus, family and species level.
Further large-scale computational experiments have
to be carried out to confirm these preliminary results
and establish their validity. Such experiments could
provide additional insights regarding the choice of op-
timal distance for structural genome comparison in
different settings.
Karamichalis et al. Page 14 of 14
(a) DSSIM distance. (b) Descriptors distance.
(c) Euclidean distance. (d) Manhattan distance.
(e) Pearson distance. (f) Approx. inform. distance.
Figure 6 Three-dimensional Molecular Distance Maps of
150 kbp genomic DNA sequences from H. sapiens (blue),
P. troglodytes (red) using the six distances.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
RK data acquisition; data analysis, methodology and result interpretation;
manuscript draft; manuscript editing; software design. LK data analysis,
methodology and result interpretation; manuscript draft; manuscript
editing. S.Kon data analysis, methodology and result interpretation;
manuscript editing. S.Kop data analysis, methodology and result
interpretation; manuscript editing. All authors read and approved the final
manuscript.
Acknowledgements
We thank Yuri Boykov, Lena Gorelick and Olga Veksler for discussions on
the definition for the descriptor distance, and Stephen Solis for comments
on earlier drafts of the manuscript.
Author details
1Department of Computer Science, University of Western Ontario,
London, ON, Canada. 2Department of Mathematics and Computing
Science, Saint Mary’s University, Halifax, NS, Canada.
References
1. Hebert, P.D., Cywinska, A., Ball, S.L., et al.: Biological identifications
through DNA barcodes. Proceedings of the Royal Society of London.
Series B: Biological Sciences 270(1512), 313–321 (2003)
2. Sirovich, L., Stoeckle, M.Y., Zhang, Y.: Structural analysis of
biodiversity. PLoS One 5(2), 9266 (2010)
3. Jeffrey, H.: Chaos Game Representation of gene structure. Nucleic
Acids Research 18(8), 2163–2170 (1990)
4. Kari, L., Hill, K.A., Sayem, A.S., Karamichalis, R., Bryans, N., Davis,
K., Dattani, N.S.: Mapping the Space of Genomic Signatures
(Submitted). ArXiv e-prints http://arxiv.org/abs/1307.3755
(2014)
5. Edwards, S., Fertil, B., Girron, A., Deschavanne, P.: A genomic schism
in birds revealed by phylogenetic analysis of DNA strings. Systematic
Biology 51(4), 599–613 (2002)
6. Pandit, A., Vadlamudi, J., Sinha, S.: Analysis of dinucleotide
signatures in HIV-1 subtype B genomes. Journal of genetics 92(3),
403–412 (2013)
7. Deschavanne, P., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic
signature: characterization and classification of species assessed by
Chaos Game Representation of sequences. Molecular Biology and
Evolution 16(10), 1391–1399 (1999)
8. Gentles, A.J., Karlin, S.: Genome-scale compositional comparisons in
eukaryotes. Genome Research 11(4), 540–546 (2001).
doi:10.1101/gr.163101
9. Deschavanne, P., Giron, A., Vilain, J., Dufraigne, C., Fertil, B.:
Genomic signature is preserved in short DNA fragments. In:
Proceedings of IEEE International Symposium on Bio-Informatics and
Biomedical Engineering, pp. 161–167 (2000)
10. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality
assessment: From error visibility to structural similarity. IEEE
Transactions on Image Processing 13(4), 600–612 (2004).
doi:10.1109/TIP.2003.819861
11. Iversen, G.R., Gergen, M., Gergen, M.M.: Statistics: The Conceptual
Approach. Springer, Berlin Heidelberg (1997)
12. Krause, E.F.: Taxicab Geometry: An Adventure in non-Euclidean
Geometry. Courier Dover Publications, Mineola, New York (2012)
13. Li, M., Chen, X., Li, X., Ma, B., Vitany, P.: The similarity metric.
IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
14. Jeffrey, H.: Chaos game visualization of sequences. Comput. Graphics
16(1), 25–33 (1992)
15. Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary
implications of microbial genome tetranucleotide frequency biases.
Genome Research 13(2), 145–158 (2003)
16. Deschavanne, P., DuBow, M., Regeard, C.: The use of genomic
signature distance between bacteriophages and their hosts diplays
evolutionary relationships and phage growth cycle determination.
Virology Journal 7(1), 163 (2010)
17. Pandit, A., Sinha, S.: Using genomic signatures for HIV-1 subtyping.
BMC Bioinformatics 11(Suppl 1), 26 (2010)
18. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin
Heidelberg (2009)
19. Kruskal, J.: Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)
20. Supplemental Material.
https://github.com/rallis/intraSupplemental_Material
21. Karamichalis, R.: Molecular Distance Map Interactive Webtool (2014).
https://github.com/rallis/intraMoDMap
22. Pang-Ning, T., Steinbach, M., Kumar, V., et al.: Introduction to data
mining. In: Library of Congress (2006)
23. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of
selected criterion functions for document clustering. Machine Learning
55(3), 311–331 (2004)
24. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. Journal of Computational and Applied
Mathematics 20(0), 53–65 (1987). doi:10.1016/0377-0427(87)90125-7