Content uploaded by Stavros Konstantinidis

Author content

All content in this area was uploaded by Stavros Konstantinidis on Mar 30, 2015

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Karamichalis et al.

RESEARCH

An investigation into inter- and intragenomic

variations of graphic genomic signatures

Rallis Karamichalis1, Lila Kari1*, Stavros Konstantinidis2and Steﬀen Kopecki1,2

Abstract

Background: Motivated by the general need to identify and classify species based on molecular evidence,

genome comparisons have been proposed that are based on measuring Euclidean distances between Chaos

Game Representation (CGR) patterns of genomic DNA sequences.

Results: We provide, on an extensive dataset and using several diﬀerent distances, conﬁrmation of the

hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are diﬀerent for DNA

sequences originating from genomes of diﬀerent species. This ﬁnding lends support to the theory that CGRs of

genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over

ﬁve hundred diﬀerent 150,000 bp genomic sequences originating from the genomes of six organisms, each

belonging to one of the kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi;

chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli

(Bacteria - full genome), and P. furiosus (Archaea - full genome). We also provide preliminary evidence of this

method’s applicability to closely related species by comparing H. sapiens (chromosome 21) sequences and over

one hundred and ﬁfty genomic sequences, also 150,000 bp long, from P. troglodytes (Animalia; chromosome

Y), for a total length of more than 101 million basepairs analyzed. We compute pairwise distances between

CGRs of these genomic sequences using six diﬀerent distances, and construct Molecular Distance Maps that

visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display

their interrelationships.

Conclusion: Our analysis conﬁrms that CGR patterns of DNA sequences from the same genome are in general

quantitatively similar, while being diﬀerent for DNA sequences from genomes of diﬀerent species. Our analysis

of the performance of the assessed distances uses three diﬀerent quality measures and suggests that several

distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In

particular we show that, for this dataset, DSSIM (Structural Dissimilarity Index) and the descriptor distance

(introduced here) are best able to classify genomic sequences.

Keywords: comparative genomics; genomic signature; species classiﬁcation

Introduction

Alongside DNA barcoding, [1] and Klee diagrams

[2], Chaos Game Representation (CGR) patterns of

genomic segments have been proposed as another

method for the classiﬁcation and identiﬁcation of ge-

nomic sequences [3–7]. The concept of genomic signa-

ture was ﬁrst introduced in [8], as being any speciﬁc

quantitative characteristic of a DNA genomic sequence

that is pervasive along the genome of the same organ-

ism, while being dissimilar for DNA sequences origi-

nating from diﬀerent organisms. Initial studies [3,9],

*Correspondence: lila.kari@uwo.ca

1Department of Computer Science, University of Western Ontario,

London, ON, Canada

Full list of author information is available at the end of the article

suggested that short fragments of genomic sequences

retain most of the characteristics of the species they

come from, thus implying that genomic signatures ex-

ist. Moreover, the Chaos Game Representation (CGR)

of a DNA sequence, a graphic representation of its se-

quence composition, was proposed in [3] as having both

the pervasiveness and diﬀerentiability properties nec-

essary for it to qualify as a genomic signature. This

hypothesis was quantitatively tested and largely con-

ﬁrmed in [4] for 3,176 mitochondrial DNA (mtDNA)

sequences, and Molecular Distance Maps were pro-

posed therein as vizualizations of species relationships

based on measuring the distances between the CGR-

images of their mtDNA genomes. Note that CGR pat-

arXiv:1503.00162v1 [q-bio.GN] 28 Feb 2015

Karamichalis et al. Page 2 of 14

terns of mtDNA sequences can be diﬀerent from those

of DNA sequences from the major genome of the same

organism, and that large scale quantitative analyses of

the hypothesis that CGR can play the role of a ge-

nomic signature for genomic sequences have not, to

our knowledge, been performed. The objective of this

study is to conﬁrm that CGR can play the role of ge-

nomic signature for genomic DNA sequences, as well

as to assess various distances that can be used to com-

pare CGRs of genomic sequences.

We analyze 508 fragments, 150 kbp (kilo base pairs)

long, taken from complete genomic DNA sequences

of six species, each representing a diﬀerent kingdom:

chromosome 21 of Homo sapiens, chromosome 4 of

Saccharomyces cerevisiae, chromosome 1 of Arabidop-

sis thaliana, chromosome 14 of Plasmodium falci-

parum, the genome of Escherichia coli, and the genome

of Pyrococcus furiosus, for a total length of 76,200,000

bp analyzed. We analyze the intergenomic and intrage-

nomic variation of CGR genomic signatures of these se-

quences by using six diﬀerent distances for image com-

parison: Structural Dissimilarity Index (DSSIM) [10],

Euclidean distance, Pearson correlation distance [11],

Manhattan distance [12], approximated information

distance [13], and a distance we propose here, called

descriptor distance. We visualize the results by com-

puting the Molecular Distance Maps of all DNA se-

quences in the database, for each of the six distances.

The resulting Molecular Distance Maps show a good

clustering of the DNA sequences, with those origi-

nating from the same genome being largely grouped

together, and separated from sequences belonging to

genomes of diﬀerent organisms. We observe that, in

some of the cases where the clustering was suboptimal,

the computation of three-dimensional Molecular Dis-

tance Maps resolves what appeared to be cluster over-

laps in the two-dimensional Molecular Distance Maps.

Lastly, using the “ground-truth” that sequences from

the same genomes should have similar structural char-

acteristics and thus be grouped together, while those

from genomes of diﬀerent organisms should be sepa-

rated, we assess the six distances by combining three

diﬀerent quality measures: correlation to an idealized

cluster distance, silhouette accuracy, and histogram

overlap. We conclude that DSSIM and the descriptor

distance perform best according to these measures. We

also provide preliminary evidence of this method’s ap-

plicability to classifying genomic DNA sequences of

closely related species by comparing the H. sapiens

(chromosome 21) sequences with 168 genomic DNA

sequences, 150 kbp long, from Pan troglodytes (chimp,

chromosome Y), for an additional length of 25,200,000

bp analyzed. Further research may lead to improve-

ments of these distances for optimal genomic DNA se-

quence identiﬁcation and classiﬁcation results.

Note that other alignment-free methods have been

used for phylogenetic analysis of DNA sequences. The

initial reports on CGRs of genomic sequences [3,14]

contained mostly qualitative assessments of CGR pat-

terns of whole genes. In [7], several datasets of up to

36 genomic DNA sequences were analyzed, and in [9]

some various-length sequences were analyzed based on

computing Euclidean distances between frequencies of

their k-mers, for k= 1, ..., 8. Subsequently, [5] com-

puted the Euclidean distance between frequencies of

k-mers (k≤5) for the analysis of 125 GenBank DNA

sequences from 20 bird species and the American al-

ligator. In [15], 27 microbial genomes were analyzed

to ﬁnd implications of 4-mer frequencies (k= 4) on

their evolutionary relationships. In [13], 20 mammalian

complete mtDNA sequences were analyzed using the

“similarity metric”, for k= 7. Another study, [16], an-

alyzed 459 bacteriophage genomes and compared them

with their host genomes to infer host-phage relation-

ships, by computing Euclidean distances between fre-

quencies of k-mers for k= 4. In [17], 75 complete HIV

genome sequences were compared using the Euclidean

distance between frequencies of 6-mers (k= 6), in or-

der to group them in subtypes. In [4] a dataset of 3,176

complete mtDNA sequences was analyzed, and several

Molecular Distance Maps were obtained using DSSIM

and a value of k= 9.

The main contributions of this paper are:

•We tested and conﬁrmed for an extensive dataset,

of a total length of 101,400,000 bp, the hypothe-

sis that CGR images of genomic DNA sequences

can play the role of a (graphic) genomic signa-

ture, meaning that they have a desirable genome-

and species- speciﬁcity. The dataset comprised

150 kbp long sequences taken from genomes of

organisms from each of the six kingdoms of life,

augmented by a set of same-length genomic se-

quences from P. troglogytes as a test-case of this

method’s applicability to closely related species.

•We assessed the performance of six diﬀerent dis-

tances in this context, and this analysis included

both same-genome and diﬀerent-genome DNA

fragment pairs. For several of these distances, the

intragenomic values were overall smaller than in-

tergenomic values, suggesting that this method

could separate DNA genomic fragments belong-

ing to diﬀerent genomes, based on their CGRs.

•We showed that several distances outperform the

Euclidean distance, which has so far been al-

most exclusively used for such studies. In par-

ticular, we determined that the DSSIM distance

and descriptor distance (introduced here), both of

whom essentially compare the k-mer composition

of DNA sequences (herein k= 9), were best able

Karamichalis et al. Page 3 of 14

to diﬀerentiate sequences originating from diﬀer-

ent genomes in this dataset.

•This study represents, to the best of our knowl-

edge, the largest combined dataset size and value

of kfor this type of analysis.

•Based on preliminary data, we suggest the use

of three-dimensional Molecular Distance Maps for

improved visualization of the simultaneous inter-

relationships among similar or very distant DNA

sequences.

Methods

In this section we ﬁrst describe the dataset used for our

analysis, then present an overview of the three main

steps of the method, and conclude with a description

of the six distances that we considered.

Dataset

The dataset we used includes complete genomic se-

quences from six organisms, each representing one of

the six kingdoms of life, see Table 1. For additional

information about the dataset see Appendix A.

Organism NCBI Acc. Nr.

1H. sapiens, chrom. 21 (Animalia) NC 000021.8

2E. coli (Bacteria) NC 000913.3

3S. cerevisiae, chrom. 4 (Fungi) NC 001136.10

4A. thaliana, chrom. 1 (Plantae) NC 003070.9

5P. falciparum, chrom. 14 (Protista) NC 004317.2

6P. furiosus (Archaea) NC 018092.1

Table 1 NCBI accession numbers of the dataset of the

complete genomic DNA sequences considered, in increasing

order of their NCBI accession number.

Organism Length(bp) # Letters “N” # Fragments

H. sapiens 48,129,895 13,023,253 234

E. coli 4,641,652 0 30

S. cerevisiae 1,531,933 0 10

A. thaliana 30,427,671 164,359 201

P. falciparum 3,291,871 37 21

P. furiosus 1,909,827 10 12

Table 2 Organism considered, total length of genomic

sequence, number of ignored letters “N”, and number of DNA

fragments (sequences) obtained by splitting each complete

genomic DNA sequence into consecutive, non-overlapping,

equal length (150 kbp) contiguous fragments.

In order to have relatively comparable number of

DNA sequences for each organism, we chose the longest

chromosomes for all organisms except H. sapiens, for

which the shortest chromosome was chosen.

The DNA sequences in the NCBI database are rep-

resented as strings of letters “A”, “C”, “G”, “T”, and

“N” which represent the four nucleobases Adenine,

Cytosine, Guanine, Thymine, and “unidentiﬁed Nu-

cleotide”, respectively. For our analysis we ignored all

letters “N”. In S. cerevisiae and E. coli there were no

ignored letters, and in P. falciparum and P. furiosus

the number of ignored letters is of the order of 0.001%

of the length of the sequence. In H. sapiens this num-

ber is 27%, and in A. thaliana is 0.54%. In H. sapiens,

in particular, 96.4% of these ignored letters exist in

centromeric and telomeric regions of the chromosome.

The resulting genomic DNA sequences were di-

vided into successive, non-overlapping, contiguous

fragments, each 150 kbp long. When the last sequence

was shorter than 150 kbp, it was not included in the

analysis. This resulted in 234 fragments for H. sapiens,

30 fragments for E. coli, 10 fragments for S. cerevisiae,

201 fragments for A. thaliana, 21 fragments for P. fal-

ciparum, and 12 fragments for P. furiosus, for a total

of 508 DNA fragments, see Table 2.

Overview

The method we used to analyze and classify the 508

sequences of the dataset has three steps: (i) gener-

ate graphical representations (images) of each DNA

sequence using Chaos Game Representation (CGR),

(ii) compute all pairwise distances between these im-

ages, and (iii) visualize the interrelationships implied

by these distances as two- or three-dimensional maps,

using Multi-Dimensional Scaling (MDS).

CGR is a method introduced by Jeﬀrey [3] in 1990

to visualize the structure of a DNA sequence. A CGR

associates an image to each DNA sequence as follows.

Starting from a unit square with corners labelled A, C,

G, and T, and the center of the square as the starting

point, the image is obtained by successively plotting

each nucleotide as the middle point between the cur-

rent point and the corner labelled by the nucleotide to

be plotted. If the generated square image has a size of

2k×2kpixels, then every pixel represents a distinct

k-mer: A pixel is black if the k-mer it represents oc-

curs in the DNA sequence, otherwise it is white. CGR

images of genetic DNA sequences originating from var-

ious species show patterns such as squares, parallel

lines, rectangles, triangles, and also complex fractal

patterns, Figure 1.

For step (i), a slight modiﬁcation of the original CGR

was used, introduced by Deschavanne [7]: a k-th or-

der FCGR (frequency CGR) is a 2k×2kmatrix that

can be constructed by dividing the CGR plot into a

2k×2kgrid, and deﬁning the element aij as the num-

ber of points that are situated in the corresponding

grid square. A ﬁrst and second order FCGR are shown

below, where Nwis the number of occurrences of the

oligonucleotide win the sequence s.

F CGR1(s) = NCNG

NANT,

Karamichalis et al. Page 4 of 14

F CGR2(s) =

NCC NGC NC G NGG

NAC NT C NAG NT G

NCA NGA NC T NGT

NAA NT A NAT NT T

.

The (k+ 1)-th order F CGRk+1 (s) can be obtained

by replacing each element NXin F CGRk(s) with four

elements

NCX NGX

NAX NT X

where Xis a sequence of length kover the alphabet

{A, C, G, T }.

(a) H. sapiens (b) E. coli (c) S. cerevisiae

(d) A. thaliana (e) P. falciparum (f) P. furiosus

Figure 1 29×29CGR images of 150 kbp genomic DNA

sequences. of H. sapiens,E. coli,S. cerevisiae,A. thaliana,

P. falciparum, and P. furiosus.

For step (ii), after computing the FCGR matrices for

each of the 150 kbp sequences in our dataset, the goal

was to measure “distances” between every two CGR

images. There are many distances that can be deﬁned

and used for this purpose, [18]. One of the goals of

this study was to identify what distance is better able

to diﬀerentiate the structural diﬀerences of various ge-

nomic DNA sequences and classify them based on the

species they belong to. In this paper we use six diﬀer-

ent distances: Structural Dissimilarity Index (DSSIM),

descriptor distance (deﬁned here), Euclidean distance,

Manhattan distance, Pearson correlation distance, and

approximated information distance.

For step (iii), after computing all possible pairwise

distances we obtained six diﬀerent distance matrices.

To visualize the inter-relationships between sequences

implied by each of the distance matrices, and to thus

visually assess each of the distances, we used Multi-

Dimensional Scaling (MDS). MDS is an information

visualization technique introduced by Kruskal in [19].

Given as input a distance matrix that contains the

pairwise distances among a set of items[1], the out-

put of MDS is a spatial representation of the items on

a common Euclidean space wherein each item is rep-

resented as a point and the spatial distance between

any two points corresponds to the distance between

the items in the distance matrix: Objects with a small

pairwise distance will result in points that are close to

each other, while objects with a large pairwise distance

will become points that are far apart. For example,

in [4] MDS was used in conjunction with DSSIM and

CGR to produce Molecular Distance Maps that visu-

ally display the simultaneous interrelationships among

a set of full mitochondrial DNA sequences.

The ideal Molecular Distance Map is a placement of

nitems as points in an (n−1)-dimensional space. The

two-dimensional Molecular Distance Map is simply an

approximation, a ﬂattening of this highly-dimensional

space onto the plane, which may sometimes result in

erroneous positioning of some points. Increasing the

dimensionality of the Molecular Distance Map often

results in a more accurate representation of the real

interrelationships between sequences, as embodied in

the original distance matrix.

Distances

In this section we describe and formally deﬁne each of

the six distances used in our analysis: DSSIM, descrip-

tor distance (introduced here), Euclidean, Manhattan,

Pearson, and approximated information distance.

Structural Similarity Index, SSIM, was introduced

in [10] for the purpose of assessing the degree of simi-

larity between two images. Given two images X, Y as

n×nmatrices having as elements integers ranging in

the interval [0, L], SSIM computes three factors (lumi-

nance, contrast and structure) and combines them to

obtain a similarity value. However, instead of comput-

ing a global similarity between the two images, each

image is divided into 11 ×11 sliding square windows

Xij (Yij respectively) with i, j = 1,· · · , n −10 which

move pixel by pixel to eventually cover the entire im-

age, and the SSIM similarity of any given pair of im-

ages is computed by comparing their corresponding

windows. In addition, an 11 ×11 circular symmet-

ric Gaussian weighting function W∈R11×11 with a

ﬁxed standard deviation of 1.5, normalized to unit sum

(P11

p=1 P11

q=1 Wpq = 1), is used. Then, the mean µx,i,j

(µy,i,j for Y), variance σx,i,j (σy,i,j for Y) and corre-

lation σxy,i,j are computed, as follows:

µx,i,j =

11

X

p=1

11

X

q=1

WpqXij

pq

[1]In this paper the items are the 150 kpb DNA se-

quences analyzed.

Karamichalis et al. Page 5 of 14

σx,i,j =v

u

u

t

11

X

p=1

11

X

q=1

Wpq(Xij

pq −µx,i,j )2

σxy,i,j =

11

X

p=1

11

X

q=1

Wpq(Xij

pq −µx,i,j )(Yij

pq −µy,i,j )

where Apq denotes the (p, q) element of the matrix A.

Based on these values, the luminance l(Xij, Y ij ), con-

trast c(Xij , Y ij ) and structure s(Xij , Y ij ) are com-

puted as

l(Xij , Y ij ) = 2µx,i,j µy,i,j +C1

µ2

x,i,j +µ2

y,i,j +C1

c(Xij , Y ij ) = 2σx,i,j σy,i,j +C2

σ2

x,i,j +σ2

y,i,j +C2

s(Xij , Y ij ) = σxy,i,j +C3

σx,i,j σy,i,j +C3

where C1= (0.01)2,C2= (0.03)2,C3=C2

2. Then,

these three factors are combined to get

SSIM (Xij , Y ij ) = l(Xij , Y ij )c(Xij , Y ij )s(Xij , Y ij)

and ﬁnally, the SSIM index used to evaluate the over-

all image similarity is computed as

SSIM (X, Y ) = 1

(n−10)2

n−10

X

i=1

n−10

X

j=1

SSIM (Xij , Y ij ).

In theory, the values for SSIM range in the interval

[−1,1] with the similarity being 1 between two identi-

cal images, 0, for example, between a black image and

a white image, and −1 if the two images are negatively

correlated; that is, SSIM(X, Y ) = −1 if and only if X

and Yhave the same luminance µand every pixel xi

of image Xhas the inverted value of the corresponding

pixel yi= 2µ−xiin Y.

To compute the distance rather than the similarity

between two images, we calculate DSSIM (X, Y ) =

1−SSIM(X, Y ). Consequently, the range of DSSIM

is the interval [0,2]: two identical images will result

in a DSSIM distance of 0, while two images that are

the negatives of each other would result in a DSSIM

distance of 2.

The descriptor distance between two FCGRs X, Y ∈

N2k×2kaims to compare a combination of several dif-

ferent“descriptors”, that is, a combination of several

diﬀerent aspects, of the two given FCGRs.

Adescriptor is a vector characterized by parameters

mand r, as well as rintervals, where mis the size

of the non-overlapping windows in which the FCGR is

divided (scale of the comparison), and the rintervals

represent the “granularity” of the analysis, in that they

deﬁne the intervals of numbers of k-mer occurrences

that are considered signiﬁcant.

For a given m≤kand r, and intervals [a0, a1),[a1, a2),

· · · ,[ar−1, ar) such that Sr−1

i=0 [ai, ai+1) = [0,∞) and

[ai, ai+1)∩[aj, aj+1 ) = ∅ ∀i, j with i6=j, a decriptor

is constructed as follows.

Starting from the top-left corner, we divide each of

the two FCGR matrices Xand Yinto non-overlapping

submatrices[2] of size 2m×2m. This procedure re-

sults in 4k−msubmatrices Xij and Yij with i, j =

1,· · · ,2k−m, which will be pairwise compared.

The choice of the rintervals, called “bins”, points

to the fact that, rather than considering the ﬁnest

granularity, we are interested in a coarser compari-

son. This means that, instead of a computationally

expensive pairwise comparison of all possible numbers

of occurrences of k-mers, we are interested only in cer-

tain “bins” of such numbers. For example, in our case,

we use r= 5 and consider only 5 diﬀerent bins, that

is only k-mers with number of occurences: 0 (not oc-

curring), 1 (one occurrence), 2 (two occurrences), be-

tween 2 and 5, between 5 and 20, and greater than

20 (most frequent). Formally, we use r= 5 and

[0,∞) = [0,1) ∪[1,2) ∪[2,5) ∪[5,20) ∪[20,∞) as the

5 bins.

Afterwards, we compute for every Xij a vector

vecXij =1

(2m×2m)(b1, b2,· · · , br) where bi=|{x∈

Xij :ai−1≤x < ai}|. In our case, for each Xij, we

compute a ﬁve-tuple wherein, for example, the 4th el-

ement represents the number of 9-mers whose number

of occurrences is in the 4th bin, that is, at least 5 but

less than 20. The division to 2m×2mis to obtain a

probability distribution for each submatrix. The same

procedure is performed for Yij , resulting in the vector

vecYij.

We further append all vectors vecXij and form a new

vector vecXm,r and, using the same order of append-

ing, we append all vectors vecYij forming a new vector

vecYm,r. These two vectors are the “descriptors” of

the FCGR matrices Xand Yfor the parameters m,r

and the rchosen bins.

As a last step, we combine descriptors vecXm,r (re-

spectively vecYm,r) for several values of mand rby

appending them one after another, in the same order,

to obtain the vector vecX(respectively vecY).

[2]In general, these windows (submatrices) can be over-

lapping, but in this paper we made the choice of using

non-overlapping windows.

Karamichalis et al. Page 6 of 14

The descriptor distance between the two FCGRs X

and Yis now deﬁned as the Euclidean distance be-

tween the vectors vecXand vecY

dD(X, Y ) = dE(vecX, vecY).

In our case we computed descriptors for m= 4,5,6

therefore forming vectors vecXand vecYof length

5(512

64 )2+ (512

32 )2+ (512

16 )2= 6720. In general,

for a given r, the length of the vectors compared

is r((2k−m1)2+ (2k−m2)2+... + (2k−mp)2), where

m1, m2, . . . , mpare the values used for m. The choice

of mfor this study was made to balance the com-

putational cost of calculating the vector of descriptors

with the ability to compare the two matrices at various

scales: large (m= 6, that is, compare windows of size

64×64), medium (m= 5, windows of size 32×32)) and

small (m= 4, windows of size 16×16). The parameter

r= 5 and the 5 bins were kept constant throughout

our calculations but, in general, these parameters can

also be varied, and the resulting vectors for each value

added to the vector of descriptors, resulting in a larger

vector.

In principle, the descriptor distance between two FC-

GRs eﬀectively compares the distribution of frequen-

cies of k-mers between the corresponding submatrices

Xij and Yij , and does that for several values of m,

that is, at several diﬀerent scales. (Note that, in each

window Xij, all k-mers have the same suﬃx of length

k−m.)

We now illustrate the descriptor distance by an ex-

ample wherein k= 3, m= 2, r= 3, and the 3 bins are

[0,15)∪[15,30)∪[30,∞). Since k= 3, the FCGR table

will contain the number of occurrences of all 3-mers in

a DNA sequence, as follows:

CCC GCC CGC GGC CCG GCG CGG GGG

ACC TCC AGC TGC ACG TCG AGG TGG

CAC GAC CTC GTC CAG GAG CTG GTG

AAC TAC ATC TTC AAG TAG ATG TTG

CCA GCA CGA GGA CCT GCT CGT GGT

ACA TCA AGA TGA ACT TCT AGT TGT

CAA GAA CTA GTA CAT GAT CTT GTT

AAA TAA ATA TTA AAT TAT ATT TTT

Take the two FCGRs X, Y ∈N8×8, (k= 3, thus

23×23) corresponding to two genomic 150 kbp se-

quences of our dataset (one human and one bacterial),

respectively. In order to use small numbers throughout

the example, we divide all elements of the obtained ma-

trices by 100 and take the integer part of each element,

obtaining:

X=

42 33 9 33 14 10 15 45

22 30 26 25 9 5 37 37

32 21 33 19 44 35 41 35

17 9 13 21 23 10 22 18

37 26 6 32 34 24 9 23

29 24 31 27 19 27 18 28

21 23 10 9 19 17 21 15

35 15 14 14 19 12 17 30

,

Y=

18 34 40 27 30 36 27 12

27 18 27 32 24 23 15 23

24 17 13 17 36 12 32 18

27 17 28 26 18 8 22 25

32 32 23 16 16 25 23 22

20 29 18 25 16 16 15 17

25 25 7 16 26 27 20 25

32 21 20 21 25 18 27 34

.

Thus, in the human DNA sequence, the triplet CCC

appears about 4200 times, the triplet GCC appears

about 3300 times, the triplet CGC appears about 900

times, etc.

Since m= 2, we divide each of the matrices Xand Y

into non-overlapping submatrices of size 4×4 (22×22).

For Xwe thus obtain X11, X12, X21 , X22

42 33 9 33

22 30 26 25

32 21 33 19

17 9 13 21

,

14 10 15 45

9 5 37 37

44 35 41 35

23 10 22 18

,

37 26 6 32

29 24 31 27

21 23 10 9

35 15 14 14

,

34 24 9 23

19 27 18 28

19 17 21 15

19 12 17 30

.

and similarly for Y.

Since the r= 3 bins are [0,15) ∪[15,30) ∪[30,∞),

we will count, for each submatrix, the number of 3-

mers for which the number of occurrences is less than

15, between 15 and 30, and greater than or equal to

30. Thus we obtain vecX11 =1

16 (3,7,6) which has

as elements the number of elements of X11 which be-

long in each of the intervals selected, divided by the

total number of elements of X11. We proceed simi-

larly for vecX12 =1

16 (5,4,7), vecX21 =1

16 (5,7,4),

vecX22 =1

16 (2,12,2) and we form vecXby appending

these vectors one after the other, that is

vecX=1

16 (3,7,6,5,4,7,5,7,4,2,12,2) .

Karamichalis et al. Page 7 of 14

We apply exactly the same procedure for the matrix

Yand we get

vecY=1

16 (1,12,3,3,9,4,1,12,3,0,15,1) .

The descriptor distance between these two FCGRs is

computed as the Euclidean distance between vecXand

vecY, in this case dD(X, Y )≈0.718. Note that, since

we started by dividing the number of 3-mer occur-

rences by 100, as well as because of the bin selection,

this is a ﬁctitious example. The real value of the de-

scriptor distance between the mentioned human and

bacterial sequences is 8.66, and the range of the de-

scriptor distance for this dataset of DNA sequences is

[0, 13.17]. In general, the descriptor distance has a vari-

able range, that depends on the choices of parameters

used.

To compute the Euclidean, Manhattan and Pearson

distances, we ﬁrst convert the matrices X, Y ∈Nn×n

into 1 ×n2vectors. For two vectors x, y ∈Rn, their

Euclidean distance dE(x, y) and their Manhattan dis-

tance dM(x, y) are computed as

dE(x, y) = v

u

u

t

n

X

i=1

(xi−yi)2,

dM(x, y) =

n

X

i=1

|xi−yi|,

while their Pearson distance dP(x, y) is deﬁned as

dP(x, y)=1−σxy

σxσy

,

where

µx=1

n

n

X

i=1

xi, σx=v

u

u

t

1

n−1

n

X

i=1

(xi−µx)2,

σxy =1

n−1

n

X

i=1

(xi−µx)(yi−µy).

In theory, the correlation coeﬃcient σxy

σxσyranges in

the interval [−1,1], and therefore the Pearson distance

ranges in the interval [0,2].

The last distance we considered is based on the in-

formation distance deﬁned in [13]. The use of this dis-

tance is motivated computationally since it is easily

computed from FCGRs as it tracks the number of dif-

ferent k-mers for a sequence instead of the actual set.

In [13], for a given k, the information distance for two

strings x, y is deﬁned as

dAID (x, y) = Nk(x|y) + Nk(y|x)

Nk(xy)

with

Nk(x|y) = Nk(xy)−Nk(x)

where Nk(x) is the number of diﬀerent k-mers (pos-

sibly overlapping) which occur in x. We go one step

further and modify this in order to avoid the creation

of “unwanted” k-mers from the concatenation xy of

xand y. First, we need to show how we compute

Nk(x) for a sequence x. For a sequence x, ﬁrstly, we

build its FCGR(x) = X∈N2k×2k, which is a ma-

trix of 2k×2kwith element values in N. Then we

unitize X, that is every non-zero entry becomes 1,

while zeros remain 0. Nk(x) is now computed as the

sum of the elements of this unitized FCGR, that is,

Nk(x) = f(X) = SumOfElements(Unitize(X)). For

two strings xand y, with FCGRs Xand Yrespec-

tively, we deﬁne Nk(x|y) as:

Nk(x|y) = f(X+Y)−Nk(x) (1)

This slight modiﬁcation of the information distance

gives us also the desired properties of d(x, x) = 0 and

d(x, y) = d(y, x) which were not satisﬁed before. Us-

ing (1), we now deﬁne the approximated information

distance (AID) as:

dAID (x, y) = 2 −f(X) + f(Y)

f(X+Y)(2)

where x, y are the strings and X, Y ∈N2k×2ktheir

FCGRs, respectively. It also turns out that this dis-

tance is in fact the normalised Hamming Distance of

the unitized FCGRs Xand Y. Note that, for two

sets Xand Y, the normalized Hamming distance is

|X 4Y|

|X ∪Y| = 2 −|X |+|Y |

|X ∪Y| where 4denotes the symmetric

diﬀerence.

The generation of CGR images, calculation of dis-

tance matrices and creation of 2D and 3D Molecu-

lar Distance Maps with MDS were done and can be

tested with the code available in [20] written in Wol-

fram Mathematica, version 9. The interactive webtool

ModMap, [21], allows in-depth exploration of the 2D

Mod Maps (Molecular Distance Maps) in this paper[3] .

[3]When using the interactive webtool MoDMap, click-

ing on a distance underneath a dataset will result in

Karamichalis et al. Page 8 of 14

Online Supplemental Material [20] includes all dis-

tance matrices and the code used to produce all ﬁgures

and plots in this paper. More details about the online

resources can be found in Appendix B.

Analysis and Results

For our dataset, we use k= 9, that is, each DNA se-

quence was represented as a 29×29FCGR matrix.

In practice, this means that the FCGR of a DNA

sequence contains the full information regarding its

k-mer sequence composition, for k= 1,2, ..., 9. The

length choice of 150 kbp and value of k= 9 is justiﬁed

by the fact that, for a random sequence of length 150

kbp, its CGR at resolution 29×29has around half of

the pixels black, and half white.

Figure 2depicts two-dimensional Molecular Distance

Maps for the over ﬁve hundred DNA sequences in

our dataset, computed using the DSSIM distance, de-

scriptor distance, Euclidean distance, Manhattan dis-

tance, Pearson distance and approximated informa-

tion distance, respectively. Figure 3depicts the corre-

sponding three-dimensional Molecular Distance Maps

for the same dataset. The projection of each three-

dimensional map is chosen by hand in order to visually

separate clusters of points which appear to be overlap-

ping in the two-dimensional maps, as discussed below.

We note that MDS is not a clustering method, as the

clusters are deﬁned beforehand by the coloring scheme

used (blue for H. sapiens, green for E. coli, and so on).

MDS simply tries to display visually the interrelation-

ships between the given items, based on the pairwise

distances in the distance matrix which is its input.

Note also that an increase in dimensionality from 2 to

3 can lead to a better cluster visualization. For exam-

ple, if we compare the two-dimensional and the three-

dimensional Molecular Distance Maps obtained using

DSSIM, we see that points that appeared to be erro-

neously mixed with each other in the two-dimensional

map, Figure 2(a), (S. cerevisiae and P. falciparum se-

quences mixed in with A. thaliana sequences) were in

fact clearly separated from each other in Figure 3(a),

the three-dimensional version of the Molecular Dis-

tance Map.

plotting the MoD Map of the dataset computed with

that distance. On any particular MoD Map, clicking on

a point will display a window with information about

the subsequence represented by that point: its NCBI

accession number, scientiﬁc name of the organism it

originates from, and its CGR pattern. Clicking on the

“From here” and “To here” buttons on two such se-

lected windows will display the distance between the

corresponding genomic subsequences in the distance

matrix.

Figure 4displays the histograms of the pairwise in-

tragenomic distances (dark blue and turqoise) and in-

tergenomic distances (grey) of DNA sequences from

H. sapiens and A. thaliana, obtained using each of

the six distances. As noted, some distances seem to

perform better than others. Visually, the poorest per-

former for these two sets of sequences (from H. sapiens

and A. thaliana) seems to be the Euclidean distance

wherein the intragenomic distances are as high as in-

tergenomic distances, and no separation is visible. In

contrast, DSSIM gives – for the same data – interge-

nomic distances that are overall much higher than in-

tragenomic distances, resulting in a clear classiﬁcation

of DNA sequences into the species they belong to.

Table 3displays the mean and standard deviation of

distances between clusters Ciand Cj, 1 ≤i, j ≤6,

where a cluster C`is deﬁned as the set of all ge-

nomic sequences from the genome of organism `, as

labelled in Table 1. In each subtable, the diagonals

represent the means and standard deviation for in-

tragenomic distances, while the other entries are all

intergenomic distances. From this table we see that

for DSSIM, Manhattan and approximated information

distance, the maximum of all the averages of intrage-

nomic distances in this dataset is strictly smaller than

the minimum of all the averages of intergenomic dis-

tances. For the descriptor distance and Pearson dis-

tance the previous statement does not hold but, for

each pair of organisms, the two averages of intrage-

nomic distances (e.g., human-human and plant-plant)

are both lower than the average of the intergenomic

distances (human-plant). For the Euclidean distance,

none of the previous statements holds: For example,

the average of the plant-plant intragenomic distances

(element 4-4 in the Euclidean distance subtable of Ta-

ble 3) intragenomic distances is 723, which is larger

than 672, the average of the yeast-plant intergenomic

distances (element 3-4 in the Euclidean distance sub-

table of Table 3). The complete histograms of all pair-

wise comparisons Ci−Cjcan be found in Appendix C.

Karamichalis et al. Page 9 of 14

- 1 2 3 4 5 6

1

0.81 ±

0.04

0.99 ±

0.01

0.92 ±

0.02

0.91 ±

0.03

0.92 ±

0.03

0.91 ±

0.02

2-0.85 ±

0.01

0.97 ±

0.01

0.99 ±

0.01

0.99 ±

0.01 0.99±0.

3- - 0.87 ±

0.01

0.89 ±

0.02 0.91±0.0.91 ±

0.01

4---0.87 ±

0.03

0.9±

0.02

0.91 ±

0.01

5- - - - 0.74 ±

0.01 0.94±0.

6DSSIM 0.83 ±

0.01

1

3.76 ±

1.69

9.74 ±

0.66

5.92 ±

1.14

5.71 ±

1.41

9.33 ±

1.23

5.44 ±

0.92

2-2.5±

0.28

8.05 ±

0.39

9.1±

0.55

12.67 ±

0.19

9.38 ±

0.41

3- - 2.12 ±

0.08

3.42 ±

1.05

9.48 ±

0.31

4.6±

0.09

4---2.75 ±

1.33

8.23 ±

0.94

4.94 ±

0.76

5- - - - 1.53 ±

0.14

9.99 ±

0.28

6Descriptors 2.4±

0.32

1

756 ±

498

856 ±

349

756 ±

361

818 ±

514

3914 ±

510

812 ±

356

2-558±5674 ±

17

802 ±

366

4102 ±

466 696 ±18

3- - 564 ±

11

672 ±

383

3964 ±

472 633 ±20

4---723 ±

535

3923 ±

506

748 ±

372

5- - - - 999 ±

276

4085 ±

468

6Euclidean 585 ±24

1

171 ±

15 222±5189 ±

13 188 ±17 213 ±20 191 ±9

2-175±2 209±4 219 ±8 252 ±4 218 ±3

3- - 171±2 177 ±10 206 ±2 184 ±2

4---172 ±16 200 ±11 188 ±9

5- - - - 105 ±3224 ±2

6Manhattan (in thousands) 167 ±3

1

0.5±

0.12

0.97 ±

0.02

0.69 ±

0.1

0.64 ±

0.12

0.65 ±

0.09

0.81 ±

0.06

2-0.71 ±

0.02

0.93 ±

0.02

0.96 ±

0.02

0.98 ±

0.01

0.99 ±

0.02

3- - 0.6±

0.02

0.6±

0.07

0.71 ±

0.03

0.75 ±

0.02

4---0.53 ±

0.11

0.63 ±

0.09

0.76 ±

0.04

5- - - - 0.02 ±

0.01

0.94 ±

0.01

6Pearson 0.64 ±

0.03

1

0.65 ±

0.03

0.78 ±

0.01

0.7±

0.03

0.7±

0.03

0.76 ±

0.04

0.69 ±

0.02

2-0.67 ±

0.

0.75 ±

0.01

0.77 ±

0.02

0.85 ±

0.01

0.77 ±

0.01

3- - 0.67 ±

0.01

0.68 ±

0.02 0.74±0.0.69 ±0.

4---0.67 ±

0.03

0.73 ±

0.02

0.69 ±

0.02

5- - - - 0.64 ±

0.01

0.76 ±

0.01

6Approx. Information 0.65 ±

0.01

Table 3 Mean and standard deviation of distances between

clusters Ci−Cjfor i, j = 1, ..., 6.

(a) DSSIM distance. (b) Descriptors distance.

(c) Euclidean distance (d) Manhattan distance

(e) Pearson distance (f ) Approx. inform. distance

Figure 2 Two-dimensional Molecular Distance Maps of DNA

genomic sequences from all six organisms in the dataset,

obtained using DSSIM, descriptor, Euclidean, Manhattan,

Pearson and aproximated information distance, respectively.

Each point corresponds to a 150 kbp genomic sequence from

H. sapiens (blue), E. coli (green), S. cerevisiae (red),

A. thaliana (turqoise), P. falciparum (magenta), and

P. furiosus (orange).

Quality Measures for Distances

In this section we present three quality measures that

each evaluates the quality of the six distances con-

sidered. In the data mining literature a wide range

of quality measures for clusterings has been deﬁned;

see for example [22,23]. Most of these methods are

designed to assess the quality of diﬀerent automated

clustering methods while using the same distance. Our

set-up is diﬀerent, as we use diﬀerent distances while

the clustering is ﬁxed and given by the initial colour-

coding of the sequence-representing points. Thus, we

have to use other approaches to compare the distances

we analyze. In particular, as the six distances have

Karamichalis et al. Page 10 of 14

(a) DSSIM distance. (b) Descriptors distance.

(c) Euclidean distance (d) Manhattan distance

(e) Pearson distance (f ) Approx. inform. distance

Figure 3 Three-dimensional Molecular Distance Maps of

genomic DNA sequences from all six organisms in the dataset,

obtained using DSSIM, descriptor, Euclidean, Manhattan,

Pearson and approximated information distance, respectively.

Each point corresponds to a 150 kbp genomic sequence from

H. sapiens (blue), E. coli (green), S. cerevisiae (red),

A. thaliana (turqoise), P. falciparum (magenta), and

P. furiosus (orange).

diﬀerent ranges, we have to use assessment methods

which are invariant to the scale of the distance.

The “ground-truth” that we use as a basis for our

distance assessment is the fact that the “ideal” clus-

tering of DNA sequences and the points that repre-

sent them is known: sequences from the same organism

should be close to one another and far from sequences

originating from other organisms. (This assumption is

justiﬁed – for this dataset – as the six organisms con-

sidered are very diﬀerent from one another, belonging

(a) DSSIM distance. (b) Descriptors distance.

(c) Euclidean distance (d) Manhattan distance

(e) Pearson distance (f ) Approx. inform. distance

Figure 4 Histograms of pairwise intragenomic and

intergenomic distances among the DNA sequences from

H. sapiens and A. thaliana.

to diﬀerent kingdoms of life.) Thus, an optimal dis-

tance should yield a relatively small distance between

two FCGRs which were generated from the DNA se-

quences originating from the same organism, and rel-

atively high distances between two FCGR originating

from DNA sequences coming from diﬀerent organisms.

In order to assess each of the six distances quantita-

tively, we computed three quality measures which rate

diﬀerent features of a distance:

•the correlation to an idealized cluster distance

•the silhouette cluster accuracy

•the relative overlap between the intragenomic and

intergenomic distance histograms.

Let us stress that all three quality measures of the six

distances are based on the distance matrices which we

computed and not on their MDS plots. We will deﬁne

the three quality measures such that their expected

values range in the interval [0,1] where higher values

correspond to better performance.

Let us ﬁrst describe the three quality measures infor-

mally. An idealized distance is a distance that would

be able to diﬀerentiate DNA sequences by species, that

is, a distance δfor which δ(x, y) = 0 if xand yare

Karamichalis et al. Page 11 of 14

sequences from the same species and δ(x, y) = 1 oth-

erwise. The ﬁrst quality measure, the correlation to

an idealized cluster distance, measures how well a dis-

tance is linearly correlated to the idealized distance δ.

The second quality measure, silhouette cluster accu-

racy, is the percentage of points that are best embed-

ded in the cluster they belong to. The third quality

measure quantiﬁes the “visual overlap” between the

intragenomic and intergenomic distance histograms.

Given our dataset, it is reasonable to expect that a

good distance gives a low value if applied to FCGRs of

genomic sequences of the same organism, and a high

value when applied to FCGRs of genomic sequences

from two diﬀerent organisms, thus separating the his-

tograms of intragenomic distances from that of interge-

nomic distances. This is illustrated by the histograms

in Figure 4, where a high overlap between the graph

of intragenomic distances (dark blue and turquoise)

and the graphs of intergenomic distances (grey) is an

indication of a poorly performing distance. In a theo-

retically optimal situation, there would exist a value c

such that all distances that are smaller than care in-

tragenomic distances and all distances that are larger

than care intergenomic distances. This can usually not

be expected from real data, but a low overlap between

histograms is nevertheless indicative of a “good” dis-

tance.

In order to formally deﬁne the three quality mea-

sures, we consider a dataset Vwhich is partitioned

into pnon-overlapping clusters C1, . . . , Cpfor which a

distance dα:V×V→R≥0exists. The cardinalities of

the sets are |V|=mand |Ci|=mifor i= 1, . . . , p.

In our analysis, p= 6 and C1contains all FCGRs

generated from genomic DNA sequences from H. sapi-

ens,C2contains all FCGRs generated from genomic

sequences of E.coli, and so on, according to the order

in Table 1. The distance dαis one of the six distances

α∈ {DSSIM, D, E, M, P, AID}.

The correlation to an idealized cluster distance is

computed as follows. We deﬁne the idealized cluster

distance as a function (or matrix) δ:V×V→ {0,1}

such that δ(x, y) = 0 if and only if xand ybelong to

the same cluster, and δ(x, y) = 1 otherwise. Because

we can view dαand δas discrete, symmetric functions

which have the same domain, we can compute their

correlation coeﬃcient. We deﬁne the correlation of δ

to dαto be the Pearson correlation of δand dα. More

precisely, the upper triangular part of the matrix cor-

responding to a distance dαis interpreted as a vector

(x1, . . . , xn) and compared with the corresponding val-

ues (y1, . . . , yn) given by δ. We obtain the δ-correlation

as

Dα=σxy

σxσy

.

The correlation ranges in the interval [−1,1]: a value

of 1 means that dαand δare linearly correlated, and

a value of 0 means that they are unrelated. In other

words, if the value obtained by measuring the correla-

tion of a given distance to the idealized cluster distance

is close to 1, this means that the given distance is closer

to the idealized cluster distance, and hence, performs

well. Note that negative values for this measure are not

expected as this would imply that dαand δwere neg-

atively related (dαwould perform worse than a matrix

containing random entries).

The silhouette cluster accuracy is based on the sil-

houette coeﬃcient, deﬁned in [24], as a measure that

determines how well a single point is embedded in the

cluster to which it belongs. For a point xfrom cluster

Ciwe deﬁne axas the average distance of this point

to all other points in Ci, that is,

ax=1

mi−1X

y∈Ci,y6=x

dα(x, y),

and we deﬁne bxas the minimum over the average

distances of xto all points of a diﬀerent cluster

bx=K

min

j=1,j6=i

1

mjX

y∈Cj

dα(x, y)

.

The silhouette coeﬃcient of xis deﬁned as

Sα(x) = bx−ax

max{ax, bx}.

If a point xhas a silhouette coeﬃcient Sα(x)≤0,

then xis at least as close to a cluster to which it does

not belong than to its own cluster. The silhouette clus-

ter accuracy Aαdenotes the percentage of points with

a silhouette coeﬃcient greater than 0, that is the per-

centage of points which are well-embedded in their own

cluster,

Aα=|{x∈V| Sα(x)>0}|

m.

Obviously, the silhouette cluster accuracy ranges in

[0,1] with a high accuracy being desirable.

For assessing the relative overlap of the histograms,

consider any two clusters Ciand Cjwith i6=j(for

example, C1is the H. sapiens cluster and C4the

A. thaliana cluster). We compare the two sets of in-

tragenomic distances Ci–Ciand Cj–Cjwith the set

of intergenomic distances Ci–Cj. For a distance dα,

we divide the range from min(dα) to the maximum

distance max(dα) in this dataset into 100 bins of size

Karamichalis et al. Page 12 of 14

r=max(dα)−min(dα)

100 and count the distances which fall

into this bin: ci,i[`] denotes bin `containing distances

from Ci–Ciand ci,j [`] denotes bin icontaining dis-

tances from Ci–Cj. For `= 1,...,100 we let

ci0,j0[`] = |{{x, y} | x∈Ci0, y ∈Cj0and x6=y

and (`−1) ·r < dα(x, y)≤`·r}|.

By si0,j0we denote the sum over all ci0,j0-bins si0,j0=

P100

`=1 ci0,j0[`]. We deﬁne the relative overlap Oα(i, j) of

Ci–Ci(intragenomic distances) with Ci–Cj(interge-

nomic distances) as

Oα(i, j) = max{si,i, si,j }

min{si,i, si,j }·P100

i=1 min{ci,i, ci,j }

P100

i=1 max{ci,i, ci,j }.

The relative overlap Oα(j, i) of Cj–Cjwith Ci–Cjis

deﬁned analogously; note that Oα(i, j)6=Oα(j, i) in

general. The overlap is normalized to the range [0,1]

where 0 means no overlap of elements of bins between

intra- and intergenomic distances, and 1 means that

one of the histograms completely “covers” the other.

Also note that we are not interested in the overlap of

Ci–Ciwith Cj–Cjas both sets of distances are intrage-

nomic distances.

Since we intend to deﬁne the a quality measure where

a value close to 1 should represent a small overlap, we

will use 1 − Oα(i, j) as relative overlap. Furthermore,

we combine these quantities for all possible pairs of

clusters Ciand Cj, obtaining the relative overlap as:

Oα= 1 −1

p(p−1)

p

X

i=1

p

X

j=1,i6=j

Oα(i, j).

For example, in Figure 4, for each of the considered

distance, the dark blue histograms depict the C1−C1

(H. sapiens –H. sapiens) intragenomic distances,

the turquoise histograms the C4−C4(A. thaliana

–A. thaliana) intragenomic distances, and grey his-

tograms the C1−C4(H. sapiens –A. thaliana) in-

tergenomic distances. As seen from this ﬁgure, the de-

scriptor distance appears to visually perform best at

separating the two intragenomic distance histograms

from the intergenomic histogram, while the Euclidean

distance has the weakest performance. The relative

overlap attempts to quantify this by computing the

overlaps of each of the two pairs of histograms (dark

blue with grey and turquoise with grey). Note that

small visual histogram overlaps will result in a high

numerical relative overlap, and is indicative of a bet-

ter performing distance.

Distance Comparison Results

The results of comparing the six distances we ana-

lyzed, using the three quality measures, are listed in

Table 4. Recall that all quality measures have an ex-

pected range of [0,1] where larger values imply better

performance.

DαAαOαz-score sum Rank

DSSIM 0.627 1.000 0.965 1.895 2nd

Descriptors 0.639 0.976 0.988 2.509 1st

Euclidean 0.231 0.325 0.907 −4.831 6th

Manhattan 0.527 1.000 0.951 0.84 3rd

Pearson 0.536 0.980 0.888 −0.875 5th

Approx. Inf. 0.527 1.000 0.937 0.462 4th

Table 4 Summary of quality measures for the performances of six

distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,

approximated information distance) on a dataset of 508 genomic

DNA sequences taken from organisms from each kingdom of life.

Dαis the correlation to an idealized cluster, Aαthe silhouette

cluster accuracy, and Oαthe relative overlap. Higher is better.

To compare each distance relative to all the other dis-

tances, we further compute for each quality measure

(each column) the standard scores (z-scores) of each

distance dα, where α∈ {DSSIM, D, E, M, P, AID}, as

z(dα) = dα−µ

σwhere µis the mean and σis the devi-

ation of all six dαfor that particular quality measure

(column). A positive value of the standard score will

mean that a distance performs above average (in this

category) and a negative value that it performs below

average.

Finally, we compute the sum of the z-scores for each

quality measure as seen in Table 4. Note that the total

of z-scores for a distance represents the performance

of that distance relative to the other distances, and

indicates its relative ranking.

The conclusion of this analysis is that the best

performing distances are the descriptor distance and

DSSIM. Manhattan, Pearson, and approximate infor-

mation distance perform well in some categories but

not so well in other categories. For this dataset and

value of k, the Euclidean distance had the weakest per-

formance in all measured categories, which conﬁrms

the visual assessment of the MDS plots obtained by

using the Euclidean distance, as seen in Figure 2and

Figure 3.

It is worth noting that the two distances which per-

form best (DSSIM and descriptor) treat FCGR ma-

trices as two-dimensional maps in which the local ar-

rangement of the cells (matrix entries) inﬂuences the

computed distance, whereas the other distances treat

the FCGR matrices as linear vectors. This suggests

that the organization of the k-mer tallies (in this pa-

per k= 9) of a DNA sequence as an FCGR matrix,

rather than a simple vector, reveals structural prop-

erties of the DNA sequence that could be utilized in

order to identify and classify genomic DNA sequences.

Karamichalis et al. Page 13 of 14

Discussion and Conclusions

In this study we test the hypothesis that CGR-based

genomic signatures of genomic DNA sequences are in-

deed species and genome-speciﬁc. With this goal in

mind we analyze over ﬁve hundred 150 kbp DNA

genomic sequences originating from organisms repre-

senting each of the kingdoms of life. Our quantita-

tive comparison of six diﬀerent distances suggests that

several other distances outperform the Euclidean dis-

tance, which has been until now almost exclusively

used in such studies. Our preliminary results show

that two of these distances, DSSIM and descriptor dis-

tance (introduced here) when applied to CGR-based

genomic signatures, have indeed the ability to diﬀer-

entiate between DNA sequences coming from diﬀerent

species. This indicates that the k-mer sequence compo-

sition (where k= 1,2, ..., 9) of genomic sequences con-

tains taxonomic information which could potentially

aid in the identiﬁcation, comparison and classiﬁca-

tion of species based on molecular evidence. The two-

dimensional and three-dimensional Molecular Distance

Maps we obtain, which visualize the simultaneous in-

tragenomic and intergenomic interrelationships among

the sequences in our dataset, show this method’s po-

tential.

Further analysis is needed to explore this method’s

potential to the analysis of closely related species.

As a preliminary experiment, we applied it to H.

sapiens chromosome 21 (NC 000021.8), which yields

234 fragments, and P. troglodytes chromosome Y

(NC 006492.3) which yields 168 sequences, also 150

kbp long.

DαAαOαz-score sum Rank

DSSIM 0.167 0.915 0.136 3.453 1st

Descriptors 0.015 0.500 0.101 −2.593 5th

Euclidean 0.037 0.58 0.069 −2.899 6th

Manhattan 0.112 0.863 0.108 1.27 3rd

Pearson 0.142 0.714 0.119 1.339 2nd

Approx. Inf. 0.075 0.933 0.062 −0.569 4th

Table 5 Summary of quality measures for the performances of six

distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson,

approximated information distance) on a dataset of 402 DNA

sequences from H. sapiens, chromosome 21 and P. troglodytes,

chromosome Y. Dαis the correlation to an idealized cluster, Aα

is the silhouette cluster accuracy, and Oαis the relative overlap.

The Molecular Distance Maps in Figure 5and Fig-

ure 6, of 402 DNA sequences, suggests that several of

the distances are able to diﬀerentiate even between

DNA sequences from closely related organisms. As

seen in Table 5, the Euclidean distance was again out-

performed by other distances, when assessed with the

quality measures we described. In this case-study, we

note a change in the distance rankings: DSSIM, which

(a) DSSIM distance. (b) Descriptors distance.

(c) Euclidean distance. (d) Manhattan distance.

(e) Pearson distance. (f) Approx. inform. distance.

Figure 5 Two-dimensional Molecular Distance Maps of

150 kbp genomic DNA sequences from H. sapiens (blue),

P. troglodytes (red) using the six distances.

ranked second previously, now ranks ﬁrst, while the

descriptor distance, which ranked ﬁrst previously, now

ranks second last. This may be an indication that de-

scriptor distance, which was designed to detect pattern

diﬀerences, may only perform well for analyses of se-

quences of distantly related organisms while DSSIM,

which is sensitive to small diﬀerences in similar images,

may be the preferred option for ﬁne-grained analyses

at the genus, family and species level.

Further large-scale computational experiments have

to be carried out to conﬁrm these preliminary results

and establish their validity. Such experiments could

provide additional insights regarding the choice of op-

timal distance for structural genome comparison in

diﬀerent settings.

Karamichalis et al. Page 14 of 14

(a) DSSIM distance. (b) Descriptors distance.

(c) Euclidean distance. (d) Manhattan distance.

(e) Pearson distance. (f) Approx. inform. distance.

Figure 6 Three-dimensional Molecular Distance Maps of

150 kbp genomic DNA sequences from H. sapiens (blue),

P. troglodytes (red) using the six distances.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

RK data acquisition; data analysis, methodology and result interpretation;

manuscript draft; manuscript editing; software design. LK data analysis,

methodology and result interpretation; manuscript draft; manuscript

editing. S.Kon data analysis, methodology and result interpretation;

manuscript editing. S.Kop data analysis, methodology and result

interpretation; manuscript editing. All authors read and approved the ﬁnal

manuscript.

Acknowledgements

We thank Yuri Boykov, Lena Gorelick and Olga Veksler for discussions on

the deﬁnition for the descriptor distance, and Stephen Solis for comments

on earlier drafts of the manuscript.

Author details

1Department of Computer Science, University of Western Ontario,

London, ON, Canada. 2Department of Mathematics and Computing

Science, Saint Mary’s University, Halifax, NS, Canada.

References

1. Hebert, P.D., Cywinska, A., Ball, S.L., et al.: Biological identiﬁcations

through DNA barcodes. Proceedings of the Royal Society of London.

Series B: Biological Sciences 270(1512), 313–321 (2003)

2. Sirovich, L., Stoeckle, M.Y., Zhang, Y.: Structural analysis of

biodiversity. PLoS One 5(2), 9266 (2010)

3. Jeﬀrey, H.: Chaos Game Representation of gene structure. Nucleic

Acids Research 18(8), 2163–2170 (1990)

4. Kari, L., Hill, K.A., Sayem, A.S., Karamichalis, R., Bryans, N., Davis,

K., Dattani, N.S.: Mapping the Space of Genomic Signatures

(Submitted). ArXiv e-prints http://arxiv.org/abs/1307.3755

(2014)

5. Edwards, S., Fertil, B., Girron, A., Deschavanne, P.: A genomic schism

in birds revealed by phylogenetic analysis of DNA strings. Systematic

Biology 51(4), 599–613 (2002)

6. Pandit, A., Vadlamudi, J., Sinha, S.: Analysis of dinucleotide

signatures in HIV-1 subtype B genomes. Journal of genetics 92(3),

403–412 (2013)

7. Deschavanne, P., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic

signature: characterization and classiﬁcation of species assessed by

Chaos Game Representation of sequences. Molecular Biology and

Evolution 16(10), 1391–1399 (1999)

8. Gentles, A.J., Karlin, S.: Genome-scale compositional comparisons in

eukaryotes. Genome Research 11(4), 540–546 (2001).

doi:10.1101/gr.163101

9. Deschavanne, P., Giron, A., Vilain, J., Dufraigne, C., Fertil, B.:

Genomic signature is preserved in short DNA fragments. In:

Proceedings of IEEE International Symposium on Bio-Informatics and

Biomedical Engineering, pp. 161–167 (2000)

10. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality

assessment: From error visibility to structural similarity. IEEE

Transactions on Image Processing 13(4), 600–612 (2004).

doi:10.1109/TIP.2003.819861

11. Iversen, G.R., Gergen, M., Gergen, M.M.: Statistics: The Conceptual

Approach. Springer, Berlin Heidelberg (1997)

12. Krause, E.F.: Taxicab Geometry: An Adventure in non-Euclidean

Geometry. Courier Dover Publications, Mineola, New York (2012)

13. Li, M., Chen, X., Li, X., Ma, B., Vitany, P.: The similarity metric.

IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)

14. Jeﬀrey, H.: Chaos game visualization of sequences. Comput. Graphics

16(1), 25–33 (1992)

15. Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary

implications of microbial genome tetranucleotide frequency biases.

Genome Research 13(2), 145–158 (2003)

16. Deschavanne, P., DuBow, M., Regeard, C.: The use of genomic

signature distance between bacteriophages and their hosts diplays

evolutionary relationships and phage growth cycle determination.

Virology Journal 7(1), 163 (2010)

17. Pandit, A., Sinha, S.: Using genomic signatures for HIV-1 subtyping.

BMC Bioinformatics 11(Suppl 1), 26 (2010)

18. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin

Heidelberg (2009)

19. Kruskal, J.: Multidimensional scaling by optimizing goodness of ﬁt to a

nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)

20. Supplemental Material.

https://github.com/rallis/intraSupplemental_Material

21. Karamichalis, R.: Molecular Distance Map Interactive Webtool (2014).

https://github.com/rallis/intraMoDMap

22. Pang-Ning, T., Steinbach, M., Kumar, V., et al.: Introduction to data

mining. In: Library of Congress (2006)

23. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of

selected criterion functions for document clustering. Machine Learning

55(3), 311–331 (2004)

24. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and

validation of cluster analysis. Journal of Computational and Applied

Mathematics 20(0), 53–65 (1987). doi:10.1016/0377-0427(87)90125-7