ArticlePDF Available

Barcodes for genomes and applications

Authors:

Abstract and Figures

Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each genome and termed the genome's barcode. We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness. These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.
Content may be subject to copyright.
BioMed Central
Page 1 of 11
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Research article
Barcodes for genomes and applications
Fengfeng Zhou
, Victor Olman
and Ying Xu*
Address: Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, and BioEnergy Science Center (BESC), University of
Georgia, Athens, GA 30602, USA
Email: Fengfeng Zhou - ffzhou@csbl.bmb.uga.edu; Victor Olman - olman@csbl.bmb.uga.edu; Ying Xu* - xyn@bmb.uga.edu
* Corresponding author †Equal contributors
Abstract
Background: Each genome has a stable distribution of the combined frequency for each k-mer
and its reverse complement measured in sequence fragments as short as 1000 bps across the whole
genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each
genome and termed the genome's barcode.
Results: We found that for each genome, the majority of its short sequence fragments have highly
similar barcodes while sequence fragments with different barcodes typically correspond to genes
that are horizontally transferred or highly expressed. This observation has led to new and more
effective ways for addressing two challenging problems: metagenome binning problem and
identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm
substantially improves the state of the art in terms of both binning accuracies and the scope of
applicability. Other attractive properties of genomes barcodes include (a) the barcodes have
different and identifiable characteristics for different classes of genomes like prokaryotes,
eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to
the genomes' phylogenetic closeness.
Conclusion: These and other properties of genomes barcodes make them a new and effective
tool for studying numerous genome and metagenome analysis problems.
Background
The challenges being faced in sorting out short genomic
fragments generated by metagenome sequencing projects
[1] pose a fundamental question: "does each genome
have a unique signature imprinted on its short sequence
fragments so that fragments from the same genomes in a
metagenome can be identified accurately?" A positive
answer to this question could have significant implica-
tions to many important genome and metagenome anal-
ysis problems such as identification of genetic material
transferred from other organisms [2] or through virus
invasions [3,4], separation of short sequence fragments
generated by metagenome sequencing into individual
genomes [5] and phylogenetic analyses of genomes [6].
Understanding the intrinsic properties of genome
sequences, either general to all or specific to some classes
of genomes, has been the focus of many studies in the past
two decades. Earlier work includes the discovery of the
periodicity property of DNA sequences across both
prokaryotic and eukaryotic genomes [7] and the realiza-
tion that coding sequences follow Markov chain proper-
ties [8-10]. Karlin and colleagues have studied various
genome properties based on analyses of k-mer frequency
Published: 17 December 2008
BMC Bioinformatics 2008, 9:546 doi:10.1186/1471-2105-9-546
Received: 16 June 2008
Accepted: 17 December 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/546
© 2008 Zhou et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 2 of 11
(page number not for citation purposes)
distributions, and have observed that the di-nucleotide rel-
ative abundance, a normalized di-mer frequency with
respect to the mono-mer frequencies, is generally stable
across a genome measured on 50 K base-pair (bp) frag-
ments [11-13]. They even suggested that such normalized
di-mer frequency distributions can possibly serve as signa-
tures of genomes.
In this paper, we present a barcoding scheme for all
sequenced genomes, and illustrate a number of interest-
ing and useful properties of the barcodes, which we can
take advantage to solve challenging genome analysis
problems. We highlight the power of this barcoding
scheme through addressing two application problems:
metagenome binning problem and identification of hori-
zontally transferred genes.
Results
Barcodes and their properties
We have calculated the barcode for each sequenced
prokaryotic genome, using the following procedure. For
each genome, we partition its sequence into a series of
non-overlapping and equal-sized fragments of M bps;
then for each k-mer (1 <k < 6 in this study), we calculate
the combined frequency of the k-mer and its reverse com-
plement within each partitioned fragment. The barcode for
each genome is a matrix of N(k) columns and
genome_length/M rows, with each element representing
the frequency of the corresponding k-mer within the cor-
responding sequence fragment, where N(k) is the number
of unique combined k-mers. Note that N(k) = 4
k
/2 or (4
k
+ 4
k/2
)/2, depending on whether k is odd or even. For
example, N(4) = 136. The portion of the barcode corre-
sponding to a fragment in a genome is called the frag-
ment's barcode. In this paper, barcodes are calculated
using M = 1000 and k = 4 unless stated otherwise. A dis-
cussion on our choices of the M and k values is given in
Additional file 1, where we can also see that the above
"equal-sized" requirement is not necessary.
For each barcode, we have created a grey-level image, a
barcode image, by mapping the k-mer frequencies to grey
levels using a procedure given in the METHODS section
so that darker grey levels are for lower frequencies. Figure
1 shows the barcode images for five prokaryotic genomes.
A key advantage of having barcode images is that they pro-
vide an intuitive, informative and global view of genomes,
from which various genomic features become immedi-
ately apparent. This view can be used to guide our rigor-
ous statistical analyses of genomes. We have calculated
the barcode images for all 586 sequenced prokaryotic
genomes, which are all accessible at [14], along with the
barcode images for other classes of genomes.
From these barcodes (e.g., Figure 1), we observed that (a)
all chromosomal genomes have remarkably stable 4-mer
frequency distributions essentially for all 4-mers, giving
rise to the vertical bands with consistent grey levels across
each barcode; (b) the small fraction of the fragments with
clearly different, abnormal, barcodes (horizontal stripes
in the barcodes) than the rest of the genome typically rep-
resent 2–3 special classes of genes (see discussion later);
(c) multiple chromosomes of the same organisms gener-
ally have highly similar barcodes (Figure 2(a)) but they
each have their unique patterns of abnormal fragments;
and (d) the barcodes similarities tend to be generally pro-
portional to the genomes' phylogenetic closeness (Figure
2(b)).
To understand why a genomic sequence has the barcode
property, we have examined random nucleotide
sequences generated using different models, including
Markov chain models of order from 0 to 6. We observed
that barcodes for random nucleotide sequences generated
using a third-order Markov chain model are the closest to
the barcodes of genomic sequences in terms of their
appearances (Additional file 1), and higher order Markov
chain models do not seem to add much to this property.
Hence we believe that the barcode property of prokaryotic
genomes is mainly due to the third-order Markov chain
property of the coding sequences in the genomes, which
count for 80–90% of a typical prokaryotic genome. It is
worth noting that barcodes for coding and non-coding
sequences of the same genome are generally different
though they share a weakly similar backbone structure
while each of these two classes of (composite) regions
generally has highly similar barcodes (Figure 3).
Extension to other genomes
In addition to prokaryotes, we have also calculated the
barcodes for the other classes of sequenced genomes,
namely eukaryotic, mitochondrial, plastid and plasmid
genomes. For eukaryotes, we studied the barcodes for four
key components in eukaryotic genomes, namely the
(composite) regions of repetitive sequences, promoter
sequences (the 1000-bp upstream region from each trans-
lation start), coding regions and introns, respectively (Fig-
ure 3(b)–(e)). We observed that (i) different regions in a
high-level eukaryotic genome (e.g., human) have similar
"backbone" structures in their barcodes, and (ii) the bar-
codes for the four types of regions have increasingly
higher complexity, going from repetitive sequences to
coding regions to introns and promoter sequences. This is
consistent with the belief that introns and promoter
sequences are probably the most information rich among
the four because of the possibly large numbers of regula-
tory elements they encode.
The barcodes of the mitochondrial genomes are generally
unique compared to the barcodes of the other genomes as
they have a distinct overall appearance (e.g., Figure 3(h)–
(i)). Their similar appearance may be the result of all
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 3 of 11
(page number not for citation purposes)
mitochondria originating from Proteobacteria. The bar-
codes of all plasmid genomes also tend to have similar
characteristics among themselves, possibly due to being
under similar selection pressure caused by their frequent
transferring among cell cultures. The barcodes of all the
plastid genomes are also generally unique compared to
the barcodes of the others (e.g., Figure 3(j)–(k)). For
example, a majority of them each consist of two dark hor-
izontal bends toward one end in their barcodes along the
genome axis, whose corresponding genomic regions con-
sist of RNA genes such as ribosomal RNAs and tRNAs,
plus ribosomal proteins. The fuzzier appearance of the
plastid barcodes indicates that their k-mer frequencies
along the genome axis are not as stable as in the other
genomes. The overall similar appearances of the plastid
barcodes may be due to all originating from the Cyanobac-
teria.
One interesting question is "do different classes of
genomes have their unique characteristics in their bar-
codes?" Our answer is yes, based on their highly separable
distributions in the feature space defined by two particu-
lar features, as shown in Figure 4, one of which measures
the overall frequency variation for all 4-mers across the
Barcodes for five prokaryotic genomesFigure 1
Barcodes for five prokaryotic genomes. (a) E. coli K-12; (b) E. coli O157; (c) chromosome 1 of B. pseudomallei K96243; (d)
archaean P. furiosus DSM 3638; and (e) a random nucleotide sequence generated using a zero
th
order Markov chain model. The
x-axis for each barcode is the list of all 4-mers arranged in the alphabetical order, and the y-axis is the genome axis with each
pixel representing a fragment of M bp long.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 4 of 11
(page number not for citation purposes)
Basic features of barcodesFigure 2
Basic features of barcodes. (a) Barcode distance distribution among chromosomes from the same organisms, across all
prokaryotic and eukaryotic chromosomal genomes. The x-axis is the barcode distance and the y-axis is the frequency of chro-
mosome pairs of the same organism having a particular barcode distance. (b) Genome barcode distances versus sequence simi-
larities among the corresponding 16S rRNAs (based on the multiple sequence alignment given in DeSantis TZ et al. [26]). The
y-axis represents the barcode distance, and the x-axis is the sequence identity axis between two 16S rRNAs grouped into nine
bins, where the sequence identity is calculated as the average sequence identity over all 16S pairs in each bin.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 5 of 11
(page number not for citation purposes)
genome's barcode, and the other measures the overall
similarity level among all the M-bp fragments of the
genome, each considered as a vector of 4-mer frequencies.
While Figure 2(b) indicates that barcodes generally pre-
serve sequence-level similarities, Figure 4 suggests that
barcodes also capture a higher-level similarity beyond
individual genome sequence similarities through the tex-
tures of their images, which are the common and unique
Barcodes of some organismsFigure 3
Barcodes of some organisms. Barcodes of (a) Human chromosome 1 (226.21 Mbps); major components of human chro-
mosome in a composite form: (b) repetitive sequence, (c) promoter sequence, (d) coding regions and (e) introns; and (f) cod-
ing and (g) non-coding regions of E. coli K-12. Only a 639-Kbp region of each sequence in (b) – (g) is displayed so each pixel
represents the same sequence length. 639 Kbps is used since this is the length of the shortest region among them all, i.e. the
total non-coding region of E. coli K-12. Mitochondrial genome barcodes of (h) C. elegans (13794 bps) and (i) Drosophila mela-
nogaster (19517 bps). Plastid genome barcodes of (j) aquatic plant Ceratophyllum demersum (156252 bps) and (k) land plant Pop-
ulus trichocarpa (157033 bps).
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 6 of 11
(page number not for citation purposes)
characteristics of different classes of genomes. This prop-
erty indicates that barcodes are not just a simple visualiza-
tion tool, instead they have captured some fairly basic
information about genomes! From application point of
view, we believe that this feature will prove to be useful to
metagenome analyses as fragments from different classes
of genomes such as eukaryotes, prokaryotes or different
organelle genomes, have different characteristics in their
barcode images.
Identification of abnormal sequence fragments
Our procedure for identifying sequence fragments with
abnormal barcodes in a genome employs a clustering
strategy to divide all the sequence fragments in a genome
into two groups: (a) a large group of fragments with their
barcodes all similar to each other and (b) the rest (see
METHODS section).
Using this procedure, we have identified 30,582 abnor-
mal fragments, covering 30,889 genes across all the com-
plete prokaryotic genomes. Specifically 28,460 such
fragments are identified in the 542 bacterial genomes,
covering 28,562 genes, and 2,122 such fragments are
identified in the 46 archaeal genomes, covering 2,327
genes. We found that the percentage of fragments with
abnormal barcodes ranges from 9.40% to 32.32% across
all the bacterial genomes, with the average being 12.85%.
Among the 46 sequenced archaeal genomes, the percent-
Barcodes in feature spaceFigure 4
Barcodes in feature space. The x-axis is the average of variations of the 4-mer frequencies across a whole genome across
all 4-mers, and the y-axis measures the similarity level among all 1000-bp partitioned fragments of the genome, each repre-
sented as a 136-dimensional vector of 4-mer frequencies; Specifically, for each genome, we build a minimum spanning tree [27]
based on the 4-mer frequency vectors for its sequence fragments and their distances. The y-axis is the averaged weight (dis-
tance) of all edges in the minimum spanning tree. The green dots represent prokaryotes (586 genomes), the blue ones for
eukaryotes (83 chromosomes), the red ones for plastids (101 genomes with lengths > 20,000 bps), the brown ones for plas-
mids of prokaryotic genomes (237 plasmids > 20,000 bps) and the black for mitochondria (120 genomes with lengths > 20,000
bps).
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 7 of 11
(page number not for citation purposes)
age of fragments with abnormal barcodes ranges from
9.86% to 23.14%, with the average being 13.58%. Further
information can be found from Additional file 1. The
detailed frequency information for abnormal fragments
across different genomes is in Additional file 2[15].
While we found that it is generally more challenging to
study the abnormal fragments in eukaryotes, we did apply
the same procedure to different human chromosomes,
and found that the percentage of abnormal fragments
ranges from 10.08% to 31.32%, with the average being
12.10%.
We have analyzed the abnormal fragments across the
prokaryotic genomes, and found the following: ~30% of
the abnormal fragments can be explained in terms of (a)
horizontal gene transfers, (b) phage invasions and (c)
highly expressed genes, based on PHX-PA [16,17] and
Prophinder [18], respectively. Among the genes that fall
into this 30%, 6.99% are horizontally transferred genes,
4.97% bacteriophage genes and 18.90% highly expressed
genes, based on the above two prediction programs – note
that these numbers do not add up to exactly 30% since
there are overlaps among them. The genes falling into dif-
ferent categories are given in Additional file 3[15]. We
have carried out an enrichment analysis of such elements
in regions with abnormal versus normal barcodes. We
found that the highly expressed genes are enriched in the
abnormal fragments, with the enrichment ratio > 1 across
all the genomes and the average enrichment ratio being
1.90. Similar results hold for the horizontally transferred
genes and bacteriophage genes. All the detailed data can
be found in Additional file 2
We noted that our estimate of the percentages of "foreign
fragments" in bacterial genomes (after deducting the
"highly expressed genes") is in general agreement with the
previous estimates though different information and tech-
niques are used to derive the estimates [19].
We do not yet have an explanation for the remaining
~70% of abnormal fragments in prokaryotic genomes,
although we suspect that they mostly fall into the same
three categories – one reason that we could not explain
them now is possibly due to the limited coverage of the
current databases for horizontally transferred genes, bac-
teriophage genomes and highly expressed genes. We
believe that by using more sophisticated computational
procedures, one may be able to derive the level of abnor-
mality of a fragment's barcode in a genome, and possibly
link such information to when such fragments were hori-
zontally transferred [20].
Binning metagenome sequence
The ability to sequence a microbial community has led to
the sequencing of at least 7.04 Giga bps of metagenome
sequences, already 2.22 times the total complete genome
sequences accumulated in the past two decades [21].
These metagenome sequences have opened many doors
to new research possibilities, and have posed some chal-
lenging problems. One such problem is determining
which fragments are from the same organisms in a large
pool of metagenomic fragments [22], typically ~1000 bps
in lengths after the initial assembly using the Sanger
sequencing techniques.
We have applied a clustering algorithm (see METHODS
section) for binning sequence fragments together based
on their barcode similarities, and tested the clustering
strategy on three sets of simulated metagenome data cre-
ated by cutting actual bacterial genomes into fragments
and mixing them together. The three test sets consist of all
sequence fragments from three sets of genomes, respec-
tively, extracted from the GenBank. The first set consists of
11 genomes randomly selected from the same genus but
from 11 different species (the genus has only 11
sequenced species) while the last two sets each consist of
30 and 100 genomes randomly selected from 30 and 100
different bacterial genera, respectively. The genome
names are given in Additional file 4[15].
To assess the binning ability of our algorithm as a func-
tion of the fragment size, we have considered fragment
size M = 1000, 2000, 5000 and 10000. To test the limit of
our binning algorithm, we have also considered M = 500.
For each set of genomes, we partitioned each genome into
fragments of size M, and then mixed the fragments of the
same length into one pool. We then calculated the bar-
code for each fragment, and did a clustering analysis,
assuming that the number of genomes in each pool is
known (this information is derivable from the 16S
rRNAs). We have carried out binning predictions, one
directly on the generated fragments and one on a reduced
set of generated fragments, in which we remove 10% of
the fragments from each genome whose barcodes are
most different from the average barcode of the genome.
The consideration is that each bacterial genome has ~13%
of fragments with abnormal barcodes on average, which
are not expected to be binned correctly with the rest of
their host genome. This way we can more accurately assess
the binning ability of our algorithm. Table 1 gives the bin-
ning results on the three sets of synthetic metagenome
data, both the original set and the reduced set.
From the table, we can see that the binning accuracy (into
the correct genomes) is high for fragment size M = 1000
and above, at both the species and the genus level. From
the table, we see that there is a drop in the binning accu-
racy when the number of the underlying genomes is
increased from 30 to 100. This indicates the increased
complexity of the problem as a function of the number of
underlying genomes.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 8 of 11
(page number not for citation purposes)
We have compared our binning performance with the
published results by the best available algorithm PhyloPy-
thia [5]. Our comparison indicates that our algorithm
gives consistently more accurate and more specific bin-
ning results across different fragment sizes. For example,
at the species level, our algorithm has better than 50%
accuracy on our test set when the fragment size is at least
2000 bps while no binning results at the species level is
given in McHardy et al. [5]. At the genus level, the accu-
racy (the average of binning specificity and sensitivity,
extracted from Figure 1(a) and 1(b) in McHardy et al. [5])
by PhyloPythia is 45.5% for fragment size 1000, 56% for
2000, 74% for 5000 and 82.5% for 10000 (no data is pro-
vided for 500), all measured in terms of putting fragments
into the correct genera while ours is to the correct
genomes and with more accurate binning results. It
should be noted that the test set used by PhyloPythia is
different than ours, which may affect the performance sta-
tistics somewhat though we suspect that will be insignifi-
cant, considering the sizes of the test sets. Another key
difference between the two algorithms is that while Phy-
loPythia is a supervised learning algorithm, which
requires a training set, our algorithm does not require a
training set, and hence it is more general.
One thing worth noting is that a prokaryotic genome, on
average, has ~13–14% of abnormal sequence fragments,
when the fragment size is M = 1,000, suggesting that the
theoretical limit for binning accuracy should be no better
than 86–87%. Similarly we expect that the theoretical lim-
its of binning accuracy for 2,000, 5,000 and 10,000 frag-
ments-based binning should, in general, be no better than
87.36%, 87.58% and 88.4%, respectively.
Discussion and Conclusion
A natural question is "do all nucleotide sequences have
the barcode property like genome sequences have?" The
answer is no, based on the large number of randomly gen-
erated sequences that we have examined. Figure 1(e)
shows a typical barcode of a random sequence generated
using a zero
th
order Markov chain model. We found that
none of the so generated nucleotide sequences has the ver-
tical band structures as in genomes barcodes. More gener-
ally, barcodes for genomes and the randomly generated
nucleotide sequences have different characteristics as
shown in Figure 5.
The barcode analyses in this paper are mainly based on
data from prokaryotes. Though we have applied the same
barcode model to eukaryotes and made interesting obser-
vations, we suspect that the current barcoding scheme is
rich enough to capture all the complexity of eukaryotes.
Further studies along this direction are clearly needed.
We believe that for many genome analysis problems, par-
ticularly for prokaryotic genomes, the barcodes provide a
natural, intuitive, information-rich and unified frame-
work for studying them. Further applications of this capa-
bility to numerous genome analysis problems can be
envisioned, such as phylogeny studies, particularly for
genomes without obvious marker genes such as viruses,
more thorough examination of different types of genomic
regions in eukaryotes, their structures and organization,
further studies of horizontal gene transfers, assisting in
genome assembly of higher-order organisms (e.g., populus
which we are currently working on) and possibly many
more. We believe that we have only begun exploring the
true power of this new capability for genome studies.
Methods
Mapping frequencies to grey levels
The frequency of each k-mer is mapped to a grey level as
follows. We first count the frequency of each k-mer across
all prokaryotic genomes, and sort the frequency list S
[1:N(k)] in the increasing order of the frequencies with
N(k) being the number of k-mers. We then find an integer
L, L > 0, and partition S [1:N(k)] into L sub-lists so that the
following function is minimized: , where S
i
is
()SS
i
i
iL
=
=
1
Table 1: Binning accuracies of our barcode-based clustering algorithm.
11 genomes 30 genomes 100 genomes
Original genomes Filtered genomes Original genomes Filtered genomes Original genomes Filtered genomes
FS = 500 bps 71.10% 77.30% 51. 6% 55.70% 40.50% 41.10%
FS = 1000 bps 79.90% 85.90% 65.30% 70.30% 51.10% 52.60%
FS = 2000 bps 86.30% 91.70% 74.80% 80.60% 61.00% 68.53%
FS = 5000 bps 91.10% 98.10% 86.60% 93.20% 79.40% 81.90%
FS = 10000 bps 95.80% 99.30% 91.90% 97.50% 86.60% 89.18%
The binning accuracy is defined as (prediction specificity + prediction sensitivity)/2, and FS is for fragment size, where both the specificity and sensitivity
are measured in terms of putting the fragments into the correct bin corresponding to each genome, defined by the majority of the fragments in the
bin. The column "Original genomes" lists the binning accuracy of our algorithm on all the non-overlapping fragments in each group of genomes, and
the column "Filtered genomes" gives the accuracy after removing the 10% fragments with the most abnormal barcodes from each genome.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 9 of 11
(page number not for citation purposes)
the sum of all frequencies in the i
th
sub-list, is the aver-
age of S, and L is a parameter to be determined by the
minimization result. For M = 1000 and k = 4, we found L
= 14 gives the best value for the above objective function.
The computed partition of S gives a mapping of frequen-
cies to the grey levels. Note that this mapping is genome-
independent so each grey level in the barcodes has the
same meaning in different genomes.
Barcode similarity calculation
We define the distance (or dissimilarity) between two bar-
codes based on their simplified representations, each of
which is a matrix having the same number of columns of
the barcode and the number of grey levels, L, used in bar-
code images as the number of rows; each element in the
matrix represents the frequency of the corresponding grey
level across each column in the barcode. For two such
matrices M
1
and M
2
with K columns and L rows, we define
their barcode distance as
Clearly this is a generalization of the Euclidean distance
between two vectors of the averaged k-mer frequencies
across each genome, widely used for genome comparisons
as in the work of Karlin and colleagues [13,16,23,24] and
many others. This is equivalent to the special case of our
barcode distance when L = 1. Figure S4 in Additional file
1 provides a comparison between the two distances.
S
((,) (,)).Mij Mij
j
K
i
L
1
11
2
2
==
Distribution of ratios between barcode variations of all prokaryotic genomes and their corresponding randomly generated nucleotide sequencesFigure 5
Distribution of ratios between barcode variations of all prokaryotic genomes and their corresponding ran-
domly generated nucleotide sequences. For each genome, a corresponding random nucleotide sequence is defined as a
random sequence of the same length and with the same mono-nucleotide frequencies as those of the genome, generated using
a zero
th
order Markov chain model. The variation of a barcode is defined as the standard deviation of the list of the averaged
frequencies of all the k-mers along the genome. The x-axis is the ratio of the barcode variations between a genome and a cor-
responding random sequence, and the y-axis represents the frequency of cases with a particular variation ratio.
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 10 of 11
(page number not for citation purposes)
Identification of abnormal fragments in a genome
We have used the following procedure to identify frag-
ments in a genome with substantially different barcodes
than the average barcode of the genome. The procedure
consists of two key steps. First, for each k-mer, we select
the fragments in the genome that have the highest or the
lowest X% of this k-mer's frequency among all fragments,
with X being a parameter. Then we sort all the fragments
in the increasing order of the number of times they are
selected in the first step, termed function F(p), with p
being the index of a fragment. Let p
0
be the fragment index
having the highest second-order derivative of F(p). We
consider all fragments p with F(p) > F(p
0
) to be the non-
native fragments of the genome as they have used the
most number of k-mers with frequencies that are substan-
tially different than the typical k-mer frequencies through-
out the genome. We found that the abnormal fragment
prediction is not very sensitive to the detailed value of X
within the range from 5 to 20. So we have chosen X = 10
as the default value of our program.
The rationale for this procedure is that fragments with
higher F(p) values represent fragments that have more
"abnormal" k-mer frequencies compared to the average k-
mer frequencies in the genome, and hence are more prob-
able to be non-native fragments. By examining the curve
of the F(p) function, we found that it is convex with one
sharp transition point p
0
, indicating a transition point
from the typical fragments to the "abnormal" fragments
in the genome (see Additional file 1). Hence we have used
this point as the separation point between the normal (or
native) fragments and the "abnormal"fragments.
Metagenome binning algorithm
Our binning procedure starts with an application of the
CLUMP program [25] to a given pool of fragments (not
necessarily of the same lengths) to be clustered based on
their barcode similarities. A unique feature of CLUMP is
that it is quite accurate in identifying the core elements of
each cluster as we have previously demonstrated [25],
though a weakness of the algorithm could be that it does
not always handle the boundary elements accurately.
Hence we have combined CLUMP with a K-means based
clustering approach that we implemented. After identify-
ing the initial clusters formed by CLUMP based on bar-
code similarities, assuming that we know the number of
clusters to be identified, we pick a seed from each pre-
dicted cluster randomly according to the density distribu-
tion of the cluster. Then we run the K-means algorithm,
using the selected seeds. For each pool of fragments, we
run this two-step clustering algorithm multiple times,
using a different set of seeds for each run. In deciding the
number of runs, our rule of thumb based on our experi-
ence working on the metagenome data is to use 500 * (the
number of clusters),. For each given set of seeds, we run the
K-means algorithm 10000 iterations. Among all the com-
puted clustering results for each pool, we choose the clus-
tering result C
1
, C
2
,..., C
K
that minimizes the following
function as the final binning result:
where C
1
, C
2
,..., C
K
is a partition of a given pool of metage-
nomic fragments with each C
i
being a subset of the pool
and being the average of the barcodes of all fragments
in C
i
, i = 1,..., K.
Authors' contributions
Y.X. conceived the project. F.Z. and V.O. analyzed the data
and performed the experiments. Y.X. supervised this
project as P.I. and wrote the manuscript.
Additional material
Acknowledgements
This work was supported by National Science Foundation (DBI-0354771,
ITR-IIS-0407204, DBI-0542119, CCF0621700), a U.S. Department of
Energy "BioEnergy Research Center" grant from the Office of Biological and
Environmental Research in the DOE Office of Science, and a Distinguished
Scholar grant from the Georgia Cancer Coalition. We would like to thank
Additional file 1
Supplementary material. Supplementary material 1–3.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-546-S1.doc]
Additional file 2
Supplementary Table 1. HX is highly expressed gene, HT is horizontally
transferred gene, and PH is the phage gene. UnknownGene consists of
genes within fragments of abnormal barcodes but do not belong to the
above three categories.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-546-S2.xls]
Additional file 3
Supplementary Table 2. Gene classifications of all the prokaryotic
genomes.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-546-S3.zip]
Additional file 4
Supplementary Table 3. Genomes used in the binning section.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-546-S4.xls]
(),XXi
XCi
K
i
=
2
1
X
i
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546
Page 11 of 11
(page number not for citation purposes)
the two anonymous reviewers for their helpful comments on our work.
We would also like to thank Ms Joan Yantko for preparing the manuscript.
References
1. Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI: Host-
bacterial mutualism in the human intestine. Science 2005,
307(5717):1915-1920.
2. Jain R, Rivera MC, Lake JA: Horizontal gene transfer among
genomes: the complexity hypothesis. Proceedings of the National
Academy of Sciences of the United States of America 1999,
96(7):3801-3806.
3. Frey TK: Neurological aspects of rubella virus infection. Inter-
virology 1997, 40(2–3):167-175.
4. Rybchin VN, Svarchevsky AN: The plasmid prophage N15: a lin-
ear DNA with covalently closed ends. Mol Microbiol 1999,
33(5):895-903.
5. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I:
Accurate phylogenetic classification of variable-length DNA
fragments. Nat Methods 2007, 4(1):63-72.
6. Yang E, Bin W, Peng J, Zhang X, Wang J, Yang J, Dong J, Chu Y, Zhang
J, Jin Q: Comparative genomics and phylogenetic analysis of
S. dysenteriae subgroup. Sci China C Life Sci 2005, 48(4):406-413.
7. Trifonov EN, Sussman JL: The pitch of chromatin DNA is
reflected in its nucleotide sequence. Proceedings of the National
Academy of Sciences of the United States of America 1980,
77(7):3816-3820.
8. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-
cal patterns in primary structures of functional regions in the
E. coli genome. I. Oligonucleotide frequencies analysis.
Molecular Biology 1986, 20:826-833.
9. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-
cal patterns in primary structures of functional regions in the
E. coli genome. II. Non-homogeneous Markov models. Molec-
ular Biology 1986, 20:833-840.
10. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-
cal patterns in primary structures of functional regions in the
E. coli genome. III. Computer recognition of coding regions.
Molecular Biology 1986, 20:1145-1150.
11. Karlin S, Burge C: Dinucleotide relative abundance extremes:
a genomic signature. Trends Genet 1995, 11(7):283-290.
12. Karlin S, Zhu ZY, Karlin KD: The extended environment of
mononuclear metal centers in protein structures. Proceedings
of the National Academy of Sciences of the United States of America 1997,
94(26):14225-14230.
13. Karlin S, Brocchieri L, Mrazek J, Campbell AM, Spormann AM: A chi-
meric prokaryotic ancestry of mitochondria and primitive
eukaryotes. Proceedings of the National Academy of Sciences of the
United States of America 1999, 96(16):9190-9195.
14. Computed_barcodes [http://csbl.bmb.uga.edu/~ffzhou/BoDB/
]
15. Supplementary_material [http://csbl.bmb.uga.edu/~ffzhou/
BoDB/supp/]
16. Mrazek J, Bhaya D, Grossman AR, Karlin S: Highly expressed and
alien genes of the Synechocystis genome. Nucleic Acids Res
2001, 29(7):1590-1601.
17. Karlin S, Mrazek J: Predicted highly expressed genes of diverse
prokaryotic genomes. J Bacteriol 2000, 182(18):5238-5250.
18. Lima-Mendez G, Helden JV, Toussaint A, Leplae R: Prophinder: a
computational tool for prophage prediction in pro-karyotic
genomes. Bioinformatics 2008.
19. Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and
the nature of bacterial innovation. Nature 2000,
405(6784):299-304.
20. Lawrence JG, Ochman H: Amelioration of bacterial genomes:
rates of change and exchange. J Mol Evol 1997, 44(4):383-397.
21. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC: The
Genomes On Line Database (GOLD) in 2007: status of
genomic and metagenomic projects and their associated
metadata. Nucleic Acids Res 2008:D475-479.
22. McHardy AC, Rigoutsos I: What's in the mix: phylogenetic clas-
sification of metagenome sequence samples. Current opinion in
microbiology 2007, 10(5):499-503.
23. Karlin S, Mrazek J, Ma J, Brocchieri L: Predicted highly expressed
genes in archaeal genomes. Proceedings of the National Academy of
Sciences of the United States of America 2005, 102(20):7303-7308.
24. Mrazek J, Karlin S: Detecting alien genes in bacterial genomes.
Ann N Y Acad Sci 1999, 870:314-329.
25. Olman V, Mao F, Wu H, Xu Y: Parallel Clustering Algorithm for
Large Data Sets with applications in Bioinformatics. IEEE/
ACM Transactions on Computational Biology and Bioinformatics 2007 in
press.
26. DeSantis TZ Jr, Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno
YM, Phan R, Andersen GL: NAST: a multiple sequence align-
ment server for comparative analysis of 16S rRNA genes.
Nucleic Acids Res 2006:W394-399.
27. Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algo-
rithms Second edition. Cambridge, MA The MIT Press; 2001.

Supplementary resources (4)

... The widespread presence of a conserved set of meiotic, gamete and nuclear fusion proteins (fusogens) among extant eukaryotes suggests that meiotic sex emerged once, predating the last eukaryotic common ancestor (LECA) 1,6 . The conserved gamete fusogen HAP2/GCS1 belongs to a superfamily of fusion proteins called fusexins [3][4][5] . This superfamily encompasses class II viral fusogens (viral fusexins) that fuse the envelope of some animal viruses with the membranes of host cells during infection [7][8][9] ; EFF-1 and AFF-1 (somatic fusexins) that promote cell fusion during syncytial organ development [10][11][12][13][14] ; and HAP2/GCS1 (sexual fusexins) that mediate 2 gamete fusion [15][16][17] . ...
... FsxA domains I and III are relatively sequence-conserved among archaeal homologues (Extended Data Fig. 5a; Supplementary Fig. 3) and closely resemble nuclear H2B-RFP or H2B-GFP 3 . Following co-culture of the two batches, we fixed, permeabilized and performed immunofluorescence against a V5 tag fused to the cytoplasmic tail of FsxA (Fig. 3a, b). ...
... Proteins localizing on the surface were detected as previously described 3 CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2021. ...
Preprint
Full-text available
Sexual reproduction consists of genome reduction by meiosis and subsequent gamete fusion. Presence of meiotic genes in prokaryotes suggests that DNA repair mechanisms evolved toward meiotic recombination; however, fusogenic proteins resembling those found in eukaryotes were not identified in prokaryotes. Here, we identify archaeal proteins that are homologs of fusexins, a superfamily of fusogens that mediate eukaryotic gamete and somatic cell fusion, as well as virus entry. The crystal structure of a trimeric archaeal Fusexin1 reveals novel features such as a six-helix bundle and an additional globular domain. Ectopically expressed Fusexin1 can fuse mammalian cells, and this process involves the additional domain and a conserved fusion loop. Archaeal fusexin genes exist within integrated mobile elements, potentially linking ancient archaeal gene exchanges and eukaryotic sex. One-Sentence Summary Cell membrane fusion proteins of viruses and eukaryotes are also present in archaea.
... Genome regions encoding for the chemical structures of proteins, such as genes, exons or CDS (coding DNA sequences), are known to harbor functional sequence structures (amino acid codons) conserved within a wide phylogenetic range [1]. While the remaining "non-coding" regions (introns and intergenic regions (IIRs)) were initially declared as useless "junk" DNA [2,3], the existence and importance of conserved sequence structures in IIRs became clearer and clearer in the last decades [4,5]. Interesting findings include similarities between individual introns [4] as well as conserved global intronic or intergenic sequence structures [5,6]. ...
... While the remaining "non-coding" regions (introns and intergenic regions (IIRs)) were initially declared as useless "junk" DNA [2,3], the existence and importance of conserved sequence structures in IIRs became clearer and clearer in the last decades [4,5]. Interesting findings include similarities between individual introns [4] as well as conserved global intronic or intergenic sequence structures [5,6]. The first speculations of a functional relation of IIRs in Animalia were made in [7], where a correlation between the sizes of the two was observed [7]. ...
... Standard sequence analysis tools, such as the NCBI Basic Local Alignment Search Tool (BLAST) [9], Genes 2021, 12, 1571 2 of 19 cannot be effectively used to search such structures within regions of sizes comparable to entire genomes [10]. Therefore, powerful alignment-free methods were developed and have been established [5,11,12]. We used a simplistic but powerful method called k-mer analysis [12] designed for this special task [12]. ...
Article
Full-text available
Several strongly conserved DNA sequence patterns in and between introns and intergenic regions (IIRs) consisting of short tandem repeats (STRs) with repeat lengths <3 bp have already been described in the kingdom of Animalia. In this work, we expanded the search and analysis of conserved DNA sequence patterns to a wider range of eukaryotic genomes. Our aims were to confirm the conservation of these patterns, to support the hypothesis on their functional constraints and/or the identification of unknown patterns. We pairwise compared genomic DNA sequences of genes, exons, CDS, introns and intergenic regions of 34 Embryophyta (land plants), 30 Protista and 29 Fungi using established k-mer-based (alignment-free) comparison methods. Additionally, the results were compared with values derived for Animalia in former studies. We confirmed strong correlations between the sequence structures of IIRs spanning over the entire domain of Eukaryotes. We found that the high correlations within introns, intergenic regions and between the two are a result of conserved abundancies of STRs with repeat units ≤2 bp (e.g., (AT)n). For some sequence patterns and their inverse complementary sequences, we found a violation of equal distribution on complementary DNA strands in a subset of genomes. Looking at mismatches within the identified STR patterns, we found specific preferences for certain nucleotides stable over all four phylogenetic kingdoms. We conclude that all of these conserved patterns between IIRs indicate a shared function of these sequence structures related to STRs.
... While the DNA sequence analysis of the human genome initially focused on the protein-coding fractions of the genome, recently, the search for patterns and their potential functions, including intronic and intergenic sequences [1][2][3][4], as well as non-coding but functional elements (e.g., retrotransposons [5][6][7]), was intensified. A comparative analysis of the sequence pattern has become an effective approach with which to identify potentially functional elements, when a direct functional analysis is not possible (e.g., without a promising hypothesis) or not feasible (e.g., by technical or financial limitations). ...
Article
Full-text available
The specific characteristics of k-mer words (2 ≤ k ≤ 11) regarding genomic distribution and evolutionary conservation were recently found. Among them are, in high abundance, words with a tandem repeat structure (repeat unit length of 1 bp to 3 bp). Furthermore, there seems to be a class of extremely short tandem repeats (≤12 bp), so far overlooked, that are non-random-distributed and, therefore, may play a crucial role in the functioning of the genome. In the following article, the positional distributions of these motifs we call super-short tandem repeats (SSTRs) were compared to other functional elements, like genes and retrotransposons. We found length- and sequence-dependent correlations between the local SSTR density and G+C content, and also between the density of SSTRs and genes, as well as correlations with retrotransposon density. In addition to many general interesting relations, we found that SINE Alu has a strong influence on the local SSTR density. Moreover, the observed connection of SSTR patterns to pseudogenes and -exons might imply a special role of SSTRs in gene expression. In summary, our findings support the idea of a special role and the functional relevance of SSTRs in the genome.
... The absence of major markers could be caused, e.g., by the genome not being sequenced to a sufficient depth to assemble well, resulting in markers of interest possibly missing from the assembly [12]. An additional reason could be that fragments carrying the markers do not bin with the rest of the genome, which is a frequent problem with 16S ribosomal RNA genes [13]. ...
Article
Full-text available
We propose MT-MAG, a novel machine learning-based software tool for the complete or partial hierarchically-structured taxonomic classification of metagenome-assembled genomes (MAGs). MT-MAG is alignment-free, with k-mer frequencies being the only feature used to distinguish a DNA sequence from another (herein k = 7). MT-MAG is capable of classifying large and diverse metagenomic datasets: a total of 245.68 Gbp in the training sets, and 9.6 Gbp in the test sets analyzed in this study. In addition to complete classifications, MT-MAG offers a "partial classification" option, whereby a classification at a higher taxonomic level is provided for MAGs that cannot be classified to the Species level. MT-MAG outputs complete or partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. To assess the performance of MT-MAG, we define a "weighted classification accuracy," with a weighting scheme reflecting the fact that partial classifications at different ranks are not equally informative. For the two benchmarking datasets analyzed (genomes from human gut microbiome species, and bacterial and archaeal genomes assembled from cow rumen metagenomic sequences), MT-MAG achieves an average of 87.32% in weighted classification accuracy. At the Species level, MT-MAG outperforms DeepMicrobes, the only other comparable software tool, by an average of 34.79% in weighted classification accuracy. In addition, MT-MAG is able to completely classify an average of 67.70% of the sequences at the Species level, compared with DeepMicrobes which only classifies 47.45%. Moreover, MT-MAG provides additional information for sequences that it could not classify at the Species level, resulting in the partial or complete classification of 95.13%, of the genomes in the datasets analyzed. Lastly, unlike other taxonomic assignment tools (e.g., GDTB-Tk), MT-MAG is an alignment-free and genetic marker-free tool, able to provide additional bioinformatics analysis to confirm existing or tentative taxonomic assignments.
... Among them, WGS has the advantages of high throughput, high accuracy, and convenience. It can obtain the whole genome sequence information of Mycobacterium tuberculosis, which can not only identify the species but also analyze the phylogenetic relationship between MTB, infection sources, and individual process of dissemination (13)(14)(15)(16)(17)(18)(19). Current molecular epidemiological surveillance methods also have their own shortcomings. ...
Article
Full-text available
Background Tuberculosis is a communicable disease that is a major cause of ill health. Bibliometrics is an important statistical methodology used to analyze articles and other publications in the literature study. In this study, publications on molecular epidemiology were analyzed using bibliometric analysis. The statistical analysis of influential publications, journals, countries and authors was first conducted. Methods The Web of Science database was searched for publications on the molecular epidemiology of tuberculosis with the keywords “tuberculosis” and “molecular epidemiology” in the title. The number of publications, citation analysis, co-authorship of the author, institution and country, keyword co-occurrence, and reference co-citations were analyzed. Results A total of 225 journal articles were retrieved. The mean citation was 37.72 per article and 292.69 per year. The annual publications on molecular epidemiology fluctuated within a certain range in the past. Journal of Clinical Microbiology is the most published journal with 33 articles. RASTOGI N is the most prolific author with 11 articles. The top 1 research institution is Inst Pasteur Guadeloupe. Stratified by the number of publications, the USA was the most prolific country. It also cooperates closely with other countries. Burstness analysis of references and keywords showed that the developing research trends in this field mainly focused on “genetic diversity” and “lineage” during the past decade. Conclusion The annual publications on tuberculosis molecular epidemiology fluctuated within a specific range in the past decade. The USA continues to dominate research output and funding. The exchange of expertise, ideas, and technology is of paramount importance in this field. More frequent and deeper cooperation among countries or institutions will be essential in the future.
... Among different methodologies that rely on DNA composition to identify horizontally transferred genomic regions 126 , k-mer spectrum analysis is a standard tool for this purpose 127,128 . Normalized k-mer spectra for DNA sequences of arbitrary length were generated by counting occurrences of all k-mers and normalizing by the total amount of words counted. ...
Article
Full-text available
Sexual reproduction consists of genome reduction by meiosis and subsequent gamete fusion. The presence of genes homologous to eukaryotic meiotic genes in archaea and bacteria suggests that DNA repair mechanisms evolved towards meiotic recombination. However, fusogenic proteins resembling those found in gamete fusion in eukaryotes have so far not been found in prokaryotes. Here, we identify archaeal proteins that are homologs of fusexins, a superfamily of fusogens that mediate eukaryotic gamete and somatic cell fusion, as well as virus entry. The crystal structure of a trimeric archaeal fusexin (Fusexin1 or Fsx1) reveals an archetypical fusexin architecture with unique features such as a six-helix bundle and an additional globular domain. Ectopically expressed Fusexin1 can fuse mammalian cells, and this process involves the additional globular domain and a conserved fusion loop. Furthermore, archaeal fusexin genes are found within integrated mobile elements, suggesting potential roles in cell-cell fusion and gene exchange in archaea, as well as different scenarios for the evolutionary history of fusexins. Sexual reproduction in eukaryotes involves gamete fusion, mediated by fusogenic proteins. Here, the authors identify fusogenic protein homologs encoded within mobile genetic elements in archaeal genomes, solve the crystal structure of one of the proteins, and show that its ectopic expression can fuse mammalian cells, suggesting potential roles in cell-cell fusion and gene exchange.
... Genetic marker-based tools suffer from the fact that partial 26 genomes frequently lack major markers. The absence of major markers can be caused 27 either by the genome not being sequenced to a sufficient depth to assemble well 28 (resulting in markers of interest possibly missing from the assembly [11]), or by 29 fragments carrying the markers not binning with the rest of the genome, which is a 30 frequent problem with 16S ribosomal RNA genes [12]. 31 Finally, tools that are based on k-mer frequencies (e.g. ...
Preprint
We propose MT-MAG, a novel machine learning-based taxonomic assignment tool for hierarchically-structured local classification of metagenome-assembled genomes (MAGs). MT-MAG is capable of classifying large and diverse real metagenomic datasets, having analyzed for this study a total of 240 Gbp of data in the training set, and 7 Gbp of data in the test set. MT-MAG is, to the best of our knowledge, the first machine learning method for taxonomic assignment of metagenomic data that offers a "partial classification" option. MT-MAG outputs complete or a partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. MT-MAG is able to completely classify 48% more sequences than DeepMicrobes to the Species level (the only comparable taxonomic rank for DeepMicrobes), and it outperforms DeepMicrobes by an average of 33% in weighted accuracy, and by 89% in constrained accuracy.
Article
Metagenome-assembled genomes, or MAGs, are genomes retrieved from metagenome datasets. In the vast majority of cases, MAGs are genomes from prokaryotic species that have not been isolated or cultivated in the lab. They, therefore, provide us with information on these species that are impossible to obtain otherwise, at least until new cultivation methods are devised. Thanks to improvements and cost reductions of DNA sequencing technologies and growing interest in microbial ecology, the rise in number of MAGs in genome repositories has been exponential. This chapter covers the basics of MAG retrieval and processing and provides a practical step-by-step guide using a real dataset and state-of-the-art tools for MAG analysis and comparison.
Article
Background Genomic Islands (GIs) are clusters of genes that are mobilized through horizontal gene transfer. GIs play a pivotal role in bacterial evolution as a mechanism of diversification and adaptation to different niches. Therefore, identification and characterization of GIs in bacterial genomes is important for understanding bacterial evolution. However, quantifying GIs is inherently difficult, and the existing methods suffer from low prediction accuracy and precision–recall trade-off. Moreover, several of them are supervised in nature, and thus, their applications to newly sequenced genomes are riddled with their dependency on the functional annotation of existing genomes. Results We present SSG-LUGIA, a completely automated and unsupervised approach for identifying GIs and horizontally transferred genes. SSG-LUGIA is a novel method based on unsupervised anomaly detection technique, accompanied by further refinement using cues from signal processing literature. SSG-LUGIA leverages the atypical compositional biases of the alien genes to localize GIs in prokaryotic genomes. SSG-LUGIA was assessed on a large benchmark dataset `IslandPick’ and on a set of 15 well-studied genomes in the literature and followed by a thorough analysis on the well-understood Salmonella typhi CT18 genome. Furthermore, the efficacy of SSG-LUGIA in identifying horizontally transferred genes was evaluated on two additional bacterial genomes, namely, those of Corynebacterium diphtheria NCTC13129 and Pseudomonas aeruginosa LESB58. SSG-LUGIA was examined on draft genomes and was demonstrated to be efficient as an ensemble method. Conclusions Our results indicate that SSG-LUGIA achieved superior performance in comparison to frequently used existing methods. Importantly, it yielded a better trade-off between precision and recall than the existing methods. Its nondependency on the functional annotation of genomes makes it suitable for analyzing newly sequenced, yet uncharacterized genomes. Thus, our study is a significant advance in identification of GIs and horizontally transferred genes. SSG-LUGIA is available as an open source software at https://nibtehaz.github.io/SSG-LUGIA/.
Article
Full-text available
We provide data and analysis to support the hypothesis that the ancestor of animal mitochondria (Mt) and many primitive amitochondrial (a-Mt) eukaryotes was a fusion microbe composed of a Clostridium-like eubacterium and a Sulfolobus-like archaebacterium. The analysis is based on several observations: (i) The genome signatures (dinucleotide relative abundance values) of Clostridium and Sulfolobus are compatible (sufficiently similar) and each has significantly more similarity in genome signatures with animal Mt sequences than do all other available prokaryotes. That stable fusions may require compatibility in genome signatures is suggested by the compatibility of plasmids and hosts. (ii) The expanded energy metabolism of the fusion organism was strongly selective for cementing such a fusion. (iii) The molecular apparatus of endospore formation in Clostridium serves as raw material for the development of the nucleus and cytoplasm of the eukaryotic cell.
Article
Full-text available
A correlation analysis of chromatin DNA nucleotide sequences reveals the clear tendency of some of the dinucleotides to be repeated along the sequences with periods of 3 and about 10.5 bases. This latter period, which is equal within experimental error to recent estimates of the pitch of the DNA double helix [Wang, J. (1979) Proc. Natl. Acad. Sci. USA 76, 200-203; Trifonov, E. & Bettecken, T. (1979) Biochemistry 18, 454-456] is interpreted as a reflection of the deformational anisotropy of the DNA molecule that facilitates its smooth folding in chromatin.
Article
We present new methods for calculating codon bias of a group of genes or an individual gene relative to a standard gene class. This method is suitable for identifying alien (e.g., horizontally transferred) and highly expressed genes. In yeast and several bacterial genomes, highly expressed genes typically include ribosomal protein genes, elongation factors, chaperonins (heat shock proteins), and a subset of genes involved in glycolysis generally essential in exponential growth. Highly expressed genes of the Synechocystis genome feature several photosystem II genes, and highly expressed genes in several methanogens (Methanococcus jannaschii, M. thermoautotrophicum) are essential for methanogenesis. Alien genes mostly consist of ORFs of unknown function, transposases, prophage genes, and restriction/modification enzymes. Notably, nuclear ribosomal proteins of yeast are highly expressed, whereas mitochondrial ribosomal protein genes appear to be alien genes. Alien genes often occur in clusters, suggesting in these cases that transfer events entail several genes.
Article
Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.
Article
We have presented the method for recognition of structural domains of DNA. This method uses statistical description of coding and non-coding regions in the form of stationary or nonstationary Marcov chain, which was introduced in our previous papers. Calculation of the probability that the given fragment of the DNA appears part of the coding region, is the main operation of this algorithm. The results, obtained for the number of E. coli DNA sequences showed the ability of the method to find the structural domains and correct reading frame, so as to give the estimation of the extent of protein expressivity. Provided necessary statistical data are available, the proposed method may be used for the analysis of DNA of other organisms.