ArticlePDF Available

Barcodes for genomes and applications

December 2008
BMC Bioinformatics 9(1):546

December 2008
9(1):546

DOI:10.1186/1471-2105-9-546

Source
PubMed

License
CC BY 2.0

Authors:

Fengfeng Zhou

Chinese Academy of Sciences

Ying Xu

University of Georgia

Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each genome and termed the genome's barcode. We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness. These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.

: Binning accuracies of our barcode-based clustering algorithm.

…

Barcodes for five prokaryotic genomes. (a) E. coli K-12; (b) E. coli O157; (c) chromosome 1 of B. pseudomallei K96243; (d) archaean P. furiosus DSM 3638; and (e) a random nucleotide sequence generated using a zeroth order Markov chain model. The x-axis for each barcode is the list of all 4-mers arranged in the alphabetical order, and the y-axis is the genome axis with each pixel representing a fragment of M bp long.

…

Basic features of barcodes. (a) Barcode distance distribution among chromosomes from the same organisms, across all prokaryotic and eukaryotic chromosomal genomes. The x-axis is the barcode distance and the y-axis is the frequency of chromosome pairs of the same organism having a particular barcode distance. (b) Genome barcode distances versus sequence similarities among the corresponding 16S rRNAs (based on the multiple sequence alignment given in DeSantis TZ et al. [26]). The y-axis represents the barcode distance, and the x-axis is the sequence identity axis between two 16S rRNAs grouped into nine bins, where the sequence identity is calculated as the average sequence identity over all 16S pairs in each bin.

…

Barcodes of some organisms. Barcodes of (a) Human chromosome 1 (226.21 Mbps); major components of human chromosome in a composite form: (b) repetitive sequence, (c) promoter sequence, (d) coding regions and (e) introns; and (f) coding and (g) non-coding regions of E. coli K-12. Only a 639-Kbp region of each sequence in (b) – (g) is displayed so each pixel represents the same sequence length. 639 Kbps is used since this is the length of the shortest region among them all, i.e. the total non-coding region of E. coli K-12. Mitochondrial genome barcodes of (h) C. elegans (13794 bps) and (i) Drosophila melanogaster (19517 bps). Plastid genome barcodes of (j) aquatic plant Ceratophyllum demersum (156252 bps) and (k) land plant Populus trichocarpa (157033 bps).

…

Barcodes in feature space. The x-axis is the average of variations of the 4-mer frequencies across a whole genome across all 4-mers, and the y-axis measures the similarity level among all 1000-bp partitioned fragments of the genome, each represented as a 136-dimensional vector of 4-mer frequencies; Specifically, for each genome, we build a minimum spanning tree [27] based on the 4-mer frequency vectors for its sequence fragments and their distances. The y-axis is the averaged weight (distance) of all edges in the minimum spanning tree. The green dots represent prokaryotes (586 genomes), the blue ones for eukaryotes (83 chromosomes), the red ones for plastids (101 genomes with lengths > 20,000 bps), the brown ones for plasmids of prokaryotic genomes (237 plasmids > 20,000 bps) and the black for mitochondria (120 genomes with lengths > 20,000 bps).

…

Figures - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

BioMed Central

Page 1 of 11

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Research article

Barcodes for genomes and applications

Fengfeng Zhou

†

, Victor Olman

†

and Ying Xu*

Address: Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, and BioEnergy Science Center (BESC), University of

Georgia, Athens, GA 30602, USA

Email: Fengfeng Zhou - ffzhou@csbl.bmb.uga.edu; Victor Olman - olman@csbl.bmb.uga.edu; Ying Xu* - xyn@bmb.uga.edu

* Corresponding author †Equal contributors

Abstract

Background: Each genome has a stable distribution of the combined frequency for each k-mer

and its reverse complement measured in sequence fragments as short as 1000 bps across the whole

genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each

genome and termed the genome's barcode.

Results: We found that for each genome, the majority of its short sequence fragments have highly

similar barcodes while sequence fragments with different barcodes typically correspond to genes

that are horizontally transferred or highly expressed. This observation has led to new and more

effective ways for addressing two challenging problems: metagenome binning problem and

identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm

substantially improves the state of the art in terms of both binning accuracies and the scope of

applicability. Other attractive properties of genomes barcodes include (a) the barcodes have

different and identifiable characteristics for different classes of genomes like prokaryotes,

eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to

the genomes' phylogenetic closeness.

Conclusion: These and other properties of genomes barcodes make them a new and effective

tool for studying numerous genome and metagenome analysis problems.

Background

The challenges being faced in sorting out short genomic

fragments generated by metagenome sequencing projects

[1] pose a fundamental question: "does each genome

have a unique signature imprinted on its short sequence

fragments so that fragments from the same genomes in a

metagenome can be identified accurately?" A positive

answer to this question could have significant implica-

tions to many important genome and metagenome anal-

ysis problems such as identification of genetic material

transferred from other organisms [2] or through virus

invasions [3,4], separation of short sequence fragments

generated by metagenome sequencing into individual

genomes [5] and phylogenetic analyses of genomes [6].

Understanding the intrinsic properties of genome

sequences, either general to all or specific to some classes

of genomes, has been the focus of many studies in the past

two decades. Earlier work includes the discovery of the

periodicity property of DNA sequences across both

prokaryotic and eukaryotic genomes [7] and the realiza-

tion that coding sequences follow Markov chain proper-

ties [8-10]. Karlin and colleagues have studied various

genome properties based on analyses of k-mer frequency

Published: 17 December 2008

BMC Bioinformatics 2008, 9:546 doi:10.1186/1471-2105-9-546

Received: 16 June 2008

Accepted: 17 December 2008

This article is available from: http://www.biomedcentral.com/1471-2105/9/546

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 2 of 11

(page number not for citation purposes)

distributions, and have observed that the di-nucleotide rel-

ative abundance, a normalized di-mer frequency with

respect to the mono-mer frequencies, is generally stable

across a genome measured on 50 K base-pair (bp) frag-

ments [11-13]. They even suggested that such normalized

di-mer frequency distributions can possibly serve as signa-

tures of genomes.

In this paper, we present a barcoding scheme for all

sequenced genomes, and illustrate a number of interest-

ing and useful properties of the barcodes, which we can

take advantage to solve challenging genome analysis

problems. We highlight the power of this barcoding

scheme through addressing two application problems:

metagenome binning problem and identification of hori-

zontally transferred genes.

Results

Barcodes and their properties

We have calculated the barcode for each sequenced

prokaryotic genome, using the following procedure. For

each genome, we partition its sequence into a series of

non-overlapping and equal-sized fragments of M bps;

then for each k-mer (1 <k < 6 in this study), we calculate

the combined frequency of the k-mer and its reverse com-

plement within each partitioned fragment. The barcode for

each genome is a matrix of N(k) columns and

genome_length/M rows, with each element representing

the frequency of the corresponding k-mer within the cor-

responding sequence fragment, where N(k) is the number

of unique combined k-mers. Note that N(k) = 4

/2 or (4

+ 4

k/2

)/2, depending on whether k is odd or even. For

example, N(4) = 136. The portion of the barcode corre-

sponding to a fragment in a genome is called the frag-

ment's barcode. In this paper, barcodes are calculated

using M = 1000 and k = 4 unless stated otherwise. A dis-

cussion on our choices of the M and k values is given in

Additional file 1, where we can also see that the above

"equal-sized" requirement is not necessary.

For each barcode, we have created a grey-level image, a

barcode image, by mapping the k-mer frequencies to grey

levels using a procedure given in the METHODS section

so that darker grey levels are for lower frequencies. Figure

1 shows the barcode images for five prokaryotic genomes.

A key advantage of having barcode images is that they pro-

vide an intuitive, informative and global view of genomes,

from which various genomic features become immedi-

ately apparent. This view can be used to guide our rigor-

ous statistical analyses of genomes. We have calculated

the barcode images for all 586 sequenced prokaryotic

genomes, which are all accessible at [14], along with the

barcode images for other classes of genomes.

From these barcodes (e.g., Figure 1), we observed that (a)

all chromosomal genomes have remarkably stable 4-mer

frequency distributions essentially for all 4-mers, giving

rise to the vertical bands with consistent grey levels across

each barcode; (b) the small fraction of the fragments with

clearly different, abnormal, barcodes (horizontal stripes

in the barcodes) than the rest of the genome typically rep-

resent 2–3 special classes of genes (see discussion later);

ally have highly similar barcodes (Figure 2(a)) but they

each have their unique patterns of abnormal fragments;

and (d) the barcodes similarities tend to be generally pro-

portional to the genomes' phylogenetic closeness (Figure

2(b)).

To understand why a genomic sequence has the barcode

property, we have examined random nucleotide

sequences generated using different models, including

Markov chain models of order from 0 to 6. We observed

that barcodes for random nucleotide sequences generated

using a third-order Markov chain model are the closest to

the barcodes of genomic sequences in terms of their

appearances (Additional file 1), and higher order Markov

chain models do not seem to add much to this property.

Hence we believe that the barcode property of prokaryotic

genomes is mainly due to the third-order Markov chain

property of the coding sequences in the genomes, which

count for 80–90% of a typical prokaryotic genome. It is

worth noting that barcodes for coding and non-coding

sequences of the same genome are generally different

though they share a weakly similar backbone structure

while each of these two classes of (composite) regions

generally has highly similar barcodes (Figure 3).

Extension to other genomes

In addition to prokaryotes, we have also calculated the

barcodes for the other classes of sequenced genomes,

namely eukaryotic, mitochondrial, plastid and plasmid

genomes. For eukaryotes, we studied the barcodes for four

key components in eukaryotic genomes, namely the

(composite) regions of repetitive sequences, promoter

sequences (the 1000-bp upstream region from each trans-

lation start), coding regions and introns, respectively (Fig-

ure 3(b)–(e)). We observed that (i) different regions in a

high-level eukaryotic genome (e.g., human) have similar

"backbone" structures in their barcodes, and (ii) the bar-

codes for the four types of regions have increasingly

higher complexity, going from repetitive sequences to

coding regions to introns and promoter sequences. This is

consistent with the belief that introns and promoter

sequences are probably the most information rich among

the four because of the possibly large numbers of regula-

tory elements they encode.

The barcodes of the mitochondrial genomes are generally

unique compared to the barcodes of the other genomes as

they have a distinct overall appearance (e.g., Figure 3(h)–

(i)). Their similar appearance may be the result of all

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 3 of 11

(page number not for citation purposes)

mitochondria originating from Proteobacteria. The bar-

codes of all plasmid genomes also tend to have similar

characteristics among themselves, possibly due to being

under similar selection pressure caused by their frequent

transferring among cell cultures. The barcodes of all the

plastid genomes are also generally unique compared to

the barcodes of the others (e.g., Figure 3(j)–(k)). For

example, a majority of them each consist of two dark hor-

izontal bends toward one end in their barcodes along the

genome axis, whose corresponding genomic regions con-

sist of RNA genes such as ribosomal RNAs and tRNAs,

plus ribosomal proteins. The fuzzier appearance of the

plastid barcodes indicates that their k-mer frequencies

along the genome axis are not as stable as in the other

genomes. The overall similar appearances of the plastid

barcodes may be due to all originating from the Cyanobac-

teria.

One interesting question is "do different classes of

genomes have their unique characteristics in their bar-

codes?" Our answer is yes, based on their highly separable

distributions in the feature space defined by two particu-

lar features, as shown in Figure 4, one of which measures

the overall frequency variation for all 4-mers across the

Barcodes for five prokaryotic genomesFigure 1

Barcodes for five prokaryotic genomes. (a) E. coli K-12; (b) E. coli O157; (c) chromosome 1 of B. pseudomallei K96243; (d)

archaean P. furiosus DSM 3638; and (e) a random nucleotide sequence generated using a zero

order Markov chain model. The

x-axis for each barcode is the list of all 4-mers arranged in the alphabetical order, and the y-axis is the genome axis with each

pixel representing a fragment of M bp long.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 4 of 11

(page number not for citation purposes)

Basic features of barcodesFigure 2

Basic features of barcodes. (a) Barcode distance distribution among chromosomes from the same organisms, across all

prokaryotic and eukaryotic chromosomal genomes. The x-axis is the barcode distance and the y-axis is the frequency of chro-

mosome pairs of the same organism having a particular barcode distance. (b) Genome barcode distances versus sequence simi-

larities among the corresponding 16S rRNAs (based on the multiple sequence alignment given in DeSantis TZ et al. [26]). The

y-axis represents the barcode distance, and the x-axis is the sequence identity axis between two 16S rRNAs grouped into nine

bins, where the sequence identity is calculated as the average sequence identity over all 16S pairs in each bin.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 5 of 11

(page number not for citation purposes)

genome's barcode, and the other measures the overall

similarity level among all the M-bp fragments of the

genome, each considered as a vector of 4-mer frequencies.

While Figure 2(b) indicates that barcodes generally pre-

serve sequence-level similarities, Figure 4 suggests that

barcodes also capture a higher-level similarity beyond

individual genome sequence similarities through the tex-

tures of their images, which are the common and unique

Barcodes of some organismsFigure 3

Barcodes of some organisms. Barcodes of (a) Human chromosome 1 (226.21 Mbps); major components of human chro-

mosome in a composite form: (b) repetitive sequence, (c) promoter sequence, (d) coding regions and (e) introns; and (f) cod-

ing and (g) non-coding regions of E. coli K-12. Only a 639-Kbp region of each sequence in (b) – (g) is displayed so each pixel

represents the same sequence length. 639 Kbps is used since this is the length of the shortest region among them all, i.e. the

total non-coding region of E. coli K-12. Mitochondrial genome barcodes of (h) C. elegans (13794 bps) and (i) Drosophila mela-

nogaster (19517 bps). Plastid genome barcodes of (j) aquatic plant Ceratophyllum demersum (156252 bps) and (k) land plant Pop-

ulus trichocarpa (157033 bps).

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 6 of 11

(page number not for citation purposes)

characteristics of different classes of genomes. This prop-

erty indicates that barcodes are not just a simple visualiza-

tion tool, instead they have captured some fairly basic

information about genomes! From application point of

view, we believe that this feature will prove to be useful to

metagenome analyses as fragments from different classes

of genomes such as eukaryotes, prokaryotes or different

organelle genomes, have different characteristics in their

barcode images.

Identification of abnormal sequence fragments

Our procedure for identifying sequence fragments with

abnormal barcodes in a genome employs a clustering

strategy to divide all the sequence fragments in a genome

into two groups: (a) a large group of fragments with their

barcodes all similar to each other and (b) the rest (see

METHODS section).

Using this procedure, we have identified 30,582 abnor-

mal fragments, covering 30,889 genes across all the com-

plete prokaryotic genomes. Specifically 28,460 such

fragments are identified in the 542 bacterial genomes,

covering 28,562 genes, and 2,122 such fragments are

identified in the 46 archaeal genomes, covering 2,327

genes. We found that the percentage of fragments with

abnormal barcodes ranges from 9.40% to 32.32% across

all the bacterial genomes, with the average being 12.85%.

Among the 46 sequenced archaeal genomes, the percent-

Barcodes in feature spaceFigure 4

Barcodes in feature space. The x-axis is the average of variations of the 4-mer frequencies across a whole genome across

all 4-mers, and the y-axis measures the similarity level among all 1000-bp partitioned fragments of the genome, each repre-

sented as a 136-dimensional vector of 4-mer frequencies; Specifically, for each genome, we build a minimum spanning tree [27]

based on the 4-mer frequency vectors for its sequence fragments and their distances. The y-axis is the averaged weight (dis-

tance) of all edges in the minimum spanning tree. The green dots represent prokaryotes (586 genomes), the blue ones for

eukaryotes (83 chromosomes), the red ones for plastids (101 genomes with lengths > 20,000 bps), the brown ones for plas-

mids of prokaryotic genomes (237 plasmids > 20,000 bps) and the black for mitochondria (120 genomes with lengths > 20,000

bps).

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 7 of 11

(page number not for citation purposes)

age of fragments with abnormal barcodes ranges from

9.86% to 23.14%, with the average being 13.58%. Further

information can be found from Additional file 1. The

detailed frequency information for abnormal fragments

across different genomes is in Additional file 2[15].

While we found that it is generally more challenging to

study the abnormal fragments in eukaryotes, we did apply

the same procedure to different human chromosomes,

and found that the percentage of abnormal fragments

ranges from 10.08% to 31.32%, with the average being

12.10%.

We have analyzed the abnormal fragments across the

prokaryotic genomes, and found the following: ~30% of

the abnormal fragments can be explained in terms of (a)

horizontal gene transfers, (b) phage invasions and (c)

highly expressed genes, based on PHX-PA [16,17] and

Prophinder [18], respectively. Among the genes that fall

into this 30%, 6.99% are horizontally transferred genes,

4.97% bacteriophage genes and 18.90% highly expressed

genes, based on the above two prediction programs – note

that these numbers do not add up to exactly 30% since

there are overlaps among them. The genes falling into dif-

ferent categories are given in Additional file 3[15]. We

have carried out an enrichment analysis of such elements

in regions with abnormal versus normal barcodes. We

found that the highly expressed genes are enriched in the

abnormal fragments, with the enrichment ratio > 1 across

all the genomes and the average enrichment ratio being

1.90. Similar results hold for the horizontally transferred

genes and bacteriophage genes. All the detailed data can

be found in Additional file 2

We noted that our estimate of the percentages of "foreign

fragments" in bacterial genomes (after deducting the

"highly expressed genes") is in general agreement with the

previous estimates though different information and tech-

niques are used to derive the estimates [19].

We do not yet have an explanation for the remaining

~70% of abnormal fragments in prokaryotic genomes,

although we suspect that they mostly fall into the same

three categories – one reason that we could not explain

them now is possibly due to the limited coverage of the

current databases for horizontally transferred genes, bac-

teriophage genomes and highly expressed genes. We

believe that by using more sophisticated computational

procedures, one may be able to derive the level of abnor-

mality of a fragment's barcode in a genome, and possibly

link such information to when such fragments were hori-

zontally transferred [20].

Binning metagenome sequence

The ability to sequence a microbial community has led to

the sequencing of at least 7.04 Giga bps of metagenome

sequences, already 2.22 times the total complete genome

sequences accumulated in the past two decades [21].

These metagenome sequences have opened many doors

to new research possibilities, and have posed some chal-

lenging problems. One such problem is determining

which fragments are from the same organisms in a large

pool of metagenomic fragments [22], typically ~1000 bps

in lengths after the initial assembly using the Sanger

sequencing techniques.

We have applied a clustering algorithm (see METHODS

section) for binning sequence fragments together based

on their barcode similarities, and tested the clustering

strategy on three sets of simulated metagenome data cre-

ated by cutting actual bacterial genomes into fragments

and mixing them together. The three test sets consist of all

sequence fragments from three sets of genomes, respec-

tively, extracted from the GenBank. The first set consists of

11 genomes randomly selected from the same genus but

from 11 different species (the genus has only 11

sequenced species) while the last two sets each consist of

30 and 100 genomes randomly selected from 30 and 100

different bacterial genera, respectively. The genome

names are given in Additional file 4[15].

To assess the binning ability of our algorithm as a func-

tion of the fragment size, we have considered fragment

size M = 1000, 2000, 5000 and 10000. To test the limit of

our binning algorithm, we have also considered M = 500.

For each set of genomes, we partitioned each genome into

fragments of size M, and then mixed the fragments of the

same length into one pool. We then calculated the bar-

code for each fragment, and did a clustering analysis,

assuming that the number of genomes in each pool is

known (this information is derivable from the 16S

rRNAs). We have carried out binning predictions, one

directly on the generated fragments and one on a reduced

set of generated fragments, in which we remove 10% of

the fragments from each genome whose barcodes are

most different from the average barcode of the genome.

The consideration is that each bacterial genome has ~13%

of fragments with abnormal barcodes on average, which

are not expected to be binned correctly with the rest of

their host genome. This way we can more accurately assess

the binning ability of our algorithm. Table 1 gives the bin-

ning results on the three sets of synthetic metagenome

data, both the original set and the reduced set.

From the table, we can see that the binning accuracy (into

the correct genomes) is high for fragment size M = 1000

and above, at both the species and the genus level. From

the table, we see that there is a drop in the binning accu-

racy when the number of the underlying genomes is

increased from 30 to 100. This indicates the increased

complexity of the problem as a function of the number of

underlying genomes.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 8 of 11

(page number not for citation purposes)

We have compared our binning performance with the

published results by the best available algorithm PhyloPy-

thia [5]. Our comparison indicates that our algorithm

gives consistently more accurate and more specific bin-

ning results across different fragment sizes. For example,

at the species level, our algorithm has better than 50%

accuracy on our test set when the fragment size is at least

2000 bps while no binning results at the species level is

given in McHardy et al. [5]. At the genus level, the accu-

racy (the average of binning specificity and sensitivity,

extracted from Figure 1(a) and 1(b) in McHardy et al. [5])

by PhyloPythia is 45.5% for fragment size 1000, 56% for

2000, 74% for 5000 and 82.5% for 10000 (no data is pro-

vided for 500), all measured in terms of putting fragments

into the correct genera while ours is to the correct

genomes and with more accurate binning results. It

should be noted that the test set used by PhyloPythia is

different than ours, which may affect the performance sta-

tistics somewhat though we suspect that will be insignifi-

cant, considering the sizes of the test sets. Another key

difference between the two algorithms is that while Phy-

loPythia is a supervised learning algorithm, which

requires a training set, our algorithm does not require a

training set, and hence it is more general.

One thing worth noting is that a prokaryotic genome, on

average, has ~13–14% of abnormal sequence fragments,

when the fragment size is M = 1,000, suggesting that the

theoretical limit for binning accuracy should be no better

than 86–87%. Similarly we expect that the theoretical lim-

its of binning accuracy for 2,000, 5,000 and 10,000 frag-

ments-based binning should, in general, be no better than

87.36%, 87.58% and 88.4%, respectively.

Discussion and Conclusion

A natural question is "do all nucleotide sequences have

the barcode property like genome sequences have?" The

answer is no, based on the large number of randomly gen-

erated sequences that we have examined. Figure 1(e)

shows a typical barcode of a random sequence generated

using a zero

order Markov chain model. We found that

none of the so generated nucleotide sequences has the ver-

tical band structures as in genomes barcodes. More gener-

ally, barcodes for genomes and the randomly generated

nucleotide sequences have different characteristics as

shown in Figure 5.

The barcode analyses in this paper are mainly based on

data from prokaryotes. Though we have applied the same

barcode model to eukaryotes and made interesting obser-

vations, we suspect that the current barcoding scheme is

rich enough to capture all the complexity of eukaryotes.

Further studies along this direction are clearly needed.

We believe that for many genome analysis problems, par-

ticularly for prokaryotic genomes, the barcodes provide a

natural, intuitive, information-rich and unified frame-

work for studying them. Further applications of this capa-

bility to numerous genome analysis problems can be

envisioned, such as phylogeny studies, particularly for

genomes without obvious marker genes such as viruses,

more thorough examination of different types of genomic

regions in eukaryotes, their structures and organization,

further studies of horizontal gene transfers, assisting in

genome assembly of higher-order organisms (e.g., populus

which we are currently working on) and possibly many

more. We believe that we have only begun exploring the

true power of this new capability for genome studies.

Methods

Mapping frequencies to grey levels

The frequency of each k-mer is mapped to a grey level as

follows. We first count the frequency of each k-mer across

all prokaryotic genomes, and sort the frequency list S

[1:N(k)] in the increasing order of the frequencies with

N(k) being the number of k-mers. We then find an integer

L, L > 0, and partition S [1:N(k)] into L sub-lists so that the

following function is minimized: , where S

()SS

−

∑

Table 1: Binning accuracies of our barcode-based clustering algorithm.

11 genomes 30 genomes 100 genomes

Original genomes Filtered genomes Original genomes Filtered genomes Original genomes Filtered genomes

FS = 500 bps 71.10% 77.30% 51. 6% 55.70% 40.50% 41.10%

FS = 1000 bps 79.90% 85.90% 65.30% 70.30% 51.10% 52.60%

FS = 2000 bps 86.30% 91.70% 74.80% 80.60% 61.00% 68.53%

FS = 5000 bps 91.10% 98.10% 86.60% 93.20% 79.40% 81.90%

FS = 10000 bps 95.80% 99.30% 91.90% 97.50% 86.60% 89.18%

The binning accuracy is defined as (prediction specificity + prediction sensitivity)/2, and FS is for fragment size, where both the specificity and sensitivity

are measured in terms of putting the fragments into the correct bin corresponding to each genome, defined by the majority of the fragments in the

bin. The column "Original genomes" lists the binning accuracy of our algorithm on all the non-overlapping fragments in each group of genomes, and

the column "Filtered genomes" gives the accuracy after removing the 10% fragments with the most abnormal barcodes from each genome.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 9 of 11

(page number not for citation purposes)

the sum of all frequencies in the i

sub-list, is the aver-

age of S, and L is a parameter to be determined by the

minimization result. For M = 1000 and k = 4, we found L

= 14 gives the best value for the above objective function.

The computed partition of S gives a mapping of frequen-

cies to the grey levels. Note that this mapping is genome-

independent so each grey level in the barcodes has the

same meaning in different genomes.

Barcode similarity calculation

We define the distance (or dissimilarity) between two bar-

codes based on their simplified representations, each of

which is a matrix having the same number of columns of

the barcode and the number of grey levels, L, used in bar-

code images as the number of rows; each element in the

matrix represents the frequency of the corresponding grey

level across each column in the barcode. For two such

matrices M

and M

with K columns and L rows, we define

their barcode distance as

Clearly this is a generalization of the Euclidean distance

between two vectors of the averaged k-mer frequencies

across each genome, widely used for genome comparisons

as in the work of Karlin and colleagues [13,16,23,24] and

many others. This is equivalent to the special case of our

barcode distance when L = 1. Figure S4 in Additional file

1 provides a comparison between the two distances.

((,) (,)).Mij Mij

∑∑

−

Distribution of ratios between barcode variations of all prokaryotic genomes and their corresponding randomly generated nucleotide sequencesFigure 5

Distribution of ratios between barcode variations of all prokaryotic genomes and their corresponding ran-

domly generated nucleotide sequences. For each genome, a corresponding random nucleotide sequence is defined as a

random sequence of the same length and with the same mono-nucleotide frequencies as those of the genome, generated using

a zero

order Markov chain model. The variation of a barcode is defined as the standard deviation of the list of the averaged

frequencies of all the k-mers along the genome. The x-axis is the ratio of the barcode variations between a genome and a cor-

responding random sequence, and the y-axis represents the frequency of cases with a particular variation ratio.

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 10 of 11

(page number not for citation purposes)

Identification of abnormal fragments in a genome

We have used the following procedure to identify frag-

ments in a genome with substantially different barcodes

than the average barcode of the genome. The procedure

consists of two key steps. First, for each k-mer, we select

the fragments in the genome that have the highest or the

lowest X% of this k-mer's frequency among all fragments,

with X being a parameter. Then we sort all the fragments

in the increasing order of the number of times they are

selected in the first step, termed function F(p), with p

being the index of a fragment. Let p

be the fragment index

having the highest second-order derivative of F(p). We

consider all fragments p with F(p) > F(p

) to be the non-

native fragments of the genome as they have used the

most number of k-mers with frequencies that are substan-

tially different than the typical k-mer frequencies through-

out the genome. We found that the abnormal fragment

prediction is not very sensitive to the detailed value of X

within the range from 5 to 20. So we have chosen X = 10

as the default value of our program.

The rationale for this procedure is that fragments with

higher F(p) values represent fragments that have more

"abnormal" k-mer frequencies compared to the average k-

mer frequencies in the genome, and hence are more prob-

able to be non-native fragments. By examining the curve

of the F(p) function, we found that it is convex with one

sharp transition point p

, indicating a transition point

from the typical fragments to the "abnormal" fragments

in the genome (see Additional file 1). Hence we have used

this point as the separation point between the normal (or

native) fragments and the "abnormal"fragments.

Metagenome binning algorithm

Our binning procedure starts with an application of the

CLUMP program [25] to a given pool of fragments (not

necessarily of the same lengths) to be clustered based on

their barcode similarities. A unique feature of CLUMP is

that it is quite accurate in identifying the core elements of

each cluster as we have previously demonstrated [25],

though a weakness of the algorithm could be that it does

not always handle the boundary elements accurately.

Hence we have combined CLUMP with a K-means based

clustering approach that we implemented. After identify-

ing the initial clusters formed by CLUMP based on bar-

code similarities, assuming that we know the number of

clusters to be identified, we pick a seed from each pre-

dicted cluster randomly according to the density distribu-

tion of the cluster. Then we run the K-means algorithm,

using the selected seeds. For each pool of fragments, we

run this two-step clustering algorithm multiple times,

using a different set of seeds for each run. In deciding the

number of runs, our rule of thumb based on our experi-

ence working on the metagenome data is to use 500 * (the

number of clusters),. For each given set of seeds, we run the

K-means algorithm 10000 iterations. Among all the com-

puted clustering results for each pool, we choose the clus-

tering result C

, C

,..., C

that minimizes the following

function as the final binning result:

where C

, C

,..., C

is a partition of a given pool of metage-

nomic fragments with each C

being a subset of the pool

and being the average of the barcodes of all fragments

in C

, i = 1,..., K.

Authors' contributions

Y.X. conceived the project. F.Z. and V.O. analyzed the data

and performed the experiments. Y.X. supervised this

project as P.I. and wrote the manuscript.

Additional material

Acknowledgements

This work was supported by National Science Foundation (DBI-0354771,

ITR-IIS-0407204, DBI-0542119, CCF0621700), a U.S. Department of

Energy "BioEnergy Research Center" grant from the Office of Biological and

Environmental Research in the DOE Office of Science, and a Distinguished

Scholar grant from the Georgia Cancer Coalition. We would like to thank

Additional file 1

Supplementary material. Supplementary material 1–3.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-9-546-S1.doc]

Additional file 2

Supplementary Table 1. HX is highly expressed gene, HT is horizontally

transferred gene, and PH is the phage gene. UnknownGene consists of

genes within fragments of abnormal barcodes but do not belong to the

above three categories.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-9-546-S2.xls]

Additional file 3

Supplementary Table 2. Gene classifications of all the prokaryotic

genomes.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-9-546-S3.zip]

Additional file 4

Supplementary Table 3. Genomes used in the binning section.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-9-546-S4.xls]

(),XXi

XCi

−

∈=

∑∑

Publish with Bio Med Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546

Page 11 of 11

(page number not for citation purposes)

the two anonymous reviewers for their helpful comments on our work.

We would also like to thank Ms Joan Yantko for preparing the manuscript.

References

1. Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI: Host-

bacterial mutualism in the human intestine. Science 2005,

307(5717):1915-1920.

2. Jain R, Rivera MC, Lake JA: Horizontal gene transfer among

genomes: the complexity hypothesis. Proceedings of the National

Academy of Sciences of the United States of America 1999,

96(7):3801-3806.

3. Frey TK: Neurological aspects of rubella virus infection. Inter-

virology 1997, 40(2–3):167-175.

4. Rybchin VN, Svarchevsky AN: The plasmid prophage N15: a lin-

ear DNA with covalently closed ends. Mol Microbiol 1999,

33(5):895-903.

5. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I:

Accurate phylogenetic classification of variable-length DNA

fragments. Nat Methods 2007, 4(1):63-72.

6. Yang E, Bin W, Peng J, Zhang X, Wang J, Yang J, Dong J, Chu Y, Zhang

J, Jin Q: Comparative genomics and phylogenetic analysis of

S. dysenteriae subgroup. Sci China C Life Sci 2005, 48(4):406-413.

7. Trifonov EN, Sussman JL: The pitch of chromatin DNA is

reflected in its nucleotide sequence. Proceedings of the National

Academy of Sciences of the United States of America 1980,

77(7):3816-3820.

8. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-

cal patterns in primary structures of functional regions in the

E. coli genome. I. Oligonucleotide frequencies analysis.

Molecular Biology 1986, 20:826-833.

9. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-

cal patterns in primary structures of functional regions in the

E. coli genome. II. Non-homogeneous Markov models. Molec-

ular Biology 1986, 20:833-840.

10. Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A: Statisti-

cal patterns in primary structures of functional regions in the

E. coli genome. III. Computer recognition of coding regions.

Molecular Biology 1986, 20:1145-1150.

11. Karlin S, Burge C: Dinucleotide relative abundance extremes:

a genomic signature. Trends Genet 1995, 11(7):283-290.

12. Karlin S, Zhu ZY, Karlin KD: The extended environment of

mononuclear metal centers in protein structures. Proceedings

of the National Academy of Sciences of the United States of America 1997,

94(26):14225-14230.

13. Karlin S, Brocchieri L, Mrazek J, Campbell AM, Spormann AM: A chi-

meric prokaryotic ancestry of mitochondria and primitive

eukaryotes. Proceedings of the National Academy of Sciences of the

United States of America 1999, 96(16):9190-9195.

14. Computed_barcodes [http://csbl.bmb.uga.edu/~ffzhou/BoDB/

]

15. Supplementary_material [http://csbl.bmb.uga.edu/~ffzhou/

BoDB/supp/]

16. Mrazek J, Bhaya D, Grossman AR, Karlin S: Highly expressed and

alien genes of the Synechocystis genome. Nucleic Acids Res

2001, 29(7):1590-1601.

17. Karlin S, Mrazek J: Predicted highly expressed genes of diverse

prokaryotic genomes. J Bacteriol 2000, 182(18):5238-5250.

18. Lima-Mendez G, Helden JV, Toussaint A, Leplae R: Prophinder: a

computational tool for prophage prediction in pro-karyotic

genomes. Bioinformatics 2008.

19. Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and

the nature of bacterial innovation. Nature 2000,

405(6784):299-304.

20. Lawrence JG, Ochman H: Amelioration of bacterial genomes:

rates of change and exchange. J Mol Evol 1997, 44(4):383-397.

21. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC: The

Genomes On Line Database (GOLD) in 2007: status of

genomic and metagenomic projects and their associated

metadata. Nucleic Acids Res 2008:D475-479.

22. McHardy AC, Rigoutsos I: What's in the mix: phylogenetic clas-

sification of metagenome sequence samples. Current opinion in

microbiology 2007, 10(5):499-503.

23. Karlin S, Mrazek J, Ma J, Brocchieri L: Predicted highly expressed

genes in archaeal genomes. Proceedings of the National Academy of

Sciences of the United States of America 2005, 102(20):7303-7308.

24. Mrazek J, Karlin S: Detecting alien genes in bacterial genomes.

Ann N Y Acad Sci 1999, 870:314-329.

25. Olman V, Mao F, Wu H, Xu Y: Parallel Clustering Algorithm for

Large Data Sets with applications in Bioinformatics. IEEE/

ACM Transactions on Computational Biology and Bioinformatics 2007 in

press.

26. DeSantis TZ Jr, Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno

YM, Phan R, Andersen GL: NAST: a multiple sequence align-

ment server for comparative analysis of 16S rRNA genes.

Nucleic Acids Res 2006:W394-399.

27. Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algo-

rithms Second edition. Cambridge, MA The MIT Press; 2001.

Additional file 4

Data

December 2008

Fengfeng Zhou · Victor Olman · Ying Xu

Additional File 2

Data

December 2008

Fengfeng Zhou · Victor Olman · Ying Xu

Additional File 1

Data

December 2008

Fengfeng Zhou · Victor Olman · Ying Xu

Additional file 3

Data

December 2008

Fengfeng Zhou · Victor Olman · Ying Xu

Archaeal origins of gamete fusion

Preprint

Full-text available

Oct 2021

Sexual reproduction consists of genome reduction by meiosis and subsequent gamete fusion. Presence of meiotic genes in prokaryotes suggests that DNA repair mechanisms evolved toward meiotic recombination; however, fusogenic proteins resembling those found in eukaryotes were not identified in prokaryotes. Here, we identify archaeal proteins that are homologs of fusexins, a superfamily of fusogens that mediate eukaryotic gamete and somatic cell fusion, as well as virus entry. The crystal structure of a trimeric archaeal Fusexin1 reveals novel features such as a six-helix bundle and an additional globular domain. Ectopically expressed Fusexin1 can fuse mammalian cells, and this process involves the additional domain and a conserved fusion loop. Archaeal fusexin genes exist within integrated mobile elements, potentially linking ancient archaeal gene exchanges and eukaryotic sex. One-Sentence Summary Cell membrane fusion proteins of viruses and eukaryotes are also present in archaea.

Eukaryotic Genomes Show Strong Evolutionary Conservation of k-mer Composition and Correlation Contributions between Introns and Intergenic Regions

Article

Full-text available

Oct 2021

Several strongly conserved DNA sequence patterns in and between introns and intergenic regions (IIRs) consisting of short tandem repeats (STRs) with repeat lengths <3 bp have already been described in the kingdom of Animalia. In this work, we expanded the search and analysis of conserved DNA sequence patterns to a wider range of eukaryotic genomes. Our aims were to confirm the conservation of these patterns, to support the hypothesis on their functional constraints and/or the identification of unknown patterns. We pairwise compared genomic DNA sequences of genes, exons, CDS, introns and intergenic regions of 34 Embryophyta (land plants), 30 Protista and 29 Fungi using established k-mer-based (alignment-free) comparison methods. Additionally, the results were compared with values derived for Animalia in former studies. We confirmed strong correlations between the sequence structures of IIRs spanning over the entire domain of Eukaryotes. We found that the high correlations within introns, intergenic regions and between the two are a result of conserved abundancies of STRs with repeat units ≤2 bp (e.g., (AT)n). For some sequence patterns and their inverse complementary sequences, we found a violation of equal distribution on complementary DNA strands in a subset of genomes. Looking at mismatches within the identified STR patterns, we found specific preferences for certain nucleotides stable over all four phylogenetic kingdoms. We conclude that all of these conserved patterns between IIRs indicate a shared function of these sequence structures related to STRs.

Specific Patterns in Correlations of Super-Short Tandem Repeats (SSTRs) with G+C Content, Genic and Intergenic Regions, and Retrotransposons on All Human Chromosomes

Article

Full-text available

Dec 2023

The specific characteristics of k-mer words (2 ≤ k ≤ 11) regarding genomic distribution and evolutionary conservation were recently found. Among them are, in high abundance, words with a tandem repeat structure (repeat unit length of 1 bp to 3 bp). Furthermore, there seems to be a class of extremely short tandem repeats (≤12 bp), so far overlooked, that are non-random-distributed and, therefore, may play a crucial role in the functioning of the genome. In the following article, the positional distributions of these motifs we call super-short tandem repeats (SSTRs) were compared to other functional elements, like genes and retrotransposons. We found length- and sequence-dependent correlations between the local SSTR density and G+C content, and also between the density of SSTRs and genes, as well as correlations with retrotransposon density. In addition to many general interesting relations, we found that SINE Alu has a strong influence on the local SSTR density. Moreover, the observed connection of SSTR patterns to pseudogenes and -exons might imply a special role of SSTRs in gene expression. In summary, our findings support the idea of a special role and the functional relevance of SSTRs in the genome.

MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenomeassembled genomes

Article

Full-text available

Aug 2023
PLOS ONE

We propose MT-MAG, a novel machine learning-based software tool for the complete or partial hierarchically-structured taxonomic classification of metagenome-assembled genomes (MAGs). MT-MAG is alignment-free, with k-mer frequencies being the only feature used to distinguish a DNA sequence from another (herein k = 7). MT-MAG is capable of classifying large and diverse metagenomic datasets: a total of 245.68 Gbp in the training sets, and 9.6 Gbp in the test sets analyzed in this study. In addition to complete classifications, MT-MAG offers a "partial classification" option, whereby a classification at a higher taxonomic level is provided for MAGs that cannot be classified to the Species level. MT-MAG outputs complete or partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. To assess the performance of MT-MAG, we define a "weighted classification accuracy," with a weighting scheme reflecting the fact that partial classifications at different ranks are not equally informative. For the two benchmarking datasets analyzed (genomes from human gut microbiome species, and bacterial and archaeal genomes assembled from cow rumen metagenomic sequences), MT-MAG achieves an average of 87.32% in weighted classification accuracy. At the Species level, MT-MAG outperforms DeepMicrobes, the only other comparable software tool, by an average of 34.79% in weighted classification accuracy. In addition, MT-MAG is able to completely classify an average of 67.70% of the sequences at the Species level, compared with DeepMicrobes which only classifies 47.45%. Moreover, MT-MAG provides additional information for sequences that it could not classify at the Species level, resulting in the partial or complete classification of 95.13%, of the genomes in the datasets analyzed. Lastly, unlike other taxonomic assignment tools (e.g., GDTB-Tk), MT-MAG is an alignment-free and genetic marker-free tool, able to provide additional bioinformatics analysis to confirm existing or tentative taxonomic assignments.

Bibliometric analysis of tuberculosis molecular epidemiology based on CiteSpace

Article

Full-text available

Nov 2022

Background Tuberculosis is a communicable disease that is a major cause of ill health. Bibliometrics is an important statistical methodology used to analyze articles and other publications in the literature study. In this study, publications on molecular epidemiology were analyzed using bibliometric analysis. The statistical analysis of influential publications, journals, countries and authors was first conducted. Methods The Web of Science database was searched for publications on the molecular epidemiology of tuberculosis with the keywords “tuberculosis” and “molecular epidemiology” in the title. The number of publications, citation analysis, co-authorship of the author, institution and country, keyword co-occurrence, and reference co-citations were analyzed. Results A total of 225 journal articles were retrieved. The mean citation was 37.72 per article and 292.69 per year. The annual publications on molecular epidemiology fluctuated within a certain range in the past. Journal of Clinical Microbiology is the most published journal with 33 articles. RASTOGI N is the most prolific author with 11 articles. The top 1 research institution is Inst Pasteur Guadeloupe. Stratified by the number of publications, the USA was the most prolific country. It also cooperates closely with other countries. Burstness analysis of references and keywords showed that the developing research trends in this field mainly focused on “genetic diversity” and “lineage” during the past decade. Conclusion The annual publications on tuberculosis molecular epidemiology fluctuated within a specific range in the past decade. The USA continues to dominate research output and funding. The exchange of expertise, ideas, and technology is of paramount importance in this field. More frequent and deeper cooperation among countries or institutions will be essential in the future.

Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins

Article

Full-text available

Jul 2022

Sexual reproduction consists of genome reduction by meiosis and subsequent gamete fusion. The presence of genes homologous to eukaryotic meiotic genes in archaea and bacteria suggests that DNA repair mechanisms evolved towards meiotic recombination. However, fusogenic proteins resembling those found in gamete fusion in eukaryotes have so far not been found in prokaryotes. Here, we identify archaeal proteins that are homologs of fusexins, a superfamily of fusogens that mediate eukaryotic gamete and somatic cell fusion, as well as virus entry. The crystal structure of a trimeric archaeal fusexin (Fusexin1 or Fsx1) reveals an archetypical fusexin architecture with unique features such as a six-helix bundle and an additional globular domain. Ectopically expressed Fusexin1 can fuse mammalian cells, and this process involves the additional globular domain and a conserved fusion loop. Furthermore, archaeal fusexin genes are found within integrated mobile elements, suggesting potential roles in cell-cell fusion and gene exchange in archaea, as well as different scenarios for the evolutionary history of fusexins. Sexual reproduction in eukaryotes involves gamete fusion, mediated by fusogenic proteins. Here, the authors identify fusogenic protein homologs encoded within mobile genetic elements in archaeal genomes, solve the crystal structure of one of the proteins, and show that its ectopic expression can fuse mammalian cells, suggesting potential roles in cell-cell fusion and gene exchange.

MT-MAG: Accurate and interpretable machine learning based taxonomic assignment of metagenome-assembled genomes, with a partial classification option

Preprint

Jan 2022

We propose MT-MAG, a novel machine learning-based taxonomic assignment tool for hierarchically-structured local classification of metagenome-assembled genomes (MAGs). MT-MAG is capable of classifying large and diverse real metagenomic datasets, having analyzed for this study a total of 240 Gbp of data in the training set, and 7 Gbp of data in the test set. MT-MAG is, to the best of our knowledge, the first machine learning method for taxonomic assignment of metagenomic data that offers a "partial classification" option. MT-MAG outputs complete or a partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. MT-MAG is able to completely classify 48% more sequences than DeepMicrobes to the Species level (the only comparable taxonomic rank for DeepMicrobes), and it outperforms DeepMicrobes by an average of 33% in weighted accuracy, and by 89% in constrained accuracy.

How to Obtain and Compare Metagenome-Assembled Genomes

Article

Jun 2024

Metagenome-assembled genomes, or MAGs, are genomes retrieved from metagenome datasets. In the vast majority of cases, MAGs are genomes from prokaryotic species that have not been isolated or cultivated in the lab. They, therefore, provide us with information on these species that are impossible to obtain otherwise, at least until new cultivation methods are devised. Thanks to improvements and cost reductions of DNA sequencing technologies and growing interest in microbial ecology, the rise in number of MAGs in genome repositories has been exponential. This chapter covers the basics of MAG retrieval and processing and provides a practical step-by-step guide using a real dataset and state-of-the-art tools for MAG analysis and comparison.

Design and Fabrication of Curved Sensor Based on Polyvinylidene Fluoride/Graphene Composite Film with a Self-Assembling Mechanism for Monitoring of Human Body Parts Movement

Article

Apr 2023

SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

Article

May 2021

Background Genomic Islands (GIs) are clusters of genes that are mobilized through horizontal gene transfer. GIs play a pivotal role in bacterial evolution as a mechanism of diversification and adaptation to different niches. Therefore, identification and characterization of GIs in bacterial genomes is important for understanding bacterial evolution. However, quantifying GIs is inherently difficult, and the existing methods suffer from low prediction accuracy and precision–recall trade-off. Moreover, several of them are supervised in nature, and thus, their applications to newly sequenced genomes are riddled with their dependency on the functional annotation of existing genomes. Results We present SSG-LUGIA, a completely automated and unsupervised approach for identifying GIs and horizontally transferred genes. SSG-LUGIA is a novel method based on unsupervised anomaly detection technique, accompanied by further refinement using cues from signal processing literature. SSG-LUGIA leverages the atypical compositional biases of the alien genes to localize GIs in prokaryotic genomes. SSG-LUGIA was assessed on a large benchmark dataset `IslandPick’ and on a set of 15 well-studied genomes in the literature and followed by a thorough analysis on the well-understood Salmonella typhi CT18 genome. Furthermore, the efficacy of SSG-LUGIA in identifying horizontally transferred genes was evaluated on two additional bacterial genomes, namely, those of Corynebacterium diphtheria NCTC13129 and Pseudomonas aeruginosa LESB58. SSG-LUGIA was examined on draft genomes and was demonstrated to be efficient as an ensemble method. Conclusions Our results indicate that SSG-LUGIA achieved superior performance in comparison to frequently used existing methods. Importantly, it yielded a better trade-off between precision and recall than the existing methods. Its nondependency on the functional annotation of genomes makes it suitable for analyzing newly sequenced, yet uncharacterized genomes. Thus, our study is a significant advance in identification of GIs and horizontally transferred genes. SSG-LUGIA is available as an open source software at https://nibtehaz.github.io/SSG-LUGIA/.

A chimeric prokaryotic ancestry of mitochondria and primitive eukaryotes

Article

Full-text available

Aug 1999
P NATL ACAD SCI USA

We provide data and analysis to support the hypothesis that the ancestor of animal mitochondria (Mt) and many primitive amitochondrial (a-Mt) eukaryotes was a fusion microbe composed of a Clostridium-like eubacterium and a Sulfolobus-like archaebacterium. The analysis is based on several observations: (i) The genome signatures (dinucleotide relative abundance values) of Clostridium and Sulfolobus are compatible (sufficiently similar) and each has significantly more similarity in genome signatures with animal Mt sequences than do all other available prokaryotes. That stable fusions may require compatibility in genome signatures is suggested by the compatibility of plasmids and hosts. (ii) The expanded energy metabolism of the fusion organism was strongly selective for cementing such a fusion. (iii) The molecular apparatus of endospore formation in Clostridium serves as raw material for the development of the nucleus and cytoplasm of the eukaryotic cell.

The Pitch of Chromatin DNA Is Reflected in Its Nucleotide Sequence

Article

Full-text available

Jul 1980

A correlation analysis of chromatin DNA nucleotide sequences reveals the clear tendency of some of the dinucleotides to be repeated along the sequences with periods of 3 and about 10.5 bases. This latter period, which is equal within experimental error to recent estimates of the pitch of the DNA double helix [Wang, J. (1979) Proc. Natl. Acad. Sci. USA 76, 200-203; Trifonov, E. & Bettecken, T. (1979) Biochemistry 18, 454-456] is interpreted as a reflection of the deformational anisotropy of the DNA molecule that facilitates its smooth folding in chromatin.

Statistical patterns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions

Article

Jan 1986

Statistical patterns in the primary structure of the functional regions of the Escherichia coli genome. II. Nonuniform Markov models

Article

Jan 1986

Introduction To Algorithms

Book

Jan 2001
J OPER RES SOC

Detecting Alien Genes in Bacterial Genomesa

Article

Feb 2006
ANN NY ACAD SCI

We present new methods for calculating codon bias of a group of genes or an individual gene relative to a standard gene class. This method is suitable for identifying alien (e.g., horizontally transferred) and highly expressed genes. In yeast and several bacterial genomes, highly expressed genes typically include ribosomal protein genes, elongation factors, chaperonins (heat shock proteins), and a subset of genes involved in glycolysis generally essential in exponential growth. Highly expressed genes of the Synechocystis genome feature several photosystem II genes, and highly expressed genes in several methanogens (Methanococcus jannaschii, M. thermoautotrophicum) are essential for methanogenesis. Alien genes mostly consist of ORFs of unknown function, transposases, prophage genes, and restriction/modification enzymes. Notably, nuclear ribosomal proteins of yeast are highly expressed, whereas mitochondrial ribosomal protein genes appear to be alien genes. Alien genes often occur in clusters, suggesting in these cases that transfer events entail several genes.

Introduction to Algorithms (Second Edition)

Chapter

Jan 2001

What's in the mix? Methods for the phylogenetic classification of metagenome sequence samples.

Article

Jan 2007

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

Article

Apr 2009
IEEE ACM T COMPUT BI

Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.

[Statistical characteristics of primary structures of the functional regions of the Escherichia coli genome. III. Computer recognition of coding regions]

Article

Sep 1986
Mol Biol

We have presented the method for recognition of structural domains of DNA. This method uses statistical description of coding and non-coding regions in the form of stationary or nonstationary Marcov chain, which was introduced in our previous papers. Calculation of the probability that the given fragment of the DNA appears part of the coding region, is the main operation of this algorithm. The results, obtained for the number of E. coli DNA sequences showed the ability of the method to find the structural domains and correct reading frame, so as to give the estimation of the extent of protein expressivity. Provided necessary statistical data are available, the proposed method may be used for the analysis of DNA of other organisms.

Barcodes for genomes and applications

Abstract and Figures

Supplementary resources (4)

Recommended publications

The Single Primary Endosymbiotic Event

Phylogenetic relationships among species of Piptochaetium (Poaceae, Pooideae, Stipeae) and a synopsi...

CHANGE IN SEROLOGIC SPECIFICITY OF RAT LIVER CELLS DURING CARCINOGENESIS WITH P-DIMETHYLAMINOAZOBENZ...

The Topic Is ‘There’