Article

Discriminate the Falsely Predicted Protein-Coding Genes in Aeropyrum Pernix K1 Genome Based on Graphical Representation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The problem that how many protein-coding genes exist in Aeropyrum pernix K1 genome has confused many scientists since 1999. In this paper, we attempt to re-identify the protein-coding genes in this genome by proposing a modified method based on I-TN curve. Consequently, all of the 727 experimentally validated protein-coding genes and 726 of the corresponding negative samples are correctly predicted respectively, then an accuracy of 99.93% of self-test is obtained. In the Jackknife test, two positive samples and two negative samples are falsely predicted, respectively, and then the accuracy of cross-validation is 99.72%. In the testing set, all of the 132 putative genes are correctly predicted as protein-coding and 14 out of the 841 hypothetical genes are predicted as non-coding, the number of protein-coding genes is reduced to 1686 instead of 1700. Further analysis shows the performance of the reannotating algorithm is comparable to other prevalent programs, and the present method is much simple and efficient. We implement the reannotating algorithm trained by Aeropyrum pernix K1 to Chlorobium tepidum TLS genome, and 217 hypothetical genes are predicted as non-coding. Sufficient sequences analysis indicates most of them are random sequences that are falsely predicted as protein-coding genes. In addition, we also perform some significative analysis aiming to the influence of artificial parameters on the graphical representation approaches, which may provide helpful information for related researches.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The next data set Human Gene is the largest among considered benchmarks in terms of the number of examples and labels. Instances are human genes represented in the form of 36 real number descriptors [32]. The goal is to assign the gene expression (degrees of membership) for 68 diseases (classes). ...
... However, please note that all of 15 benchmark sets used in the evaluation process are related to practical problems originated from real-world demands, e.g. movies' rating distribution [5], gene expressions for diseases [32], landscape images classification [33], human faces emotion recognition [34,35] (2 data sets), or phylogenetic profile of yeasts genes [36] (10 data sets). ...
Article
Full-text available
Label Distribution Learning (LDL) is a new learning paradigm with numerous applications in various domains. It is a generalization of both standard multiclass classification and multilabel classification. Instead of a binary value, in LDL, each label is assigned a real number which corresponds to a degree of membership of the object being classified to a given class. In this paper a new neural network approach to Label Distribution Learning (Duo-LDL), which considers pairwise class dependencies, is introduced. The method is extensively tested on 15 well-established benchmark sets, against 6 evaluation measures, proving its competitiveness to state-of-the-art non-neural LDL approaches. Additional experimental results on artificially generated data demonstrate that Duo-LDL is especially effective in the case of most challenging benchmarks, with extensive input feature representations and numerous output classes.
... The 11th data set Human Gene is a large-scale data set collected from the medical experiments on the connections between human genes and diseases. Human Gene contains 17,892 human genes, each of which is represented by a 36dimensional feature vector extracted by the method proposed in [46]. There are 68 different disease labels in total, and the normalized gene expression level for each disease is considered as the description degree of the corresponding disease label. ...
... Then, how to accurately predict the diverse genomic components has been the one of the most important project in the post genome era (Kyrpides 2009;Petty 2010;Li et al. 2011;Liao et al. 2012). Even though gene prediction in prokaryotic genomes has lasted for more than 20 years, more and more recent studies indicate that protein coding genes annotation errors have been a universal phenomenon in public databases (Poptsova and Gogarten 2010;Bakke et al. 2009;Pallejà et al. 2008;Kisand and Lettieri 2013;Yu et al. 2014), including the problems of translational starting site (TSS) prediction (Gao et al. 2010), protein coding genes over annotation (Nagy et al. 2008;Luo et al. 2009;Chen et al. 2008;Yu and Sun 2010;Yu et al. 2012; Electronic supplementary material The online version of this article (doi:10.1007/s13258-014-0263-0) contains supplementary material, which is available to authorized users. Wang et al. 2013;Guo et al. 2013) and missing genes etc. (Warren et al. 2010;Qiu et al. 2010). ...
Article
Protein coding gene annotation errors in prokaryotic genomes are accumulating continually in bioinformatics databases, while the update rate of genome annotation can not keep up with the explosive increasing genome sequences in most cases. Hence it is critical to manually rectify the genome annotation errors. In this paper, a hybrid strategy by combing the gene ab initio predicting programs and the over annotated gene re-annotation programs is proposed for re-annotation of the protein coding genes in prokaryotic genomes. Based on this strategy, the protein coding genes in Geobacter sulfurreducens PCA is comprehensively re-annotated. As a consequence, 16 hypothetical genes are annotated as non-coding sequences and 104 missing genes are retrieved as protein coding genes. Subsequent function analysis and sequences analysis show that the predicting results are much reliable and robust. Further application to other genomes show that this work can provide alternative tools for later post-process of prokaryotic genome annotations.
... The third dataset is a large-scale real-world dataset collected from the biological research on the relationship between human genes and diseases. There are in total 30, 542 human genes included in this dataset, each of which is represented by the 36 numerical descriptors for a gene sequence proposed in [35]. The labels corresponds to 68 different diseases. ...
Article
Although multi-label learning can deal with many problems with label ambiguity, it does not fit some real applications well where the overall distribution of the importance of the labels matters. This paper proposes a novel learning paradigm named \emph{label distribution learning} (LDL) for such kind of applications. The label distribution covers a certain number of labels, representing the degree to which each label describes the instance. LDL is a more general learning framework which includes both single-label and multi-label learning as its special cases. This paper proposes six working LDL algorithms in three ways: problem transformation, algorithm adaptation, and specialized algorithm design. In order to compare their performance, six evaluation measures are suggested for LDL algorithms, and the first batch of label distribution datasets are collected and made publicly available. Experimental results on one artificial and two real-world datasets show clear advantage of the specialized algorithms, which indicates the importance of special design for the characteristics of the LDL problem.
... One of the typical re-annotation cases of archaea was associated with Aeropyrum pernix K1, in which protein-coding genes were over-annotated up to 60% by the original sequencing institute.4–6 It is lucky that this major error has been corrected by using proteome approaches and bioinformatics methods.7–10 Amsacta moorei entomopoxvirus may have the most over-annotated protein-coding genes among sequenced viruses.11 ...
Article
Full-text available
In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes.
Article
Label distribution learning (LDL) is a new machine learning paradigm that addresses label ambiguity by emphasizing the relevance of each label to a particular instance. In supervised learning, many LDL algorithms have been proposed, which often require a large amount of well-annotated training data to achieve good performance. However, annotating a label distribution is more complicated and expensive than annotating a single label or multiple labels with logical values of 0 and 1. Thus, we propose a projection graph embedding algorithm for semi-supervised label distribution learning (PGE-SLDL). Specifically, we seek a potential space by orthogonal neighborhood preserving projections, named capture space. This capture space is used to select more valuable features and construct a graph that contains more accurate data structure information. We utilize the sample correlation information contained between graph nodes to recover the unknown label distribution. In addition, compared with fixed graphs in traditional semi-supervised learning, we carry out projection and graph construction simultaneously to obtain a self-updating projection graph, which is more helpful to learn label distribution. The experimental results validate the effectiveness of the proposed algorithm.
Article
Label Distribution Learning (LDL) is a general learning framework that assigns an instance to a distribution over a set of labels rather than to a single label or multiple labels. Current LDL methods have proven their effectiveness in many real-life machine learning applications. However, LDL is a generalization of the classification task and as such it is exposed to the same problems as standard classification algorithms, including class-imbalanced, noise, overlapping or irregularities. The purpose of this paper is to mitigate these effects by using decomposition strategies. The technique devised, called Decomposition-Fusion for LDL (DF-LDL), is based on one of the most renowned strategy in decomposition: the One-vs-One scheme, which we adapt to be able to deal with LDL datasets. In addition, we propose a competent fusion method that allows us to discard non-competent classifiers when their output is probably not of interest. The effectiveness of the proposed DF-LDL method is verified on several real-world LDL datasets on which we have carried out two types of experiments. First, comparing our proposal with the base learners and, second, comparing our proposal with the state-of-the-art LDL algorithms. DF-LDL shows significant improvements in both experiments.
Article
Label Distribution Learning (LDL) is a general learning framework that assigns an instance to a distribution over a set of labels rather than a single label or multiple labels. Current LDL methods have proven their effectiveness in many machine learning applications. As of the first formulation of the LDL problem, numerous studies have been carried out that apply the LDL methodology to various real-life problem solving. Others have focused more specifically on the proposal of new algorithms. The purpose of this article is to start addressing the LDL problem as of the data pre-processing stage. The baseline hypothesis is that, due to the high dimensionality of existing LDL data sets, it is very likely that this data will be incomplete and/or that poor data quality will lead to poor performance once applied to the learning algorithms. In this paper, we propose an oversampling method, which creates a superset of the original dataset by creating new instances from existing ones. Then, we apply already existing algorithms to the pre-processed training set in order to validate the effcacy of our method. The effectiveness of the proposed SSG-LDL is verified on several LDL datasets, showing significant improvements to the state-of-the-art LDL methods.
Article
Multi-label learning deals with training examples each represented by a single instance while associated with multiple class labels, and the task is to train a predictive model which can assign a set of proper labels for the unseen instance. Existing approaches employ the common assumption of equal labeling-importance, i.e., all associated labels are regarded to be relevant to the training instance while their relative importance in characterizing its semantics are not differentiated. Nonetheless, this common assumption does not reflect the fact that the importance degree of each relevant label is generally different, though the importance information is not directly accessible from the training examples. In this article, we show that it is beneficial to leverage the implicit relative labeling-importance (RLI) information to help induce multi-label predictive model with strong generalization performance. Specifically, RLI degrees are formalized as multinomial distribution over the label space, which can be estimated by either global label propagation procedure or local $k$ -nearest neighbor reconstruction. Correspondingly, the multi-label predictive model is induced by fitting modeling outputs with estimated RLI degrees along with multi-label empirical loss regularization. Extensive experiments clearly validate that leveraging implicit RLI information serves as a favorable strategy to achieve effective multi-label learning.
Article
Although multi-label learning can deal with many problems with label ambiguity, it does not fit some real applications well where the overall distribution of the importance of the labels matters. This paper proposes a novel learning paradigm named label distribution learning (LDL) for such kind of applications. The label distribution covers a certain number of labels, representing the degree to which each label describes the instance. LDL is a more general learning framework which includes both single-label and multi-label learning as its special cases. This paper proposes six working LDL algorithms in three ways: problem transformation, algorithm adaptation, and specialized algorithm design. In order to compare the performance of the LDL algorithms, six representative and diverse evaluation measures are selected via a clustering analysis, and the first batch of label distribution datasets are collected and made publicly available. Experimental results on one artificial and 15 real-world datasets show clear advantages of the specialized algorithms, which indicates the importance of special design for the characteristics of the LDL problem.
Article
Gene annotation plays a key role in subsequent biochemical and molecular biological studies of various organisms. There are some errors in the original annotation of sequenced genomes because of the lack of sufficient data, and these errors may propagate into other genomes. Therefore, genome annotation must be checked from time to time to evaluate newly accumulated data. In this study, we evaluated the gene density of 2606 bacteria or archaea, and identified 2 with extreme values, the minimum value (Chloroflexus aurantiacus strain J-10-fl) and maximum value (Natrinema sp J7-2), to conduct genome re-annotation. In the genome of C. aurantiacus strain J-10-fl, we identified 17 new genes with definite functions and eliminated 34 non-coding open-reading frames; in the genome of Natrinema sp J7-2, we eliminated 118 non-coding open reading frames. Our re-annotation procedure may provide a reference for improving the annotation of other bacterial genomes.
Article
More and more studies indicate that the issue of protein-coding gene finding in microbial genomes is far from thoroughly solved and the annotation quality has been questioned continuously in the past several years. In this paper, we summarize the computational methods for identifying the over-annotated genes and missing genes, and provide perspective for prospective gene finding works.
Article
The article examines the plume-ducting system design of vertical launcher using computational-fluid-dynamics tool. Rocket exhaust and air are considered as two different species with different thermodynamic properties, and their transport equations are solved. A highly under-expanded, supersonic, short-duration jet exhausting from a conical nozzle into a tube with an inside diameter slightly larger than the nozzle exit characterized the flowfield of tube launched rockets. The experimental condition of a static test of a rocket motor with cold gas as well as double base solid rocket propellant inside a launcher tube carried out by Batson and Bertin is taken as the test case for validation. The nozzle configurations for the hot and cold cases are slightly different. The effect of the plume-exit area of the plenum on the jet structure is studied, and the pressure and temperature rise in the plenum are compared with test data.
Article
Full-text available
The complete sequence of the genome of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1, which optimally grows at 95°C, has been determined by the whole genome shotgun method with some modifications. The entire length of the genome was 1,669,695 bp. The authenticity of the entire sequence was supported by restriction analysis of long PCR products, which were directly amplified from the genomic DNA. As the potential protein-coding regions, a total of 2,694 open reading frames (ORFs) were assigned. By similarity search against public databases, 633 (23.5%) of the ORFs were related to genes with putative function and 523 (19.4%) to the sequences registered but with unknown function. All the genes in the TCA cycle except for that of alpha-ketoglutarate dehydrogenase were included, and instead of the alpha-ketoglutarate dehydrogenase gene, the genes coding for the two subunits of 2-oxoacid:ferredoxin oxidoreductase were identified. The remaining 1,538 ORFs (57.1%) did not show any significant similarity to the sequences in the databases. Sequence comparison among the assigned ORFs suggested that a considerable member of ORFs were generated by sequence duplication. The RNA genes identified were a single 16S–23S rRNA operon, two 5S rRNA genes and 47 tRNA genes including 14 genes with intron structures. All the assigned ORFs and RNA coding regions occupied 89.12% of the whole genome. The data presented in this paper are available on the internet homepage ( Author Webpage ).
Article
Full-text available
Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.
Article
Full-text available
Motivation: Biological sequence was regarded as an important study by many biologists, because the sequence contains a large number of biological information, what is helpful for scientists' studies on biological cells, DNA and proteins. Currently, many researchers used the method based on protein sequences in function classification, sub-cellular location, structure and functional site prediction, including some machine-learning methods. The purpose of this article, is to find a new way of sequence analysis, but more simple and effective. Results: According to the nature of 64 genetic codes, we propose a simple and intuitive 2D graphical expression of protein sequences. And based on this expression we give a new Euclidean-distance method to compute the distance of different sequences for the analysis of sequence similarity. This approach contains more sequence information. A typical phylogenetic tree constructed based on this method proved the effectiveness of our approach. Finally, we use this sequence-similarity-analysis method to predict protein sub-cellular localization, in the two datasets commonly used. The results show that the method is reasonable.
Article
Full-text available
Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.
Article
Full-text available
A genome space is a moduli space of genomes. In this space, each point corresponds to a genome. The natural distance between two points in the genome space reflects the biological distance between these two genomes. Currently, there is no method to represent genomes by a point in a space without losing biological information. Here, we propose a new graphical representation for DNA sequences. The breakthrough of the subject is that we can construct the moment vectors from DNA sequences using this new graphical method and prove that the correspondence between moment vectors and DNA sequences is one-to-one. Using these moment vectors, we have constructed a novel genome space as a subspace in R(N). It allows us to show that the SARS-CoV is most closely related to a coronavirus from the palm civet not from a bird as initially suspected, and the newly discovered human coronavirus HCoV-HKU1 is more closely related to SARS than to any other known member of group 2 coronavirus. Furthermore, we reconstructed the phylogenetic tree for 34 lentiviruses (including human immunodeficiency virus) based on their whole genome sequences. Our genome space will provide a new powerful tool for analyzing the classification of genomes and their phylogenetic relationships.
Article
Full-text available
As one of human pathogens, the genome of Uropathogenic Escherichia coli strain CFT073 was sequenced and published in 2002, which was significant in pathogenetic bacterial genomics research. However, the current RefSeq annotation of this pathogen is now outdated to some degree, due to missing or misannotation of some essential genes associated with its virulence. We carried out a systematic reannotation by combining automated annotation tools with manual efforts to provide a comprehensive understanding of virulence for the CFT073 genome. The reannotation excluded 608 coding sequences from the RefSeq annotation. Meanwhile, a total of 299 coding sequences were newly added, about one third of them are found in genomic island (GI) regions while more than one fifth of them are located in virulence related regions pathogenicity islands (PAIs). Furthermore, there are totally 341 genes were relocated with their translational initiation sites (TISs), which resulted in a high quality of gene start annotation. In addition, 94 pseudogenes annotated in RefSeq were thoroughly inspected and updated. The number of miscellaneous genes (sRNAs) has been updated from 6 in RefSeq to 46 in the reannotation. Based on the adjustment in the reannotation, subsequent analysis were conducted by both general and case studies on new virulence factors or new virulence-associated genes that are crucial during the urinary tract infections (UTIs) process, including invasion, colonization, nutrition uptaking and population density control. Furthermore, miscellaneous RNAs collected in the reannotation are believed to contribute to the virulence of strain CFT073. The reannotation including the nucleotide data, the original RefSeq annotation, and all reannotated results is freely available via http://mech.ctb.pku.edu.cn/CFT073/. As a result, the reannotation presents a more comprehensive picture of mechanisms of uropathogenicity of UPEC strain CFT073. The new genes change the view of its uropathogenicity in many respects, particularly by new genes in GI regions and new virulence-associated factors. The reannotation thus functions as an important source by providing new information about genomic structure and organization, and gene function. Moreover, we expect that the detailed analysis will facilitate the studies for exploration of novel virulence mechanisms and help guide experimental design.
Article
Full-text available
Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.
Article
Full-text available
A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures — functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of ‘typical’ exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure — counting oligomers — is more effective than any of the more sophisticated measures. Ditferent measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.
Article
Full-text available
It has often been suggested that differential usage of codons recognized by rare tRNA species, i.e. “rare codons”, represents an evolutionary strategy to modulate gene expression. In particular, regulatory genes are reported to have an extraordinarily high frequency of rare codons. From E.coli we have compiled codon usage data for highly expressed genes, moderately/lowly expressed genes, and regulatory genes. We have identified a clear and general trend in codon usage bias, from the very high bias seen in very highly expressed genes and attributed to selection, to a rather low bias in other genes which seems to be more influenced by mutation than by selection. There is no clear tendency for an increased frequency of rare codons in the regulatory genes, compared to a large group of other moderately/lowly expressed genes with low codon bias. From this, as well as a consideration of evolutionary rates of regulatory genes, and of experimental data on translation rates, we conclude that the pattern of synonymous codon usage in regulatory genes reflects primarily the relaxation of natural selection.
Article
Full-text available
A simple, effective measure of synonymous codon usage bias, the Codon Adaptation Index, is detailed. The index uses a reference set of highly expressed genes from a species to assess the relative merits of each codon, and a score for a gene is calculated from the frequency of use of all codons in that gene. The index assesses the extent to which selection has been effective in moulding the pattern of codon usage. In that respect it is useful for predicting the level of expression of a gene, for assessing the adaptation of viral genes to their hosts, and for making comparisons of codon usage in different organisms. The index may also give an approximate indication of the likely success of heterologous gene expression.
Article
Full-text available
A protein is usually classified into one of the following five structural classes: alpha, beta, alpha + beta, alpha/beta, and zeta (irregular). The structural class of a protein is correlated with its amino acid composition. However, given the amino acid composition of a protein, how may one predict its structural class? Various efforts have been made in addressing this problem. This review addresses the progress in this field, with the focus on the state of the art, which is featured by a novel prediction algorithm and a recently developed database. The novel algorithm is characterized by a covariance matrix that takes into account the coupling effect among different amino acid components of a protein. The new database was established based on the requirement that the classes should have (1) as many nonhomologous structures as possible, (2) good quality structure, and (3) typical or distinguishable features for each of the structural classes concerned. The very high success rate for both the training-set proteins and the testing-set proteins, which has been further validated by a simulated analysis and a jackknife analysis, indicates that it is possible to predict the structural class of a protein according to its amino acid composition if an ideal and complete database can be established. It also suggests that the overall fold of a protein is basically determined by its amino acid composition.
Article
Full-text available
Biological processes in any living organism are based on selective interactions between particular biomolecules. In most cases, these interactions involve and are driven by proteins which are the main conductors of any living process within the organism. The physical nature of these interactions is still not well known. This paper represents a whole new view to biomolecular interactions, in particular protein-protein and protein-DNA interactions, based on the assumption that these interactions are electromagnetic in their nature. This new approach is incorporated in the Resonant Recognition Model (RRM), which was developed over the last 10 years. It has been shown initially that certain periodicities within the distribution of energies of delocalized electrons along a protein molecule are critical for protein biological function, i.e., interaction with its target. If protein conductivity was introduced, then a charge moving through protein backbone can produce electromagnetic irradiation or absorption with spectral characteristics corresponding to energy distribution along the protein. The RRM enables these spectral characteristics, which were found to be in the range of infrared and visible light, to be calculated. These theoretically calculated spectra were proved using experimentally obtained frequency characteristics of some light-induced biological processes. Furthermore, completely new peptides with desired spectral characteristics, and consequently corresponding biological activities, were designed.
Article
Full-text available
While most organisms grow at temperatures ranging between 20 and 50 °C, many archaea and a few bacteria have been found capable of withstanding temperatures close to 100 °C, or beyond, such as Pyrococcus or Aquifex. Here we report the results of two independent large scale unbiased approaches to identify global protein properties correlating with an extreme thermophile lifestyle. First, we performed a comparative proteome analyses using 30 complete genome sequences from the three kingdoms. A large difference between the proportions of charged versuspolar (noncharged) amino acids was found to be a signature of all hyperthermophilic organisms. Second, we analyzed the water accessible surfaces of 189 protein structures belonging to mesophiles or hyperthermophiles. We found that the surfaces of hyperthermophilic proteins exhibited the shift already observed at the genomic level,i.e. a proportion of solvent accessible charged residues strongly increased at the expense of polar residues. The biophysical requirements for the presence of charged residues at the protein surface, allowing protein stabilization through ion bonds, is therefore clearly imprinted and detectable in all genome sequences available to date.
Article
Full-text available
Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).
Article
Full-text available
The goal of the NCBI Reference Sequence (RefSeq) project is to provide the single best non-redundant and comprehensive collection of naturally occurring biological molecules, representing the central dogma. Nucleotide and protein sequences are explicitly linked on a residue-by-residue basis in this collection. Ideally all molecule types will be available for each well-studied organism, but the initial database collection pragmatically includes only those molecules and organisms that are most readily identified. Thus different amounts of information are available for different organisms at any given time. Furthermore, for some organisms additional intermediate records are provided when the genome sequence is not yet finished. The collection is supplied by NCBI through three distinct pipelines in addition to collaborations with community groups. The collection is curated on an ongoing basis. Additional information about the NCBI RefSeq project is available at http://www.ncbi.nih.gov/RefSeq/.
Article
Full-text available
Clusters or runs of purines on the mRNA synonymous strand have been found in many different organisms including orthopoxviruses. The purine bias that is exhibited by these clusters can be observed using a purine skew and in the case of poxviruses, these skews can be used to help determine the coding strand of a particular segment of the genome. Combined with previous findings that minor ORFs have lower than average aspartate and glutamate composition and higher than average serine composition, purine content can be used to predict the likelihood of a poxvirus ORF being a "real gene". Using purine skews and a "quality" measure designed to incorporate previous findings about minor ORFs, we have found that in our training case (vaccinia virus strain Copenhagen), 59 of 65 minor (small and unlikely to be a real genes) ORFs were correctly classified as being minor. Of the 201 major (large and likely to be real genes) vaccinia ORFs, 192 were correctly classified as being major. Performing a similar analysis with the entomopoxvirus amsacta moorei (AMEV), it was found that 4 major ORFs were incorrectly classified as minor and 9 minor ORFs were incorrectly classified as major. The purine abundance observed for major ORFs in vaccinia virus was found to stem primarily from the first codon position with both the second and third codon positions containing roughly equal amounts of purines and pyrimidines. Purine skews and a "quality" measure can be used to predict functional ORFs and purine skews in particular can be used to determine which of two overlapping ORFs is most likely to be the real gene if neither of the two ORFs has orthologs in other poxviruses.
Article
Full-text available
We analyzed the proteome of a crenararchaeon, Aeropyrum pernix K1, by using the following four methods: (i) two-dimensional PAGE followed by MALDI-TOF MS, (ii) one-dimensional SDS-PAGE in combination with two-dimensional LC-MS/MS, (iii) multidimensional LC-MS/MS, and (iv) two-dimensional PAGE followed by amino-terminal amino acid sequencing. These methods were found to be complementary to each other, and biases in the data obtained in one method could largely be compensated by the data obtained in the other methods. Consequently a total of 704 proteins were successfully identified, 134 of which were unique to A. pernix K1, and 19 were not described previously in the genomic annotation. We found that the original annotation of the genomic data of this archaeon was not adequate in particular with respect to proteins of 10-20 kDa in size, many of which were described as hypothetical. Furthermore the amino-terminal amino acid sequence analysis indicated that surprisingly the translation of 52% of their genes starts with TTG in contrast to ATG (28%) and GTG (20%). Thus, A. pernix K1 is the first example of an organism in which TTG is the most predominant translational initiation codon.
Article
Full-text available
In this paper, a revision for the existing method of locating exons by genomic signal processing technique employing four binary indicator sequences is presented. The existing method relies on the pronounced period three peaks observed in the Fourier power spectrum of the exon regions which are absent in non-coding regions. The authors have abandoned the four sequences all together and adopted a single ‘EIIP indicator sequence’ which is formed by substituting the electron-ion interaction pseudopotentials (EIIP) of the nucleotides A, G, C and T in the DNA sequence, reducing the computational overhead by 75%. The power spectrum of this sequence reveals period three peaks for exon regions. Also a number of exons have been identified which exhibit period three peaks when mapped to ‘EIIP indicator sequence’ and which do not show the same when the binary indicator sequences are employed. We could get better discrimination between exon areas and non-coding areas of a number of genomes when the sequences are mapped to EIIP indicator sequences and the power spectra of the same are taken in a sliding Kaiser window, compared to the existing method using a rectangular window which utilizes binary indicator sequences.
Article
One important task in the study of genome sequences is to determine densities of specific nucleotides and to understand the implications for exons or coding regions. Mathematical analysis of the large volume genomic DNA sequence data is one of the challenges for bio-scientists. In this manuscript, we introduce a novel method for visualizing and analyzing DNA sequences, the applications on mutation analysis and similarity analysis are presented in detail based on DC-curve (Delta coding curve).
Article
A new system, ZCURVE 1.0, for finding protein‐ coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein‐coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov‐model‐based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function‐known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene‐finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes ∼2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.
Article
According to the physiochemical property of the base at the first site, the 16 kinds of dinucleotides are classified into four groups. Based on such classification, we propose a novel graphical representation of DNA sequence without loss of information due to overlapping and crossing of the curve with itself. This representation allows direct inspection of compositions and distributions of dinucleotides and visual recognition of similarities/dissimilarities among different sequences. A 6D vector is exploited as quantitative descriptor from this representation, which can display both the global and local features of DNA sequences in a 6D phase space. The applications in similarities/dissimilarities analysis of the complete coding sequences of E globin genes of eleven species illustrate their utilities.
Article
According to the partial order constructed on a selected pair of physico-chemical properties of amino acids, we presented a novel 2D graphical representation of protein sequences which is called an H-L curve. The representations are mathematically proven to be no circuit (i.e., without any degeneracy), and associated with protein sequences in a one-to-one manner. In addition, our graphical curves allow more conveniently a visual inspection of protein sequences alignment. We illustrated our approach on two examples.
Article
Three distances for assessing genomic similarity based on dinucleotide frequencies in large DNA sequences are introduced. The method requires neither homologous sequences nor prior sequence alignments. The analysis centers on symmetrized dinucleotide frequencies reflecting the DNA structures related to dinucleotide stacking energies, and constraints of DNA curvatures. To show the utility of the method, we use these distances to examine the similarities among the first exon-1 of the β-globin gene for 11 different species.
Article
In this paper a method for assessing DNA similarity based on dinucleotide frequencies in DNA sequence is introduced. The method does not require prior sequence alignments. The analysis centers on dinucleotide frequencies in DNA sequences and distances between sequences based on Frobenius norm of covariance matrices of dinucleotide frequencies. Analysis shows an overall qualitative agreement among similarities for the beta globin exon 1 sequences of 11 species.
Article
Stenotrophomonas maltophilia strain R551-3 is a multiple-antibiotic-resistant opportunistic human pathogen involved in nosocomial infections. It has a widely distributed GC-rich (>66%) genome. Analysis of differential expression of the genes of this genome reveals that majority of genes belonging to highly expressed category are mostly present on lagging strand without showing any strand specific codon usage bias. Relatively small number of lowly expressed genes is equally distributed on both leading and lagging strands with a difference in codon usage pattern between them. Among several multi drug resistance genes of S. maltophilia involving lowly expressed category some are predicted as horizontally transferred. It can be inferred that horizontally transferred genes may have been imported into this genome for their pathogenic mode of living. Our study may help to modify the expression level of the target genes of this human pathogen in order to control its infection.
Article
In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300 genes, we show that it probably has only ∼3800 genes, and that a similar discrepancy exists for almost all published genomes.
Article
The nucleotide frequencies in the second codon positions of genes are remarkably different for the coding regions that correspond to different secondary structures in the encoded proteins, namely, helix, β-strand and aperiodic structures. Indeed, hydrophobic and hydrophilic amino acids are encoded by codons having U or A, respectively, in their second position. Moreover, the β-strand structure is strongly hydrophobic, while aperiodic structures contain more hydrophilic amino acids. The relationship between nucleotide frequencies and protein secondary structures is associated not only with the physico-chemical properties of these structures but also with the organisation of the genetic code. In fact, this organisation seems to have evolved so as to preserve the secondary structures of proteins by preventing deleterious amino acid substitutions that could modify the physico-chemical properties required for an optimal structure.
Article
Over annotation of protein coding genes is common phenomenon in microbial genomes, the genome of Amsacta moorei entomopoxvirus (AmEPV) is a typical case, because more than 63% of its annotated ORFs are hypothetical. In this article, we propose an improved graphical representation titled I-TN (improved curve based on trinucleotides) curve, which allows direct inspection of composition and distribution of codons and asymmetric gene structure. This improved graphical representation can also provide convenient tools for genome analysis. From this presentation, 18 variables are exploited as numerical descriptors to represent the specific features of protein coding genes quantitatively, with which we reannotate the protein coding genes in several viral genomes. Using the parameters trained on the experimentally validated genes, all of the 30 experimentally validated genes and 63 putative genes in AmEPV genome are recognized correctly as protein coding, the accuracies of the present method for self-test and cross-validation are 100%, respectively. Twenty-eight annotated hypothetical genes are predicted as noncoding, and then the number of reannotated protein coding genes in AmEPV should be 266 instead of 294 reported in the original annotations. Extending the present method trained in AmEPV to other entomopoxvirus genomes directly, such as Melanoplus sanguinipes entomopoxvirus (MsEPV), all of the 123 annotated function-known and putative genes are recognized correctly as protein coding, and 17 hypothetical genes are recognized as noncoding. The present method could also be extended to other genomes with or without adaptation of training sets with high accuracy.
Article
A (two-dimensional) 2D graphical representation of protein sequences based on six physicochemical properties of amino acids is outlined. The numerical characterization of protein graphs is given as descriptors of protein sequences. It is not only useful for comparative study of proteins but also for encoding innate information about the structure of proteins. The coefficient of determination is proposed as a new similarity/dissimilarity measure. Finally, a simple example is taken to highlight the behavior of the new similarity/dissimilarity measure on protein sequences taken from the ND6 (NADH dehydrogenase subunit 6) proteins for eight different species. The results demonstrate the approach is convenient, fast, and efficient.
Article
In this paper, a novel 3D graphical representation of DNA sequence based on trinucleotides is proposed. This representation allows direct inspection of composition as well as distribution of trinucleotides in DNA sequence for the first time and avoids loss of information, from which one can obtain more information. Based on this novel model, six numerical descriptors of DNA sequence are deduced without complicated calculations, and the applications in similarities/dissimilarities analysis of coding sequences and conserved genes discrimination illustrate their utilities. In addition, two simple methods for similarities/dissimilarities analysis of coding sequences among different species are exploited by using two vectors composed of 64 and six components, respectively, which can provide convenient sequence alignment tools for both computational scientists and molecular biologists.
Article
According to the three classifications of nucleotides. we introduce a sort of binary coding method of RNA secondary structures. On the basis of this representation, we call reduce a RNA secondary structure into three binary digit sequences. We also propose coding rules based on the exclusive-OR operation. Associating with the proposed coding. rules, we can judge the mutation between bases or between base and base pair. and make sequence alignment easily. (C) 2009 Wiley Periodicals, Inc. J Comput Chem 30: 2205-2212, 2009
Article
The problem that how many protein-coding genes there are in the genome of Aeropyrum pernix K1 has confused many scientists since the sequencing in 1999. In this paper, the protein-coding genes in A. pernix K1 are identified from the original and current NITE annotation by using the Aper_ORFs method. Consequently, 702 of 704 experimentally validated genes are correctly predicted as coding, which means the sensitivity of the method is 702/704 approximately 99.7%. This sensitivity is one percent higher than that of the versatile bacterial gene-finding program, ZCURVE 1.0. The number of genes determined in this work is 1699. This number is very closely equal to that of the current NITE annotation, which is 1700. Therefore, the two independent predictions may end the ten-years-lasting controversy about gene number in this genome. Furthermore, the Aper_ORFs method is extended to identify protein-coding genes in the genome of Chlorobium tepidum TLS and about 98% of the function-known genes are correctly predicted as coding. In addition, 188 hypothetical ORFs are identified as non-coding in the genome. Mapping point analysis shows that these ORFs have different base frequency distribution with that of function-known genes, suggesting that most of them do not encode proteins. It's hoped the Aper_ORFs method will become a useful tool for gene annotation in newly sequenced bacterial and archaeal genomes, as long as the G+C content of which is similar with that of A. pernix.
Article
We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.
Article
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark.hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.
Article
The complete sequence of the genome of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1, which optimally grows at 95 degrees C, has been determined by the whole genome shotgun method with some modifications. The entire length of the genome was 1,669,695 bp. The authenticity of the entire sequence was supported by restriction analysis of long PCR products, which were directly amplified from the genomic DNA. As the potential protein-coding regions, a total of 2,694 open reading frames (ORFs) were assigned. By similarity search against public databases, 633 (23.5%) of the ORFs were related to genes with putative function and 523 (19.4%) to the sequences registered but with unknown function. All the genes in the TCA cycle except for that of alpha-ketoglutarate dehydrogenase were included, and instead of the alpha-ketoglutarate dehydrogenase gene, the genes coding for the two subunits of 2-oxoacid:ferredoxin oxidoreductase were identified. The remaining 1,538 ORFs (57.1%) did not show any significant similarity to the sequences in the databases. Sequence comparison among the assigned ORFs suggested that a considerable member of ORFs were generated by sequence duplication. The RNA genes identified were a single 16S-23S rRNA operon, two 5S rRNA genes and 47 tRNA genes including 14 genes with intron structures. All the assigned ORFs and RNA coding regions occupied 89.12% of the whole genome. The data presented in this paper are available on the internet homepage (http://www.mild.nite.go.jp).
Article
The relationship between the synonymous codon usage and protein secondary structural elements (alpha helices and beta sheets) were reinvestigated by taking structural information of proteins from Protein Data Bank (PDB) and their corresponding mRNA sequences from GenBank for four different organisms E. coli, B. subtilis, S. cerevisiae, and Homo sapiens. It was observed that synonymous codon families have non-random codon usage, but there does not exist any species invariant universal correlation between the synonymous codon usage and protein secondary structural elements. The secondary structural units of proteins can be distinguished from the occurrences of bases at the second codon position.
Article
The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed from the other. Based on the Z curve, a new protein coding gene-finding algorithm specific for the yeast genome at better than 95% accuracy has been proposed. Six cross-validation tests were performed to confirm the above accuracy. Using the new algorithm, the number of protein coding genes in the yeast genome is re-estimated. The estimate is based on the assumption that the unknown genes have similar statistical properties to the known genes. It is found that the number of protein coding genes in the 16 yeast chromosomes is </=5645, significantly smaller than the 5800-6000 which is widely accepted, and much larger than the 4800 estimated by another group recently. The mitochondrial genes were not included into the above estimate. A codingness index called the YZ score (YZ OE [0,1]) is proposed to recognize protein coding genes in the yeast genome. Among the ORFs annotated in the MIPS (Munich Information Centre for Protein Sequences) database, those recognized as non-coding by the present algorithm are listed in this paper in detail. The criterion for a coding or non-coding ORF is simply decided by YZ > 0.5 or YZ < 0.5, respectively. The YZ scores for all the ORFs annotated in the MIPS database have been calculated and are available on request by sending e-mail to the corresponding author.
Article
The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be approximately 3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use.
Article
The 2694 ORFs originally annotated as potential genes in the genome of Aeropyrum pernix can be categorized into three clusters (A, B, C), according to their nucleotide composition at three codon positions. Coding potential was found to be responsible for the phenomenon of three clusters in a 9-dimensional space derived from the nucleotide composition of ORFs: ORFs assigned to cluster A are coding ones, while those assigned to clusters B and C are non-coding ORFs. A "codingness" index called the AZ score is defined based on a clustering method used to recognize protein-coding genes in the A. pernix genome. The criterion for a coding or non-coding ORF is based on the AZ score. ORFs with AZ > 0 or AZ < 0 are coding or non-coding, respectively. Consequently, 620 out of 632 ORFs with putative functions based on the original annotation are contained in cluster A, which have positive AZ scores. In addition, all 29 ORFs encoding putative or conserved proteins newly added in RefSeq annotation also have positive AZ scores. Accordingly, the number of re-recognized protein-coding genes in the A. pernix genome is 1610, which is significantly less than 2694 in the original annotation and also much less than 1841 in the RefSeq annotation curated by NCBI staff. Annotation information of re-recognized genes and their AZ scores are available at: http://tubic.tju.edu.cn/Aper/.
Article
Using the Z curve method, the protein-coding genes in AmEPV genome are re-predicted. On the basis of the parameters trained on the experimentally validated genes, all of the 30 experimentally validated genes and 67 putative genes are predicted correctly as coding genes. The sensitivities of the present method for self-test and cross-validation are all 100% based on these test sets. Thirty-eight annotated conserved and hypothetical genes are predicted as non-coding ORFs. The number of re-predicted protein-coding genes in AmEPV is 256. It is significantly less than the number 294 reported in the original annotation. After extending the present method trained in AeEPV genome to the other entomopoxvirus genome, it is found that 116 of the 123 known and putative genes are predicted correctly as coding. Six of the seven falsely missed genes are less than 300bp. The present method could be extended to other poxvirus genomes with or without adaptation of training sets.
Article
Over-annotation of hypothetical ORFs is a common phenomenon in bacterial genomes, which necessitates confirming the coding reliability of hypothetical ORFs and then predicting their functions. The important plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 (Eca1043) is a typical case because more than a quarter of its annotated ORFs are hypothetical. Our analysis focuses on annotation of Eca1043 hypothetical ORFs, and comprises two efforts: (a) based on the Z-curve method, 49 originally annotated hypothetical ORFs are recognized as noncoding, this is further supported by principal components analysis and other evidence; and (b) using sequence-alignment tools and some functional resources, more than a half of the hypothetical genes were assigned functions. The potential functions of 427 hypothetical genes are summarized according to the cluster of orthologous groups functional category. Moreover, 114 and 86 hypothetical genes are recognized as putative 'membrane proteins' and 'exported proteins', respectively. Reannotation of Eca1043 hypothetical ORFs will benefit research into the lifestyle, metabolism and pathogenicity of the important plant pathogen. Also, our study proffers a model for the reannotation of hypothetical ORFs in microbial genomes.