Article

On the Total Number of Genes and Their Length Distribution in Complete Microbial Genomes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300 genes, we show that it probably has only ∼3800 genes, and that a similar discrepancy exists for almost all published genomes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Overannotation, an estimate on the proportion of false genes annotated into a genome, can work as a proxy to genome annotation quality (see examples of use at [1][2][3][4][5][6]). In this regard, Skovgaard et al. [7] developed a method to estimate the number of genes that should be annotated in a genome. The method is based on comparing the *Correspondence: gmoreno@wlu.ca ...
... Overannotation was calculated using the SwissProt method described by [7]. Briefly, the method estimates the number of genes that should be annotated in a genome by calculating the proportion of genes coding for proteins at least 200 amino-acid residues long (deemed as true genes), matching proteins in the SwissProt database (large SP-matching genes); and the proportion of small annotated genes, those that would code for proteins less than 200 amino-acid residues long, also matching SwissProt proteins (small SP-matching genes). ...
... Gene estimates and overannotation as calculated with BLAST+[7]. ...
Article
Full-text available
Background As the number of genomes in public databases increases, it becomes more important to be able to quickly choose the best annotated genomes for further analyses in comparative genomics and evolution. A proxy to annotation quality is the estimation of overannotation by comparing annotated coding genes against the SwissProt database. NCBI’s BLAST (BLAST+) is the common software of choice to compare these sequences. Newer programs that run in a fraction of the time as BLAST+ might miss matches that BLAST+ would find. However, the results might still be useful to calculate overannotation. We thus decided to compare the overannotation estimates yielded using three such programs, UBLAST, LAST and the Blast-Like Alignment Tool (BLAT), and to test non-redundant versions of the SwissProt database to reduce the number of comparisons necessary. Findings We found that all, UBLAST, LAST and BLAT, tend to produce similar overannotation estimates to those obtained with BLAST+. As would be expected, results varied the most from those obtained with BLAST+ in genomes with fewer proteins matching sequences in the SwissProt database. UBLAST was the fastest running algorithm, and showed the smallest variation from the results obtained using BLAST+. Reduced SwissProt databases did not seem to affect the results much, but the reduction in time was modest compared to that obtained from UBLAST, LAST, or BLAT. Conclusions Despite faster programs miss sequence matches otherwise found by NCBI’s BLAST, the overannotation estimates are very similar and thus these programs can be used with confidence for this task. Electronic supplementary material The online version of this article (doi:10.1186/1756-0500-7-651) contains supplementary material, which is available to authorized users.
... It is well-known that the median protein length in Eukaryotes is significantly longer than in Prokaryotes. Among Prokaryotes, Bacteria tend to have longer proteins, on average, than Archaea (Zhang, 2000;Skovgaard et al., 2001;Brocchieri and Karlin, 2005). Concerning the median protein length, the trends presented in Table 1 confirm the results observed by others (Zhang, 2000;Skovgaard et al., 2001;Brocchieri and Karlin, 2005) on a genomic level. ...
... Among Prokaryotes, Bacteria tend to have longer proteins, on average, than Archaea (Zhang, 2000;Skovgaard et al., 2001;Brocchieri and Karlin, 2005). Concerning the median protein length, the trends presented in Table 1 confirm the results observed by others (Zhang, 2000;Skovgaard et al., 2001;Brocchieri and Karlin, 2005) on a genomic level. With only a median protein length of 228 a.a. ...
Article
Full-text available
Disulfide bridges establish a fundamental element in the molecular architecture of proteins and peptides which are involved e.g., in basic biological processes or acting as toxins. NMR spectroscopy is one method to characterize the structure of bioactive compounds including cystine-containing molecules. Although the disulfide bridge itself is invisible in NMR, constraints obtained via the neighboring NMR-active nuclei allow to define the underlying conformation and thereby to resolve their functional background. In this mini-review we present shortly the impact of cysteine and disulfide bonds in the proteasome from different domains of life and give a condensed overview of recent NMR applications for the characterization of disulfide-bond containing biomolecules including advantages and limitations of the different approaches.
... Up to now, less than a dozen studies were devoted to protein length distribution. Among those, there were only four relevant publications: [4,5,8,24]. In 2000, using an early version of the COG database, Zhang compared 22 species in three domains of life [4] and found that the average gene length is smallest for Archaea and greatest for eukaryotes. ...
... In 2000, using an early version of the COG database, Zhang compared 22 species in three domains of life [4] and found that the average gene length is smallest for Archaea and greatest for eukaryotes. Similarly, Skovgaard et al. [24] analysed 34 prokaryotic genomes and discovered that, for the vast majority of functional families, Bacterial proteins were longer than Archaeal ones. In 2005, Brocchieri and Karlin [5] confirmed these findings using a larger collection of genomes (16 Archaeal and 67 Bacterial species). ...
Article
Full-text available
Proteins of the same functional family (for example, kinases) may have significantly different lengths. It is an open question whether such variation in length is random or it appears as a response to some unknown evolutionary driving factors. The main purpose of this paper is to demonstrate existence of factors affecting prokaryotic gene lengths. We believe that the ranking of genomes according to lengths of their genes, followed by the calculation of coefficients of association between genome rank and genome property, is a reasonable approach in revealing such evolutionary driving factors. As we demonstrated earlier, our chosen approach, Bubble-sort, combines stability, accuracy, and computational efficiency as compared to other ranking methods. Application of Bubble Sort to the set of 1390 prokaryotic genomes confirmed that genes of Archaeal species are generally shorter than Bacterial ones. We observed that gene lengths are affected by various factors: within each domain, different phyla have preferences for short or long genes; thermophiles tend to have shorter genes than the soil-dwellers; halophiles tend to have longer genes. We also found that species with overrepresentation of cytosines and guanines in the third position of the codon (GC 3 content) tend to have longer genes than species with low GC 3 content.
... (in direction of translation, from 3' to 5') to produce long theoretical peptide sequences which are compared to known proteins from other organisms. Nevertheless, Skovgaard et al. (2001) showed that the number of genes in bacteria is generally overpredicted (in A. pernix they estimated 100% gene overprediction which is by far the most extreme in their analysis). ...
... For example, the proteome of the archaea Aeropyrum pernix contains the largest fraction of orphan regions. This result may be biased because the gene prediction in Aeropyrum pernix produced many very short questionable ORFs (Skovgaard et al., 2001). ...
... It is also well known that the efficiency of gene finding algorithms is strongly dependent on the composition of the analyzed genome described for example by their G+C content ( [2], [18]). Skovgaard et al. [18] observed that abundance of long non-coding ORFs is correlated with the G+C level. ...
... It is also well known that the efficiency of gene finding algorithms is strongly dependent on the composition of the analyzed genome described for example by their G+C content ( [2], [18]). Skovgaard et al. [18] observed that abundance of long non-coding ORFs is correlated with the G+C level. This clearly results from the nucleotide composition of stop codons (TAA, TAG, and TGA) which are poor in G+C and A+T-rich. ...
Article
Full-text available
Methods based on the theory of Markov chains are most commonly used in the recognition of protein coding sequences. However, they require big learning sets to fill up all elements in transition probability matrices describing dependence between nucleotides in the analyzed sequences. Moreover, gene prediction is strongly influenced by the nucleotide bias measured by e.g. G+C content. In this paper we compare two methods: (i) the classical GeneMark algorithm, which uses a three-periodic non-homogeneous Markov chain, and (ii) an algorithm called PMC that considers six independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in two DNA strands. We have tested the efficiency (in terms of true positive rate) of these two Markov chain methods for the model bacterial genome of Escherichia coli depending on the size of the learning set, uncertainty of ORFs’ function annotation, and model order of these algorithms. We have also applied the methods with different model orders for 163 prokaryotic genomes that covered a wide range of G+C content. The PMC algorithm of different chain orders turns out to be more stable in comparison to the GeneMark algorithm. The PMC also outperforms the GM algorithm giving a higher fraction of coding sequences in the tested set of annotated genes. Moreover, it requires much smaller learning sets than GM to work properly.
... However, both these approaches run into difficulties when annotating short or highly diverged genes. Statistical methods such as codon bias have less power in discriminating coding from noncoding DNA for short genes using the "composition-based" approach, and small or highly diverged genes return weak hits to homologs in BLAST searches and therefore cannot be accurately differentiated from ORFs that occur by chance (10). This uncertainty has resulted in overestimation of the number of proteincoding genes in annotated bacterial species (10)(11)(12) but also in many bona fide short genes being overlooked (8,(13)(14)(15)(16). ...
... Statistical methods such as codon bias have less power in discriminating coding from noncoding DNA for short genes using the "composition-based" approach, and small or highly diverged genes return weak hits to homologs in BLAST searches and therefore cannot be accurately differentiated from ORFs that occur by chance (10). This uncertainty has resulted in overestimation of the number of proteincoding genes in annotated bacterial species (10)(11)(12) but also in many bona fide short genes being overlooked (8,(13)(14)(15)(16). ...
Article
Full-text available
We report the development of SearchDOGS Bacteria, software to automatically detect missing genes in annotated bacterial genomes by combining BLAST searches with comparative genomics. Having successfully applied the approach to yeast genomes, we redeveloped SearchDOGS to function as a standalone, downloadable package, requiring only a set of GenBank annotation files as input. The software automatically generates a homology structure using reciprocal BLAST and a synteny-based method; this is followed by a scan of the entire genome of each species for unannotated genes. Results are provided in a HTML interface, providing coordinates, BLAST results, syntenic location, omega values (Ka/Ks, where Ks is the number of synonymous substitutions per synonymous site and Ka is the number of nonsynonymous substitutions per nonsynonymous site) for protein conservation estimates, and other information for each candidate gene. Using SearchDOGS Bacteria, we identified 155 gene candidates in the Shigella boydii sb227 genome, including 56 candidates of length < 60 codons. SearchDOGS Bacteria has two major advantages over currently available annotation software. First, it outperforms current methods in terms of sensitivity and is highly effective at identifying small or highly diverged genes. Second, as a freely downloadable package, it can be used with unpublished or confidential data.
... Other scientists have also found that automated annotation methods lead to the selection of the wrong reading frame, over-annotation of protein coding genes, and incorrect start codon positions, which are all common problems in the microbial genomes deposited in GenBank. For example, E. coli has been found to have ,500 fewer genes than originally reported [22]. It is estimated that overannotation is as high as 20% in many genomes [22,23]. ...
... For example, E. coli has been found to have ,500 fewer genes than originally reported [22]. It is estimated that overannotation is as high as 20% in many genomes [22,23]. ...
Article
Full-text available
Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.
... However, similar to Markov-based models, these machine learning models are primarily developed for annotating long genes. As a result, their sensitivity for short genes remains low due to a lack of annotated short gene sequences in the RefSeq database Nielsen & Krogh (2005); Skovgaard et al. (2001). This lack of information from short genes renders the machine learning methods ill-equipped to identify genes belonging to ProtiGeno extracts ORFs from a prokaryotic genome and translates them to amino acid sequences. ...
Preprint
Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes. Data, codes, and models are available at https://github.com/tonytu16/protigeno.
... Sequence conservation weighs in evolutionary selection, indicating that the sequence remains functionally useful throughout In the mitochondria, sORFs are found in the mitochondrial DNA (mtDNA). In the cytoplasm, sORFs are scattered across different RNA transcripts i.e., circular RNA (circRNA), long non-coding RNA (lncRNA), and pri-microRNA phylogenetic trees [3,51]. Whereas sequence similarity denotes similar protein motifs or domains aligned over previously identified protein sequences so as to derive coding potentials and potential protein functionalities [4,17]. ...
Article
Full-text available
A short open reading frame (sORFs) constitutes ≤ 300 bases, encoding a microprotein or sORF-encoded protein (SEP) which comprises ≤ 100 amino acids. Traditionally dismissed by genome annotation pipelines as meaningless noise, sORFs were found to possess coding potential with ribosome profiling (RIBO-Seq), which unveiled sORF-based transcripts at various genome locations. Nonetheless, the existence of corresponding microproteins that are stable and functional was little substantiated by experimental evidence initially. With recent advancements in multi-omics, the identification, validation, and functional characterisation of sORFs and microproteins have become feasible. In this review, we discuss the history and development of an emerging research field of sORFs and microproteins. In particular , we focus on an array of bioinformatics and OMICS approaches used for predicting, sequencing, validating, and characterizing these recently discovered entities. These strategies include RIBO-Seq which detects sORF transcripts via ribosome footprints, and mass spectrometry (MS)-based proteomics for sequencing the resultant microproteins. Subsequently , our discussion extends to the functional characterisation of microproteins by incorporating CRISPR/Cas9 screen and protein-protein interaction (PPI) studies. Our review discusses not only detection methodologies, but we also highlight on the challenges and potential solutions in identifying and validating sORFs and their microproteins. The novelty of this review lies within its validation for the functional role of microproteins, which could contribute towards the future landscape of microproteomics.
... It is generally considered that the use of the TRGs term represents a more orderly definition of "orphan" or "novel" genes, since it not only considers their lack of sequence similarity to genes or proteins in other organisms, but also their narrow phylogenetic distribution (Wilson et al. 2005). TRGs have at times been suggested to be non-functional sequences, or annotation artifacts (Skovgaard et al. 2001;Clamp et al. 2007). However, support for their functionality and biological significance is supported by rapidly accumulating experimental evidence (see below). ...
Chapter
Taxonomically restricted genes, or TRGs, are specific to a particular taxon that can be found only in the genomes of single species or are represented as orthologs in closely related genera. Despite being regarded with a mixture of skepticism and awe by the scientific community, progress has been gradually attained in the understanding of their presumed origin and function in most, if not all, forms of life. Grain amaranth is not an exception, as shown by the numerous unknown function TRGs that were unveiled by a recent transcriptomic analysis undergone under different (a)biotic stress conditions. True to their nature, amaranth TRGs appear to be mostly stress-related genes that may offer a clue to better understand the ability of these remarkable plants to thrive under unfavorable ambient conditions. This chapter will concentrate on the description of what has gradually emerged from the incipient study of TRGs in grain amaranth and will place this knowledge in the context of what is known about these enigmatic genes in other organisms.
... Analyzing the genes annotated to CRISPR elements of the selected strains revealed that Vp. 1513 had the most CRISPR elements, whereas Vp. 1496 did not show any genes annotated to CRISPR elements integrated in its genome. The difference of CRISPR number in the genomes of V. parahaemolyticus environmental isolates might relate to genus specificity and V. parahaemolyticus source (Skovgaard et al., 2001). Based on the phylogenetic results showed in Figure 1B, Vp. 1513 underwent a higher level of genetic variation than that of Vp. 1496, and proceeded a longer evolutionary process, therefore the integration of a higher number of CRISPR elements might be due to phage attacks improved its resistance against viruses. ...
Article
Full-text available
Vibrio parahaemolyticus is a common pathogenic marine bacterium that causes gastrointestinal infections and other health complications, which could be life-threatening to immunocompromised patients. For the past two decades, the pathogenicity of environmental V. parahaemolyticus has increased greatly, and the genomic change behind this phenomenon still needs an in-depth exploration. To investigate the difference in pathogenicity at the genomic level, three strains with different hemolysin expression and biofilm formation capacity were screened out of 69 environmental V. parahaemolyticus strains. Subsequently, 16S rDNA analysis, de novo sequencing, pathogenicity test, and antibiotic resistance assays were performed. Comparative genome-scale interpretation showed that various functional region differences in pathogenicity of the selected V. parahaemolyticus strains were due to dissimilarities in the distribution of key genetic elements and in the secretory system compositions. Furthermore, the genomic analysis-based hypothesis of distinct pathogenic effects was verified by the survival rate of mouse models infected with different V. parahaemolyticus strains. Antibiotic resistance results also presented the multi-directional evolutionary potential in environmental V. parahaemolyticus , in agreement with the phylogenetic analysis results. Our study provides a theoretical basis for better understanding of the increasing pathogenicity of environmental V. parahaemolyticus at the genome level. Further, it has a key referential value for the exploration of pathogenicity and prevention of environmental V. parahaemolyticus in the future.
... After sequencing of the complete genome of Haemophilus influenzae (1.8 million base pairs) in 1995, a total of 279 complete bacterial genomes have been deposited in the public databases [13]. The Data processing could be done by using the computer programmes and its comparison with the genome of other organisms is carried out by application of tools of bioinformatics. ...
Article
Deoxyribonucleic acid (DNA) has been a favourite building block for molecular computations and biological computers. The biological molecule DNA serves to store genetic information and particular DNA base sequence arrangement in a gene determines the order of chain of amino acids. These amino acids are the building block molecules for protein/enzyme synthesis. The arrangement of amino acids, their folds, twists and turns in a protein and the interaction of these proteins with other proteins and molecules are the basis of many biological reactions and processes. A small change in the DNA sequence of a gene (mutation) may substitute one amino acid for another in the protein which may alter the protein folding. Even single nucleotide base change in the protein may produce different antigens in the blood. DNA chips are made for testing genetic variations and to develop genetic profiles of normal and diseased cells. Biological computers allow us to store, organize, search, compare and interpret enormous amount of data about different genes. Gene finding is crucial in understanding the genome of a species. Along with the ongoing revolution in sequencing technology, the number of sequenced genomes has increased drastically. Therefore, the development of reliable automated techniques for predicting genes has become critical. Gene mining will help in identification and characterization of related proteins that perform similar function in the bacterial, plant or animal cell. Predicting gene regulatory function and engineering of the specified regulated behaviours are the main issues in synthetic biology and molecular computing. Future developments in protein/enzyme engineering could enable enzyme manufacturing with pre-designed function.
... Therefore, the threedimensional structural information is somehow encoded in the primary sequence of amino acids, which is in turn derived from the genetic information. Since the average protein is composed of several hundreds of amino acid residues, 6 it is difficult to envisage the process of unassisted protein folding as a completely random search for the global energy minimum. The scientific question yet to be solved is how the protein avoids this combinatorial explosion (so-called Levinthal's paradox). ...
Article
By combining bioinformatics with quantum-chemical calculations, we attempt to address quantitatively some of the physical principles underlying protein folding. The former allowed us to identify tripeptide sequences in existing protein three-dimensional structures with a strong preference for either helical or extended structure. The selected representatives of pro-helical and pro-extended sequences were converted into "isolated" tripeptides-capped at N-and C-termini-and these were subjected to an extensive conformational sampling and geometry optimization (typically thousands to tens of thousands of conformers for each tripeptide). For each conformer, the QM(DFT-D3)/COSMO-RS free-energy value was then calculated, G conf (solv). The ?G conf (solv) is expected to provide an objective, unbiased, and quantitatively accurate measure of the conformational preference of the particular tripeptide sequence. It has been shown that irrespective of the helical vs extended preferences of the selected tripeptide sequences in context of the protein, most of the low-energy conformers of isolated tripeptides prefer the R-helical structure. Nevertheless, pro-helical tripeptides show slightly stronger helix preference than their pro-extended counterparts. Furthermore, when the sampling is repeated in the presence of a partner tripeptide to mimic the situation in a β-sheet, pro-extended tripeptides (exemplified by the VIV) show a larger free-energy benefit than pro-helical tripeptides (exemplified by the EAM). This effect is even more pronounced in a hydrophobic solvent, which mimics the less polar parts of a protein. This is in line with our bioinformatic results showing that the majority of pro-extended tripeptides are hydrophobic. The preference for a specific secondary structure by the studied tripeptides is thus governed by the plasticity to adopt to its environment. In addition, we show that most of the "naturally occurring" conformations of tripeptide sequences, i.e., those found in existing three-dimensional protein structures, are within 10 kcal·mol ⁻¹ from their global minima. In summary, our "ab initio" data suggest that complex protein structures may start to emerge already at the level of their small oligopeptidic units, which is in line with a hierarchical nature of protein folding.
... The human MHC is very dense, coding for one gene per 1316 basepairs (Skovgaard et al. 2001;Kulski et al. 2002). The 1.8-Mb long human MHC (HLA) class I region contains eighteen genes, including six coding and twelve pseudogenes, as well as seven MHC class I-chain related (MIC) genes, two of which are coding and five are pseudogenes (MHC Sequencing Consortium 1999;Shiina et al. 1999a;Shiina et al. 1999b;Kulski et al. 2002). ...
... The human MHC is very dense, coding for one gene per 1316 basepairs (Skovgaard et al. 2001;Kulski et al. 2002). The 1.8-Mb long human MHC (HLA) class I region contains eighteen genes, including six coding and twelve pseudogenes, as well as seven MHC class I-chain related (MIC) genes, two of which are coding and five are pseudogenes (MHC Sequencing Consortium 1999;Shiina et al. 1999a;Shiina et al. 1999b;Kulski et al. 2002). ...
... There are four primary predictions of this model: (1) the localisation of mRNAs to the ER is translation dependent; (2) mRNA localisation to the ER requires a topogenic signal (signal sequence and/or transmembrane domain); (3) steady-state mRNA distributions between the cytosol and ER compartments should reflect the localisation mechanism, with topogenic signal-encoding mRNAs being highly enriched on the ER and mRNAs encoding cytoplasmic/nucleoplasmic proteins being similarly enriched in the cytosol and (4) ribosome dissociation from the ER occurs on the timescale of translation, which for a typical (e.g. 41 kDa) human protein (Skovgaard et al., 2001) would be tens of seconds. With the advent of deep sequencing technologies, predictions (2)-(4) have come under intensive examination. ...
Chapter
Proteome expression is the integrated output of gene transcription, messenger ribonucleic acid (mRNA) stability, mRNA translation and protein stability, and varies between cells, tissues and organs. Proteome expression is dynamic and can be readily modified in response to stress, cell cycle progression, pathogenic infection and malignant transformation. With recent advances in single molecule optical imaging and high‐resolution genomic analyses, increased attention has turned to the spatiotemporal regulation of proteome expression at the subcellular level. With this interest has come a reinvestigation of a fundamental question in cell biology – the subcellular organisation of mRNA transcriptome expression. Although it is widely accepted that eukaryotic cells partition proteome expression, with soluble proteins being synthesised on cytoplasmic ribosomes and secretory/integral membrane proteins on endoplasmic reticulum (ER)‐bound ribosomes, recent data indicate a far broader role for the ER in proteome expression. Key Concepts • Ribosomes, the cellular machines that perform protein synthesis, are present in the two primary protein synthesis compartments of the eukaryotic cell, the cytosol and endoplasmic reticulum (ER). • mRNAs are largely partitioned between the cytosol and the ER on the basis of their encoded gene product, with cytosolic protein‐encoding mRNAs being translated by cytoplasmic ribosomes and secretory/membrane protein‐encoding mRNAs undergoing translation on ER‐bound ribosomes. • The mechanisms governing mRNA and ribosome partitioning between the cytosol and ER compartments are largely unknown. • Recent studies have revealed that ER‐bound ribosomes are broadly engaged in the translation of the mRNA transcriptome. • A role for the ER in the expression of the cellular proteome can now be considered.
... It is often proposed that the origin of life depended on oligomermediated replication 1, 2, 4-6 , and these oligomers were shorter than modern biopolymers (for example, average bacterial-coded proteins are estimated to be on the order of 200-270 amino acid residues 22,23 ). The types of chemistry that could have facilitated early replication and translation-like activities are unknown, but it seems likely the first replicating systems had weaker control over the composition of the polymers they were capable of producing and were more error prone. ...
Article
Full-text available
It is widely believed that the origin of life depended on environmentally driven complexification of abiotically produced organic compounds. Polymerization is one type of such complexification, and it may be important that many diverse polymer sequences be produced for the sake of selection. Not all compound classes are easily polymerized under the environmental conditions present on primitive planets, and it is possible that life’s origin was aided by other monomers besides those used in contemporary biochemistry. Here we show that alpha-hydroxy acids, which are plausibly abundant prebiotic monomers, can be oligomerized to generate vast, likely sequence-complete libraries, which are also stable for significant amounts of time. This occurs over a variety of reaction conditions (temperature, concentration, salinity, and presence of congeners) compatible with geochemical settings on the primitive Earth and other solar system environments. The high-sequence heterogeneity achievable with these compounds may be useful for scaffolding the origin of life.
... However, the mother gene ECs2385 is annotated as hypothetical in EHEC and if ECs2385 is not a protein-coding gene, ano could not be considered as overlapping gene. In former times, there have been some doubts that hypothetical (and especially short) genes are "real" genes (Skovgaard et al., 2001;Ochman, 2002). Nowadays, most researchers believe that hypothetical genes annotated by current genome annotation programs are indeed protein coding (Storz et al., 2014;Baek et al., 2017). ...
Article
Full-text available
Current notion presumes that only one protein is encoded at a given bacterial genetic locus. However, transcription and translation of an overlapping open reading frame (ORF) of 186 bp length were discovered by RNAseq and RIBOseq experiments. This ORF is almost completely embedded in the annotated L,D-transpeptidase gene ECs2385 of Escherichia coli O157:H7 Sakai in the antisense reading frame -3. The ORF is transcribed as part of a bicistronic mRNA, which includes the annotated upstream gene ECs2384, encoding a murein lipoprotein. The transcriptional start site of the operon resides 38 bp upstream of the ECs2384 start codon and is driven by a predicted σ70 promoter, which is constitutively active under different growth conditions. The bicistronic operon contains a ρ-independent terminator just upstream of the novel gene, significantly decreasing its transcription. The novel gene can be stably expressed as an EGFP-fusion protein and a translationally arrested mutant of ano, unable to produce the protein, shows a growth advantage in competitive growth experiments compared to the wild type under anaerobiosis. Therefore, the novel antisense overlapping gene is named ano (anaerobiosis responsive overlapping gene). A phylostratigraphic analysis indicates that ano originated very recently de novo by overprinting after the Escherichia/Shigella clade separated from other enterobacteria. Therefore, ano is one of the very rare cases of overlapping genes known in the genus Escherichia.
... Currently, in Genome Reviews the problem that annotations describing DNA sequence features may be out of date or incorrect is addressed, but not the problem that the DNA sequence features themselves may be incorrect or absent. However, a substantial number of gene predictions may not encode real genes -for Escherichia coli for example it could be shown that it probably has only approximately 3,800 genes, in contrast to the claimed 4,300 genes, and a similar discrepancy seem to exist for almost all published genomes [22] -, and that other genes have not been described. Therefore methods are going to be developed to map protein sequences not annotated in the original EMBL genome entries (which may represent corrected versions of originally annotated protein sequences, or novel protein sequences subsequently experimentally determined or predicted by alternative methods) onto their corresponding genomes. ...
Article
Full-text available
Integr8 (http://www.ebi.ac.uk/integr8/) has been developed to provide an integration layer for the exploitation of genomic and proteomic data. High-quality databases from major bioinformatics centres in Europe are included, and some core data and the relationships of biological entities to each other and to entries in other databases are stored. Thus, a framework exists that allows for new kinds of data to be integrated, and an entity-centric view of complete genomes and proteomes is offered. Integr8 is an automatically populated database, providing different entry points to the data, depending on the user’s entity of interest. The Proteome Analysis database for statistical analysis and the Genome Reviews for annotated genome information are the main developments within the Integr8 project. With the BioMart application, an interactive querying tool for performing customisable proteome analysis and data mining is offered. Future developments will especially focus on the Genome Reviews, including mapping not yet annotated protein sequences onto their corresponding genomes, generating new predictions for non-coding RNA genes, and generally extending the scope to lower metazoan organisms.
... The wWb genome contains 961 ORFs, one copy of each of the 5S rRNA, 16S rRNA and 23S rRNA as well as one copy of each of the 34 tRNA genes. Given that GLIMMER is known to inflate the number of small ORFs in a genome, we removed all ORFs <60 aa and all ORFs coding for hypothetical proteins <100 aa with no ortholog in wBm (Skovgaard et al. 2001). Using Sybil (Riley et al. 2012) to visualize and interrogate orthologous proteins between wWb and wBm, the two genomes were found to share 779 orthologous clusters, with wWb having 101 unclustered genes and wBm having 23 unclustered genes. ...
Article
Full-text available
The draft genome assembly of the Wolbachia endosymbiont of Wuchereria bancrofti (wWb) genome consists of 1,060,850 bp in 100 contigs and contains 961 ORFs, with a single copy of the 5S rRNA, 16S rRNA, and 23S rRNA and each of the 34 tRNA genes. Phylogenetic core genome analyses show wWb to cluster with other strains in supergroup D of the Wolbachia phylogeny, while being most closely related to the Wolbachia endosymbiont of Brugia malayi strain TRS (wBm). The wWb and wBm genomes share 779 orthologous clusters with wWb having 101 unclustered genes and wBm having 23 unclustered genes. The higher number of unclustered genes in the wWb genome likely reflects the fragmentation of the draft genome.
... Iterative rounds of training were used to produce hidden Markov models for the ab initio gene predictors SNAP (Korf, 2004) and Augustus (Stanke, Sch€ offmann, Morgenstern, & Waack, 2006) while self-training was used for GeneMark-ES (Ter-Hovhannisyan, Chernoff, & Borodovsky, 2008). Ab initio gene predictors are notorious for overcalling genes (Larsen & Krogh, 2003;Skovgaard, Jensen, Brunak, Ussery, & Krogh, 2001), we set MAKER to only call gene models which had RNA-seq or protein evidence (AED < 1) supporting them (keep_preds = 0). The remaining ab initio gene calls were then scanned for protein family domains (PFAM) using InterProScan5 (IPRscan) (Jones et al., 2014). ...
Article
Full-text available
Bark and ambrosia beetles are highly specialized weevils (Curculionidae) that have established diverse symbioses with fungi, most often from the order Ophiostomatales (Ascomycota, Sordariomycetes). The two types of beetles are distinguished by their feeding habits and intimacy of interactions with their symbiotic fungi. The tree tissue diet of bark beetles is facilitated by fungi, while ambrosia beetles feed solely on fungi that they farm. The farming life history strategy requires domestication of a fungus, which the beetles consume as their sole food source. Ambrosia beetles in the subfamily Platypodinae originated in the mid-Cretaceous (119-88Mya) and are the oldest known group of farming insects. However, attempts to resolve phylogenetic relationships and the timing of domestication events for fungal cultivars have been largely inconclusive. We sequenced the genomes of 12 ambrosia beetle fungal cultivars and bark beetle associates, including the devastating laurel wilt pathogen, Raffaelea lauricola, to estimate a robust phylogeny of the Ophiostomatales. We find evidence for contemporaneous diversification of the beetles and their associated fungi, followed by three independent domestication events of the ambrosia fungi genus Raffaelea. We estimate the first domestication of an Ophiostomatales fungus occurred ~86Mya, 25 million years earlier than prior estimates and in close agreement with the estimated age of farming in the Platypodinae (96Mya). Comparisons of the timing of fungal domestication events with the timing of beetle radiations support the hypothesis that the first large beetle radiations may have spread domesticated ‘ambrosia’ fungi to other fungi-associated beetle groups, perhaps facilitating the evolution of new farming lineages.
... The findings of A-C taken together, suggest that protein-coding genes that lack any functional annotation may preferentially include genes that do not encode a functional protein and represent cases of over-prediction. This over-prediction of protein-coding genes has been noted previously both in prokaryotes and eukaryotes (Clamp et al. 2007;Skovgaard et al. 2001;Warren et al. 2010). ...
... The smallest size had a mean diameter of 3.4 nm. This diameter fits well to the size range of many proteins which have an average diameter of 2–5 nm (Erickson, 2009; Skovgaard et al., 2001). The second population had an average diameter of 10 nm. ...
Article
Full-text available
The application of small-angle X-ray scattering (SAXS) to whole Escherichia coli cells is challenging owing to the variety of internal constituents. To resolve their contributions, the outer shape was captured by ultra-small-angle X-ray scattering and combined with the internal structure resolved by SAXS. Building on these data, a model for the major structural components of E. coli was developed. It was possible to deduce information on the occupied volume, occurrence and average size of the most important intracellular constituents: ribosomes, DNA and proteins. E. coli was studied after treatment with three different antibiotic agents (chloramphenicol, tetracycline and rifampicin) and the impact on the intracellular constituents was monitored.
... Databases containing large coverage, such as the NCBI nr (nonredundant) database and the EMBL Nucleotide Sequence Database (127), are effective at annotating as many genes as possible but are the least reliable, since their annotation and curation are the responsibility of the submitter and lack any validation. Analysis of the nr database has shown that it likely contains a substantial number of noncoding ORFs misannotated as actual coding sequences (128). Transitive annotation from these databases without additional verification can result in the misannotation of genes and propa- Protein-encoding locations are identified, followed by automated assignment of gene function by comparison to existing databases. ...
Article
SUMMARY The number of large-scale genomics projects is increasing due to the availability of affordable high-throughput sequencing (HTS) technologies. The use of HTS for bacterial infectious disease research is attractive because one whole-genome sequencing (WGS) run can replace multiple assays for bacterial typing, molecular epidemiology investigations, and more in-depth pathogenomic studies. The computational resources and bioinformatics expertise required to accommodate and analyze the large amounts of data pose new challenges for researchers embarking on genomics projects for the first time. Here, we present a comprehensive overview of a bacterial genomics projects from beginning to end, with a particular focus on the planning and computational requirements for HTS data, and provide a general understanding of the analytical concepts to develop a workflow that will meet the objectives and goals of HTS projects.
... It has sometimes been suggested that TRGs are non-functional sequences, or annotation artefacts (Skovgaard et al., 2001, Clamp et al., 2007, but evidence is increasing for their functionality and biological significance (see reviews Khalturin et al., 2009, Tautz andDomazet-Lošo, 2011). That relatively few TRGs have well-documented functions is likely due to lack of funding and research aimed at their functional characterization, rather than lack of actual function. ...
... W ten sposób inaktywacja jednego z systemów naprawy DNA destabilizuje organizację całego genomu. Ogromne skokowe zmiany w strukturze i ułożeniu genów przynoszą różnego typu rearanżacje [28,35,59,78]. Rearanżacje określonych rejonów DNA mogą być efektem aktywności elementów transpozonowych [47], systemów miejscowospecyficznej rekombinacji [36], a także delecji zależnych m.in. ...
Article
Recently, the study of bacterial evolution has been based on the comparative analysis of nucleotide sequences within and between species. Analyses of microbial genomes has shown that genomes contain genes that are closely related to phylogenetically, very distantly related procaryotes. Usually, evolutionary biologists have thought mutations within individual genes, followed by clonal transfer of genetic information, is the major source of phenotypic variation, leading to adaptation through natural selection and generating diversity among species. Contrary to eukaryotes, which evolve principally through the modification of existing genetic information, bacteria obtain a significant proportion of their genetic diversity through the acquisition of DNA fragments from distantly related organisms. Horizontal gene transfer is the term used to describe the processes that permit the exchange of DNA among organisms of different species. Such a transfer produces highly dynamic genomes in which substantial amounts of DNA are introduced into and deleted from the chromosome. Changes in the genome, occurring through gene acquisition and deletion, are the major events underlying the emergence and evolution of a new bacterial strains including pathogens. However, beside ecological isolation and the fitness of new recombinants, there are several molecular barriers to chromosomal gene transfer between bacterial species, which correlate with genomic sequence divergence/homology. First, specific uptake sequences (US) in DNA may be required for an efficient transformation process. The real barrier to successful DNA aquisition being recipient cell are restriction-modification systems and the lack of Chi-like sequences in the structure of alien DNA, both of which provide a means to species specific endo- and exonucleolytic degradations, respectively. The strong barrier for homologous recombination is the methyl-directed mismatch repair (MMR) system. MMR-deficient mutators exhibit a hyperrecombinetic phenotype and act to reduce sequence homology limitations in such processes like transformation, transduction, and conjugation. Repeated changes in the MMR phenotype (loss and acquisition) are involved in rapid genetic diversity and the quick adaptation of bacterial populations.
... In the past 30 years, gene finding has been one of the most important topics of the life sciences. Even so, the problem of gene annotation errors has been deemed a universal phenomenon in microbial genomes, [20,21] and different reannotation strategies have been outlined. [3] Despite that most programs have claimed to achieve high accuracies on their training sets and testing sets, more and more recent studies using different gene-finding algorithms have shown that the predicted results differ greatly. ...
Article
Full-text available
Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as "hypothetical" were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58.
... Other studies of many genes from a genome For some types of proteins, more than one score is shown because the protein is found to moonlight in more than one organism, and the score was determined for each protein in each organism in which it moonlights. The type of enzyme is indicated by number on the x-axis: or from whole genomes found similar averages for protein length [102]. Interestingly, most of the proteins in our study set are significantly longer than these median sizes. ...
Article
Full-text available
Moonlighting proteins comprise a subset of multifunctional proteins that perform two or more biochemical functions that are not due to gene fusions, multiple splice variants, proteolytic fragments, or promiscuous enzyme activities. The project described herein focuses on a sub-set of moonlighting proteins that have a canonical biochemical function inside the cell and perform a second biochemical function on the cell surface in at least one species. The goal of this project is to consider the biophysical features of these moonlighting proteins to determine whether they have shared characteristics or defining features that might suggest why these particular proteins were adopted for a second function on the cell surface, or if these proteins resemble typical intracellular proteins. The latter might suggest that many other normally intracellular proteins found on the cell surface might also be moonlighting in this fashion. We have identified 30 types of proteins that have different functions inside the cell and on the cell surface. Some of these proteins are found to moonlight on the surface of multiple species, sometimes with different extracellular functions in different species, so there are a total of 98 proteins in the study set. Although a variety of intracellular proteins (enzymes, chaperones, etc.) are observed to be re-used on the cell surface, for the most part, these proteins were found to have physical characteristics typical of intracellular proteins. Many other intracellular proteins have also been found on the surface of bacterial pathogens and other organisms in proteomics experiments. It is quite possible that many of those proteins also have a moonlighting function on the cell surface. The increasing number and variety of known moonlighting proteins suggest that there may be more moonlighting proteins than previously thought, and moonlighting might be a common feature of many more proteins.
... We also analyzed intergenic regions in convergent and divergent gene pairs whenever each member of the pair belonged to a COG. The COGs were employed in order to exclude from the analysis spurious`genes' that may be falsely predicted in long intergenic spacers due to purely statistical reasons (18). ...
Article
Full-text available
Prokaryotic genomes are considered to be ‘wall‐to‐wall’ genomes, which consist largely of genes for proteins and structural RNAs, with only a small fraction of the genomic DNA allotted to intergenic regions, which are thought to typically contain regulatory signals. The majority of bacterial and archaeal genomes contain 6–14% non‐coding DNA. Significant positive correlations were detected between the fraction of non‐coding DNA and inter‐ and intra‐operonic distances, suggesting that different classes of non‐coding DNA evolve congruently. In contrast, no correlation was found between any of these characteristics of non‐coding sequences and the number of genes or genome size. Thus, the non‐coding regions and the gene sets in prokaryotes seem to evolve in different regimes. The evolution of non‐coding regions appears to be determined primarily by the selective pressure to minimize the amount of non‐functional DNA, while maintaining essential regulatory signals, because of which the content of non‐coding DNA in different genomes is relatively uniform and intra‐ and inter‐operonic non‐coding regions evolve congruently. In contrast, the gene set is optimized for the particular environmental niche of the given microbe, which results in the lack of correlation between the gene number and the characteristics of non‐coding regions.
... It could also be that the genomes in the database do not represent enough information to show the difference with enough emphasis. Finally, gene overannotations in B. subtilis, as estimated by the SwissProt method proposed by Skovgaard et al. [25,26], is higher for B. subtilis (16.6%) than for E. coli (5.32%). False genes would necessarily have no orthologs in other genomes, and their MI with other genes would necessarily be zero. ...
Article
Full-text available
Tests for the evolutionary conservation of associations between genes coding for transcription factors (TFs) and other genes have been limited to a few model organisms due to the lack of experimental information of functional associations in other organisms. We aimed at surmounting this limitation by using the most co-occurring gene pairs as proxies for the most conserved functional interactions available for each gene in a genome. We then used genes predicted to code for TFs to compare their most conserved interactions against the most conserved interactions for the rest of the genes within each prokaryotic genome available. We plotted profiles of phylogenetic profiles, p-cubic, to compare the maximally scoring interactions of TFs against those of other genes. In most prokaryotes, genes coding for TFs showed lower co-occurrences when compared to other genes suggesting that transcriptional regulation evolves quickly in most, if not all prokaryotes. We also show that genes coding for TFs tend to have lower Codon Adaptation Indexes compared to other genes further suggesting quick gene exchange and rewiring of transcriptional regulation across prokaryotes. transcription factors; interactome; evolvability; comparative genomics; phylogenetic profiles; regulatory interactions.
... Synthesized proteins from RNA translation have different lengths, usually ranging from 30 to more than 20,000 amino acids. An analysis of the protein length distribution in several microorganisms was previously reported [5]. The mean of the protein size distribution depends on the complexity of the organism. ...
Article
Full-text available
An approach for approximately calculating the number of genes in a genome is presented, which takes into account the average protein length expected for the species. A number of virus, bacterial and eukaryotic genomes are scrutinized. Genome figures are presented, which support the average protein size of a species as a criterion for assessing life complexity. The human gene distribution in the 23 chromosomes is investigated emphasizing the genomic rate, the mean 'exon' length, and the mean 'exons' per gene. It is shown that storing all genes of a single human definitely requires less than 12 MB.
... However, the results of some reports endorsing the uncertainty of the current annotation techniques for protein-coding genes should also be mentioned. Some of these studies [42,43] propose that particular annotated genes in genomes are not protein-coding genes; they are, in fact, caused by overlapping of open reading frames happened by chance and proposing decrease in proteome size. Conversely, some other studies [44,45] suggest implementing new annotation methods for addition of missing protein-coding genes, in turn proposing increase in proteome size. ...
Article
In this study, to demonstrate the language-like behavior of protein length distribution in proteomes, a quantitative linguistic distribution model, Menzerath–Altmann model, was adopted. A total of 10 proteomes from completely sequenced representative organisms (archaea, bacteria, and eukarya domains) were examined. The results showed that the protein length distribution in the complete set of proteomic proteins, or at least in a wide range for each proteome, can be described reasonably well using the distribution model without considering any complex underlying mechanisms. The deliberation of the model parameters confirmed the evolutionary trend and the model parameters were observed to be related to organismal complexity. © 2014 Wiley Periodicals, Inc. Complexity, 2014
... However, a general drawback of computational screens is the limited information content of a small proteincoding gene, particularly if the gene encodes a hydrophobic small protein. This limitation has led to both the underannotation and the overannotation of small protein genes in completed genome sequences (85). Thus, an essential component of small protein identification is independent data such as those acquired through direct detection or mutational analysis demonstrating that the small protein is synthesized. ...
Article
Full-text available
Small proteins, here defined as proteins of 50 amino acids or fewer in the absence of processing, have traditionally been overlooked due to challenges in their annotation and biochemical detection. In the past several years, however, increasing numbers of small proteins have been identified either through the realization that mutations in intergenic regions are actually within unannotated small protein genes or through the discovery that some small, regulatory RNAs encode small proteins. These insights, together with comparative sequence analysis, indicate that tens if not hundreds of small proteins are synthesized in a given organism. This review summarizes what has been learned about the functions of several of these bacterial small proteins, most of which act at the membrane, illustrating the astonishing range of processes in which these small proteins act and suggesting several general conclusions. Important questions for future studies of these overlooked proteins are also discussed. Expected final online publication date for the Annual Review of Biochemistry Volume 83 is June 02, 2014. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.
Article
Full-text available
Microproteins encoded by small open reading frames (sORFs) have emerged as a fascinating frontier in genomics. Traditionally overlooked due to their small size, recent technological advancements such as ribosome profiling, mass spectrometry-based strategies and advanced computational approaches have led to the annotation of more than 7000 sORFs in the human genome. Despite the vast progress, only a tiny portion of these microproteins have been characterized and an important challenge in the field lies in identifying functionally relevant microproteins and understanding their role in different cellular contexts. In this review, we explore the recent advancements in sORF research, focusing on the new methodologies and computational approaches that have facilitated their identification and functional characterization. Leveraging these new tools hold great promise for dissecting the diverse cellular roles of microproteins and will ultimately pave the way for understanding their role in the pathogenesis of diseases and identifying new therapeutic targets.
Thesis
Ce travail de thèse comporte deux volets ayant pour trait commun l’analyse a posteriori d’une épidémie d’infection invasive à méningocoque (IIM) survenue en Normandie entre 2003 et 2012 et liée à l’expansion d’un clone hypervirulent particulier (B:14:P1.7,16/ST-32 ). Le premier travail (publié dans Virulence : Sevestre et al 2018 ;9 :923-929) s’est focalisé sur les déterminants de la virulence de « B14 » en comparant 6 isolats identifiés les uns d’IIM, les autres de portage pharyngé sain (ces derniers exprimant ou non la capsule). Apparemment identiques selon le typage classique (immunotypage et génotypage par MLST), ces 3 groupes bactériens se sont révélés distincts après analyse génomique et comparaison gène par gène (plus de 600 gènes au profil génétique variable entre les groupes) conduisant à identifier le rôle majeur de l’acquisition du fer dans la virulence et en particulier celui du système HmbR, un récepteur de l’hémoglobine. Dans un modèle murin (souris transgéniques rendues sensibles à l’infection humaine) les 3 groupes de souches sont aussi apparus distincts, avec une hiérarchie des marqueurs d’infectiosité (titres bactériens, taux de cytokines). La restauration du système HmbR (souches de portage capsulées « Off » dérivées en « On ») a restauré le pouvoir invasif in vitro et chez l’animal. Si le fer était déjà connu comme facteur de virulence pour différentes espèces bactériennes, l’originalité ici est d’avoir identifié le rôle de la variation de phase du gène hmbR au sein d’un même clone épidémique, permettant l’adaptation au portage, condition sine qua non de la transmission d’individu en individu. Le second travail (publié dans Vaccine : Sevestre et al 2017 ;35 :4029-4033) s’est intéressé à la durabilité et à l’ampleur de la protection vaccinale du MenBvac®, vaccin à base de vésicules de membranes externes (Outer Membrane Vesicles, OMV) utilisé jadis pour contrôler l’épidémie. Ceci a pu être réalisé grâce à deux cohortes d’enfants vaccinés par un schéma à 4 doses et prélevés pour les uns 1 an après la dernière dose et pour les autres 4 an après. L’immunogénicité (étude de l’activité bactéricide du sérum vis-à-vis du clone ciblé) s’est avérée de durabilité moyenne avec 48% des enfants protégés à 1 an et 31% à 4 ans, un résultat en phase avec les données de la littérature sur les vaccins OMV. Un effet bactéricide fut observé très au-delà de « B14 », du fait d’une immunité croisée aux souches avec une homologie au moins partielle de la porine PorA (principal déterminant antigénique des vaccins OMV) soit pour le MenBvac® 15% des clones virulents B actuellement circulants en France, un résultat davantage original car ayant jusqu’alors que peu investigué.
Article
Full-text available
The concept of cell signaling in the context of nonenzyme-assisted protein modifications by reactive electrophilic and oxidative species, broadly known as redox signaling, is a uniquely complex topic that has been approached from numerous different and multidisciplinary angles. Our Review reflects on five aspects critical for understanding how nature harnesses these noncanonical post-translational modifications to coordinate distinct cellular activities: (1) specific players and their generation, (2) physicochemical properties, (3) mechanisms of action, (4) methods of interrogation, and (5) functional roles in health and disease. Emphasis is primarily placed on the latest progress in the field, but several aspects of classical work likely forgotten/lost are also recollected. For researchers with interests in getting into the field, our Review is anticipated to function as a primer. For the expert, we aim to stimulate thought and discussion about fundamentals of redox signaling mechanisms and nuances of specificity/selectivity and timing in this sophisticated yet fascinating arena at the crossroads of chemistry and biology.
Chapter
Introduction Classification of Bacteria and Archaea Analysis of Prokaryotic Genomes Comparison of Prokaryotic Genomes Perspective Pitfalls Web Resources Discussion Questions Problems/Computer Lab Self-Test Quiz Suggested Reading References
Article
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
Article
Full-text available
Significance Hsp70 (70-kDa heat shock protein) chaperones bind cognate substrates to prevent their aggregation and guide them toward their correctly folded, functional states. Here we use NMR spectroscopy to understand how this is achieved by studying a complex of Hsp70 with a folding competent substrate. Using an NMR experiment presented here, we show that long-range transient contacts are established in the unfolded, unbound state of the substrate. These contacts are greatly attenuated in the bound form of the substrate that also exists as an unfolded ensemble. Our results establish that Hsp70 binding can significantly bias the folding mechanism of client substrate molecules toward pathways where secondary structure is first generated, followed by the establishment of longer-range interactions in a distance-dependent fashion.
Article
Integr8 is a new web portal for exploring the biology of organisms with completely deciphered genomes. For over 190 species, Integr8 provides access to general information, recent publications, and a detailed statistical overview of the genome and proteome of the organism. The preparation of this analysis is supported through Genome Reviews, a new database of bacterial and archaeal DNA sequences in which annotation has been upgraded (compared to the original submission) through the integration of data from many sources, including the EMBL Nucleotide Sequence Database, the UniProt Knowledgebase, InterPro, CluSTr, GOA and HOGENOM. Integr8 also allows the users to customize their own interactive analysis, and to download both customized and prepared datasets for their own use. Integr8 is available at http://www.ebi.ac.uk/integr8.
Article
A central challenge in the field of metabolic engineering is the efficient identification of a metabolic pathway genotype that maximizes specific productivity over a robust range of process conditions. Here we review current methods for optimizing specific productivity of metabolic pathways in living cells. New tools for library generation, computational analysis of pathway sequence-flux space, and high-throughput screening and selection techniques are discussed.
Chapter
Pseudomonas putida strains are rapidly growing bacteria, frequently isolated from most temperate soils and waters, particularly polluted soils. They are nutritional opportunists par excellence and a paradigm of metabolically versatile microorganisms that recycle organic wastes in aerobic and microaerophilic compartments of the environment, and that plays a key role in the maintenance of environmental quality. P putida strain KT24402, 53 is probably the best-characterized saprophytic laboratory Pseudomonad that has retained its ability to survive and function in the environment. The bacterium is a plasmid-free derivative of a toluene-degrading bacterium, originally designated Pseudomonas arvilla strain mt-246 and subsequently reclassified as P putida mt-243, 68. It is the first Gram-negative soil bacterium to be certified by the Recombinant DNA Advisory Committee (RAC) of the United States National Institutes of Health as the host strain of a host-vector biosafety (HV1) system for gene cloning in Gram-negative soil bacteria21. An extensive spectrum of versatile genetic tools, in particular mini-transposons and tools based on these, have been developed for its analysis, manipulation and use as a host for cloned genes from other soil organisms 12, 13, 35, 41,. KT2440 is being exploited in the development of a variety of biotechnological applications, including the design of new catabolic pathways for pollutants 19, 51, 56, the production by biocatalysis of intermediates, including chiral synthons for chemical syntheses72, and quality improvement of fossil fuels, for example by desulphurization24. KT2440 is alsi able to colonize the rhizosphere of a variety of crop plants, such as corn plants, wheat, strawberry, sugarcane and spinach 20, and is being used to develop new biopesticides and plant growth promoters that function in the plant rhizosphere.
Article
The problem that how many protein-coding genes exist in Aeropyrum pernix K1 genome has confused many scientists since 1999. In this paper, we attempt to re-identify the protein-coding genes in this genome by proposing a modified method based on I-TN curve. Consequently, all of the 727 experimentally validated protein-coding genes and 726 of the corresponding negative samples are correctly predicted respectively, then an accuracy of 99.93% of self-test is obtained. In the Jackknife test, two positive samples and two negative samples are falsely predicted, respectively, and then the accuracy of cross-validation is 99.72%. In the testing set, all of the 132 putative genes are correctly predicted as protein-coding and 14 out of the 841 hypothetical genes are predicted as non-coding, the number of protein-coding genes is reduced to 1686 instead of 1700. Further analysis shows the performance of the reannotating algorithm is comparable to other prevalent programs, and the present method is much simple and efficient. We implement the reannotating algorithm trained by Aeropyrum pernix K1 to Chlorobium tepidum TLS genome, and 217 hypothetical genes are predicted as non-coding. Sufficient sequences analysis indicates most of them are random sequences that are falsely predicted as protein-coding genes. In addition, we also perform some significative analysis aiming to the influence of artificial parameters on the graphical representation approaches, which may provide helpful information for related researches.
Article
Results: Malarial parasites show distinctive responses to base-compositional pressures that increase as protein lengths increase. A low-GC% species (Plasmodium falciparum) is likely to have more placeholder amino acids than an intermediate-GC% species (P. vivax), so that homologous proteins are longer. In prokaryotes, GC% is generally greater and AG% is generally less in open reading frames (ORFs) encoding long proteins. The increased GC% in long ORFs increases as species’ GC% increases, and decreases as species’ AG% increases. In low- and intermediate-GC% prokaryotic species, increases in ORF GC% as encoded proteins increase in length are largely accounted for by the base compositions of first and second (amino acid-determining) codon positions. In high-GC% prokaryotic species, first and third (non-amino acid-determining) codon positions play this role.
Article
More and more studies indicate that the issue of protein-coding gene finding in microbial genomes is far from thoroughly solved and the annotation quality has been questioned continuously in the past several years. In this paper, we summarize the computational methods for identifying the over-annotated genes and missing genes, and provide perspective for prospective gene finding works.
Article
Full-text available
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
Article
Full-text available
The complete sequence of the genome of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1, which optimally grows at 95°C, has been determined by the whole genome shotgun method with some modifications. The entire length of the genome was 1,669,695 bp. The authenticity of the entire sequence was supported by restriction analysis of long PCR products, which were directly amplified from the genomic DNA. As the potential protein-coding regions, a total of 2,694 open reading frames (ORFs) were assigned. By similarity search against public databases, 633 (23.5%) of the ORFs were related to genes with putative function and 523 (19.4%) to the sequences registered but with unknown function. All the genes in the TCA cycle except for that of alpha-ketoglutarate dehydrogenase were included, and instead of the alpha-ketoglutarate dehydrogenase gene, the genes coding for the two subunits of 2-oxoacid:ferredoxin oxidoreductase were identified. The remaining 1,538 ORFs (57.1%) did not show any significant similarity to the sequences in the databases. Sequence comparison among the assigned ORFs suggested that a considerable member of ORFs were generated by sequence duplication. The RNA genes identified were a single 16S–23S rRNA operon, two 5S rRNA genes and 47 tRNA genes including 14 genes with intron structures. All the assigned ORFs and RNA coding regions occupied 89.12% of the whole genome. The data presented in this paper are available on the internet homepage ( Author Webpage ).
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
SWISS-PROT (http://www.expasy.ch/) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
Article
Full-text available
SWISS-2DPAGE (http://www.expasy.ch/ch2d/ ) is an annotated two-dimensional polyacrylamide gel electro­phoresis (2-DE) database established in 1993. The current release contains 24 reference maps from human and mouse biological samples, as well as from Saccharomyces cerevisiae, Escherichia coli and Dictyostelium discoideum origin. These reference maps have now 2824 identified spots, corresponding to 614 separate protein entries in the database, in addition to virtual entries for each SWISS-PROT sequence or any user-entered amino acids sequence. Last year improvements in the SWISS-2DPAGE database are as follows: three new maps have been created and several others have been updated; cross-references to newly built federated 2-DE databases have been added; new functions to access the data have been provided through the ExPASy proteomics server.
Article
Full-text available
The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis.
Article
Full-text available
Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
Full-text available
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDSs) in the EMBL Nucleotide Sequence Database, except the CDSs already included in SWISS-PROT. We also describe the Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. SWISS-PROT is available at: http://www.expasy.ch/sprot/ and http://www.ebi.ac.uk/swissprot/
Article
Full-text available
While most organisms grow at temperatures ranging between 20 and 50 °C, many archaea and a few bacteria have been found capable of withstanding temperatures close to 100 °C, or beyond, such as Pyrococcus or Aquifex. Here we report the results of two independent large scale unbiased approaches to identify global protein properties correlating with an extreme thermophile lifestyle. First, we performed a comparative proteome analyses using 30 complete genome sequences from the three kingdoms. A large difference between the proportions of charged versuspolar (noncharged) amino acids was found to be a signature of all hyperthermophilic organisms. Second, we analyzed the water accessible surfaces of 189 protein structures belonging to mesophiles or hyperthermophiles. We found that the surfaces of hyperthermophilic proteins exhibited the shift already observed at the genomic level,i.e. a proportion of solvent accessible charged residues strongly increased at the expense of polar residues. The biophysical requirements for the presence of charged residues at the protein surface, allowing protein stabilization through ion bonds, is therefore clearly imprinted and detectable in all genome sequences available to date.
Article
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they are relevant and appropriate but comments will not be edited. The ultimate decision on publication of an online comment is at the Editors' discretion. Formatting: Please include a title for the comment and your affiliation. Note that symbols (e.g. Greek letters) may not transmit properly in this form due to potential software compatibility issues. Please spell out the words in place of the symbols (e.g. replace “α” with “alpha”). Comments should be no more than 8,000 characters (including spaces ) in length. References may be included when necessary but should be kept to a minimum. Be careful if copying and pasting from a Word document. Smart quotes can cause problems in the form. If you experience difficulties, please convert to a plain text file and then copy and paste into the form.
Selection of representative protein datasets
  • U Hobohm
Hobohm, U. et al. (1992) Selection of representative protein datasets. Protein Sci. 1, 409-417