Sequence-based feature prediction and annotation of proteins

Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Lyngby, Denmark.
Genome biology (Impact Factor: 10.81). 03/2009; 10(2):206. DOI: 10.1186/gb-2009-10-2-206
Source: PubMed


A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome.

Download full-text


Available from: Alfonso Valencia,
1 Follower
35 Reads
  • Source
    • "Many tools have been developed to mine several databases of biological information to finally predict a protein function based on sequence similarities. Detailed strategies on genomics and proteomics sequence annotation can be found in previous publications [11] [12] [13] [14] [15] [16] [17]. Nevertheless, once the genome and proteome are annotated, one of the most disseminated strategies of proteomics data functional annotation includes the use of ontologies, which can be understood as an explicit specification of a conceptualization [18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteomics experiments often generate a vast amount of data. However, the simple identification and quantification of proteins from a cell proteome or subproteome is not sufficient for the full understanding of complex mechanisms occurring in the biological systems. Therefore, the functional annotation analysis of protein datasets using bioinformatics tools is essential for interpreting the results of high-throughput proteomics. Although large-scale proteomics data have rapidly increased, the biological interpretation of these results remains as a challenging task. Here we reviewed basic concepts and different programs that are commonly used in proteomics data functional annotation, emphasizing the main strategies focused in the use of gene ontology annotations. Furthermore, we explored the characteristics of some tools developed for functional annotation analysis, concerning the ease of use and typical caveats on ontology annotations. The utility and variations between different tools were assessed through the comparison of the resulting outputs generated for an example of proteomics dataset.
    Biochimica et Biophysica Acta (BBA) - Proteins & Proteomics 01/2015; 1854(1):46–54. DOI:10.1016/j.bbapap.2014.10.019 · 2.75 Impact Factor
  • Source
    • "The annotation of most genes and gene products is incomplete with only a sparse set of annotations to generic high-level categories available (Faria et al., 2012). For those annotations that do exist, the overwhelming majority are automatically generated on the basis of sequence or structural similarity without any curatorial review (du Plessis et al., 2011; Juncker et al., 2009). Such automatically generated annotations have known quality issues relative to manually curated annotations, especially those based on published experimental findings (Bell et al., 2012; Dolan et al., 2005; Faria et al., 2012; Park et al., 2011; Schnoes et al., 2009; Skunca et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar data sets. We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled data sets. As shown using simulated gene sets with simulated data and MSigDB collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Bioinformatics 02/2014; 30(12). DOI:10.1093/bioinformatics/btu110 · 4.98 Impact Factor
  • Source
    • "A wealth of predictors have been developed in the last thirty years for inferring many diverse types of features, see e.g. Juncker et al.[1] for a review. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational methods for the prediction of protein features from sequence are a long-standing focusof bioinformatics. A key observation is that several protein features are closely inter-related, that is,they are conditioned on each other. Researchers invested a lot of effort into designing predictors thatexploit this fact. Most existing methods leverage inter-feature constraints by including known (orpredicted) correlated features as inputs to the predictor, thus conditioning the result. By including correlated features as inputs, existing methods only rely on one side of the relation:the output feature is conditioned on the known input features. Here we show how to jointly improvethe outputs of multiple correlated predictors by means of a probabilistic-logical consistencylayer. The logical layer enforces a set of weighted first-order rules encoding biological constraintsbetween the features, and improves the raw predictions so that they least violate the constraints. Inparticular, we show how to integrate three stand-alone predictors of correlated features: subcellular localization(Loctree [J Mol Biol 348:85-100, 2005]), disulfide bonding state (Disulfind [Nucleic AcidsRes 34:W177-W181, 2006]), and metal bonding state (MetalDetector [Bioinformatics 24:2094-2095,2008]), in a way that takes into account the respective strengths and weaknesses, and does not requireany change to the predictors themselves. We also compare our methodology against two alternativerefinement pipelines based on state-of-the-art sequential prediction methods. The proposed framework is able to improve the performance of the underlying predictors by removingrule violations. We show that different predictors offer complementary advantages, and our method isable to integrate them using non-trivial constraints, generating more consistent predictions. In addition,our framework is fully general, and could in principle be applied to a vast array of heterogeneouspredictions without requiring any change to the underlying software. On the other hand, the alternativestrategies are more specific and tend to favor one task at the expense of the others, as shown byour experimental evaluation. The ultimate goal of our framework is to seamlessly integrate full predictionsuites, such as Distill [BMC Bioinformatics 7:402, 2006] and PredictProtein [Nucleic AcidsRes 32:W321-W326, 2004].
    BMC Bioinformatics 01/2014; 15(1):16. DOI:10.1186/1471-2105-15-16 · 2.58 Impact Factor
Show more