-
[show abstract]
[hide abstract]
ABSTRACT: MOTIVATION: Prediction of protein-protein interaction has become an important part of systems biology in reverse engineering the biological networks for better understanding the molecular biology of the cell. While significant progress has been made in terms of prediction accuracy, most computational methods only predict whether two proteins interact but not their interacting residues - the information that can be very valuable for understanding the interaction mechanisms and designing modulation of the interaction. In this work, we developed a computational method to predict the interacting residue pairs - contact matrix for interacting protein domains, whose rows and columns correspond to the residues in the two interacting domains respectively and whose values (1 or 0) indicate whether the corresponding residues (do or do not) interact. RESULTS: Our method is based on supervised learning using support vector machines. For each domain involved in a given domain-domain interaction (DDI), an interaction profile hidden Markov model (ipHMM) is first built for the domain family, and then each residue position for a member domain sequence is represented as a 20-dimension vector of Fisher scores, characterizing how similar it is as compared to the family profile at that position. Each element of the contact matrix for a sequence-pair is now represented by a feature vector from concatenating the vectors of the two corresponding residues, and the task is to predict the element value (1 or 0) from the feature vector. A support vector machine is trained for a given DDI, using either a consensus contact matrix or contact matrices for individual sequence pairs, and is testd by leave-one-out cross validation. The performance averaged over a set of 115 DDIs collected from the 3DID database shows significant improvement (sensitivity up to 85%, and specificity up to 85%), as compared to a multiple sequence alignment based method (sensitivity 57%, and specificity 78%) previously reported in the literature. AVAILABILITY: CONTACT: lliao@cis.udel.edu.
Bioinformatics 02/2013; · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: As a member of the Open Biomedical Ontologies (OBO) foundry, the Protein Ontology (PRO) provides an ontological representation of protein forms and complexes and their relationships. Annotations in PRO can be assigned to individual protein forms and complexes, each distinguishable down to the level of post-translational modification, thereby allowing for a more precise depiction of protein function than is possible with annotations to the gene as a whole. Moreover, PRO is fully interoperable with other OBO ontologies and integrates knowledge from other protein-centric resources such as UniProt and Reactome. Here we demonstrate the value of the PRO framework in the investigation of the spindle checkpoint, a highly conserved biological process that relies extensively on protein modification and protein complex formation. The spindle checkpoint maintains genomic integrity by monitoring the attachment of chromosomes to spindle microtubules and delaying cell cycle progression until the spindle is fully assembled. Using PRO in conjunction with other bioinformatics tools, we explored the cross-species conservation of spindle checkpoint proteins, including phosphorylated forms and complexes; studied the impact of phosphorylation on spindle checkpoint function; and examined the interactions of spindle checkpoint proteins with the kinetochore, the site of checkpoint activation. Our approach can be generalized to any biological process of interest.
Frontiers in genetics. 01/2013; 4:62.
-
[show abstract]
[hide abstract]
ABSTRACT: Protein phosphorylation is a central regulatory mechanism in signal transduction involved in most biological processes. Phosphorylation of a protein may lead to activation or repression of its activity, alternative subcellular location and interaction with different binding partners. Extracting this type of information from scientific literature is critical for connecting phosphorylated proteins with kinases and interaction partners, along with their functional outcomes, for knowledge discovery from phosphorylation protein networks. We have developed the Extracting Functional Impact of Phosphorylation (eFIP) text mining system, which combines several natural language processing techniques to find relevant abstracts mentioning phosphorylation of a given protein together with indications of protein-protein interactions (PPIs) and potential evidences for impact of phosphorylation on the PPIs. eFIP integrates our previously developed tools, Extracting Gene Related ABstracts (eGRAB) for document retrieval and name disambiguation, Rule-based LIterature Mining System (RLIMS-P) for Protein Phosphorylation for extraction of phosphorylation information, a PPI module to detect PPIs involving phosphorylated proteins and an impact module for relation extraction. The text mining system has been integrated into the curation workflow of the Protein Ontology (PRO) to capture knowledge about phosphorylated proteins. The eFIP web interface accepts gene/protein names or identifiers, or PubMed identifiers as input, and displays results as a ranked list of abstracts with sentence evidence and summary table, which can be exported in a spreadsheet upon result validation. As a participant in the BioCreative-2012 Interactive Text Mining track, the performance of eFIP was evaluated on document retrieval (F-measures of 78-100%), sentence-level information extraction (F-measures of 70-80%) and document ranking (normalized discounted cumulative gain measures of 93-100% and mean average precision of 0.86). The utility and usability of the eFIP web interface were also evaluated during the BioCreative Workshop. The use of the eFIP interface provided a significant speed-up (∼2.5-fold) for time to completion of the curation task. Additionally, eFIP significantly simplifies the task of finding relevant articles on PPI involving phosphorylated forms of a given protein. Database URL: http://proteininformationresource.org/pirwww/iprolink/eFIP.shtml.
Database The Journal of Biological Databases and Curation 01/2012; 2012:bas044. · 2.07 Impact Factor
-
Sarah Hunter,
Philip Jones,
Alex Mitchell,
Rolf Apweiler,
Teresa K Attwood,
Alex Bateman,
Thomas Bernard,
David Binns,
Peer Bork,
Sarah Burge, [......],
Amaia Sangrador-Vegas,
Jeremy D Selengut,
Christian J A Sigrist,
Maxim Scheremetjew,
John Tate,
Manjulapramila Thimmajanarthanan,
Paul D Thomas, Cathy H Wu,
Corin Yeats,
Siew-Yit Yong
[show abstract]
[hide abstract]
ABSTRACT: InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Nucleic Acids Research 11/2011; 40(Database issue):D306-12. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues’ evolutionary characterization at the sites and the structure-based functional classification of the proteins in the context of a functional family. The results of testing the method on two well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC) scores improves significantly when multipositional correlations are accounted for.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 10/2011; 9(4):992-1001. · 2.25 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: MOTIVATION: Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006. AVAILABILITY: www.uniprot.org/mappingandproteininformation-resource.org/pirwww/search/idmapping.shtml CONTACT: huang@dbi.udel.edu.
Bioinformatics 04/2011; 27(8):1190-1. · 5.47 Impact Factor
-
IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011, Atlanta, GA, USA, 12-15 November, 2011; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: Technologies and experimental strategies have improved dramatically in the field of genomics and proteomics facilitating analysis of cellular and biochemical processes, as well as of proteins networks. Based on numerous such analyses, there has been a significant increase of publications in life sciences and biomedicine. In this respect, knowledge bases are struggling to cope with the literature volume and they may not be able to capture in detail certain aspects of proteins and genes. One important aspect of proteins is their phosphorylated states and their implication in protein function and protein interacting networks. For this reason, we developed eFIP, a web-based tool, which aids scientists to find quickly abstracts mentioning phosphorylation of a given protein (including site and kinase), coupled with mentions of interactions and functional aspects of the protein. eFIP combines information provided by applications such as eGRAB, RLIMS-P, eGIFT and AIIAGMT, to rank abstracts mentioning phosphorylation, and to display the results in a highlighted and tabular format for a quick inspection. In this chapter, we present a case study of results returned by eFIP for the protein BAD, which is a key regulator of apoptosis that is posttranslationally modified by phosphorylation.
Methods in molecular biology (Clifton, N.J.) 01/2011; 694:63-75.
-
[show abstract]
[hide abstract]
ABSTRACT: The rapid growth of protein sequence databases has necessitated the development of methods to computationally derive annotation for uncharacterized entries. Most such methods focus on "global" annotation, such as molecular function or biological process. Methods to supply high-accuracy "local" annotation to functional sites based on structural information at the level of individual amino acids are relatively rare. In this chapter we will describe a method we have developed for annotation of functional residues within experimentally-uncharacterized proteins that relies on position-specific site annotation rules (PIR Site Rules) derived from structural and experimental information. These PIR Site Rules are manually defined to allow for conditional propagation of annotation. Each rule specifies a tripartite set of conditions whereby candidates for annotation must pass a whole-protein classification test (that is, have end-to-end match to a whole-protein-based HMM), match a site-specific profile HMM and, finally, match functionally and structurally characterized residues of a template. Positive matches trigger the appropriate annotation for active site residues, binding site residues, modified residues, or other functionally important amino acids. The strict criteria used in this process have rendered high-confidence annotation suitable for UniProtKB/Swiss-Prot features.
Methods in molecular biology (Clifton, N.J.) 01/2011; 694:91-105.
-
[show abstract]
[hide abstract]
ABSTRACT: High-throughput proteomic, microarray, protein interaction and other experimental methods all generate long lists of proteins and/or genes that have been identified or have varied in accumulation under the experimental conditions studied. These lists can be difficult to sort through for Biologists to make sense of. Here we describe a next step in data analysis--a bottom-up approach at data integration--starting with protein sequence identifications, mapping them to a common representation of the protein and then bringing in a wide variety of structural, functional, genetic, and disease information related to proteins derived from annotated knowledge bases and then using this information to categorize the lists using Gene Ontology (GO) terms and mappings to biological pathway databases. We illustrate with examples how this can aid in identifying important processes from large complex lists.
Methods in molecular biology (Clifton, N.J.) 01/2011; 694:323-39.
-
[show abstract]
[hide abstract]
ABSTRACT: The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.
PLoS ONE 01/2011; 6(4):e18910. · 4.09 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In the past decades, a variety of publicly available data repositories and resources have been developed to support protein related information management, data-driven hypothesis generation and biological knowledge discovery. However, there is also an increasing confusion for the researchers who are trying to quickly find the appropriate resources to help them solve their problems. In this chapter, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases and resources that are relevant to comparative proteomics research. We conclude the chapter by discussing the challenges and opportunities for developing new protein bioinformatics databases.
Methods in molecular biology (Clifton, N.J.) 01/2011; 694:3-24.
-
[show abstract]
[hide abstract]
ABSTRACT: Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation.
Methods in molecular biology (Clifton, N.J.) 01/2011; 719:547-71.
-
Darren A Natale,
Cecilia N Arighi,
Winona C Barker,
Judith A Blake,
Carol J Bult,
Michael Caudy,
Harold J Drabkin,
Peter D'Eustachio,
Alexei V Evsikov,
Hongzhan Huang,
Jules Nchoutmboube,
Natalia V Roberts,
Barry Smith,
Jian Zhang, Cathy H Wu
[show abstract]
[hide abstract]
ABSTRACT: The Protein Ontology (PRO) provides a formal, logically-based classification of specific protein classes including structured representations of protein isoforms, variants and modified forms. Initially focused on proteins found in human, mouse and Escherichia coli, PRO now includes representations of protein complexes. The PRO Consortium works in concert with the developers of other biomedical ontologies and protein knowledge bases to provide the ability to formally organize and integrate representations of precise protein forms so as to enhance accessibility to results of protein research. PRO (http://pir.georgetown.edu/pro) is part of the Open Biomedical Ontology Foundry.
Nucleic Acids Research 10/2010; 39(Database issue):D539-45. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Attempts to engage the scientific community to annotate biological data (such as protein/gene function) stored in databases have not been overly successful. There are several hypotheses on why this has not been successful but it is not clear which of these hypotheses are correct. In this study we have surveyed 50 biologists (who have recently published a paper characterizing a gene or protein) to better understand what would make them interested in providing input/contributions to biological databases. Based on our survey two things become clear: a) database managers need to proactively contact biologists to solicit contributions; and b) potential contributors need to be provided with an easy-to-use interface and clear instructions on what to annotate. Other factors such as 'reward' and 'employer/funding agency recognition' previously perceived as motivators was found to be less important. Based on this study we propose community annotation projects should devote resources to direct solicitation for input and streamlining of the processes or interfaces used to collect this input. REVIEWERS: This article was reviewed by I. King Jordan, Daniel Haft and Yuriy Gusev.
Biology Direct 02/2010; 5:12. · 4.02 Impact Factor
-
Artificial Intelligence in Medicine. 01/2010; 49:155-160.
-
10th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2010, Philadelphia, Pennsylvania, USA, May 31-June 3 2010; 01/2010
-
Adv. Bioinformatics. 01/2010; 2010.
-
[show abstract]
[hide abstract]
ABSTRACT: Members of the Roseobacter clade which play a key role in the biogeochemical cycles of the ocean are diverse and abundant, comprising 10-25% of the bacterioplankton in most marine surface waters. The rapid accumulation of whole-genome sequence data for the Roseobacter clade allows us to obtain a clearer picture of its evolution.
In this study about 1,200 likely orthologous protein families were identified from 17 Roseobacter bacteria genomes. Functional annotations for these genes are provided by iProClass. Phylogenetic trees were constructed for each gene using maximum likelihood (ML) and neighbor joining (NJ). Putative organismal phylogenetic trees were built with phylogenomic methods. These trees were compared and analyzed using principal coordinates analysis (PCoA), approximately unbiased (AU) and Shimodaira-Hasegawa (SH) tests. A core set of 694 genes with vertical descent signal that are resistant to horizontal gene transfer (HGT) is used to reconstruct a robust organismal phylogeny. In addition, we also discovered the most likely 109 HGT genes. The core set contains genes that encode ribosomal apparatus, ABC transporters and chaperones often found in the environmental metagenomic and metatranscriptomic data. These genes in the core set are spread out uniformly among the various functional classes and biological processes.
Here we report a new multigene-derived phylogenetic tree of the Roseobacter clade. Of particular interest is the HGT of eleven genes involved in vitamin B12 synthesis as well as key enzynmes for dimethylsulfoniopropionate (DMSP) degradation. These aquired genes are essential for the growth of Roseobacters and their eukaryotic partners.
PLoS ONE 01/2010; 5(7):e11604. · 4.09 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: High-throughput "omics" technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput "omics" data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput "omics" data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied "omics" data from different laboratories to make useful connections that could lead to new biological knowledge.
Advances in Bioinformatics 01/2010;