Bonnie Berger

Massachusetts General Hospital, Boston, MA, USA

Are you Bonnie Berger?

Claim your profile

Publications (109)469.59 Total impact

  • Article: Computational solutions for omics data.
    Bonnie Berger, Jian Peng, Mona Singh
    [show abstract] [hide abstract]
    ABSTRACT: High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.
    Nature Reviews Genetics 05/2013; 14(5):333-46. · 38.08 Impact Factor
  • Article: Structure-based whole genome realignment reveals many novel non-coding RNAs.
    Sebastian Will, Michael Yu, Bonnie Berger
    [show abstract] [hide abstract]
    ABSTRACT: Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole genome alignments (WGAs) have predicted thousands of structural noncoding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited due to their reliance on sequence-based whole-genome aligners, which regularly misalign structural ncRNAs. This suggests that many more structural ncRNAs may remain undetected. Structure-based alignment, which could increase the sensitivity, has been prohibitive for genome-wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for Prediction of structural ncRNA), which efficiently realigns whole genomes based on RNA sequence and structure, thus allowing us to boost the performance of de novo ncRNA predictors, such as RNAz. Key to the pipeline's efficiency is the development of a novel banding technique for multiple RNA alignment. REAPR significantly outperforms the widely-used predictors RNAz and EvoFold in genome-wide screens; in direct comparison to the most recent RNAz screen on D. melanogaster, REAPR predicts twice as many high-confidence ncRNA candidates. Moreover, modEncode RNA-Seq experiments confirm a substantial number of its predictions as transcripts. REAPR's advancement of de novo structural characterization of ncRNAs complements the identification of transcripts from rapidly accumulating RNA-Seq data.
    Genome Research 01/2013; · 13.61 Impact Factor
  • Article: Genetic determinants of phosphate response in Drosophila.
    [show abstract] [hide abstract]
    ABSTRACT: Phosphate is required for many important cellular processes and having too little phosphate or too much can cause disease and reduce life span in humans. However, the mechanisms underlying homeostatic control of extracellular phosphate levels and cellular effects of phosphate are poorly understood. Here, we establish Drosophila melanogaster as a model system for the study of phosphate effects. We found that Drosophila larval development depends on the availability of phosphate in the medium. Conversely, life span is reduced when adult flies are cultured on high phosphate medium or when hemolymph phosphate is increased in flies with impaired Malpighian tubules. In addition, RNAi-mediated inhibition of MAPK-signaling by knockdown of Ras85D, phl/D-Raf or Dsor1/MEK affects larval development, adult life span and hemolymph phosphate, suggesting that some in vivo effects involve activation of this signaling pathway by phosphate. To identify novel genetic determinants of phosphate responses, we used Drosophila hemocyte-like cultured cells (S2R+) to perform a genome-wide RNAi screen using MAPK activation as the readout. We identified a number of candidate genes potentially important for the cellular response to phosphate. Evaluation of 51 genes in live flies revealed some that affect larval development, adult life span and hemolymph phosphate levels.
    PLoS ONE 01/2013; 8(3):e56753. · 4.09 Impact Factor
  • Source
    Article: Computational analysis of noncoding RNAs.
    [show abstract] [hide abstract]
    ABSTRACT: Noncoding RNAs have emerged as important key players in the cell. Understanding their surprisingly diverse range of functions is challenging for experimental and computational biology. Here, we review computational methods to analyze noncoding RNAs. The topics covered include basic and advanced techniques to predict RNA structures, annotation of noncoding RNAs in genomic data, mining RNA-seq data for novel transcripts and prediction of transcript structures, computational aspects of microRNAs, and database resources. These authors contributed equally WIREs RNA 2012. doi: 10.1002/wrna.1134 For further resources related to this article, please visit the WIREs website.
    WIREs RNA 09/2012; 3(6):759-78.
  • Article: Coev2Net: a computational framework for boosting confidence in high-throughput protein-protein interaction datasets.
    [show abstract] [hide abstract]
    ABSTRACT: Improving the quality and coverage of the protein interactome is of tantamount importance for biomedical research, particularly given the various sources of uncertainty in high-throughput techniques. We introduce a structure-based framework, Coev2Net, for computing a single confidence score that addresses both false positive and false negative rates. Coev2Net is easily applied to thousands of binary protein interactions and has superior predictive performance over existing methods. We experimentally validate selected high-confidence predictions in the human MAPK network and show that predicted interfaces are enriched for cancer-related or damaging SNPs. Coev2Net can be downloaded at http://struct2net.csail.mit.edu/
    Genome biology 08/2012; 13(8):R76. · 6.63 Impact Factor
  • Article: A global sampling approach to designing and reengineering RNA secondary structures.
    [show abstract] [hide abstract]
    ABSTRACT: The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of single-path directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNA-ensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to single-path directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNA-ensign is available at http://csb.cs.mcgill.ca/RNAensign.
    Nucleic Acids Research 08/2012; · 8.03 Impact Factor
  • Article: A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers.
    [show abstract] [hide abstract]
    ABSTRACT: BACKGROUND: Understanding the fundamental mechanisms of tumorigenesis remains one of the most pressing problems in modern biology. To this end, stem-like cells with tumor-initiating potential have become a central focus in cancer research. While the cancer stem cell hypothesis presents a compelling model of self-renewal and partial differentiation, the relationship between tumor cells and normal stem cells remains unclear. RESULTS: We identify, in an unbiased fashion, mRNA transcription patterns associated with pluripotent stem cells. Using this profile, we derive a quantitative measure of stem cell-like gene expression activity. We show how this 189 gene signature stratifies a variety of stem cell, malignant and normal tissue samples by their relative plasticity and state of differentiation within Concordia, a diverse gene expression database consisting of 3,209 Affymetrix HGU133+ 2.0 microarray assays. Further, the orthologous murine signature correctly orders a time course of differentiating embryonic mouse stem cells. Finally, we demonstrate how this stem-like signature serves as a proxy for tumor grade in a variety of solid tumors, including brain, breast, lung and colon. CONCLUSIONS: This core stemness gene expression signature represents a quantitative measure of stem cell-associated transcriptional activity. Broadly, the intensity of this signature correlates to the relative level of plasticity and differentiation across all of the human tissues analyzed. The fact that the intensity of this signature is also capable of differentiating histological grade for a variety of human malignancies suggests potential therapeutic and diagnostic implications.
    Genome biology 08/2012; 13(8):R71. · 6.63 Impact Factor
  • Source
    Article: Making sense out of massive data by going beyond differential expression.
    [show abstract] [hide abstract]
    ABSTRACT: With the rapid growth of publicly available high-throughput transcriptomic data, there is increasing recognition that large sets of such data can be mined to better understand disease states and mechanisms. Prior gene expression analyses, both large and small, have been dichotomous in nature, in which phenotypes are compared using clearly defined controls. Such approaches may require arbitrary decisions about what are considered "normal" phenotypes, and what each phenotype should be compared to. Instead, we adopt a holistic approach in which we characterize phenotypes in the context of a myriad of tissues and diseases. We introduce scalable methods that associate expression patterns to phenotypes in order both to assign phenotype labels to new expression samples and to select phenotypically meaningful gene signatures. By using a nonparametric statistical approach, we identify signatures that are more precise than those from existing approaches and accurately reveal biological processes that are hidden in case vs. control studies. Employing a comprehensive perspective on expression, we show how metastasized tumor samples localize in the vicinity of the primary site counterparts and are overenriched for those phenotype labels. We find that our approach provides insights into the biological processes that underlie differences between tissues and diseases beyond those identified by traditional differential expression analyses. Finally, we provide an online resource (http://concordia.csail.mit.edu) for mapping users' gene expression samples onto the expression landscape of tissue and disease.
    Proceedings of the National Academy of Sciences 03/2012; 109(15):5594-9. · 9.68 Impact Factor
  • Source
    Article: SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related has been profile hidden Markov models (HMMs). However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta sheets. These dependencies have been partially captured in the HMM setting by simulated evolution in the training phase and can be fully captured by Markov random fields (MRFs). However, the MRFs can be computationally prohibitive when beta strands are interleaved in complex topologies. We introduce SMURFLite, a method that combines both simplified MRFs and simulated evolution to substantially improve remote homology detection for beta structures. Unlike previous MRF-based methods, SMURFLite is computationally feasible on any beta-structural motif. RESULTS: We test SMURFLite on all propeller and barrel folds in the mainly-beta class of the SCOP hierarchy in stringent cross-validation experiments. We show a mean 26% (median 16%) improvement in area under curve (AUC) for beta-structural motif recognition as compared with HMMER (a well-known HMM method) and a mean 33% (median 19%) improvement as compared with RAPTOR (a well-known threading method) and even a mean 18% (median 10%) improvement in AUC over HHPred (a profile-profile HMM method), despite HHpred's use of extensive additional training data. We demonstrate SMURFLite's ability to scale to whole genomes by running a SMURFLite library of 207 beta-structural SCOP superfamilies against the entire genome of Thermotoga maritima, and make over a 100 new fold predictions. Availability and implementaion: A webserver that runs SMURFLite is available at: http://smurf.cs.tufts.edu/smurflite/
    Bioinformatics 03/2012; 28(9):1216-22. · 5.47 Impact Factor
  • Article: MetaMerge: scaling up genome-scale metabolic reconstructions with application to Mycobacterium tuberculosis.
    [show abstract] [hide abstract]
    ABSTRACT: Reconstructed models of metabolic networks are widely used for studying metabolism in various organisms. Many different reconstructions of the same organism often exist concurrently, forcing researchers to choose one of them at the exclusion of the others. We describe MetaMerge, an algorithm for semi-automatically reconciling a pair of existing metabolic network reconstructions into a single metabolic network model. We use MetaMerge to combine two published metabolic networks for Mycobacterium tuberculosis into a single network, which allows many reactions that could not be active in the individual models to become active, and predicts essential genes with a higher positive predictive value.
    Genome biology 01/2012; 13(1):r6. · 6.63 Impact Factor
  • Source
    Article: Efficient traversal of beta-sheet protein folding pathways using ensemble models.
    [show abstract] [hide abstract]
    ABSTRACT: Molecular dynamics (MD) simulations can now predict ms-timescale folding processes of small proteins; however, this presently requires hundreds of thousands of CPU hours and is primarily applicable to short peptides with few long-range interactions. Larger and slower-folding proteins, such as many with extended β-sheet structure, would require orders of magnitude more time and computing resources. Furthermore, when the objective is to determine only which folding events are necessary and limiting, atomistic detail MD simulations can prove unnecessary. Here, we introduce the program tFolder as an efficient method for modelling the folding process of large β-sheet proteins using sequence data alone. To do so, we extend existing ensemble β-sheet prediction techniques, which permitted only a fixed anti-parallel β-barrel shape, with a method that predicts arbitrary β-strand/β-strand orientations and strand-order permutations. By accounting for all partial and final structural states, we can then model the transition from random coil to native state as a Markov process, using a master equation to simulate population dynamics of folding over time. Thus, all putative folding pathways can be energetically scored, including which transitions present the greatest barriers. Since correct folding pathway prediction is likely determined by the accuracy of contact prediction, we demonstrate the accuracy of tFolder to be comparable with state-of-the-art methods designed specifically for the contact prediction problem alone. We validate our method for dynamics prediction by applying it to the folding pathway of the well-studied Protein G. With relatively very little computation time, tFolder is able to reveal critical features of the folding pathways which were only previously observed through time-consuming MD simulations and experimental studies. Such a result greatly expands the number of proteins whose folding pathways can be studied, while the algorithmic integration of ensemble prediction with Markovian dynamics can be applied to many other problems.
    Journal of computational biology: a journal of computational molecular cell biology 09/2011; 18(11):1635-47. · 1.69 Impact Factor
  • Source
    Article: STITCHER: Dynamic assembly of likely amyloid and prion β-structures from secondary structure predictions.
    [show abstract] [hide abstract]
    ABSTRACT: The supersecondary structure of amyloids and prions, proteins of intense clinical and biological interest, are difficult to determine by standard experimental or computational means. In addition, significant conformational heterogeneity is known or suspected to exist in many amyloid fibrils. Previous work has demonstrated that probability-based prediction of discrete β-strand pairs can offer insight into these structures. Here, we devise a system of energetic rules that can be used to dynamically assemble these discrete β-strand pairs into complete amyloid β-structures. The STITCHER algorithm progressively 'stitches' strand-pairs into full β-sheets based on a novel free-energy model, incorporating experimentally observed amino-acid side-chain stacking contributions, entropic estimates, and steric restrictions for amyloidal parallel β-sheet construction. A dynamic program computes the top 50 structures and returns both the highest scoring structure and a consensus structure taken by polling this list for common discrete elements. Putative structural heterogeneity can be inferred from sequence regions that compose poorly. Predictions show agreement with experimental models of Alzheimer's amyloid beta peptide and the Podospora anserina Het-s prion. Predictions of the HET-s homolog HET-S also reflect experimental observations of poor amyloid formation. We put forward predicted structures for the yeast prion Sup35, suggesting N-terminal structural stability enabled by tyrosine ladders, and C-terminal heterogeneity. Predictions for the Rnq1 prion and alpha-synuclein are also given, identifying a similar mix of homogenous and heterogeneous secondary structure elements. STITCHER provides novel insight into the energetic basis of amyloid structure, provides accurate structure predictions, and can help guide future experimental studies. Proteins 2011. © 2011 Wiley Periodicals, Inc.
    Proteins Structure Function and Bioinformatics 09/2011; · 3.39 Impact Factor
  • Source
    Article: An integrative approach to ortholog prediction for disease-focused and other functional studies.
    [show abstract] [hide abstract]
    ABSTRACT: Mapping of orthologous genes among species serves an important role in functional genomics by allowing researchers to develop hypotheses about gene function in one species based on what is known about the functions of orthologs in other species. Several tools for predicting orthologous gene relationships are available. However, these tools can give different results and identification of predicted orthologs is not always straightforward. We report a simple but effective tool, the Drosophila RNAi Screening Center Integrative Ortholog Prediction Tool (DIOPT; http://www.flyrnai.org/diopt), for rapid identification of orthologs. DIOPT integrates existing approaches, facilitating rapid identification of orthologs among human, mouse, zebrafish, C. elegans, Drosophila, and S. cerevisiae. As compared to individual tools, DIOPT shows increased sensitivity with only a modest decrease in specificity. Moreover, the flexibility built into the DIOPT graphical user interface allows researchers with different goals to appropriately 'cast a wide net' or limit results to highest confidence predictions. DIOPT also displays protein and domain alignments, including percent amino acid identity, for predicted ortholog pairs. This helps users identify the most appropriate matches among multiple possible orthologs. To facilitate using model organisms for functional analysis of human disease-associated genes, we used DIOPT to predict high-confidence orthologs of disease genes in Online Mendelian Inheritance in Man (OMIM) and genes in genome-wide association study (GWAS) data sets. The results are accessible through the DIOPT diseases and traits query tool (DIOPT-DIST; http://www.flyrnai.org/diopt-dist). DIOPT and DIOPT-DIST are useful resources for researchers working with model organisms, especially those who are interested in exploiting model organisms such as Drosophila to study the functions of human disease genes.
    BMC Bioinformatics 08/2011; 12:357. · 2.75 Impact Factor
  • Source
    Article: Opposing effects of glutamine and asparagine govern prion formation by intrinsically disordered proteins.
    [show abstract] [hide abstract]
    ABSTRACT: Sequences rich in glutamine (Q) and asparagine (N) residues often fail to fold at the monomer level. This, coupled to their unusual hydrogen-bonding abilities, provides the driving force to switch between disordered monomers and amyloids. Such transitions govern processes as diverse as human protein-folding diseases, bacterial biofilm assembly, and the inheritance of yeast prions (protein-based genetic elements). A systematic survey of prion-forming domains suggested that Q and N residues have distinct effects on amyloid formation. Here, we use cell biological, biochemical, and computational techniques to compare Q/N-rich protein variants, replacing Ns with Qs and Qs with Ns. We find that the two residues have strong and opposing effects: N richness promotes assembly of benign self-templating amyloids; Q richness promotes formation of toxic nonamyloid conformers. Molecular simulations focusing on intrinsic folding differences between Qs and Ns suggest that their different behaviors are due to the enhanced turn-forming propensity of Ns over Qs.
    Molecular cell 07/2011; 43(1):72-84. · 14.61 Impact Factor
  • Source
    Article: A method for probing the mutational landscape of amyloid structure.
    [show abstract] [hide abstract]
    ABSTRACT: Proteins of all kinds can self-assemble into highly ordered β-sheet aggregates known as amyloid fibrils, important both biologically and clinically. However, the specific molecular structure of a fibril can vary dramatically depending on sequence and environmental conditions, and mutations can drastically alter amyloid function and pathogenicity. Experimental structure determination has proven extremely difficult with only a handful of NMR-based models proposed, suggesting a need for computational methods. We present AmyloidMutants, a statistical mechanics approach for de novo prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein mutational landscapes, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability. Tested on non-mutant, full-length amyloid structures with known chemical shift data, AmyloidMutants offers roughly 2-fold improvement in prediction accuracy over existing tools. Moreover, AmyloidMutants is the only method to predict complete super-secondary structures, enabling accurate discrimination of topologically dissimilar amyloid conformations that correspond to the same sequence locations. Applied to mutant prediction, AmyloidMutants identifies a global conformational switch between Aβ and its highly-toxic 'Iowa' mutant in agreement with a recent experimental model based on partial chemical shift data. Predictions on mutant, yeast-toxic strains of HET-s suggest similar alternate folds. When applied to HET-s and a HET-s mutant with core asparagines replaced by glutamines (both highly amyloidogenic chemically similar residues abundant in many amyloids), AmyloidMutants surprisingly predicts a greatly reduced capacity of the glutamine mutant to form amyloid. We confirm this finding by conducting mutagenesis experiments. Our tool is publically available on the web at http://amyloid.csail.mit.edu/. lindquist_admin@wi.mit.edu; bab@csail.mit.edu.
    Bioinformatics 07/2011; 27(13):i34-42. · 5.47 Impact Factor
  • Source
    Article: Unusually effective microRNA targeting within repeat-rich coding regions of mammalian mRNAs.
    [show abstract] [hide abstract]
    ABSTRACT: MicroRNAs (miRNAs) regulate numerous biological processes by base-pairing with target messenger RNAs (mRNAs), primarily through sites in 3' untranslated regions (UTRs), to direct the repression of these targets. Although miRNAs have sometimes been observed to target genes through sites in open reading frames (ORFs), large-scale studies have shown such targeting to be generally less effective than 3' UTR targeting. Here, we show that several miRNAs each target significant groups of genes through multiple sites within their coding regions. This ORF targeting, which mediates both predictable and effective repression, arises from highly repeated sequences containing miRNA target sites. We show that such sequence repeats largely arise through evolutionary duplications and occur particularly frequently within families of paralogous C(2)H(2) zinc-finger genes, suggesting the potential for their coordinated regulation. Examples of ORFs targeted by miR-181 include both the well-known tumor suppressor RB1 and RBAK, encoding a C(2)H(2) zinc-finger protein and transcriptional binding partner of RB1. Our results indicate a function for repeat-rich coding sequences in mediating post-transcriptional regulation and reveal circumstances in which miRNA-mediated repression through ORF sites can be reliably predicted.
    Genome Research 06/2011; 21(9):1395-403. · 13.61 Impact Factor
  • Source
    Article: Structure-based prediction reveals capping motifs that inhibit β-helix aggregation.
    [show abstract] [hide abstract]
    ABSTRACT: The parallel β-helix is a geometrically regular fold commonly found in the proteomes of bacteria, viruses, fungi, archaea, and some vertebrates. β-helix structure has been observed in monomeric units of some aggregated amyloid fibers. In contrast, soluble β-helices, both right- and left-handed, are usually "capped" on each end by one or more secondary structures. Here, an in-depth classification of the diverse range of β-helix cap structures reveals subtle commonalities in structural components and in interactions with the β-helix core. Based on these uncovered commonalities, a toolkit of automated predictors was developed for the two distinct types of cap structures. In vitro deletion of the toolkit-predicted C-terminal cap from the pertactin β-helix resulted in increased aggregation and the formation of soluble oligomeric species. These results suggest that β-helix cap motifs can prevent specific, β-sheet-mediated oligomeric interactions, similar to those observed in amyloid formation.
    Proceedings of the National Academy of Sciences 06/2011; 108(27):11099-104. · 9.68 Impact Factor
  • Source
    Chapter: Efficient Traversal of Beta-Sheet Protein Folding Pathways Using Ensemble Models
    [show abstract] [hide abstract]
    ABSTRACT: Molecular Dynamics (MD) simulations can now predict ms-timescale folding processes of small proteins — however, this presently requires hundreds of thousands of CPU hours and is primarily applicable to short peptides with few long-range interactions. Larger and slower-folding proteins, such as many with extended β-sheet structure, would require orders of magnitude more time and computing resources. Furthermore, when the objective is to determine only which folding events are necessary and limiting, atomistic detail MD simulations can prove unnecessary. Here, we introduce the program tFolder as an efficient method for modelling the folding process of large β-sheet proteins using sequence data alone. To do so, we extend existing ensemble β-sheet prediction techniques, which permitted only a fixed anti-parallel β-barrel shape, with a method that predicts arbitrary β-strand/β-strand orientations and strand-order permutations. By accounting for all partial and final structural states, we can then model the transition from random coil to native state as a Markov process, using a master equation to simulate population dynamics of folding over time. Thus, all putative folding pathways can be energetically scored, including which transitions present the greatest barriers. Since correct folding pathway prediction is likely determined by the accuracy of contact prediction, we demonstrate the accuracy of tFolder to be comparable with state-of-the-art methods designed specifically for the contact prediction problem alone. We validate our method for dynamics prediction by applying it to the folding pathway of the well-studied Protein G. With relatively very little computation time, tFolder is able to reveal critical features of the folding pathways which were only previously observed through time-consuming MD simulations and experimental studies. Such a result greatly expands the number of proteins whose folding pathways can be studied, while the algorithmic integration of ensemble prediction with Markovian dynamics can be applied to many other problems.
    03/2011: pages 408-423;
  • Article: iWRAP: An interface threading approach with application to prediction of cancer-related protein-protein interactions.
    [show abstract] [hide abstract]
    ABSTRACT: Current homology modeling methods for predicting protein-protein interactions (PPIs) have difficulty in the "twilight zone" (<40%) of sequence identities. Threading methods extend coverage further into the twilight zone by aligning primary sequences for a pair of proteins to a best-fit template complex to predict an entire three-dimensional structure. We introduce a threading approach, iWRAP, which focuses only on the protein interface. Our approach combines a novel linear programming formulation for interface alignment with a boosting classifier for interaction prediction. We demonstrate its efficacy on SCOPPI, a classification of PPIs in the Protein Databank, and on the entire yeast genome. iWRAP provides significantly improved prediction of PPIs and their interfaces in stringent cross-validation on SCOPPI. Furthermore, by combining our predictions with a full-complex threader, we achieve a coverage of 13% for the yeast PPIs, which is close to a 50% increase over previous methods at a higher sensitivity. As an application, we effectively combine iWRAP with genomic data to identify novel cancer-related genes involved in chromatin remodeling, nucleosome organization, and ribonuclear complex assembly. iWRAP is available at http://iwrap.csail.mit.edu.
    Journal of Molecular Biology 02/2011; 405(5):1295-310. · 4.00 Impact Factor
  • Source
    Article: IsoBase: a database of functionally related proteins across PPI networks.
    [show abstract] [hide abstract]
    ABSTRACT: We describe IsoBase, a database identifying functionally related proteins, across five major eukaryotic model organisms: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus and Homo Sapiens. Nearly all existing algorithms for orthology detection are based on sequence comparison. Although these have been successful in orthology prediction to some extent, we seek to go beyond these methods by the integration of sequence data and protein-protein interaction (PPI) networks to help in identifying true functionally related proteins. With that motivation, we introduce IsoBase, the first publicly available ortholog database that focuses on functionally related proteins. The groupings were computed using the IsoRankN algorithm that uses spectral methods to combine sequence and PPI data and produce clusters of functionally related proteins. These clusters compare favorably with those from existing approaches: proteins within an IsoBase cluster are more likely to share similar Gene Ontology (GO) annotation. A total of 48,120 proteins were clustered into 12,693 functionally related groups. The IsoBase database may be browsed for functionally related proteins across two or more species and may also be queried by accession numbers, species-specific identifiers, gene name or keyword. The database is freely available for download at http://isobase.csail.mit.edu/.
    Nucleic Acids Research 01/2011; 39(Database issue):D295-300. · 8.03 Impact Factor

Institutions

  • 2013
    • Massachusetts General Hospital
      Boston, MA, USA
  • 1999–2013
    • Massachusetts Institute of Technology
      • • Department of Mathematics
      • • Computer Science and Artificial Intelligence Laboratory
      • • Laboratory for Computer Science
      Cambridge, MA, USA
  • 2011–2012
    • McGill University
      • McGill Centre for Bioinformatics
      Montréal, Quebec, Canada
  • 2002–2010
    • Tufts University
      Boston, GA, USA
  • 2009
    • National Taiwan University
      • Department of Computer Science and Information Engineering
      Taipei, Taipei, Taiwan
  • 2008–2009
    • Boston Children's Hospital
      Boston, MA, USA
  • 2007
    • Toyota Technological Institute at Chicago
      Chicago, IL, USA
    • Distributed Artificial Intelligence Laboratory
      Berlin, Land Berlin, Germany