Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Next-generation sequencing technologies, coupled with advances in mass spectrometry-based proteomics, have facilitated system-wide quantitative profiling of expressed mRNA transcripts and proteins. Proteo-transcriptomic analysis compares the relative abundance levels of transcripts and their corresponding proteins, illuminating discordant gene product responses to perturbations. These results reveal potential post-transcriptional regulation, providing researchers with important new insights into underlying biological and pathological disease mechanisms. To carry out proteo-transcriptomic analysis, researchers require software that statistically determines transcript-protein abundance correlation levels and provides results visualization and interpretation functionality, ideally within a flexible, user-friendly platform. As a solution, we have developed the QuanTP software within the Galaxy platform. The software offers a suite of tools and functionalities critical for proteo-transcriptomics, including statistical algorithms for assessing correlation between single transcript-protein pairs as well as across two cohorts, outlier identification and clustering, along with a diverse set of results visualizations. It is compatible with analyses of results from single experiment data or from two-cohort comparison of aggregated replicate experiments. The tool is available in the Galaxy Tool Shed, through a cloud-based instance and a Docker container. In all, QuanTP provides an accessible and effective software resource, which should enable new multi-omic discoveries from quantitative proteo-transcriptomic datasets.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Its amenability to integration of disparate software in a unified, user-friendly environment, along with a variety of useful features including complex workflow creation, provenance tracking, and reproducibility, addresses the challenges of proteogenomics. As part of our work developing Galaxy for proteomics ), we have focused on putting in place a number of tools for the various steps necessary for proteogenomics-from raw data processing and sequence database generation [9,11,12,17] to tools for interpreting the potential impact of identified sequence variants [18] and mechanisms of regulation indicated by RNA-protein response [19]. Others have also contributed to this growing community of proteogenomic researchers using Galaxy to address their data analysis and informatics needs [11][12][13][14][15]. ...
Article
Full-text available
Background Proteogenomics integrates genomics, transcriptomics, and mass spectrometry (MS)-based proteomics data to identify novel protein sequences arising from gene and transcript sequence variants. Proteogenomic data analysis requires integration of disparate ‘omic software tools, as well as customized tools to view and interpret results. The flexible Galaxy platform has proven valuable for proteogenomic data analysis. Here, we describe a novel Multi-omics Visualization Platform (MVP) for organizing, visualizing, and exploring proteogenomic results, adding a critically needed tool for data exploration and interpretation. Findings MVP is built as an HTML Galaxy plug-in, primarily based on JavaScript. Via the Galaxy API, MVP uses SQLite databases as input—a custom data type (mzSQLite) containing MS-based peptide identification information, a variant annotation table, and a coding sequence table. Users can interactively filter identified peptides based on sequence and data quality metrics, view annotated peptide MS data, and visualize protein-level information, along with genomic coordinates. Peptides that pass the user-defined thresholds can be sent back to Galaxy via the API for further analysis; processed data and visualizations can also be saved and shared. MVP leverages the Integrated Genomics Viewer JavaScript framework, enabling interactive visualization of peptides and corresponding transcript and genomic coding information within the MVP interface. Conclusions MVP provides a powerful, extensible platform for automated, interactive visualization of proteogenomic results within the Galaxy environment, adding a unique and critically needed tool for empowering exploration and interpretation of results. The platform is extensible, providing a basis for further development of new functionalities for proteogenomic data visualization.
... Its amenability to integration of disparate software in a unified, user-friendly environment, along with a variety of useful features including complex workflow creation, provenance tracking and reproducibility address the challenges of proteogenomics. As part of our work developing Galaxy for proteomics (Galaxy-P[16]), we have focused on putting in place a number of tools for the various steps necessary for proteogenomics --from raw data processing and sequence database generation [9,11,12,17], to tools for interpreting the potential impact of identified sequence variants [18] and mechanisms of regulation indicated by RNA-protein response [19]. Others have also contributed to this growing community of proteogenomic researchers utilizing Galaxy to address their data analysis and informatics needs [11][12][13][14][15]. ...
Preprint
Background: Proteogenomics integrates genomics, transcriptomics and mass spectrometry (MS)-based proteomics data to identify novel protein sequences arising from gene and transcript sequence variants. Proteogenomic data analysis requires integration of disparate omic software tools, as well as customized tools to view and interpret results. The flexible Galaxy platform has proven valuable for proteogenomic data analysis. Here, we describe a novel Multi-omics Visualization Platform (MVP) for organizing, visualizing and exploring proteogenomic results, adding a critically needed tool for data exploration and interpretation. Findings: MVP is built as an HTML Galaxy plugin, primarily based on Javascript. Via the Galaxy API, MVP uses SQlite databases as input -- a custom datatype (mzSQlite) containing MS-based peptide identification information, a variant annotation table, and a coding sequence table. Users can interactively filter identified peptides based on sequence and data quality metrics, view annotated peptide MS data, and visualize protein-level information, along with genomic coordinates. Peptides that pass the user-defined thresholds can be sent back to Galaxy via the API for further analysis; processed data and visualizations can also be saved and shared. MVP leverages the Integrated Genomics Viewer javascript (IGVjs) framework, enabling interactive visualization of peptides and corresponding transcript and genomic coding region information within the MVP interface. Conclusions: MVP provides a powerful, extensible platform for automated, interactive visualization of proteogenomics results within the Galaxy environment, adding a unique and critically needed tool for empowering exploration and interpretation of results. The platform is extensible, providing a basis for further development of new functionalities for proteogenomic data visualization.
Article
Introduction: Mass spectrometry-based proteomics reveals dynamic molecular signatures underlying phenotypes reflecting normal and perturbed conditions in living systems. Although valuable on its own, the proteome is only one level of molecular information, with the genome, epigenome, transcriptome, and metabolome all providing complementary information. Multi-omic analysis integrating information from one or more of these other domains with proteomic information provides a more complete picture of molecular contributors to dynamic biological systems. Areas covered: Here, we discuss the improvements to mass spectrometry-based technologies, focused on peptide-based, bottom-up approaches, that have enabled deep, quantitative characterization of complex proteomes. These advances are facilitating integration of proteomics data with other 'omic information, providing a more complete picture of living systems. We also describe the current state of bioinformatics software and approaches for integrating proteomic and other 'omics data, critical for enabling new discoveries driven by multi-omics. Expert commentary: Multi-omics, centered on the integration of proteomics information with other 'omic information, has tremendous promise for biological and biomedical studies. Continued advances in approaches for generating deep, reliable proteomic data and bioinformatics tools aimed at integrating data across 'omic domains will ensure the discoveries offered by these multi-omic studies continue to increase.
Article
There is increasing pressure to develop alternative ecotoxicological risk assessment approaches that do not rely on expensive, time-consuming, and ethically questionable live animal testing. This study aimed to develop a comprehensive early life stage toxicity pathway model for the exposure of fish to estrogenic chemicals that is rooted in mechanistic toxicology. Embryo-larval fathead minnows (FHM; Pimephales promelas) were exposed to graded concentrations of 17α-ethinylestradiol (water control, 0.01% DMSO, 4, 20, and 100 ng/L) for 32 days. Fish were assessed for transcriptomic and proteomic responses at 4 days post-hatch (dph), and for histological and apical end points at 28 dph. Molecular analyses revealed core responses that were indicative of observed apical outcomes, including biological processes resulting in overproduction of vitellogenin and impairment of visual development. Histological observations indicated accumulation of proteinaceous fluid in liver and kidney tissues, energy depletion, and delayed or suppressed gonad development. Additionally, fish in the 100 ng/L treatment group were smaller than controls. Integration of omics data improved the interpretation of perturbations in early life stage FHM, providing evidence of conservation of toxicity pathways across levels of biological organization. Overall, the mechanism-based embryo-larval FHM model showed promise as a replacement for standard adult live animal tests.
Article
metaQuantome is a software suite that enables the quantitative analysis, statistical evaluation. and visualization of mass-spectrometry-based metaproteomics data. In the latest update of this software, we have provided several extensions, including a step-by-step training guide, the ability to perform statistical analysis on samples from multiple conditions, and a comparative analysis of metatranscriptomics data. The training module, accessed via the Galaxy Training Network, will help users to use the suite effectively both for functional as well as for taxonomic analysis. We extend the ability of metaQuantome to now perform multi-data-point quantitative and statistical analyses so that studies with measurements across multiple conditions, such as time-course studies, can be analyzed. With an eye on the multiomics analysis of microbial communities, we have also initiated the use of metaQuantome statistical and visualization tools on outputs from metatranscriptomics data, which complements the metagenomic and metaproteomic analyses already available. For this, we have developed a tool named MT2MQ ("metatranscriptomics to metaQuantome"), which takes in outputs from the ASaiM metatranscriptomics workflow and transforms them so that the data can be used as an input for comparative statistical analysis and visualization via metaQuantome. We believe that these improvements to metaQuantome will facilitate the use of the software for quantitative metaproteomics and metatranscriptomics and will enable multipoint data analysis. These improvements will take us a step toward integrative multiomic microbiome analysis so as to understand dynamic taxonomic and functional responses of these complex systems in a variety of biological contexts. The updated metaQuantome and MT2MQ are open-source software and are available via the Galaxy Toolshed and GitHub.
Article
Full-text available
Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
Article
Full-text available
Activation of phosphoinositide 3-kinase (PI3K) signaling is frequently observed in triple-negative breast cancer (TNBC), yet PI3K inhibitors have shown limited clinical activity. To investigate intrinsic and adaptive mechanisms of resistance, we analyzed a panel of patient-derived xenograft models of TNBC with varying responsiveness to buparlisib, a pan-PI3K inhibitor. In a subset of patient-derived xenografts, resistance was associated with incomplete inhibition of PI3K signaling and upregulated MAPK/MEK signaling in response to buparlisib. Outlier phosphoproteome and kinome analyses identified novel candidates functionally important to buparlisib resistance, including NEK9 and MAP2K4. Knockdown of NEK9 or MAP2K4 reduced both baseline and feedback MAPK/MEK signaling and showed synthetic lethality with buparlisib in vitro. A complex in/del frameshift in PIK3CA decreased sensitivity to buparlisib via NEK9/MAP2K4-dependent mechanisms. In summary, our study supports a role for NEK9 and MAP2K4 in mediating buparlisib resistance and demonstrates the value of unbiased omic analyses in uncovering resistance mechanisms to targeted therapy.
Article
Full-text available
The impact of microbial communities, also known as the microbiome, on human health and the environment is receiving increased attention. Studying translated gene products (proteins) and comparing metaproteomic profiles may elucidate how microbiomes respond to specific environmental stimuli, and interact with host organisms. Characterizing proteins expressed by a complex microbiome and interpreting their functional signature requires sophisticated informatics tools and workflows tailored to metaproteomics. Additionally, there is a need to disseminate these informatics resources to researchers undertaking metaproteomic studies, who could use them to make new and important discoveries in microbiome research. The Galaxy for proteomics platform (Galaxy-P) offers an open source, web-based bioinformatics platform for disseminating metaproteomics software and workflows. Within this platform, we have developed easily-accessible and documented metaproteomic software tools and workflows aimed at training researchers in their operation and disseminating the tools for more widespread use. The modular workflows encompass the core requirements of metaproteomic informatics: (a) database generation; (b) peptide spectral matching; (c) taxonomic analysis and (d) functional analysis. Much of the software available via the Galaxy-P platform was selected, packaged and deployed through an online metaproteomics “Contribution Fest“ undertaken by a unique consortium of expert software developers and users from the metaproteomics research community, who have co-authored this manuscript. These resources are documented on GitHub and freely available through the Galaxy Toolshed, as well as a publicly accessible metaproteomics gateway Galaxy instance. These documented workflows are well suited for the training of novice metaproteomics researchers, through online resources such as the Galaxy Training Network, as well as hands-on training workshops. Here, we describe the metaproteomics tools available within these Galaxy-based resources, as well as the process by which they were selected and implemented in our community-based work. We hope this description will increase access to and utilization of metaproteomics tools, as well as offer a framework for continued community-based development and dissemination of cutting edge metaproteomics software.
Article
Full-text available
Proteogenomics has emerged as a valuable approach in cancer research, which integrates genomic and transcriptomic data with mass spectrometry-based proteomics data to directly identify expressed, variant protein sequences that may have functional roles in cancer. This approach is computationally intensive, requiring integration of disparate software tools into sophisticated workflows, challenging its adoption by nonexpert, bench scientists. To address this need, we have developed an extensible, Galaxy-based resource aimed at providing more researchers access to, and training in, proteogenomic informatics. Our resource brings together software from several leading research groups to address two foundational aspects of proteogenomics: (i) generation of customized, annotated protein sequence databases from RNA-Seq data; and (ii) accurate matching of tandem mass spectrometry data to putative variants, followed by filtering to confirm their novelty. Directions for accessing software tools and workflows, along with instructional documentation, can be found at z.umn.edu/canresgithub. Cancer Res; 77(21); e43-46. ©2017 AACR.
Article
Full-text available
An important issue for molecular biology is to establish whether transcript levels of a given gene can be used as proxies for the corresponding protein levels. Here, we have developed a targeted proteomics approach for a set of human non‐secreted proteins based on parallel reaction monitoring to measure, at steady‐state conditions, absolute protein copy numbers across human tissues and cell lines and compared these levels with the corresponding mRNA levels using transcriptomics. The study shows that the transcript and protein levels do not correlate well unless a gene‐specific RNA‐to‐protein (RTP) conversion factor independent of the tissue type is introduced, thus significantly enhancing the predictability of protein copy numbers from RNA levels. The results show that the RTP ratio varies significantly with a few hundred copies per mRNA molecule for some genes to several hundred thousands of protein copies per mRNA molecule for others. In conclusion, our data suggest that transcriptome analysis can be used as a tool to predict the protein copy numbers per cell, thus forming an attractive link between the field of genomics and proteomics.
Article
Full-text available
Discovering the gene expression signature associated with a cellular state is one of the basic quests in majority of biological studies. For most of the clinical and cellular manifestations, these molecular differences may be exhibited across multiple layers of gene regulation like genomic variations, gene expression, protein translation and post translational modifications (PTM). These system wide variations are dynamic in nature and their crosstalk is overwhelmingly complex, thus analyzing them separately may not be very informative. This necessitates the integrative analysis of such multiple layers of information to understand the interplay of the individual components of the biological system. Recent developments in high throughput RNA sequencing (RNA-seq) and mass spectrometric (MS) technologies to probe transcripts and proteins made these as preferred methods for understanding global gene regulation. Subsequently, improvements in “big-data” analysis techniques enable novel conclusions to be drawn from integrative transcriptomic-proteomic analysis. The unified analyses of both these data types have been rewarding for several biological objectives like improving genome annotation, predicting RNA-protein quantities, deciphering gene regulations, discovering disease markers and drug targets. There are different ways in which transcriptomics and proteomics data can be integrated; each aiming for different research objectives. Here, we review various studies, approaches and computational tools targeted for integrative analysis of these two high-throughput omics methods. This article is protected by copyright. All rights reserved
Article
Full-text available
The genome-wide transcriptome profiling of cancerous and normal tissue samples can provide insights into the molecular mechanisms of cancer initiation and progression. RNA Sequencing (RNA-Seq) is a revolutionary tool that has been used extensively in cancer research. However, no existing RNA-Seq database provides all of the following features: (i) large-scale and comprehensive data archives and analyses, including coding-transcript profiling, long non-coding RNA (lncRNA) profiling and coexpression networks; (ii) phenotype-oriented data organization and searching and (iii) the visualization of expression profiles, differential expression and regulatory networks. We have constructed the first public database that meets these criteria, the Cancer RNA-Seq Nexus (CRN, http://syslab4.nchu.edu.tw/CRN). CRN has a user-friendly web interface designed to facilitate cancer research and personalized medicine. It is an open resource for intuitive data exploration, providing coding-transcript/lncRNA expression profiles to support researchers generating new hypotheses in cancer research and personalized medicine.
Article
Full-text available
Differential mRNA expression studies implicitly assume that changes in mRNA expression have biological meaning, most likely mediated by corresponding changes in protein levels. Yet studies into mRNA-protein correspondence have shown notoriously poor correlation between mRNA and protein expression levels, creating concern for inferences from only mRNA expression data. However, none of these studies have examined in particular differentially expressed mRNA. Here, we examined this question in an ovarian cancer xenograft model. We measured protein and mRNA expression for twenty-nine genes in four drug-treatment conditions and in untreated controls. We identified mRNAs differentially expressed between drug-treated xenografts and controls, then analysed mRNA-protein expression correlation across a five-point time-course within each of the four experimental conditions. We evaluated correlations between mRNAs and their protein products for mRNAs differentially expressed within an experimental condition compared to those that are not. We found that differentially expressed mRNAs correlate significantly better with their protein product than non-differentially expressed mRNAs. This result increases confidence for the use of differential mRNA expression for biological discovery in this system, as well as providing optimism for the usefulness of inferences from mRNA expression in general.
Article
Full-text available
Protein expression is regulated by the production and degradation of messenger RNAs (mRNAs) and proteins, but their specific relationships remain unknown. We combine measurements of protein production and degradation and mRNA dynamics so as to build a quantitative genomic model of the differential regulation of gene expression in lipopolysaccharide-stimulated mouse dendritic cells. Changes in mRNA abundance play a dominant role in determining most dynamic fold changes in protein levels. Conversely, the preexisting proteome of proteins performing basic cellular functions is remodeled primarily through changes in protein production or degradation, accounting for more than half of the absolute change in protein molecules in the cell. Thus, the proteome is regulated by transcriptional induction for newly activated cellular functions and by protein life-cycle changes for remodeling of preexisting functions. Copyright © 2015, American Association for the Advancement of Science.
Article
Full-text available
An increasingly common method for predicting gene activity is genome-wide chromatin immuno-precipitation of 'active' chromatin modifications followed by massively parallel sequencing (ChIP-seq). In order to understand better the relationship between developmentally regulated chromatin landscapes and regulation of early B cell development, we determined how differentially active promoter regions were able to predict relative RNA and protein levels at the pre-pro-B and pro-B stages. Herein, we describe a novel ChIP-seq quantification method (cRPKM) to identify active promoters and a multi-omics approach that compares promoter chromatin status with ongoing active transcription (GRO-seq), steady state mRNA (RNA-seq), inferred mRNA stability, and relative proteome abundance measurements (iTRAQ). We demonstrate that active chromatin modifications at promoters are good indicators of transcription and steady state mRNA levels. Moreover, we found that promoters with active chromatin modifications exclusively in one of these cell states frequently predicted the differential abundance of proteins. However, we found that many genes whose promoters have non-differential but active chromatin modifications also displayed changes in abundance of their cognate proteins. As expected, this large class of developmentally and differentially regulated proteins that was uncoupled from chromatin status used mostly post-transcriptional mechanisms. Strikingly, the most differentially abundant protein in our B-cell development system, 2410004B18Rik, was regulated by a post-transcriptional mechanism, which further analyses indicated was mediated by a micro-RNA. These data highlight how this integrated multi-omics data set can be a useful resource in uncovering regulatory mechanisms. This data can be accessed at: https://usegalaxy.org/u/thereddylab/p/prediction-of-gene-activity-based-on-an-integrative-multi-omics-analysis.
Article
Full-text available
Peptide identification via tandem mass spectrometry sequence database searching is a key method in the array of tools available to the proteomics researcher. The ability to rapidly and sensitively acquire tandem mass spectrometry data and perform peptide and protein identifications has become a commonly used proteomics analysis technique because of advances in both instrumentation and software. Although many different tandem mass spectrometry database search tools are currently available from both academic and commercial sources, these algorithms share similar core elements while maintaining distinctive features. This review revisits the mechanism of sequence database searching and discusses how various parameter settings impact the underlying search.
Article
Full-text available
For a number of years the BioMart data warehousing system has proven to be a valuable resource for scientists seeking a fast and versatile means of accessing the growing volume of genomic data provided by the Ensembl project. The launch of the Ensembl Genomes project in 2009 complemented the Ensembl project by utilizing the same visualization, interactive and programming tools to provide users with a means for accessing genome data from a further five domains: protists, bacteria, metazoa, plants and fungi. The Ensembl and Ensembl Genomes BioMarts provide a point of access to the high-quality gene annotation, variation data, functional and regulatory annotation and evolutionary relationships from genomes spanning the taxonomic space. This article aims to give a comprehensive overview of the Ensembl and Ensembl Genomes BioMarts as well as some useful examples and a description of current data content and future objectives. Database URLs: http://www.ensembl.org/biomart/martview/; http://metazoa.ensembl.org/biomart/martview/; http://plants.ensembl.org/biomart/martview/; http://protists.ensembl.org/biomart/martview/; http://fungi.ensembl.org/biomart/martview/; http://bacteria.ensembl.org/biomart/martview/
Article
Full-text available
Cellular states are determined by differential expression of the cell's proteins. The relationship between protein and mRNA expression levels informs about the combined outcomes of translation and protein degradation which are, in addition to transcription and mRNA stability, essential contributors to gene expression regulation. This review summarizes the state of knowledge about large-scale measurements of absolute protein and mRNA expression levels, and the degree of correlation between the two parameters. We summarize the information that can be derived from comparison of protein and mRNA expression levels and discuss how corresponding sequence characteristics suggest modes of regulation.
Article
Full-text available
Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically. Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed. Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem. Contact: cdewey@biostat.wisc.edu Supplementary information: Supplementary data are available at Bioinformatics on
Article
Full-text available
The correlation between mRNA and protein abundances in the cell has been reported to be notoriously poor. Recent technological advances in the quantitative analysis of mRNA and protein species in complex samples allow the detailed analysis of this pathway at the center of biological systems. We give an overview of available methods for the identification and quantification of free and ribosome-bound mRNA, protein abundances and individual protein turnover rates. We review available literature on the correlation of mRNA and protein abundances and discuss biological and technical parameters influencing the correlation of these central biological molecules.
Article
Full-text available
We have determined the relationship between mRNA and protein expression levels for selected genes expressed in the yeast Saccharomyces cerevisiae growing at mid-log phase. The proteins contained in total yeast cell lysate were separated by high-resolution two-dimensional (2D) gel electrophoresis. Over 150 protein spots were excised and identified by capillary liquid chromatography-tandem mass spectrometry (LC-MS/MS). Protein spots were quantified by metabolic labeling and scintillation counting. Corresponding mRNA levels were calculated from serial analysis of gene expression (SAGE) frequency tables (V. E. Velculescu, L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett, Jr., P. Hieter, B. Vogelstein, and K. W. Kinzler, Cell 88:243–251, 1997). We found that the correlation between mRNA and protein levels was insufficient to predict protein expression levels from quantitative mRNA data. Indeed, for some genes, while the mRNA levels were of the same value the protein levels varied by more than 20-fold. Conversely, invariant steady-state levels of certain proteins were observed with respective mRNA transcript levels that varied by as much as 30-fold. Another interesting observation is that codon bias is not a predictor of either protein or mRNA levels. Our results clearly delineate the technical boundaries of current approaches for quantitative analysis of protein expression and reveal that simple deduction from mRNA transcript analysis is insufficient.
Article
Full-text available
The relationship between gene expression measured at the mRNA level and the corresponding protein level is not well characterized in human cancer. In this study, we compared mRNA and protein expression for a cohort of genes in the same lung adenocarcinomas. The abundance of 165 protein spots representing 98 individual genes was analyzed in 76 lung adenocarcinomas and nine non-neoplastic lung tissues using two-dimensional polyacrylamide gel electrophoresis. Specific polypeptides were identified using matrix-assisted laser desorption/ionization mass spectrometry. For the same 85 samples, mRNA levels were determined using oligonucleotide microarrays, allowing a comparative analysis of mRNA and protein expression among the 165 protein spots. Twenty-eight of the 165 protein spots (17%) or 21 of 98 genes (21.4%) had a statistically significant correlation between protein and mRNA expression (r > 0.2445; p < 0.05); however, among all 165 proteins the correlation coefficient values (r) ranged from -0.467 to 0.442. Correlation coefficient values were not related to protein abundance. Further, no significant correlation between mRNA and protein expression was found (r = -0.025) if the average levels of mRNA or protein among all samples were applied across the 165 protein spots (98 genes). The mRNA/protein correlation coefficient also varied among proteins with multiple isoforms, indicating potentially separate isoform-specific mechanisms for the regulation of protein abundance. Among the 21 genes with a significant correlation between mRNA and protein, five genes differed significantly between stage I and stage III lung adenocarcinomas. Using a quantitative analysis of mRNA and protein expression within the same lung adenocarcinomas, we showed that only a subset of the proteins exhibited a significant correlation with mRNA abundance.
Article
Full-text available
DNA microarrays have been widely applied to cancer transcriptome analysis; however, the majority of such data are not easily accessible or comparable. Furthermore, several important analytic approaches have been applied to microarray analysis; however, their application is often limited. To overcome these limitations, we have developed Oncomine, a bioinformatics initiative aimed at collecting, standardizing, analyzing, and delivering cancer transcriptome data to the biomedical research community. Our analysis has identified the genes, pathways, and networks deregulated across 18,000 cancer gene expression microarrays, spanning the majority of cancer types and subtypes. Here, we provide an update on the initiative, describe the database and analysis modules, and highlight several notable observations. Results from this comprehensive analysis are available at http://www.oncomine.org.
Article
Full-text available
We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA-Seq). This provides a digital measure of the presence and prevalence of transcripts from known and previously unknown genes. We report reference measurements composed of 41-52 million mapped 25-base-pair reads for poly(A)-selected RNA from adult mouse brain, liver and skeletal muscle tissues. We used RNA standards to quantify transcript prevalence and to test the linear range of transcript detection, which spanned five orders of magnitude. Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3' untranscribed regions, as well as new candidate microRNA precursors. RNA splice events, which are not readily measured by standard gene expression microarray or serial analysis of gene expression methods, were detected directly by mapping splice-crossing sequence reads. We observed 1.45 x 10(5) distinct splices, and alternative splices were prominent, with 3,500 different genes expressing one or more alternate internal splices.
Article
The chromosome-centric human proteome project (C-HPP) seeks to comprehensively characterize all protein products coded by the genome, including those expressed sequence variants confirmed via proteogenomics methods. The closely related biology and disease human proteome project (B/D-HPP) seeks to understand the biological and pathological associations of expressed protein products, especially those carrying sequence variants which may be drivers of disease. To achieve these objectives, informatics tools are required that interpret potential functional or disease implications of variant protein sequence detected via proteogenomics. Towards this end, we have developed an automated workflow within the Galaxy for proteomics (Galaxy-P) platform which leverages the Cancer-Related Analysis of Variants Toolkit (CRAVAT) and makes it interoperable with proteogenomic results. Protein sequence variants confirmed by proteogenomics are assessed for potential structure-function effects, as well as associations with cancer using CRAVAT’s rich suite of functionalities, including visualization of results directly within the Galaxy user interface. We demonstrate the effectiveness of this workflow on proteogenomic results generated from an MCF7 breast cancer cell line. Our free and open software should enable improved interpretation of functional and pathological effects of protein sequence variants detected via proteogenomics, acting as a bridge between the C-HPP and B/D-HPP.
Article
Computational proteomics is the data science concerned with the identification and quantification of proteins from high-throughput data and the biological interpretation of their concentration changes, posttranslational modifications, interactions, and subcellular localizations. Today, these data most often originate from mass spectrometry–based shotgun proteomics experiments. In this review, we survey computational methods for the analysis of such proteomics data, focusing on the explanation of the key concepts. Starting with mass spectrometric feature detection, we then cover methods for the identification of peptides. Subsequently, protein inference and the control of false discovery rates are highly important topics covered. We then discuss methods for the quantification of peptides and proteins. A section on downstream data analysis covers exploratory statistics, network analysis, machine learning, and multiomics data integration. Finally, we discuss current developments and provide an outlook on what the near future of computational proteomics might bear. Expected final online publication date for the Annual Review of Biomedical Data Science Volume 1 is July 20, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
Clustered heatmaps are the most frequently used graphics for visualization of molecular profiling data in biology. However, they are generally rendered as static, or only modestly interactive, images. We have now used recent advances in web technologies to produce interactive "next-generation" clustered heatmaps (NG-CHM) that enable extreme zooming and navigation without loss of resolution. NG-CHMs also provide link-outs to additional information sources and include other features that facilitate deep exploration of the biology behind the image. Here, we describe an implementation of the NG-CHM system in the Galaxy bioinformatics platform. We illustrate the algorithm and available computational tool using RNA-seq data from The Cancer Genome Atlas program's Kidney Clear Cell Carcinoma project. Cancer Res; 77(21); e23-26. ©2017 AACR.
Article
MaxQuant is one of the most frequently used platforms for mass-spectrometry (MS)-based proteomics data analysis. Since its first release in 2008, it has grown substantially in functionality and can be used in conjunction with more MS platforms. Here we present an updated protocol covering the most important basic computational workflows, including those designed for quantitative label-free proteomics, MS1-level labeling and isobaric labeling techniques. This protocol presents a complete description of the parameters used in MaxQuant, as well as of the configuration options of its integrated search engine, Andromeda. This protocol update describes an adaptation of an existing protocol that substantially modifies the technique. Important concepts of shotgun proteomics and their implementation in MaxQuant are briefly reviewed, including different quantification strategies and the control of false-discovery rates (FDRs), as well as the analysis of post-translational modifications (PTMs). The MaxQuant output tables, which contain information about quantification of proteins and PTMs, are explained in detail. Furthermore, we provide a short version of the workflow that is applicable to data sets with simple and standard experimental designs. The MaxQuant algorithms are efficiently parallelized on multiple processors and scale well from desktop computers to servers with many cores. The software is written in C# and is freely available at http://www.maxquant.org.
Article
The question of how genomic information is expressed to determine phenotypes is of central importance for basic and translational life science research and has been studied by transcriptomic and proteomic profiling. Here, we review the relationship between protein and mRNA levels under various scenarios, such as steady state, long-term state changes, and short-term adaptation, demonstrating the complexity of gene expression regulation, especially during dynamic transitions. The spatial and temporal variations of mRNAs, as well as the local availability of resources for protein biosynthesis, strongly influence the relationship between protein levels and their coding transcripts. We further discuss the buffering of mRNA fluctuations at the level of protein concentrations. We conclude that transcript levels by themselves are not sufficient to predict protein levels in many scenarios and to thus explain genotype-phenotype relationships and that high-quality data quantifying different levels of gene expression are indispensable for the complete understanding of biological processes.
Article
Summary Unipept is an open source web application that is designed for metaproteomics analysis with a focus on interactive datavisualization. It is underpinned by a fast index built from UniProtKB and the NCBI taxonomy that enables quick retrieval of all UniProt entries in which a given tryptic peptide occurs. Unipept version 2.4 introduced web services that provide programmatic access to the metaproteomics analysis features. This enables integration of Unipept functionality in custom applications and data processing pipelines. Availability and implementation: The web services are freely available at http://api.unipept.ugent.be and are open sourced under the MIT license. Contact: Unipept{at}ugent.be Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Transcriptomic, proteomic, and metabolomic measurements are revolutionizing the way we model and predict cellular behavior, and multi-omic comparisons are being published with increased regularity. Some have expected a trivial and predictable correlation between mRNA and protein; however, the manifest complexity of biological regulation suggests a more nuanced relationship. Indeed, observing this lack of strict correlation provides clues for new research topics, and has the potential for transformative biological insight. Copyright © 2014. Published by Elsevier Ltd.
Article
Proteogenomics combines large-scale genomic and transcriptomic data with mass spectrometry-based proteomic data to discover novel protein sequence variants and improve genome annotation. In contrast to conventional proteomic applications, proteogenomic analysis requires a number of additional data processing steps. Ideally, these required steps would be integrated and automated via a single software platform offering accessibility for wet-bench researchers as well as flexibility for user-specific customization and integration of new software tools as they emerge. Towards this end, we have extended the Galaxy bioinformatics framework to facilitate proteogenomic analysis. Using analysis of whole human saliva as an example, we demonstrate Galaxy's flexibility through the creation of a modular workflow incorporating both established and customized software tools that improve depth and quality of proteogenomic results. Our customized Galaxy-based software includes automated, batch-mode BLASTP searching and a Peptide Sequence Match Evaluator tool, both useful for evaluating the veracity of putative novel peptide identifications. Our complex workflow (approximately 140 steps) can be easily shared using built-in Galaxy functions, enabling their use and customization by others. Our results provide a blueprint for the establishment of the Galaxy framework as an ideal solution for the emerging field of proteogenomics.
Article
Zebrafish is a popular system for studying vertebrate development and disease, and shows high genetic conservation with humans. Molecular level studies at different stages of development are essential for understanding the mechanisms deployed during ontogeny. Here, we performed comparative analysis of whole proteome and transcriptome at early-stage (24 hours post fertilization) zebrafish embryo. We identified 8363 proteins with their approximate cellular abundances (the largest number of zebrafish embryos proteins quantified thus far) through a combination of thorough deyolking and extensive fractionation procedures, before resolving the peptides by mass spectrometry. We performed deep sequencing of the transcripts and found that the expressed proteome and transcriptome displayed a moderate correlation for the majority of cellular processes. Integrative functional mapping of the quantified genes demonstrated that embryonic developmental systems differentially exploit transcriptional and post-transcriptional regulatory mechanisms to modulate protein amounts. Using network mapping of the low abundant proteins, we identified various signal transduction pathways important in embryonic development and also revealed genes that may be regulated at the post-transcriptional level. Our data set represents a deep coverage of the functional proteome and transcriptome of the developing zebrafish, and our findings unveil molecular regulatory mechanisms that underlie embryonic development.
Article
High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.
Article
RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.
Article
The annotation of the human genome indicates the surprisingly low number of approximately 40,000 genes. However, the estimated number of proteins encoded by these genes is two to three orders of magnitude higher. The ability to unambiguously identify the proteins is a prerequisite for their functional investigation. As proteins derived from the same gene can be largely identical, and might differ only in small but functionally relevant details, protein identification tools must not only identify a large number of proteins but also be able to differentiate between close relatives. This information can be generated by mass spectrometry, an approach that identifies proteins by partial analysis of their digestion-derived peptides. Information gleaned from databases fills in the missing sequence information. Because both sequence databases and experimental data are limited, a certain ambiguity often remains concerning which sequence variant(s) and modification(s) are present. As the common denominator of all the isoforms is a gene, in our opinion, it would be more accurate to state that a product of this particular gene rather than a certain protein has been identified by mass spectrometry.
Article
The aim of this study was to assess the behavior of the matrix metalloproteinases (MMPs) 2 and 9 and the tissue inhibitor of metalloproteinases 1 (TIMP-1) in human prostate cancer. mRNA and protein expression patterns of MMP-2, MMP-9, and TIMP-1 were studied in cancerous and noncancerous parts of 17 prostates removed by radical prostatectomy. Competitive RT-PCR, gelatin-substrate zymography, and ELISA techniques were used for quantification. On the mRNA level, MMP-2 expression was decreased and MMP-9, TIMP-1, the ratios of MMP-2 and MMP-9 to TIMP-1 were unchanged in cancerous tissue compared to the normal counterparts. On the protein level, expression of MMP-9 was significantly higher and TIMP-1 expression was significantly lower, MMP-2 was unchanged and the ratios of MMP-2 and MMP-9 to TIMP-1 were increased in tumor tissue. The higher concentration of MMP-9 as well as the increased ratios of MMP-2 and MMP-9 to TIMP-1 in malignant tissue prove the proteolytic dysbalance in prostate cancer, which does not seem to be associated with the stage and grade of the tumor. Comparison of mRNA and protein expression of MMP-2, MMP-9 and TIMP-1, respectively, did not show any significant relationships illustrating the necessity to study these components at both molecular levels.
Article
This paper develops a new procedure, called stability analysis, for K-means clustering. Instead of ignoring local optima and only considering the best solution found, this procedure takes advantage of additional information from a K-means cluster analysis. The information from the locally optimal solutions is collected in an object by object co-occurrence matrix. The co-occurrence matrix is clustered and subsequently reordered by a steepest ascent quadratic assignment procedure to aid visual interpretation of the multidimensional cluster structure. Subsequently, measures are developed to determine the overall structure of a data set, the number of clusters and the multidimensional relationships between the clusters.
Enhancing the Multi-Omics Visualization Platform (MVP) Plug-in for Galaxy-Based Applications
  • T Mcgowan