Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.

Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre, Madrid, Spain.
Molecular Biology and Evolution (Impact Factor: 14.31). 03/2012; 29(9):2265-83. DOI: 10.1093/molbev/mss100
Source: PubMed

ABSTRACT Advances in high-throughput mass spectrometry are making proteomics an increasingly important tool in genome annotation projects. Peptides detected in mass spectrometry experiments can be used to validate gene models and verify the translation of putative coding sequences (CDSs). Here, we have identified peptides that cover 35% of the genes annotated by the GENCODE consortium for the human genome as part of a comprehensive analysis of experimental spectra from two large publicly available mass spectrometry databases. We detected the translation to protein of "novel" and "putative" protein-coding transcripts as well as transcripts annotated as pseudogenes and nonsense-mediated decay targets. We provide a detailed overview of the population of alternatively spliced protein isoforms that are detectable by peptide identification methods. We found that 150 genes expressed multiple alternative protein isoforms. This constitutes the largest set of reliably confirmed alternatively spliced proteins yet discovered. Three groups of genes were highly overrepresented. We detected alternative isoforms for 10 of the 25 possible heterogeneous nuclear ribonucleoproteins, proteins with a key role in the splicing process. Alternative isoforms generated from interchangeable homologous exons and from short indels were also significantly enriched, both in human experiments and in parallel analyses of mouse and Drosophila proteomics experiments. Our results show that a surprisingly high proportion (almost 25%) of the detected alternative isoforms are only subtly different from their constitutive counterparts. Many of the alternative splicing events that give rise to these alternative isoforms are conserved in mouse. It was striking that very few of these conserved splicing events broke Pfam functional domains or would damage globular protein structures. This evidence of a strong bias toward subtle differences in CDS and likely conserved cellular function and structure is remarkable and strongly suggests that the translation of alternative transcripts may be subject to selective constraints.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we mapped peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for more than 96% of genes that evolved before bilateria. At the opposite end of the scale we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2,001 potentially non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes, and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
    Human Molecular Genetics 06/2014; DOI:10.1093/hmg/ddu309 · 6.68 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.
    Genome Research 10/2013; DOI:10.1101/gr.161315.113 · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Trypanosoma cruzi, the causative agent of Chagas disease, is extremely resistant to ionizing radiation, enduring up to 1.5 kGy of gamma rays. Ionizing radiation can damage the DNA molecule both directly, resulting in double-strand breaks, and indirectly, as a consequence of reactive oxygen species production. After a dose of 500 Gy of gamma rays, the parasite genome is fragmented, but the chromosomal bands are restored within 48 hours. Under such conditions, cell growth arrests for up to 120 hours and the parasites resume normal growth after this period. To better understand the parasite response to ionizing radiation, we analyzed the proteome of irradiated (4, 24, and 96 hours after irradiation) and non-irradiated T. cruzi using two-dimensional differential gel electrophoresis followed by mass spectrometry for protein identification. A total of 543 spots were found to be differentially expressed, from which 215 were identified. These identified protein spots represent different isoforms of only 53 proteins. We observed a tendency for overexpression of proteins with molecular weights below predicted, indicating that these may be processed, yielding shorter polypeptides. The presence of shorter protein isoforms after irradiation suggests the occurrence of post-translational modifications and/or processing in response to gamma radiation stress. Our results also indicate that active translation is essential for the recovery of parasites from ionizing radiation damage. This study therefore reveals the peculiar response of T. cruzi to ionizing radiation, raising questions about how this organism can change its protein expression to survive such a harmful stress.
    PLoS ONE 05/2014; 9(5):e97526. DOI:10.1371/journal.pone.0097526 · 3.53 Impact Factor


Available from
May 29, 2014