Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.
Nucleic Acids Research (Impact Factor: 9.11). 02/2006; 34(3):1066-80. DOI: 10.1093/nar/gkj494
Source: PubMed


We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that
the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part
of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families
with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely
species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the
number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic
sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences,
we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized
Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures
should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how
function evolves.

Download full-text


Available from: Corin Yeats, Oct 13, 2015
16 Reads
  • Source
    • "The concept of orphan genes was first described by Fischer and Eisenberg in 1999 from studies of microbial genomes (Fischer and Eisenberg, 1999). Although many have predicted that genes considered species specific would later turn out to be an artifact of sparse genome sequence, this has proved not to be the case (Arendsee et al., 2014; Gollery et al., 2006, 2007; Marsden et al., 2006; Neme and Tautz, 2013; Silveira et al., 2013; Tautz and Domazet-Loso, 2011). Orphan genes appear to be present in all species, and represent a significant fraction (approximately 0.5% to >8%) of analysed eukaryotic and prokaryotic genomes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The genome of each species contains as high as 8% of genes that are uniquely present in that species. Little is known about the functional significance of these so-called species specific or orphan genes. The Arabidopsis thaliana gene Qua-Quine Starch (QQS) is species specific. Here, we show that altering QQS expression in Arabidopsis affects carbon partitioning to both starch and protein. We hypothesized QQS may be conserved in a feature other than primary sequence, and as such could function to impact composition in another species. To test the potential of QQS in affecting composition in an ectopic species, we introduced QQS into soybean. Soybean T1 lines expressing QQS have up to 80% decreased leaf starch and up to 60% increased leaf protein; T4 generation seeds from field-grown plants contain up to 13% less oil, while protein is increased by up to 18%. These data broaden the concept of QQS as a modulator of carbon and nitrogen allocation, and demonstrate that this species-specific gene can affect the seed composition of an agronomic species thought to have diverged from Arabidopsis 100 million years ago.
    Plant Biotechnology Journal 08/2014; DOI:10.1111/pbi.12238 · 5.75 Impact Factor
  • Source
    • "Marsden et al. [13] analyzed 203 complete genomes in the Gene3D resource [14] to provide new insights into protein family space. The number of protein families was found to be continually expanding with time but a significant proportion of the proteomes could be assigned to relatively few large, well-characterized domain families while the vast majority of domain families were relatively rare and often species specific. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Midwest Center for Structural Genomics (MCSG) is one of the large-scale centres of the Protein Structure Initiative (PSI). During the first two phases of the PSI the MCSG has solved over a thousand protein structures. A criticism of structural genomics is that target selection strategies mean that some structures are solved without having a known function and thus are of little biomedical significance. Structures of unknown function have stimulated the development of methods for function prediction from structure. We show that the MCSG has met the stated goals of the PSI and use online resources and readily available function prediction methods to provide functional annotations for more than 90% of the MCSG structures. The structure-to-function prediction method ProFunc provides likely functions for many of the MCSG structures that cannot be annotated by sequence-based methods. Although the focus of the PSI was structural coverage, many of the structures solved by the MCSG can also be associated with functional classes and biological roles of possible biomedical value.
    BMC Structural Biology 01/2011; 11:2. DOI:10.1186/1472-6807-11-2 · 1.18 Impact Factor
  • Source
    • "It is believed that evolution tends to conserve functions primarily on the preservation of the 3D structure rather than primary structure. A 3D alignment between structural relatives, even (or mainly) comprising a small number of residues within a protein active site, can be a powerful method to infer function [33]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Trypanosoma cruzi is the etiological agent of Chagas' disease, an endemic infection that causes thousands of deaths every year in Latin America. Therapeutic options remain inefficient, demanding the search for new drugs and/or new molecular targets. Such efforts can focus on proteins that are specific to the parasite, but analogous enzymes and enzymes with a three-dimensional (3D) structure sufficiently different from the corresponding host proteins may represent equally interesting targets. In order to find these targets we used the workflows MHOLline and AnEnΠ obtaining 3D models from homologous, analogous and specific proteins of Trypanosoma cruzi versus Homo sapiens. We applied genome wide comparative modelling techniques to obtain 3D models for 3,286 predicted proteins of T. cruzi. In combination with comparative genome analysis to Homo sapiens, we were able to identify a subset of 397 enzyme sequences, of which 356 are homologous, 3 analogous and 38 specific to the parasite. In this work, we present a set of 397 enzyme models of T. cruzi that can constitute potential structure-based drug targets to be investigated for the development of new strategies to fight Chagas' disease. The strategies presented here support the concept of structural analysis in conjunction with protein functional analysis as an interesting computational methodology to detect potential targets for structure-based rational drug design. For example, 2,4-dienoyl-CoA reductase (EC and triacylglycerol lipase (EC, classified as analogous proteins in relation to H. sapiens enzymes, were identified as new potential molecular targets.
    BMC Genomics 10/2010; 11(1):610. DOI:10.1186/1471-2164-11-610 · 3.99 Impact Factor
Show more