Gene inactivation and its implications for annotation in the era of personal genomics

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
Genes & development (Impact Factor: 12.64). 01/2011; 25(1):1-10. DOI: 10.1101/gad.1968411
Source: PubMed

ABSTRACT The first wave of personal genomes documents how no single individual genome contains the full complement of functional genes. Here, we describe the extent of variation in gene and pseudogene numbers between individuals arising from inactivation events such as premature termination or aberrant splicing due to single-nucleotide polymorphisms. This highlights the inadequacy of the current reference sequence and gene set. We present a proposal to define a reference gene set that will remain stable as more individuals are sequenced. In particular, we recommend that the ancestral allele be used to define the reference sequence from which a core human reference gene annotation set can be derived. In addition, we call for the development of an expanded gene set to include human-specific genes that have arisen recently and are absent from the ancestral set.

1 Follower
  • Source
    • "Despite this, the consistency of the patterns we find with their use across all functional categories is very unlikely to represent an artifact induced by a drastic misclassification of the functional impact of the variants. Second, although we avoided complete reliance on a human reference genome for studying differences in variant types and rates across populations by determining ancestral and derived alleles, many functional elements and functional prediction algorithms rely on the available reference genome (e.g., defining TFBS and predicting TFBS disrupting variants) and hence may not be adequately evaluating the functionality of specific variants due to structural and sequence context differences that exist between individuals (Balasubramanian et al., 2011). However, as SIFT and Polyphen2 exploit cross-species nucleotide conservation information, we are confident that our analyses of coding variants is not affected as much by this issue. "
    [Show abstract] [Hide abstract]
    ABSTRACT: There have been a number of recent successes in the use of whole genome sequencing and sophisticated bioinformatics techniques to identify pathogenic DNA sequence variants responsible for individual idiopathic congenital conditions. However, the success of this identification process is heavily influenced by the ancestry or genetic background of a patient with an idiopathic condition. This is so because potential pathogenic variants in a patient's genome must be contrasted with variants in a reference set of genomes made up of other individuals' genomes of the same ancestry as the patient. We explored the effect of ignoring the ancestries of both an individual patient and the individuals used to construct reference genomes. We pursued this exploration in two major steps. We first considered variation in the per-genome number and rates of likely functional derived (i.e., non-ancestral, based on the chimp genome) single nucleotide variants and small indels in 52 individual whole human genomes sampled from 10 different global populations. We took advantage of a suite of computational and bioinformatics techniques to predict the functional effect of over 24 million genomic variants, both coding and non-coding, across these genomes. We found that the typical human genome harbors ∼5.5-6.1 million total derived variants, of which ∼12,000 are likely to have a functional effect (∼5000 coding and ∼7000 non-coding). We also found that the rates of functional genotypes per the total number of genotypes in individual whole genomes differ dramatically between human populations. We then created tables showing how the use of comparator or reference genome panels comprised of genomes from individuals that do not have the same ancestral background as a patient can negatively impact pathogenic variant identification. Our results have important implications for clinical sequencing initiatives.
    Frontiers in Genetics 11/2012; 3:211. DOI:10.3389/fgene.2012.00211
  • Source
    • "Groups like the Genome Reference Consortium [40] and the 1KGP are developing a more complete and accurate reference. Other groups are proposing the use of specialized family references [41], or an ancestral allele reference [42]. However, new challenges are created with updating research tools and results to reflect the changes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Data from the 1000 genomes project (1KGP) and Complete Genomics (CG) have dramatically increased the numbers of known genetic variants and challenge several assumptions about the reference genome and its uses in both clinical and research settings. Specifically, 34% of published array-based GWAS studies for a variety of diseases utilize probes that overlap unanticipated single nucleotide polymorphisms (SNPs), indels, or structural variants. Linkage disequilibrium (LD) block length depends on the numbers of markers used, and the mean LD block size decreases from 16 kb to 7 kb,when HapMap-based calculations are compared to blocks computed from1KGP data. Additionally, when 1KGP and CG variants are compared, 19% of the single nucleotide variants (SNVs) reported from common genomes are unique to one dataset; likely a result of differences in data collection methodology, alignment of reads to the reference genome, and variant-calling algorithms. Together these observations indicate that current research resources and informatics methods do not adequately account for the high level of variation that already exists in the human population and significant efforts are needed to create resources that can accurately assess personal genomics for health, disease, and predict treatment outcomes.
    PLoS ONE 07/2012; 7(7):e40294. DOI:10.1371/journal.pone.0040294 · 3.23 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the advances in molecular genetics, animal models of human diseases are becoming more numerous and more refined every year. Despite this, one must recognize that they generally do not faithfully and comprehensively mimic the homologous human disease. Faced with these imperfections, some geneticists believe that these models are of little value, while for others, on the contrary, they are important tools. We agree with this second statement, and in this review, we examine the reasons that may explain the observed differences and suggest means to circumvent or even exploit them. Our opinion is that animal models should be regarded more as tools capable of answering specific questions rather than mere replicas, at a smaller scale, of a given human disease. Far from disappointing they are probably called for a promising future.
    MGG Molecular & General Genetics 05/2011; 286(1):1-20. DOI:10.1007/s00438-011-0627-y · 2.83 Impact Factor
Show more


1 Download
Available from