Large-scale RT-PCR recovery of full-length cDNA clones

Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
BioTechniques (Impact Factor: 2.95). 05/2004; 36(4):690-6, 698-700.
Source: PubMed


Pseudogenes, alternative transcripts, noncoding RNA, and polymorphisms each add extensive complexity to the mammalian transcriptome and confound estimation of the total number of genes. Despite advanced algorithms for gene prediction and several large-scale efforts to obtain cDNA clones for all human open reading frames (ORFs), no single collection is complete. To enhance this effort, we have developed a high-throughput pipeline for reverse transcription PCR (RT-PCR) gene recovery. Most importantly, novel molecular strategies for improving RT-PCR yield of transcripts that have been difficult to isolate by other means and computational strategies for clone sequence validation have been developed and optimized. This systematic gene recovery pipeline allows both rescue of predicted human and rat genes and provides insight into the complexity of the transcriptome through comparisons with existing data sets.

Download full-text


Available from: David Steffen
  • Source
    • "The longest isoform generally was assigned for PCR rescue. Full descriptions of the PCR rescue protocols used by each center have been published (Baross et al. 2004; Wu et al. 2004). Both groups designed PCR primers flanking the target CDS, including varying amounts of UTR sequence , and RT-PCR was performed on RNA pooled from multiple tissues. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.
    Full-text · Article · Sep 2009 · Genome Research
  • Source
    • "In response to declining yields from methods based on random expressed sequence tag (EST) sequencing (Gerhard et al. 2004), several years ago the MGC adopted a more directed strategy, by which candidate genes not in the collection were amplified by RT–PCR, then were cloned and validated by full-length sequencing (Baross et al. 2004; Wu et al. 2004a). A component of this strategy was to use ab initio computational gene prediction to identify candidates missing from catalogs of known genes and poorly supported by ESTs, yet still detectable from subtle signatures in the genome sequence. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds-not thousands-of protein-coding genes are completely missing from the current gene catalogs.
    Full-text · Article · Dec 2007 · Genome Research
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
    Full-text · Article · Nov 2004 · Genome Research
Show more