Pseudofam: The pseudogene families database

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2008; 37(Database issue):D738-43. DOI: 10.1093/nar/gkn758
Source: PubMed

ABSTRACT Pseudofam ( is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125,000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.

  • Source
    • "Additionally, pseudogene transcripts corresponding to CALM2 (calmodulin 2 phosphorylase kinase, delta), TOMM40 (translocase of outer mitochondrial membrane 40), NONO (non-POU domain-containing, octamer-binding), DUSP8 (dual-specificity phosphatase 8), PERP (TP53 apoptosis effector), and YES (v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1), etc. were observed in more than 50 samples each, which were further validated by pseudogene-specific RT-PCR followed by Sanger sequencing (Table S4). Further, because our RNA-Seq compendium comprises 35-to 45-mer short sequence reads that largely generated short sequence clusters not optimal for available pseudogene analysis tools such as Pseudopipe (Zhang et al., 2006) and Pseudofam (Lam et al., 2009) used in generating ENCODE and Yale databases , we carried out a direct query of individual clusters against the human genome (hg18) using the BLAT tool from UCSC, which is ideally suited for short sequence alignment searches (Kent, 2002). Based on this ''custom'' analysis, or simply BLAT (Figure S2A), we were able to independently assign 1,888 clusters representing 1,820 unique pseudogenes to unique genomic locations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Pseudogene transcripts can provide a novel tier of gene regulation through generation of endogenous siRNAs or miRNA-binding sites. Characterization of pseudogene expression, however, has remained confined to anecdotal observations due to analytical challenges posed by the extremely close sequence similarity with their counterpart coding genes. Here, we describe a systematic analysis of pseudogene "transcription" from an RNA-Seq resource of 293 samples, representing 13 cancer and normal tissue types, and observe a surprisingly prevalent, genome-wide expression of pseudogenes that could be categorized as ubiquitously expressed or lineage and/or cancer specific. Further, we explore disease subtype specificity and functions of selected expressed pseudogenes. Taken together, we provide evidence that transcribed pseudogenes are a significant contributor to the transcriptional landscape of cells and are positioned to play significant roles in cellular differentiation and cancer progression, especially in light of the recently described ceRNA networks. Our work provides a transcriptome resource that enables high-throughput analyses of pseudogene expression.
    Cell 06/2012; 149(7):1622-34. DOI:10.1016/j.cell.2012.04.041 · 33.12 Impact Factor
  • Source
    • "Although the function of most ncRNAs is unknown, some have been implicated in the regulation of disease, stress conditions, imprinting, gene silencing and enhancer regulation (Costa, 2007; Orom & Shiekhattar, 2011). Pseudogenes that lack protein-coding potential can be considered to be a type of ncRNA (Lam et al., 2009). Pseudogenes show sequence similarity to some functional parental genes (Muro et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We identified a predicted compact cysteine-rich sequence in the honey bee genome that we called 'Raalin'. Raalin transcripts are enriched in the brain of adult honey bee workers and drones, with only minimum expression in other tissues or in pre-adult stages. Open-reading frame (ORF) homologues of Raalin were identified in the transcriptomes of fruit flies, mosquitoes and moths. The Raalin-like gene from Drosophila melanogaster encodes for a short secreted protein that is maximally expressed in the adult brain with negligible expression in other tissues or pre-imaginal stages. Raalin-like sequences have also been found in the recently sequenced genomes of six ant species, but not in the jewel wasp Nasonia vitripennis. As in the honey bee, the Raalin-like sequences of ants do not have an ORF. A comparison of the genome region containing Raalin in the genomes of bees, ants and the wasp provides evolutionary support for an extensive genome rearrangement in this sequence. Our analyses identify a new family of ancient cysteine-rich short sequences in insects in which insertions and genome rearrangements may have disrupted this locus in the branch leading to the Hymenoptera. The regulated expression of this transcript suggests that it has a brain-specific function.
    Insect Molecular Biology 03/2012; 21(3):305-18. DOI:10.1111/j.1365-2583.2012.01138.x · 2.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequencing of the human genome has identified numerous chromosome copy number additions and subtractions that include stable partial gene duplications and pseudogenes that when not properly annotated can interfere with genetic analysis. As an example of this problem, an evolutionary chromosome event in the primate ancestral chromosome 18 produced a partial duplication and inversion of rho-associated protein kinase 1 (ROCK1 -18q11.1, 33 exons) in the subtelomeric region of the p arm of chromosome 18 detectable only in humans. ROCK1 and the partial gene copy, which the gene databases also currently call ROCK1, include non-unique single nucleotide polymorphisms (SNPs). Here, we characterize this partial gene copy of the human ROCK1, termed Little ROCK, located at 18p11.32. Little ROCK includes five exons, four of which share 99% identity with the terminal four exons of ROCK1 and one of which is unique to Little ROCK. In human while ROCK1 is expressed in many organs, Little ROCK expression is restricted to vascular smooth muscle cell (VSMC) lines and organs rich in smooth muscle. The single nucleotide polymorphism database (dbSNP) lists multiple variants contained in the region shared by ROCK1 and Little ROCK. Using gene and cDNA sequence analysis we clarified the origins of two non-synonymous SNPs annotated in the genome to actually be fixed differences between the ROCK1 and the Little ROCK gene sequences. Two additional coding SNPs were valid polymorphisms selectively within Little ROCK. Little ROCK-Green Fluorescent fusion proteins were highly unstable and degraded by the ubiquitin-proteasome system in vitro. In this report we have characterized Little ROCK (ROCK1P1), a human expressed pseudogene derived from partial duplication of ROCK1. The large number of pseudogenes in the human genome creates significant genetic diversity. Our findings emphasize the importance of taking into consideration pseudogenes in all candidate gene and genome-wide association studies, as well as the need for complete annotation of human pseudogenome.
    BMC Genetics 04/2010; 11:22. DOI:10.1186/1471-2156-11-22 · 2.36 Impact Factor
Show more

Preview (2 Sources)

Available from