Pseudofam: The pseudogene families database

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.
Nucleic Acids Research (Impact Factor: 9.11). 11/2008; 37(Database issue):D738-43. DOI: 10.1093/nar/gkn758
Source: PubMed


Pseudofam ( is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125,000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.

Download full-text


Available from: PubMed Central · License: CC BY-NC
  • Source
    • "Additionally, pseudogene transcripts corresponding to CALM2 (calmodulin 2 phosphorylase kinase, delta), TOMM40 (translocase of outer mitochondrial membrane 40), NONO (non-POU domain-containing, octamer-binding), DUSP8 (dual-specificity phosphatase 8), PERP (TP53 apoptosis effector), and YES (v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1), etc. were observed in more than 50 samples each, which were further validated by pseudogene-specific RT-PCR followed by Sanger sequencing (Table S4). Further, because our RNA-Seq compendium comprises 35-to 45-mer short sequence reads that largely generated short sequence clusters not optimal for available pseudogene analysis tools such as Pseudopipe (Zhang et al., 2006) and Pseudofam (Lam et al., 2009) used in generating ENCODE and Yale databases , we carried out a direct query of individual clusters against the human genome (hg18) using the BLAT tool from UCSC, which is ideally suited for short sequence alignment searches (Kent, 2002). Based on this ''custom'' analysis, or simply BLAT (Figure S2A), we were able to independently assign 1,888 clusters representing 1,820 unique pseudogenes to unique genomic locations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Pseudogene transcripts can provide a novel tier of gene regulation through generation of endogenous siRNAs or miRNA-binding sites. Characterization of pseudogene expression, however, has remained confined to anecdotal observations due to analytical challenges posed by the extremely close sequence similarity with their counterpart coding genes. Here, we describe a systematic analysis of pseudogene "transcription" from an RNA-Seq resource of 293 samples, representing 13 cancer and normal tissue types, and observe a surprisingly prevalent, genome-wide expression of pseudogenes that could be categorized as ubiquitously expressed or lineage and/or cancer specific. Further, we explore disease subtype specificity and functions of selected expressed pseudogenes. Taken together, we provide evidence that transcribed pseudogenes are a significant contributor to the transcriptional landscape of cells and are positioned to play significant roles in cellular differentiation and cancer progression, especially in light of the recently described ceRNA networks. Our work provides a transcriptome resource that enables high-throughput analyses of pseudogene expression.
    Full-text · Article · Jun 2012 · Cell
  • Source
    • "Although the function of most ncRNAs is unknown, some have been implicated in the regulation of disease, stress conditions, imprinting, gene silencing and enhancer regulation (Costa, 2007; Orom & Shiekhattar, 2011). Pseudogenes that lack protein-coding potential can be considered to be a type of ncRNA (Lam et al., 2009). Pseudogenes show sequence similarity to some functional parental genes (Muro et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We identified a predicted compact cysteine-rich sequence in the honey bee genome that we called 'Raalin'. Raalin transcripts are enriched in the brain of adult honey bee workers and drones, with only minimum expression in other tissues or in pre-adult stages. Open-reading frame (ORF) homologues of Raalin were identified in the transcriptomes of fruit flies, mosquitoes and moths. The Raalin-like gene from Drosophila melanogaster encodes for a short secreted protein that is maximally expressed in the adult brain with negligible expression in other tissues or pre-imaginal stages. Raalin-like sequences have also been found in the recently sequenced genomes of six ant species, but not in the jewel wasp Nasonia vitripennis. As in the honey bee, the Raalin-like sequences of ants do not have an ORF. A comparison of the genome region containing Raalin in the genomes of bees, ants and the wasp provides evolutionary support for an extensive genome rearrangement in this sequence. Our analyses identify a new family of ancient cysteine-rich short sequences in insects in which insertions and genome rearrangements may have disrupted this locus in the branch leading to the Hymenoptera. The regulated expression of this transcript suggests that it has a brain-specific function.
    Full-text · Article · Mar 2012 · Insect Molecular Biology
  • Source
    • "This sub-property, has_parent_gene restricts the range of values to instances of protein_coding_gene and restricts the maximum cardinality to a single instance. We used identifiers from the existing pseudogene ontology (PGO), which was created as part of the Pseudofam project (Lam et al., 2009). We also incorporated information, where available, about the location of particular exons and introns within the pseudogenes, noting of course that these no longer have the same meaning in a non-functional context. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent years have seen the development of a wide range of biomedical ontologies. Notable among these is Sequence Ontology (SO) which offers a rich hierarchy of terms and relationships that can be used to annotate genomic data. Well-designed formal ontologies allow data to be reasoned upon in a consistent and logically sound way and can lead to the discovery of new relationships. The Semantic Web Rules Language (SWRL) augments the capabilities of a reasoner by allowing the creation of conditional rules. To date, however, formal reasoning, especially the use of SWRL rules, has not been widely used in biomedicine. We have built a knowledge base of human pseudogenes, extending the existing SO framework to incorporate additional attributes. In particular, we have defined the relationships between pseudogenes and segmental duplications. We then created a series of logical rules using SWRL to answer research questions and to annotate our pseudogenes appropriately. Finally, we were left with a knowledge base which could be queried to discover information about human pseudogene evolution. The fully populated knowledge base described in this document is available for download from A SPARQL endpoint from which to query the dataset is also available at this location.
    Full-text · Article · Jun 2010 · Bioinformatics
Show more