MetaProm: a neural network based meta-predictor for alternative human promoter prediction. BMC Genomics 8:374

Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
BMC Genomics (Impact Factor: 3.99). 02/2007; 8(1):374. DOI: 10.1186/1471-2164-8-374
Source: PubMed


De novo eukaryotic promoter prediction is important for discovering novel genes and understanding gene regulation. In spite of the great advances made in the past decade, recent studies revealed that the overall performances of the current promoter prediction programs (PPPs) are still poor, and predictions made by individual PPPs do not overlap each other. Furthermore, most PPPs are trained and tested on the most-upstream promoters; their performances on alternative promoters have not been assessed.
In this paper, we evaluate the performances of current major promoter prediction programs (i.e., PSPA, FirstEF, McPromoter, DragonGSF, DragonPF, and FProm) using 42,536 distinct human gene promoters on a genome-wide scale, and with emphasis on alternative promoters. We describe an artificial neural network (ANN) based meta-predictor program that integrates predictions from the current PPPs and the predicted promoters' relation to CpG islands. Our specific analysis of recently discovered alternative promoters reveals that although only 41% of the 3' most promoters overlap a CpG island, 74% of 5' most promoters overlap a CpG island.
Our assessment of six PPPs on 1.06 x 109 bps of human genome sequence reveals the specific strengths and weaknesses of individual PPPs. Our meta-predictor outperforms any individual PPP in sensitivity and specificity. Furthermore, we discovered that the 5' alternative promoters are more likely to be associated with a CpG island.

Download full-text


Available from: Junwen Wang
  • Source
    • "Since CpG islands are in 74% of upstream promoters and 40% of the downstream promoters of mammalian genes40, we hypothesized that, the promoter and 5′UTR regions, which are important for regulatory roles of the genome, are also under covered by the NGS technology. Indeed, in all three folds we tested, promoter and 5′UTR regions are significantly under covered by next generation sequencing when compared with whole genome background (Supplementary Figure 2) (both P values less than 2.2e-16). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid development of next generation sequencing (NGS) technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently map millions of short reads to the reference genome and how to make accurate SNP calls are two major challenges in taking full advantage of NGS. In this article, we reviewed the current software tools for mapping and SNP calling, and evaluated their performance on samples from The Cancer Genome Atlas (TCGA) project. We found that BWA and Bowtie are better than the other alignment tools in comprehensive performance for Illumina platform, while NovoalignCS showed the best overall performance for SOLiD. Furthermore, we showed that next-generation sequencing platform has significantly lower coverage and poorer SNP-calling performance in the CpG islands, promoter and 5'-UTR regions of the genome. NGS experiments targeting for these regions should have higher sequencing depth than the normal genomic region.
    Full-text · Article · Aug 2011 · Scientific Reports
  • Source
    • "The ChIP-X data can be in ‘bed’ or ‘gff’ formats, which shows the position of all identified peak regions or summits, or as peak files produced by the two popular CisGenome (12) and MACS (13) ChIP-X analysis softwares, or the input can simply be a list of genes with associated peaks. ChIP-Array generates a list of potential targets of the TF for the inputted peak files based on the distance between the transcription start site (TSS) of the closest gene (14) and the summit of peaks with cutoff distance specified by the users. Our system can cope with genome coordinates in the peak file of human genome assembly version hg19, mouse version mm9, yeast version sacCer2, fruit fly version dm3 and Arabidopsis version TAIR8. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromatin immunoprecipitation (ChIP) coupled with high-throughput techniques (ChIP-X), such as next generation sequencing (ChIP-Seq) and microarray (ChIP–chip), has been successfully used to map active transcription factor binding sites (TFBS) of a transcription factor (TF). The targeted genes can be activated or suppressed by the TF, or are unresponsive to the TF. Microarray technology has been used to measure the actual expression changes of thousands of genes under the perturbation of a TF, but is unable to determine if the affected genes are direct or indirect targets of the TF. Furthermore, both ChIP-X and microarray methods produce a large number of false positives. Combining microarray expression profiling and ChIP-X data allows more effective TFBS analysis for studying the function of a TF. However, current web servers only provide tools to analyze either ChIP-X or expression data, but not both. Here, we present ChIP-Array, a web server that integrates ChIP-X and expression data from human, mouse, yeast, fruit fly and Arabidopsis. This server will assist biologists to detect direct and indirect target genes regulated by a TF of interest and to aid in the functional characterization of the TF. ChIP-Array is available at, with free access to academic users.
    Full-text · Article · May 2011 · Nucleic Acids Research
  • Source
    • "MetaProm [11] and EnsemPro [12] are both programs that use ensemble methods for promoter prediction. Although we were unable to obtain predictions for these programs, we could evaluate Profisi Ensemble using the evaluation rules described in the original papers in an attempt to make some comparison with them. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The computational prediction of transcription start sites is an important unsolved problem. Some recent progress has been made, but many promoters, particularly those not associated with CpG islands, are still difficult to locate using current methods. These methods use different features and training sets, along with a variety of machine learning techniques and result in different prediction sets. We demonstrate the heterogeneity of current prediction sets, and take advantage of this heterogeneity to construct a two-level classifier ('Profisi Ensemble') using predictions from 7 programs, along with 2 other data sources. Support vector machines using 'full' and 'reduced' data sets are combined in an either/or approach. We achieve a 14% increase in performance over the current state-of-the-art, as benchmarked by a third-party tool. Supervised learning methods are a useful way to combine predictions from diverse sources.
    Full-text · Article · Nov 2010 · BMC Genomics
Show more