Article

PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data

Information and Mathematical Sciences Group, Genome Institute of Singapore, 60 Biopolis Street, Genome #02-01, 138672, Singapore.
BMC Bioinformatics (Impact Factor: 2.67). 02/2006; 7:390. DOI: 10.1186/1471-2105-7-390
Source: PubMed

ABSTRACT We recently developed the Paired End diTag (PET) strategy for efficient characterization of mammalian transcriptomes and genomes. The paired end nature of short PET sequences derived from long DNA fragments raised a new set of bioinformatics challenges, including how to extract PETs from raw sequence reads, and correctly yet efficiently map PETs to reference genome sequences. To accommodate and streamline data analysis of the large volume PET sequences generated from each PET experiment, an automated PET data process pipeline is desirable.
We designed an integrated computation program package, PET-Tool, to automatically process PET sequences and map them to the genome sequences. The Tool was implemented as a web-based application composed of four modules: the Extractor module for PET extraction; the Examiner module for analytic evaluation of PET sequence quality; the Mapper module for locating PET sequences in the genome sequences; and the Project Manager module for data organization. The performance of PET-Tool was evaluated through the analyses of 2.7 million PET sequences. It was demonstrated that PET-Tool is accurate and efficient in extracting PET sequences and removing artifacts from large volume dataset. Using optimized mapping criteria, over 70% of quality PET sequences were mapped specifically to the genome sequences. With a 2.4 GHz LINUX machine, it takes approximately six hours to process one million PETs from extraction to mapping.
The speed, accuracy, and comprehensiveness have proved that PET-Tool is an important and useful component in PET experiments, and can be extended to accommodate other related analyses of paired-end sequences. The Tool also provides user-friendly functions for data quality check and system for multi-layer data management.

0 Followers
 · 
145 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide identification of transcription factor binding sites and gene regulatory elements is an important problem of computational genomics. Advances in high-throughput sequencing technologies combined with chromatin immunoprecipitation, such as ChIP-on-chip, ChIP-seq and ChIP-PET (Paired-End diTag) allow us to map transcription factor binding sites (TFBS) and analyze mechanisms of gene regulation on the level of the entire genome. Examples include Oct4, Sox2, Nanog and 10 other transcription factors in mouse, and p53, c-Myc, ER, FoxA1 in human. Clustering of multiple binding sites by different TF reveals potential enhancer regions in mammalian genome. We discuss here statistical analysis of data mapping quality, sensitivity and specificity issues of the ChIP-seq TFBS sets and downstream gene expression analysis.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Using a chromatin immunoprecipitation-paired end diTag cloning and sequencing strategy, we mapped estrogen receptor a (ERa) binding sites in MCF-7 breast cancer cells. We identified 1,234 high confidence binding clusters of which 94% are projected to be bona fide ERa binding regions. Only 5% of the mapped estrogen receptor binding sites are located within 5 kb upstream of the transcriptional start sites of adjacent genes, regions containing the proximal promoters, whereas vast majority of the sites are mapped to intronic or distal locations (.5 kb from 59 and 39 ends of adjacent transcript), suggesting transcriptional regulatory mechanisms over significant physical distances. Of all the identified sites, 71% harbored putative full estrogen response elements (EREs), 25% bore ERE half sites, and only 4% had no recognizable ERE sequences. Genes in the vicinity of ERa binding sites were enriched for regulation by estradiol in MCF-7 cells, and their expression profiles in patient samples segregate ERa-positive from ERa-negative breast tumors. The expression dynamics of the genes adjacent to ERa binding sites suggest a direct induction of gene expression through binding to ERE-like sequences, whereas transcriptional repression by ERa appears to be through indirect mechanisms. Our analysis also indicates a number of candidate transcription factor binding sites adjacent to occupied EREs at frequencies much greater than by chance, including the previously reported FOXA1 sites, and demonstrate the potential involvement of one such putative adjacent factor, Sp1, in the global regulation of ERa target genes. Unexpectedly, we found that only 22%-24% of the bona fide human ERa binding sites were overlapping conserved regions in whole genome vertebrate alignments, which suggest limited conservation of functional binding sites. Taken together, this genome-scale analysis suggests complex but definable rules governing ERa binding and gene regulation.
    PLoS Genetics 01/2005; DOI:10.1371/journal.pgen.0030087.eor · 8.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Advances in genome sequencing have progressed at a rapid pace, with increased throughput accompanied by plunging costs. But these advances go far beyond faster and cheaper. High-throughput sequencing technologies are now routinely being applied to a wide range of important topics in biology and medicine, often allowing researchers to address important biological questions that were not possible before. In this review, we discuss these innovative new approaches-including ever finer analyses of transcriptome dynamics, genome structure and genomic variation-and provide an overview of the new insights into complex biological systems catalyzed by these technologies. We also assess the impact of genotyping, genome sequencing and personal omics profiling on medical applications, including diagnosis and disease monitoring. Finally, we review recent developments in single-cell sequencing, and conclude with a discussion of possible future advances and obstacles for sequencing in biology and health.
    Molecular Systems Biology 01/2013; 9:640. DOI:10.1038/msb.2012.61 · 14.10 Impact Factor