PET-Tool: A software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data

Information and Mathematical Sciences Group, Genome Institute of Singapore, 60 Biopolis Street, Genome #02-01, 138672, Singapore.
BMC Bioinformatics (Impact Factor: 2.58). 02/2006; 7(1):390. DOI: 10.1186/1471-2105-7-390
Source: PubMed


We recently developed the Paired End diTag (PET) strategy for efficient characterization of mammalian transcriptomes and genomes. The paired end nature of short PET sequences derived from long DNA fragments raised a new set of bioinformatics challenges, including how to extract PETs from raw sequence reads, and correctly yet efficiently map PETs to reference genome sequences. To accommodate and streamline data analysis of the large volume PET sequences generated from each PET experiment, an automated PET data process pipeline is desirable.
We designed an integrated computation program package, PET-Tool, to automatically process PET sequences and map them to the genome sequences. The Tool was implemented as a web-based application composed of four modules: the Extractor module for PET extraction; the Examiner module for analytic evaluation of PET sequence quality; the Mapper module for locating PET sequences in the genome sequences; and the Project Manager module for data organization. The performance of PET-Tool was evaluated through the analyses of 2.7 million PET sequences. It was demonstrated that PET-Tool is accurate and efficient in extracting PET sequences and removing artifacts from large volume dataset. Using optimized mapping criteria, over 70% of quality PET sequences were mapped specifically to the genome sequences. With a 2.4 GHz LINUX machine, it takes approximately six hours to process one million PETs from extraction to mapping.
The speed, accuracy, and comprehensiveness have proved that PET-Tool is an important and useful component in PET experiments, and can be extended to accommodate other related analyses of paired-end sequences. The Tool also provides user-friendly functions for data quality check and system for multi-layer data management.

Download full-text


Available from: Kuo Ping Chiu, Oct 09, 2014
  • Source
    • "Raw sequences were processed for quality control [8]. Briefly, sequences with N, polyA, polyT, polyG, polyC, and PCR primer sequence were removed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Current next-generation sequencing (NGS) platforms adopt two types of sequencing mechanisms: by synthesis or by ligation. The former is employed by 454 and Solexa systems, while the latter by SOLiD system. Although the pros and cons for each sequencing mechanism have more or less been discussed in a number of occasions, the potential obstacle imposed by palindromic sequences has not yet been addressed. Methods To test the effect of the palindromic region on sequencing efficacy, we clonally amplified a paired-end ditag sequence composed of a 24-bp palindromic sequence flanked by a pair of tags from the E. coli genome. We used the near homogeneous fragments produced from MmeI digestion of the amplified clone to generate a sequencing library for SOLiD 5500xl sequencer. Results Results showed that, traditional ABI sequencers, which adopt sequencing-by-synthesis mechanism, were able to read through the palindromic region. However, SOLiD 5500xl was unable to do so. Instead, the palindromic region was read as miscellaneous random sequences. Moreover, readable tag sequence turned obscure ~2 bp prior to the palindromic region. Conclusions Taken together, we demonstrate that SOLiD machines, which employ sequencing-by-ligation mechanism, are unable to read through the palindromic region. On the other hand, sequencing-by-synthesis sequencers had no difficulty in doing so.
    Full-text · Article · Dec 2012 · BMC Systems Biology
  • Source
    • "Paired-End diTags are sequences with a mean length of 35 bp, each containing the 5′ and 3′ signatures of a full-length transcript (6,16) and a collection of 307 056 distinct PETs were obtained (unpublished data). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We have developed a novel method for estimating the parameters of hidden Markov models for gene finding in newly sequenced species. Our approach does not rely on curated training data sets, but instead uses extrinsic evidence (including paired-end ditags that have not been used in gene finding previously) and iterative training. This new method is particularly suitable for annotation of species with large evolutionary distance to the closest annotated species. We have used our approach to produce an initial annotation of more than 16,000 genes in the newly sequenced Schistosoma japonicum draft genome. We established the high quality of our predictions by comparison to full-length cDNAs (withdrawn from the extrinsic evidence) and to CEGMA core genes. We also evaluated the effectiveness of the new training procedure on Caenorhabditis elegans genome. ExonHunter and the newest parametric files for S. japonicum genome are available for download at
    Full-text · Article · Apr 2009 · Nucleic Acids Research
  • Source
    • "Data analysis was performed using PET-Tool for PET extraction and genome mapping (13), followed by visualization in the T2G browser, a specially designed visualization system for PETs mapped to genome assemblies (4). Calculations were performed with Microsoft Excel. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Complex libraries for genomic DNA and cDNA sequencing analyses are typically amplified using bacterial propagation. To reduce biases, large numbers of colonies are plated and scraped from solid-surface agar. This process is time consuming, tedious and limits scaling up. At the same time, multiple displacement amplification (MDA) has been recently developed as a method for in vitro amplification of DNA. However, MDA has no selection function for the removal of ligation multimers. We developed a novel method of briefly introducing ligation reactions into bacteria to select single insert DNA clones followed by MDA to amplify. We applied these methods to a Gene Identification Signatures with Paired-End diTags (GIS-PET) library, which is a complex transcriptome library created by pairing short tags from the 5′ and 3′ ends of cDNA fragments together, and demonstrated that this selection and amplification strategy is unbiased and efficient.
    Full-text · Article · Apr 2008 · Nucleic Acids Research
Show more