ParPEST: a pipeline for EST data analysis based on parallel computing

Department of Structural and Functional Biology, University Federico II, 80134 Naples, Italy.
BMC Bioinformatics (Impact Factor: 2.67). 01/2006; 6 Suppl 4(Suppl 4):S9. DOI: 10.1186/1471-2105-6-S4-S9
Source: PubMed

ABSTRACT Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations.
We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation.
The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.

Download full-text


Available from: M. Chiusano, Feb 09, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The consortium responsible for the sequencing of the tomato (Solanum lycopersicum) genome initially focused on the sequencing of the euchromatic regions using a BAC-by-BAC strategy. We analyzed the compositional features of the whole collection of BAC sequences publically available. This analysis highlights specific peculiarities of heterochromatic and euchromatic BACs, in particular: the whole BAC collection has i) a large variability in repeat and gene content, ii) a positive and significant correlation of LTR retrotransposons of the Gypsy class with the repeat content and iii) the preferential location of the SINEs (short interspersed nuclear elements) in BAC sequences showing a low repeat content. Our results point out a typical design of the tomato chromosomes and pave the way for further investigations on the relationship between DNA primary structure and chromatin organization in Solanaceae genomes.
    Gene 02/2012; 499(1):176-81. DOI:10.1016/j.gene.2012.02.044 · 2.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Petunia is an excellent model system, especially for genetic, physiological and molecular studies. Thus far, however, genome-wide expression analysis has been applied rarely because of the lack of sequence information. We applied next-generation sequencing to generate, through de novo read assembly, a large catalogue of transcripts for Petunia axillaris and Petunia inflata. On the basis of both transcriptomes, comprehensive microarray chips for gene expression analysis were established and used for the analysis of global- and organ-specific gene expression in Petunia axillaris and Petunia inflata and to explore the molecular basis of the seed coat defects in a Petunia hybrida mutant, anthocyanin 11 (an11), lacking a WD40-repeat (WDR) transcription regulator. Among the transcripts differentially expressed in an11 seeds compared with wild type, many expected targets of AN11 were found but also several interesting new candidates that might play a role in morphogenesis of the seed coat. Our results validate the combination of next-generation sequencing with microarray analyses strategies to identify the transcriptome of two petunia species without previous knowledge of their genome, and to develop comprehensive chips as useful tools for the analysis of gene expression in P. axillaris, P. inflata and P. hybrida.
    The Plant Journal 05/2011; 68(1):11-27. DOI:10.1111/j.1365-313X.2011.04661.x · 6.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present here a flexible computational framework to complete the large-scale computing tasks involved in automatic annotation of whole-genomes. The characteristics of this framework include a two-level job load system and NFS-based distributed store of replicated data. In addition, the storage structure of annotation results in a relational database system and a web interface for graphical interactive browsing and searching on the data are also described. The framework has been used to identify a core set of human protein coding genes that are consistently annotated and of high quality, which can be accessed by the browser provided at
    Proceedings of the Fifth International Conference on Grid and Cooperative Computing Workshops; 10/2006