ParPEST: A pipeline for EST data analysis based on parallel computing

Department of Structural and Functional Biology, University Federico II, 80134 Naples, Italy.
BMC Bioinformatics (Impact Factor: 2.58). 01/2006; 6 Suppl 4(Suppl 4):S9. DOI: 10.1186/1471-2105-6-S4-S9
Source: PubMed


Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations.
We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation.
The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.

Download full-text


Available from: M. Chiusano, Feb 09, 2014
  • Source
    • "do address to a broad set of users, however, as there is no 'one size fits all' solution, scientists have come with other solutions each tailored for a specific problem. For instance, BioBrew provided an 'over-the-counter' cluster functionality [8] [9] [10]. DNALinux [11] [12] provided a preconfigured virtual machine that runs on top of the free VMWare Player on Windows XP and Vista, meaning that one could use Windows in parallel with running one's bioinformatics application in DNALinux [8] [11]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Research in Life Sciences has moved a from purely hypothesis driven science to a data-hypothesis driven science. Huge volumes of data requires powerful systems, intelligent algorithms and a group of people maintaining and improving the infrastructure associted with the software environments. These software environments need to be constantly maintained, configured and updated to suit the researchers ever changing needs and goals. To address these challenges engineers and computer scientists have proposed multiple solutions built on Linux systems that include within them all the necessary software needed by the research group. Therefore, this paper presents a review of the major Life Sciences driven customized Linux distributions (henceforth referred to as 'Life-Linux distros') used in the academia and industry.
  • Source
    • "Some platforms terminate at the assembly level, providing contigs and singletons [16] (referred to as rESTs) while other platforms exclusively run nucleotide-based programs with limited annotation at the protein level [17-20]. Based on the benchmarking results, a robust transcriptome analysis pipeline (TranSeqAnnotator) is constructed with contig generation from ESTs and short reads, updated pathway analysis, non-classically secreted protein identification and extensive annotation with an option to select specific analysis phases by users (detailed below). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The transcriptome of an organism can be studied with the analysis of expressed sequence tag (EST) data sets that offers a rapid and cost effective approach with several new and updated bioinformatics approaches and tools for assembly and annotation. The comprehensive analyses comprehend an organism along with the genome and proteome analysis. With the advent of large-scale sequencing projects and generation of sequence data at protein and cDNA levels, automated analysis pipeline is necessary to store, organize and annotate ESTs. Results TranSeqAnnotator is a workflow for large-scale analysis of transcriptomic data with the most appropriate bioinformatics tools for data management and analysis. The pipeline automatically cleans, clusters, assembles and generates consensus sequences, conceptually translates these into possible protein products and assigns putative function based on various DNA and protein similarity searches. Excretory/secretory (ES) proteins inferred from ESTs/short reads are also identified. The TranSeqAnnotator accepts FASTA format raw and quality ESTs along with protein and short read sequences and are analysed with user selected programs. After pre-processing and assembly, the dataset is annotated at the nucleotide, protein and ES protein levels. Conclusion TranSeqAnnotator has been developed in a Linux cluster, to perform an exhaustive and reliable analysis and provide detailed annotation. TranSeqAnnotator outputs gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. The pipeline is applied to annotate large EST datasets to identify several novel and known genes with therapeutic experimental validations and could serve as potential targets for parasite intervention. TransSeqAnnotator is freely available for the scientific community at
    BMC Bioinformatics 12/2012; 13(17). DOI:10.1186/1471-2105-13-S17-S24 · 2.58 Impact Factor
  • Source
    • "Both EST and TC sequences, the latter automatically generated by the ParPEST pipeline (D'Agostino et al., 2005, 2009; Gremme et al., 2005) and collected in a dedicated database called SolEST (D'Agostino et al., 2009), were aligned along BAC sequences using the GenomeThreader software (Gremme et al., 2005). GenomeThreader is used to generate splice-alignments of each EST along genomic sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The consortium responsible for the sequencing of the tomato (Solanum lycopersicum) genome initially focused on the sequencing of the euchromatic regions using a BAC-by-BAC strategy. We analyzed the compositional features of the whole collection of BAC sequences publically available. This analysis highlights specific peculiarities of heterochromatic and euchromatic BACs, in particular: the whole BAC collection has i) a large variability in repeat and gene content, ii) a positive and significant correlation of LTR retrotransposons of the Gypsy class with the repeat content and iii) the preferential location of the SINEs (short interspersed nuclear elements) in BAC sequences showing a low repeat content. Our results point out a typical design of the tomato chromosomes and pave the way for further investigations on the relationship between DNA primary structure and chromatin organization in Solanaceae genomes.
    Gene 02/2012; 499(1):176-81. DOI:10.1016/j.gene.2012.02.044 · 2.14 Impact Factor
Show more