ParPEST: a pipeline for EST data analysis based on parallel computing.

Department of Structural and Functional Biology, University Federico II, 80134 Naples, Italy.
BMC Bioinformatics (Impact Factor: 3.02). 01/2006; 6 Suppl 4:S9. DOI: 10.1186/1471-2105-6-S4-S9
Source: PubMed

ABSTRACT Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations.
We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation.
The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Parasitic nematodes infect humans, other animals and plants, and impose a significant public health and economic burden worldwide due to the diseases that they cause. A better understanding of parasite genomes, host-parasite relationships and the molecular biology of parasites themselves will enable the rational development of diagnostic tests and/or safe anti-parasitic compounds, following the functional annotation of parasite genomic sequences. With only a few completely sequenced nematode genomes, expressed sequence tag (EST) datasets provide a low-cost alternative ("poor man's genome") to whole genome sequences and a glimpse of the transcriptome of an organism. EST data require a number of computational methods for their pre-processing, clustering, assembly and annotation to yield biologically relevant information. In this article, we review the steps involved in EST data analysis, the development of new semi-automated bioinformatic pipelines and their application to parasitic nematodes of major socio-economic significance, focused on identifying molecules involved in key biological processes or pathways that might serve as targets for new drugs or vaccines.
    Biotechnology advances 05/2009; 27(4):439-48. · 8.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Investigators in the biological sciences continue to exploit laboratory automation methods and have dramatically increased the rates at which they can generate data. In many environments, the methods themselves also evolve in a rapid and fluid manner. These observations point to the importance of robust information management systems in the modern laboratory. Designing and implementing such systems is non-trivial and it appears that in many cases a database project ultimately proves unserviceable. We describe a general modeling framework for laboratory data and its implementation as an information management system. The model utilizes several abstraction techniques, focusing especially on the concepts of inheritance and meta-data. Traditional approaches commingle event-oriented data with regular entity data in ad hoc ways. Instead, we define distinct regular entity and event schemas, but fully integrate these via a standardized interface. The design allows straightforward definition of a "processing pipeline" as a sequence of events, obviating the need for separate workflow management systems. A layer above the event-oriented schema integrates events into a workflow by defining "processing directives", which act as automated project managers of items in the system. Directives can be added or modified in an almost trivial fashion, i.e., without the need for schema modification or re-certification of applications. Association between regular entities and events is managed via simple "many-to-many" relationships. We describe the programming interface, as well as techniques for handling input/output, process control, and state transitions. The implementation described here has served as the Washington University Genome Sequencing Center's primary information system for several years. It handles all transactions underlying a throughput rate of about 9 million sequencing reactions of various kinds per month and has handily weathered a number of major pipeline reconfigurations. The basic data model can be readily adapted to other high-volume processing environments.
    BMC Bioinformatics 02/2007; 8:362. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The transcriptome of an organism can be studied with the analysis of expressed sequence tag (EST) data sets that offers a rapid and cost effective approach with several new and updated bioinformatics approaches and tools for assembly and annotation. The comprehensive analyses comprehend an organism along with the genome and proteome analysis. With the advent of large-scale sequencing projects and generation of sequence data at protein and cDNA levels, automated analysis pipeline is necessary to store, organize and annotate ESTs. Results TranSeqAnnotator is a workflow for large-scale analysis of transcriptomic data with the most appropriate bioinformatics tools for data management and analysis. The pipeline automatically cleans, clusters, assembles and generates consensus sequences, conceptually translates these into possible protein products and assigns putative function based on various DNA and protein similarity searches. Excretory/secretory (ES) proteins inferred from ESTs/short reads are also identified. The TranSeqAnnotator accepts FASTA format raw and quality ESTs along with protein and short read sequences and are analysed with user selected programs. After pre-processing and assembly, the dataset is annotated at the nucleotide, protein and ES protein levels. Conclusion TranSeqAnnotator has been developed in a Linux cluster, to perform an exhaustive and reliable analysis and provide detailed annotation. TranSeqAnnotator outputs gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. The pipeline is applied to annotate large EST datasets to identify several novel and known genes with therapeutic experimental validations and could serve as potential targets for parasite intervention. TransSeqAnnotator is freely available for the scientific community at
    BMC Bioinformatics 13(17). · 3.02 Impact Factor

Full-text (2 Sources)

Available from
Jun 5, 2014