ParPEST: a pipeline for EST data analysis based on parallel computing
ABSTRACT Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations.
We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation.
The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.
Full-textDOI: · Available from: M. Chiusano, Feb 09, 2014
[Show abstract] [Hide abstract]
ABSTRACT: Public databases contain large datasets of plant expressed sequence tags (ESTs) that can be used for mining microsatellite/simple sequence repeat markers. The identification and annotation of these markers take considerable time. Here, we describe an efficient, high-throughput microsatellite mining, and analysis pipeline, standalone EST microsatellite mining and analysis tool (SEMAT). The pipeline bundles sequence trimming, assembly, microsatellite identification, primer selection, and blast annotation, for which it consecutively uses SeqClean, CAP3, MISA, Primer3, and Blast. SEMAT is written using Perl scripts, and it runs under Ubuntu and Fedora Linux. SEMAT is an efficient and time-saving bioinformatics tool to accomplish the high throughput EST-SSR analysis. It is freely available from http://semat.cpcribioinformatics.in/.Tree Genetics & Genomes 12/2014; 10(6). DOI:10.1007/s11295-014-0785-2 · 2.44 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: We present here a flexible computational framework to complete the large-scale computing tasks involved in automatic annotation of whole-genomes. The characteristics of this framework include a two-level job load system and NFS-based distributed store of replicated data. In addition, the storage structure of annotation results in a relational database system and a web interface for graphical interactive browsing and searching on the data are also described. The framework has been used to identify a core set of human protein coding genes that are consistently annotated and of high quality, which can be accessed by the browser provided at http://bioinfo.hust.edu.cn/en/database/ben.Proceedings of the Fifth International Conference on Grid and Cooperative Computing Workshops; 10/2006
The 10th International Conference on Advanced Data Mining and Applications, Guilin, China; 12/2014