BMC Bioinformatics 2005, 6:S9
Page 8 of 9
(page number not for citation purposes)
Users can further select data based on the preliminary
functional annotation, specifying a biological function as
well as a GO term or a GO accession (Figure 5b). Moreo-
ver, restrictions on results obtained from the whole ana-
lytical procedure can be applied to retrieve different sets of
ESTs (Figure 5c). For example users can retrieve all ESTs
containing or not vectors, presenting or not BLAST
matches, classified as singletons or to be in a cluster.
Cluster Browser (Figure 6) is specifically dedicated to
select clustered sequences through a specific identifier, as
it is assigned by the software, and their structural features
(Figure 6a). Information about the functional annotation
of the contig can be used for retrieving too (Figure 6b).
Results from specific queries are reported in graphical dis-
play, reporting among other information, the contig
sequence, the ESTs which define the clusters and their
organization as aligned by CAP3 (Figure 7). This is con-
sidered useful to support analyses of transcribed variants
putatively derived from the same gene or from gene fami-
We designed the presented pipeline to perform an exhaus-
tive analysis on EST datasets. Moreover, we implemented
ParPEST to reduce execution time of the different steps
required for a complete analysis by means of distributed
processing and of parallelized software. Though some
efforts are reported in the literature where all the steps
included in a EST comprehensive analyses are integrated
in a pipelined approach [11-13], to our knowledge, no
public available software is based on parallel computing
for the whole data processing. The time efficiency is very
important if we consider that EST data are in continuous
The pipeline is conceived to run on low requiring hard-
ware components, to fulfill increasing demand, typical of
the data used, and scalability at affordable costs.
Our efforts has been focused to fulfill all the possible
automatic analyses useful to highlight structural features
of the data and to link the resulting data to biological
processes with standardized annotation such as Gene
Ontology and KEGG. This is fundamental to contribute to
the comprehension of transcriptional and post-transcrip-
tional mechanisms and to derive patterns of expression,
to characterize properties and relationships and uncover
still unknown biological functionalities.
Our goal was to set up an integrated computational plat-
form, exploiting efficient computing, including a compre-
hensive informative system and ensuring flexible queries
on varied fundamental aspects, also based on suitable
graphical views of the results, to support exhaustive and
faster investigations on challenging biological data collec-
The design of the platform is conceived to provide the
pipeline and its results using a user friendly web interface.
Upon request, users can upload GenBank or Fasta format-
We offer free support for processing sequence collections
to the academic community under specific agreements.
We would welcome you to find contacts and to visit a
demo version of the web interface at http://
This work is supported by the Agronanotech Project (Ministry of Agricul-
We thank Prof. Luigi Frusciante and Prof. Gerardo Toraldo for all their sup-
port to our work.
We thank Anantharaman Kalyanaraman for his suggestions and updates
about PaCE and Enrico Raimondo for useful discussions.
1. Chou HH, Holmes MH: DNA sequence quality trimming and vector
removal. Bioinformatics 2001, 17:1093-104.
2. SeqClean a software for vector trimming [http://
3. PHRAP software [http://www.phrap.org/
4. RepeatMasker software [http://www.repeatmasker.org/
5. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for
aligning DNA sequences. J Comput Biol 2000, 7:203-214.
6. Kalyanaraman A, Aluru S, Kothari S, Brendel V: Efficient clustering
of large EST data sets on parallel computers. Nucleic Acids Res
7. Burke J, Davison D, Hide W: d2_cluster: a validated method for
clustering EST and full-length cDNAsequences. Genome Res
8. Malde K, Coward E, Jonassen I: A graph based algorithm for gen-
erating EST consensus sequences. Bioinformatics 2005,
9. Parkinson J, Guiliano DB, Blaxter M: Making sense of EST
sequences by CLOBBing them. BMC Bioinformatics 2002, 3:31.
10. Pertea G, et al.: TIGR Gene Indices clustering tools (TGICL): a
software system for fast clustering of large EST datasets. Bio-
informatics 2003, 19:651-652.
11. Boguski MS, Lowe TM, Tolstoshev CM: dbEST-database for
"expressed sequence tags". Nat Genet 1993, 4:332-333.
12. EGTDC: EST analysis [http://envgen.nox.ac.uk/est.html
13. Mao C, Cushman JC, May GD, Weller JW: ESTAP – an automated
system for the analysis of EST data. Bioinformatics 2003,
14. Hotz-Wagenblatt A, Hankeln T, Ernst P, Glatting KH, Schmidt ER,
Suhai S: ESTAnnotator: A tool for high throughput EST anno-
tation. Nucleic Acids Res 2003, 31:3716-3719.
15. Rudd S: openSputnik – a database to ESTablish comparative
plant genomics using unsaturated sequence collections.
Nucleic Acids Res 2005, 33:D622-D627.
16. Kumar CG, LeDuc R, Gong G, Roinishivili L, Lewin HA, Liu L:
ESTIMA, a tool for EST management in a multi-project envi-
ronment. BMC Bioinformatics 2004, 5:176.
17. Xu H, et al.: EST pipeline system: detailed and automated EST
data processing and mining. Genomics Proteomics Bioinformatics
18. Pontius JU, Wagner L, Schuler GD: UniGene: a unified view of the
transcriptome. The NCBI Handbook 2003.