Zerg: A Very Fast BLAST Parser Library.
- SourceAvailable from: Joao Carlos Setubal[show abstract] [hide abstract]
ABSTRACT: Schistosoma mansoni is a blood helminth parasite that causes schistosomiasis, a disease that affects 200 million people in the world. Many orthologs of known mammalian genes have been discovered in this parasite and evidence is accumulating that some of these genes encode proteins linked to signaling pathways in the parasite that appear to be involved with growth or development, suggesting a complex co-evolutionary process. In this work we found 427 genes conserved in the Deuterostomia group that have orthologs in S. mansoni and no members in any nematodes and insects so far sequenced. Among these genes we have identified Insulin Induced Gene (INSIG), Interferon Regulatory Factor (IRF) and vasohibin orthologs, known to be involved in mammals in mevalonate metabolism, immune response and angiogenesis control, respectively. We have chosen these three genes for a more detailed characterization, which included extension of their cloned messages to obtain full-length sequences. Interestingly, SmINSIG showed a 10-fold higher expression in adult females as opposed to males, in accordance with its possible role in regulating egg production. SmIRF has a DNA binding domain, a tryptophan-rich N-terminal region and several predicted phosphorylation sites, usually important for IRF activity. Fourteen different alternatively spliced forms of the S. mansoni vasohibin (SmVASL) gene were detected that encode seven different protein isoforms including one with a complete C-terminal end, and other isoforms with shorter C-terminal portions. Using S. mansoni homologs, we have employed a parsimonious rationale to compute the total gene losses/gains in nematodes, arthropods and deuterostomes under either the Coelomata or the Ecdysozoa evolutionary hypotheses; our results show a lower losses/gains number under the latter hypothesis. The genes discussed which are conserved between S. mansoni and deuterostomes, probably have an ancient origin and were lost in Ecdysozoa, being still present in Lophotrochozoa. Given their known functions in Deuterostomia, it is possible that some of them have been co-opted to perform functions related (directly or indirectly) to host adaptation or interaction with host signaling processes.BMC Genomics 02/2007; 8:407. · 4.40 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: High-throughput sequencing technologies have opened up a new avenue for studying extinct organisms. Here we identify and quantify biases introduced by particular characteristics of ancient DNA samples. These analyses demonstrate the importance of closely related genomic sequence for correctly identifying and classifying bona fide endogenous DNA fragments. We show that more accurate genome divergence estimates from ancient DNA sequence can be attained using at least two outgroup genomes and appropriate filtering.Genome biology 05/2010; 11(5):R47. · 10.30 Impact Factor
BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 8 2003, pages 1035–1036
Zerg: a very fast BLAST parser library
Apu˜ a C.M. Paquola, Abimael A. Machado, Eduardo M. Reis,
Aline M. da Silva and Sergio Verjovski-Almeida∗
Departamento de Bioqu´ ımica, Instituto de Qu´ ımica, Universidade de S˜ ao Paulo
05508-900, S˜ ao Paulo, SP, Brazil
Received on September 11, 2002; revised on December 3, 2002; accepted on January 16, 2003
Summary: Zerg is a library of sub-routines that parses
the output from all NCBI BLAST programs (Blastn, Blastp,
Blastx, Tblastn and Tblastx) and returns the attributes of
a BLAST report to the user. It is optimized for speed,
being especially useful for large-scale genomic analysis.
Benchmark tests show that Zerg is over two orders of
magnitude faster than some widely used BLAST parsers.
In the genomic era the number of DNA-sequence data
searches and comparisons is increasing at an extremely
fast pace. This is being accompanied by a huge increase
in size of public databases such as GenBank, which now
contains over 15 million sequences (http://www.ncbi.nlm.
nih.gov/Genbank/). The most used sequence search tool is
BLAST (Altschul et al., 1997), which generates a user-
friendly alignment output that highlights identities and
discrepancies in aligned sequences. Manual inspection of
reports is adequate when a limited number of searches
and comparisons are used. For large sets of data, valuable
information can be gained by processing BLAST reports
automatically, usually with the help of a BLAST parser.
Several BLAST parser libraries and parsing programs
are available, including a recently published one (Xing
and Brendel, 2001). So why to write another BLAST
parser? The reason is 2-fold, speed and flexibility. When
processing large multi-Gigabyte output files, parser speed
becomes the limiting factor. When performing multiple
analyses, the availability of a fast parser library tool that
can be plugged into a program suited to one particular task
is an important factor. In fact, we have developed Zerg to
meet the challenge of processing a batch of 700000 ESTs
obtained in the Human Cancer Genome Project (Camargo
et al., 2001), as described below.
The design goal when writing Zerg was to provide a
fast BLAST output parser library that could be used by a
wide variety of programs. We opted to use Flex, a lexical
∗To whom correspondence should be addressed.
analyzer generator (http://www.gnu.org/software/flex/)
commonly used in compiler construction. The lexical
scanner generated by Flex is a C program with a fast
regular expression matching engine and an efficient
input-buffering scheme. In addition, Zerg library design is
simple in the sense that its core provides a lexical scanner
with no additional features whose support could slow
down its main function. An optional convenience package
(Zerg-Perl2) is provided that creates a Perl object repre-
senting an entire BLAST report for each of the FASTA
sequences analyzed in a multi FASTA BLAST output.
In order to test if Zerg was indeed a fast engine,
we benchmark tested it along with other publicly
available parsers and with MuSeqBox, a Blast pars-
ing program written in C++ that generates alignment
statistics (Xing and Brendel, 2001). The time taken by
each program to parse two 100 MB files containing
outputs from either Blastn or Blastx was measured
(Table 1). Zerg-C and Zerg-Perl were over two or-
ders of magnitude faster than the parsers provided
by either BioPerl (http://www.bioperl.org) or Boulder
(http://stein.cshl.org/software/boulder/). Zerg-Perl2 op-
tional convenience package was one order of magnitude
faster than BioPerl. Zerg-C was over five times faster than
MuSeqBox. Zerg-MSB is a program that uses Zerg and
does the same calculations as MuSeqBox with default
options, being three times faster than MuSeqBox itself.
Zerg has an additional advantage over MuSeqBox in
that the former is a modular library that can be plugged
into a Perl or C program to generate an output tailored
to contain information only from the fields of interest
extracted from a BLAST report. The latter is a standalone
program having a default output that may contain more
information than required for specific tasks, thus wasting
valuable computer time.
Zerg parsing engine is based on a single regular
expression that matches an entire BLAST report. This
approach confers the ability to detect corrupt or truncated
report files. A special token code (UNMATCHED) is
provided for text not matching any other lexical rule of
the scanner, permitting a program that uses Zerg to decide
Bioinformatics 19(8) c ? Oxford University Press 2003; all rights reserved.
by guest on February 25, 2013
A.C.M.Paquola et al.
Table 1. Comparative performance of Zerg and three publicly available BLAST parsers in processing two 100 MB BLAST output files, generated either with
BLASTN or BLASTX
BLASTN output file
BLASTX output file
Real time (elapsed time) taken by each program was measured in three separate runs; mean and standard deviation (SD) for these three runs are shown. All
tests were performed using a 1GHz Pentium-III with Linux. Average fold slower is the ratio between each observed mean time and the mean time taken by
Zerg-C, the test program for the Zerg C library. BioPerl (v. 1.0.2), Boulder (v. 1.27) and Zerg (v. 1.0.1) are libraries and were tested with Perl programs (e.g.
Zerg-Perl) that just print query sequence identifiers. Zerg-Perl2 is an optional convenience package that performs additional functions described in the text. In
contrast, MuSeqBox (v. 1.1) is a standalone C++ program in which the parser is an integral part. Besides parsing, MuSeqBox performs additional
calculations such as coverage of the matching sequences and generates its own default output. Zerg-MSB is an optional convenience package written in C
that uses Zerg to perform the additional calculations executed by MuSeqBox with default options and to generate an output.
how to handle inconsistencies in Blast output files. This
is a feature not found in scanners that rely on separate
regular expression matches to conclude about each token
type and value. Zerg contemplates the existence of single
or multiple HSPs in a BLAST output. One limitation of
the current version of Zerg is that it only recognizes NCBI
BLAST file format.
Zerg is being used in our laboratory as an intermediate
step in building contigs from 700000 ESTs (Camargo et
al., 2001). In this case, all sequences are compared to each
other using BLASTN, thus generating a 60 GB output
file. A C++ program using Zerg-C library was written
to process this file and partition sequences into classes
by minimal linkage clustering, a procedure similar to that
used by Burke et al. (1999). Processing this same file with
BioPerl, a widely-used parser library, would take over 4
days. MuSeqBox would take over 2 hours. In comparison,
when using Zerg to parse this BLAST output, the task was
completed in approximately 25 min.
Work supported by FAPESP, Fundac ¸˜ ao de Amparo a
Pesquisa do Estado de S˜ ao Paulo and CNPq, Conselho
Nacional de Desenvolvimento Cient´ ıfico e Tecnol´ ogico,
Altschul,S. Madden,T. et al. (1997) Gapped BLAST and PSI-
BLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
Burke,J., Davison,D. and Hide,W. (1999) d2 cluster: a validated
method for clustering EST and full-length cDNAsequences.
Genome Res., 9, 1135–1142.
Camargo,A Samaia,H. et al. (2001) The contribution of 700000
‘ORF sequence tags’ to the definition of the human transcrip-
tome. Proc. Natl Acad. Sci. USA, 98, 12103–12108.
Xing,L. and Brendel,V. (2001) Multi-query sequence BLAST
output examination with MuSeqBox. Bioinformatics, 17, 744–
by guest on February 25, 2013