BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 8 2003, pages 1035–1036
Zerg: a very fast BLAST parser library
Apu˜ a C.M. Paquola, Abimael A. Machado, Eduardo M. Reis,
Aline M. da Silva and Sergio Verjovski-Almeida∗
Departamento de Bioqu´ ımica, Instituto de Qu´ ımica, Universidade de S˜ ao Paulo
05508-900, S˜ ao Paulo, SP, Brazil
Received on September 11, 2002; revised on December 3, 2002; accepted on January 16, 2003
Summary: Zerg is a library of sub-routines that parses
the output from all NCBI BLAST programs (Blastn, Blastp,
Blastx, Tblastn and Tblastx) and returns the attributes of
a BLAST report to the user. It is optimized for speed,
being especially useful for large-scale genomic analysis.
Benchmark tests show that Zerg is over two orders of
magnitude faster than some widely used BLAST parsers.
In the genomic era the number of DNA-sequence data
searches and comparisons is increasing at an extremely
fast pace. This is being accompanied by a huge increase
in size of public databases such as GenBank, which now
contains over 15 million sequences (http://www.ncbi.nlm.
nih.gov/Genbank/). The most used sequence search tool is
BLAST (Altschul et al., 1997), which generates a user-
friendly alignment output that highlights identities and
discrepancies in aligned sequences. Manual inspection of
reports is adequate when a limited number of searches
and comparisons are used. For large sets of data, valuable
information can be gained by processing BLAST reports
automatically, usually with the help of a BLAST parser.
Several BLAST parser libraries and parsing programs
are available, including a recently published one (Xing
and Brendel, 2001). So why to write another BLAST
parser? The reason is 2-fold, speed and flexibility. When
processing large multi-Gigabyte output files, parser speed
becomes the limiting factor. When performing multiple
analyses, the availability of a fast parser library tool that
can be plugged into a program suited to one particular task
is an important factor. In fact, we have developed Zerg to
meet the challenge of processing a batch of 700000 ESTs
obtained in the Human Cancer Genome Project (Camargo
et al., 2001), as described below.
The design goal when writing Zerg was to provide a
fast BLAST output parser library that could be used by a
wide variety of programs. We opted to use Flex, a lexical
∗To whom correspondence should be addressed.
analyzer generator (http://www.gnu.org/software/flex/)
commonly used in compiler construction. The lexical
scanner generated by Flex is a C program with a fast
regular expression matching engine and an efficient
input-buffering scheme. In addition, Zerg library design is
simple in the sense that its core provides a lexical scanner
with no additional features whose support could slow
down its main function. An optional convenience package
(Zerg-Perl2) is provided that creates a Perl object repre-
senting an entire BLAST report for each of the FASTA
sequences analyzed in a multi FASTA BLAST output.
In order to test if Zerg was indeed a fast engine,
we benchmark tested it along with other publicly
available parsers and with MuSeqBox, a Blast pars-
ing program written in C++ that generates alignment
statistics (Xing and Brendel, 2001). The time taken by
each program to parse two 100 MB files containing
outputs from either Blastn or Blastx was measured
(Table 1). Zerg-C and Zerg-Perl were over two or-
ders of magnitude faster than the parsers provided
by either BioPerl (http://www.bioperl.org) or Boulder
(http://stein.cshl.org/software/boulder/). Zerg-Perl2 op-
tional convenience package was one order of magnitude
faster than BioPerl. Zerg-C was over five times faster than
MuSeqBox. Zerg-MSB is a program that uses Zerg and
does the same calculations as MuSeqBox with default
options, being three times faster than MuSeqBox itself.
Zerg has an additional advantage over MuSeqBox in
that the former is a modular library that can be plugged
into a Perl or C program to generate an output tailored
to contain information only from the fields of interest
extracted from a BLAST report. The latter is a standalone
program having a default output that may contain more
information than required for specific tasks, thus wasting
valuable computer time.
Zerg parsing engine is based on a single regular
expression that matches an entire BLAST report. This
approach confers the ability to detect corrupt or truncated
report files. A special token code (UNMATCHED) is
provided for text not matching any other lexical rule of
the scanner, permitting a program that uses Zerg to decide
Bioinformatics 19(8) c ? Oxford University Press 2003; all rights reserved.
by guest on February 25, 2013
A.C.M.Paquola et al. Download full-text
Table 1. Comparative performance of Zerg and three publicly available BLAST parsers in processing two 100 MB BLAST output files, generated either with
BLASTN or BLASTX
BLASTN output file
BLASTX output file
Real time (elapsed time) taken by each program was measured in three separate runs; mean and standard deviation (SD) for these three runs are shown. All
tests were performed using a 1GHz Pentium-III with Linux. Average fold slower is the ratio between each observed mean time and the mean time taken by
Zerg-C, the test program for the Zerg C library. BioPerl (v. 1.0.2), Boulder (v. 1.27) and Zerg (v. 1.0.1) are libraries and were tested with Perl programs (e.g.
Zerg-Perl) that just print query sequence identifiers. Zerg-Perl2 is an optional convenience package that performs additional functions described in the text. In
contrast, MuSeqBox (v. 1.1) is a standalone C++ program in which the parser is an integral part. Besides parsing, MuSeqBox performs additional
calculations such as coverage of the matching sequences and generates its own default output. Zerg-MSB is an optional convenience package written in C
that uses Zerg to perform the additional calculations executed by MuSeqBox with default options and to generate an output.
how to handle inconsistencies in Blast output files. This
is a feature not found in scanners that rely on separate
regular expression matches to conclude about each token
type and value. Zerg contemplates the existence of single
or multiple HSPs in a BLAST output. One limitation of
the current version of Zerg is that it only recognizes NCBI
BLAST file format.
Zerg is being used in our laboratory as an intermediate
step in building contigs from 700000 ESTs (Camargo et
al., 2001). In this case, all sequences are compared to each
other using BLASTN, thus generating a 60 GB output
file. A C++ program using Zerg-C library was written
to process this file and partition sequences into classes
by minimal linkage clustering, a procedure similar to that
used by Burke et al. (1999). Processing this same file with
BioPerl, a widely-used parser library, would take over 4
days. MuSeqBox would take over 2 hours. In comparison,
when using Zerg to parse this BLAST output, the task was
completed in approximately 25 min.
Work supported by FAPESP, Fundac ¸˜ ao de Amparo a
Pesquisa do Estado de S˜ ao Paulo and CNPq, Conselho
Nacional de Desenvolvimento Cient´ ıfico e Tecnol´ ogico,
Altschul,S. Madden,T. et al. (1997) Gapped BLAST and PSI-
BLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
Burke,J., Davison,D. and Hide,W. (1999) d2 cluster: a validated
method for clustering EST and full-length cDNAsequences.
Genome Res., 9, 1135–1142.
Camargo,A Samaia,H. et al. (2001) The contribution of 700000
‘ORF sequence tags’ to the definition of the human transcrip-
tome. Proc. Natl Acad. Sci. USA, 98, 12103–12108.
Xing,L. and Brendel,V. (2001) Multi-query sequence BLAST
output examination with MuSeqBox. Bioinformatics, 17, 744–
by guest on February 25, 2013