Database indexing for production MegaBLAST searches

Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics (Impact Factor: 4.98). 08/2008; 24(16):1757-64. DOI: 10.1093/bioinformatics/btn322
Source: PubMed


The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.
We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: [corrected]
Supplementary data are available at Bioinformatics online.

Download full-text


Available from: Alejandro A Schaffer, Oct 14, 2015
79 Reads
    • "In order to identify the target sequence, a hepatic transcriptome (S. B. Roberts et al., 2012) was annotated using the GenBank database and National Center for Biotechnology Information (NCBI)-Basic Local Alignment Search Tool (BLAST) algorithm (Zhang et al., 2000; McGinnis and Madden, 2004; Morgulis et al., 2008). Use of the BLASTx algorithm identified the sequence as hif-1 and its associated complexes (i.e., PAS domain and hif-1α CTAD). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Hypoxia [dissolved oxygen (DO) < 2mg L(-1)] is a major environmental perturbation for many aquatic ecosystems, particularly highly productive estuaries. Most research attention and understanding about the impacts of hypoxia on estuarine species has focused on the benthos, where hypoxia is most common. Although, the pelagic zone is also susceptible to the effects of hypoxia, the biological interactions and consequences are not as well understood in marine environments because documenting exposure or avoidance of hypoxia is often difficult. Physiological biomarkers may provide a way to gain more detailed spatiotemporal information regarding species' exposure to hypoxia. Here, we identified and tested a hypoxia-specific responsive gene, hypoxia-inducible factor-1α (hif-1α), to evaluate its potential as a biomarker for hypoxia exposure in Pacific Herring (Clupea pallasii), an abundant and widely distributed pelagic fish species. We conducted controlled laboratory experiments to establish the level of elevated gene expression (>1sd normoxic mean), exposure amplification (≥2hrs), and reduction rate (ca. 24hrs) for hif-1α. These experiments provided some evidence of a lethal hypoxic limit of Pacific herring (ca. 2 mg L(-1), ≥4hrs). We then used these findings to evaluate the spatiotemporal patterns of hif-1α expression of Pacific herring in a seasonally hypoxia estuary, Hood Canal, Washington, U.S.A. Although gene expression did not parallel the local hypoxic conditions in the estuary, herring from the more severe hypoxic year (2013) had a higher probability of having elevated mRNA levels. These patterns indicate that hif-1α mRNA levels may not be directly indicative of local DO levels, but rather provide insight into hypoxia exposure over broader scales. Moreover, this study demonstrates key differences and limitations of hepatic hif-1α as a biomarker for a pelagic, highly mobile species versus more benthic organisms. Copyright © 2015. Published by Elsevier Inc.
    Comparative biochemistry and physiology. Part A, Molecular & integrative physiology 08/2015; 189. DOI:10.1016/j.cbpa.2015.07.016 · 1.97 Impact Factor
  • Source
    • "Vector is a data structure which is mainly an array of elements, where each position of this array represents an information and inside this position, have another array informing where this information can be found. Some similar searching sequences techniques that uses vectors as inverted indexes are: SSAHA[22], BLAT [15], PatternHunter [17] the miBLAST [16], Megablast [25], MegaBlast [18], and Kalafus [14] which uses hash tables to align whole genomes. Transforming based methods are considered out of the scope of this work, but Jing [13] presents methods that use this technique. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The search for similar genetic sequences is one of the main bioinformatics tasks. The genetic sequences data banks are growing exponentially and the searching techniques that use linear time are not capable to do the search in the required time anymore. Another problem is that the clock speed of the modern processors are not growing as it did before, instead, the processing capacity is growing with the addiction of more processing cores and the techniques which does not use parallel computing does not have benefits from these extra cores. This work aims to use data indexing techniques to reduce the searching process computation cost united with the parallelization of the searching techniques to use the computational capacity of the multi core processors. To verify the viability of using these two techniques simultaneously, a software which uses parallelization techniques with inverted indexes was developed. Experiments were executed to analyze the performance gain when parallelism is utilized, the search time gain, and also the quality of the results when it compared with others searching tools. The results of these experiments were promising, the parallelism gain overcame the expected speedup, the searching time was 20 times faster than the parallelized NCBI BLAST, and the searching results showed a good quality when compared with this tool. The software source code is available at .
  • Source
    • "The nucleotide sequence was identified as Pseudocercospora opuntiae BSJ1 (GenBank accession number: KF975410). Identity of sequences was determined based on the highest percentage (a minimum of 97%) of total nucleotide match with sequences from nucleotide database in the GenBank (Rosselló-Mora and Amann, 2001; Morgulis et al. 2008), and were corroborated with a phylogenetic analysis (Figure 6). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Black spot is an important fungal disease widely spread in different cactus pear production systems in Mexico. In Jalisco, the disease was detected in the 1990's; nowadays almost 100% of plantations are damaged by it. The objective of this paper was to study the morphological variability, pathogenicity and virulence of the causal agent in cactus pear production systems,for fruit and vegetable (nopalitos) crops, in Jalisco, Mexico. Pseudocercospora opuntiae was isolated and characterized morphologically and molecularly from cladodes collected in cactus pear production systems of Zapopan and Ojuelos showing advanced symptoms of the disease. Pseudocercospora opuntiae exhibited high growth rates and conidia development in malt extract at 2% in 16/8 h light/darkness at 26°C. Pathogenicity and virulence were tested in healthy cladodes under field and greenhouse conditions, as well as on individual cladodes, in vitro young explants and Phaseolus vulgaris inoculated with the fungus. Pseudocercospora opuntiae was able to infect under all established conditions, the first symptoms appeared 120 days after inoculation. This is the first report of isolation, identification, morphological and molecular characterization, and pathogenicity of the causal agent of cactus pear black spot in Jalisco, Mexico.
    Journal of the Professional Association for Cactus Development 04/2015; 17:1-12. · 0.30 Impact Factor
Show more