Database indexing for production MegaBLAST searches

Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics (Impact Factor: 4.98). 08/2008; 24(16):1757-64. DOI: 10.1093/bioinformatics/btn322
Source: PubMed


The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.
We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: [corrected]
Supplementary data are available at Bioinformatics online.

Download full-text


Available from: Alejandro A Schaffer
  • Source
    • "The resulting DNA was sent to Macrogen Japan (Setagaya-ku, Tokyo, Japan) for direct sequencing with the PCR primers. A contiguous sequence was assembled manually by ClustalW (Cambridgeshire CB10 1SD, UK), and homology searches were conducted using nucleotide MEGA BLAST (Zhang et al. 2000;Morgulis et al. 2008). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The advent of molecular technologies allow for identification of organisms that were previously challenging or not possible. Conventional polymerase chain reaction analyses of a segment of the small subunit ribosomal RNA gene from trypanorhynch plerocerci obtained from cultured and wild caught amberjacks, Seriola dumerili, and Seriola rivoliana of the family Carangidae from Hawai‘i and Japan were found to be 100% identical, indicating that the cestodes from Japan and Hawai‘i are the same species, Protogrillotia zerbiae. The prevalence of the trypanorhynch plerocerci found in the musculature of Hawai‘i wild-caught S. dumerili and S. rivoliana was 86.9 and 72.7%, respectively. In stark contrast, no trypanorhynch plerocerci have been detected in S. rivoliana cultured in Hawai‘i and they are only rarely seen in S. dumerili cultured in Japan. Trypanorhynch plerocerci are part of a complex life cycle that involves the transmission in at least two intermediate hosts before finally residing in a host shark species. The results of this study indicate that artificial propagation of the amberjacks using manipulated diets has most likely disrupted the life cycle of this tapeworm, thus reducing the prevalence of this parasite in farmed amberjacks and enhancing the marketability of cultured amberjack.
    Full-text · Article · Feb 2016 · Journal of the World Aquaculture Society
  • Source
    • "These representative centroids, along with the sequences from groups 4 and 5 and those obtained in this study, were used to construct maximum-likelihood trees for each of the viral genes using the GTR+G+I model (selected using jModelTest 2[35]) as implemented in MEGA 6.06c, with 500 bootstrap replications. MegaBLAST[36]was used to select the 50 GenBank sequences with highest identity to the CDS-trimmed sequences obtained in this study; this was done for each gene segment, resulting in a list of 400 high-identity sequences. This list was then used to determine high-identity sequence density for each state or province, i.e. the percentage of these 400 high-identity sequences that was recorded at that state or province. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Migratory aquatic birds play an important role in the maintenance and spread of avian influenza viruses (AIV). Many species of aquatic migratory birds tend to use similar migration routes, also known as flyways, which serve as important circuits for the dissemination of AIV. In recent years there has been extensive surveillance of the virus in aquatic birds in the Northern Hemisphere; however in contrast only a few studies have been attempted to detect AIV in wild birds in South America. There are major flyways connecting South America to Central and North America, whereas avian migration routes between South America and the remaining continents are uncommon. As a result, it has been hypothesized that South American AIV strains would be most closely related to the strains from North America than to those from other regions in the world. We characterized the full genome of three AIV subtype H11N9 isolates obtained from ruddy turnstones (Arenaria interpres) on the Amazon coast of Brazil. For all gene segments, all three strains consistently clustered together within evolutionary lineages of AIV that had been previously described from aquatic birds in North America. In particular, the H11N9 isolates were remarkably closely related to AIV strains from shorebirds sampled at the Delaware Bay region, on the Northeastern coast of the USA, more than 5000 km away from where the isolates were retrieved. Additionally, there was also evidence of genetic similarity to AIV strains from ducks and teals from interior USA and Canada. These findings corroborate that migratory flyways of aquatic birds play an important role in determining the genetic structure of AIV in the Western hemisphere, with a strong epidemiological connectivity between North and South America.
    Full-text · Article · Dec 2015 · PLoS ONE
    • "In order to identify the target sequence, a hepatic transcriptome (S. B. Roberts et al., 2012) was annotated using the GenBank database and National Center for Biotechnology Information (NCBI)-Basic Local Alignment Search Tool (BLAST) algorithm (Zhang et al., 2000; McGinnis and Madden, 2004; Morgulis et al., 2008). Use of the BLASTx algorithm identified the sequence as hif-1 and its associated complexes (i.e., PAS domain and hif-1α CTAD). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Hypoxia [dissolved oxygen (DO) < 2mg L(-1)] is a major environmental perturbation for many aquatic ecosystems, particularly highly productive estuaries. Most research attention and understanding about the impacts of hypoxia on estuarine species has focused on the benthos, where hypoxia is most common. Although, the pelagic zone is also susceptible to the effects of hypoxia, the biological interactions and consequences are not as well understood in marine environments because documenting exposure or avoidance of hypoxia is often difficult. Physiological biomarkers may provide a way to gain more detailed spatiotemporal information regarding species' exposure to hypoxia. Here, we identified and tested a hypoxia-specific responsive gene, hypoxia-inducible factor-1α (hif-1α), to evaluate its potential as a biomarker for hypoxia exposure in Pacific Herring (Clupea pallasii), an abundant and widely distributed pelagic fish species. We conducted controlled laboratory experiments to establish the level of elevated gene expression (>1sd normoxic mean), exposure amplification (≥2hrs), and reduction rate (ca. 24hrs) for hif-1α. These experiments provided some evidence of a lethal hypoxic limit of Pacific herring (ca. 2 mg L(-1), ≥4hrs). We then used these findings to evaluate the spatiotemporal patterns of hif-1α expression of Pacific herring in a seasonally hypoxia estuary, Hood Canal, Washington, U.S.A. Although gene expression did not parallel the local hypoxic conditions in the estuary, herring from the more severe hypoxic year (2013) had a higher probability of having elevated mRNA levels. These patterns indicate that hif-1α mRNA levels may not be directly indicative of local DO levels, but rather provide insight into hypoxia exposure over broader scales. Moreover, this study demonstrates key differences and limitations of hepatic hif-1α as a biomarker for a pelagic, highly mobile species versus more benthic organisms. Copyright © 2015. Published by Elsevier Inc.
    No preview · Article · Aug 2015 · Comparative biochemistry and physiology. Part A, Molecular & integrative physiology
Show more