Article

A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST--by improving its algorithms and optimizations--is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... CUDA-BLASTP [18] and GPU-BLAST [19] mainly exploit coarse-grained parallelism in which a sequence alignment is mapped to only one thread. CUDA-BLASTP optimizes the Deterministic Finite-state Automaton (DFA), a twolevel lookup table proposed in FSA-BLAST [27,28]. In GPU-BLAST, a vector of bits is allocated in shared memory to store the information about each possible sequence word. ...
... In order to reduce the execution time of hit detection, a lookup table is constructed based on the query sequence, which performs a hash-based matching during preprocessing in Stage 1. The lookup table can be a query-index table or a deterministic finite automation (DFA) table [9,15,16,18,19,27,28]. FSA-BLAST proposed the use of DFA to speed up the process of hit detection [27,28]. ...
... The lookup table can be a query-index table or a deterministic finite automation (DFA) table [9,15,16,18,19,27,28]. FSA-BLAST proposed the use of DFA to speed up the process of hit detection [27,28]. Since the original DFA requires an extra pointer for each word, resulting in a lot of storage space being taken up, where the average consumption of pointers can account for more than half of the entire DFA size. ...
Article
Full-text available
In the field of computational biology, sequence alignment is a very important methodology. BLAST is a very common tool for performing sequence alignment in bioinformatics provided by National Center for Biotechnology Information (NCBI) in the USA. The BLAST server receives tens of thousands of queries every day on average. Among the procedures of BLAST, the hit detection process whose core architecture is a lookup table is the most time-consuming. In the latest work, a lightweight BLASTP on CUDA GPU with a hybrid query-index table was proposed for servicing the sequence query length shorter than 512, which effectively improved the query efficiency. According to the reported protein sequence length distribution, about 90% of sequences are equal to or smaller than 1024. In this paper, we propose an improved lightweight BLASTP to speed up the hit detection time for longer query sequences. The largest sequence is enlarged from 512 to 1024. As a result, one more bit is required to encode each sequence position. To meet the requirement, an extended hybrid query-index table (EHQIT) is proposed to accommodate three sequence positions in a four-byte table entry, making only one memory access sufficient to retrieve all the position information as long as the number of hits is equal to or smaller than three. Moreover, if there are more than three hits for a possible word, all the position information will be stored in contiguous table entries, which eliminates branch divergence and reduces memory space for pointers to overflow buffer. A square symmetric scoring matrix, Blosum62, is used to determine the relative score made by matching two characters in a sequence alignment. The experimental results show that for queries shorter than 512 our improved lightweight BLASTP outperforms the original lightweight BLASTP with speedups of 1.2 on average. When the number of hit overflows increases, the speedup can be as high as two. For queries shorter than 1024, our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLAST. In short, the improved lightweight BLASTP can replace the original one because it can support a longer query sequence and provide better performance.
... BLAST performs comparisons between a pair of sequences in order to find regions of local similarity [1]. The popular BLAST derivatives are NCBI-BLAST (web based and standalone versions are available), WU-BLAST, Paracel BLAST and fast search algorithm (FSA)-BLAST [2, 3, 4, 5, 6, 7]. Among them, NCBI-BLAST (standalone) and FSA-BLAST are open source programs and any one can download (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/Latest, ...
... In view of the above reasons, any improvement to the BLAST algorithm that can reduce runtime space and time without effecting sensitivity and selectivity would be very much desirable [10]. Over the years several modifications to the fundamental algorithms and new heuristics in BLAST were proposed to improve speed and minimize runtime space [3][4][5][6][7], [11][12][13][14][15][16][17][18]. Basically, BLAST program was designed to analyze both protein and DNA sequences. ...
... Basically, BLAST program was designed to analyze both protein and DNA sequences. It has mainly four algorithmic steps namely finding hits, performing un-gapped alignments, performing gapped alignments and computing trace back and outputting the results [2, 3, 6, 7]. The main functional differences between NCBI BLAST and FSA BLAST are, one is the structure used for finding hits between a query sequence and database sequence during the hit detection process and the other is using semi-gapped and restricted insertion alignments during alignment stage of the algorithm [6,7]. ...
... In addition, GPU-BLAST uses a presence bit vector to hold information about whether a specific amino acid word is present in the query and allocates the vector to shared memory. CUDA-BLASTP compresses the deterministic finite automaton, used in FSA-BLAST [22,23], by a two-level lookup table. On the other hand, cuBLASTP [15,16] and H-BLAST [6] exploit fine-grained parallelism by assigning many threads for performing each individual phase of sequence search. ...
... In BLAST, to detect hits quickly, the query sequence is preprocessed and converted into a lookup table to perform a hash-based matching in the first stage. A lookup table can be organized as a query-index [6,18] or deterministic finite automaton (DFA)-based table [15,16,19,22,23]. ...
... FSA-BLAST proposed a deterministic finite automaton (DFA) to accelerate the hit detection process, which is a cache-conscious approach [22,23]. The original DFA has the drawback that an additional pointer is required for each code word. ...
Article
Full-text available
The BLAST server in the National Center for Biotechnology Information in the USA receives tens of thousands of queries per day on average. However, the service is always the same for every query even though query lengths vary significantly. In fact, the lengths of a large portion of protein sequences are less than 500. On the other hand, the hit detection process consumes the most of the execution time of BLAST and its core architecture is a lookup table. Following the above reasons, we propose a lightweight BLASTP for servicing not-too-long queries, where a hybrid query-index table is proposed accordingly. Each table entry consists of four bytes that can store up to three query positions. Therefore, a sequence word usually requires only one memory fetch to retrieve its hit information. Furthermore, additional dummy entries are embedded into the table and interleaved with original entries. The entries without any hits and dummy entries both can be used to buffer spilled query positions. The above features result in a much smaller lookup table with a higher utilization rate and a lower cache miss ratio. Experimental results show that the lightweight BLASTP outperforms CUDA-BLASTP with speedups ranging from 1.82 to 3.37 based on the first two critical phases.
... BLAST performs comparisons between a pair of sequences in order to find regions of local similarity [1]. The popular BLAST derivatives are NCBI-BLAST (web based and standalone versions are available) [2], [3], WU-BLAST [4], Paracel BLAST [5] and fast search algorithm (FSA)-BLAST [6], [7]. Among them, NCBI-BLAST (standalone) and FSA-BLAST are open source programs and any one can download (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/Latest, ...
... Because of its widespread usage (used over 120,000 times each day [8]) any improvement to the BLAST algorithm that can reduce runtime space and time without effecting sensitivity and selectivity [9] would be very much desirable. Over the years several modifications to the fundamental algorithms and new heuristics in BLAST were proposed to improve speed and minimize runtime space [3][4][5][6][7], [10][11][12][13][14][15][16]. Manuscript This paper proposes modified data structure that reduces the runtime space during the hit detection stage of the FSA protein BLAST algorithm. ...
... Basically, BLAST program was designed to analyze both protein and DNA sequences. It has mainly four algorithmic steps namely finding hits, performing un-gapped alignments, performing gapped alignments and computing trace back and outputting the results [2], [3], [6], [7]. The main functional differences between NCBI BLAST and FSA BLAST are, one is the structure used for finding hits between a query sequence and database sequence during the hit detection stage and the other is using semi-gapped and restricted insertion alignments during alignment stage of the algorithm [6,7]. ...
... Our analysis shows that the two-hit mode provides faster search times than the one-hit mode with comparable search accuracy, in agreement with Altschul et al. [1997]. A preliminary version of the results and discussions presented in this chapter appeared in Cameron et al. [2006c]. ...
... Finally, we provide concluding remarks in Section 5.4. A preliminary version of the results and discussions presented in this chapter appeared in Cameron et al. [2006c]. ...
... It is remarkable that, while quite an intense effort was aimed at the increase of search sensitivity (which led to the invention of many new tools and even concepts [2,[7][8][9][10][11][12][13]), for many years only a small number of studies were dedicated to the improvement of speed of the generic search [14][15][16][17]. In many cases, such works addressed particular problems and were actually not applicable to most generic search tasks [18][19][20]. ...
... (TIF) Figure S2 Performance comparison for quick protein similarity search tools calculated using the tools' own reporting functionality. All measurements were taken at default parameters but for the ''PSimScan2'' series ('approx': 0.79, 'kthresh': 14). Streptococcus pneumoniae R6 proteome was used as the query set, SwissProt/Uniprot database -as the subject set. A. Found similarities by E-value ('according to the tools' own reporting -here and below). ...
Article
Full-text available
In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table-based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects 'similarity zones' aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP's and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI's non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
... BLAT [13] uses an index stored in memory. Cameron and collaborators designed a "cache-conscious" implementation of the initial word finding module of BLAST [14]. The concerns listed in this section and the start of a new C++ toolkit at the NCBI [15] motivated us to rewrite the BLAST code and release a completely new set of command-line applications. ...
... A baseline blastx application that does not split the query was prepared. Cameron et al. [14] replaced the BLAST lookup table with a DFA (Deterministic Finite Automaton) to improve the cache behavior. They reported a 10-15% reduction in search time for BLASTP (protein-protein) searches. ...
Article
Full-text available
Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.
... For each subsequence in the query, its location is stored in the lookup structure at the corresponding entry. A partial lookup structure for the query gaacacaat is illustrated inFigure 2. The choice of lookup structure requires care and we have recently described what we believe is the best choice for protein search (Cameron, Williams & Cannane 2006). We discuss the lookup structure used for nucleotide search in more detail in Section 3.1. ...
... In particular, only one hit between a query and collection sequence triggers a second stage, ungapped alignment; the published description describes the blastp approach that requires two matches. This suggests that the two-hit mode of operation improves search times for protein searches only (see Cameron et al. (2006) for a more detailed comparison of one-hit and two-hit modes of blastp). In addition, neighbourhood words are not used for nucleotide search, that is, each hit must be an exact match between a query and collection subsequence; this is in contrast to both the 1990 and 1997 descriptions of the approach. ...
Article
Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/.
... Since we are interested in improving performance with no loss in accuracy we do not consider these non-default settings further. Overall, our clustering approach with default parameters combined with improvements to the gapped alignment (Cameron et al., 2004) and hit detection (Cameron et al., 2006) stages of BLAST allow the speed of FSA-BLAST to be double that of NCBI-BLAST with no significant effect on accuracy. Both versions of BLAST produce ROC scores 0.017 below the optimal Smith-Waterman algorithm. ...
Article
Full-text available
We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
... Hence it seems to be worth pursuing ideas from [44], where a novel method of compression, based on wavelets, was proposed for dictionaries of words. Another possible improvement could be integration of the cache-conscious hashing DFA to improve efficiency of page-swapping, as described in [45]. However, we would like to recall here that our overall goal of investigating the seed-based hot spot search was to reduce the need for large information storage by choosing only those hits that seem important. ...
Article
Full-text available
In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses a advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we ll this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.
... To evaluate whether sequences are likely to be protein coding genes we consider sequence conservation, composition , and overlap in the genome. Conservation is determined by a sequence similarity search using FSA-BLAST [19] against a user-provided sequence database (which we call the GRC BLAST database). Composition is evaluated using entropy-density profiles introduced by Zhu et al. [4], and subsequently used in MED 2.0 [20] and Glimmer3 [1] . ...
Article
Full-text available
As sequencing costs have decreased, whole genome sequencing has become a viable and integral part of biological laboratory research. However, the tools with which genes can be found and functionally characterized have not been readily adapted to be part of the everyday biological sciences toolkit. Most annotation pipelines remain as a service provided by large institutions or come as an unwieldy conglomerate of independent components, each requiring their own setup and maintenance. To address this issue we have created the Genome Reverse Compiler, an easy-to-use, open-source, automated annotation tool. The GRC is independent of third party software installs and only requires a Linux operating system. This stands in contrast to most annotation packages, which typically require installation of relational databases, sequence similarity software, and a number of other programming language modules. We provide details on the methodology used by GRC and evaluate its performance on several groups of prokaryotes using GRC's built in comparison module. Traditionally, to perform whole genome annotation a user would either set up a pipeline or take advantage of an online service. With GRC the user need only provide the genome he or she wants to annotate and the function resource files to use. The result is high usability and a very minimal learning curve for the intended audience of life science researchers and bioinformaticians. We believe that the GRC fills a valuable niche in allowing users to perform explorative, whole-genome annotation.
... Histogram on the right side shows results of alignment of proteins which we know to be unrelated to Antigens with Antigens. We can see that SeedBLAST finds less non-homology-related alignments than PSI-BLAST does. of the cache-conscious hashing DFA to improve efficiency of page-swapping, as described in (Cameron et al., 2006). However, we would like to recall here that our overall goal of investigating the seed-based hot spot search was to reduce the need for large information storage by choosing only those hits that seem important. ...
Conference Paper
Full-text available
The seeding technique became central in the theory of sequence alignment and there are several efficient tools applying seeds to DNA homology search. Recently, a concept of subset seeds has been proposed for similarity search in protein sequences. We experimentally evaluate the applicability of subset seeds to protein homology search. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The representation of seeds by deterministic finite automata (DFAs) is developed and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original NCBI-BLAST on the GPCR protein family. Our results demonstrate a clear superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. SeedBLAST is an open source software freely available http://bioputer.mimuw.edu.pl/papers/sblast. Supplementary material and user manual are also provided.
... La charge de calcul -le nombre de séquences d'une banque -est connue à l'avance et est répartie de manière égale sur chaque SPE. Ces deux implémentations imposent une taille maximale de 800 caractères pour la séquence requête et obtiennent des résultats 1,6 fois plus rapide que la plus rapide que l'implémentation à base d'instructions SSE décrite dans[10].Zhang et al.[76] proposent une implémentation du programme FSA-BLASTP[77], une variation de BLASTP, sur le CELL. Dans un premier temps, le PPE génère un DFA (Deterministic Finit-state Automaton), et le transmet à tous les SPEs. ...
Article
Full-text available
The sequence comparison process is one of the main bioinformatics task. The new sequencing technologies lead to a fast increasing of genomic data and strengthen the need of fast and efficient tools to perform this task. In this thesis, a new algorithm for intensive sequence comparison is proposed. It has been specifically designed to exploit all forms of parallelism of today microprocessors (SIMD instructions, multi-core architecture). This algorithm is also well suited for hardware accelerators such as FPGA or GPU boards. The algorithm has been implemented into the PLAST software (Parallel Local Alignment Search Tool). Different versions are available according to the data to process (protein and/or DNA). A MPI version has also been developed. According to the nature of the data and the type of technologies, speedup from 3 to 20 has been measured compared with the reference software, BLAST, with the same level of quality.
... Blast is faster but the precision of result is lower than dynamic programming algorithm. With the development of HPC, many parallel research about Blast has been done, such as NCBI-Blast, FSA-Blast [31], CUDA-Blastp [57], cuBlastp [57] and Hadoop-Blast [58]. NCBI-Blast is the most popular blast implementation, which is on multi-core platform, and is supported by NCBI. ...
Article
Full-text available
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
... The raw data were filtered to remove low-quality reads using FASTP [4]. The filtered reads were compared to the non-redundant protein sequence database using BLAST [5] to discover virus-related sequences. These sequences were extracted and compared to the online GenBank nucleotide database to determine whether they belonged to a known viral pathogen. ...
Article
Full-text available
A novel negevirus, tentatively named Manglie virus (MaV), was isolated from Culex tritaeniorhynchus from the village of Manglie, Yunnan, China, in August 2011. It was identified by high-throughput sequencing of cell culture supernatants, and the complete genome was sequenced using an Illumina MiSeq sequencer. The complete MaV genome comprised 9,218 nt encoding three hypothetical proteins and had a poly(A) tail. BLASTn analysis showed that the genome had the greatest similarity to Ngewotan virus strain Nepal22, with query coverage of 100% and 79% identity. Genomic and phylogenetic analyses demonstrated that MaV should be considered a novel negevirus.
... Since we are interested in improving performance with no loss in accuracy we do not consider these non-default settings further. Overall, our clustering approach with default parameters combined with improvements to the gapped alignment (Cameron et al., 2004) and hit detection (Cameron et al., 2005) stages of BLAST more than double the speed of FSA-BLAST compared to NCBI-BLAST with no significant effect on accuracy. Both versions of BLAST produce ROC scores 0.017 below the optimal Smith-Waterman algorithm.Figure 2 shows a comparison of clustering times between CD-HIT and our clustering approach for four different releases of the GenBank NR database; details of the collections used are given inTable 2. The results show that the clustering time of our approach is linear with the collection size and the CD- HIT approach is superlinear (Figure 2). ...
Conference Paper
We present a new approach to managing redundancy in se- quence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sucien tly high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no signican t change in accuracy. We also describe our method for cluster- ing that uses ngerprinting , an approach that has been successfully ap- plied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredun- dant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a re- sult, FSA-BLAST is twice as fast as NCBI-BLAST with no signican t change in accuracy.
... Since we are interested in improving performance with no loss in accuracy we do not consider these non-default settings further. Overall, our clustering approach with default parameters combined with improvements to the gapped alignment (Cameron et al., 2004) and hit detection (Cameron et al., 2006c) stages of BLAST allow the speed of FSA-BLAST to double that of NCBI-BLAST with no significant effect on accuracy. Both versions of BLAST produce ROC scores 0.017 below the optimal Smith-Waterman algorithm. ...
... Low-quality reads were removed from the raw data using FASTP [14]. To identify virus-related sequences, we compared the remaining sequences with those in the non-redundant protein sequence database using BLAST [15]. The matched sequences were compared with those in the GenBank nucleotide database. ...
Article
Full-text available
We detected a novel bovine hepacivirus N (HNV) subtype, IME_BovHep_01, in the serum of cattle in Tengchong, Yunnan, China, by high-throughput sequencing. The complete genome of IME_BovHep_01, was sequenced using an Illumina MiSeq sequencer and found to be 8850 nt in length, encoding one hypothetical protein. BLASTn analysis showed that the genome sequence shared similarity with the bovine hepacivirus isolate BovHepV_209/Ger/2014, with 88% query coverage and 70.8% identity. However, the highest similarity was to bovine hepacivirus N strain BRBovHep_RS963, for which only a partial genome sequence is available, with 68% query coverage and 81.5% identity. Sequence comparisons and phylogenetic analysis suggested that IME_BovHep_01 is a novel HNV subtype. Importantly, IME_BovHep_01 is the first member of this new genotype for which the complete genome sequence was determined.
Article
La comparaison de séquences est une des tâches fondamentales de la bioinformatique. Les nouvelles technologies de séquençage conduisent à une production accélérée des données génomiques et renforcent les besoins en outils rapides et efficaces pour effectuer cette tâche. Dans cette thèse, nous proposons un nouvel algorithme de comparaison intensive de séquences, explicitement conçu pour exploiter toutes les formes de parallélisme présentes dans les microprocesseurs de dernière génération (instruction SIMD, architecture multi-coeurs). Cet algorithme s'adapte également à un parallélisme massif que l'on peut trouver sur des accélérateurs de type FPGA ou GPU. Cet algorithme a été mis en oeuvre à travers le logiciel PLAST (Parallel Local Alignment Search Tool). Différentes versions sont disponibles suivant les données à traiter (protéine et/ou ADN). Une version MPI a également été mise au point pour un déploiement sur un cluster de PCs. En fonction de la nature des données et des technologies employées des accélérations de 3 à 20 ont été mesurées par rapport à la référence du domaine, le logiciel BLAST, pour un niveau de qualité équivalent.
Article
The enormous growth of biological sequence databases has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing rapidly as well. The recent emergence of low cost parallel multicore accelerator technologies has made it possible to reduce execution times of many bioinformatics applications. In this paper, we demonstrate how the Cell Broadband Engine can be used as a computational platform to accelerate two approaches for protein sequence database scanning: exhaustive and heuristic. We present efficient parallelization techniques for two representative algorithms: the dynamic programming based Smith-Waterman algorithm and the popular BLASTP heuristic. Their implementation on a Playstation®3 leads to significant runtime savings compared to corresponding sequential implementations.
Article
Full-text available
Scanning protein sequence database is an often repeated task in computational biology and bioinformatics. However, scanning large protein databases, such as GenBank, with popular tools such as BLASTP requires long runtimes on sequential architectures. Due to the continuing rapid growth of sequence databases, there is a high demand to accelerate this task. In this paper, we demonstrate how GPUs, powered by the Compute Unified Device Architecture (CUDA), can be used as an efficient computational platform to accelerate the BLASTP algorithm. In order to exploit the GPU’s capabilities for accelerating BLASTP, we have used a compressed deterministic finite state automaton for hit detection as well as a hybrid parallelization scheme. Our implementation achieves speedups up to 10.0 on an NVIDIA GeForce GTX 295 GPU compared to the sequential NCBI BLASTP 2.2.22. CUDA-BLASTP source code which is available at https://sites.google.com/site/liuweiguohome/software.
Article
Full-text available
Motivation: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters. Availability: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords Contact: richa{at}helix.nih.gov
Article
Full-text available
This paper presents the first ever reported implementation of the Gapped Basic Local Alignment Search Tool (Gapped BLAST) for biological sequence alignment, with the Two-Hit method, on CUDA (compute unified device architecture)-compatible Graphic Processing Units (GPUs). The latter have recently emerged as relatively low cost and easy to program high performance platforms for general purpose computing. Our Gapped BLAST implementation on an NVIDIA Geforce 8800 GTX GPU is up to 2.7x quicker than the most optimized CPU-based implementation, namely NCBI BLAST, running on a Pentium4 3.4 GHz desktop computer with 2GB RAM.
Conference Paper
Protein database search is an important method in the field of computational biology. There are a large number of sequences in an average database which makes such searches rather time and resource consuming. With the rapid growth in size of these databases in the past years, there came a need to speed up the search and consequently, any alignments performed on such databases.This paper presents an acceleration of the database search tool sw#DB which is based on a CUDA implementation of Smith-Waterman algorithm. We achieved speed up by reducing database size. The whole database was divided into seeds of a fixed length. The positions of these seeds and the corresponding sequence indexes from the database are then stored in a hash container. This allows for a constant time lookup of all the positions of a seed in every sequence of a database. Potential alignment candidate sequences for a query are filtered using this method, forwarding only those which contain at least one seed from the query to the sw#DB. This reduces the number of alignments performed. Overall, it brings a speedup of around three times compared to the basic sw#DB tool, based solely on Smith Waterman algorithm, with almost no loss of accuracy. The implementation is written in CUDA and C programming languages. For large queries, MPI implementation with multiple CUDA cards is used.
Conference Paper
The enormous growth of biological sequence databases has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing rapidly as well. The recent emergence of low cost parallel accelerator technologies has made it possible to reduce execution times of many bioinformatics applications. In this paper, we demonstrate how the PlayStation®3, powered by the Cell Broadband Engine, can be used as an efficient computational platform to accelerate the popular BLASTP algorithm.
Conference Paper
Detection of highly similar sequences within genomic collec- tions has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collec- tions for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible | either in space or in time | as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document ngerprin ting for iden- tifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clus- tering large protein sequence collections.
Article
Several linear-time algorithms for automata-based pattern matching rely on failure transitions for efficient back-tracking. Like epsilon transitions, failure transition do not consume input symbols, but unlike them, they may only be taken when no other transition is applicable. At a semantic level, this conveniently models catch-all clauses and allows for compact language representation. This work investigates the transition-reduction problem for deterministic finite-state automata (DFA). The input is a DFA A and an integer k. The question is whether k or more transitions can be saved by replacing regular transitions with failure transitions. We show that while the problem is NP-complete, there are approximation techniques and heuristics that mitigate the computational complexity. We conclude by demonstrating the computational difficulty of two related minimisation problems, thereby cancelling the ongoing search for efficient algorithms.
Conference Paper
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
Conference Paper
The sequence alignment is a basic method for processing the information in Bioinformatics, it has a great significance for finding the function and the structure of nucleic acids and protein sequences and the information of evolution. This paper briefly describes the relevant issues of sequence alignment and the most common local sequence alignment algorithms, Blast algorithm. At present, the Blast algorithm which provided by NCBI or stand-alone can not meet the actual demand for the flood of biological data, this paper achieves the Blast-Parallel algorithm by further improvement based on the Hadoop-Blast algorithm. Through serial experiments of the stand-alone Blast algorithm and parallelizing experiments of the Hadoop-Blast algorithm and the Blast-Parallel algorithm based on Hadoop platform, results show that the Blast algorithm has significantly higher execution efficiency after the parallelization, and the matching speed of the Blast-Parallel algorithm which has been improved can achieve 1~1.5 times of the Hadoop-Blast algorithm.
Conference Paper
A duplication is basic phenomenon that occurs through molecular evolution on a biological sequence. A duplication on a string copies any substring of the string. We define k-pseudo-duplication of a string w that consists, roughly speaking, of all strings obtained from w by inserting after a substring u another substring obtained from u by at most k edit operations. We consider three variants of duplication operations, duplication, k-pseudo-duplication and reverse-duplication. First, we give the necessary and sufficient number of states that a nondeterministic finite automaton needs to recognize duplications on a string. Then, we show that regular languages and context-free languages are not closed under the duplication, k-pseudo-duplication and reverse-duplication operations. Furthermore, we show that the class of context-sensitive languages is closed under duplication, pseudo-duplication and reverse-duplication.
Article
It is important to identify viruses in animals because most infectious diseases in humans are caused by viruses of zoonotic origin. African green monkey is a widely used non-human primate model in biomedical investigations. In this study, total RNAs were extracted from stool samples of 10 African green monkeys with diarrhea. High-throughput sequencing was used to characterize viromes. PCR and Sanger sequencing were used to determine the full genome sequences. Great viral diversity was observed. The dominant viruses were enteroviruses and picobirnaviruses. Six enterovirus genomes and a picobirnavirus RNA-dependent RNA polymerase sequence were characterized. Five enteroviruses belonged to two putative new genotypes of species Enterovirus J. One enterovirus belonged to EV-A92. The picobirnavirus RNA-dependent RNA polymerase sequence had the highest nucleotide similarity (93.48%) with human picobirnavirus isolate GPBV6C2. The present study helped to identify the potential zoonotic viruses in African green monkeys. Further investigations are required to elucidate their pathogenic roles in animals and humans.
Article
Full-text available
This report presents the implementation of a protein sequence comparison algorithm specifically designed for speeding up time consuming part on parallel hardware such as SSE instructions, multicore architectures or graphic boards. Three programs have been developed: PLAST-P, TPLAST-N and PLAST-X. They provide equivalent results compared to the NCBI BLAST family programs (BLAST-P, TBLAST-N and BLAST-X) with a speed-up factor ranging from 5 to 10.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
Full-text available
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Article
Full-text available
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHIBLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.
Article
Full-text available
Motivation: Many studies have shown that database searches using position-specific score matrices (PSSMs) or profiles as queries are more effective at identifying distant protein relationships than are searches that use simple sequences as queries. One popular program for constructing a PSSM and comparing it with a database of sequences is Position-Specific Iterated BLAST (PSI-BLAST). Results: This paper describes a new software package, IMPALA, designed for the complementary procedure of comparing a single query sequence with a database of PSI-BLAST-generated PSSMs. We illustrate the use of IMPALA to search a database of PSSMs for protein folds, and one for protein domains involved in signal transduction. IMPALA's sensitivity to distant biological relationships is very similar to that of PSI-BLAST. However, IMPALA employs a more refined analysis of statistical significance and, unlike PSI-BLAST, guarantees the output of the optimal local alignment by using the rigorous Smith-Waterman algorithm. Also, it is considerably faster when run with a large database of PSSMs than is BLAST or PSI-BLAST when run against the complete non-redundant protein database.
Article
Full-text available
Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. We present a new homology search algorithm 'PatternHunter' that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. PatternHunter is available at http://www.bioinformaticssolutions.com, as a commercial package. It runs on all platforms that support Java. PatternHunter technology is being patented; commercial use requires a license from BSI, while non-commercial use will be free.
Article
Full-text available
The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.
Article
Full-text available
One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.
Article
Full-text available
The ASTRAL Compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54 745 domains, more than three times as many as the initial release 4 years ago. ASTRAL has undergone major transformations in the past 2 years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand‐alone database, as well as integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB‐style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods. ASTRAL may be accessed at http://astral.stanford. edu/.
Article
Full-text available
Electrochemical impedance measurements were used for the detection of single‐strand DNA sequences using a peptide nucleic acid (PNA) probe layer immobilized onto Si/SiO2 chips. An epoxysilane layer is first immobilized onto the Si/SiO2 surface. The immobilization procedure consists of an epoxide/amine coupling reaction between the amino group of the PNA linker and the epoxide group of the silane. A 20‐nucleotide sequence of PNA was used. Impedance measurements allow for the detection of the changes in charge distribution at the oxide/solution interface following modifications to the oxide surface. Due to these modifications, there are significant shifts in the semiconductor’s flat‐band potential after immobilization and hybridization. The results obtained using this direct and rapid approach are supported by fluorescence measurements according to classical methods for the detection of nucleic acid sequences.
Article
Full-text available
Motivation: International sequencing e#orts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.
Article
Extending the single optimized spaced seed of PatternHunter(20) to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of Smith-Waterman, for homology search. At Blastn speed, PatternHunter II approaches Smith-Waterman sensitivity, bringing homology search methodology research back to a full circle.
Article
Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.
Article
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.
Article
Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.
Article
In a statistical study of neighboring residues in 1465 peptides and proteins comprising 450,431 residues, it was found that the preferences for residues neighboring to glutamine and asparagine residues are consistent with the hypothesis that the rates of deamidation of these residues are of biological significance. Some dipeptide and tripeptide structures have special usefulness and some are especially undesirable. More such structures exist for amide residues than for other residues, and their specific types are those most relevant to the deamidation of amide residues under biological conditions.
Article
With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
Article
Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome.
Article
Basic Local Alignment Search Tool (BLAST) is one of the most heavily used sequence analysis tools available in the public domain. There is now a wide choice of BLAST algorithms that can be used to search many different sequence databases via the BLAST web pages (http://www.ncbi.nlm.nih.gov/BLAST/). All the algorithm–database combinations can be executed with default parameters or with customized settings, and the results can be viewed in a variety of ways. A new online resource, the BLAST Program Selection Guide, has been created to assist in the definition of search strategies. This article discusses optimal search strategies and highlights some BLAST features that can make your searches more powerful.
Article
Extending the single optimized spaced seed of PatternHunter to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of Smith-Waterman, for homology search. At Blastn speed, PatternHunter II approaches Smith-Waterman sensitivity, bringing homology search technology back to a full circle.
Article
We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.
Article
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the BLAST algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step-semigapped alignment-compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing BLAST to accurately filter sequences with lower computational cost. In addition, we propose a heuristic-restricted insertion alignment-that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in BLAST. We conclude that our techniques are an important improvement to the BLAST algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.
Article
Comprehensive performance assessment is important for improving sequence database search methods. Sensitivity, selectivity and speed are three major yet usually conflicting evaluation criteria. The average precision (AP) measure aims to combine the sensitivity and selectivity features of a search algorithm. It can be easily visualized and extended to analyze results from a set of queries. Finally, the time-AP plot can clearly show the overall performance of different search methods. Experiments are performed based on the SCOP database. Popular sequence comparison algorithms, namely Smith-Waterman (SSEARCH), FASTA, BLAST and PSI-BLAST are evaluated. We find that (1) the low-complexity segment filtration procedure in BLAST actually harms its overall search quality; (2) AP scores of different search methods are approximately in proportion of the logarithm of search time; and (3) homologs in protein families with many members tend to be more obscure than those in small families. This measure may be helpful for developing new search algorithms and can guide researchers in selecting most suitable search methods. Test sets and source code of this evaluation tool are available upon request.
Blast: at the core of a powerful and diverse set of sequence analysis tools
  • S Mcginnis
  • T Madden
McGinnis, S. and Madden, T. (2004). Blast: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32:W20–W25.