Article
To read the full-text of this research, you can request a copy directly from the author.

Abstract

To improve the accuracy of rapid homology searching it is common practice to filter all queries to mask low complexity regions prior to searching. We show in this paper, through a large-scale study of querying the PIR database, that applying popular filtering techniques unselectively to all queries may reduce retrieval effectiveness. We also show that masking queries with our new technique, cafefilter, which uses the overall distribution of motifs in a database, is at least as effective as current popular query filtering tools in large-scale tests.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Importantly, CAFÉ method consists of coarse and fine search is marginally less accurate than BLAST1 and FASTA. From search point of view, CAFE is 8 times faster and efficient than the BLAST2 [29,30]. PropSearch tool is proposed by the scientific and research society [13], have basic idea in database search is to utilize conserved properties in the similar structures. ...
Article
Full-text available
With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm's results indicate that the uses of parallel programming with large dataset are promising which can be improved further.
... DASH in contrast, pushes the median scores beyond the twentieth to twenty-fifth ranks. This represents a marked improvement over existing sequence filtering algorithms, such as [13] [17]. Finally,figure 8 reveals that the speedup of DASH over BLAST depends on the query sequence length. ...
Conference Paper
In this paper we present our genomic and proteomic sequence alignment algorithm, DASH, which results in order of magnitude speed improvement when compared to NCBI-BLAST 2.2.6, with superior sensitivity. Dynamic programming (DP) is the predominant contributor to search time for algorithms such as BLAST and FastA/P. Improving the efficiency of DP provides an opportunity to increase sensitivity, or significantly reduce search times and help offset the effects of the continuing exponential growth in database sizes. Specifically, for nucleotide searching we have demonstrated an order of magnitude speed improvement with significantly improved sensitivity, or alternatively moderate speed up with further sensitivity gains, depending on the parameters selected. Smith-Waterman complete DP is used as the sensitivity benchmark. Similar speed and sensitivity results are presented for protein searching. Since our algorithm is highly parallel, we have developed dedicated hardware which we will present in a companion paper, and a distributed version of our software (DDASH), which we expect to provide linear speedup on a cluster.
... Transformation based index algorithms are all based on special technique(s), and at the same time, these transformations combine properties of genomic data. 4.1 CAFE CAFE34353637 is a partition based search approach, where a coarse search using an inverted index is used to rank sequences by similarity to a query sequence, and a subsequent fine search is used to locally align only a database subset with the query. In our opinion, this method can be extended to other algorithms. ...
Article
Up to now, there are many homology search algorithms that have been investigated and studied. However, a good classification method and a comprehensive comparison for these algorithms are absent. This is especially true for index based homology search algorithms. The paper briefly introduces main index construction methods. According to index construction methods, index based homology search algorithms are classified into three categories, i.e., length based index ones, transformation based index ones, and their combination. Based on the classification, the characteristics of the currently popular index based homology search algorithms are compared and analyzed. At the same time, several promising and new index techniques are also discussed. As a whole, the paper provides a survey on index based homology search algorithms.
... Sequence analysis of a 4.5 kb EcoRI fragment, obtained by Southern screening of H. polymorpha genomic DNA, revealed an ORF (GenBank Accession No. AF286019), consisting of 131 amino acids and showing 43.2% identity with the S. cerevisiae Sed1p protein (ScSed1p), based on an algorithm of ClustalW (Thompson et al., 1994). The identity of two proteins was increased to 73% in BLAST using a filter of low complexity (Williams, 1999). Putative HpSed1p was much shorter than ScSed1p (384 amino acids) due to the absence of the repeating domains found in ScSed1p (Shimoi et al., 1998). ...
Article
A cell surface display system was developed in yeast Hansenula polymorpha. The four genes HpSED1, HpGAS1, HpTIP1and HpCWP1, encoding glycosylphosphatidyl-inositol (GPI)-anchored cell surface proteins from H. polymorpha, were cloned, characterized and evaluated for their efficacies as cell surface display motifs of reporter proteins. Sequence analysis of these genes revealed that each encodes a typical GPI-anchored protein that is structurally similar to a counterpart gene in S. cerevisiae. The genes showed a high content of serine-threonine (alanine) and harboured a putative secretion signal in the N-terminus and the GPI-attachment signal in the C-terminus. The surface anchoring efficiency of these putative cell surface proteins was tested by fusion to the C-terminal of carboxymethylcellulase (CMCase) from Bacillus subtilis. In all cases, high CMCase activities were detected in intact cell fraction, indicating anchoring of CMCase to the cell surface. HpCwp1p, HpGas1p and the 40 C-terminal amino acids of HpTip1p from H. polymorpha exhibited a comparatively high CMCase surface anchoring efficiency. When these proteins were used as anchoring motifs for surface display of the glucose oxidase (GOD) from Aspergillus niger, most enzyme activity was detected at the cell surface. Fluorescence activated cell sorter (FACS) analysis of cells displaying GOD on the cell surface demonstrated that GOD was well exposed on the cell surface. HpCwp1p showed the highest anchoring efficiency among others.
Article
Full-text available
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes
Article
The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer and unambiguously detect proteins that shared a common ancestor billions of years in the past. FASTA is both fast and selective because it initially considers only amino acid identities. Its sensitivity is increased not only by using the PAM250 matrix to score and rescore regions with large numbers of identities but also by joining initial regions. The results of searches with FASTA compare favorably with results using NWS-based programs that are 100 times slower. FASTA is slightly less sensitive but considerably more selective. It is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family. The joining step by FASTA to calculate the initn score is especially useful for sequences that share regions of sequence similarity that are separated by variable-length loops.
Article
Although there are several different comparison programs available (e.g., BLASTP, FASTA, SSEARCH, and BLITZ) that can be used with different scoring systems (e.g., PAM120, PAM250, BLOSUM50, BLOSUM62) and different databases (e.g., PIR, SWISS-PROT, GenPept), the following search protocol should identify homologous sequences whenever they can be found. 1. Always compare protein sequences if the genes encode proteins. Protein sequence comparison will typically double the evolutionary lookback time over DNA sequence comparison. 2. Search several sequence databases using a rapid sequence comparison program (e.g., BLASTP or FASTA, ktup = 2). Well-curated databases like PIR or SWISS-PROT tend to have fewer redundant sequences, which improves the statistical significance of a match, but they are less comprehensive and up-to-date than GenPept. 3. If there is good agreement between the distribution of scores and the theoretical distribution, and the alignments do not include "simple sequence" domains, accept sequences with FASTA E() values or BLASTP P() values below 0.02 as homologous. 4. If no library sequences are found with E values below 0.02, perform additional searches with FASTA, ktup = 1, or SSEARCH. If library sequences with E values less than 0.02 are found, the sequences are probably homologous, unless a low-complexity domain is aligned. However, sequences with similarity scores from 0.02 to 10.0 may be homologous as well. To characterize these more distantly related sequences, select "marginal" library sequences and use them to search the databases. Additional family members should have E values less than 0.05. 5. Homologous sequences share a common ancestor, and thus a common protein fold. Depending on the evolutionary distance and divergence path, two or more homologous sequences may have very few absolutely conserved residues. However, if homology has been inferred between A and B, between B and C, and between C and D, A and D must be homologous, even if they share no significant similarity. 6. Sequences with marginal E values should also be tested using the PRSS program. Compare the query and library sequences using at least 200 (and preferably 1000) shuffles. Shuffles using a window (-w) of 10-20 are more stringent than a uniform shuffle. Use the E value after 1000 shuffles to confirm an inference of homology. 7. Homologous sequences are usually similar over an entire sequence or domain, typically sharing 20-25% or greater identity for more than 200 residues. Matches that are more than 50% identical in a 20- to 40-amino acid region occur frequently by chance and do not indicate homology. By following these steps, one will very rarely assert that two sequences are homologous when in fact they are not. However, these criteria are stringent; distantly related homologous sequences may fail to be detected because their similarity is not statistically significant. These tests are biased toward missing some distantly related sequences to avoid the possibility of misidentifying unrelated ones. In most database searches, the ratio of related to unrelated sequences is more than 4000:1 (e.g., 10 related and 40,000 unrelated sequences). Thus, one is more likely to mistakenly identify two sequences as related than to overlook a genuine relationship, and our conservative evaluation criteria reflect that bias.
  • A Dembo
  • S Karlin
  • O Zeitouni
A. Dembo, S. Karlin, and O. Zeitouni. Annals of Probability, 22:2022 2039, 1994.
  • J C Wootton
  • S Federhen
J.C. Wootton and S. Federhen. Computers in Chemistry, 17:149 163, 1993.
  • W R Pearson
  • D J Lipman
W.R. Pearson and D.J. Lipman. Proc. National Academy of Sciences USA, 85:2444 2448, 1988.
  • J C Wootton
  • S Federhen
J.C. Wootton and S. Federhen. Methods in Enzymology, 266:5544574, 1996.
  • H Williams
  • J Zobel
H. Williams and J. Zobel. Computer Applications in the Biosciences, 135:549 554, 1997.
  • S F Altschul
  • T L Madden
  • A A Sch A Er
  • J Zhang
  • Z Zhang
  • W Miller
  • D J Lipman
S.F. Altschul, T.L. Madden, A.A. Sch a er, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Nucleic Acids Research, 2517:3389 3402, 1997.
  • S Altschul
  • M Boguski
  • W Gish
  • J Wootton
S. Altschul, M. Boguski, W. Gish, and J. Wootton. Nature Genetics, 6:119 129, 1994.
  • E G Shpaer
  • M Robinson
  • D Yee
  • J D Candlin
  • R Mines
  • T Hunkapiller
E.G. Shpaer, M. Robinson, D. Yee, J.D. Candlin, R. Mines, and T. Hunkapiller. Genomics, 38:179 191, 1996.
  • S F Altschul
  • W Gish
S.F. Altschul and W. Gish. Methods in Enzymology, 266:4600480, 1996.
  • T F Smith
  • M S Waterman
T.F. Smith and M.S. Waterman. Journal of Molecular Biology, 147:1955 197, 1981.
  • D George
  • W Barker
  • H Mewes
  • F Pfei
  • A Tsugita
D. George, W. Barker, H. Mewes, F. Pfei er, and A. Tsugita. Nucleic Acids Research, 24:17 20, 1996.
  • W R Pearson
W.R. Pearson. Methods in Enzymology, 183:63398, 1990.
  • J Jurka
  • P Klonowski
  • V Dagman
  • P Elton
J. Jurka, P. Klonowski, V. Dagman, and P. P elton. Computers in Chemistry, 201:119 122, 1996.
  • J M Hancock
  • J S Armstrong
J.M. Hancock and J.S. Armstrong. Computer Applications in the Biosciences, 10:67 70, 1994.
  • W C Barker
  • F Pfeiier
  • D C George
W.C. Barker, F. Pfeiier, and D.C. George. Methods in Enzymology, 266:59971, 1996.
  • D J Lipman
  • W R Pearson
D.J. Lipman and W.R. Pearson. Science, 227:1435 1441, 1985.
Personal communication GenBank user services, National Centre for Biotechnology Information NCBI, National Library of Medicine
  • S Mcginnis
S. McGinnis. Personal communication. GenBank user services, National Centre for Biotechnology Information NCBI, National Library of Medicine, US National Institute of Health, January 1998.
  • W R Pearson
W.R. Pearson. Protein Science, pages 1145 1160, 1995.
  • S Altschul
  • W Gish
  • W Miller
  • E Myers
  • D Lipman
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Journal of Molecular Biology, 215:403 410, 1990.
  • J M Claverie
  • D J States
J.M. Claverie and D.J. States. Computers in Chemistry, 17:1911201, 1993.
GenBank user services, National Centre for Biotechnology Information NCBI, National Library of Medicine
  • S Mcginnis
S. McGinnis. Personal communication. GenBank user services, National Centre for Biotechnology Information NCBI, National Library of Medicine, US National Institute of Health, January 1998.
  • H Williams
  • J Zobel
H. Williams and J. Zobel. In Proc. International Conference on Advances in Database Technology EDBT, pages 275 288, Avignon, France, March 1996. Springer-Verlag. Lecture Notes in Computer Science 1057.
  • W C Barker
  • F Pfei
  • D C George
W.C. Barker, F. Pfei er, and D.C. George. Methods in Enzymology, 266:59 71, 1996.