Conference Paper

Indexing Nucleotide Databases for Fast Query Evaluation.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.

No full-text available

... In the next chapter, we focus on methods for searching sequence databanks. We begin with descriptions of the popular heuristic search algorithms such as fasta [Pearson and Lipman, 1988] and blast as well as index-based approaches such as cafe [Williams and Zobel, 1996;, PatternHunter [Ma et al., 2002; and blat [Kent, 2002]. We then describe distributed search schemes and iterative algorithms such as psi-blast [Altschul et al., 1997] and sam ]. ...
... Unlike the exhaustive schemes, index-based approaches do not rely on the entire collection fitting into main-memory for reasonable search performance [Williams, 2003]. Early approaches to indexed-base search included scan [Orcutt and Barker, 1984], flash [Califano and Rigoutsos, 1993], ramdb [Fondrat and Dessen, 1995], rapid [Miller et al., 1999] and the work by Myers [1994], however the most successful approach is the cafe indexed-based homology search tool [Williams and Zobel, 1996;. ...
... Several index-based methods for homology search have been proposed, where an inverted index of the collection is used to identify hits without exhaustively scanning the entire database. Schemes such as cafe [Williams and Zobel, 1996;, scan [Orcutt and Barker, 1984], flash [Califano and Rigoutsos, 1993], ramdb [Fondrat and Dessen, 1995], and rapid [Miller et al., 1999] employ an on-disk index and are suitable for searching large collections such as GenBank. Cafe is the most successful approach to employ an on-disk index with substantially faster search times and a small reduction in search accuracy when compared to blast [Chen, 2004]. ...
... Sequencing initiatives are contributing exponentially increasing quantities of nucleotide data to databases such as GenBank (Benson et al., 1993). We propose a new direct coding compression scheme for use in homology search applications such as FASTA (Pearson and Lipman, 1988), BLAST (Altschul et al., 1990), and CAFE (Williams and Zobel, 1996a). This scheme yields compact storage, is lossless—nucleotide bases and wildcards are represented—and has extremely fast decompression. ...
... We have used the Elias gamma codes to encode each count w and Golomb codes to represent each sequence of offsets. These techniques are a variation on techniques used for inverted file compression, which has been successfully applied to large text databases (Bell et al., 1993) and to genomic databases (Williams and Zobel, 1996a; Williams and Zobel, 1996b). Compression with Golomb codes, given the appropriate choice of a pre-calculated parameter, is better than with Elias coding. ...
... We therefore expect that use of direct coding in a retrieval system would significantly reduce retrieval times overall. To further test this hypothesis we incorporated the scheme into cafe, our genomic database retrieval engine (Williams and Zobel, 1996a ), and found that retrieval times fell by over 20%. In BLAST (Altschul et al., 1990) a simple approach is taken to nucleotide compression. ...
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
... Sequencing initiatives are contributing exponentially increasing quantities of nucleotide data to databases such as GenBank (Benson et al., 1993). We propose a new direct coding compression scheme for use in homology search applications such as FASTA (Pearson and Lipman et al., 1988), BLAST (Altschul et al., 1990) and CAFE (Williams and Zobel, 1996a). This scheme yields compact storage, is lossless—nucleotide bases and wildcards are represented— and has extremely fast decompression. ...
... as, 1975) and the Golomb codes (Golomb, 1966). We have used the Elias gamma codes to encode each count w and Golomb codes to represent each sequence of offsets. These techniques are a variation on techniques used for inverted file compression, which has been successfully applied to large text databases (Bell et al., 1993) and to genomic databases (Williams and Zobel. 1996a.b). ...
... To test this hypothesis further, we incorporated the scheme into cafe, our genomic database retrieval engine (Williams and Zobel, 1996a), and found that retrieval times fell by >20%. ...
Article
Full-text available
Motivation: International sequencing e#orts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.
... Variable-byte codes for selected integers in the range 1-30 are shown in Table 1. A typical application is the coding of index term and inverted list file offset pairs for an inverted index [10]. ...
... Elias coding [2] is a non-parameterised method of coding integers that is, for example, used in large text database indexes [8] and specialist applications [10,11]. Elias coding, like the other schemes described in this paper, allows unambiguous coding of integers and does not require separators between each integer of a stored array. ...
... Each postings array contains a sorted list of document identifiers and, in many indexes, interspersed between each document identifier is a sorted array of one or more word positions of that term in the document. We experiment with an uncompressed index postings list of 630.5 Mb (vector) extracted from a much larger carefully compressed 3,596 Mb postings file used in the cafe [10] indexed genomic retrieval sys-tem. 1 This postings list contains both document identifiers and interleaved word positions. ...
Article
Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.
... Conventional databases use indexing to provide efficient access to the data. Williams and Zobel [6] use an inverted-list indexing technique to support alignment queries on nucleotide databases. In their approach the processing of queries is partitioned in two phases: ...
... In other words, the variant GNAT is same as the original, after replacing (5) by (6). The detailed algorithm is as follows. ...
... Conventional databases use indexing to provide efficient access to the data. Williams and Zobel [6] use an inverted-list indexing technique to support alignment queries on nucleotide databases. In their approach the processing of queries is partitioned in two phases: ...
... In other words, the variant GNAT is same as the original, after replacing (5) by (6). The detailed algorithm is as follows. ...
Conference Paper
Full-text available
A genomic database consists of a set of nucleotide sequences, for which an important kind of query is the focal sequence alignment. The paper investigates two different indexing techniques, namely the variations of GNAT trees and M-trees to support fast query evaluation for local alignment, by transforming the alignment problem to a variant metric space neighborhood search problem
... The edit distance family of string matching techniques is suitable for this task [11,12], and have been widely applied in related applications including genomics and phonetic name matching [27,28]. Three kinds of edit distance are applicable. ...
... This approach has the disadvantage, however, that a piece that matches well overall but has intermittent differences will not score highly. A more flexible method, successfully used in string matching and genomic retrieval [27], is to use n-grams: count the number of matching substrings of some fixed length n. The ngram count should be normalised by string length, because long pieces are statistically more likely to have an n-gram match. ...
Article
With the growth in digital representations of music, and of music stored in these representations, it is increasingly attractive to search collections of music. One mode of search is by similarity, but, for music, similarity search presents several difficulties: in particular, deciding what part of the music is likely to be perceived as the theme by a listener, and deciding whether two pieces of music with different sequences of notes represent the same theme. In this paper we propose a three-stage framework for matching pieces of music. We use the framework to compare a range of techniques for determining whether two pieces of music are similar, by experimentally testing their ability to retrieve different transcriptions of the same piece of music from a large collection of MIDI files. These experiments show that different comparison techniques differ widely in their effectiveness
... Conventional databases use indexing to provide efficient access to the data. Williams and Zobel [6] use an inverted-list indexing technique to support alignment queries on nucleotide databases. In their approach the processing of queries is partitioned in two phases: ...
... In other words, the variant GNAT is same as the original, after replacing (5) by (6). The detailed algorithm is as follows. ...
Article
Full-text available
. A genomic database consists of a set of nucleotide sequences, for which an important kind of queries is the local sequence alignment. This paper studies the indexing techniques to support fast query evaluation for local alignment, by transforming the alignment problem to a variant metric space neighborhood search problem. We propose and analyze the corresponding algorithms and structures, and identify several directions for the future work. 1 Introduction Sequence databases are among the most important information repositories in molecular biology. Two types of sequences are found in sequence databases: nucleotide sequences with a four letter alphabet, and an amino acid residue sequences with a twenty letter alphabet. A fundamental access mechanism to sequence databases is by sequence alignment. The relevance of sequence alignment derives from evolutionary relationships between sequences. Thus a typical query against a sequence database is to find for a given sequence x all sequence...
... Importantly, CAFÉ method consists of coarse and fine search is marginally less accurate than BLAST1 and FASTA. From search point of view, CAFE is 8 times faster and efficient than the BLAST2 [29,30]. PropSearch tool is proposed by the scientific and research society [13], have basic idea in database search is to utilize conserved properties in the similar structures. ...
Article
Full-text available
With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm's results indicate that the uses of parallel programming with large dataset are promising which can be improved further.
... They avoid database scanning by using inverted index, which has been proved successful in web search engines. FLASH [2], RAMDB [9], MAP [55], PatternHunter [8] and CAFE [19], [21] are index-based search tools which use pre-built indices on subsequences to speed up the lookup process. CAFE claimed that it could run eight times faster than BLAST and 50 times faster than FASTA. ...
... Though current storage technology is developing rapidly, but considering the sizes of protein sequence collections are also growing exponentially, these are relatively large storage overheads. By introducing techniques such as index compression for nucleotide databases (Williams et al., 1997) and index stopping which discards high-frequency n-grams from the index (Williams et al., 1996), we expect that the index size of ProSeS system can be further reduced to an acceptable level. ...
Article
Full-text available
Motivation: Though the sequence databases of proteins and DNAs are increas-ing in size exponentially, still exhaustive sequence search systems are commonly used in conducting biological researches. However, due to the advancement of information technology, many information retrieval algorithms have been de-veloped to search strings in large-scale text databases and are proved to be successful. We propose that these algorithms could also be applied to the bio-logical data. Results: Four n-gram indexing methods (tri-gram, tetra-gram, penta-gram, and hexa-gram) were applied to extract indices from protein sequences of the PIR-NREF database, and their retrieval effectiveness and speed were mea-sured. Penta-gram method showed the best results that its retrieval effective-ness matches for BLASTP and its retrieval speed was about 38 times faster than BLASTP program.
... When index processing is complete, the accumulators can be used to identify regions of high similarity, by identifying sequences of similar fragments. A related approach has been successfully applied by Williams and Zobel to querying of nucleotide databases [15]. We therefore believe that such an approach is practical. ...
Conference Paper
Ranking based on passages addresses some of the shortcomings ofwhole-document ranking. It provides convenient units of text toreturn to the user, avoids the difficulties of comparing documentsof different length, and enables identification of short blocks ofrelevant material amongst otherwise irrelevant text. In this paperwe explore the potential of passage retrieval, based on anexperimental evaluation of the ability of passages to identifyrelevant documents. We compare our scheme of arbitrary passageretrieval to several other document retrieval and passage retrievalmethods; we show experimentally that, compared to these methods,ranking via fixed-length passages is robust and effective. Ourexperiments also show that, compared to whole-document ranking,ranking via fixed-length arbitrary passages significantly improvesretrieval effectiveness, by 8% for TREC disks 2 and 4 and by18%-37% for the Federal Register collection.
... Index-based search tools look up subsequences and their corresponding posting lists in some well-defined data structures. For example, FLASH [1], RAMDB [4], MAP [27] and CAFE [7][8][9][10] have adopted indexing techniques in their search tools. The advantage of index-based search tools over the exhaustive ones is that the pre-built indices can help to speed up the search process. ...
Conference Paper
Full-text available
Indexing and retrieval techniques for homology searching of genomic databases are increasingly important as the search tools are facing great challenges of rapid growth in sequence collection size. Consequently, the indexing and retrieval of possibly gigabytes sequences become expensive. In this paper, we present two new approaches for indexing genomic databases that can enhance the speed of indexing and retrieval. We show experimentally that the proposed methods can be more computationally efficient than the existing ones.
... Transformation based index algorithms are all based on special technique(s), and at the same time, these transformations combine properties of genomic data. 4.1 CAFE CAFE34353637 is a partition based search approach, where a coarse search using an inverted index is used to rank sequences by similarity to a query sequence, and a subsequent fine search is used to locally align only a database subset with the query. In our opinion, this method can be extended to other algorithms. ...
Article
Up to now, there are many homology search algorithms that have been investigated and studied. However, a good classification method and a comprehensive comparison for these algorithms are absent. This is especially true for index based homology search algorithms. The paper briefly introduces main index construction methods. According to index construction methods, index based homology search algorithms are classified into three categories, i.e., length based index ones, transformation based index ones, and their combination. Based on the classification, the characteristics of the currently popular index based homology search algorithms are compared and analyzed. At the same time, several promising and new index techniques are also discussed. As a whole, the paper provides a survey on index based homology search algorithms.
... In practice, fasta is 6.5 times faster than ssearch and almost as accurate (Brenner, Chothia & Hubbard 1998). In 1996, Williams and Zobel proposed the cafe index-based homology search tool (Williams & Zobel 1996, Williams & Zobel 2002). This scheme uses inverted indexing techniques employed in text retrieval (Witten, Moffat & Bell 1999), which are most well-known now for their use in web search engines. ...
Article
Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/.
... To allow for queries that start anywhere within a melody, indexing via n-grams could be useful, where an n-gram is a sequence of n consecutive notes and each such sequence is extracted from the music; similar techniques are used for the related problems of string indexing and genomic indexing [22,23]. With melodic data, however, it is not particularly useful to represent the data as an absolute pitch, since the same melody can be played or sung at different pitches. ...
Article
Full-text available
Large volumes of music are available online, represented in performance formats such as MIDI and, increasingly, in abstract notation such as SMDL. Many types of user would find it valuable to search collections of music via queries representing music fragments, but such searching requires a reliable technique for identifying whether a provided fragment occurs within a piece of music. The problem of matching fragments to music is made difficult by the psychology of music perception, because literal matching may have little relation to perceived melodic similarity, and by the interactions between the multiple parts of typical pieces of music. In this paper we analyse the properties of music, music perception, and music database users, and use the analysis to propose alternative techniques for extracting monophonic melodies from polyphonic music; we believe that such melodies can subsequently be used for matching of queries to data. We report on experiments with music listeners, which rank our proposed techniques for extracting melodies.
Article
In DNA related research, due to various environment conditions, mutations occur very often, where a mutation is defined as a heritable change in the DNA sequence. Therefore, approximate string matching is applied to answer those queries which find mutations. The problem of approximate string matching is that given a user specified parameter, k, we want to find where the substrings, which could have k errors at most as compared to the query sequence, occur in the database sequences. In this paper, we make use of a new index structure to support the proposed method for approximate string matching. In the proposed index structure, EII, we map each overlapping q-gram of the database sequence into an index key, and record occurring positions of the q-gram in the corresponding index entry. In the proposed method, EOB, we first generate all possible mutations for each gram in the query sequence. Then, by utilizing information recorded in the EII structure, we check both local order (i.e., the order of characters in a gram) and global order (i.e., the order of grams in an interval) of these mutations. The final answers could be determined directly without applying dynamic programming which is used in traditional filter methods for approximate string matching. From the experiment results, we show that our method could outperform the (k + s) q-samples filter, a well-known method for approximate string matching, in terms of the processing time with various conditions for short query sequences.
Article
Full-text available
Article
Full-text available
In this paper, a new technique is developed to support the query relaxation in biological databases. Query relaxation is required due to the fact that queries tend not to be expressed exactly by the users, especially in scientific databases such as biological databases, in which complex domain knowledge is heavily involved. To treat this problem, we propose the concept of the so-called fuzzy equivalence classes to capture important kinds of domain knowledge that is used to relax queries. This concept is further integrated with the canonical techniques for pattern searching such as the position tree and automaton theory. As a result, fuzzy queries produced through relaxation can be efficiently evaluated. This method has been successfully utilized in a practical biological database - the GPCRDB.
Conference Paper
Today many applications routinely generate large quantities of data. The data often takes the form of (time) series, or more generally streams, i.e. an ordered sequence of records. Analysis of this data requires stream processing techniques which differ in significant ways from what current database analysis and query techniques have been optimized for. In this paper we present a new operator, called StreamJoin, that can efficiently be used to solve stream-related problems of various applications, such as universal quantification, pattern recognition and data mining. Contrary to other approaches. StreamJoin processing provides rapid response times, a non-blocking execution as well as economical resource utilization. Adaptability to different application scenarios is realized by means of parameters. In addition, the StreamJoin operator can be efficiently embedded into the database engine, thus implicitly using the optimization and parallelization capabilities for the benefit of the application. The paper focuses on the applicability of StreamJoin to integrate application semantics into the DBMS
Conference Paper
Full-text available
We describe the query interfaces of a practical biological database system-GPCRDB. Distinguishing features of the system include: an embedded smart query engine (for query relaxation), smooth integration of navigation with the more conventional SQL-based query mechanisms, and the top-down style of incremental query result presentation combined with flexible navigation capabilities. Query relaxation is important due to the fact that queries tend not to be expressed exactly by the users, particularly when complex domain knowledge is involved. Navigation capability is desired because it can be an ideal supplement to SQL-based query mechanisms when large, complex data sets are concerned, especially in the WWW environment where hyperlinks are heavily used. Top-down incremental presentation is one of the best ways for a user to conduct the data presentation/retrieval process more reasonably and efficiently toward the point of interest of the user without being lost in (unwanted) details
Conference Paper
Full-text available
We present a query system that has been implemented in a practical biological database-GPCRDB. Distinguishing features of this system include: smart query relaxation and smooth integration of navigation with conventional language based query functions. Query relaxation is required due to the fact that queries are not always effective (in other words, expected results are frequently not achieved), particularly in scientific databases like biological databases, in which complex domain knowledge is heavily used. On the other hand, navigation capability is desired as complex data sets are involved, especially in a WWW based environment where multiple hyperlinks are often employed. For efficient implementation, the “fuzzy equivalence class” concept has been applied that captures an important type of domain knowledge
Article
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes
Article
Full-text available
GenBank, the national repository for nucleotide sequence data, has implemented a new model of scientific data management, which we term electronic data publishing. In traditional publishing, both scientific conclusions and supporting data are communicated via the printed page, and in electronic journal publishing, both types of information are communicated via electronic media. In electronic data publishing, by contrast, conclusions are published in a journal while data are published via a network-accessible, electronic database.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
Article
Full-text available
Progress toward achieving the first set of goals for the genome project appears to be on schedule or, in some instances, even ahead of schedule. Furthermore, technological improvements that could not have been anticipated in 1990 have in some areas changed the scope of the project and allowed more ambitious approaches. Earlier this year, it was therefore decided to update and extend the initial goals to address the scope of genome research beyond the completion of the original 5-year plan. A major purpose of revising the plan is to inform and provide a new guide to all participants in the genome project about the project's goal. To obtain the advice needed to develop the extended goals, NIH and DOE held a series of meetings with a large number of scientists and other interested scholars and representatives of the public, including many who previously had not been direct participants in the genome project. Reports of all these meetings are available from the Office of Communications of the National Center for Human Genome Research (NCHGR) and the Human Genome Management Information System of DOE. Finally, a group of representative advisors from MIH and DOE drafted a set of new, extended goals for presentation to the National Advisory Council for Human Genome Research of NIH and the Health and Environmental Research Advisory Committee of DOE.
Article
Full-text available
Protein sequence alignments generally are constructed with the aid of a "substitution matrix" that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a "log-odds" matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.
Article
Full-text available
this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the “database”A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N c for somec<1. The sequenceA must be overa. finite alphabet σ. More precisely, our algorithm requires 0(DN pow(ɛ) logN) expected-time where ɛ=D/P is the maximum number of differences as a percentage of query length, and pow(ɛ) is an increasing and concave function that is 0 when ɛ=0. Thus the algorithm is superior to current O(DN) algorithms when ɛ is small enough to guarantee that pow(ɛ) < 1. As seen in the paper, this is true for a wide range of ɛ, e.g., ɛ. up to 33% for DNA sequences (¦⌆¦=4) and 56% for proteins sequences (¦⌆¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.
Conference Paper
Full-text retrieval systems often use either a bitmap or an inverted file to identify which documents contain which terms, so that the documents containing any combination of query terms can be quickly located. Bitmaps of term occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques.
Article
An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings. The constant of proportionality is low enough to make this algorithm of practical use, and the procedure can also be extended to deal with some more general pattern-matching problems. A theoretical application of the algorithm shows that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time. Other algorithms which run even faster on the average are also considered.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
To provide keyword-based access to a large text file it is usually necessary to invert the file and create an inverted index that stores, for each word in the file, the paragraph or sentence numbers in which that word occurs. Inverting a large file using traditional techniques may take as much temporary disk space as is occupied by the file itself, and consume a great deal of cpu time. Here we describe an alternative technique for inverting large text files that requires only a nominal amount of temporary disk storage, instead building the inverted index in compressed form in main memory. A program implementing this approach has created a paragraph level index of a 132 Mbyte collection of legal documents using 13 Mbyte of main memory; 500 Kbyte of temporary disk storage; and approximately 45 cpu-minutes on a Sun SPARCstation 2.
Article
The Portable Dictionary of the Mouse Genome is a database for personal computers that contains information on approximately 10,000 loci in the mouse, along with data on homologs in several other mammalian species, including human, rat, cat, cow, and pig. Key features of the dictionary are its compact size, its network independence, and the ability to convert the entire dictionary to a wide variety of common application programs. Another significant feature is the integration of DNA sequence accession data. Loci in the dictionary can be rapidly resorted by chromosomal position, by type, by human homology, or by gene effect. The dictionary provides an accessible, easily manipulated set of data that has many uses--from a quick review of loci and gene nomenclature to the design of experiments and analysis of results. The Portable Dictionary is available in several formats suitable for conversion to different programs and computer systems. It can be obtained on disk or from Internet Gopher servers (mickey.utmen.edu or anat4.utmen.edu), an anonymous FTP site (nb.utmem.edu in the directory pub/genedict), and a World Wide Web server (http://mickey.utmem.edu/front.html).
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
First Page of the Article
Article
There are several advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. Our experiments show that this method provides an effective compromise between speed and space, running orders of magnitude faster than brute force search, but requiring less memory than other pattern-matching data structures; indeed, in some cases requiring less memory than would be consumed by a single pointer to each string. The pattern search method is based on text indexing techniques and is a successful adaptation of inverted files to main memory databases. 1 Introduction Given the large main memories available on current computers, it is interesting to ask what additional facilities might be incorporated in full text retrieval systems if memory usage is allowed to expand. One possible application for extra memory is to s...
Issues in searching molecular sequence databases
• S Altschul
• M Boguski
• W Gish
• J Wootton
• S. Altschul
• D Benson
• D J Lipman
• J Ostell
• Genbank
• D. Benson