Conference PaperPDF Available

Genomic Information Retrieval.

Conference Paper

Genomic Information Retrieval.

Abstract

The in-silico revolution has changed how biologists characterise DNA and protein sequences. As a first step to exploring the structure and function of an unknown sequence, biologists search large genomic databases for similar sequences. This process of genomic information retrieval has allowed significant advances in biology and led to advancements in critical areas such as cancer research. In this paper, we present a background to genomic information retrieval by describing the problems, collections, and techniques used by biologists for searching large collections. In particular, we identify the problems inherent in the popular search techniques, and discuss how index-based approaches may be applied to solve these problems. We conclude by offering the challenge that information retrieval specialists must continue to make significant contributions to allow further advances in molecular biology research.
1985 1990 1995 2000
Year
1
10
100
1000
10000
GenBank size
(Millions of Bases)
... The four 3-grams " str " , " tri " , " rin " , and " ing " are extracted from " string " , and the two 3-grams " dat " and " ata " are extracted from " data " . We note that to the 1-sliding technique, the differences of the offsets of the 3-grams extracted from the same string become 1. [6, 51] 7, [7, 52] 7, [8, 53] 7, [9, 54] 7, [58] 7, [59] 44, [12] 44, [13] 44, [14] 44, [15] 12, [4, 27] 12, [5, 28] 97, [4, 87] 97, [5, 88] 97, [6, 89] 97, [7, 90] 44, [83] 44, [84] Figure 2: An example of posting lists of the 3-gram index. ...
... We note that to the 1-sliding technique, the differences of the offsets of the 3-grams extracted from the same string become 1. [6, 51] 7, [7, 52] 7, [8, 53] 7, [9, 54] 7, [58] 7, [59] 44, [12] 44, [13] 44, [14] 44, [15] 12, [4, 27] 12, [5, 28] 97, [4, 87] 97, [5, 88] 97, [6, 89] 97, [7, 90] 44, [83] 44, [84] Figure 2: An example of posting lists of the 3-gram index. ...
... Compared with earlier methods, CAFE shows comparable accuracy in searching, but shows higher performance by several times to several tens of times. However, it has been pointed out as a problem that the size of the index of CAFE is significantly larger than that of the original database [6]. ...
Conference Paper
The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.
... These schemes avoid exhaustive search through the use of a disk-based inverted index to support fast identification of sequences with broad similarity to a query. Unlike the exhaustive schemes, index-based approaches do not rely on the entire collection fitting into main-memory for reasonable search performance [Williams, 2003]. Early approaches to indexed-base search included scan [Orcutt and Barker, 1984], flash [Califano and Rigoutsos, 1993], ramdb [Fondrat and Dessen, 1995], rapid [Miller et al., 1999] and the work by Myers [1994], however the most successful approach is the cafe indexed-based homology search tool [Williams and Zobel, 1996;. ...
... has not found widespread acceptance due to several drawbacks with the cafe approach and index-based schemes in general [Williams, 2003]: ...
... • The set of accumulators used by cafe to perform ranking with the frames metric is large and must reside in main-memory [Williams, 2003]. The data structure increases in size with longer queries and larger collections. ...
... Indexing genomic searching is a method that reduces the cost and time of the searching [Williams and Zobel 2002]. In other words, the index method can efficiently solve such kinds of memory storage capacity problems by cutting down the memory storage requirement and access time [Williams 2003;Jiang et al. 2007]. In the absence of other better methods, the index method is necessary to solve the serious problems arising from the use of the exhaustive search methods [Williams and Zobel 2002]. ...
... In some approaches, for an example the RAMdb requires a large index, about twice the size of the original flat file database (storage problem) [Jiang et al. 2007]. In FASTA and BLAST, all sequences must be in the main memory when a program executes, therefore large space is needed (storage and memory problems) [Williams 2003]. Inverted Files have suffered retrieval accuracy and as such are not very useful for small query proteins with few SSEs (poor accuracy) [Gao and Zaki 2008]. ...
Article
Full-text available
Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem to be a simple task, given the speed of current available gigabytes processors. However, this is far from the truth as the growing number of data which are deposited into the database are ever increasing. Hence, searching the database becomes a difficult and time-consuming task. Here, the computer scientist can help to organize data in a way that allows biologists to quickly search existing information. In this paper, a decision tree indexing model for DNA and protein sequence datasets is proposed. This method of indexing can effectively and rapidly retrieve all the similar proteins from a large database for a given protein query. A theoretical and conceptual proposed framework is derived, based on published works using indexing techniques for different applications. After this, the methodology was proved by extensive experiments using 10 data sets with variant sizes for DNA and protein. The experimental results show that the proposed method reduced the searching space to an average of 97.9% for DNA and 98% for protein, compared to the Top Down Disk-based suffix tree methods currently in use. Furthermore, the proposed method was about 2.35 times faster for DNA and 29 times for protein compared to the BLAST+ algorithm, in respect of query processing time.
... It is well known to biologists that the amino acid arrangements of proteins determine their structures and functions. Therefore, it is possible to predict the functions, roles, structures and categories of newly discovered proteins by searching for the proteins whose amino acid arrangements are similar to those of newly discovered proteins[1][20]. However, proteins, in evolutionary history, rarely conserve the amino acid arrangment , while retaining its structure[6][16][17] . ...
... BLAST is based on the sequential scan method basically, but it makes use of heuristic algorithms to reduce the number of sequences to be aligned against a query. However, BLAST still has two main drawbacks[20] : 1) entire data set should be loaded into a main memory for fast searching, 2) since it is based on sequential access, its execution time is directly proportional to the number of sequences in the database. Due to these drawbacks, index-based approaches for approximate searching are demanding. ...
Conference Paper
Approximate searching on the primary structure (i.e., amino acid ar- rangement) of protein sequences is an essential part in predicting the functions and evolutionary histories of proteins. However, because proteins distant in an evolutionary history do not conserve amino acid residue arrangements, approxi- mate searching on the proteins' secondary structure is quite important in flnding out distant homology. In this paper, we propose an indexing scheme for e-cient approximate searching on the secondary structure of protein sequences. Exploit- ing the concept of clustering and lookahead, the proposed indexing scheme pro- cesses three types of secondary structure queries (i.e., exact match, range match, and wildcard match) very quickly. To evaluate the performance of the proposed method, we conducted extensive experiments using a set of actual protein se- quences. According to the experimental results, the proposed method was proved to be 6.3 times faster in exact match, 3.3 times faster in range match, and 1.5 times faster in wildcard match compared to the existing indexing methods.
... It is well known to biologists that the amino acid arrangements of proteins determine their structures and functions. Therefore, it is possible to predict the functions, roles, structures, and categories of newly discovered proteins by searching for the proteins whose amino acid arrangements are similar to those of newly discovered proteins [1, 19]. However, the amino acid arrangement of one protein is rarely preserved in another protein if the two proteins are distant in an evolutionary history [5, 15, 16]. ...
... BLAST is based on the sequential scan method basically, but it makes use of heuristic algorithms to reduce the number of sequences to be aligned against a query. However, BLAST still has two main drawbacks [19]: (1) entire data set should be loaded into a main memory for fast searching, and (2) since it is based on sequential access, its execution time is directly proportional to the number of sequences in the database. Due to these drawbacks, index-based approaches for approximate searching are demanding. ...
Article
Approximate searching on the primary structure (i.e., amino acid arrangement) of protein sequences is an essential part in predicting the functions and evolutionary histories of proteins. However, because proteins distant in an evolutionary history do not conserve amino acid residue arrangements, approximate searching on proteins' secondary structure is quite important in finding out distant homology. In this paper, we propose an indexing scheme for efficient approximate searching on the secondary structure of protein sequences which can be easily implemented in RDBMS. Exploiting the concept of clustering and lookahead, the proposed indexing scheme processes three types of secondary structure queries (i.e., exact match, range match, and wildcard match) very quickly. To evaluate the performance of the proposed method, we conducted extensive experiments using a set of actual protein sequences. According to the experimental results, the proposed method was proved to be faster than the existing indexing methods up to 6.3 times in exact match, 3.3 times in range match, and 1.5 times in wildcard match, respectively.
... RAMdb is 800 times faster than the exhaustive approaches. Its only limitation lies in the large size of the inverted index [7]. Sequence Search and Alignment by Hashing Algorithm (SSAHA) [8] depends on splitting the function of a search into two steps, the first one breaking the sequence into serial Ktuples with each tuple containing k contiguous bases. ...
... Consequently, the servers suffer heavy loading for index retrieval [25]. ...
... RAMdb is 800 times faster than the exhaustive approaches. Its only limitation lies in the large size of the inverted index [7]. Sequence Search and Alignment by Hashing Algorithm (SSAHA) [8] depends on splitting the function of a search into two steps, the first one breaking the sequence into serial Ktuples with each tuple containing k contiguous bases. ...
Article
Full-text available
The rapid growth of genomic databases and the increased of queries against those databases have lead to the needs of new and efficient search and compare techniques. Researcher in bioinformatics have concentrated on exploring into different approaches in order to solve the problem of cost associated with the exhaustive search techniques. One of these is the CAFE indexing algorithm which is considered to be a fast indexing algorithm in genomic information retrieval. However, there is still room for improvement in the CAFE indexing structure. This research aims to enhance the structure of CAFE inverted index by using a proper hash function to speedup retrieval process. The results of this research indicated that retrieval using the enhanced index is faster than retrieval using the original index (CAFÉ). The benefiot ratio of using the enhanced CAFE index compared to the retrieval time using the original CAFE index are between 62.8 to 74.9 for one query. However, we found that the memory space for storing the indexes are the same for both algorithms. The reason is that although the interval size decreases, each interval will now have an increased number of posting list.
... Furthermore, in the back-end index, Len(Q) little affects the query processing time since || m−n is dominant (for example, when || = 26, m = 6, and n = 3, || m−n = 17, 576). This is also an excellent property since it has been pointed out that the query performance of the n-gram index for long queries tends to be bad [30]. ...
Article
As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1GB show that the size of the n-gram/2L index is reduced by up to 1.9–2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen’s compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305–319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15–20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15–20).
... Index-based search tools look up subsequences and their corresponding posting lists in some well-defined data structures. For example, FLASH [1], RAMDB [4], MAP [27] and CAFE [7][8][9][10] have adopted indexing techniques in their search tools. The advantage of index-based search tools over the exhaustive ones is that the pre-built indices can help to speed up the search process. ...
Conference Paper
Full-text available
Indexing and retrieval techniques for homology searching of genomic databases are increasingly important as the search tools are facing great challenges of rapid growth in sequence collection size. Consequently, the indexing and retrieval of possibly gigabytes sequences become expensive. In this paper, we present two new approaches for indexing genomic databases that can enhance the speed of indexing and retrieval. We show experimentally that the proposed methods can be more computationally efficient than the existing ones.
Article
Full-text available
PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. A major objective of PIR-International is to continue the development of the Protein Sequence Database as an essential public resource for protein sequence information. This paper briefly describes the architecture of the Protein Sequence Database and how it and associated data sets are distributed and can be accessed electronically.
Article
Full-text available
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer and unambiguously detect proteins that shared a common ancestor billions of years in the past. FASTA is both fast and selective because it initially considers only amino acid identities. Its sensitivity is increased not only by using the PAM250 matrix to score and rescore regions with large numbers of identities but also by joining initial regions. The results of searches with FASTA compare favorably with results using NWS-based programs that are 100 times slower. FASTA is slightly less sensitive but considerably more selective. It is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family. The joining step by FASTA to calculate the initn score is especially useful for sequences that share regions of sequence similarity that are separated by variable-length loops.
Article
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they are relevant and appropriate but comments will not be edited. The ultimate decision on publication of an online comment is at the Editors' discretion. Formatting: Please include a title for the comment and your affiliation. Note that symbols (e.g. Greek letters) may not transmit properly in this form due to potential software compatibility issues. Please spell out the words in place of the symbols (e.g. replace “α” with “alpha”). Comments should be no more than 8,000 characters (including spaces ) in length. References may be included when necessary but should be kept to a minimum. Be careful if copying and pasting from a Word document. Smart quotes can cause problems in the form. If you experience difficulties, please convert to a plain text file and then copy and paste into the form.
Article
With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.