Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Genomic sequence databases are widely used by molecular biologists for homology searching. Amino acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... By 2025, the genomic data alone will be increased at a rate of 1 zettabase/year (1 Z � 10 21 ) [7]. Genomic data is growing faster than storage and transmission bandwidth, putting a lot of pressure on storage and data transmission [8][9][10]. How to store genomic data efficiently and reduce the pressure of storage and data migration is of great significance in genomic research and application [11]. ...
... if pos � − 1 then (6) t seq[i] is a mismatched character, recorded to the mismatched information; (7) else (8) while pos ≠ − 1 do (9) set l � k, p � pos; (10) while t seq[i + l] � r seq[p + l] do (11) l � l + 1; (12) end while (13) if l max < l then (14) l max � l, pos max � p; (15) end if (16) update pos � L[pos]; (17) end while (18) end if (19) record the mismatched string to the mismatched information (20) record the matched string to the matched information the previous matched entity, and the storage space can be reduced by storing the difference between the real position and the predicted position. Finally, all the above-encoded information is encoded and stored using the PPMD encoder. ...
... update start position � j + 1 (6) end if (7) end for (8) if matched lowercase[i] � 0 then (9) for j � start position − 1 to 1 do (10) if t lowercase[i] � r lowercase[j] then (11) matched lowercase[i] � j; (12) update start position � j + 1; (13) end if (14) end for (15) end if (16) if matched lowercase[i] � 0 then (17) mismatched lowercase[index] � t lowercase[i]; (18) update index � in dex + 1; (19) end if (20) end for Output: mismatched lowercase information; matched lowercase information; ...
Article
Full-text available
With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.
... Linear-scan-based systems include FASTA and BLAST, which are described in Chapter 5. Index-based systems, such as BLAT [79] and CAFE [80,81], perform a query using a pre-built index of the database. Although linear-scan-based systems are faster than index-based systems for smaller databases, as the size of the genomic sequence databases continually increases, index-based systems are more and more appealing. ...
... Candidate regions are generated by searching these overlapping words. Heuristics, such as FRAMES [80], are used to reduce the number of candidate regions passed to the alignment stage. ...
... FASTA is considered as the most accurate (sensitive) system, while BLAST is more popular and faster but less sensitive. Index-based systems, such as BLAT [79], CAFE [80,81], and Suffix Sequoia [111,86] perform a query using a pre-built index of the database. Although linear-scan-based systems are faster than index-based systems for smaller databases, as the size of the genomic sequence databases continually increases, index-based systems are more and more appealing for their efficiency. ...
... A more recent approach to the problem of efficient local alignment, known as cafe, is described by Williams and Zobel [2002]. A significant cost of string search using fasta or blast is the requirement that the data being searched be fully scanned. ...
... Inverted indexes are used for data that can be segmented into semantically atomic units (terms). In cafe, Williams and Zobel [2002] adapt this technique to string matching by segmenting the sequences into fixed-length, overlapping substrings. The index structure used in cafe stores an identifier for the string in which the substring occurs, along with the offset of the occurrence (in characters) from the beginning of the string. ...
... Timing experiments with Cafe, reported by Williams and Zobel [2002], demonstrate large improvements in execution time compared to blast and fasta. ...
Article
Digital formats are widely used for representation and distribution of video. Such formats have enabled easy widespread distribution of video, but allow the content to be easily copied and transformed. In recent work we described signature representations for video that allow search for copies of the same content, even after addition of noise or changes to frame rate, bit rate, and resolution. However, these methods, although faster than previous approaches, are still prohibitively slow for large volumes of data. In this paper we propose methods for quantizing the signatures, allowing them to be indexed with an inverted file. Our experiments show that our new approach provides reasonable accuracy in search, and would allow thousands of hours of video to be searched in a few seconds on a typical desktop computer.
... The compression algorithm is lossless and has a good compression rate (Rivals et al., 1997). Williams and Zobel (2002) proposed a model for compressing integers for fast file access. Compression consists of a model for data, which is used to determine codes for each symbol. ...
... Compression consists of a model for data, which is used to determine codes for each symbol. Compressing text using a Huffman model is utilized since it allows order independent decompression (Williams and Zobel, 2002). The data can be modeled using simple tokens; using a token for each integer for example. ...
... Suffix trees provide a fast method to query large sequences; however they require a great deal of space and are difficult to load. Williams and Zobel (2002) presented an indexing technique for genomic databases named CAFÉ. The authors discuss the disadvantages of their system, including the time to build an index and the space needed to store the index on disk, however the advantages greatly outweigh the costs when fast, scalable searching is considered. ...
... Approximate matches are sometimes more important to detect mutation and homology. Special indices [45,62,82,91] are designed according to the characteristics of DNA sequences to address the efficiency and the effectiveness of the results. ...
... With the increasing interest on genomic research, various DNA sequence searching systems [7,16,17,30,41,45,59,79,91] have been developed to support different objectives. Some methods locate similar regions in the sequence database by sequential scan while others index the databases using novel data structures which can speed up homology search processes where homology means the similarity in different DNA sequences. ...
... CAFECAFE was proposed by Williams et al. as a searching algorithm in a research prototype system. CAFE[91] is based on techniques used in text retrieval and in approximate string matching used for databases of names. It contains two components.1. ...
... The online interface to the tool at the popular National Center for Biotechnology Information (NCBI) website 1 is used to evaluate over 120,000 queries each day [McGinnis and Madden, 2004], and the 1997 paper describing the algorithm [Altschul et al., 1997] has been cited more than 10,000 times 2 . Blast remains the most successful approach to homology search despite a plethora of more recent methods such as index-based approaches [Kent, 2002;Williams and Zobel, 2002] and discontinuous seeds [Ma et al., 2002;. ...
... In Chapter 3, we also discuss a range of alternative approaches to homology search, each of which provides advantages as well as disadvantages over the traditional approaches such as blast. Indexed-based approaches such as cafe [Williams and Zobel, 2002], blat [Kent, 2002], PatternHunter [Ma et al., 2002;, and ssaha [Ning et al., 2001] rely on an index structure such as those commonly employed in text retrieval ] to efficiently search large collections. Discontinuous seeds have received considerable attention at late and we survey existing literature concerning this approach to the first, hit detection stage of homology search. ...
... We also propose optimisations that are specific to the iterative process. Next, we consider integrating the efficient comparison and filtering techniques employed by blast to index-based approaches such as cafe [Williams and Zobel, 2002], and propose extending our work on duplicate detection to nucleotide data. Finally, we discuss the benefits of applying our novel fingerprinting algorithm described in Chapter 7 to English text, and consider tighter integration of the individual stages of blast. ...
... Database indexing searching is one of the approaches adapted to achieve fast, accurate and efficient searching. Indexing genomic searching is a method that reduces the cost and time of the searching [Williams and Zobel 2002]. In other words, the index method can efficiently solve such kinds of memory storage capacity problems by cutting down the memory storage requirement and access time [Williams 2003;Jiang et al. 2007]. ...
... In other words, the index method can efficiently solve such kinds of memory storage capacity problems by cutting down the memory storage requirement and access time [Williams 2003;Jiang et al. 2007]. In the absence of other better methods, the index method is necessary to solve the serious problems arising from the use of the exhaustive search methods [Williams and Zobel 2002]. The methods used for genomic indexing approaches are classified under three categories which are shown in Figure 1. ...
... From the review of previous methods, it was noticed that some of the popular indexing methods do not support the search for multiple queries, called Query multiplexing or packing such as CAFÉ [Williams and Zobel 2002], Top Down Disk-based (TDD) suffix trees [Tian et al. 2005;Tata et al. 2004]. To address this problem, the hybrid in the proposed method is equipped with a query packing, to handle multiple queries and reduce the overhead of reading the queries repeatedly. ...
Article
Full-text available
Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem to be a simple task, given the speed of current available gigabytes processors. However, this is far from the truth as the growing number of data which are deposited into the database are ever increasing. Hence, searching the database becomes a difficult and time-consuming task. Here, the computer scientist can help to organize data in a way that allows biologists to quickly search existing information. In this paper, a decision tree indexing model for DNA and protein sequence datasets is proposed. This method of indexing can effectively and rapidly retrieve all the similar proteins from a large database for a given protein query. A theoretical and conceptual proposed framework is derived, based on published works using indexing techniques for different applications. After this, the methodology was proved by extensive experiments using 10 data sets with variant sizes for DNA and protein. The experimental results show that the proposed method reduced the searching space to an average of 97.9% for DNA and 98% for protein, compared to the Top Down Disk-based suffix tree methods currently in use. Furthermore, the proposed method was about 2.35 times faster for DNA and 29 times for protein compared to the BLAST+ algorithm, in respect of query processing time.
... The authors of Cafe [19] worked to find a way to predefine likely sequence alignments to reduce query evaluation costs. They managed to reduce this cost by 40%-90% [18] We then define a value W which is the window size, this window we It is shown that CAFE is faster but has a little bit lower precision than BLAST when searching for very similar sequences. ...
... If we can get good results with this naiv approach, there is a lot of room for improvement on indexing time and size by specializing the n-gram-indexer for DNA. We would be very happy with precision and recall lying at 95% and above and query execution time should be below 500ms [19], this will place us up with the best of the current technologies for searching in DNA and show that we have a strong base for exploring the world of searching DNA further. The major focus of our thesis is to find if n-grams are useful in DNA-search and if it is, what size of the n-gram would be the best. ...
... C'est la structure de données qui, pour un mot donné, nous donne directement la liste des documents où il apparaît très rapidement. Dans le cadre de la recherche de similarités sur les séquences génomiques, le logiciel CAFE [57,58,59] utilisant cette structure de données a été proposé. ...
... Pour l'alignement d'ADN [59], les auteurs ont déclaré que CAFE peut être plus de quatre-vingt fois plus rapide que FASTA et huit fois plus rapide que BLAST. Cependant, CAFE n'utilise pas de programmation dynamique pour optimiser l'alignement local. ...
Article
Full-text available
The sequence comparison process is one of the main bioinformatics task. The new sequencing technologies lead to a fast increasing of genomic data and strengthen the need of fast and efficient tools to perform this task. In this thesis, a new algorithm for intensive sequence comparison is proposed. It has been specifically designed to exploit all forms of parallelism of today microprocessors (SIMD instructions, multi-core architecture). This algorithm is also well suited for hardware accelerators such as FPGA or GPU boards. The algorithm has been implemented into the PLAST software (Parallel Local Alignment Search Tool). Different versions are available according to the data to process (protein and/or DNA). A MPI version has also been developed. According to the nature of the data and the type of technologies, speedup from 3 to 20 has been measured compared with the reference software, BLAST, with the same level of quality.
... Like Cafe [2] and Flash [3], Dash uses an index of the database to facilitate fast retrieval. In this paper we detail exploratory work which has yielded order of magnitude speed improvement compared to the common NCBI-Blast nblast program. ...
... The list for each tuple is referred to as a column. 1 U is the RNA equivalent of T The index data is approximately an order of magnitude larger than the database itself. To tackle this, the index data is stored compressed, a concept successfully pioneered by Cafe [2]. To determine an appropriate compression algorithm for Dash, the entropy characteristics were explored when indexing a draft of the Human Genome [5]. ...
Article
In this paper we introduce several features of our first generation diagonal aggregating search heuristic system, Dash[17], which results in order of magnitude speed improvement for nucleotide searches when compared to NCBI-Blast 2.2.6[1,7]. Heuristic algorithms such as Blast and FastA[8] are indispensable for searching large genomic databases. Not surprisingly the significant contributor to search time for such algorithms is the dynamic programming evaluations. Indeed NCBI-Blast typically spends around 76% of its time budget in this area. Improving the efficiency of dynamic programming activities provides an opportunity to significantly reduce search times and help offset the effects of the continuing exponential growth in database sizes. Features which contribute to the speed of Dash include almost complete mitigation of the dominant dynamic programming activities involved in a typical genomic search, efficient sequence comparison and scoring, optimisation of index size through sub-entropy encoding of index data, and by preventing repeated evaluation of dynamic programming search space.
... Amortized structures are designed to be usually not much worse than the average performance of a more rigid structure and, in applications where the access patterns are not uniform, they are potentially more efficient. Indeed, efficient search systems, such as the MG text database [4] and the CAFE indexed genomic search system [5], use splay trees for index construction. ...
... Splaying has found application in areas where amortized performance is important, including in data compression [8,9], sorting [10], and index construction [4,5]. In addition, extended n-ary splay trees and variants have been proposed and compared to static n-ary trees, with similar amortized performance [11,12]. ...
... Indexed genomic searching can be defined as a method used to reduce the cost of searching in genomic databases. This is done by reducing the storage size of an index table but at the same time produce sequences which are highly similar to the queries [3]. ...
... William and Zobel proposed an index-based approach CAFE [3] in order to obtain sequences in database which have high similarity to the query, and to enhance retrieval speed. The main procedure in this algorithm involves the decomposition of the retrieval function into two stages i.e. coarse search and fine search. ...
Article
Full-text available
The rapid growth of genomic databases and the increased of queries against those databases have lead to the needs of new and efficient search and compare techniques. Researcher in bioinformatics have concentrated on exploring into different approaches in order to solve the problem of cost associated with the exhaustive search techniques. One of these is the CAFE indexing algorithm which is considered to be a fast indexing algorithm in genomic information retrieval. However, there is still room for improvement in the CAFE indexing structure. This research aims to enhance the structure of CAFE inverted index by using a proper hash function to speedup retrieval process. The results of this research indicated that retrieval using the enhanced index is faster than retrieval using the original index (CAFÉ). The benefiot ratio of using the enhanced CAFE index compared to the retrieval time using the original CAFE index are between 62.8 to 74.9 for one query. However, we found that the memory space for storing the indexes are the same for both algorithms. The reason is that although the interval size decreases, each interval will now have an increased number of posting list.
... A local alignment identifies sequence pairs with an optimal possible alignment. Local alignment similarity scoring is used to determine high-scoring regions and we ignore the fact that the whole sequences may have large differences [Williams and Zobel, 2002]. ...
... Today, local alignment is used in bioinformatics to find organisms of similar homology, that is, organisms with a similar evolutionary ancestry. To achieve this, the differences between the DNA (Deoxyribonucleic Acid) sequences belonging to various organisms [Williams and Zobel, 2002] are compared. One tool that can be used to achieve this is BLAST (Basic Local Alignment Search Tool) [Altschul et al., 1990]. ...
Conference Paper
Full-text available
The copying of programming assignments is a widespread problem in academic institutions. Manual plagiarism detection is time-consuming, and current popular plagiarism detection systems are not scalable to large code repositories. While there are text-based plagiarism detection systems capable of handling millions of student papers, comparable systems for code-based plagiarism detection are in their infancy. In this thesis, we propose and evaluate new techniques for code plagiarism detection. Using small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of JPlag. 1 Declaration I declare that this work is entirely my own except where due acknowledgement has been made and
... Because of the Human Genome Initiative, an international research program for the creation of detailed genetic and physical maps of the human genome, enormous quantities of genome data, e.g., DNA and protein sequences, are generated [7]. DNA sequences, holding the code of life of every living organism, could be considered as strings over an alphabet of four characters, {A, C, G, T}, called bases [10] [28]. DNA sequences could be very long. ...
... Although the inverted index [28] is a simple approach which does not need large storage space as compared to the suffix tree, it may lose some information at the end of the target sequence. Therefore, in this subsection, based on a revised version of the inverted index, we present the EII (Encoded Inverted Index) structure for efficiently indexing DNA sequences, which could avoid those missing cases. ...
Article
In DNA related research, due to various environment conditions, mutations occur very often, where a mutation is defined as a heritable change in the DNA sequence. Therefore, approximate string matching is applied to answer those queries which find mutations. The problem of approximate string matching is that given a user specified parameter, k, we want to find where the substrings, which could have k errors at most as compared to the query sequence, occur in the database sequences. In this paper, we make use of a new index structure to support the proposed method for approximate string matching. In the proposed index structure, EII, we map each overlapping q-gram of the database sequence into an index key, and record occurring positions of the q-gram in the corresponding index entry. In the proposed method, EOB, we first generate all possible mutations for each gram in the query sequence. Then, by utilizing information recorded in the EII structure, we check both local order (i.e., the order of characters in a gram) and global order (i.e., the order of grams in an interval) of these mutations. The final answers could be determined directly without applying dynamic programming which is used in traditional filter methods for approximate string matching. From the experiment results, we show that our method could outperform the (k + s) q-samples filter, a well-known method for approximate string matching, in terms of the processing time with various conditions for short query sequences.
... Indexed genomic searching can be defined as a method used to reduce the cost of searching in genomic databases. This is done by reducing the storage size of an index table but at the same time produce sequences which are highly similar to the queries [3]. ...
... William and Zobel proposed an index-based approach CAFE [3] in order to obtain sequences in database which have high similarity to the query, and to enhance retrieval speed. The main procedure in this algorithm involves the decomposition of the retrieval function into two stages i.e. coarse search and fine search. ...
... Furthermore, although mentioned as a possibility, metric-space indexing was not pursued. A number of efforts have introduced inverted indexes on q-grams [10, 28, 34, 42, 48]. These systems are proving to be very fast and useful when used as either a coarse filtering mechanism or when applied to genomic analysis problems on evolutionarily close sequences [37]. ...
... Also in question is the broader applicability of the language and physical structures. Q-gram approaches are endemic to information retrieval [48]. Q-gram methods first derived for speech recognition are now being extended toward the retrieval of music files by humming [15]. ...
Article
Biologically effective retrieval and analysis of sequences entails much more than finding matching strings. While identification and storage of biological sequences usually comprises long functional units (e.g. genes, proteins and chromosomes), the analysis and retrieval of those sequences is primarily concerned with finding ordered sets of short matching subsequences (q-grams). This characterization applies both to homology search algorithms, i.e., BLAST searches, as well as a growing toolkit of algorithms in comparative genomics that are tantamount to executing joins on pairs of whole genome sequences (whole genome joins). To support these two logical views of sequence data, we introduce mSQL, a set of extensions to SQL92. We have implemented mSQL as a component of MoBIoS, the Molecular Biological Information System. We describe the materialization of sets of q-grams as a metric-space index. Such a physical structure provides an access path for indexed-nested loop joins, enabling O(mlogn) comparative genomic analysis. We detail the optimization of paged MVP-trees to support a metric for the retrieval of protein sequences. Empirical results demonstrate O(logn) retrieval times for local alignments.
... Global alignment of nucleotide sequences as described can be handled using a dynamic programming matrix exactly as described in Section 3.1, using the following update function: 7 There have been some attempts to conduct homology search using inverted indexes. The cafe system (Williams and Zobel, 2002) is one such implementation. However, the accuracy of such systems is generally inferior to the approaches described in this section. ...
... • Improved sequence similarity search algorithms (Park et al., 1997;Karwath & King, 2002;Pawlowski et al., 2000;Williams & Zobel, 2002;Jaakkola et al., 2000) Machine learning, intermediate sequences, better indexing schemes and new understanding of the relationship between sequence similarity and function can all be used to improve homology searches. ...
... Indexing and Retrieval for Genomic Database uses CAFÉ indexed scheme [4] and it shows that the indexed approached results in significant, saving in computationally intensive local alignment, and that index-based searching is as accurate as existing exhaustive search scheme and it is better than BLAST. ...
Article
This paper provides an optimal storage algorithm of an effective design and implementation of a dist ributed bioinformatics computing system for analysis of DNA sequences (OPTSDNA). This system could be used for storing various sizes of DNA sequences into database. DNA sequences of different lengths were stored by using this algorithm. These sequences varied in size from very small to very large. The performance of thi s storage system is c ompared with sequential approach
... A number of efforts have applied information retrieval algorithms to biological sequence retrieval. These commonly involve inverted indexes on q-grams [6,21,33,37]. These systems are proving to be very fast and useful when used as either a coarse filtering mechanism or when applied to genomic analysis problems on evolutionarily close sequences [5]. ...
Article
mSQL is an extended SQL query language targeting the expanding area of biological sequence databases and sequence analysis methods. The core aspects include first-class data types for biological sequences, operators based on an extended-relational algebra, an ability to define logical views of sequences as overlapping q-grams and the materialization of those views as metric-space indices. We first describe the current trends in biological analysis that necessitate a more intuitive, flexible, and optimizable approach than current methodologies. We present our solution, mSQL, and describe its formal definition with respect to both physical and logical operators, detailing the cost model of each operator. We describe the necessity of indexing sequences offline to adequately manage this type of data given space and time concerns. We assess a number of metric-space indexing methods and conclude that MVP-trees can be expected to perform the best for sequence data. We ultimately implement two queries in mSQL to show that, not only can biologically valid analyses be expressed in concise mSQL queries, such queries can be optimized in the same ways as those relying on a standard relational algebra.
... Several criteria were applied to satisfy the use of this technique. These indexing and retrieval techniques are embodied in a fullscale prototype retrieval system, CAFÉ, that is based on techniques used in text retrieval and in approximate string matching for databases [13]. The principal features of CAFÉ are the incorporation of data structures for query resolution and the indexing technique used. ...
... For example, >100 million human genomes are expected to be sequenced by 2025 for precision medicine research (Stephens et al., 2015). As an uncompressed haploid human genome needs roughly three gigabytes (GB) of physical memory (Deorowicz et al., 2015;Ochoa et al., 2015), this big-data has created computational challenges for storage (Kahn, 2011), retrieval (Williams and Zobel, 2002) and privacy (Huang et al., 2016). Efficient and high-ratio data compression is needed to address the Original Paper exponential growth of NGS genomes (Numanagi c et al., 2016). ...
Article
Full-text available
Motivation: The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio. Results: We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark data set of eight human genomes. HiRGC takes less than 30 minutes to compress about 21 gigabytes of each set of the seven target genomes into 96 to 260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent. Availability and Implementation: The C++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.
... Indexing and Retrieval for Genomic Database uses CAFÉ indexed scheme [4] and it shows that the indexed approached results in significant, saving in computationally intensive local alignment, and that index-based searching is as accurate as existing exhaustive search scheme and it is better than BLAST. ...
... CAFE (Williams and Zobel 2002b, Williams 1999, Williams and Zobel 1997b is an example of a later indexed algorithm that utilises much smaller indices than FLASH, and offers similar sensitivity to BLAST 1 or FASTA. ...
... An early system called cafe [27] for aligning sequences to databases did use compression techniques in representing a genomic index. However, that system used an inverted index structure rather than a hash table, so it lacked an exhaustive set of offsets. ...
Article
Full-text available
Background Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. Results We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. Conclusions Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access. Electronic supplementary material The online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users.
... DNA sequence databases are normally as large as billions of bps (base pairs). Special indices [6,7,8,9] are designed according to the characteristics of DNA sequences to address the efficiency and the effectiveness of the results. Since the biological data becomes tremendous with the growing research interests and the revolution of research approaches, it becomes more and more important and necessary to analyze and understand biological data and the relationships between various data sets using computational approaches.The research in Web mining aims to develop new techniques to effectively extract and mine useful knowledge or information from these databases [10]. ...
Article
Full-text available
MINING biological data is an emerging area of intersection between data mining and bioinformatics. Bioinformaticians have been working on the research and development of computational methodologies and tools for expanding the use of biological, medical, behavioral, or health-related data. Data mining researchers have been making substantial contribution to the development of models and algorithms to meet challenges posed by the bioinformatics research. Mining these databases tend to develop data quality issues like data anomaly and duplication. For biological data to be corrected, methods and tools must be developed. This paper proposes one such tool, called BIOMINING that is designed to eliminate anomalous and redundancy in biological web content.
... Further, a new enhanced BM algorithm that uses threads search technique is adapted in the proposed method to accelerate the search for the input pattern. Besides, the huge library is divided to the four different classification libraries, according to the nature of the DNA formulation from four nucleotide bases {A, C, G, and T} (Williams and Zobel, 2002). These improvements enable the proposed Bio-BM method to achieve an optimal solution for the extraction process of biological sequences among the other existing biological search methods. ...
Conference Paper
Full-text available
Exponential growth of the biological data raises a fundamental problem for meaningful extraction of information from the huge GenBank databases. The Bioinformatics computational methods have been efficiently applied for extracting, searching, integrating and analyzing the biological data. In this paper we propose an innovative extraction method, called Bio-BM, which uses the Boyer-Moore search algorithm to solve the problem of biological data selection and analysis. Experimental performance and evaluation process is utilized to compare the new Bio-BM method to the state-of-the-art biological search methods using the real benchmark datasets. The effectiveness and high-performance of the proposed method show more accurate results over the real biological datasets from GenBank
... Their main paradigms are: identification, cloning and analysis of a specific gene product for a given function is responsible for most of what we know in modern biology [6][7][8] . This creates an impact on identifying and cloning of new genes. ...
Article
The term computational biology refers to the knowledge derived from a computer analysis of biological data that includes identification of genes in DNA sequence of different organisms, prediction of structural and functional mechanism of proteins, feature extraction and classification of genomics and proteomics. Computational biology is a rapidly developing branch of science and is highly interdisciplinary, using techniques and Concepts from informatics, mathematics, chemistry, physics, statistics and biochemistry. This field has risen in parallel with the developments of automated high throughput methods of biochemistry and biological discovery that yield a variety of forms of experimental data, such as DNA& RNA sequences, gene expressions patterns and chemical structures. The field’s rapid growth is spurred by the vast potential for new understanding that can lead to new technological treatments, new agro-crops cultivation and new pharmaceutical drug discovery. In the recent years, most Bioengineering disciplines are started adopting the information technology oriented curriculum due to its high performance computing, data interoperability, web-based platform compatibility and secured a suitable job opportunity. This study discusses the challenges to set up an interdisciplinary oriented curriculum by merging life sciences and information technology at a university level. It also provides the career opportunities for different life science disciplines like drug development, microbial genome applications, biotechnology, forensic and analysis of microbes.
... IRS is widely used in many applications such as digital libraries, search engines, e-commerce, electronic news, genomic sequence analysis etc. [1], [2]. One of the efficient techniques used to locate the data for fast retrieval in IRS is indexing. ...
Article
Indexing plays an important role for storing and retrieving the data in Information Retrieval System (IRS). Inverted Index is the most frequently used indexing structure in IRS. In order to reduce the size of the index and retrieve the data efficiently, compression schemes are used, because the retrieval of compressed data is faster than uncompressed data. High speed compression schemes can improve the performance of IRS. In this paper, we have studied and analyzed various compression techniques for 32-bit integer sequences. The previously proposed compression schemes achieved either better compression rates or fast decoding, hence their decompression speed (disk access + decoding) might not be better. In this paper, we propose a new compression technique, called Optimal FastPFOR, based on FastPFOR. The proposed method uses better integer representation and storage structure for compressing inverted index to improve the decompression performance. We have used TREC data collection in our experiments and the results show that the proposed code could achieve better compression and decompression compared to FastPFORand other existing related compression techniques.
... 1. CAFE: This technique employs a two-stage process for searching for all similar sequences in genomic databases [131,132]. An initial coarse-grained search is done through the use of a compressed inverted-index built using overlapping substrings of a fixed length. ...
Article
Biodiversity research generates and uses a variety of data spanning across diverse do-mains, including taxonomy, geo-spatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database engine that seamlessly integrates these diverse types of data, spanning the range from molecular to organism-level information. BODHI is a native object-oriented database system built around a publically available micro-kernel and extensible query processor, and offers a functionally comprehensive query interface. The server is partitioned into three service modules: object, spatial and sequence, each handling the associated data domain and providing appropriate storage, modeling inter-faces, and evaluation algorithms for predicates over the corresponding data types. To accelerate query response times, a variety of specialized access structures are included for each domain. Our experiments with complex cross-domain queries over a representative
... Now we will discuss about some of the important techniques used by the transformation based index algorithm. Among the first is CAFÉ [31] partitioned based approach. In this approach, to rank the similarity to a query sequence, coarse searching with inverted index is used. ...
Article
Full-text available
With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm's results indicate that the uses of parallel programming with large dataset are promising which can be improved further.
... They avoid database scanning by using inverted index, which has been proved successful in web search engines. FLASH [2], RAMDB [9], MAP [55], PatternHunter [8] and CAFE [19], [21] are index-based search tools which use pre-built indices on subsequences to speed up the lookup process. CAFE claimed that it could run eight times faster than BLAST and 50 times faster than FASTA. ...
Article
World Wide Web is growing rapidly. So it is necessary to study the user web navigation behavior to improve the quality of web services, offered to the web user. Analysis of user web navigation behavior is achieved through modeling web navigation history. Markov model is widely used to model the user web navigation sessions. Lower-order Markov model provides high coverage, but with low accuracy. Higher-order Markov model give low coverage but high accuracy with more time complexity. In this paper a new way of structuring the Markov model is proposed named as Dynamic Nested Markov model for modeling the user web navigation sessions. Dynamic Nested Markov model uses the nesting concept, the higher-order Markov model is nested inside the lower-order Markov model. Through this nesting, the second-order Markov model is accommodated inside the first-order Markov model. In Dynamic Nested Markov model, all the advantages of lower-order model and higher-order model are achieved in one model. In this model focus is on time complexity and coverage of the prediction state. Result shows that the high coverage has achieved and time complexity has been reduced.
Article
The Human Genome Project and the explosion of high-throughput data have transformed the areas of molecular and personalized medicine, which are producing a wide range of studies and experimental results and providing new insights for developing medical applications. Research in many interdisciplinary fields is resulting in data repositories and computational tools that support a wide diversity of tasks: genome sequencing, genome-wide association studies, analysis of genotype-phenotype interactions, drug toxicity and side effects assessment, prediction of protein interactions and diseases, development of computational models, biomarker discovery, and many others. The authors of the present paper have developed several inventories covering tools, initiatives and studies in different computational fields related to molecular medicine: medical informatics, bioinformatics, clinical informatics and nanoinformatics. With these inventories, created by mining the scientific literature, we have carried out several reviews of these fields, providing researchers with a useful framework to locate, discover, search and integrate resources. In this paper we present an analysis of the state-of-the-art as it relates to computational resources for molecular medicine, based on results compiled in our inventories, as well as results extracted from a systematic review of the literature and other scientific media. The present review is based on the impact of their related publications and the available data and software resources for molecular medicine. It aims to provide information that can be useful to support ongoing research and work to improve diagnostics and therapeutics based on molecular - level insights.
Article
Full-text available
Now a days huge amount of biological data is produced due to advancements in high-throughput sequencing technology. Those enormous volumes of sequence require effective storage, fast transmission and provision of quick access for analysis to any record. Standard general purpose lossless compression techniques failed to compress these sequences rather they may increase the size enough. Researcher always try to develop new algorithms for this purpose. Present algorithms indicate that there is enough room to make new algorithms to compress groups of genomes and that will be more time and space effective. In this review paper, we will be analyzing and presenting genomic compression algorithms for both single genomes i.e. non-referential algorithm by exploiting the intra-sequence similarity and sets of related or non-related genomes i.e. referential compression algorithms by exploiting inter-sequence similarity. Also we will discuss the different data format on which those algorithms are applied. The main focus of this review paper is the different data structure for huge sequence representation like compressed suffix tries, suffix tree, suffix array, etc. algorithm such as dynamic programming approach and different indexing technique for searching similar subsequence using pattern recognition methods. We will also discuss about Map-Reduce using HDFS, Yarn and Spark for first searching and streaming concept for reference sequence selection.
Article
Homology-related querying on Bio-XML databases pose several problems, as most available exhaustive mining techniques do not incorporate the semantic relationships inherent to these data collections. This paper identifies an index-based approach to mining such data and explores the improvement achieved in the quality of query results by the application of genetic algorithms
Conference Paper
Deoxyribonucleic acid (DNA) sequences are difficult to analyze similarity due to their length and complexity. The challenge lies in being able to use digital signal processing (DSP) to solve highly relevant problems in DNA sequences. Here, we transfer a one-dimensional (1D) DNA sequence into a two-dimensional (2D) pattern by using the Peano scan algorithm. Four complex values are assigned to the characters "A", "C", "T", and "G", respectively. Then, Fourier transform is employed to obtain far-field amplitude distribution of the 2D pattern. Hereto, a 1D DNA sequence becomes a 2D image pattern. Features are extracted from the 2D image pattern with the Principle Component Analysis (PCA) method. Therefore, the DNA sequence database can be established. Unfortunately, comparing features may take a long time when the database is large since multi-dimensional features are often available. This problem is solved by building indexing structure like a filter to filter-out non-relevant items and select a subset of candidate DNA sequences. Clustering algorithms can organize the multi-dimensional feature data into the indexing structure for effective retrieval. Accordingly, the query sequence can be only compared against candidate ones rather than all sequences in database. In fact, our algorithm provides a pre-processing method to accelerate the DNA sequence search process. Finally, experimental results further demonstrate the efficiency of our proposed algorithm for DNA sequences similarity retrieval.
Article
Data compression has been widely used in many Information Retrieval based applications like web search engines, digital libraries, etc. to enable the retrieval of data to be faster. In these applications, universal codes Elias codes EC, Fibonacci code FC, Rice code RC, Extended Golomb code EGC, Fast Extended Golomb code FEGC etc. have been preferably used than statistical codes Huffman codes, Arithmetic codes etc. Universal codes are easy to be constructed and decoded than statistical codes. In this paper, the authors have proposed two methods to construct universal codes based on the ideas used in Rice code and Fast Extended Golomb Code. One of the authors' methods, Re-ordered FEGC, can be suitable to represent small, middle and large range integers where Rice code works well for small and middle range integers. It is also competing with FC, EGC and FEGC in representing small, middle and large range integers. But it could be faster in decoding than FC, EGC and FEGC. The authors' another coder, Block based RFEGC, uses local divisor rather than global divisor to improve the performance both compression and decompression of RFEGC. To evaluate the performance of the authors' coders, the authors have applied their methods to compress the integer values of the inverted files constructed from TREC, Wikipedia and FIRE collections. Experimental results show that their coders achieve better performance both compression and decompression for those files which contain significant distribution of middle and large range integers.
Chapter
A variety of biological databases are currently available to researchers in the XML format. Homology-related querying on such databases presents several challenges, as most available exhaustive mining techniques do not incorporate the semantic relationships inherent to these data collections. This chapter identifies an index-based approach to mining such data and explores the improvement achieved in the quality of query results by the application of genetic algorithms. Our experiments confirm the widely accepted advantages of index and vector-space based model for biological data and specifically, show that the application of genetic algorithms optimizes the search and achieves higher levels of precision and accuracy in heterogeneous databases and faster query execution across all data collections.
Chapter
This chapter introduces a new algorithm called online and accurate search technique for Inferring local-alignments on sequences (OASIS), which improves upon the performance of the existing state-of-the-art for accurate local sequence alignment. The existing accurate local-alignment algorithm, the Smith-Waterman (S-W) algorithm, is rarely used since it is very computationally expensive. The chapter shows that the OASIS algorithm is often an order-of-magnitude or faster than the S-W algorithm when the query is a short sequence. Such short sequences are often used in querying biological sequence data sets, and OASIS is very effective in these cases. OASIS, also has the property of returning result tuples in decreasing order of the matching scores. Consequently, OASIS can also be used in an online mode, where the scientist may want to abort the query after seeing the top few results. The chapter experimentally evaluates OASIS and demonstrates that for an important class of searches, in which the query sequence lengths are small, OASIS is more than an order of magnitude faster than S-W. In addition, the speed of OASIS is comparable to BLAST.
Chapter
The similarity searches in multidimensional non-ordered discrete data spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. Existing indexing methods developed for multidimensional (ordered) continuous data spaces (CDS) such as R-tree cannot be directly applied to an NDDS. This is because some essential geometric concepts/properties such as the minimum bounding region and the area of a region in a CDS are no longer valid in an NDDS. Whereas, indexing methods based on metric spaces such as M-tree are too general to effectively utilize the data distribution characteristics in an NDDS. Therefore, their retrieval performance is not optimized. To support efficient similarity searches in an NDDS, it proposes a new dynamic indexing technique, called the ND-tree. The key idea is to extend the relevant geometric concepts as well as some indexing strategies used in CDSs to NDDSs.
Article
Compression techniques are essential for those applications which require more disk accesses. Since the uncompressed data need more disk accesses than the compressed data, compression is used for reducing the costs of disk access to enable the retrieval of data to be faster. Various variable length codes, such as Elias codes, Rice code, and fast extended Golomb code have been used in many applications to compress the data. Particularly, these codes have been used in information retrieval-based applications to compress integers. In these applications, integers are the basis of indexes that are used to resolve queries. This paper has proposed a new method to represent non-negative integers based on the idea used in Rice code and fast extended Golomb code. The variable length codes produced by the proposed method can be suitable for representing small, middle, and large range of integers, where Rice code suits well for representing small or middle or large range of integers. This method also gives better representation for most of the integers from small-to-large range than fast extended Golomb code. In this paper, the proposed method has been applied to compress the coordinates (integers) used in the R-tree structure, which is used for indexing the spatial data. In the experiments, TIGER data collections and synthetic data collections have been used to evaluate the compression performance. The experimental results show that our code achieves better bit-rate than other existing codes for those spatial data files, which contain significant distribution of small, middle, and large integers.
Article
The approximate string search on data streams was studied to face the gradual transfer of the data accessing from the static disk mode to the dynamic data stream mode with the development of network technologies. To solve the problem that current static-dataset based approximate searching methods cannot work efficiently on data streams because data streams are continuous, boundless and unpredictable, and the resources for online computing are limited, a sliding-window based method for Approximate String Search on data Streams, called the AS3, was proposed based on a filter-and-verification framework. The AS3 adopts a basic window mechanism to facilitate real-time index update, and improves the asymmetric signature scheme for the construction of basic window signature index. Furthermore, it uses the pre-prune filtering (PPF) algorithm, the count-filtering on stream (CFS) algorithm and the coordinate based verification (CV) algorithm. The experimental results show that the AS3 can achieve the great query performance on data streams while ensuring the accuracy of results with the higher real-time response and peak processing capacity on data streams. ©, 2014, Inst. of Scientific and Technical Information of China. All right reserved.
Article
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. In this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. Inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1 GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.
Conference Paper
Inverted Index is an important data structure in computer science. It is used to create a mapping between a word and the set of documents in which that word appears. Thus, it is used to store documents per word. Currently, the output of inverted indexing is stored haphazardly in a look up table. Hence traversing through the look up table for fetching indexes requires linear search. The time complexity of linear search is O(n) where n is the number of words whose inverted index has been stored. In this paper, a hash based optimization is proposed for storing the output of inverted index which can reduce the searching time complexity to O(1). Since inverted indexes are quite popular in big data applications like search engines, a MapReduce implementation of the proposed technique is also presented which can be easily implemented in a distributed environment.
Article
Full-text available
Motivation: Growing sequence databases are instigating sequence retrieval systems that construct k-mer (hot-spot) indexes off-line to speed up the on-line query execution. For fixed k, the content of each index bucket grows along with the database, diminishing the effectiveness of the index. Thus it is important to establish effective methods beyond simply indexing BLAST hot-spots, where k equals 3 with concomitant 8,000 index buckets. Results: We investigate an evolutionary criterion to directly retrieve an evolutionary neighborhood of k-mers. The method uses metric-space search and an overlapping k-mer representation of protein databases. We prove a lemma that enables the comparison of two k-mers using weighted-Hamming distance in-lieu of global alignment, yielding an O(k) speed-up. We evaluate the trade-offs between scalability, speed and accuracy and assess several k-nearest neighbor search algorithms. The results extend k to 6 and over 60 million buckets, achieving better scalability and maintaining comparable search accuracy as BLAST.
Article
Full-text available
The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.
Article
Full-text available
Publisher Summary This chapter discusses the study of local alignment statistics, the distribution of optimal gapped subalignment scores, and the evidence that two parameters are sufficient to describe both the form of this distribution and its dependence on sequence length. Using a random protein model, the relevant statistical parameters are calculated for a variety of substitution matrices and gap costs. An analysis of these parameters elucidates the relative effectiveness of affine as opposed to length-proportional gap costs. Thus, sum statistics provide a method for evaluating sequence similarity that treats short and long gaps differently. By example, the chapter shows how this method has the potential to increase search sensitivity. The statistics described can be applied to the results of fast alignment (FASTA) searches or to those from a variation of the basic local alignment search tool (BLAST) programs.
Article
Full-text available
Contents 1 Introduction 2 1.1 Evolutionary time scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Similarity, Ancestry and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Modes of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Conventional divergence from a common ancestor . . . . . . . . . . . . . . . . 8 ATPase . . . . . . . . . . . . . . . . 10 1.3.3 Protein families diverge at different rates . . . . . . . . . . . . . . . . . . . . . 16 1.3.4 Mosaic proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4 Introns Early/Late . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5 DNA vs Protein comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Alignment methods 22 2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Dynamic Programming Algorithm
Article
Full-text available
Scoring matrices for nucleic acid sequence comparison that are based on models appropriate to the analysis of molecular sequencing errors or biological mutation processes are presented. In mammalian genomes, transition mutations occur significantly more frequently than transversions, and the optimal scoring of sequence alignments based on this substitution model differs from that derived assuming a uniform mutation model. The information from sequence alignments potentially available using an optimal scoring system is compared with that obtained using the BLASTN default scoring. A modified BLAST database search tool allows these, or other explicitly specified scoring matrices, to be utilized in computationally efficient queries of nucleic acid databases with nucleic acid query sequences. Results of searches performed using BLASTN's default score matrix are compared with those using scores based on a mutational model in which transitions are more prevalent than transversions.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
Motivation Compression algorithms can be used to analyse genetic sequences. A compression algorithm tests a given property on the sequence and uses it to encode the sequence: if the property is true, it reveals some structure of the sequence which can be described briefly, this yields a description of the sequence which is shorter than the sequence of nucleotides given in extenso. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Results We present a compression algorithm that tests the presence of a particular type of dosDNA (defined ordered sequence-DNA): approximate tandem repeats of small motifs (i.e. of lengths <4). This algorithm has been experimented with on four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes. Availability The algorithms in C are available on the World Wide Web (URL: http://www.lifl.fr/~rivals/Doc/RTA/). Contact E-mail: rivals@lifl.fr
Article
Full-text available
PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. A major objective of PIR-International is to continue the development of the Protein Sequence Database as an essential public resource for protein sequence information. This paper briefly describes the architecture of the Protein Sequence Database and how it and associated data sets are distributed and can be accessed electronically.
Article
Full-text available
We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the two sequences and W is the width of the band. The basic algorithm can be used to calculate either local or global alignment scores. Local alignments are produced by finding the beginning and end of a best local alignment in the band, and then applying the global alignment algorithm between those points. This algorithm has been incorporated into the FASTA program package, where it has decreased the amount of memory required to calculate local alignments from O(NW) to O(N) and decreased the time required to calculate optimized scores for every sequence in a protein sequence database by 40%. On computers with limited memory, such as the IBM-PC, this improvement both allows longer sequences to be aligned and allows optimization within wider bands, which can include longer gaps.
Article
Full-text available
We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.
Article
Full-text available
A 2.8-kb EcoRI-BglII fragment cloned from the wild-type Haemophilus influenzae Rd chromosome is shown to increase the transformability of the Com-101 mutant through trans complementation. Deletion and sequence analyses indicate that the active region of the clone carries a 687-bp open reading frame. A 0.3-kb insertion in the corresponding EcoRI-BglII fragment of the Com-101 chromosome is shown to be a partial (331-bp) duplication of this open reading frame. The wild-type sequence produces a peptide of a size that is consistent with the sequence data when this sequence is expressed in Escherichia coli with a T7 promoter-based transcription vector. RNA hybridization analysis using a DNA probe derived from the open reading frame suggests that the sequence is transiently expressed during competence development. On the basis of these observations, it is proposed that the open reading frame corresponds to the com101A gene.
Article
Full-text available
Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a "substitution score matrix" that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a "log-odds" matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human alpha 1 B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.
Article
Full-text available
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
Article
Full-text available
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Full-text available
Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.
Article
Full-text available
Progress toward achieving the first set of goals for the genome project appears to be on schedule or, in some instances, even ahead of schedule. Furthermore, technological improvements that could not have been anticipated in 1990 have in some areas changed the scope of the project and allowed more ambitious approaches. Earlier this year, it was therefore decided to update and extend the initial goals to address the scope of genome research beyond the completion of the original 5-year plan. A major purpose of revising the plan is to inform and provide a new guide to all participants in the genome project about the project's goal. To obtain the advice needed to develop the extended goals, NIH and DOE held a series of meetings with a large number of scientists and other interested scholars and representatives of the public, including many who previously had not been direct participants in the genome project. Reports of all these meetings are available from the Office of Communications of the National Center for Human Genome Research (NCHGR) and the Human Genome Management Information System of DOE. Finally, a group of representative advisors from MIH and DOE drafted a set of new, extended goals for presentation to the National Advisory Council for Human Genome Research of NIH and the Health and Environmental Research Advisory Committee of DOE.
Article
Full-text available
Protein sequence alignments generally are constructed with the aid of a "substitution matrix" that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a "log-odds" matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.
Article
Full-text available
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to seven additional databases; a variety of new documentation files; the creation of TREMBL, and unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT.
Article
Full-text available
We present an algorithm to identify potential functional elements like protein binding sites in DNA sequences, solely from nucleotide sequence data. Prerequisites are a set of at least seven not closely related sequences with a common biological function which is correlated to one or more unknown sequence elements present in most but not necessarily all of the sequences. The algorithm is based on a search for n-tuples which occur at least in a minimum percentage of the sequences with no or one mismatch, which may be at any position of the tuple. In contrast to functional tuples, random tuples show no preferred pattern of mismatch locations within the tuple nor is the conservation extended beyond the tuple. Both features of functional tuples are used to eliminate random tuples. Selection is carried out by maximization of the information content first for the n-tuple, then for a region containing the tuple and finally for the complete binding site. Further matches are found in an additional selection step, using the ConsInd method previously described. The algorithm is capable of identifying and delimiting elements (e.g. protein binding sites) represented by single short cores (e.g. TATA box) in sets of unaligned sequences of about 500 nucleotides using no information other than the nucleotide sequences. Furthermore, we show its ability to identify multiple elements in a set of complete LTR sequences (more than 600 nucleotides per sequence).
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
The Human Genome Project has successfully completed all the major goals in its current 5-year plan, covering the period 1993–98. A new plan, for 1998–2003, is presented, in which human DNA sequencing will be the major emphasis. An ambitious schedule has been set to complete the full sequence by the end of 2003, 2 years ahead of previous projections. In the course of completing the sequence, a “working draft” of the human sequence will be produced by the end of 2001. The plan also includes goals for sequencing technology development; for studying human genome sequence variation; for developing technology for functional genomics; for completing the sequence ofCaenorhabditis elegans and Drosophila melanogaster and starting the mouse genome; for studying the ethical, legal, and social implications of genome research; for bioinformatics and computational studies; and for training of genome scientists.
Article
From its origin the Protein Sequence Database has been designed to support research and has focused on comprehensive coverage, quality control and organization of the data in accordance with biological principles. Since 1988 the database has been maintained collaboratively within the framework of PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The database is widely distributed and is available on the World Wide Web, via ftp, email server, on CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations, including SWISS-PROT and the Entrez system of the NCBI.
Article
This article surveys the current state of affairs with regard to software for sequence assembly. We try to give some appreciation of the underlying combinatorial problems involved and the nature of the computer codes or algorithms used to solve them. The article further discusses the requirements that should reasonably be met by the user- interface that drives the computational components. This discussion is intended to give the reader a perspective on how to evaluate the software systems currently available to them. Finally, the article discusses current research on algorithms for fragment assem- bly and the nature of the improvements that can be expected in the near future.
Article
this document.applications to five Ribonucleases, three FAD-binding enzymes and five-crolike DNA binding proteins. The Bacon and Anderson (1986) algorithm showsconsiderable promise for the location of significant short sequence similarities.However, the method does not provide an overall alignment of the sequences anddoes not explicitly consider gaps. Johnson and Doolittle [51] reduce the number ofsegment comparisons that must be performed by progressively evaluating selectedsegments...
Conference Paper
A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.
Article
Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Efficient dynamic programming algorithms are available for a broad class of protein and DNA sequence comparison problems. These algorithms require computer time proportional to the product of the lengths of the two sequences being compared [O(N2)] but require memory space proportional only to the sum of these lengths [O(N)]. Although the requirement for O(N2) time limits use of the algorithms to the largest computers when searching protein and DNA sequence databases, many other applications of these algorithms, such as calculation of distances for evolutionary trees and comparison of a new sequence to a library of sequence profiles, are well within the capabilities of desktop computers. In particular, the results of library searches with rapid searching programs, such as FASTA or BLAST, should be confirmed by performing a rigorous optimal alignment. Whereas rapid methods do not overlook significant sequence similarities, FASTA limits the number of gaps that can be inserted into an alignment, so that a rigorous alignment may extend the alignment substantially in some cases. BLAST does not allow gaps in the local regions that it reports; a calculation that allows gaps is very likely to extend the alignment substantially. Although a Monte Carlo evaluation of the statistical significance of a similarity score with a rigorous algorithm is much slower than the heuristic approach used by the RDF2 program, the dynamic programming approach should take less than 1 hr on a 386-based PC or desktop Unix workstation. For descriptive purposes, we have limited our discussion to methods for calculating similarity scores and distances that use gap penalties of the form g = rk. Nevertheless, programs for the more general case (g = q+rk) are readily available. Versions of these programs that run either on Unix workstations, IBM-PC class computers, or the Macintosh can be obtained from either of the authors.
Article
The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer and unambiguously detect proteins that shared a common ancestor billions of years in the past. FASTA is both fast and selective because it initially considers only amino acid identities. Its sensitivity is increased not only by using the PAM250 matrix to score and rescore regions with large numbers of identities but also by joining initial regions. The results of searches with FASTA compare favorably with results using NWS-based programs that are 100 times slower. FASTA is slightly less sensitive but considerably more selective. It is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family. The joining step by FASTA to calculate the initn score is especially useful for sequences that share regions of sequence similarity that are separated by variable-length loops.
Article
We have chemically synthesized and expressed in yeast a gene coding for human epidermal growth factor (urogastrone), a 53-amino-acid polypeptide that has been shown to promote epithelial cell proliferation and to inhibit gastric acid secretion. The synthetic gene, consisting of 170 base pairs, was designed with yeast-preferred codons and assembled by enzymatic ligation of synthetic fragments produced by phosphoramidite chemistry. The DNA synthesis protocol used allows for facile synthesis of oligonucleotides larger than 50 bases. Yeast cells were transformed with plasmids containing the synthetic gene under control of a yeast glyceraldehyde-3-phosphate dehydrogenase gene promoter and were shown to synthesize a biologically active human epidermal growth factor.
Article
As the volume of protein sequence data grows, rapid methods for searching the protein sequence database become of primary importance. Rigorous comparison of sequences is obtained with the well-known dynamic programming algorithms. However, these algorithms are not rapid enough to use for routinely searching the entire database. In this paper we discuss some methods that can be used for rapid searches.
Article
With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.
Article
A key issue in managing today's large amounts of genetic data is the availability of efficient, accurate, and selective techniques for detecting homologies (similarities) between newly discovered and already stored sequences. A common characteristic of today's most advanced algorithms, such as FASTA, BLAST, and BLAZE is the need to scan the contents of the entire database, in order to find one or more matches. This design decision results in either excessively long search times or, as is the case of BLAST, in a sharp trade-off between the achieved accuracy and the required amount of computation. The homology detection algorithm presented in this paper, on the other hand, is based on a probabilistic indexing framework. The algorithm requires minimal access to the database in order to determine matches. This minimal requirement is achieved by using the sequences of interest to generate a highly redundant number of very descriptive tuples; these tuples are subsequently used as indices in a table look-up paradigm. In addition to the description of the algorithm, theoretical and experimental results on the sensitivity and accuracy of the suggested approach are provided. The storage and computational requirements are described and the probability of correct matches and false alarms is derived. Sensitivity and accuracy are shown to be close to those of dynamic programming techniques. A prototype system has been implemented using the described ideas. It contains the full Swiss-Prot database rel 25 (10 MR) and the genome of E. Coli (2 MR). The system is currently being expanded to include the complete Genbank database.(ABSTRACT TRUNCATED AT 250 WORDS)
Article
We present here a codification structure, entirely interfaced with the main packages for biomolecule database management, associated with a new search algorithm to retrieve quickly a sequence in a database. This system is derived from a method previously proposed for homology search in databanks with a preprocessed codification of an entire database in which all the overlapping subsequences of a specific length in a sequence were converted into a code and stored in a hash-coding file. This new algorithm is designed for an improved use of the codification. It is based on the recognition of the rarest strings which characterize the query sequence and the intersection of sorted lists read in the codification structure. The system is applicable to both nucleic acid and protein sequences and is used to find patterns in databanks or large sets of sequences. A few examples of applications are given. In addition, the comparison of our method with existing ones shows that this new approach speeds up the search for query patterns in large data sets.
Article
The Portable Dictionary of the Mouse Genome is a database for personal computers that contains information on approximately 10,000 loci in the mouse, along with data on homologs in several other mammalian species, including human, rat, cat, cow, and pig. Key features of the dictionary are its compact size, its network independence, and the ability to convert the entire dictionary to a wide variety of common application programs. Another significant feature is the integration of DNA sequence accession data. Loci in the dictionary can be rapidly resorted by chromosomal position, by type, by human homology, or by gene effect. The dictionary provides an accessible, easily manipulated set of data that has many uses--from a quick review of loci and gene nomenclature to the design of experiments and analysis of results. The Portable Dictionary is available in several formats suitable for conversion to different programs and computer systems. It can be obtained on disk or from Internet Gopher servers (mickey.utmen.edu or anat4.utmen.edu), an anonymous FTP site (nb.utmem.edu in the directory pub/genedict), and a World Wide Web server (http://mickey.utmem.edu/front.html).