Adam Cannane's research while affiliated with RMIT University and other places

Publications (12)

Article
BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST--by improving its algorithms and optimizations--is essential to improve search times in the face of exponentially increasin...
Article
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the BLAST...
Article
Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted i...
Article
Fast and accurate techniques for searching large genomic text collections are becoming increasingly important. While Information Retrieval is well-established for general-purpose text retrieval tasks, less is known about retrieval techniques for genomic text data. In this paper, we investigate and propose general-purpose search techniques for genom...
Conference Paper
Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query terms. Inverted indexes are used in all query-based retrieval systems to allow efficient query proc...
Article
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire co...
Article
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. In text databases, compression of documents based on semistatic modeling with words has been shown to be both practical and fast. Similarly, for specific applications—such as databases of integers or scientific databases—specially designed semi...
Conference Paper
Full-text available
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. We have described elsewhere our RAY algorithm for compressing databases containing general-purpose data, such as images, sound, and also text. We describe here an extension to the RAY compression algorithm that permits use on very large databas...
Conference Paper
Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaini...

Citations

... Despite the high numbers of citations of BLAST and FASTA in literature, there is hardly any quantitative comparison of the two tools in terms of speed, precision and accuracy. A comparative analysis on the algorithm runtime complexity and precision of BLAST and FASTA showed that BLAST was over six times faster for searching structural classification of proteins (SCOP) than FASTA (Chattaraj et al., 1999). However, the average precision of FASTA was about 2% higher than that of BLAST (Chattaraj et al., 1999). ...
... Sparse inference Earlier research has applied inverted indices for reducing the classification times for Knearest Neighbours [Yang, 1994] and Centroid [Shanks et al., 2003]. The same reductions are gained for computing posterior probabilities for linearly interpolated language models in information retrieval [Hiemstra, 1998, Zhai andLafferty, 2001b]. ...
... 3 Over the past several decades, various IR systems have been built to aid in the development of new retrieval models, to test hypotheses about information seeking, and to validate new evaluation methodologies. An incomplete list includes Lemur/Indri [38,39], Galago [15], Terrier [35,42], ATIRE [51], Ivory [30], JASS [31], MG4J [12], [60,61]. Although some academic systems are widely used across many institutions (for example, Indri and Terrier), many researchers exclusively conduct experiments on their own systems, which contributes to difficulty in interpreting experimental results, since it is unclear if ranking models or implementations are actually being compared. ...
... Although PPM provides the best compression ratio, performing retrieval is difficult, whether directly or through indexing on the compressed file. On the other hand, Word- based Huffman coding schemes (Moffat, Sacks-Davis, Wilkinson and Zobel 1993, Witten, Moffat and Bell 1999, Ziviani, Moura, Navarro and Baeza-Yates 2000, Cannane and Williams 2002, provide a better balance between compression ability and performance in indexing and searching. Several researchers like Adjeroh and Mukherjee and Bell and Powell and Zhang 2002, Amir and Benson and Farach 1996, Bunke and Csirik 1993, Bunke and Csirik 1995, Manber 1997, Navarro and Raffinot 1999 matching algorithms that search patterns directly on the compressed file with or without preprocessing. ...
... All special-purpose encoding method should meet the following criteria [3]. i) It must permit autonomous use of encoded sequence and independent decompression of data. ...
... CUDA-BLASTP [18] and GPU-BLAST [19] mainly exploit coarse-grained parallelism in which a sequence alignment is mapped to only one thread. CUDA-BLASTP optimizes the Deterministic Finite-state Automaton (DFA), a twolevel lookup table proposed in FSA-BLAST [27,28]. In GPU-BLAST, a vector of bits is allocated in shared memory to store the information about each possible sequence word. ...
... Functional annotation was assigned using the protein (Nr and Swiss-Prot), Clusters of Orthologous Groups (COG) and Gene Ontology (GO) databases. BLASTX was employed to identify related sequences in the protein databases based on E-values of less than 10-5 [66]. In addition, all transcriptome sequences were reannotated using the NCBI genome databases or the EST database. ...
... Run-length encoding (RLE), where repeats of the same element are expressed as pairs, is an attractive approach for compressing sorted data in a column store [12]. The task of string matching to find the most effective phrase [13] for replacement in the input stream is not a simple one. Real-time database compression algorithm must provide high compression radio to realize large numbers of data storage in real-time database [14]. ...
... Many techniques surveyed in those papers include compression methods that deal with only text data types rather than a diversity of data types. One particular method that does consider the different data types is the RAY algorithm as described by Cannane, Williams and Zobel (Cannane, Williams & Zobel 1999). However, data compression is a large area of research and it is orthogonal to the work outlined here. ...
... Following this, we use Gini to compute the bias of the system on the overall collection using the r(d) scores. We also report the total retrievability (RSUM), which is d r (d), to provide a measure of how much retrievability is a orded to the collection (a similar access measure is used by Garcia et al. [15]). Figure 1 provides plots of MAP, NDCG@10, Gini coe cient, RSUM, and query time across the p-ratios for each of the pruning algorithms. ...