Conference Paper

Cafe: An Indexed Approach to Searching Genomic Databases.

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Cafe is a full-scale prototype retrieval system consists of novel indexing and retrieval techniques for querying genomic databases. Cafe is based on techniques used in text retrieval and approximate string matching for database of names. The principal features of Cafe are the incorporation of novel and efficient data structures for query resolution and the demonstration that indexing can be successfully applied to genomic databases. The Cafe search capabilities are demonstrated to show its requisite properties of efficiency and accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In contrast, other approaches suggest that database indexing can yield much faster speed than query indexing [16,17]. Examples of such tools include BLAT [1], SSAHA [18], MegaBLAST [19], and CAFE [20]. However, these tools cannot provide the same level of sensitivity as the BLAST algorithm [17,21,22], or support nucleotide sequence search. ...
Article
Full-text available
Background The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. Results muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. Conclusions With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1302-4) contains supplementary material, which is available to authorized users.
... C'est la structure de données qui, pour un mot donné, nous donne directement la liste des documents où il apparaît très rapidement. Dans le cadre de la recherche de similarités sur les séquences génomiques, le logiciel CAFE [57,58,59] utilisant cette structure de données a été proposé. ...
Article
Full-text available
The sequence comparison process is one of the main bioinformatics task. The new sequencing technologies lead to a fast increasing of genomic data and strengthen the need of fast and efficient tools to perform this task. In this thesis, a new algorithm for intensive sequence comparison is proposed. It has been specifically designed to exploit all forms of parallelism of today microprocessors (SIMD instructions, multi-core architecture). This algorithm is also well suited for hardware accelerators such as FPGA or GPU boards. The algorithm has been implemented into the PLAST software (Parallel Local Alignment Search Tool). Different versions are available according to the data to process (protein and/or DNA). A MPI version has also been developed. According to the nature of the data and the type of technologies, speedup from 3 to 20 has been measured compared with the reference software, BLAST, with the same level of quality.
... 1. CAFE: This technique employs a two-stage process for searching for all similar sequences in genomic databases [131,132]. An initial coarse-grained search is done through the use of a compressed inverted-index built using overlapping substrings of a fixed length. ...
Article
Biodiversity research generates and uses a variety of data spanning across diverse do-mains, including taxonomy, geo-spatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database engine that seamlessly integrates these diverse types of data, spanning the range from molecular to organism-level information. BODHI is a native object-oriented database system built around a publically available micro-kernel and extensible query processor, and offers a functionally comprehensive query interface. The server is partitioned into three service modules: object, spatial and sequence, each handling the associated data domain and providing appropriate storage, modeling inter-faces, and evaluation algorithms for predicates over the corresponding data types. To accelerate query response times, a variety of specialized access structures are included for each domain. Our experiments with complex cross-domain queries over a representative
... The authors of Cafe [19] worked to find a way to predefine likely sequence alignments to reduce query evaluation costs. They managed to reduce this cost by 40%-90% [18] We then define a value W which is the window size, this window we It is shown that CAFE is faster but has a little bit lower precision than BLAST when searching for very similar sequences. It should also be noted that it was very difficult to find papers describing the details of the inner workings of Cafe as the project seems to be discontinued. ...
... Coarse filtering algorithms have been applied successfully to genomic (Williams, 1998) and speech signal database indexing (Keogh, 2002). Coarse filtering algorithms for genome databases have traditionally drawn inspiration from text (Faloutsos and Oard, 1996) and image retrieval (Smith and Chang, 1996). ...
Article
Full-text available
Motivation: We reformulate the problem of comparing mass- spectra by mapping spectra to the vector space model com- monly used in document retrieval. It follows that measures of document similarity and document indexing may be adapted for protein identification. In our approach a fast coarse filtering method leveraging a metric space indexing algorithm is used to produce an initial candidate set. We then rank the spectra in this reduced set using ProFound's Bayesian scoring scheme. Ideally,the complexity of the coarse filter search approaches O(log n), as compared to the linear performance provided by most leading tools in the field. Results: We consider three distance measures based on cosine and hamming distances, modifying them to accommo- date the peak shifts intrinsic to mass spectra and investigate their integration with the multivantage-point index structure. Of these, a semi-metric, fuzzy-cosine distance using peptide mass constraints performs the best. We implement an appro- ximate semi-metric search, and show that this improves index pruning power over a standard metric space search. We measure accuracy of results and index performance on a test set of peptide fragmentation spectra from E.coli prote- ins. We also report sensitivity(recall) and specificity(precision) scores on a more comprehensive benchmark of 1000 Angiotensin-II tandem mass spectra, showing that, in practice, approximate searches in this high dimensional sparse space are acceptable when accompanied by substantial increase in search efficiency.
... For example, approaches that combine the virtual database approach with complex distance functions similar to Pevzner et al. (2001) may start to become feasible. Coarse filtering algorithms have been applied successfully to genomic (Williams, 1998) and speech signal database indexing (Keogh, 2002). Coarse filtering algorithms for genome databases have traditionally drawn inspiration from text (Faloutsos and Oard, 1996) and image retrieval (Smith and Chang, 1996). ...
Article
Full-text available
Motivation: We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages a metric space indexing algorithm to produce an initial candidate set, which can be followed by any fine ranking scheme. Results: We consider three distance measures integrated into a multi-vantage point index structure. Of these, a semi-metric fuzzy-cosine distance using peptide precursor mass constraints performs the best. The index acts as a coarse, lossless filter with respect to the SEQUEST and ProFound scoring schemes, reducing the number of distance computations and returned candidates for fine filtering to about 0.5% and 0.02% of the database respectively. The fuzzy cosine distance term improves specificity over a peptide precursor mass filter, reducing the number of returned candidates by an order of magnitude. Run time measurements suggest proportional speedups in overall search times. Using an implementation of ProFound's Bayesian score as an example of a fine filter on a test set of Escherichia coli protein fragmentation spectra, the top results of our sample system are consistent with that of SEQUEST.
Article
La comparaison de séquences est une des tâches fondamentales de la bioinformatique. Les nouvelles technologies de séquençage conduisent à une production accélérée des données génomiques et renforcent les besoins en outils rapides et efficaces pour effectuer cette tâche. Dans cette thèse, nous proposons un nouvel algorithme de comparaison intensive de séquences, explicitement conçu pour exploiter toutes les formes de parallélisme présentes dans les microprocesseurs de dernière génération (instruction SIMD, architecture multi-coeurs). Cet algorithme s'adapte également à un parallélisme massif que l'on peut trouver sur des accélérateurs de type FPGA ou GPU. Cet algorithme a été mis en oeuvre à travers le logiciel PLAST (Parallel Local Alignment Search Tool). Différentes versions sont disponibles suivant les données à traiter (protéine et/ou ADN). Une version MPI a également été mise au point pour un déploiement sur un cluster de PCs. En fonction de la nature des données et des technologies employées des accélérations de 3 à 20 ont été mesurées par rapport à la référence du domaine, le logiciel BLAST, pour un niveau de qualité équivalent.
ResearchGate has not been able to resolve any references for this publication.