Parallelization of FM-Index
ABSTRACT A parallel design and implementation of FM-index is presented in this paper. In applications, the performance of the FM-index is crucial, which is a self-contained, highly compressed indexing algorithm. With the popularity of multi-core processors, parallel computing allows the FM-index to run faster by performing multiple computations simultaneously when possible. Our approach works by splitting input data into overlapping blocks with equal size, and running them through the FM-index algorithm simultaneously on multiple processors. After analyzing and refactoring the sequential version, we organize the data flows of all operations according to a unified parallel framework. The experimental results show that, in general our approach has achieved a significant and sub-linear speedup on widespread symmetrical multi-processing architectures. This will greatly reduce the running time of executing operations on large data sets.
- 01/1995; Addison-Wesley.
Conference Proceeding: Opportunistic data structures with applications[show abstract] [hide abstract]
ABSTRACT: We address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because text T[1,u] is stored using O(H<sub>k </sub>(T))+o(1) bits per input symbol in the worst case, where H<sub>k </sub>(T) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P[1,p], the opportunistic data structure allows to search for the occurrences of P in T in O(p+occlog <sup>ε</sup>u) time (for any fixed ε>0). If data are uncompressible we achieve the best space bound currently known (Grossi and Vitter, 2000); on compressible data our solution improves the succinct suffix array of (Grossi and Vitter, 2000) and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study our opportunistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool (Manber and Wu, 1994). The result is an indexing tool which achieves sublinear space and sublinear query time complexityFoundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on; 02/2000
Conference Proceeding: Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences.[show abstract] [hide abstract]
ABSTRACT: Searching patterns in the DNA sequence is an im- portant step in biological research. To speed up the search process, one can index the DNA sequence. However, classical indexing data structures like suf- x trees and sux arrays are not feasible for index- ing DNA sequences due to main memory require- ment, as DNA sequences can be very long. In this paper, we evaluate the performance of two compressed data structures, Compressed Sux Array (CSA) and FM-index, in the context of searching and indexing DNA sequences. Our results show that CSA is better than FM-index for searching long patterns. We also investigate other practical aspects of the data structures such as the memory requirement for building the indexes.Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments and the First Workshop on Analytic Algorithmics and Combinatorics, New Orleans, LA, USA, January 10, 2004; 01/2004