Rahul Shah

Rahul Shah
Louisiana State University | LSU · Department of Electrical Engineering and Computer Science

PhD

About

120
Publications
7,376
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,985
Citations
Additional affiliations
August 2007 - present
Louisiana State University
October 2002 - July 2007
Purdue University
Education
August 1997 - September 2002
Rutgers, The State University of New Jersey
Field of study
  • Computer Science

Publications

Publications (120)
Article
The skyline of a set of two-dimensional points is the subset of points not dominated by any other point. In this paper, we consider a set of two-dimensional points (in rank space) that are assigned an additional category, or color. The goal is to preprocess these points so that given a three-sided region of the form [a,b]×[τ,∞] we can return the se...
Article
Let P be a collection of d patterns {P1,P2,…,Pd} of total length n characters, which are chosen from an alphabet Σ of size σ. Given a text T (over Σ), the dictionary indexing problem is to create a data structure using which we can report all positions j (called occurrences) where at least one of the patterns Pi∈P is a match with the same-length su...
Article
Let T[1,n] be a text of length n and T[i,n] be the suffix starting at position i. Also, for any two strings X and Y, let LCP(X,Y) denote their longest common prefix. The range-LCP of T w.r.t. a range [α,β], where 1≤α<β≤n isrlcp(α,β)=max⁡{|LCP(T[i,n],T[j,n])||i≠jandi,j∈[α,β]} Amir et al. [ISAAC 2011] introduced the indexing version of this problem,...
Article
Full-text available
Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in T efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range n...
Article
Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and opti...
Article
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem o...
Chapter
Let \(\mathsf {T}[1,n]\) be a text of length n and \(\mathsf {T}[i,n]\) be the suffix starting at position i. Also, for any two strings X and Y, let \(\mathsf {LCP}(X, Y)\) denote their longest common prefix. The range-LCP of \(\mathsf {T}\) w.r.t. a range \([\alpha ,\beta ]\), where \(1\le \alpha < \beta \le n\) is Open image in new window Amir et...
Article
Let D={T1,T2,…,TD} be a collection of D documents having n characters in total. Given two patterns P and Q, and an integer k>0, we consider the following queries. •top-k forbidden pattern query: Among all documents containing P, but not Q, report the k documents most relevant to P. •top-k two pattern query: Among all documents that contain both P a...
Article
We consider the problem of indexing a given text T[0...n−1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occℓ,r occurrences of P in T[ℓ...r]. In this paper, we pr...
Article
Given a string X[1,n] and a position k∈[1,n], a Shortest Unique Substring of X covering k, denoted by Sk, is a substring X[i,j] of X which satisfies the following conditions: (i) i≤k≤j, (ii) i is the only position where there is an occurrence of X[i,j], and (iii) j−i is minimized. Current best-known algorithms for finding Sk require Θ(n) words of w...
Article
A gap is a sequence of don’t care characters. In this paper, we study two variants of the dictionary matching problem, where gaps may be present in the patterns or in the text. The first variant, called dictionary matching with one gap, considers indexing a collection D of d one-gap-patterns, where the ith pattern is of the form Pi[αi,βi]Qi with Pi...
Article
Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return t...
Article
Full-text available
We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $\Sigma_s$ and a...
Article
Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each pos...
Conference Paper
The restricted shortest path (RSP) problem on directed networks is a well-studied problem, and has a large number of applications such as in Quality of Service routing. The problem is known to be NP-hard. In certain cases, however, the network is not static i.e., edge parameters vary over time. In light of this, we extend the RSP problem for genera...
Conference Paper
Given a text \(\mathsf {T}\) having \(n\) characters, we consider the non-overlapping indexing problem defined as follows: pre-process \(\mathsf {T}\) into a data-structure, such that whenever a pattern \(P\) comes as input, we can report a maximal set of non-overlapping occurrences of \(P\) in \(\mathsf {T}\). The best known solution for this prob...
Conference Paper
A gap-pattern is a sequence of sub-patterns separated by bounded sequences of don’t care characters (called gaps). A one-gap-pattern is a pattern of the form \(P[\alpha ,\beta ]Q\), where \(P\) and \(Q\) are strings drawn from alphabet \(\varSigma \) and \([\alpha , \beta ]\) are lower and upper bounds on the gap size \(g\). The gap size \(g\) is t...
Conference Paper
Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{-...
Article
Full-text available
Genome indexing is the basis for many bioinformatics applications. Read mapping(sequence alignment) is one such application where the goal is to align millions of short reads against reference genome. Several tools are available for read mapping which rely on different indexing techniques to expedite the alignment process. However, many of these co...
Conference Paper
Full-text available
Let $\mathcal{D} = \{\T_1,\T_2, \dots,\T_D\}$ be a collection of $D$ string documents of $n$ characters in total, that are drawn from an alphabet set $\Sigma=[\sigma]$. The \emph{top-$k$ document retrieval problem} is to preprocess $\D$ into a data structure that, given a query $(P[1..p],k)$, can return the $k$ documents of $\D$ most relevant to pa...
Conference Paper
We consider the problem of indexing a collection \(\cal{D}\) of D strings (documents) of total n characters from an alphabet set of size σ, such that whenever a pattern P (of p characters) and an integer τ ∈ [1, D] comes as a query, we can efficiently report all (i) maximal generic words and (ii) minimal discriminating words as defined below: maxim...
Conference Paper
Let \({\cal D}\) be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess \({\cal D}\) into a data structure that, given a query (P,k), can return the k documents of \({\cal D}\) most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking funct...
Article
Full-text available
A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in t...
Article
Given an array A[1...n] of n distinct elements from the set {1, 2, ..., n} a range maximum query RMQ(a, b) returns the highest element in A[a...b] along with its position. In this paper, we study a generalization of this classical problem called Categorical Range Maxima Query (CRMQ) problem, in which each element A[i] in the array has an associated...
Article
Given a set \(\mathcal{D}\) of patterns of total length n, the dictionary matching problem is to index \(\mathcal{D}\) such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick in Commun. ACM 18(6):333–340,...
Article
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an a...
Conference Paper
The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to...
Conference Paper
We present a data structure for the following problem: Given a tree $\mathcal{T}$, with each of its nodes assigned a color in a totally ordered set, preprocess $\mathcal{T}$ to efficiently answer queries for the top $k$ distinct colors on the path between two nodes, reporting the colors sorted in descending order. Our data structure requires linear...
Conference Paper
Range LCP (longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string S[1...n] so that max a,b ∈ {i...j }LCP(S a , S b ) can be computed efficiently for the input i, j ∈ [1, n], where LCP(S a , S b ) is the length of the longest common prefix of the suffixes of S starting at locations a and b....
Conference Paper
We consider the problem of indexing a given text T[0...n − 1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occ ℓ,r occurrences of P in T[ℓ...r]. In this paper, we...
Conference Paper
Let \({\cal{D}}\) be a given set of (string) documents of total length n. The top-k document retrieval problem is to index \(\cal{D}\) such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P. We present the first non-trivial external memory index supporting top-k do...
Conference Paper
We present O(n)-space data structures to support various range frequency queries on a given array A[0:n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of pre-order rank indices (i,j), our data structures return a least frequent element, mode, or α-minority of the multiset of elements in the unique path with endpoints at...
Conference Paper
Hon et al. (2011) recently proposed a variant of suffix tree, called circular suffix tree, and showed that it can be compressed into succinct space and can be used to solve the circular dictionary matching problem efficiently. Although there are several efficient construction algorithms for the suffix tree in the literature, none of them can be app...
Article
Given a collection of strings, goal of the approximate string matching is to efficiently find the strings in the collection that are similar to a query string. In this paper, we focus on edit distance as measure to quantify the similarity between two strings. Existing q-gram based methods to address this problem use inverted indexes to index the q-...
Article
Given a set D of d patterns, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) [12] gave an nHk(D) +o(nlogσ)-bit index which supports a query in O(T(log εn+logd)+o...
Conference Paper
Let D = {d1, d2,...dD} be a given collection of D string documents of total length n, our task is to index D, such that whenever a pattern P (of length p) and an integer k come as a query, those k documents in which P appears the most number of times can be listed efficiently. In this paper, we propose a compressed index taking 2|CSA| + D logn/D +...
Article
We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry. We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalenc...
Article
Document retrieval is a special type of pattern matching that is closely related to information retrieval and web searching. In this problem, the data consist of a collection of text documents, and given a query pattern P, we are required to report all the documents (not all the occurrences) in which this pattern occurs. In addition, the notion of...
Article
We consider the problem of succinctly representing a given vertex-weighted tree of n vertices, whose vertices are labeled by integer weights from {1,2,…,σ}{1,2,…,σ} and supporting the following path queries efficiently:•Path median query: Given two vertices i, j, return the median weight on the path from i to j.•Path selection query: Given two vert...
Article
We study the position restricted substring searching (PRSS) problem, where the task is to index a text T[0⋯n-1] of n characters over an alphabet set Σ of size δ, in order to answer the following: given a query pattern P (of length p) and two indices ℓ and r, report all occ ℓ,r occurrences of P in T[ℓ⋯r]. Known indexes take O(nlogn) bits or O(nlog 1...
Article
Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$ (string) documents of total length $n$. The top-$k$ document retrieval problem is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a parameter $k$ come as a query, the index returns the $k$ most relevant documents to the pattern $P$. Hon et. al. \cite{HSV09} gav...
Conference Paper
Let \(\mathcal D\) = {d 1,d 2,...,d D } be a given collection of D string documents of total length n. We consider the problem of indexing \(\mathcal D\) such that, whenever two patterns P + and P − comes as an online query, we can list all those documents containing P + but not P −. Let t represent the number of such documents. An index proposed b...
Conference Paper
Given a set ${\cal P}$ of d patterns, the circular dictionary matching problem is to index ${\cal P}$ such that for any online query text T, we can quickly locate the occurrences of any cyclic shift of any pattern of ${\cal P}$ within T efficiently. This problem can be applied on practical problems that arise in bioinformatics and computational geo...
Conference Paper
Circular patterns are those patterns whose circular permutations are also valid patterns. These patterns arise naturally in bioinformatics and computational geometry. In this paper, we consider succinct indexing schemes for a set of d circular patterns of total length n , with each character drawn from an alphabet of size σ . Our method is by defin...
Conference Paper
Full-text available
Let $\D = $$ \{d_1,d_2,...d_D\}$ be a given set of $D$ string documents of total length $n$, our task is to index $\D$, such that the $k$ most relevant documents for an online query pattern $P$ of length $p$ can be retrieved efficiently. We propose an index of size $|CSA|+n\log D(2+o(1))$ bits and $O(t_{s}(p)+k\log\log n+poly\log\log n)$ query time...
Article
Full-text available
Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a...
Conference Paper
Full-text available
Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern P in a string text T (of size n), such that the matched text part satisfies some conceptual property. The property of a string is a set π of (possibly overlapping) intervals {(s1,f1),(s2,f2),…}{(s1,f1),(s2,f2),…} corresponding to t...
Conference Paper
Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The...
Article
Full-text available
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P , we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we exte...
Conference Paper
Full-text available
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index...
Conference Paper
Let T=T"1@f^k^"^1T"2@f^k^"^2...@f^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem...
Article
In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the...
Conference Paper
Full-text available
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named PSI-RA (parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to tune the space consumption of the index based on the available memory of the machine and the mini...
Conference Paper
Given a collection \(\mathcal D\) of string documents \(\{d_1,d_2,...,d_{|\mathcal D|}\}\) of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P 1, P 2, ..., P m }. To measure the relevance of a document with respect to the query pa...
Conference Paper
Full-text available
Top-k queries allow end-users to focus on the most important (top-k) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-k query returns k tuples with the highest score. In uncertain database, top-k answer depends not only on the scores but also on the...
Conference Paper
Full-text available
The eld of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same function- ality and speed as traditional data structures. In this invited present...
Conference Paper
Given a set D{\cal D} of d patterns, the dictionary matching problem is to index D{\cal D} such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D{\cal D} contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) gave an nHk(D) + o(n logs)nH_k({\cal D}) + o(n \log \sigma)-...
Article
Full-text available
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular ex...
Article
Full-text available
The original publication is available at www.springerlink.com The past few years have witnessed several exciting results on compressed represen- tation of a string T that supports e±cient pattern matching, and the space complexity has been reduced to jTjHk(T)+o(jTj log ¾) bits [8, 10], where Hk(T) denotes the kth- order empirical entropy of T, and...
Conference Paper
In this paper we revisit the dynamic dictionary matching problem, which asks for an index for a set of patterns P 1, P 2, ..., P k that can support the following query and update operations efficiently. Given a query text T, we want to find all the occurrences of of these patterns; furthermore, as the set of patterns may change over time, we also w...
Article
Full-text available
(c) 2009 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other w...
Conference Paper
Full-text available
In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document re...
Conference Paper
Full-text available
A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve...
Conference Paper
Full-text available
Applications requiring the handling of urzcertain data have led to the developmerlt of database management sys- terns extending the scope of relational databases to in- clude uncertain (probabilistic) data as a izative data type. New automatic query optirnizatiorzs having the ability to estimate the cost of execution of a given query plan, as avail...
Conference Paper
Full-text available
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called geometric Burrows-Wheeler transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) can compressed dat...
Conference Paper
Full-text available
The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for hand...
Conference Paper
Full-text available
The past few years have witnessed several exciting results on compressed representation of a string T that supports efficient pattern matching, and the space complexity has been reduced to |T| H<sub>k</sub> (T) + o (|T| log sigma) bits, where H<sub>k</sub>(T) denotes the kth-order empirical entropy of T, and sigma is the size of the alphabet. In th...
Article
Full-text available
Constantly evolving data arise in various mobile applications such as location-based services and sensor networks. The problem of indexing the data for efficient query processing is of increasing importance. Due to the constant changing nature of the data, traditional indexes suffer from a high update overhead which leads to poor performance. In th...
Conference Paper
Full-text available
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed...
Conference Paper
Full-text available
Orion is a state-of-the-art uncertain database management system with built-in support for probabilistic data as rst class data types. In contrast to other uncertain databases, Orion supports both attribute and tuple uncertainty with arbitrary correlations. This enables the database engine to handle both discrete and continuous pdfs in a natural an...
Conference Paper
Full-text available
We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size M memory blocks along with a request sequence Σ =(b1,b2,...,bn) where each block bi resides...
Conference Paper
Full-text available