About
120
Publications
7,376
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,985
Citations
Additional affiliations
August 2007 - present
Louisiana State University
October 2002 - July 2007
Education
August 1997 - September 2002
Publications
Publications (120)
The skyline of a set of two-dimensional points is the subset of points not dominated by any other point. In this paper, we consider a set of two-dimensional points (in rank space) that are assigned an additional category, or color. The goal is to preprocess these points so that given a three-sided region of the form [a,b]×[τ,∞] we can return the se...
Let P be a collection of d patterns {P1,P2,…,Pd} of total length n characters, which are chosen from an alphabet Σ of size σ. Given a text T (over Σ), the dictionary indexing problem is to create a data structure using which we can report all positions j (called occurrences) where at least one of the patterns Pi∈P is a match with the same-length su...
Let T[1,n] be a text of length n and T[i,n] be the suffix starting at position i. Also, for any two strings X and Y, let LCP(X,Y) denote their longest common prefix. The range-LCP of T w.r.t. a range [α,β], where 1≤α<β≤n isrlcp(α,β)=max{|LCP(T[i,n],T[j,n])||i≠jandi,j∈[α,β]} Amir et al. [ISAAC 2011] introduced the indexing version of this problem,...
Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in T efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range n...
Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and opti...
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem o...
Let \(\mathsf {T}[1,n]\) be a text of length n and \(\mathsf {T}[i,n]\) be the suffix starting at position i. Also, for any two strings X and Y, let \(\mathsf {LCP}(X, Y)\) denote their longest common prefix. The range-LCP of \(\mathsf {T}\) w.r.t. a range \([\alpha ,\beta ]\), where \(1\le \alpha < \beta \le n\) is Open image in new window Amir et...
Let D={T1,T2,…,TD} be a collection of D documents having n characters in total. Given two patterns P and Q, and an integer k>0, we consider the following queries.
•top-k forbidden pattern query: Among all documents containing P, but not Q, report the k documents most relevant to P.
•top-k two pattern query: Among all documents that contain both P a...
We consider the problem of indexing a given text T[0...n−1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occℓ,r occurrences of P in T[ℓ...r]. In this paper, we pr...
Given a string X[1,n] and a position k∈[1,n], a Shortest Unique Substring of X covering k, denoted by Sk, is a substring X[i,j] of X which satisfies the following conditions: (i) i≤k≤j, (ii) i is the only position where there is an occurrence of X[i,j], and (iii) j−i is minimized. Current best-known algorithms for finding Sk require Θ(n) words of w...
A gap is a sequence of don’t care characters. In this paper, we study two variants of the dictionary matching problem, where gaps may be present in the patterns or in the text. The first variant, called dictionary matching with one gap, considers indexing a collection D of d one-gap-patterns, where the ith pattern is of the form Pi[αi,βi]Qi with Pi...
Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return t...
We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $\Sigma_s$ and a...
Strings form a fundamental data type in computer systems. String searching
has been extensively studied since the inception of computer science.
Increasingly many applications have to deal with imprecise strings or strings
with fuzzy information in them. String matching becomes a probabilistic event
when a string contains uncertainty, i.e. each pos...
The restricted shortest path (RSP) problem on directed networks is a well-studied problem, and has a large number of applications such as in Quality of Service routing. The problem is known to be NP-hard. In certain cases, however, the network is not static i.e., edge parameters vary over time. In light of this, we extend the RSP problem for genera...
Given a text \(\mathsf {T}\) having \(n\) characters, we consider the non-overlapping indexing problem defined as follows: pre-process \(\mathsf {T}\) into a data-structure, such that whenever a pattern \(P\) comes as input, we can report a maximal set of non-overlapping occurrences of \(P\) in \(\mathsf {T}\). The best known solution for this prob...
A gap-pattern is a sequence of sub-patterns separated by bounded sequences of don’t care characters (called gaps). A one-gap-pattern is a pattern of the form \(P[\alpha ,\beta ]Q\), where \(P\) and \(Q\) are strings drawn from alphabet \(\varSigma \) and \([\alpha , \beta ]\) are lower and upper bounds on the gap size \(g\). The gap size \(g\) is t...
Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{-...
Genome indexing is the basis for many bioinformatics applications. Read mapping(sequence alignment) is one such application where the goal is to align millions of short reads against reference genome. Several tools are available for read mapping which rely on different indexing techniques to expedite the alignment process. However, many of these co...
Let $\mathcal{D} = \{\T_1,\T_2, \dots,\T_D\}$ be a collection of $D$ string
documents of $n$ characters in total, that are drawn from an alphabet
set $\Sigma=[\sigma]$. The \emph{top-$k$ document retrieval problem}
is to preprocess $\D$ into a data structure that, given a query
$(P[1..p],k)$, can return the $k$ documents of $\D$ most relevant to pa...
We consider the problem of indexing a collection \(\cal{D}\) of D strings (documents) of total n characters from an alphabet set of size σ, such that whenever a pattern P (of p characters) and an integer τ ∈ [1, D] comes as a query, we can efficiently report all (i) maximal generic words and (ii) minimal discriminating words as defined below:
maxim...
Let \({\cal D}\) be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess \({\cal D}\) into a data structure that, given a query (P,k), can return the k documents of \({\cal D}\) most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking funct...
A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in t...
Given an array A[1...n] of n distinct elements from the set {1, 2, ..., n} a range maximum query RMQ(a, b) returns the highest element in A[a...b] along with its position. In this paper, we study a generalization of this classical problem called Categorical Range Maxima Query (CRMQ) problem, in which each element A[i] in the array has an associated...
Given a set \(\mathcal{D}\) of patterns of total length n, the dictionary matching problem is to index \(\mathcal{D}\) such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick in Commun. ACM 18(6):333–340,...
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an a...
The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to...
We present a data structure for the following problem: Given a tree $\mathcal{T}$, with each of its nodes assigned a color in a totally ordered set, preprocess $\mathcal{T}$ to efficiently answer queries for the top $k$ distinct colors on the path between two nodes, reporting the colors sorted in descending order. Our data structure requires linear...
Range LCP (longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string S[1...n] so that max
a,b ∈ {i...j }LCP(S
a
, S
b
) can be computed efficiently for the input i, j ∈ [1, n], where LCP(S
a
, S
b
) is the length of the longest common prefix of the suffixes of S starting at locations a and b....
We consider the problem of indexing a given text T[0...n − 1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occ
ℓ,r
occurrences of P in T[ℓ...r]. In this paper, we...
Let \({\cal{D}}\) be a given set of (string) documents of total length n. The top-k document retrieval problem is to index \(\cal{D}\) such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P. We present the first non-trivial external memory index supporting top-k do...
We present O(n)-space data structures to support various range frequency queries on a given array A[0:n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of pre-order rank indices (i,j), our data structures return a least frequent element, mode, or α-minority of the multiset of elements in the unique path with endpoints at...
Hon et al. (2011) recently proposed a variant of suffix tree, called circular suffix tree, and showed that it can be compressed into succinct space and can be used to solve the circular dictionary matching problem efficiently. Although there are several efficient construction algorithms for the suffix tree in the literature, none of them can be app...
Given a collection of strings, goal of the approximate string matching is to efficiently find the strings in the collection that are similar to a query string. In this paper, we focus on edit distance as measure to quantify the similarity between two strings. Existing q-gram based methods to address this problem use inverted indexes to index the q-...
Given a set D of d patterns, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) [12] gave an nHk(D) +o(nlogσ)-bit index which supports a query in O(T(log εn+logd)+o...
Let D = {d1, d2,...dD} be a given collection of D string documents of total length n, our task is to index D, such that whenever a pattern P (of length p) and an integer k come as a query, those k documents in which P appears the most number of times can be listed efficiently. In this paper, we propose a compressed index taking 2|CSA| + D logn/D +...
We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry. We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalenc...
Document retrieval is a special type of pattern matching that is closely related to information retrieval and web searching. In this problem, the data consist of a collection of text documents, and given a query pattern P, we are required to report all the documents (not all the occurrences) in which this pattern occurs. In addition, the notion of...
We consider the problem of succinctly representing a given vertex-weighted tree of n vertices, whose vertices are labeled by integer weights from {1,2,…,σ}{1,2,…,σ} and supporting the following path queries efficiently:•Path median query: Given two vertices i, j, return the median weight on the path from i to j.•Path selection query: Given two vert...
We study the position restricted substring searching (PRSS) problem, where the task is to index a text T[0⋯n-1] of n characters over an alphabet set Σ of size δ, in order to answer the following: given a query pattern P (of length p) and two indices ℓ and r, report all occ ℓ,r occurrences of P in T[ℓ⋯r]. Known indexes take O(nlogn) bits or O(nlog 1...
Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$
(string) documents of total length $n$. The top-$k$ document retrieval problem
is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a
parameter $k$ come as a query, the index returns the $k$ most relevant
documents to the pattern $P$. Hon et. al. \cite{HSV09} gav...
Let \(\mathcal D\) = {d
1,d
2,...,d
D
} be a given collection of D string documents of total length n. We consider the problem of indexing \(\mathcal D\) such that, whenever two patterns P
+ and P
− comes as an online query, we can list all those documents containing P
+ but not P
−. Let t represent the number of such documents. An index proposed b...
Given a set ${\cal P}$ of d patterns, the circular dictionary matching problem is to index ${\cal P}$ such that for any online query text T, we can quickly locate the occurrences of any cyclic shift of any pattern of ${\cal P}$ within T efficiently. This problem can be applied on practical problems that arise in bioinformatics and computational geo...
Circular patterns are those patterns whose circular permutations are also valid patterns. These patterns arise naturally in bioinformatics and computational geometry. In this paper, we consider succinct indexing schemes for a set of d circular patterns of total length n , with each character drawn from an alphabet of size σ . Our method is by defin...
Let $\D = $$ \{d_1,d_2,...d_D\}$ be a given set of $D$ string documents of
total length $n$, our task is to index $\D$, such that the $k$ most relevant
documents for an online query pattern $P$ of length $p$ can be retrieved
efficiently. We propose an index of size $|CSA|+n\log D(2+o(1))$ bits and
$O(t_{s}(p)+k\log\log n+poly\log\log n)$ query time...
Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a...
Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern P in a string text T (of size n), such that the matched text part satisfies some conceptual property. The property of a string is a set π of (possibly overlapping) intervals {(s1,f1),(s2,f2),…}{(s1,f1),(s2,f2),…} corresponding to t...
Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The...
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P , we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we exte...
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index...
Let T=T"1@f^k^"^1T"2@f^k^"^2...@f^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem...
In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the...
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named PSI-RA (parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to tune the space consumption of the index based on the available memory of the machine and the mini...
Given a collection \(\mathcal D\) of string documents \(\{d_1,d_2,...,d_{|\mathcal D|}\}\) of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P
1, P
2, ..., P
m
}. To measure the relevance of a document with respect to the query pa...
Top-k queries allow end-users to focus on the most important (top-k) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-k query returns k tuples with the highest score. In uncertain database, top-k answer depends not only on the scores but also on the...
The eld of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same function- ality and speed as traditional data structures. In this invited present...
Given a set D{\cal D} of d patterns, the dictionary matching problem is to index D{\cal D} such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D{\cal D} contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) gave an nHk(D) + o(n logs)nH_k({\cal D}) + o(n \log \sigma)-...
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular ex...
The original publication is available at www.springerlink.com The past few years have witnessed several exciting results on compressed represen- tation of a string T that supports e±cient pattern matching, and the space complexity has been reduced to jTjHk(T)+o(jTj log ¾) bits [8, 10], where Hk(T) denotes the kth- order empirical entropy of T, and...
In this paper we revisit the dynamic dictionary matching problem, which asks for an index for a set of patterns P
1, P
2, ..., P
k
that can support the following query and update operations efficiently. Given a query text T, we want to find all the occurrences of of these patterns; furthermore, as the set of patterns may change over time, we also w...
(c) 2009 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other w...
In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document re...
A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required
by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes,
namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve...
Applications requiring the handling of urzcertain data have led to the developmerlt of database management sys- terns extending the scope of relational databases to in- clude uncertain (probabilistic) data as a izative data type. New automatic query optirnizatiorzs having the ability to estimate the cost of execution of a given query plan, as avail...
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called geometric Burrows-Wheeler transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) can compressed dat...
The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for hand...
The past few years have witnessed several exciting results on compressed representation of a string T that supports efficient pattern matching, and the space complexity has been reduced to |T| H<sub>k</sub> (T) + o (|T| log sigma) bits, where H<sub>k</sub>(T) denotes the kth-order empirical entropy of T, and sigma is the size of the alphabet. In th...
Constantly evolving data arise in various mobile applications such as location-based services and sensor networks. The problem of indexing the data for efficient query processing is of increasing importance. Due to the constant changing nature of the data, traditional indexes suffer from a high update overhead which leads to poor performance. In th...
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed...
Orion is a state-of-the-art uncertain database management system with built-in support for probabilistic data as rst class data types. In contrast to other uncertain databases, Orion supports both attribute and tuple uncertainty with arbitrary correlations. This enables the database engine to handle both discrete and continuous pdfs in a natural an...
We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size M memory blocks along with a request sequence Σ =(b1,b2,...,bn) where each block bi resides...