
Sharma V. ThankachanGeorgia Institute of Technology | GT · School of Computational Science & Engineering
Sharma V. Thankachan
PhD
About
109
Publications
5,554
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
780
Citations
Introduction
Additional affiliations
August 2014 - present
Publications
Publications (109)
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an a...
Given a set D{\cal D} of d patterns, the dictionary matching problem is to index D{\cal D} such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D{\cal D} contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) gave an nHk(D) + o(n logs)nH_k({\cal D}) + o(n \log \sigma)-...
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index...
Let T1 and T2 be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in T2 is a permutation of those in T1. Nodes are associated with weight, such that the weight of a node u, denoted by W(u), is more than the weight of its parent. A node x∈T1 and a node y∈T2 are induced, iff their subtrees have a...
In recent years, several compressed indexes based on variants of the Burrows–Wheeler transform have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index (Ferragina and Manzini in J. ACM 52(4):552–581, https://doi.org/10.1145/1082036.1082039, 2005). As such, there...
Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time i...
The skyline of a set of two-dimensional points is the subset of points not dominated by any other point. In this paper, we consider a set of two-dimensional points (in rank space) that are assigned an additional category, or color. The goal is to preprocess these points so that given a three-sided region of the form [a,b]×[τ,∞] we can return the se...
Motivation
Co-linear chaining has proven to be a powerful technique for finding approximately optimal alignments and approximating edit distance. It is used as an intermediate step in numerous mapping tools that follow seed-and-extend strategy. Despite this popularity, subquadratic time algorithms for the case where chains support anchor overlaps a...
The non-overlapping indexing problem is defined as follows: pre-process a given text T[1,n] of length n into a data structure such that whenever a pattern P[1,m] comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in T. The best-known solution is by Cohen and Porat [ISAAC 2009]. The size of their structu...
Background
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective...
Let P be a collection of d patterns {P1,P2,…,Pd} of total length n characters, which are chosen from an alphabet Σ of size σ. Given a text T (over Σ), the dictionary indexing problem is to create a data structure using which we can report all positions j (called occurrences) where at least one of the patterns Pi∈P is a match with the same-length su...
Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as...
The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardnes...
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collec...
Let T[1,n] be a text of length n and T[i,n] be the suffix starting at position i. Also, for any two strings X and Y, let LCP(X,Y) denote their longest common prefix. The range-LCP of T w.r.t. a range [α,β], where 1≤α<β≤n isrlcp(α,β)=max{|LCP(T[i,n],T[j,n])||i≠jandi,j∈[α,β]} Amir et al. [ISAAC 2011] introduced the indexing version of this problem,...
This paper revisits the
$k$
-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the
$k$
-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of
$O(...
Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in T efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range n...
We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time $2^{o(\sigma)}n$ unless the Exponential Time Hypothesis fails, where $\sigma$ is the size of the alphabet. Moreover, we show that optimizat...
Let \(\mathsf {T}[1,n]\) be a string of length n and \(\mathsf {T}[i,j]\) be the substring of \(\mathsf {T}\) starting at position i and ending at position j. A substring \(\mathsf {T}[i,j]\) of \(\mathsf {T}\) is a repeat if it occurs more than once in \(\mathsf {T}\); otherwise, it is a unique substring of \(\mathsf {T}\). Repeats and unique subs...
Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and opti...
In recent years several compressed indexes based on variants of the Borrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions su...
In this article, the investigators present a new method using a deep learning approach to diagnose schizophrenia. In the experiment presented, the investigators have used a secondary dataset provided by The National Institute of Health. The aforementioned experimentation involves analyzing this dataset for the existence of schizophrenia using tradi...
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem o...
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS_k, have been shown to produce results as effective as multip...
\beginthebibliography 1 \bibitemalzamel2017faster M. Alzamel, P. Charalampopoulos, C. S. Iliopoulos, S. P. Pissis, J. Radoszewski, and W.-K. Sung. \newblock Faster algorithms for 1-mappability of a sequence. \newblock In \em International Conference on Combinatorial Optimization and Applications, pages 109--121. Springer, 2017. \bibitemderrien2012f...
This paper presents a new method for diagnosing schizophrenia using deep learning. This experiment used a secondary dataset supplied by the National Institute of Health. The experiment analyzes the dataset and identifies schizophrenia using traditional machine learning methods such as logistic regression, support vector machines, and random forest....
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$...
Let \(\mathsf {T}[1,n]\) be a text of length n and \(\mathsf {T}[i,n]\) be the suffix starting at position i. Also, for any two strings X and Y, let \(\mathsf {LCP}(X, Y)\) denote their longest common prefix. The range-LCP of \(\mathsf {T}\) w.r.t. a range \([\alpha ,\beta ]\), where \(1\le \alpha < \beta \le n\) is Open image in new window Amir et...
Let D={T1,T2,…,TD} be a collection of D documents having n characters in total. Given two patterns P and Q, and an integer k>0, we consider the following queries.
•top-k forbidden pattern query: Among all documents containing P, but not Q, report the k documents most relevant to P.
•top-k two pattern query: Among all documents that contain both P a...
The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS can be computed in O(n) space and time, where n=x+y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task...
We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully co...
We consider the problem of indexing a given text T[0...n−1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occℓ,r occurrences of P in T[ℓ...r]. In this paper, we pr...
Given a string X[1,n] and a position k∈[1,n], a Shortest Unique Substring of X covering k, denoted by Sk, is a substring X[i,j] of X which satisfies the following conditions: (i) i≤k≤j, (ii) i is the only position where there is an occurrence of X[i,j], and (iii) j−i is minimized. Current best-known algorithms for finding Sk require Θ(n) words of w...
Background
Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the pr...
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words...
Background
Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and tr...
A gap is a sequence of don’t care characters. In this paper, we study two variants of the dictionary matching problem, where gaps may be present in the patterns or in the text. The first variant, called dictionary matching with one gap, considers indexing a collection D of d one-gap-patterns, where the ith pattern is of the form Pi[αi,βi]Qi with Pi...
On a given vector of integers, the range selection ( ) query is finding the k–th smallest integer in for any ( ) such that , and . Previous studies on the problem proposed data structures that occupied additional bits of space over the X itself that answer the queries in logarithmic time. In this study, we replace X and encode all integers in it vi...
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably effi...
Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return t...
In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. W...
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, w...
We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $\Sigma_s$ and a...
We revisit the exact shortest unique substring (SUS) finding problem, and
propose its approximate version where mismatches are allowed, due to its
applications in subfields such as computational biology. We design a generic
in-place framework that fits to solve both the exact and approximate
$k$-mismatch SUS finding, using the minimum $2n$ memory w...
Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-...
Strings form a fundamental data type in computer systems. String searching
has been extensively studied since the inception of computer science.
Increasingly many applications have to deal with imprecise strings or strings
with fuzzy information in them. String matching becomes a probabilistic event
when a string contains uncertainty, i.e. each pos...
Alignment-free sequence comparison approaches have been garnering increasing interests in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, average common substring (ACS) approach is emerging as one of the prominent alignm...
The Range LCP problem is to preprocess a string \(S[1\dots n]\), to enable efficient solutions of the following query: given a range [l, r] as the input, report \(\max _{i, j \in \{l,\ldots ,r\}} |\mathsf {LCP}(S_{i}, S_j)|\). Here \(\mathsf {LCP}(S_i, S_j)\) is the longest common prefix of the suffixes of S starting at locations i and j and \(|\ma...
Given a text \(\mathsf {T}\) having \(n\) characters, we consider the non-overlapping indexing problem defined as follows: pre-process \(\mathsf {T}\) into a data-structure, such that whenever a pattern \(P\) comes as input, we can report a maximal set of non-overlapping occurrences of \(P\) in \(\mathsf {T}\). The best known solution for this prob...
A gap-pattern is a sequence of sub-patterns separated by bounded sequences of don’t care characters (called gaps). A one-gap-pattern is a pattern of the form \(P[\alpha ,\beta ]Q\), where \(P\) and \(Q\) are strings drawn from alphabet \(\varSigma \) and \([\alpha , \beta ]\) are lower and upper bounds on the gap size \(g\). The gap size \(g\) is t...
Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{-...
Alignment free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogentic reconstruction. Among the methods based on substring composition, the {\it Average Common Substring}($\ACS$) measure proposed by Burstein
{\it et al.} (RECOMB 2005) admits a straightf...
On a given vector X = x1 , x2 , . . . , xn of integers, the range selection (i, j, k) query is
finding the k–th smallest integer in xi , xi+1 , . . . , xj for any (i, j, k) such that 1 ≤ i ≤ j ≤ n,
and 1 ≤ k ≤ j − i + 1. Previous studies on the problem kept X intact and proposed data
structures that occupied additional O(n · log n) bits of space ov...
We face the problem of designing a data structure that can report all
$\tau$-majorities within any range of an array $A[1,n]$, without storing $A$. A
$\tau$-majority in a range $A[i,j]$, for $0<\tau< 1$, is an element that occurs
more than $\tau(j-i+1)$ times in $A[i,j]$. We show that $\Omega(n\log(1/\tau))$
bits are necessary for such a data struc...
We study the problem of indexing a text T[1,n] such that whenever a pattern P[1,p] and an interval [a,b] comes as a query, we can report all pairs (i, j) of consecutive occurrences of P in T with a <= j-i <= b. We present an O(n log n) space data structure with optimal O(p+k) query time, where k is the output size.
Let \({\cal D} =\{d_1,d_2,...,d_D\}\) be a collection of D string documents of n characters in total. The two-pattern matching problems ask to index \({\cal D}\) for answering the following queries efficiently.
report/count the unique documents containing P
1 and P
2.
report/count the unique documents containing P
1, but not P
2.
Here P
1 and P
2 r...
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linear-space solutions for this problem is trivial, but the compressed-space solutions are not eas...
Let $\mathcal{D} = \{\T_1,\T_2, \dots,\T_D\}$ be a collection of $D$ string
documents of $n$ characters in total, that are drawn from an alphabet
set $\Sigma=[\sigma]$. The \emph{top-$k$ document retrieval problem}
is to preprocess $\D$ into a data structure that, given a query
$(P[1..p],k)$, can return the $k$ documents of $\D$ most relevant to pa...
We consider the problem of indexing a collection \(\cal{D}\) of D strings (documents) of total n characters from an alphabet set of size σ, such that whenever a pattern P (of p characters) and an integer τ ∈ [1, D] comes as a query, we can efficiently report all (i) maximal generic words and (ii) minimal discriminating words as defined below:
maxim...
In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. W...
Text indexing is a fundamental problem in computer science, where the task is to index a given text (string) T[1..n]T[1..n], such that whenever a pattern P[1..p]P[1..p] comes as a query, we can efficiently report all those locations where P occurs as a substring of T . In this paper, we consider the case when P contains wildcard characters (which c...
Let $\set$ be a set of $n$ points in an $[n]^d$ grid, such that each point is assigned a color. Given a query range $\Q= [a_1, b_1] \times [a_2, b_2] \times \ldots \times [a_d, b_d]$,
the geometric range mode query problem asks to report the most frequent color (i.e., a mode) of the multiset of colors corresponding to points in $\set \cap \Q$. When...
Let \({\cal D}\) be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess \({\cal D}\) into a data structure that, given a query (P,k), can return the k documents of \({\cal D}\) most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking funct...
We address the problem of indexing a collection D={T1,T2,...TD}D={T1,T2,...TD} of D string documents of total length n , so that we can efficiently answer top -k queries : retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n)O(n) words, that answer such queri...
Given an array A[1...n] of n distinct elements from the set {1, 2, ..., n} a range maximum query RMQ(a, b) returns the highest element in A[a...b] along with its position. In this paper, we study a generalization of this classical problem called Categorical Range Maxima Query (CRMQ) problem, in which each element A[i] in the array has an associated...
We consider how to preprocess n colored points in the plane such that later, given a multiset of colors, we can quickly find an axis-aligned rectangle containing a subset of the points with exactly those colors, if one exists. We first give an index that uses o(n
4) space and o (n) query time when there are \({\mathcal{O}({1})}\) distinct colors. W...
Given a set \(\mathcal{D}\) of patterns of total length n, the dictionary matching problem is to index \(\mathcal{D}\) such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick in Commun. ACM 18(6):333–340,...
Let \(\cal{D}\)= {d
1,d
2,...d
D
} be a given set of D string documents of total length n. Our task is to index \(\cal{D}\) such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In th...
The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to...