Sharma V. Thankachan

Sharma V. Thankachan
Georgia Institute of Technology | GT · School of Computational Science & Engineering

PhD

About

109
Publications
5,554
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
780
Citations
Additional affiliations
August 2014 - present
Georgia Institute of Technology
Position
  • PostDoc Position

Publications

Publications (109)
Article
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an a...
Conference Paper
Given a set D{\cal D} of d patterns, the dictionary matching problem is to index D{\cal D} such that for any query text T, we can locate the occurrences of any pattern within T efficiently. When D{\cal D} contains a total of n characters drawn from an alphabet of size σ, Hon et al. (2008) gave an nHk(D) + o(n logs)nH_k({\cal D}) + o(n \log \sigma)-...
Conference Paper
Full-text available
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index...
Article
Full-text available
Let T1 and T2 be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in T2 is a permutation of those in T1. Nodes are associated with weight, such that the weight of a node u, denoted by W(u), is more than the weight of its parent. A node x∈T1 and a node y∈T2 are induced, iff their subtrees have a...
Article
Full-text available
In recent years, several compressed indexes based on variants of the Burrows–Wheeler transform have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index (Ferragina and Manzini in J. ACM 52(4):552–581, https://doi.org/10.1145/1082036.1082039, 2005). As such, there...
Preprint
Full-text available
Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time i...
Article
The skyline of a set of two-dimensional points is the subset of points not dominated by any other point. In this paper, we consider a set of two-dimensional points (in rank space) that are assigned an additional category, or color. The goal is to preprocess these points so that given a three-sided region of the form [a,b]×[τ,∞] we can return the se...
Preprint
Full-text available
Motivation Co-linear chaining has proven to be a powerful technique for finding approximately optimal alignments and approximating edit distance. It is used as an intermediate step in numerous mapping tools that follow seed-and-extend strategy. Despite this popularity, subquadratic time algorithms for the case where chains support anchor overlaps a...
Article
The non-overlapping indexing problem is defined as follows: pre-process a given text T[1,n] of length n into a data structure such that whenever a pattern P[1,m] comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in T. The best-known solution is by Cohen and Porat [ISAAC 2009]. The size of their structu...
Article
Full-text available
Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective...
Article
Let P be a collection of d patterns {P1,P2,…,Pd} of total length n characters, which are chosen from an alphabet Σ of size σ. Given a text T (over Σ), the dictionary indexing problem is to create a data structure using which we can report all positions j (called occurrences) where at least one of the patterns Pi∈P is a match with the same-length su...
Article
Full-text available
Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as...
Preprint
The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardnes...
Article
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collec...
Article
Let T[1,n] be a text of length n and T[i,n] be the suffix starting at position i. Also, for any two strings X and Y, let LCP(X,Y) denote their longest common prefix. The range-LCP of T w.r.t. a range [α,β], where 1≤α<β≤n isrlcp(α,β)=max⁡{|LCP(T[i,n],T[j,n])||i≠jandi,j∈[α,β]} Amir et al. [ISAAC 2011] introduced the indexing version of this problem,...
Article
This paper revisits the $k$ -mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the $k$ -mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(...
Article
Full-text available
Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in T efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range n...
Preprint
We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time $2^{o(\sigma)}n$ unless the Exponential Time Hypothesis fails, where $\sigma$ is the size of the alphabet. Moreover, we show that optimizat...
Chapter
Let \(\mathsf {T}[1,n]\) be a string of length n and \(\mathsf {T}[i,j]\) be the substring of \(\mathsf {T}\) starting at position i and ending at position j. A substring \(\mathsf {T}[i,j]\) of \(\mathsf {T}\) is a repeat if it occurs more than once in \(\mathsf {T}\); otherwise, it is a unique substring of \(\mathsf {T}\). Repeats and unique subs...
Article
Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and opti...
Preprint
In recent years several compressed indexes based on variants of the Borrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions su...
Article
In this article, the investigators present a new method using a deep learning approach to diagnose schizophrenia. In the experiment presented, the investigators have used a secondary dataset provided by The National Institute of Health. The aforementioned experimentation involves analyzing this dataset for the existence of schizophrenia using tradi...
Article
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem o...
Conference Paper
Full-text available
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS_k, have been shown to produce results as effective as multip...
Conference Paper
\beginthebibliography 1 \bibitemalzamel2017faster M. Alzamel, P. Charalampopoulos, C. S. Iliopoulos, S. P. Pissis, J. Radoszewski, and W.-K. Sung. \newblock Faster algorithms for 1-mappability of a sequence. \newblock In \em International Conference on Combinatorial Optimization and Applications, pages 109--121. Springer, 2017. \bibitemderrien2012f...
Conference Paper
This paper presents a new method for diagnosing schizophrenia using deep learning. This experiment used a secondary dataset supplied by the National Institute of Health. The experiment analyzes the dataset and identifies schizophrenia using traditional machine learning methods such as logistic regression, support vector machines, and random forest....
Conference Paper
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$...
Chapter
Let \(\mathsf {T}[1,n]\) be a text of length n and \(\mathsf {T}[i,n]\) be the suffix starting at position i. Also, for any two strings X and Y, let \(\mathsf {LCP}(X, Y)\) denote their longest common prefix. The range-LCP of \(\mathsf {T}\) w.r.t. a range \([\alpha ,\beta ]\), where \(1\le \alpha < \beta \le n\) is Open image in new window Amir et...
Article
Let D={T1,T2,…,TD} be a collection of D documents having n characters in total. Given two patterns P and Q, and an integer k>0, we consider the following queries. •top-k forbidden pattern query: Among all documents containing P, but not Q, report the k documents most relevant to P. •top-k two pattern query: Among all documents that contain both P a...
Preprint
Full-text available
The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS can be computed in O(n) space and time, where n=x+y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task...
Chapter
We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully co...
Article
We consider the problem of indexing a given text T[0...n−1] of n characters over an alphabet set Σ of size σ, in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p) and two indices ℓ and r and the output is the set of all occℓ,r occurrences of P in T[ℓ...r]. In this paper, we pr...
Article
Given a string X[1,n] and a position k∈[1,n], a Shortest Unique Substring of X covering k, denoted by Sk, is a substring X[i,j] of X which satisfies the following conditions: (i) i≤k≤j, (ii) i is the only position where there is an occurrence of X[i,j], and (iii) j−i is minimized. Current best-known algorithms for finding Sk require Θ(n) words of w...
Article
Full-text available
Background Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the pr...
Article
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words...
Article
Full-text available
Background Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and tr...
Article
A gap is a sequence of don’t care characters. In this paper, we study two variants of the dictionary matching problem, where gaps may be present in the patterns or in the text. The first variant, called dictionary matching with one gap, considers indexing a collection D of d one-gap-patterns, where the ith pattern is of the form Pi[αi,βi]Qi with Pi...
Article
On a given vector of integers, the range selection ( ) query is finding the k–th smallest integer in for any ( ) such that , and . Previous studies on the problem proposed data structures that occupied additional bits of space over the X itself that answer the queries in logarithmic time. In this study, we replace X and encode all integers in it vi...
Conference Paper
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably effi...
Article
Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return t...
Article
In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. W...
Article
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, w...
Article
Full-text available
We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $\Sigma_s$ and a...
Conference Paper
Full-text available
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate $k$-mismatch SUS finding, using the minimum $2n$ memory w...
Article
Full-text available
Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-...
Article
Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each pos...
Conference Paper
Full-text available
Alignment-free sequence comparison approaches have been garnering increasing interests in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, average common substring (ACS) approach is emerging as one of the prominent alignm...
Conference Paper
Full-text available
The Range LCP problem is to preprocess a string \(S[1\dots n]\), to enable efficient solutions of the following query: given a range [l, r] as the input, report \(\max _{i, j \in \{l,\ldots ,r\}} |\mathsf {LCP}(S_{i}, S_j)|\). Here \(\mathsf {LCP}(S_i, S_j)\) is the longest common prefix of the suffixes of S starting at locations i and j and \(|\ma...
Conference Paper
Given a text \(\mathsf {T}\) having \(n\) characters, we consider the non-overlapping indexing problem defined as follows: pre-process \(\mathsf {T}\) into a data-structure, such that whenever a pattern \(P\) comes as input, we can report a maximal set of non-overlapping occurrences of \(P\) in \(\mathsf {T}\). The best known solution for this prob...
Conference Paper
A gap-pattern is a sequence of sub-patterns separated by bounded sequences of don’t care characters (called gaps). A one-gap-pattern is a pattern of the form \(P[\alpha ,\beta ]Q\), where \(P\) and \(Q\) are strings drawn from alphabet \(\varSigma \) and \([\alpha , \beta ]\) are lower and upper bounds on the gap size \(g\). The gap size \(g\) is t...
Conference Paper
Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{-...
Conference Paper
Alignment free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogentic reconstruction. Among the methods based on substring composition, the {\it Average Common Substring}($\ACS$) measure proposed by Burstein {\it et al.} (RECOMB 2005) admits a straightf...
Conference Paper
On a given vector X = x1 , x2 , . . . , xn of integers, the range selection (i, j, k) query is finding the k–th smallest integer in xi , xi+1 , . . . , xj for any (i, j, k) such that 1 ≤ i ≤ j ≤ n, and 1 ≤ k ≤ j − i + 1. Previous studies on the problem kept X intact and proposed data structures that occupied additional O(n · log n) bits of space ov...
Article
Full-text available
We face the problem of designing a data structure that can report all $\tau$-majorities within any range of an array $A[1,n]$, without storing $A$. A $\tau$-majority in a range $A[i,j]$, for $0<\tau< 1$, is an element that occurs more than $\tau(j-i+1)$ times in $A[i,j]$. We show that $\Omega(n\log(1/\tau))$ bits are necessary for such a data struc...
Conference Paper
We study the problem of indexing a text T[1,n] such that whenever a pattern P[1,p] and an interval [a,b] comes as a query, we can report all pairs (i, j) of consecutive occurrences of P in T with a <= j-i <= b. We present an O(n log n) space data structure with optimal O(p+k) query time, where k is the output size.
Conference Paper
Let \({\cal D} =\{d_1,d_2,...,d_D\}\) be a collection of D string documents of n characters in total. The two-pattern matching problems ask to index \({\cal D}\) for answering the following queries efficiently. report/count the unique documents containing P 1 and P 2. report/count the unique documents containing P 1, but not P 2. Here P 1 and P 2 r...
Article
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linear-space solutions for this problem is trivial, but the compressed-space solutions are not eas...
Conference Paper
Full-text available
Let $\mathcal{D} = \{\T_1,\T_2, \dots,\T_D\}$ be a collection of $D$ string documents of $n$ characters in total, that are drawn from an alphabet set $\Sigma=[\sigma]$. The \emph{top-$k$ document retrieval problem} is to preprocess $\D$ into a data structure that, given a query $(P[1..p],k)$, can return the $k$ documents of $\D$ most relevant to pa...
Conference Paper
We consider the problem of indexing a collection \(\cal{D}\) of D strings (documents) of total n characters from an alphabet set of size σ, such that whenever a pattern P (of p characters) and an integer τ ∈ [1, D] comes as a query, we can efficiently report all (i) maximal generic words and (ii) minimal discriminating words as defined below: maxim...
Conference Paper
In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. W...
Conference Paper
Text indexing is a fundamental problem in computer science, where the task is to index a given text (string) T[1..n]T[1..n], such that whenever a pattern P[1..p]P[1..p] comes as a query, we can efficiently report all those locations where P occurs as a substring of T . In this paper, we consider the case when P contains wildcard characters (which c...
Conference Paper
Let $\set$ be a set of $n$ points in an $[n]^d$ grid, such that each point is assigned a color. Given a query range $\Q= [a_1, b_1] \times [a_2, b_2] \times \ldots \times [a_d, b_d]$, the geometric range mode query problem asks to report the most frequent color (i.e., a mode) of the multiset of colors corresponding to points in $\set \cap \Q$. When...
Conference Paper
Let \({\cal D}\) be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess \({\cal D}\) into a data structure that, given a query (P,k), can return the k documents of \({\cal D}\) most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking funct...
Article
We address the problem of indexing a collection D={T1,T2,...TD}D={T1,T2,...TD} of D string documents of total length n , so that we can efficiently answer top -k queries : retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n)O(n) words, that answer such queri...
Article
Given an array A[1...n] of n distinct elements from the set {1, 2, ..., n} a range maximum query RMQ(a, b) returns the highest element in A[a...b] along with its position. In this paper, we study a generalization of this classical problem called Categorical Range Maxima Query (CRMQ) problem, in which each element A[i] in the array has an associated...
Conference Paper
Full-text available
We consider how to preprocess n colored points in the plane such that later, given a multiset of colors, we can quickly find an axis-aligned rectangle containing a subset of the points with exactly those colors, if one exists. We first give an index that uses o(n 4) space and o (n) query time when there are \({\mathcal{O}({1})}\) distinct colors. W...
Article
Given a set \(\mathcal{D}\) of patterns of total length n, the dictionary matching problem is to index \(\mathcal{D}\) such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick in Commun. ACM 18(6):333–340,...
Conference Paper
Let \(\cal{D}\)= {d 1,d 2,...d D } be a given set of D string documents of total length n. Our task is to index \(\cal{D}\) such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In th...
Conference Paper
The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to...