
Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: The dictionary matching with gaps problem is to preprocess a dictionary D of total size containing d gapped patterns over an alphabet Σ, where each gapped pattern is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text T of length n over Σ, the goal is to output all locations in T in which a pattern , , ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap or a few gaps with at least α and at most β don't cares, where α and β are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either preprocessing time and space, and query time , where occ is the number of patterns found, or preprocessing time and space: , and query time , where occ is the number of patterns found. We also show that the dictionary matching with k gaps problem, where , can be solved in preprocessing time: , space: , and query time: , where are constants and occ is the number of patterns found. As far as we know, these are the best solutions for this setting of the problem, where many overlaps may exist in the dictionary. Theoretical Computer Science 04/2015; 589. DOI:10.1016/j.tcs.2015.04.011 · 0.52 Impact Factor

Source Available from: Oren Kapah
[Show abstract] [Hide abstract]
ABSTRACT: Efficient handling of sparse data is a key challenge in Computer Science.
Binary convolutions, such as polynomial multiplication or the Walsh Transform
are a useful tool in many applications and are efficiently solved.
In the last decade, several problems required efficient solution of sparse
binary convolutions. both randomized and deterministic algorithms were
developed for efficiently computing the sparse polynomial multiplication. The
key operation in all these algorithms was length reduction. The sparse data is
mapped into small vectors that preserve the convolution result. The reduction
method used todate was the modulo function since it preserves location (of the
"1" bits) up to cyclic shift.
To date there is no known efficient algorithm for computing the sparse Walsh
transform. Since the modulo function does not preserve the Walsh transform a
new method for length reduction is needed. In this paper we present such a new
method  polynomials. This method enables the development of an efficient
algorithm for computing the binary sparse Walsh transform. To our knowledge,
this is the first such algorithm. We also show that this method allows a faster
deterministic computation of sparse polynomial multiplication than currently
known in the literature.

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: The dictionary matching with gaps problem is to preprocess a dictionary $D$
of $d$ gapped patterns $P_1,\ldots,P_d$ over alphabet $\Sigma$, where each
gapped pattern $P_i$ is a sequence of subpatterns separated by bounded
sequences of don't cares. Then, given a query text $T$ of length $n$ over
alphabet $\Sigma$, the goal is to output all locations in $T$ in which a
pattern $P_i\in D$, $1\leq i\leq d$, ends. There is a renewed current interest
in the gapped matching problem stemming from cyber security. In this paper we
solve the problem where all patterns in the dictionary have one gap with at
least $\alpha$ and at most $\beta$ don't cares, where $\alpha$ and $\beta$ are
given parameters. Specifically, we show that the dictionary matching with a
single gap problem can be solved in either $O(d\log d + D)$ time and
$O(d\log^{\varepsilon} d + D)$ space, and query time $O(n(\beta \alpha
)\log\log d \log ^2 \min \{ d, \log D \} + occ)$, where $occ$ is the number
of patterns found, or preprocessing time and space: $O(d^2 + D)$, and query
time $O(n(\beta \alpha ) + occ)$, where $occ$ is the number of patterns found.
As far as we know, this is the best solution for this setting of the problem,
where many overlaps may exist in the dictionary.

Source Available from: Oren Kapah
[Show abstract] [Hide abstract]
ABSTRACT: We introduce and examine the {\em Holiday Gathering Problem} which models the
difficulty that couples have when trying to decide with which parents should
they spend the holiday. Our goal is to schedule the family gatherings so that
the parents that will be {\em happy}, i.e.\ all their children will be home
{\em simultaneously} for the holiday festivities, while minimizing the number
of consecutive holidays in which parents are not happy.
The holiday gathering problem is closely related to several classical
problems in computer science, such as the {\em dining philosophers problem} on
a general graph and periodic scheduling,and has applications in scheduling of
transmissions made by cellular radios. We also show interesting connections
between periodic scheduling, coloring, and universal prefix free encodings.
The combinatorial definition of the Holiday Gathering Problem is: given a
graph $G$, find an infinite sequence of independentsets of $G$. The objective
function is to minimize, for every node $v$, the maximal gap between two
appearances of $v$. In good solutions this gap depends on local properties of
the node (i.e., its degree) and the the solution should be periodic, i.e.\ a
node appears every fixed number of periods. We show a coloringbased
construction where the period of each node colored with the $c$ is at most
$2^{1+\log^*c}\cdot\prod_{i=0}^{\log^*c} \log^{(i)}c$ (where $\log^{(i)}$ means
iterating the $\log$ function $i$ times). This is achieved via a connection
with {\it prefixfree encodings}. We prove that this is the best possible for
coloringbased solutions. We also show a construction with period at most $2d$
for a node of degree $d$.

Source Available from: Moshe Lewenstein
[Show abstract] [Hide abstract]
ABSTRACT: Jumbled indexing is the problem of indexing a text $T$ for queries that ask
whether there is a substring of $T$ matching a pattern represented as a Parikh
vector, i.e., the vector of frequency counts for each character. Jumbled
indexing has garnered a lot of interest in the last four years. There is a
naive algorithm that preprocesses all answers in $O(n^2\Sigma)$ time allowing
quick queries afterwards, and there is another naive algorithm that requires no
preprocessing but has $O(n\log\Sigma)$ query time. Despite a tremendous
amount of effort there has been little improvement over these running times.
In this paper we provide good reason for this. We show that, under a
3SUMhardness assumption, jumbled indexing for alphabets of size $\omega(1)$
requires $\Omega(n^{2\epsilon})$ preprocessing time or $\Omega(n^{1\delta})$
query time for any $\epsilon,\delta>0$. In fact, under a stronger 3SUMhardness
assumption, for any constant alphabet size $r\ge 3$ there exist describable
fixed constant $\epsilon_r$ and $\delta_r$ such that jumbled indexing requires
$\Omega(n^{2\epsilon_r})$ preprocessing time or $\Omega(n^{1\delta_r})$ query
time.

[Show abstract] [Hide abstract]
ABSTRACT: Histogram indexing, also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o(T(2)/polylogT) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple lineartime solution. Blockmass pattern matching problem is a recently introduced problem, motivated by issues in massspectrometry. It is also an example of a pattern matching problem that has an efficient, almost lineartime solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the blockmass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text T in time o(T(2)/polylogT) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a tradeoff between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be preprocessed in time o(T(2)/polylogT) and enable lineartime query processing. Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences 04/2014; 372(2016):20130132. DOI:10.1098/rsta.2013.0132 · 2.86 Impact Factor

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: Given ϵ∈[0,1)ϵ∈[0,1), the ϵRelative Error Periodic Pattern Problem (REPP) is the following:
INPUT: An nlong sequence S of numbers si∈Nsi∈N in increasing order.
OUTPUT: The longest ϵrelative error periodic pattern, i.e., the longest subsequence si1,si2,…,siksi1,si2,…,sik of S, for which there exists a number p such that the absolute difference between any two consecutive numbers in the subsequence is at least p and at most p(1+ϵ)p(1+ϵ).
The best known algorithm for this problem has O(n3)O(n3) time complexity. This bound is too high for large inputs in practice. In this paper we give a new algorithm for finding the longest ϵrelative error periodic pattern (the REPP problem). Our method is based on a transformation of the input sequence into a different representation: the ϵactive maximal intervals list L, defined in this paper. We show that the transformation of S to the list L can be done efficiently (quadratic in n and linear in the size of L) and prove that our algorithm is linear in the size of L. This enables us to prove that our algorithm works in subcubic time on inputs for which the best known algorithm works in O(n3)O(n3) time. Moreover, though it may happen that our algorithm would still be cubic, it is never worse than the known O(n3)O(n3)algorithm and in many situations its complexity is O(n2)O(n2) time. Theoretical Computer Science 03/2014; 525:60–67. DOI:10.1016/j.tcs.2013.05.001 · 0.52 Impact Factor

Source Available from: Moshe Lewenstein
[Show abstract] [Hide abstract]
ABSTRACT: This paper presents a general technique for optimally transforming any
dynamic data structure that operates on atomic and indivisible keys by
constanttime comparisons, into a data structure that handles unboundedlength
keys whose comparison cost is not a constant. Examples of these keys are
strings, multidimensional points, multipleprecision numbers, multikey data
(e.g.~records), XML paths, URL addresses, etc. The technique is more general
than what has been done in previous work as no particular exploitation of the
underlying structure of is required. The only requirement is that the insertion
of a key must identify its predecessor or its successor.
Using the proposed technique, online suffix tree can be constructed in worst
case time $O(\log n)$ per input symbol (as opposed to amortized $O(\log n)$
time per symbol, achieved by previously known algorithms). To our knowledge,
our algorithm is the first that achieves $O(\log n)$ worst case time per input
symbol. Searching for a pattern of length $m$ in the resulting suffix tree
takes $O(\min(m\log \Sigma, m + \log n) + tocc)$ time, where $tocc$ is the
number of occurrences of the pattern. The paper also describes more
applications and show how to obtain alternative methods for dealing with suffix
sorting, dynamic lowest common ancestors and order maintenance. SIAM Journal on Computing 06/2013; DOI:10.1137/110836377 · 0.76 Impact Factor

[Show abstract] [Hide abstract]
ABSTRACT: Finding the consensus of a given set of strings is a hard and challenging problem. The problem is formally defined as follows: given a set of strings S={s1,…,sk}S={s1,…,sk} and a constant d, find, if it exists, a string s⁎s⁎ such that the distance of s⁎s⁎ from each of the strings does not exceed d.This problem has many applications. Two examples are: In biology, it may be used to seek a common ancestor to given sections of DNA. In web searching it may be used as a clustering aid.The stringology community researched this problem under the Hamming distance. In that metric the problem is NPNPhard. A lot of work has been also done in the Euclidean metric.In this paper we consider the Consensus problem under other string metrics. We show that this problem is NPNPhard for the swap metric and APXAPXhard for the reversal metric. Information Processing Letters 05/2013; 113(s 10–11):371–374. DOI:10.1016/j.ipl.2013.02.016 · 0.48 Impact Factor

[Show abstract] [Hide abstract]
ABSTRACT: The Closest String Problem is defined as follows. Let S be a set of k strings {s1,…sk}, each of length ℓ, find a string $\hat{s}$, such that the maximum Hamming distance of $\hat{s}$ from each of the strings is minimized. We denote this distance with d. The string $\hat{s}$ is called a consensus string. In this paper we present two main algorithms, the Configuration algorithm with O(k2 ℓ k) running time for this problem, and the Minority algorithm. The problem was introduced by Lanctot, Li, Ma, Wang and Zhang [13]. They showed that the problem is $\cal{NP}$hard and provided an IP approximation algorithm. Since then the closest string problem has been studied extensively. This research can be roughly divided into three categories: Approximate, exact and practical solutions. This paper falls under the exact solutions category. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of O(ℓ k) was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times O(ℓ2), O(ℓ3), O(ℓ4) and O(ℓ5) exist for the cases of k=2,3,4 and 5, respectively. It is known that, in fact, the cases of k=2,3, and 4 can be solved in linear time. No efficient algorithm is currently known for the case of k=5. We prove the minority lemma that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. This lemma with some additional ideas give an O(ℓ2) time algorithm for computing a closest string of 5 binary strings. Proceedings of the 19th international conference on String Processing and Information Retrieval; 10/2012

Information and Computation 01/2012; 213. · 0.60 Impact Factor

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: Periodicity has been historically well studied and has numerous applications. In nature, however, few cyclic phenomena have an exact period. This paper surveys some recent results in approximate periodicity: concept definition, discovery or recovery, techniques and efficient algorithms. We will also show some interesting connections between error correction codes and periodicity. We will try to pinpoint the issues involved, the context in the literature, and possible future research directions. String Processing and Information Retrieval, 01/2012: pages 115;

[Show abstract] [Hide abstract]
ABSTRACT: The problem of finding the consensus of a given set of strings is formally defined as follows: given a set of strings S = {s
1,…s
k
}, and a constant d, find, if it exists, a string s
*, such that the Hamming distance of s
* from each of the strings does not exceed d.
In this paper we study an LP relaxation for the problem. We prove an additive upper bound, depending only in the number of
strings k, and randomized bounds. We show that empirical results are much better. We also compare our program with some algorithms
reported in the literature, and it is shown to perform well. 09/2011: pages 168173;

[Show abstract] [Hide abstract]
ABSTRACT: The consensus (string) problem is finding a representative string, called a consensus, of a given set S of strings. In this paper we deal with consensus problems considering both distance sum and radius, where the distance sum is the sum of (Hamming) distances from the strings in S to the consensus and the radius is the longest (Hamming) distance from the strings in S to the consensus. Although there have been results considering either distance sum or radius, there have been no results considering both, to the best of our knowledge.We present the first algorithms for two consensus problems considering both distance sum and radius for three strings: one problem is to find an optimal consensus minimizing both distance sum and radius. The other problem is to find a bounded consensus such that the distance sum is at most s and the radius is at most r for given constants s and r. Our algorithms are based on characterization of the lower bounds of distance sum and radius, and thus they solve the problems efficiently. Both algorithms run in linear time. Theoretical Computer Science 09/2011; 412(39):52395246. DOI:10.1016/j.tcs.2011.05.034 · 0.52 Impact Factor

[Show abstract] [Hide abstract]
ABSTRACT: Matching a mass spectrum against a text (a key computational task in
proteomics) is slow since the existing text indexing algorithms (with
search time independent of the text size) are not applicable in the
domain of mass spectrometry. As a result, many important applications
(e.g., searches for mutated peptides) are prohibitively timeconsuming
and even the standard search for nonmutated peptides is becoming too
slow with recent advances in highthroughput genomics and proteomics
technologies. We introduce a new paradigm  the Blocked Pattern Matching
(BPM) Problem  that models peptide identification. BPM corresponds to
matching a pattern against a text (over the alphabet of integers) under
the assumption that each symbol a in the pattern can match a block of
consecutive symbols in the text with total sum a. Research in Computational Molecular Biology  15th Annual International Conference, RECOMB 2011, Vancouver, BC, Canada, March 2831, 2011. Proceedings; 01/2011

String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 1721, 2011. Proceedings; 01/2011

Source Available from: Avivit Levy

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: The problem of finding the period of a vector V is central to many applications. Let V′V′ be a periodic vector closest to V under some metric. We seek this V′V′, or more precisely we seek the smallest period that generates V′V′. In this paper we consider the problem of finding the closest periodic vector in LpLp spaces. The measures of “closeness” that we consider are the metrics in the different LpLp spaces. Specifically, we consider the L1,L2L1,L2 and L∞L∞ metrics. In particular, for a given n dimensional vector V , we develop O(n2)O(n2) time algorithms (a different algorithm for each metric) that construct the smallest period that defines such a periodic n dimensional vector V′V′. We call that vector the closest periodic vector of V under the appropriate metric. We also show (three) O˜(n) time constant approximation algorithms for the period of the approximate closest periodic vector. Algorithms and Computation  22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 58, 2011. Proceedings; 01/2011

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we define the Range LCP problem as follows. Preprocess a string S, of length n, to enable efficient solutions of the following query: Given \([i,j],\ \ 0< i \leq j \leq n\), compute max ℓ, k ∈ {i,…,j} LCP(S ℓ, S k ), where LCP(S ℓ, S k ) is the length of the longest common prefix of the suffixes of S starting at locations ℓ and k. This is a natural generalization of the classical LCP problem. Surprisingly, while it is known how to preprocess a string in linear time to enable LCP computation of two suffixes in constant time, this seems quite difficult in the Range LCP problem. It is trivial to answer such queries in time O(j − i2) after a lineartime preprocessing and easy to show an O(1) query algorithm after an O(S2) time preprocessing. We provide algorithms that solve the problem with the following complexities: 1 Preprocessing Time: O(S), Space: O(S), Query Time: O(j − iloglogn). Algorithms and Computation  22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 58, 2011. Proceedings; 01/2011

[Show abstract] [Hide abstract]
ABSTRACT: The Shortest Common Supersequence (SCS) is the problem of seeking a shortest possible sequence that contains each of the input sequences as a subsequence. In this paper we consider applying the problem to Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of SCS on PWM. For the first, we give a polynomial time algorithm, having two input sequences. For the second, we prove NPhardness. String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 1721, 2011. Proceedings; 01/2011