Amihood Amir

Johns Hopkins University, Baltimore, Maryland, United States

Are you Amihood Amir?

Claim your profile

Publications (190)72.51 Total impact

  • Source
    Amihood Amir · Avivit Levy · Ely Porat · B. Riva Shalom
    [Show abstract] [Hide abstract]
    ABSTRACT: The dictionary matching with gaps problem is to preprocess a dictionary D of total size containing d gapped patterns over an alphabet Σ, where each gapped pattern is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text T of length n over Σ, the goal is to output all locations in T in which a pattern , , ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap or a few gaps with at least α and at most β don't cares, where α and β are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either preprocessing time and space, and query time , where occ is the number of patterns found, or preprocessing time and space: , and query time , where occ is the number of patterns found. We also show that the dictionary matching with k gaps problem, where , can be solved in preprocessing time: , space: , and query time: , where are constants and occ is the number of patterns found. As far as we know, these are the best solutions for this setting of the problem, where many overlaps may exist in the dictionary.
    Full-text · Article · Apr 2015 · Theoretical Computer Science
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Range LCP problem is to preprocess a string S[1…n], to enable efficient solutions of the following query: given a range [l, r] as the input, report maxi,j∈{l,…,r} |LCP(Si, Sj)|. Here LCP(Si, Sj) is the longest common prefix of the suffixes of S starting at locations i and j and |LCP(Si, Sj)| is its length. We study a natural extension of this problem, where the query consists of two ranges. Additionally, we allow a bounded number (say k ≥ 0) of mismatches in the LCP computation. Specifically, our task is to report the following when two ranges [ℓ1, r1] and [ℓ2, r2] comes as input: max {ℓ1≤i≤r1,ℓ2≤j≤r2} |LCPk(Si, Sj)| Here LCPk(Si, Sj) is the longest prefix of Si and Sj with at most k mismatches allowed. We show that the queries can be answered in O(k) time using an O(n2/w) space data structure, where w is the word size. We also present space efficient data structures for k = 0 and k = 1. For k = 0, we obtain a linear space data structure with query time O(√n/w logϵ n), where w is the word size and ϵ > 0 is an arbitrarily small constant. For the case k = 1 we obtain an O(n log n) space data structure with query time O(√ n log n). Finally, we give a reduction from Set Intersection to Range LCP queries, suggesting that it will be very difficult to improve our upper bound by more than a factor of O(logϵ n).
    No preview · Article · Jan 2015
  • Source
    Amihood Amir · Oren Kapah · Ely Porat · Amir Rothschild
    [Show abstract] [Hide abstract]
    ABSTRACT: Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as polynomial multiplication or the Walsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions. both randomized and deterministic algorithms were developed for efficiently computing the sparse polynomial multiplication. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the "1" bits) up to cyclic shift. To date there is no known efficient algorithm for computing the sparse Walsh transform. Since the modulo function does not preserve the Walsh transform a new method for length reduction is needed. In this paper we present such a new method - polynomials. This method enables the development of an efficient algorithm for computing the binary sparse Walsh transform. To our knowledge, this is the first such algorithm. We also show that this method allows a faster deterministic computation of sparse polynomial multiplication than currently known in the literature.
    Full-text · Article · Oct 2014
  • Source
    Amihood Amir · Avivit Levy · Ely Porat · B. Riva Shalom
    [Show abstract] [Hide abstract]
    ABSTRACT: The dictionary matching with gaps problem is to preprocess a dictionary $D$ of $d$ gapped patterns $P_1,\ldots,P_d$ over alphabet $\Sigma$, where each gapped pattern $P_i$ is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text $T$ of length $n$ over alphabet $\Sigma$, the goal is to output all locations in $T$ in which a pattern $P_i\in D$, $1\leq i\leq d$, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least $\alpha$ and at most $\beta$ don't cares, where $\alpha$ and $\beta$ are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either $O(d\log d + |D|)$ time and $O(d\log^{\varepsilon} d + |D|)$ space, and query time $O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ)$, where $occ$ is the number of patterns found, or preprocessing time and space: $O(d^2 + |D|)$, and query time $O(n(\beta -\alpha ) + occ)$, where $occ$ is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.
    Full-text · Article · Aug 2014
  • Source
    Amihood Amir · Oren Kapah · Tsvi Kopelowitz · Moni Naor · Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce and examine the {\em Holiday Gathering Problem} which models the difficulty that couples have when trying to decide with which parents should they spend the holiday. Our goal is to schedule the family gatherings so that the parents that will be {\em happy}, i.e.\ all their children will be home {\em simultaneously} for the holiday festivities, while minimizing the number of consecutive holidays in which parents are not happy. The holiday gathering problem is closely related to several classical problems in computer science, such as the {\em dining philosophers problem} on a general graph and periodic scheduling,and has applications in scheduling of transmissions made by cellular radios. We also show interesting connections between periodic scheduling, coloring, and universal prefix free encodings. The combinatorial definition of the Holiday Gathering Problem is: given a graph $G$, find an infinite sequence of independent-sets of $G$. The objective function is to minimize, for every node $v$, the maximal gap between two appearances of $v$. In good solutions this gap depends on local properties of the node (i.e., its degree) and the the solution should be periodic, i.e.\ a node appears every fixed number of periods. We show a coloring-based construction where the period of each node colored with the $c$ is at most $2^{1+\log^*c}\cdot\prod_{i=0}^{\log^*c} \log^{(i)}c$ (where $\log^{(i)}$ means iterating the $\log$ function $i$ times). This is achieved via a connection with {\it prefix-free encodings}. We prove that this is the best possible for coloring-based solutions. We also show a construction with period at most $2d$ for a node of degree $d$.
    Full-text · Article · Aug 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Consensus String Problem is that of finding a string, such that the maximum Hamming distance from it to a given set of strings of the same length is minimized. However, a generalization is necessary for clustering. One needs to consider a partition into a number of sets, each with a distinct centerstring. In this paper we define two natural versions of the consensus problem for c-centerstrings. We analyse the hardness and fixed parameter tractability of these problems and provide approximation algorithms.
    No preview · Article · Jun 2014
  • Amihood Amir · Benny Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Palindrome recognition is a classic problem in computer science. It is an example of a language that can not be recognized by a deterministic finite automaton and is often brought as an example of a problem whose decision by a single-tape Turing machine requires quadratic time. In this paper we re-visit the palindrome recognition problem. We define a novel fingerprint that allows recognizing palindromes on-line in linear time with high probability. We then use group testing techniques to show that the fingerprint can be adapted to recognizing approximate palindromes on-line, i.e. it can recognize that a string is a palindrome with no more than k mismatches, where k is given. Finally, we show that this fingerprint can be used as a tool for solving other problems on-line. In particular we consider approximate pattern matching by non-overlapping reversals. This is the problem where two strings S and T are given and the question is whether applying a sequence of non-overlapping reversals to S results in string T.
    No preview · Article · Jun 2014
  • Amihood Amir · Yonatan Aumann · Piotr Indyk · Avivit Levy · Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, a new pattern matching paradigm was proposed, pattern matching with address errors. In this paradigm approximate string matching problems are studied, where the content is unaltered and only the locations of the different entries may change. Specifically, a broad class of problems was defined—the class of rearrangement errors. In this type of error the pattern is transformed through a sequence of rearrangement operations, each with an associated cost. The natural ℓ[subscript 1] and ℓ[subscript 2] rearrangement systems were considered. The best algorithm presented for general patterns, that may have repeating symbols, is O(nm). In this paper, we show that the problem can be approximated in linear time for general patterns! Another natural rearrangement system is considered in this paper—the ℓ[subscript ∞] rearrangement distance. For this new rearrangement system efficient exact solutions for different variants of the problem are provided, as well as a faster approximation.
    No preview · Article · May 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Jumbled indexing is the problem of indexing a text $T$ for queries that ask whether there is a substring of $T$ matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in $O(n^2|\Sigma|)$ time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has $O(n\log|\Sigma|)$ query time. Despite a tremendous amount of effort there has been little improvement over these running times. In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size $\omega(1)$ requires $\Omega(n^{2-\epsilon})$ preprocessing time or $\Omega(n^{1-\delta})$ query time for any $\epsilon,\delta>0$. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size $r\ge 3$ there exist describable fixed constant $\epsilon_r$ and $\delta_r$ such that jumbled indexing requires $\Omega(n^{2-\epsilon_r})$ preprocessing time or $\Omega(n^{1-\delta_r})$ query time.
    Full-text · Article · May 2014
  • Source
    Amihood Amir · Ayelet Butman · Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Histogram indexing, also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o(|T|(2)/polylog|T|) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple linear-time solution. Block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass-spectrometry. It is also an example of a pattern matching problem that has an efficient, almost linear-time solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the block-mass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text T in time o(|T|(2)/polylog|T|) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a trade-off between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be pre-processed in time o(|T|(2)/polylog|T|) and enable linear-time query processing.
    Preview · Article · Apr 2014 · Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences
  • [Show abstract] [Hide abstract]
    ABSTRACT: The problem of partitioning an edge-capacitated graph on n vertices into k balanced parts has been amply researched. Motivated by applications such as load balancing in distributed systems and market segmentation in social networks, we propose a new variant of the problem, called Multiply Balanced k Partitioning, where the vertex-partition must be balanced under d vertex-weight functions simultaneously. We design bicriteria approximation algorithms for this problem, i.e., they partition the vertices into up to k parts that are nearly balanced simultaneously for all weight functions, and their approximation factor for the capacity of cut edges matches the bounds known for a single weight function times d. For the case where d = 2, for vertex weights that are integers bounded by a polynomial in n and any fixed ε > 0, we obtain a \((2+\epsilon,\, O(\sqrt{\log n \log k}))\)-bicriteria approximation, namely, we partition the graph into parts whose weight is at most 2 + ε times that of a perfectly balanced part (simultaneously for both weight functions), and whose cut capacity is \(O(\sqrt{\log n \log k})\cdot\) OPT. For unbounded (exponential) vertex weights, we achieve approximation \((3,\ O(\log n))\). Our algorithm generalizes to d weight functions as follows: For vertex weights that are integers bounded by a polynomial in n and any fixed ε > 0, we obtain a \((2d +\epsilon,\, O(d\sqrt{\log n \log k}))\)-bicriteria approximation. For unbounded (exponential) vertex weights, we achieve approximation \((2d+ 1,\ O(d\log n))\).
    No preview · Article · Mar 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given ϵ∈[0,1)ϵ∈[0,1), the ϵ-Relative Error Periodic Pattern Problem (REPP) is the following: INPUT: An n-long sequence S of numbers si∈Nsi∈N in increasing order. OUTPUT: The longest ϵ-relative error periodic pattern, i.e., the longest subsequence si1,si2,…,siksi1,si2,…,sik of S, for which there exists a number p such that the absolute difference between any two consecutive numbers in the subsequence is at least p and at most p(1+ϵ)p(1+ϵ). The best known algorithm for this problem has O(n3)O(n3) time complexity. This bound is too high for large inputs in practice. In this paper we give a new algorithm for finding the longest ϵ-relative error periodic pattern (the REPP problem). Our method is based on a transformation of the input sequence into a different representation: the ϵ-active maximal intervals list L, defined in this paper. We show that the transformation of S to the list L can be done efficiently (quadratic in n and linear in the size of L) and prove that our algorithm is linear in the size of L. This enables us to prove that our algorithm works in sub-cubic time on inputs for which the best known algorithm works in O(n3)O(n3) time. Moreover, though it may happen that our algorithm would still be cubic, it is never worse than the known O(n3)O(n3)-algorithm and in many situations its complexity is O(n2)O(n2) time.
    Full-text · Article · Mar 2014 · Theoretical Computer Science
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time $O(\log n)$ per input symbol (as opposed to amortized $O(\log n)$ time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves $O(\log n)$ worst case time per input symbol. Searching for a pattern of length $m$ in the resulting suffix tree takes $O(\min(m\log |\Sigma|, m + \log n) + tocc)$ time, where $tocc$ is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance.
    Full-text · Article · Jun 2013 · SIAM Journal on Computing
  • Amihood Amir · Haim Paryenty · Liam Roditty
    [Show abstract] [Hide abstract]
    ABSTRACT: Finding the consensus of a given set of strings is a hard and challenging problem. The problem is formally defined as follows: given a set of strings S={s1,…,sk}S={s1,…,sk} and a constant d, find, if it exists, a string s⁎s⁎ such that the distance of s⁎s⁎ from each of the strings does not exceed d.This problem has many applications. Two examples are: In biology, it may be used to seek a common ancestor to given sections of DNA. In web searching it may be used as a clustering aid.The stringology community researched this problem under the Hamming distance. In that metric the problem is NPNP-hard. A lot of work has been also done in the Euclidean metric.In this paper we consider the Consensus problem under other string metrics. We show that this problem is NPNP-hard for the swap metric and APXAPX-hard for the reversal metric.
    No preview · Article · May 2013 · Information Processing Letters
  • Amihood Amir · Benny Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: The Sorting by Reversals problem is known to be NP-hard. A simplification, Sorting by Signed Reversals is polynomially computable. Motivated by the pattern matching with rearrangements model, we consider Pattern Matching with Reversals. Since this is a generalization of the Sorting by Reversals problem, it is clearly NP-hard. We, therefore consider the simplification where reversals cannot overlap. Such a constrained version has been researched in the past for various metrics in the rearrangement model – the swap metric and the interchange metric. We show that the constrained problem can be solved in linear time. We then consider the Approximate Pattern Matching with non-overlapping Reversals problem, i.e. where mismatch errors are introduced. We show that the problem can be solved in quadratic time and space. Finally, we consider the on-line version of the problem. We introduce a novel signature for palindromes and show that it has a pleasing behavior, similar to the Karp-Rabin signature. It allows solving the Pattern Matching with non-overlapping Reversals problem on-line in linear time w.h.p.
    No preview · Article · Jan 2013
  • Amihood Amir · Haim Paryenty · Liam Roditty
    [Show abstract] [Hide abstract]
    ABSTRACT: The Closest String Problem is defined as follows. Let S be a set of k strings {s1,…sk}, each of length ℓ, find a string $\hat{s}$, such that the maximum Hamming distance of $\hat{s}$ from each of the strings is minimized. We denote this distance with d. The string $\hat{s}$ is called a consensus string. In this paper we present two main algorithms, the Configuration algorithm with O(k2 ℓ k) running time for this problem, and the Minority algorithm. The problem was introduced by Lanctot, Li, Ma, Wang and Zhang [13]. They showed that the problem is $\cal{NP}$-hard and provided an IP approximation algorithm. Since then the closest string problem has been studied extensively. This research can be roughly divided into three categories: Approximate, exact and practical solutions. This paper falls under the exact solutions category. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of O(ℓ k) was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times O(ℓ2), O(ℓ3), O(ℓ4) and O(ℓ5) exist for the cases of k=2,3,4 and 5, respectively. It is known that, in fact, the cases of k=2,3, and 4 can be solved in linear time. No efficient algorithm is currently known for the case of k=5. We prove the minority lemma that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. This lemma with some additional ideas give an O(ℓ2) time algorithm for computing a closest string of 5 binary strings.
    No preview · Conference Paper · Oct 2012
  • Amihood Amir · Laxmi Parida

    No preview · Article · Jan 2012 · Information and Computation
  • Source
    Amihood Amir · Avivit Levy
    [Show abstract] [Hide abstract]
    ABSTRACT: Periodicity has been historically well studied and has numerous applications. In nature, however, few cyclic phenomena have an exact period. This paper surveys some recent results in approximate periodicity: concept definition, discovery or recovery, techniques and efficient algorithms. We will also show some interesting connections between error correction codes and periodicity. We will try to pinpoint the issues involved, the context in the literature, and possible future research directions.
    Full-text · Chapter · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of finding the period of a vector V is central to many applications. Let V′ be a periodic vector closest to V under some metric. We seek this V′, or more precisely we seek the smallest period that generates V′. In this paper we consider the problem of finding the closest periodic vector in L p spaces. The measures of “closeness” that we consider are the metrics in the different L p spaces. Specifically, we consider the L 1, L 2 and L ∞ metrics. In particular, for a given n-dimensional vector V, we develop O(n 2) time algorithms (a different algorithm for each metric) that construct the smallest period that defines such a periodic n-dimensional vector V′. We call that vector the closest periodic vector of V under the appropriate metric. We also show (three) O(n logn) time constant approximation algorithms for the (appropriate) period of the closest periodic vector.
    Full-text · Conference Paper · Dec 2011
  • Amihood Amir · Zvi Gotthilf · B. Riva Shalom
    [Show abstract] [Hide abstract]
    ABSTRACT: The Shortest Common Supersequence (SCS) is the problem of seeking a shortest possible sequence that contains each of the input sequences as a subsequence. In this paper we consider applying the problem to Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of SCS on PWM. For the first, we give a polynomial time algorithm, having two input sequences. For the second, we prove NP-hardness.
    No preview · Conference Paper · Oct 2011

Publication Stats

3k Citations
72.51 Total Impact Points

Institutions

  • 2006-2015
    • Johns Hopkins University
      • Department of Computer Science
      Baltimore, Maryland, United States
  • 1970-2012
    • Bar Ilan University
      • Department of Computer Science
      Gan, Tel Aviv, Israel
  • 1991-2007
    • Georgia Institute of Technology
      • College of Computing
      Atlanta, Georgia, United States
  • 2001-2002
    • AT&T Labs
      Austin, Texas, United States
  • 1988-1991
    • University of Maryland, College Park
      • • Institute for Advanced Computer Studies
      • • Department of Computer Science
      Maryland, United States