Amihood Amir

Bar Ilan University, Gan, Tel Aviv, Israel

Are you Amihood Amir?

Claim your profile

Publications (181)23.52 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Jumbled indexing is the problem of indexing a text $T$ for queries that ask whether there is a substring of $T$ matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in $O(n^2|\Sigma|)$ time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has $O(n\log|\Sigma|)$ query time. Despite a tremendous amount of effort there has been little improvement over these running times. In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size $\omega(1)$ requires $\Omega(n^{2-\epsilon})$ preprocessing time or $\Omega(n^{1-\delta})$ query time for any $\epsilon,\delta>0$. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size $r\ge 3$ there exist describable fixed constant $\epsilon_r$ and $\delta_r$ such that jumbled indexing requires $\Omega(n^{2-\epsilon_r})$ preprocessing time or $\Omega(n^{1-\delta_r})$ query time.
    05/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given ϵ∈[0,1)ϵ∈[0,1), the ϵ-Relative Error Periodic Pattern Problem (REPP) is the following: INPUT: An n-long sequence S of numbers si∈Nsi∈N in increasing order. OUTPUT: The longest ϵ-relative error periodic pattern, i.e., the longest subsequence si1,si2,…,siksi1,si2,…,sik of S, for which there exists a number p such that the absolute difference between any two consecutive numbers in the subsequence is at least p and at most p(1+ϵ)p(1+ϵ). The best known algorithm for this problem has O(n3)O(n3) time complexity. This bound is too high for large inputs in practice. In this paper we give a new algorithm for finding the longest ϵ-relative error periodic pattern (the REPP problem). Our method is based on a transformation of the input sequence into a different representation: the ϵ-active maximal intervals list L, defined in this paper. We show that the transformation of S to the list L can be done efficiently (quadratic in n and linear in the size of L) and prove that our algorithm is linear in the size of L. This enables us to prove that our algorithm works in sub-cubic time on inputs for which the best known algorithm works in O(n3)O(n3) time. Moreover, though it may happen that our algorithm would still be cubic, it is never worse than the known O(n3)O(n3)-algorithm and in many situations its complexity is O(n2)O(n2) time.
    Theoretical Computer Science. 01/2014; 525:60–67.
  • Amihood Amir, Ayelet Butman, Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Histogram indexing, also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o(|T|(2)/polylog|T|) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple linear-time solution. Block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass-spectrometry. It is also an example of a pattern matching problem that has an efficient, almost linear-time solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the block-mass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text T in time o(|T|(2)/polylog|T|) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a trade-off between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be pre-processed in time o(|T|(2)/polylog|T|) and enable linear-time query processing.
    Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences 01/2014; 372(2016):20130132. · 2.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time $O(\log n)$ per input symbol (as opposed to amortized $O(\log n)$ time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves $O(\log n)$ worst case time per input symbol. Searching for a pattern of length $m$ in the resulting suffix tree takes $O(\min(m\log |\Sigma|, m + \log n) + tocc)$ time, where $tocc$ is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance.
    06/2013;
  • Amihood Amir, Haim Paryenty, Liam Roditty
    [Show abstract] [Hide abstract]
    ABSTRACT: Finding the consensus of a given set of strings is a hard and challenging problem. The problem is formally defined as follows: given a set of strings S={s1,…,sk}S={s1,…,sk} and a constant d, find, if it exists, a string s⁎s⁎ such that the distance of s⁎s⁎ from each of the strings does not exceed d.This problem has many applications. Two examples are: In biology, it may be used to seek a common ancestor to given sections of DNA. In web searching it may be used as a clustering aid.The stringology community researched this problem under the Hamming distance. In that metric the problem is NPNP-hard. A lot of work has been also done in the Euclidean metric.In this paper we consider the Consensus problem under other string metrics. We show that this problem is NPNP-hard for the swap metric and APXAPX-hard for the reversal metric.
    Information Processing Letters 01/2013; 113(s 10–11):371–374. · 0.49 Impact Factor
  • Amihood Amir, Haim Paryenty, Liam Roditty
    [Show abstract] [Hide abstract]
    ABSTRACT: The Closest String Problem is defined as follows. Let S be a set of k strings {s1,…sk}, each of length ℓ, find a string $\hat{s}$, such that the maximum Hamming distance of $\hat{s}$ from each of the strings is minimized. We denote this distance with d. The string $\hat{s}$ is called a consensus string. In this paper we present two main algorithms, the Configuration algorithm with O(k2 ℓ k) running time for this problem, and the Minority algorithm. The problem was introduced by Lanctot, Li, Ma, Wang and Zhang [13]. They showed that the problem is $\cal{NP}$-hard and provided an IP approximation algorithm. Since then the closest string problem has been studied extensively. This research can be roughly divided into three categories: Approximate, exact and practical solutions. This paper falls under the exact solutions category. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of O(ℓ k) was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times O(ℓ2), O(ℓ3), O(ℓ4) and O(ℓ5) exist for the cases of k=2,3,4 and 5, respectively. It is known that, in fact, the cases of k=2,3, and 4 can be solved in linear time. No efficient algorithm is currently known for the case of k=5. We prove the minority lemma that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. This lemma with some additional ideas give an O(ℓ2) time algorithm for computing a closest string of 5 binary strings.
    Proceedings of the 19th international conference on String Processing and Information Retrieval; 10/2012
  • Amihood Amir, Haim Paryenty, Liam Roditty
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of finding the consensus of a given set of strings is formally defined as follows: given a set of strings S = {s 1,…s k }, and a constant d, find, if it exists, a string s *, such that the Hamming distance of s * from each of the strings does not exceed d. In this paper we study an LP relaxation for the problem. We prove an additive upper bound, depending only in the number of strings k, and randomized bounds. We show that empirical results are much better. We also compare our program with some algorithms reported in the literature, and it is shown to perform well.
    09/2011: pages 168-173;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The consensus (string) problem is finding a representative string, called a consensus, of a given set S of strings. In this paper we deal with consensus problems considering both distance sum and radius, where the distance sum is the sum of (Hamming) distances from the strings in S to the consensus and the radius is the longest (Hamming) distance from the strings in S to the consensus. Although there have been results considering either distance sum or radius, there have been no results considering both, to the best of our knowledge.We present the first algorithms for two consensus problems considering both distance sum and radius for three strings: one problem is to find an optimal consensus minimizing both distance sum and radius. The other problem is to find a bounded consensus such that the distance sum is at most s and the radius is at most r for given constants s and r. Our algorithms are based on characterization of the lower bounds of distance sum and radius, and thus they solve the problems efficiently. Both algorithms run in linear time.
    Theor. Comput. Sci. 01/2011; 412:5239-5246.
  • Amihood Amir, Haim Parienty, Liam Roditty
    String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings; 01/2011
  • Theor. Comput. Sci. 01/2011; 412:3537-3544.
  • Julio Ng, Amihood Amir, Pavel A. Pevzner
    [Show abstract] [Hide abstract]
    ABSTRACT: Matching a mass spectrum against a text (a key computational task in proteomics) is slow since the existing text indexing algorithms (with search time independent of the text size) are not applicable in the domain of mass spectrometry. As a result, many important applications (e.g., searches for mutated peptides) are prohibitively time-consuming and even the standard search for non-mutated peptides is becoming too slow with recent advances in high-throughput genomics and proteomics technologies. We introduce a new paradigm - the Blocked Pattern Matching (BPM) Problem - that models peptide identification. BPM corresponds to matching a pattern against a text (over the alphabet of integers) under the assumption that each symbol a in the pattern can match a block of consecutive symbols in the text with total sum a.
    Research in Computational Molecular Biology - 15th Annual International Conference, RECOMB 2011, Vancouver, BC, Canada, March 28-31, 2011. Proceedings; 01/2011
  • Conference Paper: Range LCP.
    Algorithms and Computation - 22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. Proceedings; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: The problem of finding the period of a vector V is central to many applications. Let V′V′ be a periodic vector closest to V under some metric. We seek this V′V′, or more precisely we seek the smallest period that generates V′V′. In this paper we consider the problem of finding the closest periodic vector in LpLp spaces. The measures of “closeness” that we consider are the metrics in the different LpLp spaces. Specifically, we consider the L1,L2L1,L2 and L∞L∞ metrics. In particular, for a given n -dimensional vector V , we develop O(n2)O(n2) time algorithms (a different algorithm for each metric) that construct the smallest period that defines such a periodic n -dimensional vector V′V′. We call that vector the closest periodic vector of V under the appropriate metric. We also show (three) O˜(n) time constant approximation algorithms for the period of the approximate closest periodic vector.
    Algorithms and Computation - 22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. Proceedings; 01/2011
  • Amihood Amir, Zvi Gotthilf, B. Riva Shalom
    [Show abstract] [Hide abstract]
    ABSTRACT: The Shortest Common Supersequence (SCS) is the problem of seeking a shortest possible sequence that contains each of the input sequences as a subsequence. In this paper we consider applying the problem to Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of SCS on PWM. For the first, we give a polynomial time algorithm, having two input sequences. For the second, we prove NP-hardness.
    String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Square Tiling Problem was recently introduced as equivalent to the problem of reconstructing an image from patches and a possible general-purpose indexing tool. Unfortunately, the Square Tiling Problem was shown to be NP\cal{NP}-hard. A 1/2-approximation is known. We show that if the tile alphabet is fixed and finite, there is a Polynomial Time Approximation Scheme (PTAS) for the Square Tiling Problem with approximation ratio of (1-[(e)/(2logn)])(1-{\epsilon\over 2\log n}) for any given ε ≤ 1.
    10/2010: pages 118-126;
  • Amihood Amir, Avivit Levy
    [Show abstract] [Hide abstract]
    ABSTRACT: A basic assumption in traditional pattern matching is that the order of the elements in the given input strings is correct, while the description of the content, i.e. the description of the elements, may be erroneous. Motivated by questions that arise in Text Editing, Computational Biology, Bit Torrent and Video on Demand, and Computer Architecture, a new pattern matching paradigm was recently proposed by [2]. In this model, the pattern content remains intact, but the relative positions may change. Several papers followed the initial definition of the new paradigm. Each paper revealed new aspects in the world of string rearrangement metrics. This new unified view has already proven itself by enabling the solution of an open problem of the mathematician Cayley from 1849. It also gave better insight to problems that were already studied in different and limited situations, such as the behavior of different cost functions, and enabled deriving results for cost functions that were not yet sufficiently analyzed by previous research. At this stage, a general understanding of this new model is beginning to coalesce. The aim of this survey is to present an overview of this recent new direction of research, the problems, the methodologies, and the state-of-the-art.
    08/2010: pages 1-33;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A string S ∈ Σ m can be viewed as a set of pairs { (si , i) | si Î S,i Î { 0,¼, m-1} }\{ (s_i , i) \mid s_i\in S,\ i\in \{ 0,\ldots, m-1\} \}. We follow the recent work on pattern matching with address errors and consider approximate pattern matching problems arising from the setting where errors are introduced to the location component (i), rather than the more traditional setting, where errors are introduced to the content itself (s i ). Specifically, we continue the work on string matching in the presence of address bit errors. In this paper, we consider the case where bits of i may be stuck, either in a consistent or transient manner. We formally define the corresponding approximate pattern matching problems, and provide efficient algorithms for their resolution.
    05/2010: pages 395-405;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assume that a natural cyclic phenomenon has been measured, but the data is corrupted by errors. The type of corruption is application-dependent and may be caused by measurements errors, or natural features of the phenomenon. This paper studies the problem of recovering the correct cycle from data corrupted by various error models, formally defined as the period recovery problem. Specifically, we define a metric property which we call pseudo-locality and study the period recovery problem under pseudo-local metrics. Examples of pseudo-local metrics are the Hamming distance, the swap distance, and the interchange (or Cayley) distance. We show that for pseudo-local metrics, periodicity is a powerful property allowing detecting the original cycle and correcting the data, under suitable conditions. Some surprising features of our algorithm are that we can efficiently identify the period in the corrupted data, up to a number of possibilities logarithmic in the length of the data string, even for metrics whose calculation is NP{\cal NP} -hard. For the Hamming metric we can reconstruct the corrupted data in near linear time even for unbounded alphabets. This result is achieved using the property of separation in the self-convolution vector and Reed-Solomon codes. Finally, we employ our techniques beyond the scope of pseudo-local metrics and give a recovery algorithm for the non pseudo-local Levenshtein edit metric.
    Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I; 01/2010
  • Amihood Amir, Laxmi Parida
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel technique, suitable for bit-parallelism, for representing both the nondeterministic automaton and the nondeterministic suffix automaton of a given string in a more compact way. Our approach is based on a particular factorization of ...
    Information and Computation. 01/2010; 213:1.
  • Conference Paper: Approximate Periodicity.
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the question of finding an approximate period in a given string S of length n. Let S′ be a periodic string closest to S under some distance metric. We consider this distance the error of the periodic string, and seek the smallest period that generates a string with this distance to S. In this paper we consider the Hamming and swap distance metrics. In particular, if S is the given string, and S′ is the closest periodic string to S under the Hamming distance, and if that distance is k, we develop an O(nkloglogn) algorithm that constructs the smallest period that defines such a periodic string S′. We call that string the approximate period of S under the Hamming distance. We further develop an O(n 2) algorithm that constructs the approximate period under the swap distance. Finally, we show an O(nlogn) algorithm for finite alphabets, and O(nlog3 n) algorithm for infinite alphabets, that approximates the number of mismatches in the approximate period of the string.
    Algorithms and Computation - 21st International Symposium, ISAAC 2010, Jeju Island, Korea, December 15-17, 2010, Proceedings, Part I; 01/2010

Publication Stats

2k Citations
23.52 Total Impact Points

Institutions

  • 1970–2014
    • Bar Ilan University
      • Department of Computer Science
      Gan, Tel Aviv, Israel
  • 2006–2011
    • Johns Hopkins University
      • Department of Computer Science
      Baltimore, Maryland, United States
  • 2003
    • CUNY Graduate Center
      New York City, New York, United States
  • 2001–2002
    • AT&T Labs
      Austin, Texas, United States
  • 1997
    • Georgia Tech Research Institute
      Atlanta, Georgia, United States
  • 1992–1997
    • Georgia Institute of Technology
      • College of Computing
      Atlanta, Georgia, United States
    • Rutgers, The State University of New Jersey
      • Department of Computer Science
      New Brunswick, New Jersey, United States
  • 1994
    • King's College London
      Londinium, England, United Kingdom
  • 1988–1991
    • University of Maryland, College Park
      • • Institute for Advanced Computer Studies
      • • Department of Computer Science
      Maryland, United States