Diego Arroyuelo

Universidad Técnica Federico Santa María, Ciudad de Valparaíso, Valparaíso, Chile

Are you Diego Arroyuelo?

Claim your profile

Publications (23)1.5 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays.
    Parallel Computing. 01/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).
    Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval; 07/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.
    Information Processing and Management - IPM. 09/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Positional ranking functions, widely used in Web search engines, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time-space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether one should index positional data or not. We show that there is a wide range of practical time-space trade-offs. Moreover, we show that both position and textual data can be stored using about 71% of the space used by traditional positional indexes, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from the literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uH k (T)+o(ulog σ) bits of space, where H k (T) denotes the k-th order empirical entropy of T, for any k=o(log σ u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m 3log σ+(m+occ)log u) worst-case time. Although this index has proven very competitive in practice, the O(m 3log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+ε)uH k (T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m 2+(m+occ)log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1+ε)uH k (T)+o(ulog σ) to obtain a structure with O(m 2) average search time for m≥2log σ u. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log u) with (3+ε)uH k (T)+o(ulog σ) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.
    Algorithmica 01/2012; 62:54-101. · 0.49 Impact Factor
  • Source
    Diego Arroyuelo, Gonzalo Navarro
    [Show abstract] [Hide abstract]
    ABSTRACT: A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv–Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87–114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[1..u] over an alphabet of size σ, the LZ-index requires 4|LZ|(1+o(1)) bits of space, where |LZ| is the size of the LZ78-compression of T. This can be bounded by |LZ|=uHk(T)+o(ulogσ), where Hk(T) is the k-th order empirical entropy of T, for any k=o(logσu). The LZ-index is built in O(ulogσ) time, yet requiring O(ulogu) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O(u(logσ+loglogu)) time and requiring 4|LZ|(1+o(1)) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times |LZ|(1+o(1)) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O(u(logσ+loglogu)) time, and within |LZ|(1+o(1)) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.
    Inf. Comput. 01/2011; 209:1070-1102.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present the first adaptive data structure for two-dimensional orthogonal range search. Our data structure is adaptive in the sense that it gives improved search performance for data that is better than the worst case (Demaine et al., 2000) [8]; in this case, data with more inherent sortedness.Given n points on the plane, the linear space data structure can answer range queries in O(logn+k+m) time, where m is the number of points in the output and k is the minimum number of monotonic chains into which the point set can be decomposed, which is in the worst case. Our result matches the worst-case performance of other optimal-time linear space data structures, or surpasses them when . Our data structure can be made implicit, requiring no extra space beyond that of the data points themselves (Munro and Suwanda, 1980) [16], in which case the query time becomes O(klogn+m). We also present a novel algorithm of independent interest to decompose a point set into a minimum number of untangled, similarly directed monotonic chains in O(k2n+nlogn) time.
    Theoretical Computer Science. 01/2011; 412(32):4200-4211.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary �, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ^ ··· ^ tk can be answered in O(klog log |�|) adaptive time, where � is the instance difficulty of the query, as defined by Barbay andKenyon in their SODA'02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive inter- section algorithm by Barbay and Kenyon takes O(klog n M � ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more space-efficient in-memory encoding, outperformingthe query performance of inverted indices when the ratio n M � is !(log|�|).
    String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings; 01/2010
  • Source
    Diego Arroyuelo, Gonzalo Navarro
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a text T[1&Dot;n] over an alphabet of size σ, the full-text search problem consists in locating the occ occurrences of a given pattern P[1&Dot;m] in T. Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T)+o(nlogσ) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice, the average locating complexity of the LZ-index is O(σ m logσ n + occ σm/2), where occ is the number of occurrences of P. It can extract text substrings of length &ell; in O(&ell;) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use nHk(T)+o(nlogσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1+&epsis;)nHk(T) + o(nlogσ) bits of space, for any 0<&epsis; <1. They have an average locating time of O(1/&epsis;(mlog n + occ σm/2)), while extracting takes O(&ell;) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.
    Journal of Experimental Algorithmics 01/2010; 15:1.5.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
    Software Practice and Experience 01/2010; · 1.01 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We implement and compare the major current techniques for representing general trees in succinct form. This is impor- tant because a general tree of n nodes is usually represented in pointer form, requiring O(nlog n) bits, whereas the suc- cinct representations we study require just 2n + o(n) bits and carry out many sophisticated operations in constant time. Yet, there is no exhaustive study in the literature comparing the practical magnitudes of the o(n)-space and the O(1)-time terms. The techniques can be classified into three broad trends: those based on BP (balanced parenthe- ses in preorder), those based on DFUDS (depth-first unary degree sequence), and those based on LOUDS (level-ordered unary degree sequence). BP and DFUDS require a balanced parentheses representation that supports the core operations findopen, findclose, and enclose, for which we implement and compare three major algorithmic proposals. All the tree rep- resentations require also core operations rank and select on bitmaps, which are already well studied in the literature. We show how to predict the time and space performance of most variants via combining these core operations, and also study some tree operations for which specialized im- plementations exist. This is especially relevant for a recent proposal (K. Sadakane and G. Navarro, SODA'10) which, although belonging to class BP, deviates from the main tech- niques in some cases in order to achieve constant time for the widest range of operations. We experiment over vari- ous types of real-life trees and of traversals, and conclude that the latter technique stands out as an excellent practi- cal combination of space occupancy, time performance, and functionality, whereas others, particularly LOUDS, are still interesting in some limited-functionality niches.
    Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments, ALENEX 2010, Austin, Texas, USA, January 16, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can efficiently be implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts of querying the text of the document, plus some parts of querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced; it stores the tree structure of an XML document using a bit array of opening and closing brackets, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1--3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
    CoRR. 01/2009; abs/0907.2089.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present the first adaptive data structure for two-dimensional orthogonal range search. Our data structure is adaptive in the sense that it gives improved search performance for data with more inherent sortedness. Given n points on the plane, the linear-space data structure can answer range queries in O(logn + k + m) time, where m is the number of points in the output and k is the minimum number of monotonic chains into which the point set can be decomposed, which is O(Ön)O(\sqrt{n}) in the worst case. Our result matches the worst-case performance of other optimal-time linear-space data structures, or surpasses them when k=o(Ön)k=o(\sqrt{n}). Our data structure can also be made implicit, requiring no extra space beyond that of the data points themselves, in which case the query time becomes O(k logn + m). We present a novel algorithm of independent interest to decompose a point set into a minimum number of untangled, same-direction monotonic chains in O(kn + nlogn)time.
    Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings; 01/2009
  • Source
    Diego Arroyuelo
    [Show abstract] [Hide abstract]
    ABSTRACT: k-ary trees are a fundamental data structure in many text-processing algorithms (e.g., text searching). The traditional pointer-based representation of trees is space consuming, and hence only relatively small trees can be kept in main memory. Nowadays, however, many applications need to store a huge amount of information. In this paper we present a succinct representation for dynamic k-ary trees of n nodes, requiring 2n + nlogk + o(nlogk) bits of space, which is close to the information-theoretic lower bound. Unlike alternative representations where the operations on the tree can be usually computed in O(logn) time, our data structure is able to take advantage of asymptotically smaller values of k, supporting the basic operations parent and child in O(logk + loglogn) time, which is o(logn) time whenever logk = o(logn). Insertions and deletions of leaves in the tree are supported in O((logk+loglogn)(1+\fraclogklog(logk + loglogn)))O((\log{k}+\log\log{n})(1+\frac{\log{k}}{\log{(\log{k} + \log\log{n})}})) amortized time. Our representation also supports more specialized operations (like subtreesize, depth, etc.), and provides a new trade-off when k = O(1) allowing faster updates (in O(loglogn) amortized time, versus the amortized time of O((loglogn)1 + ε ), for ε> 0, from Raman and Rao [21]), at the cost of slower basic operations (in O(loglogn) time, versus O(1) time of [21]).
    Combinatorial Pattern Matching, 19th Annual Symposium, CPM 2008, Pisa, Italy, June 18-20, 2008, Proceedings; 01/2008
  • Source
    Diego Arroyuelo, Gonzalo Navarro
    [Show abstract] [Hide abstract]
    ABSTRACT: Full-text searching consists in locating the occurrences of a given pattern P(1..m) in a text T(1..u), both sequences over an alpha- bet of size �. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(ulog �) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(logσ u). Our experimental re- sults show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees (Ferragina and Grossi, JACM 1999). In exchange, our index re- quires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.
    Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007, Proceedings; 01/2007
  • Source
    Diego Arroyuelo, Gonzalo Navarro
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a text T(1..u) over an alphabet of size σ = O(polylog(u)) and with k-th order empirical entropy Hk(T), we propose a new com- pressed full-text self-index based on the Lempel-Ziv (LZ) compression algorithm, which replaces T with a representation requiring about three times the size of the compressed text, i.e (3+ ǫ)uHk(T)+ o(ulog σ) bits, for any ǫ > 0 and k = o(logσ u), and in addition gives indexed access to T: it is able to locate the occ occurrences of a pattern P(1..m) in the text in O((m + occ)log u) time. Our index is smaller than the existing indices that achieve this locating time complexity, and locates the occur- rences faster than the smaller indices. Furthermore, our index is able to count the pattern occurrences in O(m) time, and it can extract any text substring of length ℓ in optimal O(ℓ/logσ u) time. Overall, our indices appear as a very attractive alternative for space-efficient i ndexed text searching.
    01/2007;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The LZ-index is a compressed full-text self-index able to represent a text P 1...m, over an alphabet of size s = O(polylog(u))\sigma = O(\textrm{polylog}(u)) and with k-th order empirical entropy H k (T), using 4uH k (T) + o(ulogσ) bits for any k = o(log σ u). It can report all the occ occurrences of a pattern P 1...m in T in O(m 3logσ + (m + occ)logu) worst case time. Its main drawback is the factor 4 in its space complexity, which makes it larger than other state-of-the-art alternatives. In this paper we present two different approaches to reduce the space requirement of LZ-index. In both cases we achieve (2 + ε)uH k (T) + o(ulogσ) bits of space, for any constant ε> 0, and we simultaneously improve the search time to O(m 2logm + (m + occ)logu). Both indexes support displaying any subtext of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1 + ε)uH k (T) + o(ulogσ) to obtain a structure with O(m 2) average search time for m \geqslant 2logsum \geqslant 2\log_\sigma{u}.
    06/2006: pages 318-329;
  • Source
    Diego Arroyuelo, Gonzalo Navarro
    [Show abstract] [Hide abstract]
    ABSTRACT: A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZ-index, in particular, requires 4uH(k)(1 + o(1)) bits of space, where u is the text length in characters and H-k is its k-th order empirical entropy, Although in practice the LZ-index needs 1.0-1.5 times the text size, its construction requires Much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical space-efficient algorithm to construct LZ-index, requiring (4 + is an element of)uH(k) + o(u) bits of space, for any constant 0 < epsilon < 1, and O(sigma u) time, being sigma the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index.
    01/2005;
  • Source
    01/2004;
  • Source
    Diego Arroyuelo, Gonzalo Navarro, Nora Reyes
    [Show abstract] [Hide abstract]
    ABSTRACT: Hybrid dynamic spatial approximation treesare recently proposed data structures for search- ing in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty. In this paper we complete and improve hybrid dynamic spatial approximation trees, by pre- senting a new search alternative, an algorithm to remove objects from the tree, and an improved way of managing the available memory. The result is a fully dynamic and optimized data structure for similarity searching in metric spaces.
    10/2003;

Publication Stats

146 Citations
1.50 Total Impact Points

Institutions

  • 2010–2014
    • Universidad Técnica Federico Santa María
      • Department of Computer Science
      Ciudad de Valparaíso, Valparaíso, Chile
  • 2005–2010
    • University of Santiago, Chile
      CiudadSantiago, Santiago, Chile
  • 2003
    • Universidad Nacional de San Luis
      • Department of Computer Science
      San Luis, San Luis, Argentina