Publications (23)1.5 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: Text search is a classical problem in Computer Science, with many dataintensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include largescale search services, such as Web search engines, where it is necessary to efficiently process intensivetraffic streams of online queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributedmemory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (nondistributed) suffix arrays is low in practice, the problem of processing multiple queries on distributedmemory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable online search services based on suffix arrays.Parallel Computing. 01/2014; 
Conference Paper: Document identifier reassignment and runlengthcompressed inverted indexes for improved search performance
[Show abstract] [Hide abstract]
ABSTRACT: Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use runlength encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the DocumentataTime query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (nonranked) OR queries (by up to 86%).Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval; 07/2013  [Show abstract] [Hide abstract]
ABSTRACT: Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributedmemory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of selfindexed compressed text and posting lists caching. We show that a text selfindex (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Information Processing and Management  IPM. 09/2012; 
Article: To index or not to index: timespace tradeoffs in search engines with positional ranking functions
[Show abstract] [Hide abstract]
ABSTRACT: Positional ranking functions, widely used in Web search engines, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study timespace tradeoffs for search engines with positional ranking functions and text snippet generation. We consider both indexbased and nonindex based alternatives for positional data. We aim to answer the question of whether one should index positional data or not. We show that there is a wide range of practical timespace tradeoffs. Moreover, we show that both position and textual data can be stored using about 71% of the space used by traditional positional indexes, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from the literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current stateoftheart alternatives with little impact in query processing time.01/2012;  [Show abstract] [Hide abstract]
ABSTRACT: Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a more spaceefficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LempelZiv index (LZindex) of Navarro is a compressed fulltext selfindex able to represent T using 4uH k (T)+o(ulog σ) bits of space, where H k (T) denotes the kth order empirical entropy of T, for any k=o(log σ u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m 3log σ+(m+occ)log u) worstcase time. Although this index has proven very competitive in practice, the O(m 3log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other stateoftheart alternatives. In this paper we present stronger LempelZiv based indices (LZindices), improving the overall performance of the original LZindex. We achieve indices requiring (2+ε)uH k (T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZindices. We simultaneously improve the search time to O(m 2+(m+occ)log u), which makes our indices very competitive with stateoftheart alternatives. Our indices support displaying any text substring of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1+ε)uH k (T)+o(ulog σ) to obtain a structure with O(m 2) average search time for m≥2log σ u. Alternatively, the search time of LZindices can be improved to O((m+occ)log u) with (3+ε)uH k (T)+o(ulog σ) bits of space, which is much less than the space needed by other LempelZivbased indices achieving the same search time. Overall our indices stand out as a very attractive alternative for spaceefficient indexed text searching.Algorithmica 01/2012; 62:54101. · 0.49 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZindex [G. Navarro, Indexing text using the Ziv–Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87–114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[1..u] over an alphabet of size σ, the LZindex requires 4LZ(1+o(1)) bits of space, where LZ is the size of the LZ78compression of T. This can be bounded by LZ=uHk(T)+o(ulogσ), where Hk(T) is the kth order empirical entropy of T, for any k=o(logσu). The LZindex is built in O(ulogσ) time, yet requiring O(ulogu) bits of main memory in the worst case. In practice, the LZindex occupies 1.0–1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to mediumsized texts. In this paper we present a spaceefficient algorithm to construct the LZindex in O(u(logσ+loglogu)) time and requiring 4LZ(1+o(1)) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZindex, which occupy from 1 to 3 times LZ(1+o(1)) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZindex variants in O(u(logσ+loglogu)) time, and within LZ(1+o(1)) bits of main memory, that is, asymptotically just the space to hold the LZ78compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.Inf. Comput. 01/2011; 209:10701102.  [Show abstract] [Hide abstract]
ABSTRACT: We present the first adaptive data structure for twodimensional orthogonal range search. Our data structure is adaptive in the sense that it gives improved search performance for data that is better than the worst case (Demaine et al., 2000) [8]; in this case, data with more inherent sortedness.Given n points on the plane, the linear space data structure can answer range queries in O(logn+k+m) time, where m is the number of points in the output and k is the minimum number of monotonic chains into which the point set can be decomposed, which is in the worst case. Our result matches the worstcase performance of other optimaltime linear space data structures, or surpasses them when . Our data structure can be made implicit, requiring no extra space beyond that of the data points themselves (Munro and Suwanda, 1980) [16], in which case the query time becomes O(klogn+m). We also present a novel algorithm of independent interest to decompose a point set into a minimum number of untangled, similarly directed monotonic chains in O(k2n+nlogn) time.Theoretical Computer Science. 01/2011; 412(32):42004211.  [Show abstract] [Hide abstract]
ABSTRACT: We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary �, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ^ ··· ^ tk can be answered in O(klog log �) adaptive time, where � is the instance difficulty of the query, as defined by Barbay andKenyon in their SODA'02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive inter section algorithm by Barbay and Kenyon takes O(klog n M � ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more spaceefficient inmemory encoding, outperformingthe query performance of inverted indices when the ratio n M � is !(log�).String Processing and Information Retrieval  17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 1113, 2010. Proceedings; 01/2010 
Article: Practical approaches to reduce the space requirement of lempelzivbased compressed text indices
[Show abstract] [Hide abstract]
ABSTRACT: Given a text T[1¨n] over an alphabet of size σ, the fulltext search problem consists in locating the occ occurrences of a given pattern P[1¨m] in T. Compressed fulltext selfindices are spaceefficient representations of the text that provide direct access to and indexed search on it. The LZindex of Navarro is a compressed fulltext selfindex based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T)+o(nlogσ) bits of space, where Hk(T) is the kth order empirical entropy of T). In practice, the average locating complexity of the LZindex is O(σ m logσ n + occ σm/2), where occ is the number of occurrences of P. It can extract text substrings of length ℓ in O(ℓ) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZindex can be up to 4 times larger than the smallest existing indices (which use nHk(T)+o(nlogσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZindex. We obtain new LZindex variants that require 2(1+&epsis;)nHk(T) + o(nlogσ) bits of space, for any 0<&epsis; <1. They have an average locating time of O(1/&epsis;(mlog n + occ σm/2)), while extracting takes O(ℓ) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZindex by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time tradeoff when indices are allowed to use 4 times the size of the compressed text or more.Journal of Experimental Algorithmics 01/2010; 15:1.5.  [Show abstract] [Hide abstract]
ABSTRACT: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and startswith predicates. Such predicates can be efficiently implemented using a compressed selfindex of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed selfindex. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate topdown or bottomup with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 13 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for countingonly queries.Software Practice and Experience 01/2010; · 1.01 Impact Factor 
Conference Paper: Succinct Trees in Practice.
[Show abstract] [Hide abstract]
ABSTRACT: We implement and compare the major current techniques for representing general trees in succinct form. This is impor tant because a general tree of n nodes is usually represented in pointer form, requiring O(nlog n) bits, whereas the suc cinct representations we study require just 2n + o(n) bits and carry out many sophisticated operations in constant time. Yet, there is no exhaustive study in the literature comparing the practical magnitudes of the o(n)space and the O(1)time terms. The techniques can be classified into three broad trends: those based on BP (balanced parenthe ses in preorder), those based on DFUDS (depthfirst unary degree sequence), and those based on LOUDS (levelordered unary degree sequence). BP and DFUDS require a balanced parentheses representation that supports the core operations findopen, findclose, and enclose, for which we implement and compare three major algorithmic proposals. All the tree rep resentations require also core operations rank and select on bitmaps, which are already well studied in the literature. We show how to predict the time and space performance of most variants via combining these core operations, and also study some tree operations for which specialized im plementations exist. This is especially relevant for a recent proposal (K. Sadakane and G. Navarro, SODA'10) which, although belonging to class BP, deviates from the main tech niques in some cases in order to achieve constant time for the widest range of operations. We experiment over vari ous types of reallife trees and of traversals, and conclude that the latter technique stands out as an excellent practi cal combination of space occupancy, time performance, and functionality, whereas others, particularly LOUDS, are still interesting in some limitedfunctionality niches.Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments, ALENEX 2010, Austin, Texas, USA, January 16, 2010; 01/2010  [Show abstract] [Hide abstract]
ABSTRACT: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and startswith predicates. Such predicates can efficiently be implemented using a compressed selfindex of the document's text nodes. Most queries, however, contain some parts of querying the text of the document, plus some parts of querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced; it stores the tree structure of an XML document using a bit array of opening and closing brackets, and stores the text nodes of the document using a global compressed selfindex. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate topdown or bottomup with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 13 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for countingonly queries.CoRR. 01/2009; abs/0907.2089. 
Conference Paper: Untangled Monotonic Chains and Adaptive Range Search.
[Show abstract] [Hide abstract]
ABSTRACT: We present the first adaptive data structure for twodimensional orthogonal range search. Our data structure is adaptive in the sense that it gives improved search performance for data with more inherent sortedness. Given n points on the plane, the linearspace data structure can answer range queries in O(logn + k + m) time, where m is the number of points in the output and k is the minimum number of monotonic chains into which the point set can be decomposed, which is O(Ön)O(\sqrt{n}) in the worst case. Our result matches the worstcase performance of other optimaltime linearspace data structures, or surpasses them when k=o(Ön)k=o(\sqrt{n}). Our data structure can also be made implicit, requiring no extra space beyond that of the data points themselves, in which case the query time becomes O(k logn + m). We present a novel algorithm of independent interest to decompose a point set into a minimum number of untangled, samedirection monotonic chains in O(kn + nlogn)time.Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 1618, 2009. Proceedings; 01/2009 
Conference Paper: An Improved Succinct Representation for Dynamic kary Trees.
[Show abstract] [Hide abstract]
ABSTRACT: kary trees are a fundamental data structure in many textprocessing algorithms (e.g., text searching). The traditional pointerbased representation of trees is space consuming, and hence only relatively small trees can be kept in main memory. Nowadays, however, many applications need to store a huge amount of information. In this paper we present a succinct representation for dynamic kary trees of n nodes, requiring 2n + nlogk + o(nlogk) bits of space, which is close to the informationtheoretic lower bound. Unlike alternative representations where the operations on the tree can be usually computed in O(logn) time, our data structure is able to take advantage of asymptotically smaller values of k, supporting the basic operations parent and child in O(logk + loglogn) time, which is o(logn) time whenever logk = o(logn). Insertions and deletions of leaves in the tree are supported in O((logk+loglogn)(1+\fraclogklog(logk + loglogn)))O((\log{k}+\log\log{n})(1+\frac{\log{k}}{\log{(\log{k} + \log\log{n})}})) amortized time. Our representation also supports more specialized operations (like subtreesize, depth, etc.), and provides a new tradeoff when k = O(1) allowing faster updates (in O(loglogn) amortized time, versus the amortized time of O((loglogn)1 + ε ), for ε> 0, from Raman and Rao [21]), at the cost of slower basic operations (in O(loglogn) time, versus O(1) time of [21]).Combinatorial Pattern Matching, 19th Annual Symposium, CPM 2008, Pisa, Italy, June 1820, 2008, Proceedings; 01/2008 
Conference Paper: A LempelZiv Text Index on Secondary Storage.
[Show abstract] [Hide abstract]
ABSTRACT: Fulltext searching consists in locating the occurrences of a given pattern P(1..m) in a text T(1..u), both sequences over an alpha bet of size �. In this paper we define a new index for fulltext searching on secondary storage, based on the LempelZiv compression algorithm and requiring 8uHk +o(ulog �) bits of space, where Hk denotes the kth order empirical entropy of T, for any k = o(logσ u). Our experimental re sults show that our index is significantly smaller than any other practical secondarymemory data structure: 1.42.3 times the text size including the text, which means 39%65% the size of traditional indexes like String Btrees (Ferragina and Grossi, JACM 1999). In exchange, our index re quires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.041.68 times the text size, requiring about 2060 disk accesses, depending on the pattern length.Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 911, 2007, Proceedings; 01/2007  [Show abstract] [Hide abstract]
ABSTRACT: Given a text T(1..u) over an alphabet of size σ = O(polylog(u)) and with kth order empirical entropy Hk(T), we propose a new com pressed fulltext selfindex based on the LempelZiv (LZ) compression algorithm, which replaces T with a representation requiring about three times the size of the compressed text, i.e (3+ ǫ)uHk(T)+ o(ulog σ) bits, for any ǫ > 0 and k = o(logσ u), and in addition gives indexed access to T: it is able to locate the occ occurrences of a pattern P(1..m) in the text in O((m + occ)log u) time. Our index is smaller than the existing indices that achieve this locating time complexity, and locates the occur rences faster than the smaller indices. Furthermore, our index is able to count the pattern occurrences in O(m) time, and it can extract any text substring of length ℓ in optimal O(ℓ/logσ u) time. Overall, our indices appear as a very attractive alternative for spaceefficient i ndexed text searching.01/2007;  [Show abstract] [Hide abstract]
ABSTRACT: The LZindex is a compressed fulltext selfindex able to represent a text P 1...m, over an alphabet of size s = O(polylog(u))\sigma = O(\textrm{polylog}(u)) and with kth order empirical entropy H k (T), using 4uH k (T) + o(ulogσ) bits for any k = o(log σ u). It can report all the occ occurrences of a pattern P 1...m in T in O(m 3logσ + (m + occ)logu) worst case time. Its main drawback is the factor 4 in its space complexity, which makes it larger than other stateoftheart alternatives. In this paper we present two different approaches to reduce the space requirement of LZindex. In both cases we achieve (2 + ε)uH k (T) + o(ulogσ) bits of space, for any constant ε> 0, and we simultaneously improve the search time to O(m 2logm + (m + occ)logu). Both indexes support displaying any subtext of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1 + ε)uH k (T) + o(ulogσ) to obtain a structure with O(m 2) average search time for m \geqslant 2logsum \geqslant 2\log_\sigma{u}.06/2006: pages 318329;  [Show abstract] [Hide abstract]
ABSTRACT: A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZindex, in particular, requires 4uH(k)(1 + o(1)) bits of space, where u is the text length in characters and Hk is its kth order empirical entropy, Although in practice the LZindex needs 1.01.5 times the text size, its construction requires Much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical spaceefficient algorithm to construct LZindex, requiring (4 + is an element of)uH(k) + o(u) bits of space, for any constant 0 < epsilon < 1, and O(sigma u) time, being sigma the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index.01/2005; 
Article: Bases de Datos no Convencionales
01/2004;  [Show abstract] [Hide abstract]
ABSTRACT: Hybrid dynamic spatial approximation treesare recently proposed data structures for search ing in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty. In this paper we complete and improve hybrid dynamic spatial approximation trees, by pre senting a new search alternative, an algorithm to remove objects from the tree, and an improved way of managing the available memory. The result is a fully dynamic and optimized data structure for similarity searching in metric spaces.10/2003;
Publication Stats
146  Citations  
1.50  Total Impact Points  
Top Journals
Institutions

2010–2014

Universidad Técnica Federico Santa María
 Department of Computer Science
Ciudad de Valparaíso, Valparaíso, Chile


2005–2010

University of Santiago, Chile
CiudadSantiago, Santiago, Chile


2003

Universidad Nacional de San Luis
 Department of Computer Science
San Luis, San Luis, Argentina
