Top-k Ranked Document Search in General Text Databases

DOI: 10.1007/978-3-642-15781-3_17


Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can
readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many
new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional
indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents
against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques,
which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly
faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for
English text when word queries are issued.

Download full-text


Available from: Andrew Turpin,
  • Source
    • "GREEDY is also dependent on size of the [sp..ep] range, requiring two orders of magnitudes more time on ENWIKI-BIG than on ENWIKI-SML, matching the difference in their sizes. Second, for long patterns (>15 characters) SADA is now only one order of magnitude slower than GREEDY, in contrast to two orders reported by Culpepper et al. [2]. This is due to the faster extraction of inverse SA values and the use of a Ψ-based CSA instead of a WT based one. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.
  • Source
    • "We have shown that our structure can use, instead, O(n(log σ + log D)) bits for the tf measure (and slightly more for others), but the constants are still large. There is a whole trend of reduced-space representations for general document retrieval problems with the tf measure [57] [63] [37] [19] [30] [34] [9] [29] [61] [36] [49]. The current situation is as follows [46]: One trend aims at the least space usage. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Let $\mathcal{D}$ be a collection of $D$ documents, which are strings over an alphabet of size $\sigma$, of total length $n$. We describe a data structure that uses linear space and and reports $k$ most relevant documents that contain a query pattern $P$, which is a string of length $p$, in time $O(p/\log_\sigma n+k)$, which is optimal in the RAM model in the general case where $\lg D = \Theta(\log n)$, and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures... [clip] When $\lg D = o(\log n)$, we show how to reduce the space of the data structure from $O(n\log n)$ to $O(n(\log\sigma+\log D+\log\log n))$ bits... [clip] We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time $O(p(\log\log n)^2/\log_\sigma n+\log n + k\log\log k)$, whereas insertions and deletions require $O(\log^{1+\epsilon} n)$ time per symbol, for any constant $\epsilon>0$. Finally, we consider an extended static scenario where an extra parameter $par(P,d)$ is defined, and the query must retrieve only documents $d$ such that $par(P,d)\in [\tau_1,\tau_2]$, where this range is specified at query time. We solve these queries using linear space and $O(p/\log_\sigma n + \log^{1+\epsilon} n + k\log^\epsilon n)$ time, for any constant $\epsilon>0$. Our technique is to translate these top-$k$ problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.
  • Source
    • "The second round of experiments compares ours with previous work. The Greedy heuristic [7] is run over different wavelet-tree representations of the document array: a plain one (WT-Plain) [7], a Re-Pair compressed one (WT-RP), and a hybrid that at each wavelet tree level chooses between plain, Re-Pair, or entropy-based compression of the bitmaps (WT-Alpha) [22]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel algorithms and data structures dominate almost all the space/time tradeoff.
Show more