Top-k Ranked Document Search in General Text Databases

DOI: 10.1007/978-3-642-15781-3_17 In book: Algorithms – ESA 2010, pp.194-205


Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can
readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many
new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional
indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents
against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques,
which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly
faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for
English text when word queries are issued.

Download full-text


Available from: Andrew Turpin
  • Source
    • "However, the IR indexes suffer from some drawbacks such as big size, search performance and security. Indexing plays very important role in databases performances and now the same concept can be used for developing secure databases[13,14,15]. Thus, there is a need for index structures that supports single key-word queries and phrase queries efficiently in large information retrieval systems, and avoids the draw backs of other indexing structures. "

    Full-text · Dataset · Jan 2016
  • Source
    • "GREEDY is also dependent on size of the [sp..ep] range, requiring two orders of magnitudes more time on ENWIKI-BIG than on ENWIKI-SML, matching the difference in their sizes. Second, for long patterns (>15 characters) SADA is now only one order of magnitude slower than GREEDY, in contrast to two orders reported by Culpepper et al. [2]. This is due to the faster extraction of inverse SA values and the use of a Ψ-based CSA instead of a WT based one. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.
    Preview · Article · Nov 2013
  • Source
    • "We have shown that our structure can use, instead, O(n(log σ + log D)) bits for the tf measure (and slightly more for others), but the constants are still large. There is a whole trend of reduced-space representations for general document retrieval problems with the tf measure [57] [63] [37] [19] [30] [34] [9] [29] [61] [36] [49]. The current situation is as follows [46]: One trend aims at the least space usage. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Let $\mathcal{D}$ be a collection of $D$ documents, which are strings over an alphabet of size $\sigma$, of total length $n$. We describe a data structure that uses linear space and and reports $k$ most relevant documents that contain a query pattern $P$, which is a string of length $p$, in time $O(p/\log_\sigma n+k)$, which is optimal in the RAM model in the general case where $\lg D = \Theta(\log n)$, and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures... [clip] When $\lg D = o(\log n)$, we show how to reduce the space of the data structure from $O(n\log n)$ to $O(n(\log\sigma+\log D+\log\log n))$ bits... [clip] We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time $O(p(\log\log n)^2/\log_\sigma n+\log n + k\log\log k)$, whereas insertions and deletions require $O(\log^{1+\epsilon} n)$ time per symbol, for any constant $\epsilon>0$. Finally, we consider an extended static scenario where an extra parameter $par(P,d)$ is defined, and the query must retrieve only documents $d$ such that $par(P,d)\in [\tau_1,\tau_2]$, where this range is specified at query time. We solve these queries using linear space and $O(p/\log_\sigma n + \log^{1+\epsilon} n + k\log^\epsilon n)$ time, for any constant $\epsilon>0$. Our technique is to translate these top-$k$ problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.
    Full-text · Article · Jul 2013
Show more