Top-k Ranked Document Search in General Text Databases

DOI: 10.1007/978-3-642-15781-3_17

ABSTRACT Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can
readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many
new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional
indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents
against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques,
which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly
faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for
English text when word queries are issued.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this report, we describe our experimental approach for the NTCIR-9 GeoTime task. For our experiments, we use our experimental search engine, Newt. Newt is a ranked self-index capable of supporting multiple languages by deferring linguistic decisions until query time. To our knowledge, this is the first application of ranked self-indexing to a multilingual information retrieval task at NTCIR.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In a World Wide Web continuously growing the need for searching information keeps also growing. The search functionality can be applied to several domains as long as the amount of data justifies it. In this paper we make an evolution of a previous defined algorithm for optimizing documental searches. In this new algorithm we include category disambiguation and number of citations to increase the precision of returned results. We use Mendeley reference manager system to evaluate our algorithm, making a comparison with the results obtained in both cases.
    01/2014; 16. DOI:10.1016/j.protcy.2014.10.064
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The wavelet tree (Grossi et al., SODA 2003) is nowadays a popular succinct data structure for text indexes, discrete grids, and many other applications. When it has many nodes, a levelwise representation proposed by Mäkinen and Navarro (LATIN 2006) is preferable. We pro-pose a different arrangement of the levelwise data, so that the bitmaps are shuffled in a different way. The result can no more be called a wavelet tree, and we dub it wavelet matrix. We demonstrate that the wavelet ma-trix is simpler to build, simpler to query, and faster in practice than the levelwise wavelet tree. This has a direct impact on many applications that use the levelwise wavelet tree for different purposes.

Full-text (2 Sources)

Available from
May 29, 2014