Conference Paper

Using variable length ngrams for retrieving technical abstracts in Japanese (poster session).

DOI: 10.1145/355214.355250 Conference: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, 2000, Hong Kong, China, September 30 - October 01, 2000
Source: DBLP


Previous studies have reported that bigrams work well for many Asian language including Chinese, Korean and Japanese. Most of these studies have focused on newspaper texts. We report an experiment with a very different genre (technical abstracts) and find performance can be improved by combining both short and long ngrams. It is a sound approach to work with all ngrams of all lengths since we will have more information than that of bigrams.

2 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.
    SIGIR '97: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 27-31, 1997, Philadelphia, PA, USA; 12/1997
  • [Show abstract] [Hide abstract]
    ABSTRACT: A series of Japanese full-text retrieval experiments were conducted using an inference network document retrieval model. The retrieval performance of two major indexing methods, character-based and word-based, were evaluated. Using structured queries, the character-based indexing performed retrieval as well as, or slightly better, than the word-based system. This result has practical significance since the character-based indexing speed is considerably faster than the traditional word-based indexing. All the queries in this experiment were automatically formulated from natural language input.
    Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Pittsburgh, PA, USA, June 27 - July 1, 1993; 01/1993
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach. 1. Introduction While information retrieval (IR) in English has over thirty years of history, IR in Chinese is relatively recent. It is well-known that written Chi...
    ACM SIGIR Forum 08/1997; DOI:10.1145/278459.258531