Conference Paper

Relevant document retrieval using a spoken document

Grad. Sch. of Eng., Tohoku Univ., Sendai, Japan
DOI: 10.1109/ISCIT.2009.5341051 Conference: Communications and Information Technology, 2009. ISCIT 2009. 9th International Symposium on
Source: IEEE Xplore


In this paper, we proposed a method of retrieving documents from the world wide Web using a spoken document as a ldquokey.rdquo This method can be viewed as a speech version of an ordinary relevant document retrieval, where a text document is used as a query of retrieval. Basically the retrieval is based on an automatic transcription of a spoken document using a speech recognizer. The difficult point of this task is that the automatic transcription contains many recognition errors, therefore we cannot trust keywords extracted from the automatic transcription using conventional method such as tfmiddotidf. To solve this problem, we developed three methods. The first one is to measure relevance of a keyword to the spoken document by using Web documents retrieved using a Web search engine by specifying the keyword as a query. The second one is to compose a query from the selected keywords so that words derive from misrecognitions are excluded and similar words are gathered. The third one is to measure relevance of a downloaded Web document to the spoken document. The experimental results suggest that the proposed methods are promising for retrieving relevant documents of a spoken document.

Download full-text


Available from: Shozo Makino, Jan 13, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, automatic indexing of a spoken document using a speech recognizer attracts attention. However, index generation from an automatic transcription has many problems because the automatic transcription has many recognition errors and Out-Of-Vocabulary words. To solve this problem, we propose a document expansion method using Web documents. To obtain important keywords which included in the spoken document but lost by recognition errors, we acquire Web documents relevant to the spoken document. Then, an index of the spoken document is generated by combining an index that generated from the automatic transcription and the Web documents. We propose a method for retrieval of relevant documents, and the experimental result shows that the retrieved Web document contained many OOV words. Next, we propose a method for combining the recognized index and the Web index. The experimental result shows that the index of the spoken document generated by the document expansion was closer to an index from the manual transcription than the index generated by the conventional method. Finally, we conducted a spoken document retrieval experiment, and the document-expansion-based index gave better retrieval precision than the conventional indexing method.
    Full-text · Conference Paper · Sep 2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a relevant document retrieval method is proposed for document retrieval systems with vector space models (VSM). In recent years, with the size of the database becomes extremely large, there becomes a high demanding of an accurate and fast-time document retrieval algorithm. Based on the maximum similarity criterion, a document retrieval algorithm using the discrete stochastic optimization method is proposed with the user query to retrieve the relevant documents. The proposed algorithm has the self-learning capability for most of the computational effort is spent at the global optimal document and converges fast to the relevant documents in the database. Numerical results demonstrate that the proposed algorithm has a good convergence property and satisfied document retrieval performance in the database.
    No preview · Conference Paper · Dec 2013