978-1-4244-4522-6/09/$25.00 ©2009 IEEEISCIT 20091483
Relevant Document Retrieval using a Spoken
Akinori Ito, Yu Uno, Ryo Masumura, Masashi Ito and Shozo Makino
Graduate School of Engineering
Tohoku University, Sendai, 980-8579 Japan
Tel: +81-22-795-7084, Fax: +81-22-795-7084
Abstract—In this paper, we proposed a method of retrieving
documents from the World Wide Web using a spoken document
as a “key.” This method can be viewed as a speech version of an
ordinary relevant document retrieval, where a text document is
used as a query of retrieval. Basically the retrieval is based on
an automatic transcription of a spoken document using a speech
recognizer. The difficult point of this task is that the automatic
transcription contains many recognition errors, therefore we
cannot trust keywords extracted from the automatic transcription
using conventional method such as tf·idf. To solve this problem,
we developed three methods. The first one is to measure relevance
of a keyword to the spoken document by using Web documents
retrieved using a Web search engine by specifying the keyword
as a query. The second one is to compose a query from the
selected keywords so that words derive from misrecognitions are
excluded and similar words are gathered. The third one is to
measure relevance of a downloaded Web document to the spoken
document. The experimental results suggest that the proposed
methods are promising for retrieving relevant documents of a
Growth of the Internet enables us to access to huge amount
of speech contents including videos. However, speech contents
(spoken documents) are difficult to search, summarize or
browse, which is a big difference from text-based contents.
We must listen the whole part of a spoken document to
understand it. This drawback has prevented us from utilizing
spoken documents in the Internet.
To enable easier access to spoken documents, spoken doc-
ument retrieval methods have been developed , . The
conventional spoken document retrieval methods enable us to
specify a search query based on keywords and search spoken
documents that match the query. The basic framework of those
methods is to recognize the spoken document using a speech
recognizer and generates an index using the recognition result
There are two problems on using a speech recognition result
as a source of index generation. First, the index is affected
by recognition errors. A speech recognition result inevitably
contains many recognition errors. Therefore, the generated
index includes those words that are not contained in the
original spoken document, while the index lacks those words
contained in the spoken document. Especially the latter is the
severe problem because it degrades the recall of the retrieval.
To treat these problems (especially the latter one), there have
been proposed methods that utilize multiple recognition results
for one utterance or uses more flexible representation such as
word lattice . Using these methods, we can improve recall
The second problem is out-of-vocabulary (OOV) words. As
the vocabulary size of a speech recognizer is limited, there
are many words in a spoken document that are not included
in the vocabulary, which is called OOV words. We cannot
recognize OOV words in principle. The OOV words are low-
frequency words; unfortunately, low-frequency words are often
important keywords of a document that concern to a limited
topic. Therefore, a keyword for searching a spoken document
could be an OOV word. Many methods have been proposed so
far for searching spoken documents using OOV words , .
These methods are basically vocabulary-independent search,
which uses subword-based pattern matching.
This paper proposes a different framework for utilizing spo-
ken documents. The conventional spoken document retrieval
methods focus on detecting terms contained in a spoken docu-
ment. Different from this framework, we propose a method to
retrieve documents from the World Wide Web that have similar
topics to the spoken documents. This framework is considered
to a relevant document retrieval using a spoken document.
Relevant (associative) document retrieval is a technique that
retrieves those documents similar to a specified document.This
technique can be viewed as an information retrieval using a
document as a retrieval key. One of its major applications is a
patent search to find similar patents to a specific patent . The
relevant document retrieval is related to relevance feedback
, which is a technique for enhancing a search query by
specifying several documents as “related documents” to the
target document. Both methods measure similarity between
specified documents as retrieval keys and all documents in a
Difference between these works and our work is that our
work uses spoken document as a key.There also have been
proposed methods that use a spoken sentence as a retrieval key
of database or Web search. For example, Akiba et al. proposed
a method to use an utterance with fixed expressions as a
retrieval key . This method is an interface of information
retrieval where a user utters a query instead of typing it
using a keyboard. Conversely, our method uses existing spoken
documents such as recording of meeting or lecture as retrieval
Fig. 1. System overview
We can suppose several applications for the proposed
method that retrieves relevant documents to a specified spoken
document. One possible application is indexing or summa-
rization of spoken document. As it is difficult to transcribe
a spoken document accurately, indexing or summarization
of spoken document is not easy . These tasks could be
improved by using relevant documents because we can expect
that those relevant documents contain similar keywords or
sentences to the spoken document.
The other application is unsupervised language model adap-
tation of speech recognizer . When transcribing a spoken
document, we can improve the accuracy if we know the topic
of the spoken document. This can be realized by gathering
relevant documents to the spoken documents. By using the
gathered documents, we can use a language model adaptation
technique  for enhancing the language model of the speech
We can suppose various aspects of evaluating the retrieved
documents. In this work, we assume that the relevant document
retrieval is used for indexing of spoken document or language
model adaptation, and we exploited coverage of important
words and OOV words included in the downloaded text as
II. EXTRACTION OF KEYWORD CANDIDATES
A. Overview of the method
Fig. 1 shows an overview of the proposed document retrieval
method. A spoken document is first recognized using a speech
recognizer, and an automatic transcription is generated. Then
keyword candidates are extracted from the transcription. Next,
we cluster the keyword to generate the queries, and we send
the queries to a Web search engine and retrieve URLs. Finally,
we download Web documents from URLs, and filter them to
obtain the final result.
B. Conventional keyword selection
To extracted keyword candidates, we calculate some kind of
scores for each word included in the automatic transcription,
EXAMPLES OF EXTRACTED KEYWORDS FROM AUTOMATIC
TRANSCRIPTION OF A LECTURE “THE HISTORY OF CHARACTER.”
退院 (leaving hospital)
筆 (writing brush)
明朝 (Ming-cho type)
良書 (good book)
and selects those words with high scores as keyword candi-
dates. The score should give high score to those words that
are keywords representing the spoken document. The tf·idf
(term frequency - inverse document frequency) is the most
popular score used in this purpose, which is calculated by
multiplying a term frequency and a logarithm of inverse of a
document frequency of the term. Let tf(d,t) be a frequency of
term t in a document d, and df(t) be the number of documents
in which the term t appears. Then the tf·idf value is calculated
Here, N denotes the number of documents in a corpus.
By weighting the term frequency tf(d,t) using the inverse
document frequency idf = log(N/df(t)), the tf·idf value of
words that appears many documents become smaller, while
that value of words that appears limited number of documents
become larger. As shown in Eq. (1), calculation of tf·idf needs
a corpus that contains a large number of documents. A similar
metric that does not use corpus has also been proposed .
The conventional scores for selecting keywords, including
above ones, assumes that all of the words in a document are
correct. On the contrary, the words included in a transcription
of a spoken document are not necessarily correct. As a result
of speech recognizer inevitably contains errors, there is a pos-
sibility that a selected keyword derives from mis-recognition.
For example, Table I shows an example of selected keywords
using tf·idf from an automatic transcription of a lecture speech
on “history of characters,” which is a part of the Corpus of
Spontaneous Japanese (CSJ) . The italicized words in this
table are mis-recognitions. For example, the 2nd word 所帯
/shotai/ (household) and the 6th word 状態 /jo:tai/ (status)
are mis-recognitions of 書体 /shotai/ (font). Similarly, the 3rd
word 退院 /taiin/ (leaving hostpital) is a mis-recognition of 体
/tai/ (typeface), the 10th word 良書 /ryo:sho/ (good book) is a
mis-recognition of 行書 /gyo:sho/ (semi-cursive style). These
examples depicts that the conventional keyword selection
methods cannot avoid selecting mis-recognized words, and
then the selected keywords do not correctly reflect the spoken
Fig. 2. Keyword selection
To solve this problem, we propose a word scoring method
that uses Web text downloaded by a one-word query using the
C. Keyword selection using cosine similarities of downloaded
Fig. 2 shows the overview of the proposed method. Let w
be a word included in an automatic transcription Ta, T(w) be
documents downloaded from the WWW by using w as a query,
and v(T) be a document vector calculated from documents T.
Then the score of word w is calculated as follows.
R(w,Ta) =v(T(w)) · v(Ta)
Here, R(w,Ta) means a cosine similarity between two vectors
v(T(w)) and v(Ta). When calculating a document vector, we
use tf·idf values of words as elements in the vector v(Ta), and
tf values for v(T(w)).
D. Evaluation of the proposed score
In this score calculation, we use the automatic transcription
Taas a “reference” of relevance measurement. However, as Ta
contains recognition errors, it is not sure that R(w,Ta) really
shows the relevance of w to the real transcription of the spoken
document. Therefore, we investigated if the score is correlated
to the score calculated using the human-transcription of the
document. We denote this transcription To.
Before conducting the score calculation experiment, we
first generated automatic transcriptions Tafrom recordings of
lectures. The CSJ  was used for training of a language
model of a speech recognizer. Transcriptions of 3,124 lectures
(containing about 7 million words) were used as a training
corpus. The vocabulary size of the recognizer was about 57k,
which contained all words appear in the training text. Two
years worth of articles from a Japanese newspaper (Mainichi
Shimbun) were added to the general corpus for the calculation
of idf. There were 208,693 articles in the newspaper database,
which contain about 80 million words. In addition, we used
Julius as a decoder, with the 3000-state phonetic tied
LECTURES FOR EVALUATION
ID in CSJTitle# of
History of character
Paintings in America and
Effect of charcoal
The steel industry
Fig. 3. Comparison of the score R(w,Ta) and the ideal score R(w.To)
mixture HMM as an acoustic model. The average word
accuracy of six spoken documents was 43.4%.
Six lectures from the CSJ were used for this experiment.
Table II shows the topics of the lectures. These lectures are
not included in the training data of the language model. We
first selected 50 keyword candidates using tf·idf, and then we
downloaded 20 Web documents for each of the selected 50
keyword candidates based on a retrieval results by Yahoo!
Japan. We used Yahoo API  for retrieving the search
results. The downloaded Web documents are preprocessed
codes and sentences written in other than Japanese. Then
the preprocessed sentences are analyzed using the morphemic
analyzer ChaSen  to split into words.
The results are shown in Fig. 3. In this figure, one dot
denotes a value for one keyword in a document. Results for
all keyword candidates of all documents are shown together in
this figure. We can see that the values of R(w,To) (X-axis) and
R(w,Ta) (Y-axis) have correlation. The correlation coefficient
was 0.78. It is observed that when the value of R(w,Ta) is
small (under 0.2) the score for the real transcription R(w,To)
is almost zero, which shows the score is not reliable when its
value is small. However, we can see stronger correlation when
R(w,Ta) is large. Therefore, the proposed score seems to be
useful when selecting keywords with the highest scores.
III. GENERATION OF A QUERY USING CLUSTERING
A. Similarity between words and clusters
After selecting keyword candidates, we compose a query
from the candidates. A straightforward way of composing
a query is to use all keywords altogether. However, there
is a problem: although the keyword candidates selected by
the proposed score seem to be better than those keywords
selected by tf·idf, misrecognized words are still included in
the candidates. Another problem is that the proposed score
does not consider relationship between keywords. It is desir-
able considering the relationship among the keywords, and
selecting mutually relevant keywords for composing a query,
since a document can involve more than one subtopic .
To this end, we conduct a clustering on the selected keyword
candidates based on similarities between each keyword. We
exploited a cosine similarity between document vectors for
measuring similarity between two words.
Now we can perform a bottom-up hierarchical clustering
for the set of keywords. Initially, all keywords belong to their
own singleton clusters. Then the two clusters with the highest
similarity are merged into one cluster. The new similarity be-
tween clusters C1and C2is defined as Eq.(5). This definition
of similarity corresponds to the furthest-neighbor method of
an ordinary clustering using the distance of two points.
sim(C1,C2) = min
This clustering method makes dendrogram such as Fig. 5.
B. Clustering algorithm and determination of a query
After selecting keyword candidates, clusters corresponding
to the subtopics are extracted. Let d(n) be a number of web
pages that is relevant to the query composed by the keywords
in a node n. This number can be obtained through the API.
Then, we determine a threshold nθthat is a minimum number
of web pages relevant to a query. Let Q and S be a set of
“current nodes” and “selected nodes,” respectively. We then
use the algorithm shown in Fig. 4 for determining the clusters
Fig. 5 shows an example of the clustering result and selected
keywords. In this figure, a circle denotes the “selected node”.
There are six keywords (word “A” to “F”), and three clusters
Finally, we determine a cluster used as a query. In principle,
we can use all clusters as multiple queries. But we decided to
choose only one query in this work; using multiple queries is
a future work.
On selecting a cluster for q query, we used sum of tf·idf
values of all words in a cluster as an evaluation score of a
cluster. As the score is a simple sum, the larger cluster tends
to be selected.
Q ← ∅, S ← ∅
Add the root node to Q
while Q is not empty
for all node n in Q
Remove n from Q
if d(n) > nθthen
Add n to S
else if n has no child nodes, then
Add n to S
Add all child nodes of n to Q
Fig. 4. An algorithm for determination of queries
Fig. 5. An example of hierarchical clustering and selected keywords.
C. Evaluation experiment
In this section, we conducted an experiment to select key-
words, compose a query, and download relevant documents.
Experimental conditions are same as explained in the previous
First, we evaluated how many misrecognized words are
included in a query. We investigated the following three
1) 15 keywords with highest tf·idf score
2) 15 keywords with highest R(w,Ta) score
3) Keywords in the best cluster
The experimental result is shown in Fig. 6. This result depicts
that the proposed score in the previous section effectively ex-
cludes the misrecognized words from the keywords. Besides,
the keyword clustering completely excludes the misrecognized
Next, we evaluated the keywords selected by conventional
and proposed methods using cover rates of OOV words and
Let a cover rate of document T(w) with respect to word set
Fig. 6. Misrecognized words in a query
S is defined as
c(w,S) =|V (T(w)) ∩ S|
where V (T) denotes the set of all words appear in text T. We
used the set of important words Simpand OOV words Soov.
We defined Simpas 50 words with highest tf·idf values in the
manual transcription of the spoken document. Soovis defined
as a set of words that are contained in To and not included
in the vocabulary of the speech recognizer. If the downloaded
text has higher cover rate, we can say that the downloaded
text is relevant to the original text.
We downloaded 1,000 web pages as T(w). Then we calcu-
lated the average cover rate for a set of keywords SKby
The result is shown in Fig. 7. From this result, we can
conclude that the similarity-based keyword selection and
clustering-based query composition improved the coverage of
both OOV words and important words, which means that the
proposed method could download more relevant documents
from the World Wide Web.
D. Evaluation of query composition
The previous evaluation is focused on how a selected
keyword is effective. In this section, we evaluated composition
of a query from selected keywords. The experiment examined
two simple query composition methods: query compositions
by AND or OR operators. 1000 Web documents were used for
calculating the cover rates. In this experiment, we calculated
cover rates of the OOV words and important words just as the
previous experiment; the difference was that just one keyword
was used the previous experiment and we had more than one
query for one spoken document, while in this experiment
we had only one query composed of multiple keywords for
one spoken document. The cover rates of all queries were
averaged. Fig. 8 shows the experimental result. This result
suggests that a query composed by AND operator outperforms
a query by OR operator.
¯ c(SK,S) =
Fig. 7. Cover rate by the conventional and proposed methods
Fig. 8. Cover rate by different query composition methods
IV. SELECTION OF RELEVANT DOCUMENTS
Finally, we investigated a method to select the relevant doc-
uments from all of the downloaded documents. The previous
result proved that we could gather relevant documents when
gathering 1000 documents. However, in some cases we do
not necessarily need as many as 1000 documents. When we
need only 100 documents, we can retrieve 100 documents
using the Web search engine. Here we must consider that the
retrieval results obtained from a search engine are ranked using
a score considering relationship between Web documents, such
as PageRank. Therefore, The score of a document obtained
from a search engine does not necessarily reflect relevance of
a downloaded document to the spoken document. When using
smaller number of documents, it is desirable that we score the
downloaded documents so that the “more useful” documents
have higher score, and choose a smaller number of documents
according to that score.
In this experiment, we aim at obtaining documents that
contain more OOV words. Gathering documents with more
important words can be another objective, but we focused on
the OOV in this experiment.
A straightforward way of evaluating two documents’ rele-
vance is to calculate a cosine similarity of those documents.
When a Web document Tiis given, we can calculate the cosine
simD(Ti,Ta) =v(Ti) · v(Ta)
However, this similarity does not consider that Ta contains
recognition errors, as explained before. In fact, the correlation
coefficient between this similarity and OOV cover rate by a
Web document was only 0.30.
The objective of document selection here is to select those
documents that contain more OOV words of the spoken
document. To this end, we developed a similarity considering
“possibility of containing OOV words” as
simo(Ti,Ta) = simD(Ti,Ta)
1 −|V (Ti) ∩ V (Ta)|
This value means that the document has similar word dis-
tribution to the transcription and has many words that are
not included in the automatic transcription of the spoken
document. The correlation coefficient between this metric and
the OOV cover rate was 0.45, which is fairly high considering
that this value is calculated without knowing the real OOV
In this paper, we proposed a method of retrieving relevant
documents to a spoken document. This method is based on
keyword selection from an automatic transcription of the spo-
ken document, and we consider that the automatic transcription
contains many recognition errors.
To retrieve relevant documents, we developed three novel
methods. The first one is to evaluate relevance of a keyword to
the spoken document, which uses downloaded text when using
the keyword as a Web query. The second one is to exclude
keywords that derive from misrecognitions and compose a
query, which is based on a hierarchical clustering. The third
one is to estimate how useful a document is from an aspect
of OOV cover rate.
As a future work, we are going to evaluate the pro-
posed method more directly. In addition, we are applying the
proposed method to an indexing of spoken documents and
language model adaptation for speech recognition.
This work is partially supported by Microsoft Research
Asia, IJARC Core Project.
 U. Glavitsch and P. Scha¨ uble, “A system for retrieving speech doc-
uments,” Proc. ACM SIGIR Conf.on Research and Development in
Information Retrieval, pp. 168 – 176, 1992.
 J. Garofolo, G. Auzanne and E. Voorhees, “The TREC Spoken Document
Retrieval Track: A Success Story,” Proc. TREC 8, pp. 107–130, 2000
 L. L. Mølgaard, K. W. Jørgensen and L. K. Hansen, “Castsearch – Context
Based Spoken Document Retrieval,” Proc. ICASSP, 2007.
 Z.-Y. Zhou, P. Yu, C.Chelba and F. Seide, “Towards Spoken-Document
Retrieval for the Internet: Lattice Indexing for Large-Scale Web-Search
Architectures,” Proc. HLT-NAACL, pp. 415–422, 2006.
 M. Larson and S. Eickeler, “Using Syllable-based Indexing Features and
Language Models to improve German Spoken Document Retrieval,” Proc.
Eurospeech, pp. 1217–1220, 2003.
 S. W. Lee, K. Tanaka and Y. Itoh, “Open-vocabulary spoken document
retrieval based on multilingual subphonetic segment recognition,” Proc.
of 18th International Congress on Acoustics (ICA2004), Vol. II, pp.1723–
 T. Takaki, A. Fujii and T. Ishikawa, “Associative Document Retrieval
by Query Subtopic Analysis and its Application to Invalidity Patent
Search,” Proc. Conf. Information and Knowledge Management (CIKM
2004), pp.399–405, 2004.
 J. J. Rocchio, “Relevance feedback in information retrieval,” The SMART
Retrieval System - Experiments in Automatic Document Processing,
Prentice Hall Inc., pp.313.323, 1971.
 T. Akiba, K. Itou and A. Fujii, “Language model adaptation for fixed
phrases by amplifying partial n-gram sequences,” Systems and Computers
in Japan, vol. 38, no. 4, pp. 63-73, 2007.
 C. Hori and S. Furui, “A new approach to automatic speech summariza-
tion,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 368–378, 2003.
 A. Ito, Y. Kajiura, S. Makino and M. Suzuki, “Unsupervised language
model adaptation based on keyword clustering and query availability
estimation,” Proc. Int. Conf. on Audio, Language and Image Processing,
pp. 1412–1418, 2008.
 J. R. Bellegarda, “Statistical language model adaptation: review and
perspectives,” Speech Communication, vol. 42, no. 1, pp. 93–109, 2004.
 K. S. Jones, “A statistical interpretation of term specificity and its
application in retrieval,” J. of Documentation, vol. 28, no. 1, pp. 11-21,
 Y. Matsuo and M. Ishizuka, “Keyword extraction from a single document
using word co-occurrence statistical information,” International Journal on
Artificial Intelligence Tools, vol. 13, no. 1, pp. 157–169, 2004.
 K. Maekawa, H. Koiso, S. Furui and H. Isahara, “Spontaneous speech
corpus of Japanese,” In Proc. Second International Conference on Lan-
guage Resources and Evaluation (LREC), pp. 947-952, 2000.
 A. Lee, T. Kawahara and K. Shikano, “Julius — an open source real-
time large vocabulary recognition engine,” In Proc. Eurospeech, pp. 1691–
 C. H. Lee, L. R. Rabiner, R. Pieraccini, and J. G. Wilpon, “Acoustic
modeling for large vocabulary speech recognition,” Computer Speech and
Language, vol. 4, no. 2, pp. 127–166, 1990.
 Yahoo! Developer Network. http://developer.yahoo.com
 R. Nisimura, K. Komatsu, Y. Kuroda, K. Nagatomo, A. Lee, H.
Saruwatari and K. Shikano, “Automatic n-gram language model creation
from Web resources,” In Proc. Eurospeech, pp. 2127–2130, 2001.
 Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. Hirano, O. Imaichi, and
T. Imamura, “Japanese morphological analysis system ChaSen manual,”
NAIST Technical Report NAIST-IS-TR97007, Nara Institute of Science
and Technology, 1997.