ArticlePDF Available

Intelligent Transcription System Based on Spontaneous Speech Processing

Authors:

Abstract

With the improvement of the speech recognition tech-nology, semi-automatic generation of transcripts or document records of lectures and meetings has become one of its promising applications. For this purpose, we need to take into account post-processing that includes cleaning of verbatim transcripts and segmentation into sentence and paragraph units. This article presents a novel statistical framework for an intelligent transcrip-tion system. The recent progress of automatic speech recognition of lectures and meetings is also reported. Then, several approaches to sentence unit detection and disfluency detection are described, as they are signifi-cant in the post-processing of transcripts generated by the speech recognizer.
Intelligent Transcription System
based on Spontaneous Speech Processing
Tatsuya Kawahara
Academic Center for Computing and Media Studies, Kyoto University
Sakyo-ku, Kyoto 606-8501, Japan
kawahara@i.kyoto-u.ac.jp
http://www.ar.media.kyoto-u.ac.jp/
Abstract
With the improvement of the speech recognition tech-
nology, semi-automatic generation of transcripts or
document records of lectures and meetings has become
one of its promising applications. For this purpose, we
need to take into account post-processing that includes
cleaning of verbatim transcripts and segmentation into
sentence and paragraph units. This article presents a
novel statistical framework for an intelligent transcrip-
tion system. The recent progress of automatic speech
recognition of lectures and meetings is also reported.
Then, several approaches to sentence unit detection and
disfluency detection are described, as they are signifi-
cant in the post-processing of transcripts generated by
the speech recognizer.
1. Introduction
Speech has been one of the most fundamental means
of communication by which human beings have ex-
changed knowledge and opinions. Even today, with the
prevalence of e-mails and the internet, new ideas are
discussed and important decisions are made primarily
based on speech communications at seminars and meet-
ings. Speech communication, however, is “volatile” in
nature, and thus must be recorded. Recently, speech
media can be stored digitally as is, but records are
usually saved in the form of text for easy browsing and
search.
Speech-to-text systems, or automatic speech recog-
nition systems, have been investigated extensively for
long years. Most of these studies, however, have fo-
cused on human-machine interfaces such as dictation
systems. On the other hand, automatic transcription
of spontaneous speech, such as human-to-human com-
munication, is far more difficult because of the large
variation in both acoustic and linguistic characteris-
tics. Moreover, spontaneous speech processing requires
the development of a different paradigm, in that faith-
ful transcription is not necessarily useful because of
the existence of disfluencies and the lack of sentence
and paragraph markers. Utterances are made while
thinking during interactions, and thus the disfluency
is inevitable. Actually, cleaning of the transcript is
performed by human stenographers in the making of
records of lectures and meetings. This process involves
the correction of colloquial expressions to document-
style expressions. Speech is a time-dimensional sig-
nal, and so a transcript is simply a sequence of words,
which corresponds to a text without punctuation or
line-breaks. In spontaneous Japanese in particular,
sentences are easily concatenated without explicit end-
ings. Thus, the following issues must be addressed:
deletion of disfluencies and redundant words
correction of colloquial expressions and recovery
of omitted particles
segmentation of sentences and paragraphs
In making lecture notes and meeting minutes, it
is also necessary to extract important sentences and
compress them for a summary. Automatic speech
summarization has also become an important research
topic. However, this article focuses on the generation
of speech transcripts that are both faithful and read-
able. Specifically, we have been investigating the devel-
opment of automatic transcription systems for lectures
and meetings, which can be used for the generation
of records of lectures and meetings[1]. Applications
also include the next-generation transcription system
for the Japanese Diet (Congress).
The remainder of this article is organized as follows.
The proposed intelligent transcription system is de-
scribed in Section 2. Section 3 summarizes the recent
progress of automatic speech recognition for lectures
and meetings. Sections 4 and 5 describe approaches
to sentence unit detection and disfluency detection, re-
spectively. These issues are also addressed in the Meta-
Data Extraction (MDE) task under the DARPA EARS
project[2][3][4]. Finally, future areas for consideration
are discussed in Section 6.
2. Intelligent Transcription System
2.1. System Overview
An overview of the proposed intelligent transcrip-
tion system, which combines the cleaning post-process
with the conventional automatic speech recognition, is
illustrated in Figure 1. We adopt the framework of
statistical machine translation (SMT) to transform a
verbatim transcript Vinto a document-style text W.
In a similar formulation to the automatic speech recog-
nition, the proposed framework decomposes the poste-
riori probability p(W|V) into two probabilities by the
Bayes rule. The language model probability p(W)can
be reliably trained using an enormous number of doc-
uments, including newspaper articles and Web texts,
whereas the translation probability p(V|W)mustbe
estimated with a parallel corpus of smaller size that
aligns verbatim transcripts of utterances Vand cleaned
texts for documentation W.
We extend this framework to estimate the language
model probability p(V) for speech recognition, as be-
low, by considering that the training text size for p(W)
is much larger than the size of verbatim transcripts
needed for training p(V).
p(V)=p(W)·p(V|W)
p(W|V)
Here, p(V|W)andp(W|V) are estimated with the
same parallel corpus. The probability p(V|W)models
the generation process of the spontaneous utterances,
whereas the probability p(W|V) is used for the cleaning
process.
2.2. Analysis using the Diet Corpus
For a parallel corpus that aligns verbatim transcripts
and document-style texts, we have compiled the “Diet
Corpus” by transcribing the actual utterances made in
the Lower House of the Japanese Diet (Congress) and
aligning them with the official meeting records. The
speech
X
verbatim
transcript
V
document
record
W
huge
archives
small data
for training
ASR SMT
argmax p(V|X)
=argmax p(V)p(X|V) argmax p(W|V)
=argmax p(W)p(V|W)
p(V)=p(W) p(V|W)
p(W|V) p(W)
decoding
training with aligned corpus
Figure 1. Overview of proposed transcription
system
current size of the corpus is approximately two million
words in text or 150-hour speech. For the morphologi-
cal analysis, we used Chasen 2.2.3 with IPADIC 2.4.4.
We investigated the difference in the transcripts of ac-
tual utterances Vand the document-style texts Wob-
served in this corpus. The transformation process from
the former Vto the latter Wis classified into deletions,
insertions and substitutions.
A summary of the statistics for the three categories
and typical, or the most frequent, examples thereof are
shown in Table 1. The differences in either category
are observed for 11% of the words in total. The ma-
jority of these differences concern the deletion of re-
dundant words. These words include not only fillers
but also several end-of-sentence expressions, such as
desune”and“to”. False starts and portions of self-
repairs are also deleted, but their lexical patterns vary
widely. On the other hand, the most insertions to doc-
umented texts are functional words and verb suffixes,
i’, such as “shi-te-(i)-ru”and“ki-te-(i)-ru”. The sub-
stituted words are related to colloquial expressions.
2.3. Style Conversion of Language Model
A direct approach to statistical modeling of these
phenomena involves counting the frequencies of con-
version patterns. However, in the cleaning process, re-
dundant words, such as fillers, are always deleted, and
colloquial expressions are always substituted, so that
p(W|V) for these patterns is set to be 1. The other pat-
terns specified in Table 1 are to be estimated using the
corpus. Since the translation probability apparently
depends on neighboring words, the context-dependent
modeling is desirable, but the data sparseness prob-
lem is encountered. Therefore, we introduce a Part-
Table 1. Major differences between spontaneous speech and document-style text
p(W|V)p(V|W)frequency examples
deletions of redundant words 1estimate 8.5% ee, desune, ano, maa
insertions of missing particles estimate estimate 1.0% i, wa, wo
substitutions of colloquial expressions 1estimate 1.8% toiu/teiu, keredomo/kedomo
Of-Speech-based (POS-based) contextual model as a
back-off model[5].
The size of the official records of the Diet is so large
(for example 71M words for four years worth of docu-
ments) that the reliable statistics are estimated for the
document-style model p(W) and are then transformed
to the verbatim model p(V).
3. Spontaneous Speech Recognition
We have also been intensively studying automatic
speech recognition (ASR) of lectures and meetings.
The recent progress is described in this section.
Here, we used the Corpus of Spontaneous Japanese
(CSJ)[6][7] as a primary corpus as well as the Diet cor-
pus.
3.1. Acoustic Modeling
Spontaneous speech has greater variation both
in spectral and temporal structures than read-style
speech. As the acoustic variation is largely depen-
dent on speakers, feature normalization techniques
such as vocal tract length normalization (VTLN)[8][9]
are effective. Speaker adaptive training (SAT)[10][11]
scheme has also been investigated.
3.2. Pronunciation Variation Modeling
The phonetic variation caused by spontaneous ut-
terance can also be modeled in a pronunciation dic-
tionary, in which a list of possible phone sequences
for each word is defined. While the orthodox pronun-
ciation forms are referred to as baseforms, the vari-
ants observed in spontaneous speech are called “sur-
face forms”. These surface form entries are often de-
rived from speech data by aligning them with phone
models[12]. In the CSJ, actual phonetic (kana)tran-
scription is given manually, so the set of surface forms is
easily defined. However, the simple addition of surface
form entries results in the side-effect of false match-
ing. Thus, effective but constrained use of these surface
forms are necessary.
An approach is statistical modeling, which is similar
to language modeling. Namely, the unigram probabil-
ity of each pronunciation form is assigned and is mul-
tiplied by the language model probability in decoding.
In this case, the statistical framework of speech recog-
nition is re-formulated as:
w=argmax
w,p
P(x|p)P(p|w)P(w)
Here, P(p|w) is the pronunciation probability of surface
form pfor word w, while P(x|p)andP(w)represent
conventional acoustic model probability and language
model probability for input x, respectively. We inves-
tigated the comparison of statistical models using the
CSJ, and concluded that cutoff of less frequent sur-
face forms is crucial, and that the unigram model is
effective, whereas the trigram model has a marginal
gain[13].
When the surface form is derived for word units, it is
dependent on the task and corpus, and is not necessar-
ily applicable to different tasks. Phone-based modeling
of pronunciation variation is more general and portable
to various lexicons. Surface forms are obtained by ap-
plying such a model to phone sequences of baseforms.
We proposed generalized modeling of subword-based
mapping between baseforms and surface forms using
variable-length phone context[14]. The variation pat-
terns of phone sequences are automatically extracted
together with their contexts of up to two preceding
and following phones, which are decided by their oc-
currence statistics. A set of rewrite rules are then de-
rived with their probabilities and variable-length phone
contexts. The model effectively predicts pronuncia-
tion variations depending on the phone context using a
back-off scheme. The model was applied and evaluated
with two transcription tasks, domains of which are dif-
ferent from the training corpus (CSJ). The effects of
the predicted surface form entries and their probabili-
ties are individually evaluated, and each was found to
have a similar impact on overall performance.
3.3. Language Model
Language model training of spontaneous speech is
much more difficult than that for dictation systems,
which can make use of huge language resources such
as newspaper articles and Web pages. Most of the
available language data are written text, and are mis-
matched with the spoken-style. For language modeling
of spontaneous speech, a great deal of transcription is
essential, but has a huge cost.
The most widely-used solution to enhance language
model training data is to combine with or interpolate
other existing text databases, which are not necessarily
spontaneous speech corpora, but are related to the tar-
get task domain. These include proceedings of lectures,
minutes of meetings, and closed captions for broadcast
programs. Recently, the World Wide Web has become
a major language resource, and not a few Web sites
contain spoken-style documents, such as records of lec-
tures and meetings. We also studied a method to filter
spoken-style texts from the Web pages[15].
As described in Section 2, we propose a novel “trans-
lation” approach that estimates language model statis-
tics (N-gram counts) of spontaneous speech from a
document-style large corpus based on the framework
of statistical machine translation (SMT)[5]. The trans-
lation is designed for modeling characteristic linguistic
phenomena in spontaneous speech, such as insertion
of fillers, and estimating their occurrence probabilities.
These contextual patterns and probabilities are derived
from a small parallel aligned corpus of faithful tran-
scripts and their documented records. This method
was successfully applied to the estimation of the lan-
guage model for the Diet meetings from their minute
archives.
3.4. Adaptation of Acoustic Model
Since the variation of acoustic features is very
large in spontaneous speech, speaker adaptation of the
acoustic model is effective and is almost essential. Al-
though there are many factors affecting the acoustic
characteristics of spontaneous speech such as speak-
ing rate and speaking styles, speaker adaptation is a
simple solution to handle all of these factors in an im-
plicit manner. The acoustic model adaptation also in-
volves channel adaptation, that is, the characteristics
of rooms and microphones are also normalized.
In particular, in lectures and meetings, each speaker
makes many utterances in the same session. Thus,
a considerably large amount of data is available to
conduct unsupervised adaptation in a batch mode,
where the initial speech recognition result with the
speaker-independent model is used for adaptation of
the acoustic model, which is then used for rescoring
or re-decoding. Standard adaptation techniques such
as maximum likelihood linear regression (MLLR) are
used, and filtering the reliable recognition hypotheses
with confidence measures can also be incorporated.
3.5. Adaptation of Language Model
Adaptation of the language model is also important
to deal with a variety of topics and speaking styles. In
lectures and meetings, the topic is focused and consis-
tent throughout the entire session. Therefore, language
model adaptation is feasible even in an unsupervised or
batch mode, as in the acoustic model adaptation, and
computationally expensive methods can be allowed in
off-line transcription tasks.
The simplest method is to construct an N-gram
model from the initial speech recognition result and
interpolate it with the baseline model. We also inves-
tigated methods to select the most relevant texts from
the corpus based on the initial recognition result[13].
As a criterion for text selection, we used the tf-idf mea-
sure and perplexity by the N-gram model generated
from the initial speech recognition result, and demon-
strated that they have comparable and significant ef-
fects in improving speech recognition performance.
The cache model[16] and trigger model[17] weigh
the probability of words recently used in the utter-
ances or talk, or those directly related to the previous
topic words. We also proposed a trigger-based lan-
guage model adaptation method oriented to meeting
transcription[18]. The initial speech recognition result
is used to extract task-dependent trigger pairs and to
estimate their statistics. This method achieved a re-
markable perplexity reduction of 28%.
Recently, latent semantic analysis (LSA) which
maps documents into implicit topic sub-spaces using
the singular value decomposition (SVD) has been inves-
tigated extensively for language modeling[19]. A prob-
abilistic formulation, PLSA[20], is powerful for char-
acterizing the topics and documents in a probabilistic
space and predicting word probabilities. We proposed
an adaptation method based on two sub-spaces of top-
ics and speaker characteristics[21]. Here, PLSA was
performed on the initial speech recognition result to
provide unigram probabilities conditioned on the input
speech, and the baseline model is adapted by scaling
N-gram probabilities with these unigram probabilities.
The method was applied to automatic transcription of
panel discussions and was shown to be effective.
3.6. Speech Recognition Performance
A summary of the current speech recognition per-
formance for the CSJ (lectures) and the Diet (Budget
Committee meeting) is given in Table 2. By combining
Table 2. Speech recognition performance for
lectures & Diet meetings (word accuracy)
method lecture Diet
baseline 76.6 74.1
+ speaker normalization 78.3 76.7
+ acoustic model adaptation 81.2 80.5
+ language model adaptation 81.8 81.5
the adaptation of acoustic and language models using
the initial speech recognition result, the word accuracy
was improved to around 82% for both evaluation tasks.
4. Sentence Boundary Detection
Detection of the sentence unit is vital for linguis-
tic processing of spontaneous speech, since most of the
conventional natural language processing systems as-
sume that the input is segmented by sentence units.
Sentence segmentation is also an essential step to key
sentence indexing and summary generation.
In spontaneous speech, especially in Japanese, in
which subjects and verbs can be omitted, the unit of
the sentence is not so evident. In the CSJ, therefore,
the clause unit is first defined based on the morpho-
logical information of end-of-sentence or end-of-clause
expressions. The sentence unit is then annotated by
human judgment considering syntactic and semantic
information.
Two approaches to automatic detection of sentence
boundaries are described in the following subsections.
4.1. Statistical Language Model (SLM)
In fluent speech or read speech of well-formed sen-
tences, it is possible to assume that long pauses can
be interpreted as punctuation marks, and the insertion
of periods (=sentence boundaries) or commas can be
decided by the neighboring word contexts.
Thus, the baseline method makes use of an N-
gram statistical language model (SLM) that is trained
using a text with punctuation symbols, in order
to determine a pause to be converted to a pe-
riod. Specifically, for a word sequence around a
pause, X=(w2,w
1,pause,w
1,w
2), a period is
inserted at the place of the pause if P(W1)=
P(w2,w
1,period,w
1,w
2) is larger than P(W2)=
P(w2,w
1,w
1,w
2) by some margin. Actually, this
decoding is formulated as the maximization of a likeli-
hood log P(W)+β∗|W|,where|W|denotes the num-
ber of words in Wand βis the insertion penalty widely
used in speech recognition.
In spontaneous speech, however, the approaches
that rely heavily on the pauses are not successful.
Speakers put pauses in places other than the ends of
sentences for certain discourse effects, and disfluency
causes irregular pauses (=interruption points), while
consecutive sentences are often continuously uttered
without a pause between them.
Therefore, we introduce a more elaborate model that
selects possible sentence boundary candidates by con-
sidering contextual words (w1and w1). Specifically,
if these words match typical end-of-sentence expres-
sions, a sentence boundary is hypothesized regardless of
the existence of a pause, and if they match non-typical
end-of-sentence expressions, a sentence boundary is hy-
pothesized only if there is a pause. The hypotheses
of sentence boundaries are verified using the N-gram
model. The method is also formulated as a statistical
machine translation framework[22] and is referred to as
enhanced SLM.
4.2. Support Vector Machines (SVM)
A simpler but more general approach is to treat the
pause duration as one of the features in addition to the
lexical features, and feed them into a machine learn-
ing framework. We adopted support vector machines
(SVM) because there are a wide variety of cue expres-
sions suggesting sentence endings in Japanese. In this
case, sentence boundary detection is regarded as a text
chunking problem[23], and we adopt the IE labeling
scheme, where I and E denote inside-chunk and end-
of-chunk, respectively. For every input word, a feature
vector is composed of the preceding and the following
three words, together with their POS tags and the du-
rations of the subsequent pauses, if any. The pause du-
ration is normalized by the average in a turn or a talk,
because it is affected by the speaking rate and signif-
icantly different between speakers. Dynamic features
or estimated results of preceding input parts can also
be fed into SVM. The SVM is considered to be pow-
erful for handling a very large number of features and
finding the critical features called “support vectors”.
4.3. Experimental Evaluations
Here, we present the results evaluated in the
CSJ[24]. The test-set was that used for speech recog-
nition evaluation and consists of 30 presentations or
71K words in total. Both SLM and SVM described
in the previous subsections were trained with the Core
168 presentations of 424K words, excluding the test-
set. In this experiment, we used automatic speech
Table 3. Results of sentence unit (boundary)
detection in CSJ
recall precision F-measure
SLM (text) 79.2 84.6 81.8
SLM (ASR) 70.2 71.6 70.9
SVM (text) 83.0 87.9 85.4
SVM (ASR) 73.9 81.7 77.6
recognition (ASR) results without conducting speaker
adaptation and the word error rate was approximately
30%. The results are summarized in Table 3, where the
recall, precision and F-measure are computed for sen-
tence boundaries. Correct transcripts (text) are used
for reference. SVM realizes significantly better per-
formance than SLM, and is even more effective in the
speech recognition case, demonstrating robustness for
erroneous input. SVM is directly trained to classify
boundaries, whereas SLM measures the linguistic like-
lihood of sentence boundaries. Moreover, features used
in SVM are independent of each other and classifica-
tion succeeds only if a key feature (support vector) is
correctly detected, whereas a single error affects the
likelihood of word sequences in the N-gram model. It
is noteworthy that performance degradation by using
speech recognition is much smaller than the word error
rate.
4.4. Further Extension
Another approach for further improvement is to in-
corporate higher-level linguistic information, such as
syntactic dependency and caseframe structures. We
studied an interactive framework of parsing and sen-
tence boundary detection, and showed that depen-
dency structure analysis can help sentence boundary
detection and vice versa[25]. However, the conven-
tional dependency structure analysis does not necessar-
ily work reliably and robustly for spontaneous speech
having ill-formed sentences and disfluencies, especially
for erroneous transcripts generated by speech recog-
nition systems. Therefore, we propose a more robust
method that is based on local syntactic dependency of
adjacent words and phrases[26].
We also investigate the detection of quotations,
which is similar to sentence boundary detection, but in-
volves analysis of very complex sentence structures[27].
5. Disfluency Detection
Disfluency is another prominent characteristic of
spontaneous speech. Disfluency is inevitable because
humans make utterances while thinking about what to
say, and the pipeline processing is often clogged. Thus,
the detection of disfluencies may be useful for analyz-
ing the discourse structure or speaker’s mental status.
However, disfluencies should be removed for improving
readability and applying conventional natural language
processing systems including machine translation and
summarization.
Disfluency is classified into the following two broad
categories:
fillers (such as “um” and “uh”), including dis-
course markers (such as “well” and “you know”),
with which speakers try to fill pauses while think-
ing or to attract the attention of listeners.
repairs, including repetitions and re-starts, where
speakers try to correct, modify, or abort earlier
statements.
Note that fillers usually appear in the middle of repairs.
Lexical filler words are usually obtained as the out-
put of the speech recognition system, and their recog-
nition accuracy is much the same as that of ordinary
words. However, there are a number of words that also
functions as non-fillers such as “well” in English and
ano” in Japanese. For these distinctions, prosodic fea-
tures will be useful since we can recognize fillers even
for non-familiar languages. Previously, we investigated
the difference in prosodic features in these words in
Japanese[28].
On the other hand, the detection of self-repairs in-
volves much more complex processes. The most con-
ventional approach is to assume the repair interval
model (RIM)[29], which consists of the following three
parts in order:
(RPD) ! (DF) (RP)
(ex.) “I’m going {RPD:toTokyo}!{DF:
no}{RP:toKyoto}
RPD (ReParanDum): portion to be repaired
DF (DisFluency): fillers or discourse markers
RP (RePair): portion to correct or replace RPD
The first step to the self-repair analysis based on
this model is to detect DF or interruption points (IP),
noted with ‘!’ in the above, which seem relatively easy
to spot. DF usually consists of filler words, and IP
detection can be formulated in much the same manner
as the sentence boundary detection using neighboring
lexical features together with prosodic features[30].
In the CSJ or Japanese monologue, however, we ob-
serve many cases that do not satisfy this assumption or
RIM. First, DF or filler words are often absent. Second,
RPD and RP segments often have nothing in common
on the surface level, although they may be semanti-
cally related, for example, “ana (hole) ! mizo (trench)
wa...” This phenomenon makes it extremely difficult to
perform machine learning using lexical features. Actu-
ally, using SVM, we obtained a detection accuracy (F-
measure) of 77.1% for the cases in which RPD and RP
segments have some words in common, but only 20.0%
otherwise. This result suggests that high-level seman-
tic information is necessary for further improvement.
6. Conclusions and Future Works
In this article, studies on automatics transcription of
spontaneous human-to-human speech communications
are described. Specifically, we focused on sentence seg-
mentation and disfluency detection for improved read-
ability of the transcripts. Since errors are inevitable in
automatic speech recognition, manual post-processing
including error correction is necessary. Therefore, we
are developing an interface for human editors to cor-
rect and edit the transcripts generated by the auto-
matic speech recognition system[31]. Appropriate seg-
mentation of utterances is essential for the editing pro-
cess. We are also investigating the degree of speech
recognition accuracy that is needed to realize an ef-
ficient transcription system, including manual post-
processing. The use of multiple candidates and confi-
dence measures output by the speech recognizer should
also be explored for this purpose.
One of the next targets will be the note-taking of lec-
tures for handicapped people, which is real-time gener-
ation of closed-captions. The process should incorpo-
rate compaction or a kind of summarization technique
to improve the readability, and the entire process must
be performed with a small latency, although the tran-
scripts may not be perfect as long as they are compre-
hensible.
In the future, integration with other media, such as
video, should be studied more extensively. We are in-
vestigating this direction for lectures in our university
classrooms. Furthermore, integration with knowledge
processing must be explored, because speech is a me-
dia for exchanging knowledge and is thus a source of
knowledge.
Acknowledgments: The work was conducted with
the contribution of research associates and graduate
students including Dr. H. Nanjo, Dr. Y. Akita, Mr.
K. Takanashi, Mr. M. Mimura, Mr. K. Shitaoka and
Mr. M. Saikou.
References
[1] T.Kawahara. Spoken language processing for audio
archives of lectures and panel discussions. In Proc. Int’l
Conference on Informatics Research for Development
of Knowledge Society Infrastructure (ICKS), pages 23–
30, 2004.
[2] Y.Liu, E.Shriberg, A.Stolcke, B.Peskin, J.Ang,
D.Hillard, M.Ostendorf, M.Tomalin, P.Woodland, and
M.Harper. Structural metadata research in the EARS
program. In Proc. IEEE-ICASSP, volume 5, pages
957–960, 2005.
[3] E.Shriberg. Spontaneous speech: How people really
talk and why engineers should care. In Proc. INTER-
SPEECH, pages 1781–1784, 2005.
[4] Y.Liu, E.Shriberg, A.Stolcke, D.Hillard, M.Ostendorf,
and M.Harper. Enriching speech recognition with au-
tomatic detection of sentence boundaries and disfluen-
cies. IEEE Trans. Audio, Speech & Language Process.,
14(5):1526–1540, 2006.
[5] Y.Akita and T.Kawahara. Efficient estimation of lan-
guage model statistics of spontaneous speech via sta-
tistical transformation model. In Proc. IEEE-ICASSP,
volume 1, pages 1049–1052, 2006.
[6] S.Furui. Recent advances in spontaneous speech
recognition and understanding. In Proc. ISCA &
IEEE Workshop on Spontaneous Speech Processing
and Recognition, pages 1–6, 2003.
[7] K.Maekawa. Corpus of Spontaneous Japanese: Its de-
sign and evaluation. In Proc. ISCA & IEEE Work-
shop on Spontaneous Speech Processing and Recogni-
tion, pages 7–12, 2003.
[8] L.Lee and R.C.Rose. Speaker normalization using ef-
ficient frequency warping procedures. In Proc. IEEE-
ICASSP, pages 353–356, 1996.
[9] S.Wegmann, D.McAllaster, J.Orloff, and B.Peskin.
Speaker normalization on conversational telephone
speech. In Proc. IEEE-ICASSP, pages 339–342, 1996.
[10] J.W.McDonough, T.Anastasakos, G.Zavaliagkos, and
H.Gish. Speaker-adapted training on the switchboard
corpus. In Proc. IEEE-ICASSP, pages 1059–1062,
1997.
[11] D.Pye and P.C.Woodland. Experiments in speaker
normalisation and adaptation for large vocabulary
speech recognition. In Proc. IEEE-ICASSP, pages
1047–1050, 1997.
[12] E.Fosler et al. Automatic learning of word pronuncia-
tion from data. In Proc. ICSLP, 1996.
[13] H.Nanjo and T.Kawahara. Language model and
speaking rate adaptation for spontaneous presentation
speech recognition. IEEE Trans. Speech & Audio Pro-
cess., 12(4):391–400, 2004.
[14] Y.Akita and T.Kawahara. Generalized statistical mod-
eling of pronunciation variations using variable-length
phone context. In Proc. IEEE-ICASSP,volume1,
pages 689–692, 2005.
[15] T.Misu and T.Kawahara. A bootstrapping approach
for developing language model of new spoken dialogue
systems by selecting web texts. In Proc. INTER-
SPEECH, pages 9–12, 2006.
[16] R.Khun and R.De Mori. A cache-based natural lan-
guage model for speech recognition. IEEE Trans. Pat-
tern Analysis & Machine Intelligence, 12(6):570–583,
1990.
[17] R.Lau, R.Rosenfeld, and S.Roukos. Trigger-based lan-
guage models: A maximum entropy approach. In Proc.
IEEE-ICASSP, volume 2, pages 45–48, 1993.
[18] C.Troncoso and T.Kawahara. Trigger-based language
model adaptation for automatic meeting transcription.
In Proc. INTERSPEECH, pages 1297–1300, 2005.
[19] J.R.Bellegarda. A multispan language modeling frame-
work for large vocabulary speech recognition. IEEE
Trans. Speech & Audio P rocess., 6(5):468–475, 1998.
[20] T.Hoffman. Probabilistic latent semantic indexing. In
Proc. SIG-IR, 1999.
[21] Y.Akita and T.Kawahara. Language model adaptation
based on PLSA of topics and speakers. In Proc. ICSLP,
pages 1045–1048, 2004.
[22] T.Kawahara, M.Hasegawa, K.Shitaoka, T.Kitade, and
H.Nanjo. Automatic indexing of lecture presenta-
tions using unsupervised learning of presumed dis-
course markers. IEE E Trans. S peech & Audio Process.,
12(4):409–419, 2004.
[23] T.Kudo and Y.Matsumoto. Chunking with support
vector machines. In Proc. NAACL, 2001.
[24] Y.Akita, M.Saikou, H.Nanjo, and T.Kawahara. Sen-
tence boundary detection of spontaneous Japanese us-
ing statistical language model and support vector ma-
chines. In Proc. INTERSPEECH, pages 1033–1036,
2006.
[25] K.Shitaoka, K.Uchimoto, T.Kawahara, and H.Isahara.
Dependency structure analysis and sentence boundary
detection in spontaneous Japanese. In Proc. COLING,
pages 1107–1113, 2004.
[26] T.Kawahara, M.Saikou, and K.Takanashi. Automatic
detection of sentence and clause units using local syn-
tactic dependency. In Proc. IEEE-ICASSP, page (ac-
cepted for presentation), 2007.
[27] R.Hamabe, K.Uchimoto, T.Kawahara, and H.Isahara.
Detection of quotations and inserted clauses and its
application to dependency structure analysis in spon-
taneous Japanese. In Proc. COLING-ACL,volume
Poster Sessions, pages 324–330, 2006.
[28] F.M.Quimbo, T.Kawahara, and S.Doshita. Prosodic
analysis of fillers and self-repair in Japanese speech.
In Proc. ICSLP, pages 3313–3316, 1998.
[29] C.Nakatani and J.Hirschberg. A speech first model for
repair detectin and correction. In Proc. ARPA Human
Language Technology Workshop, pages 329–334, 1993.
[30] Y.Liu, E.Shriberg, A.Stolcke, and M.Harper. Com-
paring HMM, maximum entropy and conditional ran-
dom fields for disfluency detection. In Proc. INTER-
SPEECH, pages 3033–3036, 2005.
[31] H.Nanjo, Y.Akita, and T.Kawahara. Computer as-
sisted speech transcription system for efficient speech
archive. In Proc. Western Pacific Acoustics Confer-
ence (WESPAC), 2006.
Article
Voice dictation is increasingly used for text entry, especially in mobile scenarios. However, the speech-based experience gets disrupted when users must go back to a screen and keyboard to review and edit the text. While existing dictation systems focus on improving transcription and error correction, little is known about how to support speech input for the entire text creation process, including composition, reviewing and editing. We conducted an experiment in which ten pairs of participants took on the roles of authors and typists to work on a text authoring task. By analysing the natural language patterns of both authors and typists, we identified new challenges and opportunities for the design of future dictation interfaces, including the ambiguity of human dictation, the differences between audio-only and with screen, and various passive and active assistance that can potentially be provided by future systems.
Article
Full-text available
A new method for automatic detection of section boundaries and extraction of key sentences from lecture audio archives is proposed. The method makes use of 'discourse markers' (DMs), which are characteristic expressions used in initial utterances of sections, together with pause and language model information. The DMs are derived in a totally unsupervised manner based on word statistics. An experimental evaluation using the Corpus of Spontaneous Japanese (CSJ) demonstrates that the proposed method provides better indexing of section boundaries compared with a simple baseline method using pause information only, and that it is robust against speech recognition errors. The method is also applied to extraction of key sentences that can index the section topics. The statistics of the presumed DMs are used to define the importance of sentences, which favors potentially section-initial ones. The measure is also combined with the conventional tf-idf measure based on content words. Experimental results confirm the effectiveness of using the DMs in combination with the keyword-based method. The paper also describes a statistical framework for transforming raw speech transcriptions into the document style for defining appropriate sentence units and improving readability.
Article
Full-text available
Pronunciation variation modeling is one of major issues in automatic transcription of spontaneous speech. We present statistical modeling of subword-based mapping between baseforms and surface forms using a large-scale sponta- neous speech corpus (CSJ). Variation patterns of phone se- quences are automatically extracted together with their con- texts of up to two preceding and following phones, which are decided by their occurrence statistics. Then, we derive a set of rewrite rules with their probabilities and variable- length phone contexts. The model effectively predicts pro- nunciation variations depending on the phone context using a back-off scheme. Since it is based on phone sequences, the model is applicable to any lexicon to generate appro- priate surface forms. The proposed method was evaluated on two transcription tasks whose domains are different from the training corpus (CSJ), and significant reduction of word error rates was achieved. necessary to suppress false matching caused by increased entries. The data-driven approach has also been studied, for example, pattern extraction using automatic phone recogni- tion. Most of the previous works, however, assume that the domain and lexicon of training data are same as those of the test-set. In this paper, we present a generalized pronunciation modeling based on probabilistic mapping of phone se- quences using a large-scale spontaneous speech corpus. Variation patterns of baseforms are extracted as a set of rewrite rules with their probabilities which have variable- length phone context. The model is flexibly applicable to any new lexicon, and their surface forms can be generated with appropriate probabilities. In this paper, the proposed method is tested on transcription tasks whose domains and lexicons are different from the training corpus.
Article
Full-text available
How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large-scale national project entitled "Spontaneous Speech: Corpus and Processing Technology" started in Japan in 1999. This paper gives an overview of the project and reports on the major results of experiments that have been conducted so far at Tokyo Institute of Technology, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition. The paper also discusses the most important research problems to be solved in order to achieve ultimate spontaneous speech recognition systems.
Article
Full-text available
Corpus of Spontaneous Japanese, or CSJ, is a large-scale database of spontaneous Japanese. It contains speech signal and transcription of about 7 million words along with various annotations like POS and phonetic labels. After describing its design issues, preliminary evaluation of the CSJ was presented. The results suggest strongly the usefulness of the CSJ as the resource for the study of spontaneous speech.
Article
Full-text available
This paper addresses computer assisted speech transcription (CAST) system for making archives such as meeting minutes and lecture notes. For such system, automatic speech recog-nition (ASR) is promising, but ASR errors are inevitable. Therefore, it is significant to design a good interface with which we can correct errors easily. Moreover, to make a better system, we should know what kind of recognition error is significant and how much accuracy of ASR we need in correcting errors. From these points of view, we design a CAST system with several correction interfaces; 1) pointing device for selection from confusion pairs, 2) microphone for re-speaking, and 3) keyboard. With some subject experiments, we confirmed that the system could reduce a tran-scription time by about half at the best case. We also found that 75% or more ASR accuracy should be achieved for users to feel ASR system is convenience in correction task.
Conference Paper
Full-text available
One of the most significant problems in language modeling of spontaneous speech such as meetings and lectures is that only limited amount of matched training data, i.e. faithful transcript for the relevant task domain, is available. In this paper, we propose a novel transformation approach to estimate language model statistics of spontaneous speech from a document-style text database, which is often available with a large scale. The proposed statistical transformation model is designed for modeling characteristic linguistic phenomena in spontaneous speech and estimating their occurrence probabilities. These contextual patterns and probabilities are derived from a small amount of parallel aligned corpus of the faithful transcripts and their document-style texts. To realize wide coverage and reliable estimation, a model based on part-of-speech (POS) is also prepared to provide a back-off scheme from a word-based model. The approach has been successfully applied to estimation of the language model for National Congress meetings from their minute archives, and significant reduction of test-set perplexity is achieved
Conference Paper
Full-text available
Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include detection of sentence boundaries, filler words, and disfluencies. Modeling approaches combine lexical, prosodic, and syntactic information, using various modeling techniques for knowledge source integration. The performance of these methods is evaluated by task, by data source (broadcast news versus spontaneous telephone conversations) and by whether transcriptions come from humans or from an (errorful) automatic speech recognizer. A representative sample of results shows that combining multiple knowledge sources (words, prosody, syntactic information) is helpful, that prosody is more helpful for news speech than for conversational speech, that word errors significantly impact performance, and that discriminative models generally provide benefit over maximum likelihood models. Important remaining issues, both technical and programmatic, are also discussed.
Article
本稿では, Support Vector Machine (SVM) に基づく一般的なchunk同定手法を提案し, その評価を行う.SVMは従来からある学習モデルと比較して, 入力次元数に依存しない高い汎化能力を持ち, Kernel関数を導入することで効率良く素性の組み合わせを考慮しながら分類問題を学習することが可能である.SVMを英語の単名詞句とその他の句の同定問題に適用し, 実際のタグ付けデータを用いて解析を行ったところ, 従来手法に比べて高い精度を示した.さらに, chunkの表現手法が異なる複数のモデルの重み付き多数決を行うことでさらなる精度向上を示すことができた.