Conference PaperPDF Available

A general algorithm for word graph matrix decomposition

Authors:

Abstract and Figures

In automatic speech recognition, word graphs (lattices) are commonly used as an approximate representation of the complete word search space. Usually these word lattices are acyclic and have no a-priori structure. More recently a new class of normalized word lattices have been proposed. These word lattices (a.k.a. sausages) are very efficient (space) and they provide a normalization (chunking) of the lattice, by aligning words from all possible hypotheses. We propose a general framework for lattice chunking, the pivot algorithm. There are four important components of the pivot algorithm. First, the time information is not necessary but is beneficial for the overall performance. Second, the algorithm allows the definition of a predefined chunk structure of the final word lattice. Third, the algorithm operates on both weighted and unweighted lattices. Fourth, the labels on the graph are generic, and could be words as well as part of speech tags or parse tags. While the algorithm has applications to many tasks (e.g. parsing, named entity extraction) we present results on the performance of confidence scores for different large vocabulary speech recognition tasks. We compare the results of our algorithms against off-the-shelf methods and show significant improvements.
Content may be subject to copyright.
A GENERAL ALGORITHM FOR WORD GRAPH MATRIX DECOMPOSITION
Dilek Hakkani-T¨
ur and Giuseppe Riccardi
AT&T Labs-Research,
180 Park Avenue, Florham Park, NJ, USA
dtur,dsp3 @research.att.com
ABSTRACT
In automatic speech recognition, word graphs (lattices) are
commonly used as an approximate representation of the com-
plete word search space. Usually these word lattices are
acyclic and have no a-priori structure. More recently a new
class of normalized word lattices have been proposed. These
word lattices (a.k.a. sausages) are very efficient (space) and
they provide a normalization (chunking) of the lattice, by
aligning words from all possible hypotheses. In this paper
we propose a general framework for lattice chunking, the
pivot algorithm. There are four important components of
the pivot algorithm. First, the time information is not neces-
sary but is beneficial for the overall performance. Second,
the algorithm allows the definition of a predefined chunk
structure of the final word lattice. Third, the algorithm op-
erates on both weighted and unweighted lattices. Fourth, the
labels on the graph are generic, and could be words as well
as part of speech tags or parse tags. While the algorithm
has applications to many tasks (e.g. parsing, named entity
extraction) we present results on the performance of confi-
dence scores for different large vocabulary speech recogni-
tion tasks. We compare the results of our algorithms against
off-the-shelf methods and show significant improvements.
1. INTRODUCTION
In large vocabulary continuous speech recognition (LVCSR),
the word search space, which is prohibitively large, is com-
monly approximated by word lattices. Usually these word
lattices are acyclic and have no a-priori structures. Their
transitions are weighted by acoustic and language model
probabilities. More recently a new class of normalized word
lattices have been proposed [1]. These word lattices (a.k.a.
sausages) are more efficient than canonic word lattices and
they provide an alignment for all the strings in the word lat-
tices.
In this paper we propose a general framework for lattice
chunking, the pivot algorithm. In terms of state transition
The authors are listed in alphabetical order.
.
0
.
.
.
0
0
0..
.
i, j
ai, j
i, j
a
.
00
0
a
.
0
....
0
0
0
Fig. 1. The state transition matrices for topologically sorted
traditional lattices and pivot alignments. is if there is
at least one transition between states and , otherwise.
matrix this corresponds to decomposing the lattice transi-
tion matrix into a block diagonal (chunk) matrix. Figure 1
shows the state transition matrices for topologically sorted
traditional lattices and the new type of lattices we propose,
the pivots. The elements are binary, if there is at least
one transition between states and , otherwise. In the rest
of the paper we will refer to as the equivalence class of
state transitions from state to . The state transitions
can be weighted or unweighted. In the weighted case, the
cost associated to the transition from state to state , with
label is .
There are four important components of the pivot algo-
rithm:
1. The time information computed using the frame num-
bers is not necessary but is beneficial for the overall
performance.
2. The algorithm allows the definition of a predefined
chunk structure for the final lattice.
3. The algorithm operates on both weighted and unweighted
lattices.
4. The labels on the graph are generic and could be words
as well as part of speech tags or parse tags.
We describe the algorithm here in the context of automatic
speech recognition (ASR). Lattice chunking has the clear
advantage of normalizing the search space of word hypothe-
ses. The advantages of these normalized lattices are in terms
of memory and computation:
Memory The resulting structures (pivots) are much
smaller in size (order of magnitudes), while preserv-
ing accuracy of the original search space.
Computation Normalized matrices have the compo-
sitional property. Suppose that we want to compute
the word string with lowest weight, among all
strings in the lattice:
(1)
where is the set of word strings
recognized by the lattice chunk, .
There are many applications where these properties have
been very useful. In the case of weighted lattices, the transi-
tion probabilities on the pivots can also be used as word con-
fidence scores. The posterior probabilities on the most prob-
able path of the resulting pivot alignment have been used
as confidence scores for unsupervised learning of language
models [2] and active learning for ASR [3]. The pivot struc-
ture of competing word hypotheses, as well as their confi-
dence scores have been used for improving spoken language
understanding [4], machine translation [5] and named entity
extraction [6]. In [7] the compositional property has been
extended to the case of weighted string costs. In this pa-
per we present the application of the pivot algorithm to the
computation of word confidence scores for all the strings in
a word lattice. We present results on the performance of
confidence scores for a large vocabulary continuous speech
recognition task.
In the next section, we describe the algorithm. In the
third section, we provide experimental results.
2. APPROACH
The sausage algorithm proposed in [1] is designed to re-
duce word error rate and is thus biased towards automatic
speech recognition. The pivot algorithm is general and aims
to normalize the topology of any input graph according to
a canonic form. The parameters of the algorithm can be
used to optimize a specific cost function (e.g., word error
rate). The algorithm is summarized in Figure 2 and a brief
description of the steps is given below:
1. If the lattice is weighted, we first compute the pos-
terior probability of all transitions in the word graph,
by doing a forward and a backward pass through the
graph. At this point, the posterior probability of a
transition could be used as a confidence score by it-
self, but some improvements are possible by taking
into account the competing hypotheses in the same
time slot. In the case of unweighted lattices, we skip
this step.
2. We then sample a sequence of states that lie on a
path,1in the lattice, to use as the baseline of the pivot
1A path is a sequence of state transitions from the initial state to a final
state.
1. Compute the posterior probabilities of all transitions
in the word graph.
2. Extract the pivot baseline path.
3. For all transitions in the topologically ordered lat-
tice, do:
1. Using , find the most overlapping location
on the pivot baseline (defined by a start state
and an end state ).
2. If there is no transition at that location that pre-
cedes in the lattice,
(a) If a transition with the same label already
occurs at that location, add posterior prob-
ability of to the posterior probability of
that transition.
(b) Otherwise, insert a new transition to that
location with the label and posterior prob-
ability of .
3. Otherwise,
(a) Insert a new state to the pivot align-
ment.
(b) Assign that state a time information.
(c) Change the destination state of all transi-
tions originating from state to .
(d) Insert a transition between states and
, assign it the label and posterior of .
Fig. 2. The pivot algorithm.
alignment. This path can be the best path or the longest
path of the lattice, as well as any random path. The
selection of the path can be optimized towards a spe-
cific cost function (e.g., word error rate). In most of
our experiments, we either used the best or the longest
path. The states on the pivot alignment baseline are
assumed to inherit their time information from the lat-
tice. In our algorithm, the time information is not nec-
essary, but beneficial for the overall performance. We
define time slot of transition as the speech
interval between the starting and ending time frames
of .
3. In the lattice, each transition overlapping is a
competitor of , but competitors having the same word
label as are allies [8]. We sum the posterior prob-
abilities of all the allies of transition and we ob-
tain what we call the posterior probability of word .
To compute the sum of the posterior probabilities of
all transitions labeled with word , that correspond
to the same instance, we traverse the lattice in topo-
logical order, and insert all transitions into the pivot
alignment baseline. When we find the most overlap-
ping location on the baseline, defined by a source and
a destination state, we check if there is already a tran-
sition at that location that precedes on a path in the
lattice. Insertion of at that location would violate
the transition ordering defined by the initial lattice. If
there is no such transition, we check if another tran-
sition with the same label already occurs in between
those two states. In the presence of such a transition,
we increment its posterior probability by the posterior
probability of the new transition. In the absence of a
transition with the same label, we create a new transi-
tion from the source to destination state, with the label
and the posterior probability of the currently traversed
transition on the lattice. If the insertion of violates
the transition ordering of the lattice, we create a new
location, by inserting a new state in between source
and destination. We change the destination state of all
the transitions from the source state and make them
point to the newly inserted state. We insert the cur-
rent transition from the lattice, in between the newly
created state and the destination state. In the current
implementation, we assign the newly inserted state,
the mean of the times of source and destination states
as state time.
When the time information is not available, we assign each
state of the lattice its approximate location on the overall lat-
tice. According to this, the initial state is assigned a location
, the final states that do not have any outgoing transition
are assigned a location . All the other states in between are
assigned a real number in , obtained by dividing the
average length of all paths up to that state by the average
length of all paths that go through that state. These num-
bers can be computed by a forward and a backward pass
through the lattice. We use these approximate state loca-
tions to obtain . The pivot algorithm runs in
time, where is the number of state transitions in the lat-
tice, and is the number of chunks in the resulting structure
plus the average fan-out of the pivot alignment states at the
time of the insertion. is usually much less than . For
example, if the best path is used as the pivot baseline, then
is the length of the best path plus the number of state in-
sertions made and the average fan-out. The complexity of
the pivot algorithm is better than the algorithm of Mangu et.
al. which runs in time.
3. EVALUATION
We performed a series of experiments to test the quality of
the pivot alignments and the confidence scores on them.
For these experiments, we used a test set of 2,174 utter-
ances (31,018 words) from the database of the How May
I Help You? (HMIHY ) system for customer care [9].
The language models used in all our experiments are tri-
gram models based on Variable Ngram Stochastic Automata
[10]. The acoustic models are subword unit based, with tri-
phone context modeling and variable number of gaussians
(4-24). The word accuracy of our test set when recognized
with these models is 66.2%, and the oracle accuracy of the
output lattices is 85.7%. Oracle accuracy is the word ac-
curacy of the path in a lattice, whose labels are closest to
the reference sequence. It is an upper-bound on the word
accuracy that can be obtained using these lattices. To as-
sess the quality of confidence scores on pivot alignments,
we plot false rejection versus false acceptance. False re-
jection (FR) is the percentage of words that are correctly
recognized in the ASR output, but are rejected as their con-
fidence score is below some threshold. False acceptance
(FA) is the percentage of words that are misrecognized but
are accepted as their confidence score is above that same
threshold. In Figure 3, we have plotted the FR versus FA
curves using different thresholds, for four different types of
confidence scores: the posterior probabilities of the transi-
tions on the best path of the lattice, the most likely path of
the pivot alignments using approximate time information,
consensus hypotheses of sausages, and the most likely path
of the pivot alignments using time information. The curve
that is closest to the origin is the best one, as it has the min-
imum error rate (false rejection and acceptance). Both pivot
alignments and sausages result in better confidence scores
than the naive approach of using the posterior probabilities
on the best path of the lattice. Although the pivot align-
ments using time information were generated in much less
time than sausages, their FR versus FA curve is almost over-
lapping with the one obtained using sausages [1]. When the
time information is not available, the FR versus FA curve
for the pivot alignments is only slightly worse than the one
obtained using time.
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
False Rejection (%)
False Acceptance (%)
Best path posteriors
Pivot aln. without time information
sausages
pivot aln. using time info
Fig. 3. FR versus FA curves for various confidence scores.
Another method for testing the quality of the confidence
scores is checking the percentage of correctly recognized
words for given confidence scores. One may expect %
of the words having the confidence score of to be
correct. Figure 4 shows our results for confidence scores
extracted from the best path of pivot alignments computed
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
10
20
30
40
50
60
70
80
90
100
Confidence Score Bin
Percentage of Correctly Recognized Words (%)
Fig. 4. Percentage of correctly recognized words in confi-
dence score bins.
Oracle Accuracy
0.4 67.5%
0.2 70.8%
0.1 74.2%
0.05 77.0%
0.01 81.2%
0 86.7%
Table 1. Oracle accuracies when we pruned all transitions
having a posterior probability less than .
without using time information. As seen, the percentage
of correctly recognized words in each confidence score bin
increases almost linearly as the confidence score increases.
To assess the quality of the pivot alignments, we have
computed oracle accuracies after pruning the pivot align-
ments with two different criteria. In Table 1, the oracle
accuracies after pruning the pivot alignments by using a
threshold for posterior probability are presented. Any arc
which has a posterior probability less than has been pruned
from the pivot alignment, then the oracle accuracy has been
computed on the pruned pivot alignment. In Table 2, the
oracle accuracies after pruning the pivot alignments using
the rank of the transitions are presented. In between all two
states connected by a transition, only the top transitions
that have the highest posterior probability has been retained
when computing the oracle accuracy. For example, if we
use only the two transitions that have the highest posterior
probabilities, we can achieve an oracle accuracy of 75.5%.
These numbers indicate that, using the top candidates in the
pivot alignments, instead of just the ASR 1-best hypothesis,
it is possible to be more robust to ASR errors.
The sizes of the pivot alignments are much smaller than
the corresponding lattices. In our tests, the size of the pivot
alignments is 7% of the size of the lattices.
Oracle Accuracy
1 66.2%
2 75.5%
3 79.0%
4 80.9%
86.7%
Table 2. Oracle accuracies when we took only the most
probable candidates in between all states.
4. CONCLUSIONS
We have proposed a general algorithm for lattice chunking.
Our algorithm does not require any time information on the
input lattice, and the labels of the lattice can be words as
well as part of speech tags or parse tags. While the algo-
rithm has applications to many tasks, such as parsing and
named entity extraction, we described the algorithm in the
context of ASR. We have presented the application of the al-
gorithm to the computation of word confidence scores. We
have compared the results of our algorithm against off-the-
shelf methods and have shown significant improvements.
5. REFERENCES
[1] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in
speech recognition: word error minimization and other appli-
cations of confusion networks,” Computer Speech and Lan-
guage, vol. 14, no. 4, pp. 373–400, 2000.
[2] R. Gretter and G. Riccardi, “On-line learning of language
models with word error probability distributions,” in Pro-
ceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing(ICASSP), 2001, pp. 557–560.
[3] D. Hakkani-T ¨ur, G. Riccardi, and A. Gorin, “Active learning
for automatic speech recognition,” in Proceedings of Inter-
national Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2002, pp. 3904–3907.
[4] G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-
ur, “Improving spoken language understanding using word
confusion networks,” in Proceedings of International Con-
ference on Spoken Language Processing (ICSLP), 2002, pp.
1137–1140.
[5] S. Bangalore, G. Bordel, and G. Riccardi, “Computing
consensus translation from multiple machine translation sys-
tems,” in Proc. of IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), 2001.
[6] F. Bechet, J. Wright, A. Gorin, and D. Hakkani-T¨ur, “Named
entity extraction from spontaneous speech in How May I
Help You? ,” in Proceedings of International Conference
on Spoken Language Processing (ICSLP), 2002, pp. 597–
600.
[7] S. Kumar and W. Byrne, “Risk based lattice cutting fir seg-
mental minimum bayes-risk decoding,” in Proceedings of
International Conference on Spoken Language Processing
(ICSLP), 2002, pp. 373–376.
[8] D. Falavigna, R. Gretter, and G. Riccardi, “Acoustic and
word lattice based algorithms for confidence scores,” in Pro-
ceedings of International Conference on Spoken Language
Processing (ICSLP), 2002, pp. 1621–1624.
[9] A. Gorin, J.H. Wright, G. Riccardi, A. Abella, and T. Alonso,
“Semantic information processing of spoken language,” in
Proc. of ATR Workshop on Multi-Lingual Speech Communi-
cation, 2000, pp. 13–16.
[10] G. Riccardi, R. Pieraccini, and E. Bocchieri, “Stochastic au-
tomata for language modeling,” Computer Speech and Lan-
guage, vol. 10, pp. 265–293, 1996.
... Malgré cela, l'algorithme reste difficilement utilisable dans un contexte temps-réel. Nous avons alors choisi d'utiliser l'algorithme proposé dans (Hakkani-Tur et Riccardi, 2003), appelé algorithme du "pivot", qui a non seulement une complexité fortement réduite, mais aussi des performances équivalentes en termes de taux d'erreur mot à celles de l'algorithme précédent. Du point de vue de l'utilisateur, le processus de reconnaissance vocale et le processus d'interprétation sont transparents. ...
... Ainsi, pour chaque état on garde l'instant temporel (la trame) du graphe de mots et on peut définir un support temporel entre deux états consécutifs. Dans l'implémentation de l'algorithme nous utilisons le meilleur au sens MAP, un choix fait aussi dans (Hakkani-Tur et Riccardi, 2003). ...
Article
The work presented in this PhD deals with the confusion networks as a compact and structured representation of multiple aligned recognition hypotheses produced by a speech recognition system and used by different applications. The confusion networks (CN) are constructed from word graphs and structure information as a sequence of classes containing several competing word hypothesis. In this work we focus on the problem of robust understanding from spontaneous speech input in a dialogue application, using CNs as structured representation of recognition hypotheses for the spoken language understanding module. We use France Telecom spoken dialogue system for customer care. Two issues inherent to this context are tackled. A dialogue system does not only have to recognize what a user says but also to understand the meaning of his request and to act upon it. From the user's point of view, system performance is more accurately represented by the performance of the understanding process than by speech recognition performance only. Our work aims at improving the performance of the understanding process. Using a real application implies being able to process real heterogeneous data. An utterance can be more or less noisy, in the domain or out of the domain of the application, covered or not by the semantic model of the application, etc. A question raised by the variability of the data is whether applying the same processes to the entire data set, as done in classical approaches, is a suitable solution. This work follows a double perspective : to improve the CN construction algorithm with the intention of optimizing the understanding process and to propose an adequate strategy for the use of CN in a real application. Following a detailed analysis of two CN construction algorithms on a test set collected using the France Telecom customer care service, we decided to use the "pivot" algorithm for our work. We present a modified and adapted version of this algorithm. The new algorithm introduces different processing techniques for the words which are important for the understanding process. As for the variability of the real data the application has to process, we present a new multiple level decision strategy aiming at applying different processing techniques for different utterance categories. We show that it is preferable to process multiple recognition hypotheses only on utterances having a valid interpretation. This strategy optimises computation time and yields better global performance
... A compact and normalized class of word lattices, called word confusion networks (or position specific posterior lattices or sausages) have been proposed initially for improving ASR performance [9] and later for SLU [27] and speech retrieval [28]. These confusion networks are more efficient than canonical word lattices, in terms of size and structure, without compromising recognition accuracy. ...
... The WER of the oracle path is found to be 10.4%. The word confusion networks are built using the SRILM toolkit [33], which uses a method similar to AT&T pivot algorithm [27]. The word error rate of the best WCN hypotheses is significantly lower, at 16.9%. ...
... In [12] the authors propose an ASR correction approach based on maximum entropy language models to correct semantic and lexical errors. In [13,14] ASR lattices are converted to word confusion networks, and word error rates are improved by finding consensus across alternative paths in the lattice. More recent approaches such as [15] use neural network based language models for ASR correction. ...
... In [12] the authors propose an ASR correction approach based on maximum entropy language models to correct semantic and lexical errors. In [13,14] ASR lattices are converted to word confusion networks, and word error rates are improved by finding consensus across alternative paths in the lattice. More recent approaches such as [15] use neural network based language models for ASR correction. ...
Preprint
Masked language models have revolutionized natural language processing systems in the past few years. A recently introduced generalization of masked language models called warped language models are trained to be more robust to the types of errors that appear in automatic or manual transcriptions of spoken language by exposing the language model to the same types of errors during training. In this work we propose a novel approach that takes advantage of the robustness of warped language models to transcription noise for correcting transcriptions of spoken language. We show that our proposed approach is able to achieve up to 10% reduction in word error rates of both automatic and manual transcriptions of spoken language.
... The nodes in the network represent words and are weighted by the word's confidence score or its a posteriori probability. Two nodes (words) are connected when they appear in close time points and share a similar pronunciation, which merits suspecting they might get confused in recognition (Stolcke, 2002;Hakkani-Tur and Riccardi, 2003). WCN may contain empty transitions which introduce paths through the graph that skip a particular word and its alternatives. ...
Preprint
Full-text available
In this paper, we present a method for correcting automatic speech recognition (ASR) errors using a finite state transducer (FST) intent recognition framework. Intent recognition is a powerful technique for dialog flow management in turn-oriented, human-machine dialogs. This technique can also be very useful in the context of human-human dialogs, though it serves a different purpose of key insight extraction from conversations. We argue that currently available intent recognition techniques are not applicable to human-human dialogs due to the complex structure of turn-taking and various disfluencies encountered in spontaneous conversations, exacerbated by speech recognition errors and scarcity of domain-specific labeled data. Without efficient key insight extraction techniques, raw human-human dialog transcripts remain significantly unexploited. Our contribution consists of a novel FST for intent indexing and an algorithm for fuzzy intent search over the lattice - a compact graph encoding of ASR's hypotheses. We also develop a pruning strategy to constrain the fuzziness of the FST index search. Extracted intents represent linguistic domain knowledge and help us improve (rescore) the original transcript. We compare our method with a baseline, which uses only the most likely transcript hypothesis (best path), and find an increase in the total number of recognized intents by 25%.
... If a confidence score is true to its conceptual definition (the probability that a given word is correct), then it is natural to expect that the word sequence with the highest combined confidence should also be at least as accurate as the default 1-best result (obtained via the MAP decision rule). One reason that this may not easily hold true is that word confidence models, by design, often try to force all recognition hypotheses into a fixed segmentation in the form of a Word Confusion Network (WCN) [6] [7] [2]. While the original motivation of WCNs was to obtain a recognition result that is more consistent with the WER criterion, we argue that it often unnaturally decouples words from their linguistic or acoustic context and makes accurate model training difficult by introducing erroneous paths that are difficult to resolve. ...
Preprint
We present a new method for computing ASR word confidences that effectively mitigates ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We propose 1) a new method for modeling word confidence using a Heterogeneous Word Confusion Network (HWCN) that addresses some key flaws in conventional Word Confusion Networks, and 2) a new score calibration method for facilitating direct comparison of scores from different models. Using a bidirectional lattice recurrent neural network to compute the confidence scores of each word in the HWCN, we show that the word sequence with the best overall confidence is more accurate than the default 1-best result of the recognizer, and that the calibration method greatly improves the reliability of recognizer combination.
... The small data set can be used as bootstrapping for systems [11] but how can we exploit the remaining unlabeled set ? The idea to exploit the unlabeled examples by adding them the labeled data has been well studied in the last years [19], [6], [18], [12], [23]. By re-training models on the fly, experiments in [15] are also inspired by this principle. ...
Article
Full-text available
Opinion and trend mining on micro blogs like twitter recently attracted research interest in several fields including Information Retrieval and Machine Learning. This paper is intended to develop a so-called active learning for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of entities : such as politicians, celebrities, companies or brands. Our main contribution is the methodology followed to build and provide an original annotated French data-set expressing opinion on two French politicians over time. Since the performance of natural language processing tasks are limited by the amount and quality of data available to them, one promising alternative for some tasks is the propagation of pseudo-expert annotations. The paper is focused on key issues about active learning while building a large annotated data set, from noise introduced by humans annotators, abundance of data and the label distribution across data and entities.
Article
Full-text available
Cybersecurity has become an important part of our daily lives. As an important part, there are many researches on intrusion detection based on host system call in recent years. Compared to sentences, a sequence of system calls has unique characteristics. It contains implicit pattern relationships that are less sensitive to the order of occurrence and that have less impact on the classification results when the frequency of system calls varies slightly. There are also various properties such as resource consumption, execution time, predefined rules, and empirical weights of system calls. Commonly used word embedding methods, such as Bow, TI-IDF, N-Gram, and Word2Vec, do not fully exploit such relationships in sequences as well as conveniently support attribute expansion. To solve these problems, we introduce Graph Representation based Intrusion Detection (GRID), an intrusion detection framework based on graph representation learning. It captures the potential relationships between system calls to learn better features, and it is applicable to a wide range of back-end classifiers. GRID utilizes a new sequence embedding method Graph Random State Embedding (GRSE) that uses graph structures to model a finite number of sequence items and represent the structural association relationships between them. A more efficient representation of sequence embeddings is generated by random walks, word embeddings, and graph pooling. Moreover, it can be easily extended to sequences with attributes. Our experimental results on the AFDA-LD dataset show that GRID has an average improvement of 2% using the GRSE embedding method comparing to others.
Article
Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.
Conference Paper
Full-text available
State-of-the-art speech recognition systems are trained using transcribed utterances, preparation of which is labor intensive and time-consuming. In this paper, we describe a new method for reducing the transcription effort for training in automatic speech recognition (ASR). Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. We automatically estimate a confidence score for each word of the utterance, exploiting the lattice output of a speech recognizer, which was trained on a small set of transcribed data. We compute utterance confidence scores based on these word confidence scores, then selectively sample the utterances to be transcribed using the utterance confidence scores. In our experiments, we show that we reduce the amount of labeled data needed for a given word accuracy by 27%.
Conference Paper
Full-text available
The understanding module of a spoken dialogue system must ex- tract, from the speech recognizer output, the kind of request ex- pressed by the caller (the call type) and its parameters (numerical expressions, time expressions or proper-name). The definition of such parameters (called Named Entities, NE) is linked to the dia- logue application. Detecting and extracting such contextual NEs for the How May I Help You? application is the subject of this study. By detecting NEs with a statistical tagger on 1-best hy- potheses and by extracting their values with local models on word- lattices, we show very significant improvements compared to the traditional approach which uses regular expressions on the 1-best hypothesis only.
Conference Paper
Full-text available
Word confidence scores are crucial for unsupervised learning in automatic speech recognition. In the last decade there has been a flourish of work on two fundamentally different approaches to compute confidence scores. The first paradigm is acoustic and the second is based on word lattices. The first approach is data- intensive and it requires to explicitly model the acoustic channel. The second approach is suitable for on-line (unsupervised) learn- ing and requires no training. In this paper we present a comparative analysis of off-the-shelf and new algorithms for computing confi- dence scores, following the acoustic and lattice-based paradigms. We compare the performance of these algorithms across three tasks for small, medium and large vocabulary speech recognition tasks and for two languages (Italian and English). We show that word- lattice based algorithm provides consistent and effective perfor- mance across automatic speech recognition tasks.
Article
Full-text available
State-of-the-art speech recognition systems are trained using transcribed utterances, preparation of which is labor intensive and time-consuming. In this paper, we describe a new method for reducing the transcription effort for training in automatic speech recognition (ASR). Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. We automatically estimate a confidence score for each word of the utterance, exploiting the lattice output of a speech recognizer, which was trained on a small set of transcribed data. We compute utterance confidence scores based on these word confidence scores, then selectively sample the utterances to be transcribed using the utterance confidence scores. In our experiments, we show that we reduce the amount of labeled data needed for a given word accuracy by 27%.
Article
Full-text available
In this paper, we address the problem of computing a consensus translation given the outputs from a set of Machine Translation (MT) systems. The translations from the MT systems are aligned with a multiple string alignment algorithm and the consensus translation is then computed. We describe the multiple string alignment algorithm and the consensus MT hypothesis computation. We report on the subjective and objective performance of the multilingual acquisition approach on a limited domain spoken language application. We evaluate five domain-independent off-theshelf MT systems and show that the consensus-based translation performs equal or better than any of the given MT systems both in terms of objective and subjective measures.
Article
We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding to the path with the highest posterior probability given the acoustics and a language model. However, even given optimal models, the MAP decoder does not necessarily minimize the commonly used performance metric, word error rate (WER). We describe a method for explicitly minimizing WER by extracting word hypotheses with the highest posterior probabilities from word lattices. We change the standard problem formulation by replacing global search over a large set of sentence hypotheses with local search over a small set of word candidates. In addition to improving the accuracy of the recognizer, our method produces a new representation of a set of candidate hypotheses that specifies the sequence of word-level confusions in a compact lattice format. We study the properties of confusion networks and examine their use for other tasks, such as lattice compression, word spotting, confidence annotation, and reevaluation of recognition hypotheses using higher-level knowledge sources.
Conference Paper
We are interested in the problem of learning stochastic language models on-line (without speech transcriptions) for adaptive speech recognition and understanding. We propose an algorithm to adapt to variations in the language model distributions based on speech input only and without its true transcription. The on-line probability estimate is defined. as a function of the prior and word error distributions. We show the effectiveness of word-lattice based error probability distributions in terms of receiver operating characteristics (ROC) curves and word accuracy. We apply the new estimates P<sub>adapt </sub>(w) to the task of adapting on-line an initial large vocabulary trigram language model and show improvement in word accuracy with respect to the baseline speech recognizer
Article
Stochastic language models are widely used in spoken language understanding to recognize and interpret the speech signal: the speech samples are decoded into word transcriptions by means of acoustic and syntactic models and then interpreted according to a semantic model. Both for speech recognition and understanding, search algorithms use stochastic models to extract the most likely uttered sentence and its correspondent interpretation. The design of the language models has to be effective in order to mostly constrain the search algorithms and has to be efficient to comply with the storage space limits.