This paper deals with comparison of sub-word based methods for spoken term detection (STD) task and phone recognition. The sub-word units are needed for search for out-of-vocabulary words. We compared words, phones and multigrams. The maximal length and pruning of multigrams were investigated first. Then two constrained methods of multigram training were proposed. We evaluated on the NIST STD06 dev-set CTS data. The conclusion is that the proposed method improves the phone accuracy more than 9% relative and STD accuracy more than 7% relative.
"Each audio of the speech corpus is automatically segmented and then passed to a large vocabulary continuous speech recognition (LVCSR) system to produce the corresponding word lattice. These lattices are then indexed using techniques such as weighted finite state transducer (WFST)    framework or N-gram indexing    . In search phase, keywords in the textual format are searched on the index to produce a list of putative detections. "
[Show abstract][Hide abstract] ABSTRACT: Many keyword search (KWS) systems make “hit/false alarm (FA)” decisions based on the lattice-based posterior probability, which is incomparable across keywords. Therefore, score normalization is essential for a KWS system. In this paper, we investigate the integration of two novel features, ranking-score and relative-to-max, into a discriminative score normalization method. These features are extracted by considering all competing hypotheses of a putative detection. A metric-based normalization method is also applied as a post-processing step to further optimize the term-weighted value (TWV) evaluation metric. We report empirical improvements over standard baselines using the Vietnamese data from IARPA's Babel program in the NIST OpenKWS13 Evaluation setup.
ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
"A typical STD system comprises an ASR subsystem for lattice generation and a STD subsystem for term detection, as illustrated in Figure 1. State-of-the-art STD systems include those reported in      . Fig. 1. "
[Show abstract][Hide abstract] ABSTRACT: A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. One challenge that OOV terms bring to STD is the pronunciation uncertainty. A commonly used approach to address this problem is a soft matching procedure, and the other is the stochastic pronunciation modelling (SPM) proposed by the authors. In this paper we compare these two approaches, and combine them using a discriminative decision strategy. Experimental results demonstrated that SPM and soft match are highly complementary, and their combination gives significant performance improvement to OOV term detection.
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on; 04/2010
"We take this approach in the work reported here. Other types of subword units under investigation include word-fragments , particles , graphones ,, multigrams , syllables  and graphemes . Both in-vocabulary (INV) terms and OOV terms can be retrieved in the same way by a subword-based system, based on their pronunciations. "
[Show abstract][Hide abstract] ABSTRACT: A major challenge faced by a spoken term detection (STD) sys- tem is the detection of out-of-vocabulary (OOV) terms. Al- though a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. Current approaches to STD do not ac- knowledge the particular properties of OOV terms, such as pro- nunciation uncertainty. In this paper, we use a stochastic pro- nunciation model to deal with the uncertain pronunciations of OOV terms. By considering all possible term pronunciations, predicted by a joint-multigram model, we observe a significant performance improvement. Index Terms: joint-multigram, pronunciation model, spoken term detection, speech recognition
INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.