A Hierarchical Approach for Audio Stream Segmentation and Classification.
ABSTRACT This paper describes a hierarchical approach for fast audio stream segmentation and classification. With this approach, the audio stream is firstly segmented into au- dio clips by MBCR (Multiple sub-Bands spectrum Cen- troid relative Ratio) based histogram modeling. Then a MGM (Modified Gaussian modeling) based hierarchical classifier is adopted to put the segmented audio clips into six pre-defined categories in terms of discriminative background sounds, which is pure speech, pure music, song, speech with music, speech with noise and silence. The experiments on real TV program recordings showed that this approach has higher accuracy and recall rate for audio classification with a fast speed under noise envi- ronments.
SourceAvailable from: Lynn Donelle Wilcox[Show abstract] [Hide abstract]
ABSTRACT: Online digital audio is a rapidly growing resource, which can be accessed in rich new ways not previously possible. For example, it is possible to listen to just those portions of a long discussion whichinvolve a given subset of people, or to instantly skip ahead to the next speaker. Providing this capability to users, however, requires generation of necessary indices, as well as an interface which utilizes these indices to aid navigation. We describe algorithms which generate indices from automatic acoustic segmentation. These algorithms use hidden Markov models to segment audio into segments corresponding to di#erent speakers or acoustics classes #e.g. music#. Unsupervised model initialization using agglomerative clustering is described, and shown to work as well in most cases as supervised initialization. We also describe a user interface which displays the segmentation in the form of a timeline, with tracks for the di#erent acoustic classes. The interface can be used for direct navigation through the audio. 1
[Show abstract] [Hide abstract]
ABSTRACT: We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of soundAcoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on; 05/1997
[Show abstract] [Hide abstract]
ABSTRACT: A hybrid connectionist-HMM speech recognizer uses a neural network acoustic classifier. This network estimates the posterior probability that the acoustic feature vectors at the current time step should be labelled as each of around 50 phone classes. We sought to exploit informal observations of the distinctions in this posterior domain between nonspeech audio and speech segments well-modeled by the network. We describe four statistics that successfully capture these differences, and which can be combined to make a reliable speech/nonspeech categorization that is closely related to the likely performance of the speech recognizer. We test these features on a database of speech/music examples, and our results match the previously-reported classification error, based on a variety of special-purpose features, of 1.4% for 2.5 second segments. We also show that recognizing segments ordered according to their resemblance to clean speech can result in an error rate close to the ideal minimum o...