Detection and tracking of the lip contour is an important issue in speechreading. While there are solutions for lip tracking once a good contour initialization in the first frame is available, the problem of finding such a good initialization is not yet solved automatically, but done manually. We have developed a new tracking solution for lip contour detection using only few landmarks (15 to 25) and applying the well known Active Shape Models (ASM). The proposed method is a new LMS-like adaptive scheme based on an Auto regressive (AR) model that has been fit on the landmark variations in successive video frames. Moreover, we propose an extra motion compensation model to address more general cases in lip tracking. Computer simulations demonstrate a fair match between the true and the estimated spatial pixels. Significant improvements related to the well known LMS approach has been obtained via a defined Frobenius norm index.
Fast and reliable face and facial feature detection are required abilities for any Human Computer Interaction approach based on Computer Vision. Since the publication of the Viola-Jones object detection framework and the more recent open source implementation, an increasing number of applications have appeared, particularly in the context of facial processing. In this respect, the OpenCV community shares a collection of public domain classifiers for this scenario. However, as far as we know these classifiers have never been evaluated and/or compared. In this paper we analyze the individual performance of all those public classifiers getting the best performance for each target. These results are valid to define a baseline for future approaches. Additionally we propose a simple hierarchical combination of those classifiers to increase the facial feature detection rate while reducing the face false detection rate.
Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best- suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed.
Viola and Jones [9] introduced a method to accurately and rapidly detect faces within an image. This technique can be adapted to accurately detect facial features. However, the area of the image being analyzed for a facial feature needs to be regionalized to the location with the highest probability of containing the feature. By regionalizing the detection area, false positives are eliminated and the speed of detection is increased due to the reduction of the area examined.
Recently Viola et al. [2001] have introduced a rapid object detection. scheme based on a boosted cascade of simple feature classifiers. In this paper we introduce a novel set of rotated Haar-like features. These novel features significantly enrich the simple features of Viola et al. and can also be calculated efficiently. With these new rotated features our sample face detector shows off on average a 10% lower false alarm rate at a given hit rate. We also present a novel post optimization procedure for a given boosted cascade improving on average the false alarm rate further by 12.5%.
Using visual information, such as lip shapes and movements, as the secondary source of speech information has been shown to make speech recognition systems more robust to problems asso- ciated with environmental noise, training/testing mismatch and channel and speech style vari- ations. Research into utilising visual information for speech recognition has been ongoing for 20 years, however over this period, a study into which visual information is the most useful or pertinent to improving speech recognition systems has yet to be performed. This paper presents a study to determine the confusability of the phonemes grouped into their viseme classes over various levels of noise in the audio domain. The rationale behind this approach is that by establishing the interclass confusion for a group of phonemes in their viseme class, a better understanding can be obtained on the complementary nature of the separate audio and visual information sources and this can be subsequently applied in the fusion stage of an audio-visual speech processing(AVSP) system. The experiments performed show high interclass confusion variability at the 0dB and -6dB SNR levels. Further analysis found that this was mainly due to a phonetic imbalance in the dataset. Due to this result, it was suggested that it would be ap- propriate for an AVSP system used for digit recognition applications heavily weight the visual modality for the phonemes that are most prevalent such as the phoneme N.
Human speech is inherently multi-modal, consisting of both audio and visual components. Recently researchers have shown that the incorporation of information about the position of the lips into acoustic speech recognisers enables robust recognition of noisy speech. In the case of Hidden Markov Modelrecognition, we show that this happens because the visual signal stabilises the alignment of states. It is also shown that unadorned lips, both the inner and outer contours, can be robustly tracked in real time on general-purpose workstations. To accomplish this, efficient algorithms are employed which contain three key components: shape models, motion models, and focused colour feature detectors --- all of which are learnt from examples.
Analysis and Machine Vision
Jan 2008
M Sonka
V Hlavac
R Boyle
M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine
Vision, Thomson, 2008.
Tutorial: OpenCV haartraining
N Seo
N. Seo, "Tutorial: OpenCV haartraining", http:
...,.
The Language Model of the ABCP1 System
Jan 2011
V Pera
V. Pera, "The Language Model of the ABCP1 System", Tech. Rep.,
ABCP, Porto, 2011.
The Visual Model of the ABCP1 System
Jan 2012
V Pera
V. Pera, "The Visual Model of the ABCP1 System", Tech. Rep., ABCP,
Porto, (prev. 2012).