Conference Paper

A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition

DOI: 10.2197/ipsjtcva.2.25 Conference: Advances in Image and Video Technology, Third Pacific Rim Symposium, PSIVT 2009, Tokyo, Japan, January 13-16, 2009. Proceedings
Source: DBLP


This paper presents the development of a novel visual speech recognition (VSR) system based on a new representation that extends the standard viseme concept (that is referred in this paper to as Visual Speech Unit (VSU)) and Hidden Markov Models (HMM). Visemes have been regarded as the smallest visual speech elements in the visual domain and they have been widely applied to model the visual speech, but it is worth noting that they are problematic when applied to the continuous visual speech recognition. To circumvent the problems associated with standard visemes, we propose a new visual speech representation that includes not only the data associated with the articulation of the visemes but also the transitory information between consecutive visemes. To fully evaluate the appropriateness of the proposed visual speech representation, in this paper an extensive set of experiments have been conducted to analyse the performance of the visual speech units when compared with that offered by the standard MPEG-4 visemes. The experimental results indicate that the developed VSR application achieved up to 90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only in the range 62-72%.

Download full-text


Available from: Dahai Yu, Mar 09, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper sorts out the problem of Persian Vowel viseme clustering. Clustering audio-visual data has been discussed for a decade or so. However, it is an open problem due to shortcoming of appropriate data and its dependency to target language. Here, we propose a speaker-independent and robust method for Persian viseme class identification as our main contribution. The overall process of the proposed method consists of three main steps including (I) Mouth region segmentation, (II) Feature extraction, and (IV) Hierarchical clustering. After segmenting the mouth region in all frames, the feature vectors are extracted based on a new look at Hidden Markov Model. This is another contribution to this work, which utilizes HMM as a probabilistic model-based feature detector. Finally, a hierarchical clustering approach is utilized to cluster Persian Vowel viseme. The main advantage of this work over others is producing a single clustering output for all subjects, which can simplify the research process in other applications. In order to prove the efficiency of the proposed method a set of experiments is conducted on AVAII.
    Full-text · Article ·
  • [Show abstract] [Hide abstract]
    ABSTRACT: Visual speech recognition or lip reading is an approach for noise robust speech recognition by adding speaker's visual cues to audio information. Basically visual-only speech recognition is applicable to speaker verification and multimedia interface for supporting speaking impaired person. The sequential mouth-shape code method is an effective approach of lip reading for particularly uttered Japanese words by utilizing two kinds of distinctive mouth shapes, known as first and last mouth shapes, appeared intermittently. One advantage of this method is its low computational burden for the learning and word registration processes. This paper proposes a novel word lip recognition system by detecting and determining initial mouth-shape codes to recognize uttering consonants. The proposed method eventually is able to discriminate different words consisting of the same sequential vowel codes though containing different consonant codes. The conducted experiments demonstrate that the proposed system provides higher recognition rate than the conventional ones.
    No preview · Conference Paper · Nov 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, automatic lip reading based on ‘visemes’ have been studied by researchers for realizing human-machine interactive communication system in many applications. However there are a lot of problems such as the definition of the number of viseme classes, discrimination method of visemes, speech recognition method based on visemes, and so on. In this paper, a novel classification of Japanese visemes and hierarchical weighted discrimination method for speech recognition are proposed to address these problems. We augmented the classification number of visemes from 6(conventional) to 9 to represent the words in more detailed by visemes. In addition, considering the difficulty in discriminating with increase of the number of visemes, the hierarchical weighted discrimination method is proposed. For the purpose of comparing with the conventional method, the ATR phonetically balanced word group, which is large vocabulary and includes various visemes, was used and applied to word recognition experiments. From these results, we confirmed the proposed method worked well.
    No preview · Conference Paper · Jan 2013
Show more