Salient phonetic features of Indian languages in speech technology

Sadhana (Impact Factor: 0.59). 10/2011; 36(5). DOI: 10.1007/s12046-011-0039-z

ABSTRACT Speech signal is the basic study and analysis material in speech technology as well phonetics. To form meaningful chunks of language, the speech signal should have dynamically varying spectral characteristics, sometimes varying within a stretch of a few milliseconds. Phonetics groups these temporally varying spectral chunks into abstract classes roughly called as allophones. Distribution of these allophones into higher level classes called phonemes takes us closer to their function in a language. Phonemes and letters in the scripts of literate languages – languages which use writing have varying degrees of correspondence. As such a relationship exists, a major part of speech technology deals with the correlation of script letters with chunks of time-varying spectral stretches in that language. Indian languages are said to have a more direct correlation between their sounds and letters. Such similarity gives a false impression of similarity of text-to-sound rule sets across these languages. A given letter which has parallels across various languages may have different degrees of divergence in its phonetic realization in these languages. We illustrate such differences and point out the problem areas where speech scientists need to pay greater attention in building their systems, especially multilingual systems for Indian languages.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The phonetic engine is a system that performs speech signal to symbol transformation. This work describes some issues in the development of an Assamese Phonetic Engine (PE). International phonetic alphabet (IPA) is used as the phonetic unit to transcribe the speech database collected in three different modes, namely, reading, lecture and conversation modes. Only reading mode data is used for training and Hidden markov model (HMM) is used to model each phonetic unit without imposing any language or contextual constraint. The trained HMMs are used to derive a sequence of phonetic units from a test speech signal. Accuracy of 47.31%, 45.30% and 36.13% is achieved in reading, lecture and conversation mode, respectively. Confusion among the phonetic units specific to Assamese are discussed. Issues related to different recording modes, language and native speaker dependencies are discussed. The speech data is also collected in Hindi from three different sets of speakers to study speaker, language and native dependancies. Accuracy of 40.5%, 36.10% and 29.61% is achieved in native speaker dependent, native speaker independent and non-native speaker independent cases, respectively.
    2013 Annual IEEE India Conference (INDICON), Mumbai, India; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the advent of social networks, there has been an exponential growth in multimedia data including speech. This speech data is typically conversational, casual and recorded in real environment. An important characteristic of this speech data is unavailability of corresponding transcripts (text) or the language information. In this work, we discuss technologies dealing with speech data without any corresponding transcripts and/or language information. A traditional way is to adopt acoustic models from existing benchmark databases (of known languages) for obtaining a first-level transcription and then perform bootstrapping. We show inherent limitations of such approaches, and argue that signal processing algorithms based on speech production knowledge play an important role in dealing with such speech data. This paper discusses some of the ongoing work at our lab in this direction which includes building audio search, speech summarization, speech synthesis and voice conversion using untranscribed speech.
    Signal Processing and Communications (SPCOM), 2012 International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper aims to discuss the implementation of phoneme based Manipuri Keyword Spotting System (MKWSS). Manipuri is a scheduled Indian language of Tibeto-Burman origin. Around 5 hours of read speech are collected from 4 male and 6 female speakers for development of database of MKWSS. The symbols of International Phonetic Alphabet (IPA)(revised in 2005) are used during the transcription of the data. A five state left to right Hidden Markov Model (HMM) with 32 mixture continuous density diagonal covariance Gaussian Mixture Model (GMM) per state is used to build a model for each phonetic unit. We have used HMM tool kit (HTK), version 3.4 for modeling the system. The system can recognize 29 phonemes and a non-speech event (silence) and will detect the present keywords formed by these phonemes. Continuous Speech data have been collected from 5 males and 8 females for analysing the performance of the system. The performance of the system depends on the ability of detection of the keywords. An overall performance of 65.24% is obtained from the phoneme based MKWSS.
    Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2013, IIT, Jodhpur; 12/2013