Conference Paper

Development of a Cantonese-English code-mixing speech corpus.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The population of the former consists of Chinese (74%), Malays (13%), Indians (9%) and others (4%) while the latter is made up of Malay (50%), Chinese (24%), Indian (7%), and others (19%) [2][3]. In other bilingual and multilingual societies, such as in United States, Switzerland, Hong Kong and Taiwan, we could often hear code-switching speech of Spanish-English, French-Italian, Cantonese-English and Mandarin-Taiwanese, respectively [4][5][6]. Code-switching is a common speaking style in daily conversation as it enables people to maintain a sense of social belonging and provide a convenient way to express speakers" idea. ...
... Several code-switching corpora have been built, e.g. Cantonese-English, English-Mandarin and Mandarin-Taiwanese code-switching speech corpora [6][7][8][9]. The transcriptions of such corpora have been designed based on real-world spontaneous code-switching speech extracted from the Internet or TV programs. ...
Article
Full-text available
SEAME (South East Asia Mandarin-English) is a 30 hours spontaneous Mandarin-English code-switching speech corpus recorded from Singapore and Malaysia speakers. In this paper, we report a series of analyses on the recording, processing time and voice activity rate (VAR) of the speech recording, transcription, validation and language boundaries labeling processes. In addition, the duration of the monolingual segment in the code-switching utterance and the analysis of the speakers" behavior in language switching during conversation are also described. The results of the analysis show that 80% and 72% monolingual segments of English and Mandarin in the code-switching utterance are shorter than one second. In over 80% of the cases, speakers directly switch language without any short pause and discourse particle between two adjacent different languages.
... Several code-switching corpora have been reported in the literature, e.g. Cantonese-English, Mandarin-Taiwanese and Mandarin-English code-switching speech corpus [8][9][10]. Most of the studies are focused on the tasks of language boundary detection (LBD), language identification (LID) and automatic speech recognition (ASR) using bi-phone probabilities or delta-BIC and LSA-based GMMs [10][11]. ...
... This accounts for 50% and 70% of the total sentences from Singapore and Malaysia respectively. This observation of speaking style in code-switching utterance coincides with what are reported in Hong Kong and Taiwan [8][9]. Example A is one of the typical examples of this kind. ...
Conference Paper
Full-text available
This paper introduces the South East Asia Mandarin-English corpus, a 63-h spontaneous Mandarin-English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.
... Many code-mixing speech corpora, coding switch between English and Asian or European languages are created [5][6][7][8][9][10] . Moreover, the code-switching speech between English and the South African language, isizulu is also found in [11]. ...
... Moreover, the code-switching speech between English and the South African language, isizulu is also found in [11]. The code-switching speech was acquired from the speech recording of the selected code-mixing text [5,6,10] or the speech recording of the meeting and interview situation setting using code-switching between two languages [7,9] . There are also many research articles regarding Thai-English code-mixing in the variety of media as in television programs [2,12,13] , in the newspaper [14] , in the magazine [15,16] and pop song [17]. ...
... There have been several efforts towards building mixed language speech corpora. For example, SEAME for English-Mandarin [7], CUMIX for Cantonese-English [11], BilingBank [12] in TalkBank database [13] and the LIDES (Language Interaction Data Exchange System) database from LIPPS (Language Interaction in Plurilingual and Plurilectal Speakers) group [14] feature several mixed language speech corpora. Similarly, several mixed language speech processing studies [15,16,17,18] have been tested on locally developed datasets of mixed language speech. ...
... The specifications associated with the SEAME mixed language corpus include number of speakers, speaker origin and background, age group of speakers, number of utterances, number of hours of speech, speaking rate, average number of language turns per utterance, distribution of duration of monolingual segments and distribution of language switching. CUMIX speech corpus consists of a Cantonese-English mixed language read speech corpus [11]. It is designed on 3167 distinct, manually designed mixed language sentences, based on newspapers and online resources including newsgroups and online diaries. ...
... Note that there are several other mixlingual databases for different code-switching, e.g. Cantonese-English, English-Mandarin, Mandarin-Taiwanese and Mandarin-English [18], [10], [19], [20], [21]. The OC16-CE80 database is similar to the SEAME database [22] as both are Mandarin-English code-switching, but the speakers of SEAME are from Singaporean and Malaysian, whereas the speakers of OC16-CE80 are totally from the China mainland. ...
Article
Full-text available
We present the OC16-CE80 Chinese-English mixlingual speech database which was released as a main resource for training, development and test for the Chinese-English mixlingual speech recognition (MixASR-CHEN) challenge on O-COCOSDA 2016. This database consists of 80 hours of speech signals recorded from more than 1,400 speakers, where the utterances are in Chinese but each involves one or several English words. Based on the database and another two free data resources (THCHS30 and the CMU dictionary), a speech recognition (ASR) baseline was constructed with the deep neural network-hidden Markov model (DNN-HMM) hybrid system. We then report the baseline results following the MixASR-CHEN evaluation rules and demonstrate that OC16-CE80 is a reasonable data resource for mixlingual research.
... There have been several attempts to create speech corpora for language pairs like Mandarin-English, Cantonese-English, Frisian-Dutch, Swahili-English and so on. (Yılmaz et al., 2016), (Chan et al., 2005), (Lyu et al., 2015), (Lyu et al., 2010) (van der Westhuizen and Niesler, 2016), (Kleynhans et al., 2016) As this research field remains in the nascent stage of investigation, a read speech corpus can provide insightful contribution into modeling the acoustic properties of code mixing. A corpus designed in such a manner could offer enormous control on the lexical content, optimal phonetic coverage, choice of speakers, recording environments and reduce the dependence on post-processing. ...
Article
Full-text available
The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.
... Investigation of code-switching in the context of automatic speech recognition research has become viable with several code-switching databases that have been proposed in the last years (Lyu et al., 2015;Li and Fung, 2012;Dey and Fung, 2014;Chan et al., 2005;Imseng et al., 2012). These databases contain recordings of Mandarin-English, Hindi-English, Cantonese-English and French-German code-switching speech data. ...
... Investigation of code-switching in the context of automatic speech recognition research has become viable in the last years on account of several code-switching databases [24][25][26][27]. These databases contain recordings of Mandarin-English, Hindi-English, Cantonese-English and French-German codeswitching speech data. ...
... To address the former challenge, [3] [4] applied class-based language models using POS information. Further studies explored the use of translation-and semantic-based LMs [6] to improve the probability of infrequent and unseen codeswitches . The latter problem was tackled in [3] [4] [5] where speaker adaptation and phone sharing between languages were investigated. ...
Data
Full-text available
This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech. We applied state-of-the-art techniques such as speaker adaptive and discriminative training to build the first baseline system on the SEAME corpus [1] (South East Asia Mandarin-English). For acoustic modeling, we applied different phone merging approaches based on the International Phonetic Alphabet (IPA) and Bhattacharyya distance in combination with discriminative training to improve accuracy. On language model level, we investigated statistical machine translation (SMT) - based text generation approaches for building code-switching language models. Furthermore, we integrated the provided information from a language identification system (LID) into the decoding process by using a multi-stream approach. Our best 2-pass system achieves a Mixed Error Rate (MER) of 36.6% on the SEAME development set.
... To address the former challenge, [3] [4] applied class-based language models using POS information. Further studies explored the use of translation-and semantic-based LMs [6] to improve the probability of infrequent and unseen codeswitches . The latter problem was tackled in [3] [4] [5] where speaker adaptation and phone sharing between languages were investigated. ...
Conference Paper
Full-text available
This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech. We applied state-of-the-art techniques such as speaker adaptive and discriminative training to build the first baseline system on the SEAME corpus [1] (South East Asia Mandarin-English). For acoustic modeling, we applied different phone merging approaches based on the International Phonetic Alphabet (IPA) and Bhattacharyya distance in combination with discriminative training to improve accuracy. On language model level, we investigated statistical machine translation (SMT) -based text generation approaches for building code-switching language models. Furthermore, we integrated the provided information from a language identification system (LID) into the decoding process by using a multi-stream approach. Our best 2-pass system achieves a Mixed Error Rate (MER) of 36.6% on the SEAME development set.
... There is comparatively less work in the literature on automated analysis of code-switched speech, partially due to the relative lack of structured corpora (as compared to those for textbased work) and also potentially because it also poses yet another significant challenge in the form of speech recognition for multiple languages. Nonetheless, some researchers have made strong strides in spoken corpus development to support such research in certain language pairs, for instance, Mandarin-English [21,22], Cantonese-English [23] and Hindi-English [24], which have in turn led to developments in automatic speech recognition [25,26] and language modeling [27]. However, these are limited; there remains a need for more codeswitched speech resources in these and other languages to spur research into the automated processing and analysis of such data. ...
... • The CUMIX Cantonese-English code-switching speech corpus developed by Joyce Y. C. Chan, et al., at the Chinese University of Hong Kong [23]. It contains code-switched speech utterances read by the speakers. ...
Preprint
Full-text available
Code-switching refers to the usage of two languages within a sentence or discourse. It is a global phenomenon among multilingual communities and has emerged as an independent area of research. With the increasing demand for the code-switching automatic speech recognition (ASR) systems, the development of a code-switching speech corpus has become highly desirable. However, for training such systems, very limited code-switched resources are available as yet. In this work, we present our first efforts in building a code-switching ASR system in the Indian context. For that purpose, we have created a Hindi-English code-switching speech database. The database not only contains the speech utterances with code-switching properties but also covers the session and the speaker variations like pronunciation, accent, age, gender, etc. This database can be applied in several speech signal processing applications, such as code-switching ASR, language identification, language modeling, speech synthesis etc. This paper mainly presents an analysis of the statistics of the collected code-switching speech corpus. Later, the performance results for the ASR task have been reported for the created database.
... Both approaches have been recorded for collecting code-switched speech corpora. (Chan et al., 2005) collected a Cantonese-English speech corpus through read newspaper content. ) gathered a Mandarin-English speech from four different sources: (1) conversational meetings ; ...
Conference Paper
Full-text available
Speech corpora are key components needed by both: linguists (in language analyses, research and teaching languages) and Natural Language Processing (NLP) researchers (in training and evaluating several NLP tasks such as speech recognition, text-to-speech and speech-to-text synthesis). Despite of the great demand, there is still a huge shortage in available corpora, especially in the case of dialectal languages, and code-switched speech. In this paper, we present our efforts in collecting and analyzing a speech corpus for conversational Egyptian Arabic. As in other multilingual societies, it is common among Egyptians to use a mix of Arabic and English in daily conversations. The act of switching languages, at sentence boundaries or within the same sentence, is referred to as code-switching. The aim of this work is a three-fold: (1) gather conversational Egyptian Arabic spontaneous speech, (2) obtain manual transcriptions and (3) analyze the speech from the code-switching perspective. A subset of the transcriptions were manually annotated for part-of-speech (POS) tags. The POS distribution of the embedded words was analyzed as well as the POS distribution for the trigger words (Arabic words preceding a code-switching point). The speech corpus can be obtained by contacting the authors.
... As was often emphasized during the recent research meetings on CS text/speech technology (the workshop in EMNLP 2016 and the special session in Interspeech 2017), this line of research suffers from the limited availability of (particularly spoken) data resources. Our contribution to expanding the limited amount of CS speech resources (e.g., [53,54,18,55,56,57,58]) is the bilingual FAME! speech corpus [59]. This data contains radio broadcasts spoken in Frisian-Dutch extracted from the bilingual archive of the regional public broadcaster Omrop Fryslân (Frisian Broadcast Organization). ...
Preprint
Full-text available
In the FAME! project, we aim to develop an automatic speech recognition (ASR) system for Frisian-Dutch code-switching (CS) speech extracted from the archives of a local broadcaster with the ultimate goal of building a spoken document retrieval system. Unlike Dutch, Frisian is a low-resourced language with a very limited amount of manually annotated speech data. In this paper, we describe several automatic annotation approaches to enable using of a large amount of raw bilingual broadcast data for acoustic model training in a semi-supervised setting. Previously, it has been shown that the best-performing ASR system is obtained by two-stage multilingual deep neural network (DNN) training using 11 hours of manually annotated CS speech (reference) data together with speech data from other high-resourced languages. We compare the quality of transcriptions provided by this bilingual ASR system with several other approaches that use a language recognition system for assigning language labels to raw speech segments at the front-end and using monolingual ASR resources for transcription. We further investigate automatic annotation of the speakers appearing in the raw broadcast data by first labeling with (pseudo) speaker tags using a speaker diarization system and then linking to the known speakers appearing in the reference data using a speaker recognition system. These speaker labels are essential for speaker-adaptive training in the proposed setting. We train acoustic models using the manually and automatically annotated data and run recognition experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic annotations. The ASR and CS detection results demonstrate the potential of using automatic language and speaker tagging in semi-supervised bilingual acoustic model training.
... To extract and transcribe the spontaneous code-switching speech are time and cost consuming. Hence, many reports in code-switching speech in ASR and LID task still use read code-switching speech corpus [5][6]. In this paper, we develop an artificially read-style code-switching corpus where the prompts for recording are extracted and generated from a segment of spontaneous public speech recorded from a TV program. ...
Article
In this paper, a language identification (LID) task is described on Mandarin/Taiwanese code-switching utterances. The proposed word-based lexical model of this LID system integrates acoustic, phonetic and lexical cues. The first two cues are obtained from a large vocabulary continuous speech recognition (LYCSR) system, and the last one is trained for a word-based lexical model. The lexical model is used to identify languages according to the frequency and context of each word by given a sequence of words recognized by the LVCSR system. Because the switching unit in the code-switching speech is a word, the experiments showed that, by using a word-based lexical model, 16% relative reduction of classification errors was achieved compared with that in those LVSCR-based LID systems.
... There have been several attempts to create speech corpora for language pairs like Mandarin-English, Cantonese-English, Frisian-Dutch, Swahili-English and so on. (Yılmaz et al., 2016), (Chan et al., 2005), (Lyu et al., 2015), (Lyu et al., 2010) (van der Westhuizen and Niesler, 2016), (Kleynhans et al., 2016) As this research field remains in the nascent stage of investigation, a read speech corpus can provide insightful contribution into modeling the acoustic properties of code mixing. A corpus designed in such a manner could offer enormous control on the lexical content, optimal phonetic coverage, choice of speakers, recording environments and reduce the dependence on post-processing. ...
Conference Paper
The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.
... • The CUMIX Cantonese-English speech corpus [37] contains 17 hours of code-switched speech read by 80 speakers. ...
Preprint
Full-text available
Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviews computational approaches for code-switched Speech and Natural Language Processing. We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for. We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems. We conclude with future directions and open problems in the field.
... It is a common phenomenon in many bilingual societies. [1] [2] In Taiwan, at least two languages (or dialects, as some linguists prefer to call them)-Mandarin and Taiwanese-are frequently mixed and spoken in daily conversations. [3] It also becomes a type of skilled performance in a public speech. ...
Conference Paper
Full-text available
We propose an integrated approach to do automatic speech recognition on code-switching utterances, where speakers switch back and forth between at least 2 languages. This one-pass framework avoids the degradation of accuracy due to the imperfectly intermediate decisions of language detection and language identification. It is based on a three-layer recognition scheme, which consists of a mixed-language HMM-based acoustic model, a knowledge-based plus data-driven probabilistic pronunciation model, and a tree-structured searching net. The traditional multi-pass recognizer including language boundary detection, language identification and language-dependent speech recognition is also implemented for comparison. Experimental results show that the proposed approach, with a much simpler recognition scheme, could achieve as high accuracy as that could be achieved by using the traditional approach
... While there is comparatively less work in the literature on automated analysis of codeswitched speech and dialog, the number of corpora and studies is steadily growing in several language pairs -for instance, MandarinEnglish ( Li et al., 2012;Lyu et al., 2015), Cantonese-English ( Chan et al., 2005) and HindiEnglish (Dey and Fung, 2014). ...
... Three speech corpora are involved in this research; they are the monolingual English corpus TIMIT, the monolingual Cantonese corpus CUSENT [10], and the Cantonese-English code-mixing corpus CUMIX [11]. TIMIT contains five hours of read speech from 630 speakers representing eight major dialect divisions of American English. ...
... For the acoustic-modeling approach [8][9][10] developed Cantonese-English and Spanish-Catalan bilingual speech corpora, respectively. In these corpora, utterances for the embedded language were collected in regions where the matrix language was spoken and were directly used to train the acoustic models of the embedded language. ...
Article
This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real-world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition for realworld applications are tackled: One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language. A unified bilingual acoustic model, which is derived by the novel Two-pass phone-clustering method based on the Confusion Matrix (TCM), is developed to solve the first problem. To deal with the second problem, several nonnative model modification approaches are investigated on the unified acoustic models. Compared to the existing log-likelihood phone-clustering method, the proposed TCM method with effective incorporation of limited amounts of nonnative adaptation data and adaptive modification, relatively reduces the Phrase Error Rate (PER) by 10.9% for nonnative English phrases and the PER on Mandarin phrases decreases favorably, and besides, the recognition rate for bilingual code-mixing phrases achieves an 8.9% relative PER reduction.
... Code-switching speaking style can usually be found in many bilingual or multilingual societies. Examples of these societies are the French-German in Switzerland, the English-Spanish in the US, the Malay-English in Malaysia, the Mandarin-English in Malaysia [2], the Cantonese-English in Hong Kong, and the Mandarin-Taiwanese in Taiwan [4][5][6]. Code-switching speech recognition is a challenging problem due to two issues. Code-switching is not a simple mixing of two languages [7][8]. ...
Conference Paper
Full-text available
In this paper, we propose a novel approach to automatic recognition of code-switching speech. The proposed method consists of two phases: automatic speech recognition, and rescoring. The framework uses parallel automatic speech recognizers for speech recognition. The lattices produced are subsequently joined and rescored to estimate the most probable word sequence. Experiment shows that the proposed approach reduction of more than 5% WER, when tested on English/Malay code switching speech. In addition, the framework has shown to be very robust. Besides, we also propose an acoustic model adaptation approach known as hybrid approach of interpolation and merging to cross adapt acoustic models of different languages to recognize code switching speech. The adapted acoustic models show reduction in WER, when they are used for code switching speech recognition.
Conference Paper
The great success of Minimum Phone Error (MPE) training criterion in mono-language large vocabulary continuous speech recognition (LVCSR) tasks motivates us to apply it to bilingual LVCSR systems. In this paper, in conjunction with the previous respectable bilingual phoneme inventory construction techniques, we give a comprehensive investigation to the performance of MPE/fMPE on various Mandarin-English bilingual test sets under different test conditions. The evaluation results show that the final fMPE+MPE model achieves significant improvements compared to the baseline models. On the mono-language test sets, the best improvement is a relative error rate reduction of 28.4%. And on the code-mixing test set, it also achieves a relative error rate reduction of 8.1%. The within- and cross-language substitution error rate introduced in this paper also explicitly shows that fMPE/MPE training can effectively improve the model's within- and cross-language discriminabilities in our bilingual recognition tasks.
Article
This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme, the Cantonese-to-English translation dictionary is adopted to transcribe monolingual Cantonese n-grams to mixed-language n-grams. In semantics-based mapping scheme, n-gram mapping is based on the meaning and syntactic function of the English words in the lexicon. Different semantics-based language models are trained with different mapping schemes. They are evaluated in terms of perplexity and in the task of LVCSR. Experimental results confirm that, the more the observed mixed-language n-grams after mapping, the better the language model perplexity as well as the recognition performance. The proposed language models show significant improvement on recognition performance on embedded English words when they are compared with the baseline 3-gram LM. The best recognition accuracy attained is 63.9% and 74.7% respectively for the English words and Cantonese characters in code-mixing utterances.
Article
Full-text available
We introduce a new English-isiZulu code-switched speech corpus compiled from South African soap opera broadcasts. isiZulu itself is currently under-resourced, and automatic speech recognition is made even more challenging by the high prevalence of code-switching in spontaneous speech. Analysis of the corpus reflects effects common in conversational isiZulu, such as vowel deletion and cross-language prefixes and suffixes. Baseline monolingual and code-switched automatic speech recognition systems are developed, including a new language model configuration that explicitly includes switching transitions. For code-switched speech, a system with language-dependent acoustic models and language-dependent language models linked by switching transitions leads to best performance, although word error rates overall remain very high.
Article
Code-switching (CS) is a multilingual phenomenon where a speaker uses different languages in an utterance or between alternating utterances. Developing large-scale datasets for training code-switching acoustic and language models is challenging and extremely expensive. In this paper, we focus on the acoustic data augmentation for the Mandarin-English CS speech recognition task. Effectiveness of conventional acoustic data augmentation approaches are examined. More importantly, we propose a CS acoustic event detection system based on the deep neural network to extract real code-switching speech segments automatically. Then, the semi-supervised and active learning techniques are investigated to generate transcriptions of these segments. Finally, code-switching speech synthesis system is introduced to further enhance the acoustic modeling. Experimental results on the OC16-CE80 data, a Mandarin-English mixlingual speech corpus, demonstrate the effectiveness of the proposed methods.
Article
Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (i) creation of the Hindi-English code-switching text corpus, (ii) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (iii) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of the improvised POS features. It is also observed that the proposed CSL features provide an independent and additive improvement over the POS features in terms of perplexity.
Article
In recent decades, there has been a great deal of research into the problem of bilingual speech recognition-to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language**. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was comparable to that of the baseline monolingual Mandarin system. The performance for bilingual utterances achieved 22.37% relative PER reduction.
Article
Full-text available
This paper is a review of the major works in code-switching in Hong Kong to date. Four context-specific motivations commonly found in the Hong Kong Chinese press—euphemism, specificity, bilingual punning, and principle of economy—are adduced to show that English is one of the important linguistic resources used by Chinese Hongkongers to fulfill a variety of well-defined communicative purposes.
Conference Paper
Full-text available
This paper investigates the use of articulatory-acoustic features for the classification of syllables in TIMIT. The main moti- vation for this study is to circumvent the "beads-on-a-string" problem, i.e. the assumption that words can be described as a simple concatenation of phones. Posterior probabilities for articulatory-acoustic features are obtained from artificial neural nets and are used to classify speech within the scope of sylla- bles instead of phones. This gives the opportunity to account for asynchronous feature changes, exploiting the strengths of the articulatory-acoustic features, instead of losing the potential by reverting to phones.
Conference Paper
Full-text available
In this paper, we present an effective method to detect the language boundary (LB) in code-switching utterances. The utterances are mainly produced in Cantonese, a commonly used Chinese dialect, whilst occasionally English words are inserted between Cantonese words. Bi-phone probabilities are calculated to measure the confidence that the recognized phones are in Cantonese. Two sets of context-independent mono-phone models are trained by monolingual Cantonese and monolingual English data separately. Both knowledge-based and data-driven model selection approaches are studied in order to retain the language-dependent characteristics and to merge duplicated phone sets between the two languages. The LB detection accuracy is 75.12% for utterances that contain one single code-switching word or phrase.
Article
The TIMIT acoustic-phonetic database was designed jointly by researchers at MIT, TI, and SRI. It was intended to provide a rich collection of acoustic phonetic and phonological data, to be used for basic research as well as the development and evaluation of speech recognition systems. There are a total of 450 MIT sentences used in the TIMIT database. These were generated by hand in an iterative fashion, with the goal that they should be phonetically rich. To aid in the sentence generation process, Webster's Pocket Dictionary is used which, contains nearly 20,000 words. Words or word-sequences containing particular phone pairs could be accessed from this dictionary automatically, which greatly facilitated the database design process. The database consists of a total of 6,300 sentences from 639 speakers, representing over 5 hours of speech material, and was recorded by researchers at TI. This chapter describes the transcription and alignment of the TIMIT database, which was performed at MIT.
Article
Two experiments with Chinese–English bilinguals were conducted to examine the recognition of code-switched words in speech. In Experiment 1, listeners were asked to identify a code-switched word in a sentence on the basis of increasing fragments of the word. In Experiment 2, listeners repeated the code-switched word following a predesignated point upon hearing the sentence. Converging evidence from these experiments shows that the successful recognition of code-switched words depends on the interaction among phonological, structural, and contextual information in the recognition process. The results also indicate that Chinese–English bilinguals can recognize code-switched words with the same amount of information as required by monolingual English listeners. These results are interpreted in terms of parallel activation and interactive processes in spoken word recognition.
Article
Syllable fusion is a Hong Kong Cantonese connected speech process, whereby edges of syllables are obscured by consonant lenition or deletion, and vowel reduction. More extreme fusion can simplify contour tones and merge the qualities of vowels that would be separated by an onset or coda consonant at more normal degrees of disjuncture between words. This paper investigates the influence of speech rate on syllable fusion. An experiment tested the prediction that faster speech rate would give rise to more occurrences of fusion forms and a greater degree of fusion. Subjects repeated word groups in two conditions: at normal rate and at fastest possible speech rates. Results show that speech rate is a reliable predictor for the amount and for the degree of fusion. Implications for incorporating prosody in speech synthesis systems are discussed.
Conference Paper
This paper describes work on developing a large vocabulary speech database for Cantonese. As a major Chinese dialect, Cantonese is spoken by tens of millions of people in Southern China and Hong Kong. It is very different from Mandarin or Putonghua in phonology, phonetics, vocabulary and grammatical structure. A speech database specially designed for Cantonese is urgently needed for the design, implementation and performance evaluation of various speech recognition systems. The proposed database contains a large number of speech utterances which include isolated syllables, polysyllabic words and phonetically rich sentences. It covers most of the intra-syllable and inter-syllable acoustic variations
Code-mixing in Hong Kong Cantonese-English bilinguals : constraints and processes
  • Brian Hok-Shing Chan
Brian Hok-Shing Chan, Code-mixing in Hong Kong Cantonese-English bilinguals : constraints and processes, M. A. Thesis, The Chinese University of Hong Kong, 1992.
A study of lexical borrowing from English in Hong Kong Chinese, Centre of Asian Studies
  • Mimi Chan
  • Helen Kwok
Mimi Chan, Helen Kwok, A study of lexical borrowing from English in Hong Kong Chinese, Centre of Asian Studies, University of Hong Kong, Hong Kong, 1990
Some observations on code-switching between Cantonese and English in Hong Kong
  • A Tse
A. Tse, " Some observations on code-switching between Cantonese and English in Hong Kong ", Working Papers in Languages and Linguistics, Vol. 4, p.101-108, Department of Chinese, Translation and Linguistics, City Polytechnic of Hong Kong, 1992