Article

Assessing Pronunciation Improvement in Students of English Using a Controlled Computer-Assisted Pronunciation Tool

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Over the last few years, we have witnessed a growing interest in computer-assisted pronunciation training (CAPT) tools and the commercial success of foreign language teaching applications that incorporate speech synthesis and automatic speech recognition technologies. However, empirical evidence supporting the pedagogical effectiveness of these systems remains scarce. In this study, a minimal-pair based CAPT tool that implements exposure—perception—production cycles and provides automatic feedback to learners is tested for effectiveness in training adult native Spanish users (English level B1—B2) in the production of a set of difficult English sounds. Working under controlled conditions, a group of users took a pronunciation test before and after using the tool. Test results were considered against those of an in-classroom group who followed similar training within the traditional classroom setting. Results show a significant pronunciation improvement among the learners who used the CAPT tool, as well as a correlation between human rater's assessment of post-tests and automatic CAPT assessment of users.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... e most important of this model is that the learner accepts this combination for learning and assessing. Tejedor-García et al. [176] developed an application which helps learners to improve their pronunciation and assess their learning at the end of the class. ...
... Tejedor-García et al. [176] [2020] Smart Assessment Proposed an application for learner pronunciation improvement and analysis Mobiles phones, laptop etc Such like applications helps the adults. ...
... e results proposed that instead of the above BYOD and 6 Cs, there are a lot of other needs to convert the institutions to smart institutions. Similarly, Tejedor-García et al. [176] developed an application to enhance language pronunciation improvement and learner engagement. ...
Preprint
Full-text available
IoT is a fundamental enabling technology for creating smart spaces, which can assist the effective face-to-face and online education systems. The transition to smart education (integrating IoT and AI into the education system) is appealing, which has a concrete impact on learners' engagement, motivation, attendance, and deep learning. Traditional education faces many challenges, including administration, pedagogy, assessment, and classroom supervision. Recent developments in ICT (e.g., IoT, AI and 5G, etc.) have yielded lots of smart solutions for various aspects of life; however, smart solutions are not well integrated into the education system. In particular, the COVID-19 pandemic situation had further emphasized the adoption of new smart solutions in education. This study reviews the related studies and addresses the (i) problems in the traditional education system with possible solutions, (ii) the transition towards smart education, and (iii) research challenges in the transition to smart education (i.e, computational and social resistance). Considering these studies, smart solutions (e.g., smart pedagogy, smart assessment, smart classroom, smart administration, etc.) are introduced to the problems of the traditional system. This exploratory study opens new trends for scholars and the market to integrate ICT, IoT, and AI into smart education.
... e most important of this model is that the learner accepts this combination for learning and assessing. Tejedor-García et al. [176] developed an application which helps learners to improve their pronunciation and assess their learning at the end of the class. ...
... Tejedor-García et al. [176] [2020] Smart Assessment Proposed an application for learner pronunciation improvement and analysis Mobiles phones, laptop etc Such like applications helps the adults. ...
... e results proposed that instead of the above BYOD and 6 Cs, there are a lot of other needs to convert the institutions to smart institutions. Similarly, Tejedor-García et al. [176] developed an application to enhance language pronunciation improvement and learner engagement. ...
Article
Full-text available
IoT is a fundamental enabling technology for creating smart spaces, which can assist the effective face-to-face and online education systems. The transition to smart education (integrating IoT and AI into the education system) is appealing, which has a concrete impact on learners' engagement, motivation, attendance, and deep learning. Traditional education faces many challenges, including administration, pedagogy, assessment, and classroom supervision. Recent developments in ICT (e.g., IoT, AI and 5G, etc.) have yielded lots of smart solutions for various aspects of life; however, smart solutions are not well integrated into the education system. In particular, the COVID-19 pandemic situation had further emphasized the adoption of new smart solutions in education. This study reviews the related studies and addresses the (i) problems in the traditional education system with possible solutions, (ii) the transition towards smart education, and (iii) research challenges in the transition to smart education (i.e, computational and social resistance). Considering these studies, smart solutions (e.g., smart pedagogy, smart assessment, smart classroom, smart administration, etc.) are introduced to the problems of the traditional system. This exploratory study opens new trends for scholars and the market to integrate ICT, IoT, and AI into smart education.
... accessed on 27 June 2021) were presented to the experimental and in-classroom group participants in 7 lessons across three training sessions. The minimal pairs were carefully selected by experts considering the gASR limitations (homophones, word-frequency, very short words, and out-of-context words, in a similar process as in [8]). The lessons were included in the CAPT tool for the experimental group and during the class sessions for the in-classroom group (12 minimal pairs per lesson, 2 lessons per session, except for the last session that included 3 lessons; see more details about the training activities in [19]). ...
... To carry out all the experiments, we built a mobile app, Japañol, starting from a previous prototype app designed for self-directed training of English as an L2 [8]. Figure 2 shows the regular sequence of steps to complete a lesson in Japañol. ...
... The experimental and in-classroom group speakers improved 0.7|1.1 and 0.6|0.9 points out of 10, assessed by gASR|kASR systems, respectively, after just three one-hour training sessions. These results agreed with previous works which follow a similar methodology [8,30]. Thus, the training protocol and the technology included, such as the CAPT tool and the ASR systems, provided a very useful and didactic instrument that can be used complementary with other forms of second language acquisition in larger and more ambitious language learning projects. ...
Article
Full-text available
General-purpose automatic speech recognition (ASR) systems have improved in quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, such as words in minimal pairs for segmental approaches, remains an important challenge, even more so for non-native speakers. In this work, we compare the performance of our own tailored ASR system (kASR) with the one of Google ASR (gASR) for the assessment of Spanish minimal pair words produced by 33 native Japanese speakers in a computer-assisted pronunciation training (CAPT) scenario. Participants in a pre/post-test training experiment spanning four weeks were split into three groups: experimental, in-classroom, and placebo. The experimental group used the CAPT tool described in the paper, which we specially designed for autonomous pronunciation training. A statistically significant improvement for the experimental and in-classroom groups was revealed, and moderate correlation values between gASR and kASR results were obtained, in addition to strong correlations between the post-test scores of both ASR systems and the CAPT application scores found at the final stages of application use. These results suggest that both ASR alternatives are valid for assessing minimal pairs in CAPT tools, in the current configuration. Discussion on possible ways to improve our system and possibilities for future research are included.
... scarce empirical studies which include ASR technology in CAPT tools assess sentences 34 in large portions of either reading or spontaneous speech [6,7], the assessment of words 35 in isolation remains a substantial challenge [8,9]. ...
... The results of the first location (Valladolid) allowed us to verify that there were no 125 particularly differentiating aspects in the results analyzed by gender [19]. Therefore, we To carry out all the experiments, we built a mobile app, Japañol, starting from a 170 previous prototype app designed for self-directed trained of English as an L2 [8]. Figure 171 2 shows the regular sequence of steps in order to complete a lesson in Japañol. ...
... The experimental and in-classroom group speakers improved 0.7|1.1 and 0.6|0.9 points out of 10, assessed by gASR|kASR systems, respectively, after just three one-hour training sessions. These results agreed with previous works which follow a similar methodology[8,28]. Thus, the training protocol and the technology included, such as the CAPT tool and the ASR systems provided a very useful and didactic instrument that can be used complementary with other forms of second language acquisition in larger ...
Preprint
Full-text available
General–purpose automatic speech recognition (ASR) systems have improved their quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, as words in minimal pairs for segmental approaches, remains an important challenge, even more for non-native speakers. In this work, we compare the performance of our own tailored ASR system (kASR) with the one of Google ASR (gASR) for the assessment of Spanish minimal pair words produced by 33 native Japanese speakers in a computer-assisted pronunciation training (CAPT) scenario. Participants of a pre/post-test training experiment spanning four weeks were split into three groups: experimental, in-classroom, and placebo. Experimental group used the CAPT tool described in the paper, which we specially designed for autonomous pronunciation training. Statistically significant improvement for experimental and in-classroom groups is revealed, and moderate correlation values between gASR and kASR results were obtained, beside strong correlations between the post-test scores of both ASR systems with the CAPT application scores found at the final stages of application use. These results suggest that both ASR alternatives are valid for assessing minimal pairs in CAPT tools, in the current configuration. Discussion on possible ways to improve our system and possibilities for future research are included.
... The computer-assisted language teaching mode can effectively enhance students' independent learning ability, provide excellent information resources for teaching, and improve students' comprehensive quality and teaching quality [2][3]. Computer-aided language teaching integrates audio, video, pictures, text, and other resources to meet the basic needs of college English teaching at the current stage and in the actual teaching work through the practical application of computer-aided language teaching, a scientific and standardized teaching plan can be formulated, which can optimize the college English teaching mode, continuously improve the comprehensive quality of students, and cultivate more outstanding talents for socialist modernization [4][5][6]. ...
... The number of matching word strings as an evaluation of the alignment model is shown in Equation (4) and the translation of its original form is pres ...
Article
Full-text available
The computer-aided writing system first builds an English writing corpus, which mainly includes several aspects such as corpus selection, corpus preprocessing, formula algorithm design, sentence breaking, and judging whether the sentences are aligned or not. Anchor alignment cannot perform a job alone; it must work in conjunction with named entity recognition technology to study the connection and role between the two. Jieba participle utilizes the idea of the HMM model, which is researched and calculated by the Viterbi algorithm, to recognize new words outside the dictionary. A computer-assisted writing system is introduced into English language teaching to promote conscious and deep student participation, activate classroom teaching, and improve learning effectiveness. The results showed that there was also a significant correlation between text length and English writing ability (P=0.003) and on T-units and English writing ability (P=0.001). Statistical significance was reached for all three indicators of text length (P=0.001), T-units (P=0.001), and English writing proficiency (P=0.002) between groups of high and low subgroups. The vocabulary recommendation function usage record revealed that each student used it 6.21 times on average and had a liking for it 4.32 times on average. 20 students agreed that the system was helpful for their writing. The two groups of students’ writing scores differed significantly (0.001) from their respective pre-test scores. The computerized writing instruction system was more helpful for the student’s performance.
... Corrective feedback can be given globally by indicating whether a word or sentence was pronounced correctly (Evers and Chen 2020;Mroz 2018b;Neri et al. 2008) without further details. On the other hand, the feedback can return phonetic details by highlighting phonetic features that show precisely which speech sounds were mispronounced (Castelo 2022;Tejedor-García et al. 2020). Both global and specific phonetic types of corrective feedback effectively enhance learners' pronunciation. ...
... Through the detailed information provided by NOVO, students became aware of what they needed to improve in their pronunciation and could practice in pronouncing the speech sounds they found problematic. The positive impact of phonetic feedback has also been found in studies by Castelo (2022) and Tejedor-García et al. (2020). Although the treatment effect indicated a significant difference, ILI, with its global, corrective feedback, remained effective for pronunciation learning. ...
... In the realm of customized and adaptive educational experiences, computer-assisted learning stands out as particularly effective. Traditional pedagogical methods often subscribe to a one-size-fits-all approach, which is increasingly recognized as insufficient for diverse learning needs [25], [26], [29]. Computer-assisted platforms can be tailored to individual learning trajectories, an essential feature for accommodating students with learning disabilities. ...
... Tejedor-García et al, for example, designed a minimal-pair-based Computer-Assisted Pronunciation Training tool focusing on exposure-perception-production cycles. This tool provides automatic feedback and specifically targets adult native Spanish speakers for mastering difficult English phonemes [25]. Similarly, Rho et al. developed a virtual reality-based sign language learning system that employs validation-based feedback, thereby implementing an experiential learning model [26]. ...
Article
Full-text available
This study aims to address the significant learning challenges faced by students with learning disabilities, particularly in mastering algebraic concepts. Traditional rule-based instruction often exacerbates these challenges, highlighting the need for more effective teaching methodologies. While existing literature confirms the effectiveness of both prompt-based learning and computer-assisted instruction in math education, very few studies have looked at combining mobile technology with prompt-based learning approaches for algebra instruction. In response to this gap, we propose a Visual Prompt-Based Mobile Learning Strategy (VPML). This innovative method is specifically designed to enhance the understanding of basic algebra, focusing on linear equations. Utilizing a quasi-experimental design, we conducted a comprehensive 12-week study to assess the impact of VPML on learning outcomes among students with learning disabilities. The results indicate significant improvements in the students’ abilities to comprehend and solve linear equations when taught via the VPML strategy. Therefore, this study serves not only as a validation of the VPML approach but also as a meaningful contribution to the ongoing discourse on optimizing educational interventions for students with learning disabilities in algebra.
... This study has two main goals: first, to investigate to what extent the use of the CAPT tool Estoñol improves the pronunciation and perception of Estonian vowels, and second, to test the use of TTS and ASR to improve learners' perception and ISAPh 2022, 4th International Symposium on Applied Phonetics 14-16 September 2022, Lund, Sweden production of Estonian vowels. It is expected that the tool will have a positive effect on the performance, as has been shown in previous studies using similar methodology [1], [2], [14]. ...
... Likewise, the learners' production improved significantly after the training period. The effectiveness of CAPT tools has also been demonstrated in numerous previous studies involving different languages (e.g., [1], [2], [14]). ...
Conference Paper
Full-text available
This study tests the effectiveness of a CAPT tool tailored for Spanish L1 learners of Estonian to practice the perception and production of Estonian vowels. The tool is designed to train seven vowel contrasts (/i-y/, /u-y/, /ɑ-o/, /ɑ-ae/, /e-ae/, /o-ø/, and /o-ɤ/) that have been shown to be difficult for Spanish L1 learners. When practicing with the CAPT tool the learners first see a theoretical video followed by four training modes: exposure, discrimination, pronunciation and mixed. To assess the effectiveness of the tool an experiment combining training sessions with the tool and pre-and post-testing was conducted. The results show that on the whole the learners' perception and production improved significantly after the training period, suggesting that CAPT tools are an effective resource for pronunciation practice.
... Automatic pronunciation assessment (APA) constitutes an emerging field of interest. There are several studies which have addressed the automatic assessment of the pronunciation of words (e.g., Luo, 2016;Tejedor-García, 2020) and sentences (e.g., Crossley & McNamara, 2013;Yarra et al., 2019). Teaching institutions are using automatic systems for the assessment of L2 spoken discourse (Seed & Xu, 2018). ...
... Although ASR-based APA within CAPT seems efficient (Liakin et al., 2015;Luo, 2016;Tejedor-García, 2020), the implicit premise that justifies its use has only been partially tested. Researchers in the field have mostly been concerned with intelligibility (Jenkins, 2002;Levis, 2005) and therefore with the extent to which human and automatic recognition overlap (Coniam, 1999;Derwing et al., 2000;McCrocklin & Edalatishams, 2020). ...
Article
Full-text available
This study addresses the issue of automatic pronunciation assessment (APA) and its contribution to the teaching of second language (L2) pronunciation. Several attempts have been made at designing such systems, and some have proven operationally successful. However, the automatic assessment of the pronunciation of short words in segmental approaches has still remained a significant challenge. Free and off-the-shelf automatic speech recognition (ASR) systems have been used in integration with other tools with the hopes of facilitating improvement in the domain of computer-assisted pronunciation training (CAPT). The use of ASR in APA stands on the premise that a word that is recognized is intelligible and well-pronounced. Our goal was to explore and test the functionality of Google ASR as the core component within a possible automatic British English pronunciation assessment system. After testing the system against standard and non-standard (foreign) pronunciations provided by participating pronunciation experts as well as non-expert native and non-native speakers of English, we found that Google ASR does not and cannot simultaneously meet two necessary conditions (here defined as intrinsic and derived) for performing as an APA system. Our study concludes with a synthetic view on the requirements of a reliable APA system.
... Most computer-aided language learning systems [15,18], e.g., foreign language learning [10] and teaching hearing-impaired patients [14,30], include a computer-assisted pronunciation tool (CAPT) [1,31]. A typical CAPT records a learner's utterance, detects and diagnoses mispronunciations in it, and suggests a way for correcting them [1]. ...
... The main difference of this algorithm with our approach from the previous conference publication [23] is the presence of new steps 18-22 to synthesize the phonetic database adapted to the voice of the user (Subsection 3.2). Instead of simply adding the best sound (31) to the training set [23], the synthesized sounds x * r (28) are stored in a special set X * because they are much closer to the model pronunciation. As a result, it will be more convenient for the learner to produce the sounds that will be either close to their model pronunciation or the best attempts of this user. ...
Article
Full-text available
This paper considers an assessment and evaluation of speech sound pronunciation quality in computer-aided language learning systems. We examine the gain optimization of spectral distortion measures between the speech signals of a native speaker and a learner. During training, a learner has to achieve stable pronunciation of all sounds. This is measured by computing the distances between the sounds produced by the learner and the model speaker. In order to improve pronunciation, it is proposed to adapt the linear prediction coding coefficients of reference sounds by using the gradient descent optimization of the gain-optimized dissimilarity. As a result, we demonstrate the possibility of synthesizing sounds that will be either close to the model pronunciation or achievable by a learner. An experimental study shows that the proposed procedure leads to high efficiency for pronunciation training even in the presence of noise in the observed utterance.
... Computer-Assisted Pronunciation Training (CAPT) is a part of CALL responsible for learning pronunciation skills. It has been shown to help people practice and improve their pronunciation skills [6][7][8]. CAPT consists of two components: an automated pronunciation evaluation component [9][10][11] and a feedback component [12]. The automated pronunciation evaluation component is responsible for detecting pronunciation errors in spoken speech, for example, for detecting words pronounced incorrectly by the speaker. ...
Preprint
Full-text available
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.
... The rapid increase in data volume and the explosion of information have become commonplace in the current digital era. Technological advancements provide us access to vast amounts of data from multiple sources, many of which exhibit high-dimensional characteristics such as image data [1]. However, visualizing and understanding high-dimensional data is challenging. ...
Article
Full-text available
The purpose of this research is to investigate the use of information visualisation in deep learning and big data analytics. First, we present a notion that is crucial to the production of visualisations: visualisation repositioning. We can relocate visualisations and extract data from static photos using a pattern recognition-based approach, which frees up more time for artists to create and lessens the workload of designers. Next, we present VisCode, an application that may be used to encode unprocessed data into static images for superior visualisation repositioning. VisCode offers two sorts of repositioning: thematic (which applies several colour themes for the user to select) and representational (which encodes raw data and visualisation types into static visualisation graphics in JSON format). In order to demonstrate VisCode's potential utility and usefulness in the information visualisation space, we conclude with a sample application that involves moving a colour density map visualisation.
... The studies focusing on the importance of training or exposure to foreign language speech productions can be found in the work of Saito et al. (2022) and Melnik and Peperkamp (2021), whose works focus on the high variability of phonetic training, while Arora et al. (2018) made a point on the importance of exposing foreign language audio to the learners of foreign language and train them how to pronounce the words like a native. The importance of modelling was asserted in the work of Lee-Kim (2016) and Bassetti et al. (2018) to help learners to imitate speech productions better, especially with the support of a controlled computerassisted pronunciation tool (Tejedor-Garcia et al., 2020). Similar accounts of the importance of speech exposure and training were also found in the study by Cao (2016); Liakin et al. (2015) where they ensure that the participants get repeated exposure to L2 speech. ...
Article
Full-text available
Learners of English as a second language (L2) whose first language (L1) is Indonesian tend to struggle to produce aspirated consonants. This study investigates whether the difficulties came from the interference of the L1 for the production of these sounds: [k], [b], [d], [g] in the final positions, and [p] and [t] in the stressed syllable. This study involved two cohorts of English department university students with different levels of fluency in L2 speech production. The L2 learners were asked to pronounce 25 words from a textbook previously used to teach them. The L2 learners were exposed to a British English-speaking speech model, which became a benchmark for the L2 learners’ pronunciation by asking them to imitate the pronunciation. Annotation of data was conducted twice by a second annotator to ensure the objectivity of the scores given to the L2 learners which was analised using paired sample t-test. Findings suggest that the sounds with the lowest success rate of production were [p] in the stressed syllable, [k], and [g] in the final position. The production was unsuccessful because the L2 learners did not have phonological awareness of how the L2 consonant sounds were produced near-natively and were affected by their L1. The lack of awareness led to the failure to produce [p], [k], and [g] sounds because these sounds did not exist in their L1 and interference of the L1 was embed to the L2 speech production. The format analysis results using PRAAT indicate that there is an improvement in the participants’ pronunciation after exposure to the native speaker’s speech sound. The implication of this research is paramount for L2 learners and lecturers in highlighting the importance of targeted instruction and intervention to address the challenges in speech production. Contrasting the phonetic features of L1 and L2 sounds helped the learners to defer interference in their L2 speech production. This study encourages continuous assessment of L2 learners to ensure that they maintain the consistency of speech production to sound near-native.
... A cross-disciplinary approach should be adopted to create diverse and personalized musical performances, promoting a holistic education in vocal music. For example, in opera, vocal performance and dance performance are integrated (Tejedor-García et al., 2020). A good operatic performer must execute a strong combination of songs and dances. ...
Article
Full-text available
With the development of internet technology, big data has been used to evaluate the singing and pronunciation quality of vocal students. However, current methods have several problems such as poor information fusion efficiency, low algorithm robustness, and low recognition accuracy under low signal-to-noise ratio. To address these issues, this article proposes a new method for evaluating sound quality based on one-dimensional convolutional neural networks. It uses sound preprocessing, BP neural networks, wavelet neural networks, and one-dimensional CNNs to improve pronunciation quality. The proposed 1D CNN network is more suitable for one-dimensional sound signals and can effectively solve problems such as feature information fusion, pitch period detection, and network construction. It can evaluate singing art sound quality with minimum errors, good robustness, and strong portability. This method can be used for the evaluation and diagnosis of voice diseases, helping to improve students' professional abilities.
... The computer can teach any subject according to its degree of difficulty, from the simplest to the most difficult. The amount, complexity and degree of detail of the topic can be determined individually according to the level of the students (Sünbül, Gündüz & Yılmaz, 2002;Larkin & Chabay, 2021;Tejedor-García et al., 2020;Tsai, 2019). ...
Article
Full-text available
In this study, the effects of constructivist learning and computer-assisted instruction on the achievement and attitudes of high school 2nd grade students in learning folk literature unit were examined by comparing with traditional teaching methods. The sample of the study consisted of 60 high school 2nd grade students studying in a high school in Arkalyk, Kazakhstan. The folk literature unit was taught by using constructivist learning and computer assisted instruction method in the experimental group, while traditional teaching methods were used in the control group. Constructivist teaching was organized according to the 5E model and associated with computer assisted instruction. The experimental applications of the research lasted 7 weeks. Folk literature achievement test and attitude scale towards folk literature were used to collect the data. As a result of the study, it was found that the students who studied with constructivist learning and computer-assisted instruction method achieved higher achievement levels in learning the folk literature unit compared to the students who studied with traditional teaching methods. The study also found out that constructivist learning and computer-assisted instruction practices significantly increased students' attitudes towards folk literature subjects.
... Participants indicated that playing the game using an artificial intelligence-supported voice recognition WEBsite boosted their pronunciation skills as well as their word memory capacities, prompting them to think about whatever sections perplexed them. This coincides with a statement by academics in pronunciation education (Dillon & Wells, 2021;Tejedor García et al., 2020;Spring & Tabuchi, 2021) that effective techniques may assist learners by encouraging them to participate actively and think critically. The artificial intelligence-supported speech recognition pronunciation method offers significant potential to reinvent pronunciation teaching with appropriate techniques. ...
Article
Full-text available
Correct pronunciation significantly increases the intelligibility of communication. However, it is uncertain whether acquiring the pronunciation of the words enhances word retention capability. Therefore, the major purpose of this research is to evaluate whether vocabulary acquisition with the aid of pronouncing with artificial intelligence leads to a longer memory. In this research, a full experimental pattern, and a pre-test and post-test control group design were applied. Furthermore, a total of 56 high school students aged between 14-15 were asked to memorize unknown vocabulary with two pronunciation teaching methods. Prior to the experimental process, the pre-test was applied to both groups, and then, the artificial intelligence-based speech recognition pronunciation teaching process was to the experimental group while the phonetic alphabet pronunciation process was to the control group on the 4th, 8th, and 12th weeks. According to the findings, it was obtained that pronunciation practice via artificial intelligence-enabled the words to remain in memory longer. Additionally, the participants’ views were gathered at the end of the research. For further research, this study will benefit other research with a variety of accessible tools to meet objectives by utilizing a new artificial intelligence-supported pronunciation model, through recording and reacting to learners' pronouncing practices in different languages.
... The findings demonstrated that the ASR-based feedback have an influence on correcting the errors in the teaching. Tejedor-García et al. (2020) in their study also demonstrated a significant pronunciation improvement among those learners who applied the CAPT tool, and also a correlation between automatic CAPT assessment of users and human rater's assessment of post-tests. In the same vein, Pourhosein et al. (2020) investigated Iranian teachers' role in using CAPT in teaching pronunciation. ...
... It can be used to overcome time consumed by reading via audiobooks using a TTS. It can also enhance literacy skills, improve reading comprehension, the ability to recall information, and the pronunciation improvement [38,40]. It also makes it easy for anyone else searching for more convenient access to digital materials. ...
Article
Full-text available
This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis. In purposes that demand low computational complexity, conventional vocoder-based statistical parametric speech synthesis can be preferable. While capable of remarkable naturalness, recent neural vocoders nonetheless fall short of the criteria for real-time synthesis. We investigate our former continuous vocoder, in which the excitation is characterized employing two one-dimensional parameters: Maximum Voiced Frequency and continuous fundamental frequency (F0). We show that an average voice can be trained for deep neural network-based TTS utilizing data from nine English speakers. We did speaker adaptation experiments for each target speaker with 400 utterances (approximately 14 minutes). We showed an apparent enhancement in the quality and naturalness of synthesized speech compared to our previous work by utilizing the recurrent neural network topologies. According to the objective studies (Mel-Cepstral Distortion and F0 correlation), the quality of speaker adaptation using Continuous Vocoder-based DNN-TTS is slightly better than the WORLD Vocoder-based baseline. The subjective MUSHRA-like test results also showed that our speaker adaptation technique is almost as natural as the WORLD vocoder using Gated Recurrent Unit and Long Short Term Memory networks. The proposed vocoder, being capable of real-time synthesis, can be used for applications which need fast synthesis speed.
... The tool is integrated into a pre/post-test design experiment with native speakers of Spanish to assess the learners' perception and production improvement. It is expected that the tool will have a positive effect on the results, as has been shown in previous studies using similar methodology [1], [2], [10]. The goal of this study is to investigate to what extent the use of the tool Estoñol affects the results of the post-test. ...
... 7 Occupational Therapy International college psychological guidance work to build a system that meets their own work needs and is sounder at a certain level. Since most college counselors are not psychology majors and have little mastery of related professional knowledge and skills, they should be fully aware of their nonprofessional nature when exploring music therapy theories and techniques and explore them in a nonprofessional form while simply learning the essential contents of related professions 8 Occupational Therapy International [29]. In this way, counselors can construct a music therapy system that suits their nonprofessional characteristics and use it effectively in psychological counseling. ...
Article
Full-text available
This paper analyzes the modeling of a computer-aided piano music automatic notation algorithm, combines the influence of music on psychological detachment, and designs the piano music automatic notation algorithm in psychological detachment model construction. This paper investigates the multiresolution time-frequency representation constant Q-transform (CQT), which is common in music signal analysis, and finds that although CQT has higher frequency resolution at low frequencies, it also leads to lower temporal resolution. The variable Q-transform is introduced as a tool for multibasic frequency estimation of the time-frequency representation of music signals, which has better temporal resolution than CQT at the exact frequency resolution and efficient coefficient calculation. The short-time Fourier transform and constant Q-transform time-frequency analysis methods are implemented, respectively, and note onset detection and multibasic tone detection are implemented based on CNN models. The network structure, training method, and postprocessing method of CNN are optimized. This paper proposes a temporal structure model for maintaining music coherence to avoid manual input and ensure interdependence between tracks in music generation. This paper also investigates and implements a method for generating discrete music events based on multiple channels, including a multitrack correlation model and a discretization process. In this paper, the automatic piano music notation algorithm can play an influential role in significantly enhancing the actual effect of psychological detoxification.
... Computer-Assisted Pronunciation Training (CAPT) is a part of CALL responsible for learning pronunciation skills. It has been shown to help people practice and improve their pronunciation skills [6][7][8]. CAPT consists of two components: an automated pronunciation evaluation component [9][10][11] and a feedback component [12]. The automated pronunciation evaluation component is responsible for detecting pronunciation errors in spoken speech, for example, for detecting words pronounced incorrectly by the speaker. ...
Article
Full-text available
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60% precision at 40%–80% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S) and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors, but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41% from 0.528 to 0.749 compared to the state-of-the-art approach.
... Chinese people are usually used to using Chinese pronunciation for English pronunciation. is often leads to inaccurate English pronunciation, making it difficult for the other party to understand. erefore, using accurate English pronunciation is the key to improving the quality of oral English pronunciation [2]. ...
Article
Full-text available
The accuracy of English pronunciation is the key index to evaluate the quality of English teaching. Correct pronunciation and smooth language flow are the expectations of every student for English learning. Aiming at the poor effect and slow speed of the original SSD (Single Shot MultiBox Detector) algorithm in English teaching pronunciation detection, this paper proposes a clustering and improved SSD algorithm for English teaching pronunciation detection and recognition. The algorithm improves the concept module to enhance the feature extraction ability of the network and improve the detection speed. Meanwhile, it integrates multiscale features to realize multilayer multiplexing and equalization of features, so as to improve the detection effect of small target sound. This algorithm extracts more features by introducing channel attention mechanism, which increases the detection accuracy while reducing computation. In order to optimize the network’s ability to extract target location information, K-means clustering method is used to set the default parameters that are more in line with the characteristics of target samples. The experimental results showed that the proposed algorithm can accurately evaluate the pronunciation quality of reading aloud, so as to effectively reflect the oral English level of the reader.
... With the development of Internet technology, the current stage of CALL system can meet the interactive L2 learning scenarios of users, for instance, remembering words by looking at pictures and watching videos, and practicing pronunciation by simulating dialogues in scenes. CALL systems that focus on L2 pronunciation learning are called Computer Assisted Pronunciation Training (CAPT) [7,8]. ...
Article
Full-text available
The existing English pronunciation error detection methods are more oriented to the detection of wrong pronunciation, and lack of targeted improvement suggestions for pronunciation errors. With the aim of solving this problem, the paper proposes an English pronunciation error detection system based on improved random forest. Firstly, a speech corpus is constructed along with the evaluation of the acoustic features. Then an improved random forest detection algorithm is designed. The algorithm inputs rare mispronunciation data into a GAN neural network to generate new class samples and improve the uneven distribution of mispronunciation data in the sample set. The distribution rules of the pronunciation data are extracted layer by layer by stacking deep SDAEs, and the coefficient penalties and reconstruction errors of each coding layer are combined to identify the features associated with the wrong pronunciation in the high-dimensional data. Furthermore, a forest decision tree is constructed using the reduced-dimensional feature-based data to improve the pronunciation detection accuracy. Finally, the extracted 39 Mel Frequency Cepstral Coefficient (MFCC) acoustic features are used as the input of the improved random forest classifier to construct a classification error detection model. The experimental results indicate that the designed system achieves a high accuracy of English pronunciation detection.
... DL is a type of machine learning algorithm that attempts to obtain a high-level abstract representation of data that includes multiple layers of non-linear mapping [10][11][12]. DL is within the scope of representation learning in machine learning. Data can be expressed on many levels. ...
Article
Full-text available
With the advancement of globalization, an increasing number of people are learning and using a common language as a tool for international communication. However, there are clear distinctions between the native language and target language, especially in pronunciation, and the domestic target language, the learning environment is far from ideal, with few competent teachers. In addition, such learning cannot achieve computer-assisted language learning (CALL) technology. The efficient combination of computer technology and language teaching and learning methods provides a new solution to this problem. The core of CALL is speech recognition (SR) technology and speech evaluation technology. The development of deep learning (DL) has greatly promoted the development of speech recognition. The pronunciation resource collected from the Chinese college students, whose majors are language education or who are planning to obtain better pronunciation, shall be the research object of this paper. The study applies deep learning to the standard but of target language pronunciation and builds a standard evaluation model of pronunciation teaching based on the deep belief network (DBN). On this basis, this work improves the traditional pronunciation quality evaluation method, comprehensively considers intonation, speaking speed, rhythm, intonation, and other multi-parameter indicators and their weights, and establishes a reasonable and efficient pronunciation model. The systematic research results show that this article has theoretical and practical value in the field of phonetics education.
... The CAPT system is one of CALL's essential tools designed for automatically evaluating and detecting the learner's pronunciation errors. In the CAPT system, pronunciation evaluation can happen at two levels [20,21,25]: (i) detecting specific pronunciation errors, and (ii) an overall assessment of a speaker's proficiency (i.e., goodness of pronunciation (GoP) [29,5,28]). This study seeks to examine the use of ASR to empower learners to practice and improve their pronunciation on their own. ...
Conference Paper
Full-text available
Pronunciation is one of the fundamentals of language learning, and it is considered a primary factor of spoken language when it comes to understanding and being understood by others. The persistent presence of high error rates in speech recognition domains resulting from mispronunciations motivates us to find alternative techniques for handling mispronunciations. In this study, we develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers, identifies the commonly mispronounced phonemes of Italian learners of English, and presents an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora. In this work, to detect mispronunciations, we used a phone-based ASR implemented using Kaldi. We used two non-native English labeled corpora; (i) a corpus of Italian adults contains 5,867 utterances from 46 speakers, and (ii) a corpus of Italian children consists of 5,268 utterances from 78 children. Our results show that the selected error model can discriminate correct sounds from incorrect sounds in both native and non-native speech, and therefore can be used to detect pronunciation errors in non-native speech. The phone error rates show improvement in using the error language model. Furthermore, the ASR system shows better accuracy after applying the error model on our selected corpora.
... Computer aided pronunciation training (CAPT) systems give students feedback about the quality of their pronunciation [1,2] and have been shown to have a positive impact on their learning and motivation [3]. CAPT methods for phone-level pronunciation scoring can be classified in two groups depending on whether or not non-native data with pronunciation labels is used for training the system. ...
Preprint
Full-text available
Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small. In this paper, we present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring. We analyze the effect of several design choices and compare the performance with a state-of-the-art goodness of pronunciation (GOP) system. Our final system is 20% better than the GOP system on EpaDB, a database for pronunciation scoring research, for a cost function that prioritizes low rates of unnecessary corrections.
... The conflict of rules of English pronunciation errors is very common. Some usages are established by convention and may not conform to the usual rules of English [14]. The conventional method uses writing more rules to distinguish the difference, but this may cause more conflicts, making the expansion of the rules infinite. ...
Article
Full-text available
For correction system of English pronunciation errors, the level of correction performance and the reliability, practicability, and adaptability of information feedback are the main basis for evaluating its excellent comprehensive performance. In view of the disadvantages of traditional English pronunciation correction systems, such as failure to timely feedback and correct learners’ pronunciation errors, slow improvement of learners’ English proficiency, and even misleading learners, it is imperative to design a scientific and efficient automatic correction system for English pronunciation errors. High-sensitivity acoustic wave sensors can identify English pronunciation error signal and convert the dimension of collected pronunciation signal according to channel configuration information; acoustic wave sensors can then assist the automatic correction system of English pronunciation errors to filter out interference components in output signal, analyze real-time spectrum, and evaluate the sensitivity of the acoustic wave sensor. Therefore, on the basis of summarizing and analyzing previous research works, this paper expounds the current research status and significance of the design of automatic correction system for English pronunciation errors, elaborates the development background, current status and future challenges of high-sensitivity acoustic wave sensor technology, introduces the methods and principles of time-domain signal amplitude measurement and pronunciation signal preprocessing, carries out the optimization design of pronunciation recognition sensors, performs the improvement design of pronunciation recognition processors, proposes the hardware design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity acoustic wave sensors, analyzes the acquisition program design for English pronunciation errors, implements the parameter extraction of English pronunciation error signal, discusses the software design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity sound wave sensor, and finally, conducts system test and its result analysis. The study results show that the automatic correction system of English pronunciation errors assisted by the high-sensitivity acoustic wave sensors can realize the automatic correction of the amplitude linearity, sensitivity, repeatability error, and return error of English pronunciation errors, which has the robust functions of automatic real-time data collection, processing, saving, query, and retesting. The system can also minimize external interference and improve the accuracy of acoustic wave sensors’ sensitivity calibration, and it provides functions such as reading and saving English pronunciation error signals and visual operation, which effectively improves the ease of use and completeness of the correction system. The study results in this paper provide a reference for the further researches on the automatic correction system design for English pronunciation errors assisted by high-sensitivity acoustic wave sensors. 1. Introduction In the process of learning English, there is a phenomenon that some learners’ spoken language is poor, and as a critical and difficult part of English learning, spoken language has received increasing attention. Therefore, it is imperative to design a scientific and efficient automatic correction system for English pronunciation errors. The traditional English pronunciation correction system cannot provide timely feedback and correction for learners’ pronunciation errors and has disadvantages such as misleading learners and slow improvement of learners’ English proficiency [1]. For the automatic correction system for English pronunciation errors, the level of correction performance and the reliability and practicability of information feedback are the main basis for evaluating its comprehensive performance. The quality of the correction algorithm determines the correction performance, and a reasonable error detection method guarantees [2]. After decomposing and optimizing each subtarget in the multitarget, the high-sensitivity acoustic wave sensor will trade off and coordinate them to make each subtarget. This is because the input information and output information required in the automatic correction of English pronunciation errors are related to the open failure system. The automatic correction system for English pronunciation errors can be divided into two parts: system training and pronunciation correction [3]. The training process of the system is similar to the training in the automatic pronunciation recognition system. The known standard pronunciation information features are extracted and recorded as the standard for pronunciation correction. Pronunciation correction is to correct the pronunciation accuracy of the pronunciation to be tested. The basic process is to extract the features of the pronunciation to be tested, compare its standard pronunciation features, and calculate the score based on the similarity [4]. The high-sensitivity acoustic wave sensor can follow the artificial neural network model, use target tracking to design an automatic correction system, and form an abstract logic layer by combining the characteristics of English pronunciation errors. The similarity between the single target tracking algorithm and the traditional neural network is that they both use a hierarchical structure to construct the logical layer, but the difference is that the three-layer construction mode is the most suitable for automatic correction system [5]. Relying on the optimized design of the pronunciation recognition sensor and the improved design of the pronunciation recognition processor, the software design of the system is completed based on the design of the English pronunciation acquisition program and the extraction of English pronunciation error signal parameters. In this process, although the amount of data is large and the calculations are more complicated, the calculation process of each sentence is the same [6]. It is necessary to use analog-digital signal conversion to improve the data sampling efficiency, and the sampling efficiency is not less than a certain value and the single-target tracking algorithm is used to continuously perform repeated iterative calculations [7]. In pronunciation recognition, a multifrequency oscillator is designed to automatically calibrate the pronunciation accuracy, while the calibration of the circuit conversion is the key to realize the conversion of the English printing information mode. By collecting and controlling the original pronunciation information of the circuit, the accuracy of system’s automatic correction data can be improved [8]. Based on the summary and analysis of previous research results, this paper expounds the current research status and significance of the design of automatic correction system for English pronunciation errors, elaborates the development background, current status, and future challenges of high-sensitivity acoustic wave sensor technology, introduces the methods and principles of time-domain signal amplitude measurement and pronunciation signal preprocessing, carries out the optimization design of pronunciation recognition sensors, performs the improvement design of pronunciation recognition processors, proposes the hardware design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity acoustic wave sensors, analyzes the acquisition program design for English pronunciation errors, implements the parameter extraction of English pronunciation error signal, discusses the software design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity sound wave sensor, and finally, conducts system test and its result analysis. The detailed chapters are arranged as follows: Section 2 introduces the methods and principles of time-domain signal amplitude measurement and pronunciation signal preprocessing; Section 3 proposes the hardware design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity acoustic wave sensors; Section 4 discusses the software design of automatic correction system for English pronunciation errors based on the assistance of high-sensitivity sound wave sensor; Section 5 conducts system test and its result analysis; Section 6 is the conclusion. 2. Methods and Principles 2.1. Amplitude Measurement of Time Domain Signal From the perspective of the characteristics of the automatic correction system for English pronunciation errors; the assistance of the high-sensitivity acoustic wave sensor is actually a system function to obtain the required frequency response characteristics, and the same is true for digital filtering. For a linear time-invariant causal simulation system, the relationship between its input and output is where is the input of the system; is the output response of the system; is the continuous time component; is the transfer function of the system; is the number of convolution operators. For the input English phoneme of the system, given the observation vector of each frame of the th segment of pronunciation related to it, calculate its frame-based posterior probability as where is the probability distribution of the observation vector for a given phoneme ; is the prior probability of phoneme ; is the summation function of all text independent phonemes. The design of the high-sensitivity acoustic wave sensor-assisted automatic correction system for English pronunciation errors has passed the first-level calibration to measure the sensitivity of the standard acoustic wave sensor, so the final calculation formula for the sensitivity of the sensor under test is where is the sensitivity of the standard acoustic wave sensor; is the sensitivity of the acoustic wave sensor to be measured; is the amplitude of the acoustic wave sensor to be measured; is the amplitude of the reference acoustic wave sensor. The development of the correction system first recognizes the English pronunciation error signal and then performs dimensional conversion on the collected English pronunciation error signal according to the channel configuration information. Then, the high-sensitivity acoustic wave sensor is embedded in the correction system. The measurement process first filters the pronunciation signal to filter out the interference components in the output signal of the acoustic wave sensor; the system performs real-time spectrum analysis on the filtered English pronunciation error signal and evaluates the sensitivity of acoustic wave sensor [9]. Therefore, the system can minimize external interference and improve the accuracy of sensor sensitivity calibration when there is interference in the on-site environment. In addition, the software provides auxiliary functions such as reading and saving the pronunciation error signal and the operation of the visualization area to improve the ease of use and completeness of the system. The system uses a control signal source and an oscilloscope to complete the task of sending and collecting English pronunciation errors signals. Due to the limitation of the number of measurements, a loop control structure is added to measure the sensitivity of the sensor under test to achieve a certain number of cycles, and the oscilloscope collects signals are added to the program to ensure the integrity of signal reception and finally realize the task of channel triggering and channel reception. 2.2. Pronunciation Signal Preprocessing After the high-sensitivity acoustic wave sensor calculates the ratio of signal input to output, the system function can be obtained by pulling and transforming the comparison value. The acoustic wave sensor is designed by the impulse response method, and the general form of the function for pronunciation error correction is where is the acoustic wave sensor coefficient of the th state at time ; is the cumulative output probability of the th state at time ; is the previous state number of the th state at time ; is the optimal state sequence at time status. The logarithm of the posterior probability of the phoneme in the th segment of pronunciation for each meal of the English pronunciation error signal is taken, and then, the logarithmic posterior probability score of the phoneme under the th segment of pronunciation can be obtained: where is the duration of the th time period corresponding to phone ; is the normalized function of the th time period of phone ; is the likelihood of the segment of the th time period of phone ; is the final output probability. The sound wave sensor-assisted automatic correction system regards English pronunciation errors as a common pronunciation classification problem and uses a classification model to solve this problem. This model is based on a four-layer feed-forward network, which includes a pronunciation vector mapping table; the formula for inputting the input layer vector into the feed-forward network for forward calculation is as follows: where is the network weight; is the bias value; is the activation function; is the output value of the corresponding layer; is the learning rate; is the dimension of the pronunciation error; is the size of the vector table of the pronunciation error. English pronunciation error preprocessing includes sampling of English pronunciation errors, antialiasing band-pass filtering to remove individual pronunciation differences and noise effects caused by equipment and environment. English pronunciation error is an unstable random process, so it needs to use high-sensitivity acoustic wave sensor for short-term processing and involves primitive selection and endpoint detection of pronunciation recognition [10]. Endpoint detection refers to determining the start and end of pronunciation from English pronunciation errors, which is an important part of preprocessing. The process of pronunciation recognition is a process of digitally processing English pronunciation errors. Before processing English pronunciation errors, they must be digitally processed, and this process is analog-to-digital conversion. The analog-to-digital conversion process has to go through two processes, sampling and quantization, to obtain discrete digital signals in time and amplitude, and preemphasis is usually performed before transformation and after antialiasing filtering. After the system obtains learner’s follow-up pronunciation, it extracts its characteristics and calculates the similarity between it and the standard pronunciation in the test question bank and finally maps the similarity to a grade score that is easier for the learner to understand and accept. 3. Hardware Design of Automatic Correction System Based on High-Sensitivity Acoustic Wave Sensors 3.1. Optimization Design of Pronunciation Recognition Sensors In order to ensure the accuracy, reliability, unity, and self-adaptability of English pronunciation errors correction and to adapt to the development trend of automatic correction, the system hardware design must carry out effective measurement supervision on the accuracy and reliability of the sound wave sensor’s measurement value transmission to standardize and perfect the calibration of the sensor. The main components include raster data conditioning module, sensor output conditioning module to be calibrated, and acquisition device and computer system. This module can, respectively, realize the correction of the amplitude linearity, sensitivity, repeatability error, and return error of English pronunciation errors, and has the functions of automatic real-time data collection, data processing, storage, query, and remeasurement. The grating ruler is converted into the corresponding electrical signal through its conditioning circuit, and the corresponding processing is carried out by the formant acquisition card and the English pronunciation signal is input into the system. The sensor to be calibrated outputs the corresponding voltage or current through its conditioning circuit, through data acquisition device, use the interface to achieve serial communication, set the data acquisition device in the reset state, establish the trigger condition, and initialize the control settings; the model can enter the working state, open the serial port, input the output signal into the computer system, and finally, respond with the collected data analysis and processing [11]. Figure 1 shows the automatic correction system design framework for English pronunciation errors assisted by high-sensitivity acoustic wave sensors.
... With the ever-increasing surge of learning foreign languages and the continued progress of automatic speech recognition (ASR), computer-assisted pronunciation training (CAPT) has established itself as an integral means to alleviate the lack of qualified teachers and offer an individualized and self-paced environment for second-language (L2) learners to improve their speaking proficiency [1,2]. Essential to the success of a CAPT system is the accuracy of the mispronunciation detection (MD) module, which is concerned with the identification of erroneous pronunciations in the utterance of an L2 learner that differ from the canonical pronunciations of the corresponding text prompt presented to the learner. ...
Preprint
Full-text available
Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.
... It has been shown that Computer-Assisted Pronunciation Training (CAPT) helps people practice and improve pronunciation skills [1,2]. Despite significant progress over the last two decades, standard methods are still unable to detect mispronunciations with high accuracy. ...
... The emotional states affect human interaction, such as facial expression [3,4], body posture [5], communication content [6], and speech mannerisms [7]. Influenced by multimodal features, recognizing emotions is vital to develop an automatic speech emotion recognition system for understanding of these interactions, and various speech recognition for pronunciation training and learning are developed [8,9]. Currently, most emotional recognition tasks are accomplished with hand-crafted features, which requires the guarantee of data validity and quantity. ...
Article
Full-text available
Speech signals contain abundant information on personal emotions, which plays an important part in the representation of human potential characteristics and expressions. However, the deficiency of emotion speech data affects the development of speech emotion recognition (SER), which also limits the promotion of recognition accuracy. Currently, the most effective approach is to make use of unsupervised feature learning techniques to extract speech features from available speech data and generate emotion classifiers with these features. In this paper, we proposed to implement autoencoders such as a denoising autoencoder (DAE) and an adversarial autoencoder (AAE) to extract the features from LibriSpeech for model pre-training, and then conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets for classification. Considering the imbalance of data distribution in IEMOCAP, we developed a novel data augmentation approach to optimize the overlap shift between consecutive segments and redesigned the data division. The best classification accuracy reached 78.67% (weighted accuracy, WA) and 76.89% (unweighted accuracy, UA) with AAE. Compared with state-of-the-art results to our knowledge (76.18% of WA and 76.36% of UA with the supervised learning method), we achieved a slight advantage. This suggests that using unsupervised learning benefits the development of SER and provides a new approach to eliminate the problem of data scarcity.
... Additionally, speech can be applied without any physical contact interaction, making it an ideal signal for HMI applications. Speech recognition tasks can be utilized in the smart input system [105], [106], automatic transcription system [107], [108], smart voice assistant [109], computerassisted speech [110], rehabilitation [111], [112] and language teaching. Similar to the face recognition task that focuses on the recognition of an individual human using the facial information that is enclosed in a single image, the speaker recognition task tries to achieve the same goal using the vocal tone information of the subject. ...
Preprint
Full-text available
Cyber Physical Systems have been going into a transition phase from individual systems to a collecttives of systems that collaborate in order to achieve a highly complex cause, realizing a system of systems approach. The automotive domain has been making a transition to the system of system approach aiming to provide a series of emergent functionality like traffic management, collaborative car fleet management or large-scale automotive adaptation to physical environment thus providing significant environmental benefits (e.g air pollution reduction) and achieving significant societal impact. Similarly, large infrastructure domains, are evolving into global, highly integrated cyber-physical systems of systems covering all parts of the value chain. In practice, there are significant challenges in CPSoS applicability and usability to be addressed, i.e. even a small CPSoS such as a car consists several subsystems Decentralization of CPSoS appoints tasks to individual CPSs within the System of Systems. CPSoSs are heterogenous systems. They comprise of various, autonomous, CPSs, each one of them having unique performance capabilities, criticality level, priorities and pursued goals. all CPSs must also harmonically pursue system-based achievements and collaborate in order to make system-of-system based decisions and implement the CPSoS functionality. This survey will provide a comprehensive review on current best practices in connected cyberphysical systems. The basis of our investigation is a dual layer architecture encompassing a perception layer and a behavioral layer. Perception algorithms with respect to scene understanding (object detection and tracking, pose estimation), localization mapping and path planning are thoroughly investigated. Behavioural part focuses on decision making and human in the loop control.
... It has been shown that Computer-Assisted Pronunciation Training (CAPT) helps people practice and improve pronunciation skills [1,2]. Despite significant progress over the last two decades, standard methods are still unable to detect mispronunciations with high accuracy. ...
Preprint
Full-text available
We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
Article
Full-text available
Cyber–physical systems (CPSs) are evolving from individual systems to collectives of systems that collaborate to achieve highly complex goals, realizing a cyber–physical system of systems (CPSoSs) approach. They are heterogeneous systems comprising various autonomous CPSs, each with unique performance capabilities, priorities, and pursued goals. In practice, there are significant challenges in the applicability and usability of CPSoSs that need to be addressed. The decentralization of CPSoSs assigns tasks to individual CPSs within the system of systems. All CPSs should harmonically pursue system-based achievements and collaborate to make system-of-system-based decisions and implement the CPSoS functionality. The automotive domain is transitioning to the system of systems approach, aiming to provide a series of emergent functionalities like traffic management, collaborative car fleet management, or large-scale automotive adaptation to the physical environment, thus providing significant environmental benefits and achieving significant societal impact. Similarly, large infrastructure domains are evolving into global, highly integrated cyber–physical systems of systems, covering all parts of the value chain. This survey provides a comprehensive review of current best practices in connected cyber–physical systems and investigates a dual-layer architecture entailing perception and behavioral components. The presented perception layer entails object detection, cooperative scene analysis, cooperative localization and path planning, and human-centric perception. The behavioral layer focuses on human-in-the-loop (HITL)-centric decision making and control, where the output of the perception layer assists the human operator in making decisions while monitoring the operator’s state. Finally, an extended overview of digital twin (DT) paradigms is provided so as to simulate, realize, and optimize large-scale CPSoS ecosystems.
Article
Full-text available
This systematic review maps the trends of computer-assisted pronunciation training (CAPT) research based on the pedagogy of second language (L2) pronunciation instruction and assessment. The review was limited to empirical studies investigating the effects of CAPT on healthy L2 learners' pronunciation. Thirty peer-reviewed journal articles published between 1999 and 2022 were selected based on specific inclusion and exclusion criteria. Data were collected about the studies' contexts, participants, experimental designs, CAPT systems, pronunciation training scopes and approaches, pronunciation assessment practices, and learning measures. Using a pedagogically informed codebook, the pronunciation training and assessment practices were classified and evaluated based on established L2 pronunciation teaching guidelines. The findings indicated that most of the studies focused on the pronunciation training of adult English learners with an emphasis on the production of segmental features (i.e. vowels and consonants) rather than suprasegmental features (i.e. stress, intonation, and rhythm). Despite the innovation promised by CAPT technology, pronunciation practice in the studies reviewed was characterized by the predominant use of drilling through listen-and-repeat and read-aloud activities. As for assessment, most CAPT studies relied on human listeners to measure the accurate production of discrete pronunciation features (i.e. segmental and suprasegmental accuracy). Meanwhile, few studies employed global pronunciation learning measures such as intelligibility and comprehensibility. Recommendations for future research are provided based on the discussion of these results.
Article
Full-text available
The purpose of this article is to study translation as a human speech act in the context of artificial intelligence. Using the method of analysing the related literature, the article focuses on the impact of technological changes on traditional approaches and explores the links between these concepts and their emergence in linguistics and automatic language processing methods. The results show that the main methods include stochastic, rule-based, and methods based on finite automata or expressions. Studies have shown that stochastic methods are used for text labelling and resolving ambiguities in the definition of word categories, while contextual rules are used as auxiliary methods. It is also necessary to consider the various factors affecting automatic language processing and combine statistical and linguistic methods to achieve better translation results. Conclusions - In order to improve the performance and efficiency of translation systems, it is important to use a comprehensive approach that combines various techniques and machine learning methods. The research confirms the importance of automated language processing in the fields of AI and linguistics, where statistical methods play a significant role in achieving better results. Keywords: technological changes, linguistics, innovations, language technologies, automatic translation
Article
Pronunciation instruction studies have taken considerable attention in the field of foreign language teaching and research in recent years. In this systematic review, only the intervention studies indexed in SSCI were included. A literature search up to April 2024 was conducted using the Web of Science and relevant meta-analytic studies. 55 interventions met the eligibility criteria based on the PRISMA 2020. This review is twofold: to examine the effects of English L2 pronunciation instruction and to identify the methodological status of these studies in terms of treatment formulation, design, sampling type/size, treatment duration and outcome measures. Results showed that pronunciation instruction treatments positively affected L2 users’ pronunciation performance. Regarding research methodology, the studies employed mostly pre- and post-tests, with at least one experimental group having relatively few delayed tests. The most common participant group was undergraduate students with pre-intermediate and intermediate levels. The participants' performance tended to be measured through technological tools in recent years. The suprasegmental features of speech increasingly took place compared to the mere segmental features. These studies also tended to include native speakers’ ratings in the assessment phase of the instruction. The findings of this study are assumed to provide insights and recommendations for future research studies
Article
College students are learning a foreign language must know how to translate the spoken or written content from the respective language into English. These approaches do not help the college students to develop the capacity for rational thinking and adequate the motivation for the English translation. The educational principles are not in line with the qualities of the students in the typical English translation classroom teaching, and the teaching methods are out-dated. In the older process of the teaching English translation, many unreliable, vague aspects need to be considered, such as recognizing students’ fundamental English knowledge, unique circumstances, language proficiency, cultural differences, and the ambiguity of the source language. The main issue with the current English translation evaluation methodology is that it cannot be easily to deal with thecomplex fuzzy indices when judging the accuracy of the student translations. An algorithm named FCAM-AHP-ANFIS is proposed to provide an effective and accurate method for evaluating and predicting students’ English translation outcomes to overcome the traditional shortcomings. According to the proposed approach, students can learn about passive translation, but they may struggle to actively improve their translation skills. College students can benefit from the decision-making aid provided by the extensive evaluation technique due to its high availability and precision. The fundamental benefit of the fuzzy technique over more traditional forms of the assessment is that it accounts for the ambiguity and uncertainty of the making judgments at the human level and provides a coherent framework that includes the indistinct findings of the several steps in evaluating an English translation. The Fuzzy Comprehensive Assessment Model (FCAM) is a decision-making method that uses the fuzzy logic to assess the quality of English translations among the college students. The Analytic Hierarchy Process (AHP) is employed to calculate each criterion’s relative importance and determine the optimal weighting for each criterion utilized in the assessment model. The Adaptive Neuro-Fuzzy Inference System (ANFIS) is used to analyze the translated data and generate predictions for the students’ translation outcomes. The experimental outcomes show the accuracy of the English translation assessment scores are 95.6% with 97% precision, 96% recall, and 96.5% of f1-score metric in addition to Root Mean Square (RMSE) and Mean Absolute Error (MAE).
Preprint
Full-text available
The English intelligent pronunciation training system is a comprehensive system based on multiple functions such as speech recognition, comparison, pronunciation scoring, and correction. In this paper, the Fourier analysis of the speech signal is carried out to obtain the spectral characteristics of each frame. At the same time, this paper analyses the speech signal using multi-sensor fusion tracking and recognition technology. The method achieves the purpose of speech recognition by automatically matching the entropy value of the extracted English speech-related information. The practice has proved that the speech recognition system developed can accurately carry out qualitative pronunciation lip correction. The application of this method in English speech recognition has a good application prospect.
Article
The ability, interest, and prior accomplishments of students with varying proficiency levels all impact how they learn English. Exact validation is essential for facilitating efficient evaluation and training models. The research’s innovative significance resides in incorporating personal attributes, progressive appraisal, and Fuzzy Logic-based appraisal in English language learning. The PA2M model, which addresses the shortcomings of existing models, offers a thorough and accurate assessment, enabling personalized recommendations and enhanced teaching tactics for students with varied skill levels. This research proposes the Fuzzy Logic System (FLS)-based Persistent Appraisal Assessment Model (PA2M). Based on the students’ evolving performance and accumulated data, this model evaluates the students’ English learning capabilities. The model assesses the student’s ability using fuzzification approaches to reduce variations in appraisal verification by linking personal attributes with performance. Mamdani FIS offers a clear and thorough evaluation of student’s English learning capacity within the framework of the appraisal methodology. The inputs are updated utilizing performance and accumulated ability data to improve validation consistently and reduce converge errors. During the fuzzification process, pre-convergence from unavailable appraisal sequences is eliminated. The PA2M approach determines precise improvements and evaluations depending on student ability by merging prior and current data. Several appraisal validations and verifications result in clear fresh suggestions. According to experimental data, the suggested model enhances 9.79% of recommendation rates, 8.79% of appraisal verification, 8.25% of convergence factor, 12.56% error ratio, and verification time with 8.77% over a range of inputs. The PA2M model provides a fresh and useful way to evaluate English learning potential, filling in some gaps in the body of knowledge and practice.
Article
Full-text available
Acquiring a consistent accent and targeting a native standard like Received Pronunciation (RP) or General American (GA) are prerequisites for French learners who plan to become English teachers in France. Reliable methods to assess learners’ productions are therefore extremely valuable. We recorded a little over 300 students from our English Studies department and performed auditory analysis to investigate their accents and determine how close to native models their productions were. Inter-rater comparisons were carried out; they revealed overall good agreement scores which, however, varied across phonetic cues. Then, automatic speech recognition (ASR) and automatic accent identification (AID) were applied to the data. We provide exploratory interpretations of the ASR outputs, and show to what extent they agree with and complement our auditory ratings. AID turns out to be very consistent with our perception, and both types of measurements show that two thirds of our students favour an American, and the remaining third, a British pronunciation, although most of them have mixed features from the two accents.
Article
Students’ English learning ability depends on the knowledge and practice provided during the teaching sessions. Besides, language usage improves the self-ability to scale up the learning levels for professional communication. Therefore, the appraisal identification and ability estimation are expected to be consistent for different English learning levels. This paper introduces Performance Data-based Appraisal Identification Model (PDAIM) to support such reference. This proposed model is computed using fuzzy logic to identify learning level lags. The lag in performance and retains in scaling-up are identified using different fuzzification levels. The study suggests a fuzzy logic model pinpointing learning level gaps and consistently evaluating performance across various English learning levels. The PDAIM model gathers high and low degrees of variance in the learning process to give students flexible learning knowledge. Based on the student’s performance and capacity for knowledge retention, it enables scaling up the learning levels for professional communication. The performance measure in the model is adjusted to accommodate the student’s diverse grades within discernible assessment boundaries. This individualized method offers focused education and advancement to students’ unique requirements and skills. The model contains continuous normalization to enhance the fuzzification process by employing prior lags and retentions. Several indicators, including appraisal rate, lag detection, number of retentions, data analysis rate, and analysis time, are used to validate the PDAIM model’s performance. The model may adjust to the various performance levels and offer pertinent feedback using fuzzification. The high and low variation levels in the learning process are accumulated to provide adaptable learning knowledge to the students. Therefore, the performance measure is modified to fit the student’s various grades under distinguishable appraisal limits. If a consistent appraisal level from the fuzzification is observed for continuous sessions, then the learning is scaled up to the next level, failing, which results in retention. This proposed model occupies constant normalization for improving the fuzzification using previous lags and retentions. Hence the performance of this model is validated using appraisal rate, lag detection, number of retentions, data analysis rate, and analysis time.
Article
Automatic pronunciation assessment is an indispensable technology in computer-assisted pronunciation training systems. To further evaluate the quality of pronunciation, multi-task learning with simultaneous output of multi-granularity and multi-aspect has become a mainstream solution. Existing methods either predict scores at all granularity levels simultaneously through a parallel structure, or predict individual granularity scores layer by layer through a hierarchical structure. However, these methods do not fully understand and take advantage of the correlation between the three granularity levels of phoneme, word, and utterance. To address this issue, we propose a novel method, Granularity-decoupled Transformer (Gradformer), which is able to model the relationships between multiple granularity levels. Specifically, we first use a convolution-augmented transformer encoder to encode acoustic features, where the convolution module helps the model better capture local information. The model outputs both phoneme- and word-level granularity scores with high correlation by the encoder. Then, we use utterance queries to interact with the output of the encoder through the transformer decoder, ultimately obtaining the utterance scores. Through unique encoder and decoder architecture, we achieve decoupling at three granularity levels, and handling the relationship between each granularity. Experiments on the speachocean762 dataset show that our model has advantages over state-of-the-art methods in various metrics, especially in key metrics such as phoneme accuracy, word accuracy, and total score.
Article
Evaluating English teaching quality is vital for improving knowledge-based developments through communication for different aged students. Teaching quality assessment relies on the teachers’ and students’ features for constructive progression. With the development of computational intelligence, optimization and machine learning techniques are widely adapted for teaching quality assessment. In this article, a Quality-centric Assessment Model aided by Fuzzy Optimization (QAM-FO) is designed. This optimization approach validates the student-teacher features for a balanced model assessment. The distinguishable features for improving students’ oral and verbal communication from different teaching levels (basic, intermediate, and proficient) are extracted. The extracted features are the crisp input for the fuzzy optimization such that the recurring fuzzification detains the least fit feature. Such features are replaced by the level-based teaching and performance feature that differs from the previous fuzzy input. This replacement is pursued until a maximum recommendable feature (performance/ learning) is identified. The identified feature is applicable for different teaching levels for improving the quality assessment. Therefore, the proposed optimization approach provides different feasible recommendations for teaching improvements.
Preprint
Full-text available
Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area under the Curve) by 41%, i.e., from 0.528 to 0.749. One of the problems with existing CAPT methods is the low availability of annotated mispronounced speech needed for reliable training of pronunciation error detection models. Therefore, the detection of pronunciation errors is reformulated to the task of generating synthetic mispronounced speech. Intuitively, if we could mimic mispronounced speech and produce any amount of training data, detecting pronunciation errors would be more effective. Furthermore, to eliminate the need to align canonical and recognized phonemes, a novel end-to-end multi-task technique to directly detect pronunciation errors was proposed. The pronunciation error detection models have been used at Amazon to automatically detect pronunciation errors in synthetic speech to accelerate the research into new speech synthesis methods. It was demonstrated that the proposed deep learning methods are applicable in the tasks of detecting and reconstructing dysarthric speech.
Book
This book contains a selection of the best papers of the 33rd Benelux Conference on Artificial Intelligence, BNAIC/ BENELEARN 2021, held in Esch-sur-Alzette, Luxembourg, in November 2021. The 14 papers presented in this volume were carefully reviewed and selected from 46 regular submissions. They address various aspects of artificial intelligence such as natural language processing, agent technology, game theory, problem solving, machine learning, human-agent interaction, AI and education, and data analysis.
Article
Full-text available
For a tonal language such as Mandarin Chinese, accurate pronunciation of tones is critical to meaning, and research suggests that computer-based programs that allow for visualization of pitch contours are helpful for improving learners' reports on a pilot study using speech analysis software (Praat), which allowed L2 Chinese learners first to hear a native speaker of Mandarin say words and phrases while seeing a visual display of the native speaker's pitch curves, then to record themselves reading the same words and phrases, and later to compare their own pitch contours to those of the native speaker. Students in first-year Chinese were recorded reading words and phrases before and after two computer-based training sessions. Native speakers rated the words and phrases for accuracy of tones. Results indicate a ceiling effect for the pronunciation of mono-and disyllabic word tones, with 83.39% of tones produced correctly in the pre-test. Of the 16.41% of tones that were incorrect in the pre-test, almost 50% of them were pronounced correctly in the post-test. Students indicated in a post-study survey that seeing the pitch curves of both the native speakers and their own helped them improve their word tones.
Chapter
Full-text available
The use of mobile applications within Japanese high school English classrooms has been understudied. As a result, teachers have received little guidance in how to use these tools to support language learning. This led us to explore the use of an English vocabulary support tool in an advanced English class at a senior high school in Japan. This investigation employed a mixed-methods approach that is consistent with a design-based research methodology. The study showed the mobile application supported students’ vocabulary development. The investigation also revealed the barriers (e.g., low levels of learner autonomy) and instructional norms that limit the adoption of mobile tools in this little studied context. We discuss these barriers and propose strategies to help teachers overcome them. Among these strategies are small incremental increases in learner autonomy, the use of challenging activities that encourage collaboration, and repeated instruction in the mobile tool’s use.
Conference Paper
Full-text available
Swain's (1985) Comprehensible Output Hypothesis considers that input alone may not be enough for second/foreign language (L2) learners to acquire new language forms. The Hypothesis claims that producing an L2 will facilitate L2 learning due to the mental processes related with language production. Thus, learners will more likely notice discrepancies and gaps between linguistic aspects of their native language (L1) and those of their L2 when producing language than when only perceiving language. Taking Swain's Hypothesis into account, in this talk we will present a Computer Assisted Pronunciation Training designed for non-native speakers of Chinese, English, German, Portuguese (Brazilian and European) and Spanish. The game makes use of automatic speech recognition (ASR) and text-to-speech systems available in Android smartphones and tablets to (i) present learners with the target sounds by means of synthesized stimuli; (ii) test learner's discrimination of specific L2 sounds that are likely to cause intelligibility problems through exercises containing minimal pairs; and (ii) allow learners to record their speech and compare their production to that of the L2. The game provides users with immediate feedback in both perception and production exercises. In the latter exercises, when the recognizer is unable to identify an ideal or close-to-ideal response, the user can retry the answer up to five times. The main disadvantage of ASR pronunciation training is erroneous feedback, i.e., possibility of false alarms and false accepts (Neri et al., 2006). In order to encourage users' engagement and desire to keep playing the game, each correct answer entitles users to collect points so as to reach a given game status. Moreover, different language-dependent leaderboards can be displayed at the end of each round. The advantages in using a gamification design strategy are (i) the increase of learners' engagement, and (ii) the possibility of individualized and comprehensive feedback while keeping users active and comfortable to progress at their own pace in an anxiety-free context.
Conference Paper
Full-text available
We present a L2 pronunciation training serious game based on the minimal-pairs technique, incorporating sequences of exposure, discrimination and production, and using text-to-speech and speech recognition systems. We have measured the quality of users' production during a period of time in order to assess improvement after using the application. Substantial improvement is found among users with poorer initial performance levels. The program's gamification resources manage to engage a high percentage of users. A need is felt to include feedback for users in future versions with the purpose of increasing their performance and avoiding the performance drop detected after protracted use of the tool.
Conference Paper
Full-text available
We present a foreign language (L2) pronunciation training serious game, TipTopTalk!, based on the minimal-pairs technique. We carried out a three-week test experiment where participants had to overcome several challenges including exposure, discrimination and production , while using Text-To-Speech (TTS) and Automatic Speech Recognition (ASR) systems in a mobile application. The quality of users' production is measured in order to assess their improvement. The application implements gamification resources with the aim of promoting continued practice. Preliminary results show that users with poorer initial performance levels make relatively more progress than the rest. However, it is desirable to include specific and individualized feedback in future versions so as to avoid the performance drop detected after the protracted use of the tool.
Conference Paper
Full-text available
Text-To-Speech (TTS) synthesizers have piqued the interest of researchers for their potential to enhance the L2 acquisition of writing (Kirstein, 2006), vocabulary and reading (Proctor, Dalton, & Grisham, 2007) and pronunciation (Cardoso, Collins, & White, 2012; Soler-Urzua, 2011). Despite their proven effectiveness, there is a need for up-to-date formal evaluations of TTS systems. The present study was an attempt to evaluate the language learning potential of an up-to-date TTS system at two levels: (1) speech quality (comprehensibility, naturalness, accuracy, and intelligibility) and (2) focus on a linguistic form (via a feature identification task). For Task 1, participants listened to and rated human- and TTS-produced stories and sentences on a 6-point scale (1); for Task 2, they listened to 16 human- and TTS-produced sentences to identify the presence of a target feature (English regular past -ed). Results of paired samples t-tests indicated that for speech quality, the human samples earned higher ratings than the TTS samples. For the second task (past -ed perception), the TTS and human-produced samples were equivalent. The discussion of the findings will highlight how TTS can be used to complement and enhance the teaching of L2 pronunciation and other linguistic skills both inside and outside the classroom. Keywords: computer-assisted language learning, CALL, text-to-speech, technology and language learning.
Article
Full-text available
Given the limited time for instruction in the classroom, pronunciation often ends up as the most neglected aspect of language teaching. However, in cases when the learner’s pronunciation is expected to be good or native-like, as is expected of language teacher trainees, out-of-class self-study options become prominent. This study aimed to investigate the effectiveness of online text-to-speech tools used by EFL teacher trainees when preparing for an oral achievement test. The study was conducted with 43 junior year teacher trainees at a large state university in Turkey. A pre- and post-test experimental design was used. Both qualitative and quantitative data were collected through a questionnaire to explore the trainees’ opinions related to pronunciation and their practices to improve this, a post reflection questionnaire for the effectiveness of the procedure, and a speaking rubric to evaluate the oral presentations of the trainees. The results indicate that the trainees perceived a native-like accent as a measure of being a good language teacher. It was also revealed that text-to-speech websites are effective self-study tools in improving trainees’ pronunciation.
Conference Paper
Full-text available
Computer Assisted Pronunciation Training (CAPT) apps are becoming widespread to aid learning new languages. However, they are still highly criticized for the lack of the unreplaceable need of direct feedback from a human expert. The combination of the right learning methodology with a gamification design strategy can, nevertheless, increase engagement and provide adequate feedback while keeping users active and comfortable. In this paper, we introduce the second generation of a serious game[1] designed to aid pronunciation training for non-native students of English, Spanish or Chinese. The design of the new version of the game supports a learning methodology which is based in the combination of three different learning strategies: exposure, discriminations and pronunciation[2]. In exposure mode, players are helped to become familiar with the sounds of sequences of minimal pairs or trios, selected by a native linguist and presented at random. When in discrimination mode, users test their ability to discriminate between the phonetics of minimal pairs. They listen to the sound of one of the words in the pair and have to choose the right word on screen. In pronunciation mode, finally, subjects are asked to separately read aloud (and record) both words of each round of minimal pairs lists. Native pronunciation of a word can be played as many times as a user needs. When the test word is correctly uttered by the user, the corresponding icon changes its base colour to green, and gets disabled as a positive feedback message appears. Otherwise, a message with the recognized words appears on the graphical interface and a non-positive feedback message is presented. The word changes its base colour to red and gets disabled after five failures. Speech is recorded and played using commercial off-the-shelf ASR and TTS. Our game adapts to the player as a function of right and wrong answers. Users collect points to reach a " phonetic level " and obtain different achievements, in order to encourage their engagement. There are different language dependent leaderboards based on points too, to increase the desire to play. Sharing results in social networks is another option that is under way. From a pedagogical point of view, the use of Minimal Pairs[3] favours users awareness on the potential risks of producing wrong meanings when the correct phonemes are not properly realized. The discrimination of the words that make up a minimal pair is a challenging task for the ASR, since the phonetic distance between each couple of words can be really small, although clearly perceptible for a native speaker. To be efficient, minimal pairs lists are to be selected by expert linguists for each language. Real use data acquisition and processing is still ongoing , but preliminary results are promising and show that this learning and gaming strategy provides measurable improvement of learners' pronunciation. The app offers an enjoying opportunity for anywhere anytime self-learning, and a tool for teachers to design challenging games.
Conference Paper
Full-text available
This paper introduces the architecture and interface of a serious game intended for pronunciation training and assessment for Spanish students of English as second language. Users will confront a challenge consisting in the pronunciation of a minimal-pair word battery. Android ASR and TTS tools will prove useful in discerning three different pronunciation proficiency levels, ranging from basic to native. Results also provide evidence of the weaknesses and limitations of present-day technologies. These must be taken into account when defining game dynamics for pedagogical purposes.
Article
Full-text available
The goal of this study was to determine the overall effects of pronunciation instruction (PI) as well as the sources and extent of variance in observed effects. Toward this end, a comprehensive search for primary studies was conducted, yielding 86 unique reports testing the effects of PI. Each study was then coded on substantive and methodological features as well as study outcomes (Cohen’s d). Aggregated results showed a generally large effect for PI (d = 0.89 and 0.80 for N-weighted within- and between-group contrasts, respectively). In addition, moderator analyses revealed larger effects for (i) longer interventions, (ii) treatments providing feedback, and (iii) more controlled outcome measures. We interpret these and other results with respect to their practical and pedagogical relevance. We also discuss the findings in relation to instructed second language acquisition research generally and in comparison with other reviews of PI (e.g. Saito 2012). Our conclusion points out areas of PI research in need of further empirical attention and methodological refinement.
Article
Full-text available
Illiteracy is often associated with people in developing countries. However, an estimated 50 % of adults in a developed country such as Canada lack the literacy skills required to cope with the challenges of today’s society; for them, tasks such as reading, understanding, basic arithmetic, and using everyday items are a challenge. Many community-based organizations offer resources and support for these adults, yet overall functional literacy rates are not improving. This is due to a wide range of factors, such as poor retention of adult learners in literacy programs, obstacles in transferring the acquired skills from the classroom to the real life, personal attitudes toward learning, and the stigma of functional illiteracy. In our research we examined the opportunities afforded by personal mobile devices in providing learning and functional support to low-literacy adults. We present the findings of an exploratory study aimed at investigating the reception and adoption of a technological solution for adult learners. ALEX© is a mobile application designed for use both in the classroom and in daily life in order to help low-literacy adults become increasingly literate and independent. Such a solution complements literacy programs by increasing users’ motivation and interest in learning, and raising their confidence levels both in their education pursuits and in facing the challenges of their daily lives. We also reflect on the challenges we faced in designing and conducting our research with two user groups (adults enrolled in literacy classes and in an essential skills program) and contrast the educational impact and attitudes toward such technology between these. Our conclusions present the lessons learned from our evaluations and the impact of the studies’ specific challenges on the outcome and uptake of such mobile assistive technologies in providing practical support to low-literacy adults in conjunction with literacy and essential skills training.
Article
Full-text available
Research on the efficacy of second language (L2) pronunciation instruction has produced mixed results, despite reports of significant improvement in many studies. Possible explanations for divergent outcomes include learner individual differences, goals and foci of instruction, type and duration of instructional input, and assessment procedures. After identifying key concepts, we survey 75 L2 pronunciation studies, particularly their methods and results. Despite a move towards emphasizing speech intelligibility and comprehensibility, most research surveyed promoted native-like pronunciation as the target. Although most studies entailed classroom instruction, many featured Computer Assisted Pronunciation Teaching (CAPT). Segmentals were studied more often than suprasegmentals. The amount of instruction required to effect change was related to researchers’ goals; interventions focusing on a single feature were generally shorter than those addressing more issues. Reading-aloud tasks were the most common form of assessment; very few studies measured spontaneous speech. The attribution of improvement as a result of instruction was compromised in some instances by lack of a control group. We summarize our findings, highlight limitations of current research, and offer suggestions for future directions.
Article
Full-text available
In this paper I intend to present theoretical support and guidelines for the design of a Native Cardinality Method for the teaching of L2 vowels and consonants to adult students, especially adapted to the teaching of English in Spanish universities. The method must be designed so that it promotes a flexibilization of the learner's native phonological system by making it ready to accept new phonemic units, and new complementary distributed allophonic variants. Based on Daniel Jones's famous Cardinal Vowel descriptive system, NCM implies the use of L1 sounds as starting points for the progressive acquisition of L2 vowels and consonants.
Article
Full-text available
This paper first provides an overview of factors that constrain ultimate attainment in adult second language (L2) pronunciation, finding that first language influence and the quantity and quality of L2 phonetic input account for much of the variation in the de-gree of foreign accent found across adult L2 learners. The author then evaluates cur-rent approaches to computer assisted pronunciation training (CAPT), concluding that they are not well grounded in a current understanding of L2 accent. Finally, the author reports on a study in which twenty-two Mandarin speakers were trained to better dis-criminate ten Canadian English vowels. Using a specially designed computer applica-tion, learners were randomly presented with recordings of the target vowels in mono-syllabic frames, produced by twenty native speakers. The learners responded by click-ing on one of ten salient graphical images representing each vowel category and were given both visual and auditory feedback as to the accuracy of their selections. Pre-and post-tests of the learners' English vowel pronunciation indicated that their vowel intel-ligibility significantly improved as a result of training, not only in the training context, but also in an untrained context. In a third context, vowel intelligibility did not im-prove.
Article
Full-text available
One aspect of second language teaching via multimedia to have received attention over the past few years is the impact of glossing individual vocabulary words through different modalities. This study examines which of the image modalities--dynamic video or still picture--is more effective in aiding vocabulary acquisition. The participants, 30 ESL students, were introduced to a hypermedia-learning program, designed by the researcher for reading comprehension. The program provides users reading a narrative English text with a variety of glosses or annotations for words in the form of printed text, graphics, video, and sound, all of which are intended to aid in the understanding and learning of unknown words. A within-subject design was used in this study with 30 participants being measured under three conditions: printed text definition alone, printed text definition coupled with still pictures, and printed text definition coupled with video clips. In order to assess the efficacy of each mode, a vocabulary test was designed and administered to participants after they had read the English narrative. Two types of tests were administered: recognition and production. In addition, a face-to-face interview was conducted, and questionnaires were distributed. Results of the both tests were analyzed using analysis of variance procedures. The investigation has yielded the conclusion that a video clip is more effective in teaching unknown vocabulary words than a still picture. Among the suggested factors that explain such a result are that video better builds a mental image, better creates curiosity leading to increased concentration, and embodies an advantageous combination of modalities (vivid or dynamic image, sound, and printed text).
Article
Full-text available
Pronunciation instruction has been shown to improve learners' L2 accent in some, though certainly not all, cases. A core component of traditional pronunciation instruction is explicit lessons in L2 phonetics. Studies suggest that Spanish FL learners improve their pronunciation after receiving instruction, but the effect of phonetics instruction has not been directly compared with other pedagogical alternatives. This study reports on the pronunciation gains that first, second, and third year learners (n = 95) made after receiving either explicit instruction in Spanish phonetics or a more implicit treatment with similar input, practice, and feedback. The target phones included a variety of consonants that are problematic for English speakers learning Spanish: stop consonants (/p, t, k/), approximants ([]), and rhotics (/, r/). Learners' production of the target phones was measured in a pretest, posttest, delayed posttest design using a word list reading task. Learners in both groups improved their pronunciation equally, suggesting that it might be the input, practice, and/or feedback included in pronunciation instruction, rather than the explicit phonetics lessons, that are most facilitative of improvement in pronunciation.
Article
Full-text available
In “Language Learning Styles and Strategies,” the author synthesizes research from various parts of the world on two key variables affecting language learning: styles, i.e., the general approaches to learning a language; and strategies, the specific behaviors or thoughts learners use to enhance their language learning. These factors influence the student’s ability to learn in a particular instructional framework.
Article
Full-text available
In speech recognition, when a system created for one application is used for another or for a different population of users, large amounts of data and engineering effort are needed to adapt it to its new use. Much work has recently centered on reducing that effort. This paper concerns changing from an adult to a child population of users in a system that pinpoints pronunciation errors in English. It first discusses children s speech production. Then it describes adaptation that is centered around a combination of relatively small amounts of data with minimal recognizer changes for a system that can pinpoint errors as well for children s speech as it does for adults . The precision of the adult system was tested on children s speech. Then Open Source SPHINX was tested on children s speech and tests were run, using a variety of parameters, that compared the precision of automatic pinpointing of recognition errors to human tutor pinpointing of errors. The various parameters tested, the test conditions, and results are discussed.
Article
Full-text available
The current emphasis in second language teaching lies in the achievement of communicative effectiveness. In line with this approach, pronunciation train- ing is nowadays geared towards helping learners avoid serious pronunciation errors, rather than eradicating the finest traces of foreign accent. However, to devise optimal pronunciation training programmes, systematic information on these pronunciation problems is needed, especially in the case of the develop- ment of Computer Assisted Pronunciation Training systems. The research reported on in this paper is aimed at obtaining systematic in- formation on segmental pronunciation errors made by learners of Dutch with different mother tongues. In particular, we aimed at identifying errors that are frequent, perceptually salient, persistent, and potentially hampering to commu- nication. To achieve this goal we conducted analyses on different corpora of speech produced by L2 learners under different conditions. This resulted in a robust inventory of pronunciation errors that can be used for designing efficient pronunciation training programs.
Article
Full-text available
This study assessed the efficacy of a pronunciation training program that provided fundamental frequency contours as visual feedback to native English speakers acquiring Japanese pitch and durational contrasts. Native English speakers who had previously studied Japanese for 1-5 years in the United States participated in training using Kay Elemetrics' CSL-Pitch Program. The training materials were words, phrases, and sentences that contained Japanese pitch and durational contrasts. During training the subjects practiced matching the fundamental frequency contours of Japanese-native models shown on a computer screen. The subjects' ability to produce and to perceive novel Japanese words was tested in two contexts, that is, words in isolation and words in sentences, before and after training. Their ability was compared to that of a native English control group that had previously studied Japanese for 1-5 years but did not participate in the training. The trained subjects improved significantly for words in sentences as well as for words in isolation. Also, the trained subjects improved significantly in perception as well as in production. These findings suggest that the pronunciation training program developed and used here was effective in improving the ability of native English speakers to produce and perceive Japanese pitch and durational contrasts.
Article
Full-text available
Synthesized hVd vowel stimuli and naturally produced CVC minimal pairs by multiple talkers were used to train native Mandarin and Cantonese speakers to identify the English /i/?/I/, /u/?/U/, and /E/?/Q/ contrasts. In the pre? and post?tests, subjects took the identification tests on synthesized and natural stimuli, and were also recorded producing the target vowel contrasts. Results showed that subjects relied on duration cues for the /i/?/I/, contrast more consistently than they did for the other two contrasts. Training effectively shifted their attention from duration to spectral cues. Trainees? perceptual performance on natural tokens improved significantly from pretest to post?test on all three contrasts. Accuracy in generalization to new words produced by new talkers was comparable to new words by familiar talkers. The effect of perceptual learning was retained 3 months later after the training was completed. Improvement in production was observed but performance differences between pre? and post?test did not reach significance. The findings suggest that increased perceptual accuracy in identifying L2 vowel contrasts through perceptual training may not be sufficient for significant improvement in production accuracy. Future studies may look at combination of simultaneous production and perceptual training for better results.
Article
Full-text available
This study investigates whether a computer assisted pronunciation training (CAPT) system can help young learners improve word-level pronunciation skills in English as a foreign language at a level comparable to that achieved through traditional teacher-led training. The pronunciation improvement of a group of learners of 11 years of age receiving teacher-fronted instruction was compared to that of a group receiving computer assisted pronunciation training by means of a system including an automatic speech recognition component. Results show that 1) pronunciation quality of isolated words improved significantly for both groups of subjects, and 2) both groups significantly improved in pronunciation quality of words that were considered particularly difficult to pronounce and that were likely to have been unknown to them prior to the training. Training with a computer-assisted pronunciation training system with a simple automatic speech recognition component can thus lead to short-term improvements in pronunciation that are comparable to those achieved by means of more traditional, teacher-led pronunciation training.
Article
Full-text available
Communicative competence is the ultimate goal of most learners of a second language and intelligible pronunciation a fundamental part of it. Unfortunately, learners often lack the opportunity to explore how intelligible their speech is for different audiences. Our research investigates whether synchronous-voice computer-mediated communication could be an adequate tool both to promote more authentic interactions and to test the intelligibility of students´ pronunciation with different audiences. We also study whether the kind of dyad (NNS sharing L1, NNS different L1, and NS) affects improvement in pronunciation and amount of phonetically modified output as a result of interactions and investigate whether teachers' considerations of the seriousness of phonetic errors are confirmed by interlocutors' incomprehension.
Article
Full-text available
Educators and researchers in the acquisition of L2 phonology have called for empirical assessment of the progress students make after using new methods for learning (Chun, 1998, Morley, 1991). The present study investigated whether unlimited access to a speech-recognition-based language-learning program would improve the general standard of pronunciation of a group of middle-aged immigrant professionals studying English in Sweden. Eleven students were given a copy of the pro-gram Talk to Me from Auralog as a supplement to a 200-hour course in Technical English, and were encouraged to practise on their home computers. Their development in spoken English was com-pared with a control group of fifteen students who did not use the program. The program is evalu-ated in this paper according to Chapelle's (2001) six criteria for CALL assessment. Since objective human ratings of pronunciation are costly and can be unreliable, our students were pre-and post-tested with the automatic PhonePass SET-10 test from Ordinate Corp. Results indicate that practice with the program was beneficial to those students who began the course with a strong foreign accent but was of limited value for students who began the course with better pronunciation. The paper begins with an overview of the state of the art of using speech recognition in L2 applications.
Article
Full-text available
Although the success of automatic speech recognition (ASR)-based Computer Assisted Pronunciation Training (CAPT) systems is increasing, little is known about the pedagogical effectiveness of these systems. This is particularly regrettable because ASR technology still suffers from limitations that may result in the provision of erroneous feedback, possibly leading to learning breakdowns. To study the effectiveness of ASR-based feedback for improving pronunciation, we developed and tested a CAPT system providing automatic feedback on Dutch phonemes that are problematic for adult learners of Dutch. Thirty immigrants who were studying Dutch were assigned to three groups using either the ASR-based CAPT system with automatic feedback, a CAPT system without feedback, or no CAPT system. Pronunciation quality was assessed for each participant before and after the training by human experts who evaluated overall segmental quality and the quality of the phonemes addressed in the training. The participants' impressions of the CAPT system used were also studied through anonymous questionnaires. The results on global segmental quality show that the group receiving ASR-based feedback made the largest mean improvement, but the groups' mean improvements did not differ significantly. The group receiving ASR-based feedback showed a significantly larger improvement than the no-feedback group in the segmental quality of the problematic phonemes targeted.
Article
Full-text available
The establishment of new towns in the twentieth century in many parts of the world is a test bed of koineization, the type of language change that takes place when speakers of di¤erent, but mutually intelligible language varieties come together, and which may lead to new dialect or koine forma-tion. This article presents the case of Milton Keynes, an English new town designated in 1967. Our study investigated the speech of a sample of 48 working-class children divided into three age groups: four, eight, and twelve years of age, along with one caregiver for each. We hypothesize that the formation of a new dialect is in the gift of older children. We also hypothe-size that dialect levelling, which is part of koineization, will be more rapid in a new town than in an old-established town. Detailed quantitative results for four vowels strongly support these hypotheses. At the same time, we in-vestigate the social network types contracted by new town residents. We found many to be socially isolated locally, but that they maintained con-tacts with their place of origin. We show that migrants violate what the Milroys argue to be the normal inverse relationship between socioeconomic class and social network density: migrants have uniplex networks, while still having a low socioeconomic status. The consequences for dialect change are considered.
Book
This book is a comprehensive guide to the International Phonetic Alphabet, whose aim is to provide a universally agreed system of notation for the sounds of languages, and which has been widely used for over a century. The Handbook presents the basics of phonetic analysis so that the principles underlying the Alphabet can be readily understood, and gives examples of the use of each of the phonetic symbols. The application of the Alphabet is then demonstrated in nearly 30 'Illustrations' - concise analyses of the sound systems of a range of languages, each of them accompanied by a phonetic transcription of a passage of speech. The Handbook also includes the 'Extensions' to the Alphabet, covering speech sounds beyond the sound-systems of languages, and a listing of the internationally agreed computer codings for phonetic symbols. It is an essential reference work for all those involved in the analysis of speech.
Article
This study investigates the acquisition of the L2 French vowel /y/ in a mobile-assisted learning environment, via the use of automatic speech recognition (ASR). Particularly, it addresses the question of whether ASR-based pronunciation instruction using a mobile device can improve the production and perception of French /y/. Forty-two elementary French students participated in an experimental study in which they were assigned to one of three groups: (1) the ASR Group, which used an ASR application on their mobile devices to complete weekly pronunciation activities, with immediate written visual (textual) feedback provided by the software and no human interaction; (2) the Non-ASR Group, which completed the same weekly pronunciation activities in individual weekly sessions but with a teacher who provided immediate oral feedback using recasts and repetitions; and finally, (3) the Control Group, which participated in weekly individual meetings ‘to practice their conversation skills’ with a teacher who provided no pronunciation feedback. The study followed a pretest/posttest design. According to the results of the dependent samples t-tests, only the ASR group improved significantly from pretest to posttest (p < 0.001), and none of the groups improved in perception. The overall success of the ASR group on the production measures suggests that this type of learning environment is propitious for the development of segmental features such as /y/ in L2 French.
Article
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
Article
We examine the impact of the pedagogical use of mobile TTS on the L2 acquisition of French liaison, a process by which a word-final consonant is pronounced at the beginning of the following word if the latter is vowel-initial (e.g. peti/t.a/mi = > peti[ta]mi ‘boyfriend’). The study compares three groups of L2 French students learning how to produce liaison over a two-month period, following a pretest-posttests design within a mixed-methods approach to data collection and analysis. Participants were divided into three groups: (1) the TTS Group used a TTS application on their mobile devices to complete weekly pronunciation tasks consisting of noticing, listen-and-categorize, and listen-and-repeat; (2) the Non-TTS Group completed the same weekly pronunciation tasks in weekly sessions with a teacher; finally, (3) the Control Group participated in weekly meetings ‘to practice their conversation skills’ with a teacher, who provided no pronunciation feedback. The results indicate that, although all three groups improved in liaison production, if considered separately (within groups), only the two experimental groups improved over time. The discussion of our findings highlights the pedagogical use of mobile TTS technology to complement and enhance the teaching of L2 pronunciation.