Article

Automated speech scoring for non-native middle school students with multiple task types

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This study presents the results of applying automated speech scoring technology to English spoken responses provided by non-native children in the context of an English proficiency assessment for middle school students. The assessment contains three diverse task types designed to measure a student's English communication skills, and an automated scoring system was used to extract features and build scoring models for each task. The results show that the automated scores have a correlation of r = 0.70 with human scores for the Read Aloud task, which matches the human-human agreement level. For the two tasks involving spontaneous speech, the automated scores obtain correlations of r = 0.62 and r = 0.63 with human scores, which represents a drop of 0.08 - 0.09 from the humanhuman agreement level. When all 5 scores from the assessment for a given student are aggregated, the automated speaker-level scores show a correlation of r = 0.78 with human scores, compared to a human-human correlation of r = 0.90. The challenges of using automated spoken language assessment for children are discussed, and directions for future improvements are proposed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In a like manner Evanini and Wang (2013) reported the results of using automated proficiency scoring with English oral answers given by non-native speaking middle school students in an English proficiency evaluation. The computer calculated 10 linguistic measures, which include both suprasegmental and other linguistic skills (e.g., reading accuracy). ...
... Both Zechner et al. (2009) and Evanini and Wang (2013) included articulation rate in their computer models that predicted English proficiency of unconstrained speech. Similarly, Iwashita et al. (2008) found a significant relationship between proficiency level and speech rate (i.e., syllable rate). ...
... In addition, the phonation time ratio finding of Kang et al. (2010) corroborates the computer model's consideration of articulation rate as a measure of silent pauses in predicting English speaking proficiency. In their computer models, both Evanini and Wang (2013) and Zechner et al. (2009) incorporated four silent pause measures: duration of silent pauses per word, average silent pause duration in seconds, average duration of long silent pauses (i.e., greater than or equal 500 ms), and frequency of long silent pauses divided by the number of words. As discussed above, the computer model found that the number and length of silent pauses was indirectly related to proficiency through articulation and syllable-rate. ...
Article
Full-text available
Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups (prominence, filled pause, speech rate, and intonation) to calculate the proficiency scores. In test responses from 120 non-native speakers of English monologues from the Cambridge English Language Assessment (CELA), the Pearson’s correlation between the computer’s calculated proficiency levels and the official CELA proficiency levels was 0.718. The current findings provide empirical evidence that prominence and intonation are salient features in the computer model’s prediction of proficiency.
... Trained human examiners established the official CELA ratings. This correlation is higher than SpeechRater SM (0.55) and a study by Evanini and Wang (2013) of automated scoring of unconstrained spoken responses from an English proficiency assessment of non-native speaking middle school students (0.62). Both of these studies used features that were derived from words whereas our features were derived from syllables. ...
... Rogova et al. (2013), in a method similar to Ananthakrishnan (2004), syllabified words by utilizing segmental conditional random fields to combine features based on legality, sonority, and maximal onset with those based on the bigram probabilities of the training corpus. Demberg (2006) utilized a fourth-order HMM as a syllabification module in a larger German text-to-speech system. Schmid et al. (2007) enhanced Demberg's algorithm by using a statistical scheme for separating words into syllables grounded on a joint n-gram exemplar. ...
... The other three algorithms are all based on the sonority principle (Clements 1990;Selkirk 1984). Syllabification-by-HMM and syllabification-by-k-means are based on a HMM which others have employed (Bartlett et al. 2009;Demberg 2006;Krenn 1997;Schmid et al. 2007) and is a typical machine learning technique employed with time-series data such as phonetic sequences. The final one, syllabification-by-genetic-algorithm does not appear to have been utilized by other researchers, but is roughly based on the legality principle (Hooper 1972;Kahn 1976;Pulgram 1970;Vennemann 1987) and employs a dictionary of syllabification rules, which is automatically created by a genetic algorithm. ...
Article
Full-text available
Four algorithms for syllabifying phones are compared in automatically scoring English oral proficiency. The first algorithm clusters consonants into groups with the vowel nearer to them temporally, taking into account the maximal onset principle. A Hidden Markov Model (HMM) predicts the syllable boundaries based on their sonority value in the second algorithm. The third one employs three HMMs which are tuned to specific categories of utterances. The final algorithm uses a genetic algorithm to identify a set of rules for syllabifying the phones. They were evaluated by: (1) how well they syllabified utterances from the Boston University Radio News Corpus (BURNC) and (2) how well they worked as part of a process to automatically score English speaking proficiency. A measure of the temporal alignment of the syllables was utilized to judge how satisfactorily they syllabified utterances. Their suitability in the proficiency process was assessed with the Pearson correlation between the computer’s predicted proficiency scores and the scores determined by human examiners. We found that syllabification-by-genetic-algorithm performed the best in syllabifying the BURNC, but that syllabification-by-grouping (i.e., syllables are made by grouping non-syllabic consonant phones with the vowel or syllabic consonant phone nearest to them with respect to time) performed the best in the English oral proficiency rating application.
... Previous work (Evanini and Wang, 2013) explored automated assessment of the speech component of the spoken responses to the picture narration task, but the linguistic and narrative aspects of the response have not received much attention. In this work, we investigate linguistic and constructrelevant aspects of the test such as (1) relevance and completeness of the content of the responses with respect to the prompt pictures, (2) proper word usage (3) use of narrative techniques such as detailing to enhance the story, and (4) sequencing strategies to build a coherent story. ...
... Finally, our results are promising -we show that the combination of linguistic and construct-relevant features which we explore in this work outperforms the state of the art baseline system, and that the best performance is obtained when the linguistic and construct-relevant features are combined with the speech features. Evanini et al. (2013; use features extracted mainly from speech for scoring the picture narration task. They employ measures capturing fluency, prosody and pronunciation. ...
... Human expert raters listen to the recorded responses, which are about 60 seconds in duration, and assign a score to each on a scale of 1 -4, with score point 4 indicating an excellent response. In this work, we use the automatic speech recognition (ASR) output transcription of the re- sponses (see (Evanini and Wang, 2013) for details). ...
... Automated systems for assessing responses to a wider variety of speaking tasks, such as picture narration and sourcebased open questions, have appeared recently, and these systems can provide a more comprehensive evaluation of the speakers' communicative competence. For example, [5] investigated the performance of an automated speech scoring system applied to the TOEFL Junior Comprehensive assessment, which was designed to evaluate English communication skills of students aged 11 and older. In addition, [6] investigated automated speech scoring for the AZELLA speaking test, which contains a variety of spoken tasks used for assessing the English speaking proficiency of K-12 students. ...
... In this study, we use a corpus that contains non-native children's speech drawn from a pilot version of the TOEFL Junior Comprehensive assessment administered in late-2011 [5]. The TOEFL Junior Comprehensive is a computer-based test containing four sections: Reading Comprehension, Listening Comprehension, Speaking, and Writing. ...
... CELA is an internationally recognized set of exams and qualifications for learners of English. This correlation is higher than SpeechRater SM and other related computer programs for automatically scoring the proficiency of unconstrained speech (Evanini and Wang 2013), where the automated scores were compared with official test scores. ...
... While not identical to the CELA corpus, the English proficiency of speakers using unconstrained speech was analyzed in a similar manner and scored from one to four in two other papers as discussed earlier (Evanini and Wang 2013;Zechner et al. 2009). The human-computer Pearson's correlations reported in those studies ranged from 0.55 to 0.62, well below the results we attained here. ...
Article
Full-text available
The performance of machine learning classifiers in automatically scoring the English proficiency of unconstrained speech has been explored. Suprasegmental measures were computed by software, which identifies the basic elements of Brazil’s model in human discourse. This paper explores machine learning training with multiple corpora to improve two of those algorithms: prominent syllable detection and tone choice classification. The results show that machine learning training with the Boston University Radio News Corpus can improve automatic English proficiency scoring of unconstrained speech from a Pearson’s correlation of 0.677–0.718. This correlation is higher than any other existing computer programs for automatically scoring the proficiency of unconstrained speech and is approaching that of human raters in terms of inter-rater reliability.
... Previous work in automated speech scoring (Witt and Young 1997;Ai 2015;Evanini and Wang 2013) have examined phone level scores derived from GMM-HMM (Gaussian Mixture Model − Hidden Markov Model) based speech recognizer outputs. With the proliferation of deep learning techniques, more recent studies (Ying 2019;Hu et al. 2015;Sudhakara et al. 2019) have used acoustic models trained using a Deep Neural Network to improve mispronunciation detection & diagnosis (MDD). ...
Article
Automatic speech scoring is crucial in language learning, providing targeted feedback to language learners by assessing pronunciation, fluency, and other speech qualities. However, the scarcity of human-labeled data for languages beyond English poses a significant challenge in developing such systems. In this work, we propose a Language-Independent scoring approach to evaluate speech without relying on labeled data in the target language. We introduce a multilingual speech scoring system that leverages representations from the wav2vec 2.0 XLSR model and a force-alignment technique based on CTC-Segmentation to construct speech features. These features are used to train a machine learning model to predict pronunciation and fluency scores. We demonstrate the potential of our method by predicting expert ratings on a speech dataset spanning five languages - English, French, Spanish, German and Portuguese, and comparing its performance against Language-Specific models trained individually on each language, as well as a jointly-trained model on all languages. Results indicate that our approach shows promise as an initial step towards a universal language independent speech scoring.
... Again the upper limit for the performance of a CAPT scoring system is the human annotators' disagreement. For shorter speech segments, the correlation between human evaluatorsthe inter-annotator agreement -starts to degrade, for example from a correlation of 0.9 for speaker evaluation based on the complete speech pool, to 0.6-0.7 for single items (single utterance or a collection of utterances consisting of a reply to a single prompt) [13,18,19]. For evaluation of samples that are shorter than a minute, let alone for items as short as a single word of one or two phonemes, the inter-annotator agreement is a limiting factor for the performance of the computational grading system that aims for widely accepted objective scoring. ...
... The development of Computer-aided Pronunciation Training(CAPT) system empowers language learners a convenient way to practice their pronunciations [1,2,3], especially for those who have little access to professional teachers. ...
Preprint
Full-text available
Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech, in a noise-robust and speaker-independent manner. These embeddings, when used as implicit phonetic supplementary information, can alleviate the data shortage of explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD\&D system. Experimental results obtained on the L2-ARCTIC database show the proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the detection accuracy, diagnosis error rate and the F-measure, respectively.
... This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang (2013), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina Knill et al. (2018). ...
Article
Full-text available
Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence. Supplementary information: The online version contains supplementary material available at 10.1007/s10462-021-10068-2.
... Despite the most recent research [1][2][3][4][5] indicates that the Automatic Speech Recognition (ASR) system of adult speech can reach the level close to human beings, it is still facing many challenges for non-native children's ASR. A few ASR frameworks were proposed for non-native children in [6][7][8] . However, there are still some difficulties and deficiencies in the application of the ASR system to these specific groups. ...
... [1,2,3,4], and children's, e.g. [5,6,7,8,9], SLA a few non-native English ASR systems have been successfully implemented. High performance, however, is still a challenge, with limited labelled training data available. ...
... Computer-Assisted Pronunciation Training (CAPT) is an important technology that offers automatic feedback to help users learn new spoken languages [1]. Because of its objectiveness, some standardized examinations also use the CAPT system for automatic speech proficiency evaluation (e.g., TOFEL [2], AZELLA [3]). ...
... Recent advances in ASR have impacted many applications in various fields, such as education, entertainment, home automation, and medical assistance (Vajpai and Bora, 2016). These applications can benefit children in their daily life, in playing games, reading tutors (Mostow, 2012), and learning both native and foreign languages (Evanini and Wang, 2013;Yeung and Alwan, 2019). ...
Conference Paper
Full-text available
In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes between children and adult speakers. The proposed method is used to improve the speech intelligibility to enhance the children’s speech recognition using an acoustic model trained on adult speech. In the experiments, WSJCAM0 and PFSTAR are used as databases for adults’ and children’s speech, respectively. The proposed technique gives a significant improvement in the context of the DNN-HMM-based ASR. Furthermore, we validate the robustness of the technique by showing that it performs well also in mismatched noise conditions.
... Sentence level pronunciation assessment is an important task in Computer Assisted Language Learning (CALL), which is commonly required by oral practice and assessment [1,2]. ...
Preprint
Sentence level pronunciation assessment is important for Computer Assisted Language Learning (CALL). Traditional speech pronunciation assessment, based on the Goodness of Pronunciation (GOP) algorithm, has some weakness in assessing a speech utterance: 1) Phoneme GOP scores cannot be easily translated into a sentence score with a simple average for effective assessment; 2) The rank ordering information has not been well exploited in GOP scoring for delivering a robust assessment and correlate well with a human rater's evaluations. In this paper, we propose two new statistical features, average GOP (aGOP) and confusion GOP (cGOP) and use them to train a binary classifier in Ordinal Regression with Anchored Reference Samples (ORARS). When the proposed approach is tested on Microsoft mTutor ESL Dataset, a relative improvement of Pearson correlation coefficient of 26.9% is obtained over the conventional GOP-based one. The performance is at a human-parity level or better than human raters.
... Computer-Assisted Pronunciation Training (CAPT) is an important technology that offers automatic feedback to help users learn new spoken languages [1]. Because of its objectiveness, some standardized examinations also use the CAPT system for automatic speech proficiency evaluation (e.g., TOFEL [2], AZELLA [3]). ...
Preprint
Full-text available
Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems. State-of-the-art mispronunciation detection models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness of Pronunciation (GOP) based algorithm for pronunciation scoring. However, GOP based scoring models have two major limitations: i.e., (i) They depend on forced alignment which splits the speech into phonetic segments and independently use them for scoring, which neglects the transitions between phonemes within the segment; (ii) They only focus on phonetic segments, which fails to consider the context effects across phonemes (such as liaison, omission, incomplete plosive sound, etc.). In this work, we propose the Context-aware Goodness of Pronunciation (CaGOP) scoring model. Particularly, two factors namely the transition factor and the duration factor are injected into CaGOP scoring. The transition factor identifies the transitions between phonemes and applies them to weight the frame-wise GOP. Moreover, a self-attention based phonetic duration modeling is proposed to introduce the duration factor into the scoring model. The proposed scoring model significantly outperforms baselines, achieving 20% and 12% relative improvement over the GOP model on the phoneme-level and sentence-level mispronunciation detection respectively.
... Among the factors affecting the accuracy of automatic scoring of speaking tasks, the type of oral question is an important factor affecting the accuracy rate. The error between the automatic score of open oral questions and the teacher's score is still high and unstable (Evanini et al., 2013;Loukina, 2017). This study explore a method to score an open type speech test automatically and improve its accuracy to an acceptable degree. ...
... The AZELLA data set [5], developed by Pearson, includes 1, 500 spoken tests, each double graded by human professionals, from a variety of tasks. The work in [6] describes a latent semantic analysis (LSA) based approach for scoring the proficiency of the AZELLA test set, while [7] describes a system designed to automatically evaluate the communication skills of young English students. Features proposed for evaluation of pronunciation are described for instance in [8]. ...
... Many of these applications can benefit children. For example, interactive reading tutors [4] and automatic reading assessment systems can help school-age and preschool children in learning both native and foreign languages [5,6]. However, challenges in ASR for child speech have hindered its adoption for such applications. ...
... The AZELLA data set [4], developed by Pearson, includes 1, 500 spoken tests, each double graded by human professionals, from a variety of tasks. The work in [5] describes a latent semantic analysis (LSA) based approach for scoring the proficiency of the AZELLA test set, while [6] describes a system designed to automatically evaluate the communication skills of young English students. Features proposed for evaluation of pronunciation are described for instance in [7]. ...
Preprint
Full-text available
This paper describes technology developed to automatically grade Italian students (ages 9-16) on their English and German spoken language proficiency. The students' spoken answers are first transcribed by an automatic speech recognition (ASR) system and then scored using a feedforward neural network (NN) that processes features extracted from the automatic transcriptions. In-domain acoustic models, employing deep neural networks (DNNs), are derived by adapting the parameters of an original out of domain DNN.
... Various acoustic, prosodic and linguistic features are ex-plored for children's speech processing studies. An automatic scoring system [5] for speech of middle school students was developed by using pronunciation, prosody, lexical and content features. Features obtained from multiple aspects are utilized in [6] on an automatic system to assess the proficiency of non-native children's speech from age 8 and above. ...
... articulation and duration of phonemes, pauses, use of pitch, and mean duration between stressed syllables) and associated weightings of those features in computer algorithms (e.g. Evanini & Wang, 2013; Xi et al., 2012). Such systems could potentially include some of the rhythm measures identified here as a means of improving speech recognition accuracy and enhancing the assessment of prosody. ...
Chapter
Full-text available
This chapter discusses variation in pronunciation as well as the role of expectations and attitudes in perception of pronunciation, with implications for assessment. There is considerable variation in native-speaker pronunciation, even within what may be perceived as a standard variety such as ‘General American’. Such variation may go unnoticed until produced by English learners or users (or even native speakers from places like India or Singapore), which then may be interpreted as ‘errors’. In fact, there is evidence to suggest that the same pronunciation features may be perceived differently depending on who is believed to be using them and what stereotypes exist about the perceived speaker. Such misperception may be exacerbated by issues of systematic bias against (perceived) non-native speech, especially that spoken by non-White speakers. Unfortunately, TESOL specialists may also be subject to such biases in spite of good intentions. These sociolinguistic findings suggest that an accuracy-based measure of pronunciation is likely to be somewhat arbitrary, especially for vowel pronunciation, which varies especially widely among native speakers. In addition, a focus on ‘errors’ may be problematic, since these may be over-perceived when a speaker is believed to be non-native. Thus, it may not be meaningful or reliable to talk about ‘accuracy’ or even accentedness as a measure of pronunciation. Instead, as many have argued, there is a need for emphasis on intelligibility, although assessing intelligibility comes with its own challenges. While intelligibility needs to be assessed with respect to the speaker’s intended interlocutors (e.g. undergraduate students might play a role in assessment of potential non-native teaching assistants), possible biases in such assessments need to be taken into account. Ultimately, it is important to work with English learners’ strategies for dealing with possible bias as well as with the wider public’s awareness of such biases.
... articulation and duration of phonemes, pauses, use of pitch, and mean duration between stressed syllables) and associated weightings of those features in computer algorithms (e.g. Evanini & Wang, 2013; Xi et al., 2012). Such systems could potentially include some of the rhythm measures identified here as a means of improving speech recognition accuracy and enhancing the assessment of prosody. ...
Book
Full-text available
This book is open access under a CC BY licence. It spans the areas of assessment, second language acquisition (SLA) and pronunciation and examines topical issues and challenges that relate to formal and informal assessments of second language (L2) speech in classroom, research and real-world contexts. It showcases insights from assessing other skills (e.g. listening and writing) and highlights perspectives from research in speech sciences, SLA, psycholinguistics and sociolinguistics, including lingua franca communication, with concrete implications for pronunciation assessment. This collection will help to establish commonalities across research areas and facilitate greater consensus about key issues, terminology and best practice in L2 pronunciation research and assessment. Due to its interdisciplinary nature, this book will appeal to a mixed audience of researchers, graduate students, teacher-educators and exam board staff with varying levels of expertise in pronunciation and assessment and wide-ranging interests in applied linguistics. © 2017 Talia Isaacs, Pavel Trofimovich and the authors of individual chapters. All rights reserved.
... Two corpora of non-native spontaneous English drawn from the domain of spoken English proficiency assessment are used in this study. The first corpus contains non-native children's speech drawn from a pilot version of the TOEFL Junior Comprehensive assessment administered in late-2011 [17]. The TOEFL Junior Comprehensive is a computer-based test containing four sections: Reading Comprehension, Listening Comprehension, Speaking, and Writing. ...
... All responses were converted to text using a state-of-the-art automatic speech recognizer (ASR) with constrained vocabulary (see Evanini and Wang (2013) for further details). To evaluate the effect of the errors that may have been introduced by the ASR system, all responses were 2 see http://www.ets.org/s/toefl ...
Conference Paper
Full-text available
This paper investigates whether ROUGE, a popular metric for the evaluation of automated written summaries, can be applied to the assessment of spoken summaries produced by non-native speakers of English. We demonstrate that ROUGE, with its emphasis on the recall of information, is particularly suited to the assessment of the summarization quality of non-native speakers’ responses. A standard baseline implementation of ROUGE1 computed over the output of the automated speech recognizer has a Spearman correlation of = 0.55 with experts’ scores of speakers’ proficiency ( = 0.51 for a content-vector baseline). Further increases in agreement with experts’ scores can be achieved by using types instead of tokens for the computation of word frequencies for both candidate and reference summaries, as well as by using multiple reference summaries instead of a single one. These modifications increase the correlation with experts’ scores to a Spearman correlation of = 0.65. Furthermore, we found that the choice of reference summaries does not have any impact on performance, and that the adjusted metric is also robust to errors introduced by automated speech recognition ( = 0.67 for human transcriptions vs. = 0.65 for speech recognition output).
... In addition, children may have different speech patterns in linguistic areas such as pronunciation, prosody, lexical choice, and syntax. To overcome these problems, several corpora containing only children's speech have been collected (CSLU, 2008; Hagen, Pellom & Cole, 2003; Kazemzadeh et al., 2005; Kantor et al., 2012; LDC, 1997) and have been used to train or adapt ASR systems so that they will perform better on children's speech. The 2011 TOEFL Junior pilot administration collected a large number of responses from children from several different L1 backgrounds, so these data can be used for training or adapting an ASR system specifically for TOEFL Junior. ...
Article
This report describes the initial automated scoring results that were obtained using the constructed responses from the Writing and Speaking sections of the pilot forms of the TOEFL Junior® Comprehensive test administered in late 2011. For all of the items except one (the edit item in the Writing section), existing automated scoring capabilities were used with only minor modifications to obtain a baseline benchmark for automated scoring performance on the TOEFL Junior task types; for the edit item in the Writing section, a new automated scoring capability based on string matching was developed. A generic scoring model from the e-rater® automated essay scoring engine was used to score the email, opinion, and listen-write items in the Writing section, and the form-level results based on the five responses in the Writing section from each test taker showed a human–machine correlation of r = .83 (compared to a human–human correlation of r = .90). For scoring the Speaking section, new automated speech recognition models were first trained, and then item-specific scoring models were built for the read-aloud picture narration, and listen-speak items using preexisting features from the SpeechRaterSM automated speech scoring engine (with the addition of a new content feature for the listen-speak items). The form-level results based on the five items in the Speaking section from each test taker showed a human–machine correlation of r = .81 (compared to a human–human correlation of r = .89).
... Especially for children it can contribute to a fun, exciting and engaging way to learn [1]. Besides the known problems of recognizing children's speech [2], providing feedback in terms of pronunciation and grammar errors [3], reading fluency [4], speech scoring [5], etc., is becoming a central issue. ...
Conference Paper
Full-text available
By definition spoken dialogue CALL systems should be easy to use and understand. However, interaction in this context is often far from unhindered. In this paper we introduce a formative feedback mechanism in our CALL system, which can monitor interaction, report errors and provide advice and suggestions to users. The distinctive feature of this mechanism is the ability to combine information from different sources and decide on the most pertinent feedback, which can also be adapted in terms of phrasing, style and language. We conducted experiments at three secondary schools in German-speaking Switzerland and the obtained results suggest that our feedback mechanism helps students during interaction and contributes as a motivating factor.
Article
Automatic pronunciation assessment (APA) manages to quantify second language (L2) learners' pronunciation proficiency in a target language by providing fine-grained feedback with multiple aspect scores (e.g., accuracy, fluency, and completeness) at various linguistic levels (i.e., phone, word, and utterance). Most of the existing efforts commonly follow a parallel modeling framework, which takes a sequence of phone-level pronunciation feature embeddings of a learner's utterance as input and then predicts multiple aspect scores across various linguistic levels. However, these approaches neither take the hierarchy of linguistic units into account nor consider the relatedness among the pronunciation aspects in an explicit manner. In light of this, we put forward an effective modeling approach for APA, termed HierGAT, which is grounded on a hierarchical graph attention network. Our approach facilitates hierarchical modeling of the input utterance as a heterogeneous graph that contains linguistic nodes at various levels of granularity. On top of the tactfully designed hierarchical graph message passing mechanism, intricate interdependencies within and across different linguistic levels are encapsulated and the language hierarchy of an utterance is factored in as well. Furthermore, we also design a novel aspect attention module to encode relatedness among aspects. To our knowledge, we are the first to introduce multiple types of linguistic nodes into graph-based neural networks for APA and perform a comprehensive qualitative analysis to investigate their merits. A series of experiments conducted on the speechocean762 benchmark dataset suggests the feasibility and effectiveness of our approach in relation to several competitive baselines.
Article
Full-text available
Automatic speech recognition (ASR) in children is a rapidly evolving field, as children become more accustomed to interacting with virtual assistants, such as Amazon Echo, Cortana, and other smart speakers, and it has advanced the human–computer interaction in recent generations. Furthermore, non-native children are observed to exhibit a diverse range of reading errors during second language (L2) acquisition, such as lexical disfluency, hesitations, intra-word switching, and word repetitions, which are not yet addressed, resulting in ASR’s struggle to recognize non-native children’s speech. The main objective of this study is to develop a non-native children’s speech recognition system on top of feature-space discriminative models, such as feature-space maximum mutual information (fMMI) and boosted feature-space maximum mutual information (fbMMI). Harnessing the collaborative power of speed perturbation-based data augmentation on the original children’s speech corpora yields an effective performance. The corpus focuses on different speaking styles of children, together with read speech and spontaneous speech, in order to investigate the impact of non-native children’s L2 speaking proficiency on speech recognition systems. The experiments revealed that feature-space MMI models with steadily increasing speed perturbation factors outperform traditional ASR baseline models.
Chapter
Modern Computer Assisted Language Learning (CALL) systems use speech recognition to give students the opportunity to build up their spoken language skills through interactive practice with a mechanical partner. Besides the obvious benefits that these systems can offer, e.g. flexible and inexpensive learning, user interaction in this context can often be problematic. In this article, the authors introduce a parallel layer of feedback in a CALL application, which can monitor interaction, report errors and provide advice and suggestions to students. This mechanism combines knowledge accumulated from four different inputs in order to decide on appropriate feedback, which can be customized and adapted in terms of phrasing, style and language. The authors report the results from experiments conducted at six lower secondary classrooms in German-speaking Switzerland with and without this mechanism. After analyzing approximately 13,000 spoken interactions it can be reasonably argued that their parallel feedback mechanism in L2 actually does help students during interaction and contributes as a motivation factor.
Article
Full-text available
Modern Computer Assisted Language Learning (CALL) systems use speech recognition to give students the opportunity to build up their spoken language skills through interactive practice with a mechanical partner. Besides the obvious benefits that these systems can offer, e.g. flexible and inexpensive learning, user interaction in this context can often be problematic. In this article, the authors introduce a parallel layer of feedback in a CALL application, which can monitor interaction, report errors and provide advice and suggestions to students. This mechanism combines knowledge accumulated from four different inputs in order to decide on appropriate feedback, which can be customized and adapted in terms of phrasing, style and language. The authors report the results from experiments conducted at six lower secondary classrooms in German-speaking Switzerland with and without this mechanism. After analyzing approximately 13,000 spoken interactions it can be reasonably argued that their parallel feedback mechanism in L2 actually does help students during interaction and contributes as a motivation factor.
Article
Full-text available
In this paper, we investigate the task of phone-level pronunciation error detection as a binary classification problem, the performance of which is heavily affected by the imbalanced distribution of the classes in a manually annotated data set of non-native English. In order to address problems caused by this extreme class imbalance, methods for cost-sensitive learning (weighting inversely proportional to class frequencies) and over-sampling of synthetic instances (SMOTE) are investigated in order to improve classification performance. Experiments using classifiers consisting of features based on acoustic phonetics and word identity demonstrate that these machine learning approaches lead to performance improvements over the baseline system based on the extremely imbalanced data. In addition, several different types of classifiers were compared. Finally, the paper analyzes the robustness of classifier performance across different phones.
Article
Full-text available
Speech technology offers great promise in the field of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors affected by difficulties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Specifically, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based speech recognition. The proposed subword unit based speech recognition framework is shown to provide equivalent accuracy to a whole-word based speech recognizer while enabling detection of oral reading events and finer grained speech analysis during recognition. The efficacy of the approach is demonstrated using data collected from children in grades 3–5, namely 34.6% of partial words with reasonable evidence in the speech signal are detected at a low false alarm rate of 0.5%.
Article
Full-text available
We present initial results of FLORA, an accessible computer program that uses speech recognition to provide an accurate measure of children's oral reading ability. FLORA presents grade-level text passages to children, who read the passages out loud, and computes the number of words correct per minute (WCPM), a standard measure of oral reading fluency. We describe the main components of the FLORA program, including the system architecture and the speech recognition subsystems. We compare results of FLORA to human scoring on 783 recordings of grade level text passages read aloud by first through fourth grade students in classroom settings. On average, FLORA WCPM scores were within 3 to 4 words of human scorers across students in different grade levels and schools.