Klaus ZechnerEducational Testing Service | ETS · Division of Research and Development
Klaus Zechner
Ph.D.
About
105
Publications
20,425
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,441
Citations
Introduction
I have been focusing my research in the area of automated scoring of non-native speech. The Speech Team at ETS has built and is maintaining and improving the SpeechRater(TM) automated speech scoring engine.
Publications
Publications (105)
Children’s speech recognition is a challenging task because of the inherent speech production characteristics of children’s articulatory structure as well as their linguistic usage. In the context of developing automated reading companions, the problem is compounded by lack of training data. Most of the available data is recorded under clean and co...
The COVID-19 pandemic has led to a dramatic increase in the use of face masks worldwide. Face coverings can affect both acoustic properties of the signal as well as speech patterns and have unintended effects if the person wearing the mask attempts to use speech processing technologies. In this paper we explore the impact of wearing face masks on t...
Recent technology advancements have increased the prospects for automated spoken language technology to provide feedback on speaking performance. In this study we examined user perceptions of using an automated feedback system for preparing for the TOEFL iBT® test. Test takers and language teachers evaluated three types of machine-generated feedbac...
In this study, we developed an automated algorithm to provide feedback about the specific content of non-native English speakers’ spoken responses. The responses were spontaneous speech, elicited using integrated tasks where the language learners listened to and/or read passages and integrated the core content in their spoken responses. Our models...
This research report provides an overview of the R&D efforts at Educational Testing Service related to its capability for automated scoring of nonnative spontaneous speech with the SpeechRaterSM automated scoring service since its initial version was deployed in 2006. While most aspects of this R&D work have been published in various venues in rece...
As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, e.g., on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish rigorous...
As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, for example, on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish ri...
In this study, we propose an efficient way to combine human and automated scoring to increase the reliability and validity of a system used to assess spoken responses in the context of an international English language assessment. A set of filtering systems are used to automatically identify various classes of spoken responses that are difficult to...
We present work in progress on a multimodal dialog system for English language assessment using a modular cloud-based architecture adhering to open industry standards. Among the modules being developed for the system, multiple modules heavily exploit machine learning techniques, including speech recognition, spoken language proficiency rating, spea...
This report presents an overview of the SpeechRaterSM automated scoring engine model building and evaluation process for several item types with a focus on a low-English-proficiency test-taker population. We discuss each stage of speech scoring, including automatic speech recognition, filtering models for nonscorable responses, and scoring model bu...
Syntactic competence, especially the ability to use a wide range of sophisticated grammatical expressions, represents an important aspect of communicative acumen. This paper explores the question of how to best evaluate the syntactic competence of non-native speakers in an automated way. Using spoken responses of test takers participating in an Eng...
In this paper, we conduct experiments using F0 contour features to assess the nativeness of responses provided by speakers from India and China to a Sentence Repeat task in an assessment of English speaking proficiency for non-native speakers. The results show that the coefficients from polynomial models of the pitch contours help distinguish betwe...
This research report presents a summary of research and development efforts devoted to creating scoring models for automatically scoring spoken item responses of a pilot administration of the Test of English-for-Teaching (TEFT™) within the ELTeach™ framework. The test consists of items for all four language modalities: reading, listening, writing,...
In recent years, the application of learner corpora to the educational domain has developed into a rapidly growing research field. Multiple research venues are now devoted to it, including the International Speech Communication Association (ISCA) Special Interest Group on Speech and Language Technology in Education (SLaTE) and the North American Ch...
Computer-implemented systems and methods are provided for scoring content of a spoken response to a prompt. A scoring model is generated for a prompt, where generating the scoring model includes generating a transcript for each of a plurality of training responses to the prompt, dividing the plurality of training responses into clusters based on th...
This paper investigates whether ROUGE, a popular metric for the evaluation of automated written summaries, can be applied to the assessment of spoken summaries produced by non-native speakers of English. We demonstrate that ROUGE, with its emphasis on the recall of information, is particularly suited to the assessment of the summarization quality o...
This study investigates a variety of rhythm metrics on two cor-pora of non-native spontaneous speech and compares the non-native distributions to values from a corpus of native speech. Several of the metrics are shown to differentiate well between native and non-native speakers and to also have moderate corre-lations with English proficiency scores...
The present disclosure presents a useful metric for assessing the relative difficulty which non-native speakers face in pronouncing a given utterance and a method and systems for using such a metric in the evaluation and assessment of the utterances of non-native speakers. In an embodiment, the metric may be based on both known sources of difficult...
This paper compares two alternative scoring methods - multiple regression and classification trees - for an automated speech scoring system used in a practice environment. The two methods were evaluated on two criteria: construct representation and empirical performance in predicting human scores. The empirical performance of the two scoring models...
This paper presents an exploration into automated content scoring of non-native spontaneous speech using ontology-based information to enhance a vector space approach. We use content vector analysis as a baseline and evaluate the correlations between human rater proficiency scores and two cosine-similarity-based features, previously used in the con...
This study presents a method that assesses ESL learners' vocabulary usage to improve an automated scoring system of spontaneous speech responses by non-native English speakers. Focusing on vocabulary sophistication, we estimate the difficulty of each word in the vocabulary based on its frequency in a reference corpus and assess the mean difficulty...
Although the operational scoring of the TOEFL iBT speaking section features the overall judgment of an examinee's speaking ability, the evaluation of specific components of speech such as delivery (pace and clarity of speech) and language use (vocabulary and grammar use) may be a promising approach to providing diagnostic information to learners. T...
We evaluate two types of prosodic features utilizing automatically generated stress and tone labels for non-native read speech in terms of their applicability for automated speech scoring. Both types of features have not been used in the context of automated scoring of non-native read speech to date. In our first experiment, we compute features bas...
Speech rhythm measurements have been used in a limited number of previous studies on automated speech assessment, an approach using speech recognition technology to judge non-native speakers' proficiency levels. However, one of the most problematic issues of these previous studies is a lack of a comparison of these rhythm features with other effect...
We present a method that filters out non-scorable (NS) responses, such as responses with a technical difficulty, in an automated speaking proficiency assessment system. The assessment system described in this study first filters out the non-scorable responses and then predicts a proficiency score using a scoring model for the remaining responses. T...
This paper presents a description and evaluation of SpeechRaterSM, a system for automated scoring of non-native speakers’ spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the...
This paper focuses on identifying, extracting and evaluating features related to syntactic complexity of spontaneous spoken responses as part of an effort to expand the current feature set of an automated speech scoring system in order to cover additional aspects considered important in the construct of communicative competence. Our goal is to find...
This article provides an overview and rationale for the development of automated essay scoring (AES) and associated spin-off technologies including electronic portfolios, support for English-language learners, and the automated scoring of spoken responses. AES is the ability of the computer to evaluate written responses. Until recently, this capabi...
We have developed an automated method that predicts the word accuracy of a speech recognition system for non-native speech, in the context of speaking proficiency scoring. A model was trained using features based on speech recognizer scores, function word distributions, prosody, background noise, and speaking fluency. Since the method was implement...
This study investigates the use of Amazon Mechanical Turk for the transcription of non-native speech. Multiple transcriptions were obtained from several distinct MTurk workers and were combined to produce merged tran-scriptions that had higher levels of agreement with a gold standard transcription than the in-dividual transcriptions. Three differen...
This paper presents the first version of the SpeechRaterSM system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language® internet-based test (TOEFL® iBT).The system consists of a speech recognizer trained on non-native Englis...
Assessment of reading proficiency is typically done by asking subjects to read a text passage silently and then answer questions related to the text. An alternate approach, measuring reading-aloud proficiency, has been shown to correlate well with the aforementioned common method and is used as a paradigm in this paper. We describe a system that is...
This paper describes research on automatic as- sessment of the pronunciation quality of spon- taneous non-native adult speech. Since the speaking content is not known prior to the assessment, a two-stage method is developed to first recognize the speaking content based on non-native speech acoustic properties and then forced-align the recognition r...
This paper presents an analysis of differences in human transcriptions of non-native spontaneous speech on a word level, collected in the context of an English Proficiency Test. While transcribers of native speech typically agree at a very high level (5% word error rate or less), this study finds substantially higher disagreement rates between tran...
This report presents the results of a research and development effort for SpeechRaterSM Version 1.0 (v1.0), an automated scoring system for the spontaneous speech of English language learners used operationally in the Test of English as a Foreign Language™ (TOEFL®) Practice Online assessment (TPO). The report includes a summary of the validity cons...
This paper describes a system aimed at automatically scoring two task types of high and medium-high linguistic entropy from a spoken English test with a total of six widely differing task types. We describe the speech recognizer used for this system and its acoustic model and language model adaptation; the speech features computed based on the reco...
The increasing availability and performance of computer-based testing has prompted more research on the automatic assessment of language and speaking proficiency. In this investigation, we evaluated the feasibility of using an off-the-shelf speech-recognition system for scoring speaking prompts from the LanguEdge field test of 2002. We first establ...
This paper presents an overview of the SpeechRater TM system of Educational Testing Service (ETS), a fully operational automated scoring system for non-native spontaneous speech employed in a practice context. This novel system stands in contrast to most prior speech scoring systems which focus on fairly predictable, low entropy speech such as read...
This paper investigates the feasibility of automated scoring of spoken English proficiency of non-native speakers. Unlike existing automated assessments of spoken English, our data consists of spontaneous spoken responses to complex test items. We perform both a quantitative and a qualitative analysis of these features using two different machine l...
This paper presents a study on optimizing sen-tence pair alignment scores of a bilingual sen-tence alignment module. Five candidate scores based on perplexity and sentence length are introduced and tested. Then a linear regression model based on those candidates is proposed and trained to predict sentence pairs' alignment quality scores solicited f...
While the field of Information Retrieval originally had the search for the most relevant documents in mind, it has become increasingly clear that in many instances, what the user wants is a piece of coherent information, derived from a set of relevant documents and possibly other sources. Reducing relevant documents, passages, and sentences to thei...
This paper describes a system for generating text abstracts which relies on a general, purely statistical principle, i.e., on the notion of "relevance;", as it is defined in terms of the combina- tion of tf*idf weights of words in a senrenee. The systen generates abstracts from newspaper articles by selecting tile "most relevant" sentences and comb...
In this paper, we present a summarization system for spontaneous dialogues which consists of a novel multi-stage architecture. It is specifically aimed at addressing issues related to the nature of the texts being spoken vs. written and being dialogical vs. monological. The system is embedded in a graphical user interface and was developed and test...
Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain.
We address...
Oral communication is transient but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellons University's Interactive Systems Laboratories we have been experimenting with the documentation of meetings. This paper summarizes part of the pr...
In this paper, we present a chunk based partial parsing system for spontaneous, conversational speech in unrestricted domains. We show that the chunk parses produced by this parsing system can be usefully applied to the task of reranking Nbest lists from a speech recognizer, using a combination of chunk-based n-gram model scores and chunk cov- erag...
We describe and experimentally evaluate an efficient method for automatically determining small clause boundaries in spontaneous speech. Our method applies an ar- tificial neural network to information about part of speech and trigger words.
Automatic generation of text summaries for spoken language faces the problem of containing incorrect words and passages due to speech recognition errors. This paper describes comparative experiments where passages with higher speech recognizer confidence scores are favored in the ranking process. Re- sults show that a relative word error rate reduc...
This paper presents a system which automatically generates shallow semantic frame structures for conversational speech in unrestricted domains.
While the majority of summarization research so far has focused on written documents (mostly news articles or scientific papers), this thesis addresses for the first time the challenge of automatically summarizing spoken dialogues in a variety of genres and without any restriction on domain. To achieve the goal of spoken dialogue summarization, we...
This paper addresses the question of how to increase local coherence in summaries of multiparty conversations. Due to the interactive nature of dialogues, local regions of coherence often stretch across different speakers, as for instance in question-answer pairs. We present an approach to automatically detect those regions of local coherence and e...
Automatic summarization of open domain spoken dialogues is a new research area. This paper introduces the task, the challenges involved, and presents an approach to obtain automatic extract summaries for multi-party dialogues of four different genres, without any restriction on domain. We address the following issues which are intrinsic to spoken d...
Oral communication is transient, but many important decisions,
social contracts and fact findings are first carried out in an oral
setup, documented in written form and later retrieved. At Carnegie
Mellon University's Interactive Systems Laboratories we have been
experimenting with the documentation of meetings. The paper summarizes
part of the pro...
ing . . . . . . . . . . . . . . . . . . . . 20 6.2 Written Text Summarization . . . . . . . . . . . . . . . . . . . . 20 6.2.1 Systems Using Term Statistics (and Heuristics) . . . . . . 20 6.2.2 A System Using Text Understanding . . . . . . . . . . . . 23 6.2.3 An Abstracting Evaluation Study . . . . . . . . . . . . . . 25 6.3 Spoken Language Summa...
ing by Selecting Relevant Passages MSc Dissertation in Cognitive Science and Natural Language Submitted by Klaus Zechner Supervisors: Steve Finch Richard Shillcock (HCRC, Edinburgh) (CCS, Edinburgh) August 12, 1995 To My Parents Contents 0 Abstract 5 1 Text Abstracting as a Task within Information Retrieval 5 1.1 Some Key Concepts in IR . . . . . ....
This paper describes the implementation of an interface which allows runtime queries from a PATR-II grammar to a precompiled DATR lexicon during the parse of a sentence. The lexical information requested by the PATR system is then unified to the PATR parse tree. Alternatively, there is the option to create all the possible dictionary entries from a...
The lexicalist model of human sentence processing (MacDonald et al. 1994) provides an account for the interaction of lexical frequency effects with contextual information in the resolution of syntactic ambiguities. In this paper, we present an implementation of a connectionist network which evaluates the predictions of the lexicalist model for NP/S...
This paper addresses the issues that arise when having to devise meaningful annotation and evaluation schemes for summarization of spontaneous speech. It suggests a novel word-based annotation and evaluation scheme for intrinsic evaluations. A corpus of 24 spontaneous dialogues is annotated using this scheme and inter-coder agreement results as wel...
This paper describes a 3-level manual discourse coding scheme that we have devised for manual tagging of the CallHome Spanish (CHS) and CallFriend Spanish (CFS) databases used in the CLARITY project. The goal of CLARITY is to explore the use of discourse structure in understanding conversational speech. The project combines empirical methods for di...
In this paper, we present a chunk based partial parsing system for spontaneous, conversational speech in unrestricted domains. We show that the chunk parses produced by this parsing system can be usefully applied to the task of reranking Nbest lists from a speech recognizer, using a combination of chunk-based n-gram model scores and chunk coverage...
Parsing spontaneous speech has so far mainly been limited to narrow domain applications (e.g., scheduling of meetings, travel planning). In this work, a chunk based parsing approach is used for building a fast, robust, and shallow parsing system for spontaneous, conversational speech in unrestricted domains. The chunk parses produced by this parsin...
This paper presents a system which automatically generates shallow semantic frame structures for conversational speech in unrestricted domains. We argue that such shallow semantic representations can indeed be generated with a minimum amount of linguistic knowledge engineering and without having to explicitly construct a semantic knowledge base. Th...
We describe and experimentally evaluate an efficient method for automatically determining small clause boundaries in spontaneous speech. Our method applies an artificial neural network to information about part of speech and trigger words. We find that with a limited amount of data (less than 2500 words for the training set), a small sliding contex...
This paper presents a system which automatically generates shallow semantic frame structures for conversational speech in unrestricted domains. We argue that such shallow semantic representations can indeed be generated with a minimum amount of linguistic knowledge engineering and without having to explicitly construct a semantic knowledge base. Th...
The goal of the CLARITY project is to explore the use of discourse structure in the understanding of conversational speech. Within project CLARITY we aim to develop automatic classifiers for three levels of discourse structure in Spanish telephone conversations: speech acts, dialogue games, and discourse segments. This paper presents our first resu...