Article

Trends in assessment scales and criterion-referenced language assessment

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Two current developments reflecting a common concern in second/foreign language assessment are the development of: (1) scales for describing language proficiency/ability/performance; and (2) criterion-referenced performance assessments. Both developments are motivated by a perceived need to achieve communicatively transparent test results anchored in observable behaviors. Each of these developments in one way or another is an attempt to recognize the complexity of language in use, the complexity of assessing language ability, and the difficulty in interpreting potential interactions of scale task, trait, text, and ability. They reflect a current appetite for language assessment anchored in the world of functions and events, but also must address how the worlds of functions and events contain non skill-specific and discretely hierarchical variability. As examples of current tests that attempt to use performance criteria, the chapter reviews the Canadian Language Benchmark, the Common European Framework, and the Assessment of Language Performance projects.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... It provides essential characteristics in a level of performance to be observed by assessors in assessing language skills. Additionally, it has detailed descriptors for each level depending on the observations of performance in using the language skills to the optimum point (Hudson, 2005). Utilizing this type of assessment, learners are evaluated in relation to some explicitly stated criteria. ...
... Rubrics or checklists, which are subdomains of criterion referenced assessment, consist of rules which demonstrate a series of principles to be employed in giving scores in performance assessments. It is a way of effectively assessing tasks that allow a room for personal open-ended responses rather than selected response tasks (Hudson, 2005). ...
... Its main aim is to construct an extensive, transparent, and consistent reference framework for learning, teaching, and evaluation of a variety of languages across Europe. In this way, different language institutions entitled to specify language proficiencies of learners in different languages would be able to work in harmony with a consensus of recognition of language capabilities of learners according to the specified CEFR benchmark criteria (Hudson, 2005). ...
Chapter
Alternative assessment procedures have so far been researched and implemented widely both in mainstream education and in second/foreign language (L2) education for decades. In response to the shortcomings of more traditional assessment procedures, L2 professionals have included alternatives in their assessment processes. More specifically, a wide variety of alternative procedures such as portfolios and projects have been employed by English language teaching (ELT) professionals. Based on this framework, the current chapter initially defines and conceptualizes alternative assessment. It then underscores the complementary nature of standardized and alternative assessment and discusses the validity and reliability issues. Following this, the chapter introduces alternative assessment types and discusses them in relation to language learning and teaching. Upon underlining the benefits of and challenges to the use of alternative assessment particularly in the process of English language learning and teaching, the chapter finally provides some implications and future directions.
... In contrast to construct-based tests, task-based (performance) tests use language data produced by learners in real-life tasks in assessing their language use ability (McNamara, 1996). Various types of language performance tests have been used to evaluate listening, speaking, and writing abilities by providing a specific situation in which to communicate with other people (e.g., Brown, Hudson, Norris, & Bonk, 2002;Colpin & Gysen, 2006;Ellis, 2003;Hudson, 2005). However, traditional reading tests sometimes discount communicative purposes, resulting in a less authentic test format that focuses on grammar, word usage, and translation. ...
... Regarding the scoring of task performance, Brown et al. (2002) developed two types of Likert-type rating criteria: task-dependent (T-D) and independent (T-I) rating scales. According to Hudson (2005), the T-D rating scale reflects the real-world criterion elements necessary to complete that task; therefore, it has separate rating scales designed specifically for each task. Whereas the T-D rating scale can describe test-takers' language use ability, the T-I rating scale is designed to reflect "raters' [common] evaluation as to how the examinee performed across all of the tasks during the course of the test was developed" (Hudson, 2005, p. 220). ...
... We developed T-D and T-I rating scales for the assessment of the test-takers' reading performances (see Appendices C and D). The rating scales included five performance levels (1: Inadequate, 2: Semi-Adequate, 3: Able, 4: Semi-Adept, and 5: Adept), based on Brown et al. (2002), Ellis (2003), and Hudson (2005). Because Brown et al. (2002) designed a prototype framework for performance assessment using real-world tasks, we followed their method to develop rating scales for reading performances. ...
Article
Full-text available
The present study describes the first step in the development and validation of a task-based reading performance test. We designed 6 information transfer task test items in which test-takers were required to transfer what they comprehended from passages (e.g., reading a travel schedule and communicating it to circle members via email). To validate extrapolations on the construct validity of our task-based reading performance test, this study examined the reliability of the test scores by performing a generalizability theory study and a qualitative analysis of rating approaches. In particular, we considered 3 factors (task characteristics, number of raters, and type of rating scale) that affect the reliability of observed scores to obtain an appropriate rating scale and procedure. Over 3 weeks of English classes, 122 Japanese university students completed the 6 different reading tasks. Their reading task outcomes were scored by 6 raters using either a task-dependent or task-independent rating scale. A generalizability study suggested that the 2 types of rating scale could be used alternatively, but qualitative analysis revealed that the 2 rating procedures differed in scoring of local errors associated with detailed information, appropriate reorganization of passage contents, and appropriateness of sociolinguistic elements. Moreover, a decision study demonstrated that the reliability of the observed scores was strongly affected by the number of tasks. To obtain a strictly high-reliability coefficient (.80), 7 tasks by 3-92-raters are desirable using the task-independent rating scale, while 9 tasks by 3 raters are necessary using the task-dependent rating scale. This study suggested the applicability of task-based reading performance tests and points to be noted for the test implementation from the viewpoints of test material development, scoring procedures, and possible washback effects on teaching and learning of English as a foreign language.
... Empirical scale validity is not a common topic in LTA (Fulcher et al., 2011;Hudson, 2005). It is not an easy-to-grasp concept, especially without a concrete testing context (Wisniewski, 2014). ...
... Second, from an empirical perspective, linguistic phenomena that serve to either calculate proficiency measures for automatic analyses or as a fundament for rating decisions must be reliably observable in learner language, building clusters clearly distinguishable from one another. However, rating scales are mostly calibrated without analyses of learner language (Hudson, 2005;North, 1994). Rating scales that stem from performance analyses mostly focus on specific real-life tasks and thus have a restricted scope (Hudson, 2005), while there is a lack of empirically based scales that aim at capturing more general, at least partially context-independent aspects of proficiency (e.g., grammatical accuracy). ...
... However, rating scales are mostly calibrated without analyses of learner language (Hudson, 2005;North, 1994). Rating scales that stem from performance analyses mostly focus on specific real-life tasks and thus have a restricted scope (Hudson, 2005), while there is a lack of empirically based scales that aim at capturing more general, at least partially context-independent aspects of proficiency (e.g., grammatical accuracy). Again, the combination of corpus linguistic approaches, LTA expertise, and possibly NLP could help to analyze and modify existing rating scales or to develop new ones from scratch. ...
Article
The Common European Framework of Reference (CEFR, CoE, 2001) is the most widespread reference tool for linking language tests, curricula, and national educational standards to levels of foreign language of proficiency in Europe. In spite of this, little is known about how the CEFR levels (A1-C2) relate to empirical learner language(s). This paper sums up recent trends to meet the need of empirical CEFR level research, where learner corpus-based analyses play an increasing role. A first focus of the article is on studies that aim at illustrating CEFR levels by analyzing rated learner texts (‘criterial features’). Furthermore, research that tries to disentangle the empirical validity of the CEFR scales by operationalizing its descriptors is presented. Before concluding with an outline of the most urgent research needs, potentials and boundaries of inter-disciplinary work between the fields of language testing & assessment, second language acquisition, and learner corpus research are discussed.
... Furthermore, by focusing on specific, concrete, observable behavior as a means of defining the dimensions to be judged and anchoring the evaluative continuum, BARS are considered to reduce construct-irrelevant variance in performance appraisal ratings (Smith & Kendall, 1963). As reported in relative studies, the use of BARS has been hailed as a promising way to improve performance evaluations (Campbell et al., 1970;Dunnette, 1966;Hudson, 2005;Jacobs et al., 1980;Schwab et al., 1975). ...
... A disadvantage of BARS indicated in the relative literature (Borman, 1986;Hudson, 2005) is that the specific behavior descriptions, serving as anchors, may sometimes not match to the ones of the assessees. To address this issue, Borman (1986) introduced the Behavioral Summary Scale (BSS) as an alternative to the BARS scale. ...
Preprint
Full-text available
Background: The present study proposes a type of Behaviorally Anchored Rating Scale (BARS) for the writing skills assessment evaluation in foreign language education. The commonly used rating scales in language education are numeric. However, numbers can be used only for measurement and counting purposes which are objective procedures, when assessment necessarily involves subjective judgments (Mc Namara, 1996:117). In language assessment though, numbers often express an arbitrary result of language performance, promoting students' classification and disorienting their interest from learning to rating. The BARS is a scale that combines numeric verbal and descriptive evaluation scales that can provide meaningful information facilitating students' improvement and leading the teacher to the proper choices for their support (Schwab et al., 1975, Aiken, 2005). The BARS performance descriptions serve as anchors and permit the comparison with assesses' performance in order to find the correspondence. BARS are considered to reduce construct-irrelevant variance in performance appraisal ratings (Smith & Kendall, 1963) Materials and Methods: Using this assumption, experimental research was carried out at Aristotle University of Thessaloniki to investigate the effectiveness of BARS over traditional numeric scales in writing skills assessment. The 16 participants in the research were divided into two groups, the experimental group (EG) and the control group (CG). The students were required to complete 10 assignments of written production as part of a B1 CEFR level Italian language course of 50 hours. The students of the first group were assessed with a BARS scale and the second with a traditional numeric by two different raters. At the end of the course, both students and raters answered a short questionnaire about the efficiency of the scale they used. The descriptive analysis of the results was conducted with spss 24 and calculated the central tendency and the dispersion of the performances in each assignment to find the progress and the performance level. The results of the questionnaire were also analysed with the aforementioned descriptive methods. Results: Results showed that BARS was highly effective in improving students' overall performance and qualitative attributes of their writing. BARS' effectiveness was also reported by the students and raters who participated in the follow-up field study. In contrast, Numeric Scale did not appear effective in promoting students' improvement. Conclusion: BARS seem to be effective in helping students to improve their writing skills in a short time while the traditional numeric scales fail to do this.
... Within the cycle of test development and validation, rubrics are reporting mechanisms that show students how their work is assessed and what skills they need to achieve specific success levels (Ferris & Hedgcock, 2014;Hamp-Lyons 2014). Describing language learning objectives and accomplishments in qualitative terms, evaluation criteria not only serve the purpose of transparency in assessment (Crusan, 2010(Crusan, , 2015Hudson, 2005) but also help students become autonomous and responsible for their own learning by means of self-monitoring (Jonsson, 2014;Panadero & Jonsson, 2013). ...
... This concern was accompanied by their approach to the current integrated writing rubric, which did not seem to facilitate their self-monitoring. These findings suggest construct-oriented rubrics need to be tailored for student use to facilitate their interpretation of test results (Alderson & Hamp-Lyons, 1996) and promote transparency in assessment (Crusan, 2015;Hudson, 2005). ...
Article
Full-text available
Despite the breadth of integrated writing assessment research, few studies have examined student perceptions of classroom-based integrated writing tasks or the instructional value of analytic rubrics. Adopting a case study methodology, this exploratory research investigated L2 learners’ conceptualizations of integrated writing assessment and their use of an analytic rubric for self-assessment in an EAP writing course. Data sources included integrated writing samples that were evaluated by the students and their instructor, a writing self-efficacy questionnaire, individual retrospective interviews, and course materials such as syllabi, and task instructions. Qualitative analysis revealed themes related to three aspects of classroom-based integrated writing assessment: task requirements, task conditions, and instructor feedback. The themes were discussed in terms of students’ test taking strategies and the use of available support systems in EAP contexts. In addition, findings indicated an overlap between students’ self-assessment and instructor evaluation of their integrated essays, suggesting that students could use the evaluation criteria effectively. Implications for teaching and assessing integrated writing in EAP contexts are discussed.
... most widely adopted model of lar)guage ability in language testing. Hudson, 2005). They take into account notions of language competence, strategic competence, sociocultural competence, textual competence, and so on (Hudson, 2006 (1980). ...
... The ALP fbcused on how real-world tasks can function to reveal an examinee's lariguage ability in use ofpedagogical goals (Hudson, 2005). ...
... Considering how differences between raters' behavior can be rather extreme when different traits 11 It has been shown that parallel structures such as this simplify the task for raters, who latch on to them and heavily weight the key words therein during the scoring process (Hudson 2005). are rated separately, when all traits are put into a single scale (as here), it would be premature to assume the raters are all using that scale in the same way (Hudson 2005, Schaefer 2008). ...
... Considering how differences between raters' behavior can be rather extreme when different traits 11 It has been shown that parallel structures such as this simplify the task for raters, who latch on to them and heavily weight the key words therein during the scoring process (Hudson 2005). are rated separately, when all traits are put into a single scale (as here), it would be premature to assume the raters are all using that scale in the same way (Hudson 2005, Schaefer 2008). ...
Thesis
Full-text available
While a substantial body of research has accumulated regarding how intonation is acquired in a second language (L2), the topic has historically received relatively little attention from mainstream models of L2 phonology. As such, a unified theoretical framework suited to address unique acquisitional challenges specific to this domain of L2 knowledge (such as form-function mapping) has been lacking. The theoretical component of the dissertation makes progress on this front by taking up the issue of crosslinguistic transfer in L2 intonation. Using Mennen's (2015) L2 Intonation Learning theory as a point of departure, the available empirical studies are synthesized into a typology of the different possible ways two languages' intonation systems can mismatch as well as the concomitant implications for transfer. Next, the methodological component of the dissertation presents a framework for overcoming challenges in the analysis of L2 learners' intonation production due to the interlanguage mixing of their native and L2 systems. The proposed method involves first creating a stylization of the learner's intonation contour and then running queries to extract phonologically-relevant features of interest for a particular research question. A novel approach to stylization is also introduced that not only allows for transitions between adjacent pitch targets to have a nonlinear shape but also explicitly parametrizes and stores this nonlinearity for analysis. Finally, these two strands are integrated in a third, empirical component to the dissertation. Three kinds of intonation transfer, representing nodes from different branches of the typology, are examined in Japanese learners of English as a Foreign Language (EFL). For each kind of transfer, fourteen sentences were selected from a large L2 speech corpus (English Speech Database Read by Japanese Students), and productions of each sentence by approximately 20-30 learners were analyzed using the proposed method. Results suggest that the three examined kinds of transfer are stratified into a hierarchy of relative frequency, with some phenomena occurring much more pervasively than others. Together as a whole, the present dissertation lays the groundwork for future research on L2 intonation by not only generating empirical predictions to be tested but also providing the analytical tools for doing so. For the full text of the dissertation, see: http://dx.doi.org/10.5967/K8JW8BSC For the linguistic annotations that the analyses are based upon, see: http://dx.doi.org/10.5967/K86Q1V51 For the R code used to conduct these analyses, see: https://github.com/usagi5886/intonation
... Instead, language includes contextual aspects of communication. For instance, Hudson (2005) states that "[l]anguage takes place in a social context as a social act, and this frequently needs to be recognized in language assessment" (p. 205). ...
Article
Full-text available
Making EAP course outcomes congruent with post-secondary demands requires a needs analysis, in which a target situation analysis is imperative (Bocanegra-Valle, 2016; Hyland, 2016; Cabinda, 2013; Rosenfeld, Leung, & Oltman, 2001; Upton, 2012). This article details the theoretical considerations for a needs analysis, and reports the quantitative findings of a target situation analysis completed for a pre-sessional EGAP program at a Canadian College. 51 Professors from the college and a university completed questionnaires ranking academic tasks necessary for post- secondary success in all four language skill areas (reading, writing, listening, and speaking). 25 of the 43 language tasks were identified as ‘approaching very important’ and ‘very important’ to academic success at the tertiary level in Canada. The results indicated that major curricular changes were warranted, especially at the two most advanced levels, and examples are explicated.
... Researchers have discussed that instructors, as the active users of tests and evaluation scales, could play a key role in the process of assessment development and validation in terms of the contextualization of the evaluation criteria (Crusan & Matsuda, 2018). In addition, they provide important insights into isolating key features and refining rubric descriptors, which reflects their classroom teaching practices (Hudson, 2005;North, 2000). In terms of the pedagogy, instructor involvement in test design and evaluation can inform the next steps in writing instruction (e.g., development of learning materials, providing constructive feedback), thereby supporting student progression. ...
Article
Due to its authenticity as an academic writing task, integrated writing assessment has become widely used for assessing the writing ability of English for academic purposes (EAP) students (Plakans & Gebril, 2017). However, apart from validation studies focusing on a common set of standardized test rubrics (e.g., Chan, Inoue, & Taylor, 2015), little research has explored the construct of integration or how to assess it effectively in EAP contexts. To provide second language (L2) writing researchers and practitioners with an empirically based model of source integration, this study explored EAP instructors' orientations when assessing integrated essays and investigated the relationship between instructor ratings and textual measures of source use. Six experienced EAP instructors first evaluated two sample argumentative essays written by undergraduate English L2 students and provided comments through a stimulated recall interview. Next, they evaluated 48 additional argumentative essays using an analytic rubric that included dimensions of source use and provided comments for each essay. Finally, the essays were analyzed for linguistic and rhetorical features associated with source integration. Triangulation of data sources revealed that the instructors oriented to three aspects of source integration: using source information to support personal claims, incorporating source‐text language in students' own words, and representing source content accurately. Their ratings were correlated with text‐based measures of students' source use. Implications for teaching and assessing integrated writing in EAP settings are discussed.
... An important challenge associated with the assessment of integrated writing tasks is local adoption of language proficiency scales originally created by applied linguists and language testing professionals (Chan et al., 2015;Janssen et al., 2015). Although these rubrics can isolate the key features of integrated writing ability, they may not serve the needs of the local users (Hudson, 2005;North, 2000). For example, writing scales which are usually anchored to specific language tests, such as TOEFL iBT, provide an accurate and valid description of integrated writing ability and distinguish different levels of proficiency. ...
Article
Although researchers have argued for a mixed-method approach to rubric design and validation, such research is sparse in the area of L2 integrated writing. This article reports on the validation of an analytic rubric for assessing a classroom-based integrated writing test. Argumentative integrated essays (N = 48) written by EAP students at an English-medium Canadian university were rated by instructors (N = 10) with prior EAP teaching experience. Employing a mixed methods design, the quality of the rubric was established through many facet Rasch measurement and perceptions from the instructors elicited during semi-structured interviews. To further explore the rubric’s ability to differentiate among students, essays from three performance levels (low, average, high) were compared in terms of fluency, syntactic and lexical complexity, cohesion, and lexical diversity measures. Results have suggested the rubric can capture variation in student performance. Implications are discussed in terms of validation of assessment rubrics in localized assessment contexts.
... The inclusion of such tasks is advantageous for several reasons. It addresses the negative effects of using traditional testing and can potentially induce curriculum design and the inclusion of communicatively oriented objectives (Hudson, 2005). It also ensures the validity of the assessment process since construct validity is an essential property of performance assessment; it is a key element in this approach to find tasks that can prepare language learners for similar tasks in the real world. ...
Article
Full-text available
This article explores how the use of information communication technology (ICT) can assist English language instructors transform traditional assessment to better inform their teaching practice and to gain valuable insight into the actual academic progress of their English learners in a more valid and accurate fashion. Many recent publications have shown the potential ways and benefits of integrating ICT into the English teaching practice in an attempt to personalize learning and transform the learning–teaching experience for both students and teachers. Nevertheless, not as many have been devoted to discussing practical alternative ways of assessment of students’ performance, which is a much–needed missing link in the process of truly transforming the 21st century digital age learning experience. A true digital age learning environment moves away from focusing on preparing language learners for standardized language tests to creating authentic assessments reflecting the learning experiences carried out inside and outside the classroom. This article aims to transform the way English language teachers view and design alternative assessments featuring ICT tools to help students demonstrate their success in English learning in varied and multiple ways.
... The inclusion of such tasks is advantageous for several reasons. It addresses the negative effects of using traditional testing and can potentially induce curriculum design and the inclusion of communicatively oriented objectives (Hudson, 2005). It also ensures the validity of the assessment process since construct validity is an essential property of performance assessment; it is a key element in this approach to find tasks that can prepare language learners for similar tasks in the real world. ...
... The inclusion of such tasks is advantageous for several reasons. It addresses the negative effects of using traditional testing and can potentially induce 'positive washback effects' on the process of curriculum design and the inclusion of communicatively oriented objectives (Hudson, 2005). It also ensures the validity of the assessment process since construct validity is an essential property of performance assessment; it is a key element in this approach to find tasks that can prepare language learners for similar tasks in the real world. ...
Article
Full-text available
يستكشف هذا المقال كيف يمكن لتكنولوجيا المعلومات والاتصالات (ICT) مساعدة معلمي اللغة الإنجليزية على تحويل عملية التقييم التقليدية بشكل جذري، وذلك بهدف تطوير ممارساتهم التدريسية ومساعدتهم في الحصول على نظرة متعمقة للتقدم الأكاديمي الفعلي لطلابهم من متعلمي اللغة الإنجليزية بطريقة أفضل وأكثر دقة. لقد أظهرت العديد من الأبحاث والدراسات الحديثة الطرق والفوائد المتوقعة لدمج تكنولوجيا المعلومات والاتصالات في عملية تدريس اللغة الإنجليزية وذلك في محاولةٍ لمواءمة عملية التعلم مع احتياجات المتعلمين وتغيير الطريقة التقليدية للتعلم والتعليم لكلٍ من الطلاب والمعلمين. ومع ذلك، لم يتم مناقشة طرقٍ عمليةٍ بديلةٍ لتقييم أداء الطلاب رغم أهمية هذا الجانب في إنجاح عملية تحويل تجربة التعلم في العصر الرقمي في القرن الحادي والعشرين. إن بيئة تعلم العصر الرقمي الحقيقة لا تهدف إلى إعداد متعلمين قادرين فقط على اجتياز اختبارات اللغة المعيارية الموحدة، ولكنها تسعى إلى ابتكار أساليب تقييم أصيلة تعكس خبرات التعلم داخل وخارج الفصول الدراسية بشكل حقيقي. تهدف هذه الورقة العلمية إلى تغيير طريقة تفكير معلمي اللغة الإنجليزية وتصميمهم لأدوات التقييم البديلة وذلك من خلال دعمها بأدوات تكنولوجيا المعلومات والاتصالات من أجل مساعدة الطلاب على عرض مهاراتهم اللغوية وإثبات تحصيلهم الفعلي في اللغة الإنجليزية بطرق متعددة ومتنوعة.
... Instead, language includes contextual aspects of communication. For instance, Hudson (2005) states that "[l]anguage takes place in a social context as a social act, and this frequently needs to be recognized in language assessment" (p. 205). ...
Article
Full-text available
This study examines whether self-efficacy belief, motivation, and learning strategies correlate to students’ academic performance in English at higher education in Indonesia. One hundred and twenty-five undergraduate students of English Literature Study Program of Faculty of Languages and Literature Universitas Negeri Makassar and students of graduate program of English education (TEFL) participated in this study. There were 89 or 71.2% females and 36 or 28.8% males. Students were administered a questionnaire consisting of three important key topics: self-efficacy belief, motivation, and learning strategies. The student’s age, sex, study program, semester, and GPA were assessed at the beginning of the questionnaire. The results of this study show that there was a significant relationship between self-efficacy belief and students’ academic performance, motivation and students’ academic performance, and learning strategies and students’ academic performance at State University of Makassar (Universitas Negeri Makassar/UNM).
... Rasch analysis has been prevalently used in L2 performance assessment (Bonk & Ockey, 2003;Eckes, 2005;Lumley, 1998;Lumley & an effective method to examine raters' performance (Fulcher, 1996). Multifacets Rasch analysis has been seen as showing 'a great deal of promise in finding and accounting for the relative effects of contextual features that we identify' (Hudson, 2005). ...
Article
Full-text available
The aim of the study was to investigate how raters come to their decisions when judging spoken vocabulary. Segmental rating was introduced to quantify raters’ decision-making process. It is hoped that this simulated study brings fresh insight to future methodological considerations with spoken data. Twenty trainee raters assessed five Chinese students’ monologic texts on vocabulary in this study. Both segmental rating and overall rating were retrieved from the raters. Rasch analysis suggested variation between raters in their judgment of vocabulary, although consistency was found in general. Besides, there was a mismatch between candidates’ vocabulary scores and their lexical statistics. The raters’ decision-making process was generally cumulative.
... Fulcher (2003) characterised this as the 'armchair' approach in which experts, such as teachers, applied linguists and language testers, used their own intuitive judgement to isolate key features of writing performance and to hypothesise verbal descriptors of performance quality. This approach has been criticised on the ground that it lacks empirical support and tends to produce decontexualised descriptors (Hudson, 2005;North, 2000), and thus, the 1990s saw increasing calls for a more empirically-based approach to rubric development (Shohamy, 1990;Upshur and Turner, 1995;Fulcher, 1996;McNamara, 1996;North, 2000). Such an approach involves analysing samples of actual language performance, shaped by a specific purpose and context, in order to construct (or reconstruct) the essential assessment criteria and to describe meaningful levels of performance quality. ...
Article
The integrated assessment of language skills, particularly reading-into-writing, is experiencing a renaissance. The use of rating rubrics, with verbal descriptors that describe quality of L2 writing performance, in large scale assessment is well-established. However, less attention has been directed towards the development of reading-into-writing rubrics. The task of identifying and evaluating the contribution of reading ability to the writing process and product so that it can be reflected in a set of rating criteria is not straightforward. This paper reports on a recent project to define the construct of reading-into-writing ability for designing a suite of integrated tasks at four proficiency levels, ranging from CEFR A2 to C1. The authors discuss how the processes of theoretical construct definition, together with empirical analyses of test taker performance, were used to underpin the development of rating rubrics for the reading-into-writing tests. Methodologies utilised in the project included questionnaire, expert panel judgement, group interview, automated textual analysis and analysis of rater reliability. Based on the results of three pilot studies, the effectiveness of the rating rubrics is discussed. The findings can inform decisions about how best to account for both the reading and writing dimensions of test taker performance in the rubrics descriptors.
... The inclusion of both a monologue and a group based task was also based on the fact that even though group speaking tests have become prevalent (Cheng, Rogers, & Hu, 2004), some recent research (Bahrani, 2011;Martin-Monje, 2012) has illustrated the possibility of getting positive results without any synchronous interaction in computerbased pilot studies. Furthermore, pair and group tests may not always benefit students' performance (Saito, & Miriam, 2004), as aspects like personality (Tsou, 2005), anxiety (Mohammadi, Biria, Koosha, & Shahsavari, 2013) competitiveness, discourse co-construction (Zhang, 2008;Sabet, Tahriri & Pasand, 2013), motivation, learning styles (Tuan, 2011), scales (Hudson, 2005), sex (Azkarai & Mayo, 2012), channel, tester, raters and many others have a powerful effect on the final assessment. Besides, interaction may not be the only way to trigger the candidate's performance and, consequently, both types of tasks seem to be necessary. ...
Article
Full-text available
One of the most significant aspects of the Spanish new educational reform is the Baccalaureate General Test which is intended to replace the former University Entrance Examination. The new test will include an oral part, which needs to be created, based on current research on the field (Bueno-Alastuey & Luque, 2010), and tested with students from different regions in Spain to confirm its validity. This paper describes the preparation and first results of a pilot study using some proposed tasks. The speaking tasks were based on the ones currently used in the Cambridge Preliminary English Test but conveniently adapted to the Spanish context as suggested by some studies (Amengual-Pizarro & Mendez Garcí a, 2010). This paper shows the perceived strengths and weaknesses of those tasks and the test based on current testing literature on construct definition (Bachman & Palmer, 1996) and validation (Weir, 2005; Fulcher, 2010; Ekbatani, 2011). Results showed that the test corresponds better to classroom practice and favors both washback and language development at a lower cost.
Article
The purpose of this paper is to investigate specific parameters that affect the evaluation of written tests of junior and senior students, aged 16 to 18, attending high schools in Greece. To achieve this, we analyzed textual characteristics and scoring of 265 juniors and seniors, graded by 15 different raters. To examine the contribution of linguistic parameters to the assessment, we developed an automated tool to record and evaluate students' vocabulary. The results revealed that the extensive use of adjectives and the utilization of both impersonal and passive syntax, as well as adverbs to a lesser extent, contribute the most to positive grading in language tests. Furthermore, we identified a correlation between language and the other criteria of the evaluation rubric, namely content and organization. Lastly, we observed that the gender of raters affects the assessment of each evaluation criterion. This study offers a dual contribution: firstly, it explores the parameters influencing language assessment in students' written production at school, and secondly, it proposes the initial step toward designing an automated language assessment tool for the Greek language within the aforementioned context.
Chapter
The chapter reports on a validation project for a scale of Business English (BE) writing proficiency. The scale was empirically developed to facilitate the teaching, learning and assessment of BE writing in the Chinese tertiary context. To further examine the validity of the scale, semi-structured interviews were conducted to seek experts’ perceptions of the scale. Specifically, ten experts from the pedagogical domain and business domain were carefully selected, whose opinions were elicited on an individual basis concerning the quality and usefulness of the scale. The experts in general perceived the scale favorably, commenting that most of the descriptors in the scale were appropriately categorized, formulated in a lucid manner and ascribed to proper proficiency levels. The experts in particular endorsed the usefulness of the scale, elaborating on how it could be applied to their respective workplace contexts. In the meantime, areas desiring improvement were also identified, shedding important light on the formulation and refinement of English for Specific Purposes (ESP) proficiency descriptors. Grounded in the ESP domain in general and BE research in particular, the study fills an important gap in literature by validating a theoretically informed, data-driven and statistically calibrated BE writing proficiency scale for Chinese EFL learners. Although this study focuses only on the BE writing skill, its findings have significant implications for scale development and validation in other discipline- or occupation-specific domains that feature the interaction between language and content.KeywordsBusiness English writingLanguage proficiency scalesEnglish for specific purposesValidation
Article
Full-text available
This article addresses different approaches towards assessing students' language skills. The increased interest in new assessment is based on an issue: traditional assessment does not provide full description of students' outcomes which is important for the teachers to monitor learners' progress and to plan for instructions. The test score mainly shows that a student has succeeded or failed, but it gives the teacher an incomplete picture of student needs and strengths. The concept of criterion-referenced assessment is to assess language as communicative competence. This article gives clear idea about these assessments and how useful it is to use criterion-referenced assessment.
Book
This book presents an empirical study to develop and validate a proficiency scale of business English writing in the Chinese tertiary context. Through a mixture of intuitive, quantitative and qualitative methods, the book demonstrates how a pool of descriptors are collectively formulated, statistically calibrated and meticulously validated for the establishment of a proficiency scale of business English writing. The writing scale differs in significant ways from the existing language scales, most of which were constructed in English as L1 or L2 contexts and applied to English for General Purposes (EGP) domains. This book also provides important insights into the construct of business English writing as well as the methods for English for Specific Purposes (ESP) proficiency scale development and validation. It is of particular interest to those who work in the area of ESP teaching and assessment.
Chapter
This chapter reports on how the newly developed BE writing scale was validated by a group of experts from the teaching and professional domains. The experts were asked to examine the scale in terms of its descriptor categorization and level assignment through a questionnaire. Next, one-on-one interviews were conducted to explore in-depth issues such as their overall perceptions of the scale as well as areas that were in need of improvement. The analysis results show that only nine out of the 86 descriptors displayed level change, indicating a satisfactory degree of level consistency. Exploration of the interview data revealed that experts’ perceptions of the descriptor level were influenced by factors such as the examples included in the descriptors. Necessary modifications were subsequently then made after carefully weighing upon the experts’ comments to further enhance the validity of the BE writing proficiency scale.
Chapter
It is well acknowledged that the development of a language scale should have a base in linguistic theory (North, 2000). As this study aims to develop a BE writing proficiency scale, this chapter focuses on the nature of BE writing proficiency, positioned as a branch of ESP. Given the central role that genres play in ESP teaching, learning and assessment, the genre perspective was adopted to elucidate BE writing proficiency in this chapter. Specifically, we adopted Tardy’s (2009) model of Genre Knowledge as the underlying theoretical framework for the development of the BE writing scale. This model consists of four dimensions: Formal knowledge, Process knowledge, Rhetorical knowledge, and Subject-matter knowledge, each of which contributes to the development of expertise in producing disciplinary genres. It is well acknowledged that the development of a language scale should have a base in linguistic theory (North, 2000).
Chapter
Literature on language scale development is largely situated within the EGP domain, which is in practice often contrasted with ESP. When developing an ESP scale, we contend that the unique features of ESP should be taken into account in addition to drawing on valuable insights from the EGP domain. This chapter starts with an elaboration of the important issues concerning ESP assessment. The fundamental principles for scale development are then discussed, followed by an analysis of the methods commonly employed for language scale construction. In the final section, the multi-phased research design of the study is presented, outlining the procedures involved in the scale development process.
Article
Task‐based language assessment (TBLA) focuses on measuring examinees' language use rather than an abstract construct of linguistic knowledge. Task‐based assessments elicit relevant performance from test takers using tasks that are recognizable as relevant to educational and professional contexts. This entry provides a basic overview of TBLA, focusing on the intended test interpretations and advantages of using tasks as a basis for assessment. It also discusses current applications of TBLA in and out of classroom contexts as well as theoretical and practical issues in TBLA and suggestions for addressing the challenges.
Article
Full-text available
The global spread of the English language as one of the most far-reaching linguistic phenomena of our time is already an established fact. Evidence of this worldwide phenomenon of language contact, variation and change can be seen through such designations as world Englishes, new Englishes, modern Englishes, West African Englishes, South African Englishes, Indian English, to mention just a few (Ajani, 2007). Nigerian English and Ghanaian English have also become apparent even in recent times. This illustrative examines some of the common lexico-semantic features of some distinct varieties of English known as Nigerian and Ghanaian Englishes. These varieties have been traced to the local tastes of people and the contact of English with their indigenous languages. This paper concludes by making a clarion call for acceptability and codification of these Englishes which linguists like Bamgbose (1998), Eka (2000) and Udofot (2003), among others, have earlier demonstrated. Keywords: Englishes, Nigerian English, Ghanaian English, Lexicosemantic features and codification
Article
Full-text available
The aim of the present study was to examine the educational potential of the European Language Portfolio in developing self-assessment skills and learner autonomy in an English language classroom in a setting where language teaching is undergoing a shift from centralised prescriptivism to democratisation of teaching and learning processes. Data are driven from a long-term project of using the European Language Portfolio in Experimental English Classes at the American University of Armenia. The study explored the possibilities of developing learner autonomy and reflective language learning by integrating self-assessment of language competences into English language learning and teaching programmes through the use of the European Language Portfolio. It is our hope that the presented results will motivate language teachers wishing to promote learner autonomy and explore their foreign language pedagogy in their own contexts.
Chapter
Higher Education Admissions Practices - edited by María Elena Oliveri January 2020
Chapter
Higher Education Admissions Practices - edited by María Elena Oliveri January 2020
Chapter
This chapter focuses on the Writing, Speaking and Classroom Language Assessment modules—the three areas that are assessed by scales and descriptors in the LPATE. The scales and descriptors adopted in each LPATE paper, namely Writing, Speaking and Classroom Language Assessment, are first introduced, followed by a presentation of tasks that were used in different modules. This chapter focuses on how these tasks aided the development of the scales assessed in the LPATE, thus helping participants meet the LPR. From a wider perspective, this chapter describes how an enhanced grasp of the Writing, Speaking and Classroom Language Assessment modules may contribute to teacher professional development .
Chapter
In this chapter, high-stakes assessment, educational standards and benchmarks are first discussed. These are subsequently elaborated upon and discussed within their theoretical contexts and, particularly, within the context of language assessment. The ways in which assessment paradigms have changed in recent decades are also discussed to highlight their influences on the benchmarking project.
Chapter
English as an international language (EIL) refers to the use of English as a means by which people from different parts of the world communicate with each other. Geopolitics concerns political power within and across geographic space. Within this broadly defined EIL context, this entry frames the political power of assessment and addresses this power from two dimensions. The first dimension is the geopolitics of international language testing, which deals with the testing of proficiency in English by speakers of other languages whose purpose of taking the test is to pursue academic study in English‐speaking countries, such as the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS). The second dimension focuses on the testing of proficiency in English by speakers of other languages whose purpose is for schooling in English‐speaking countries, such as the Ontario Secondary School Literacy Test (OSSLT) in Canada and the No Child Left Behind (NCLB) policy in the United States.
Chapter
Although teachers are sometimes portrayed as unreliable raters because of their emotional involvement and proximity to students or test-takers, it can be argued that they have more expertise and experience in rating test-takers’ performances than most test developers. Therefore, it seems only logical to include them in the development of rating scales. This applies to both scenarios in which teachers are only responsible for preparing students for High-Stakes exams, and scenarios where teachers are responsible for test preparation as well as the rating of the test performances. Involving teachers in rating scale design can offer test developers access to a wealth of rating experience and thereby increase the validity of the scale. It can also instil an important feeling of ownership in the teachers, which seems indispensable for the promotion of positive attitudes towards High-Stakes exams. This chapter will outline the potentials and challenges of involving secondary school teachers in the design of rating instruments for a large-scale national High-Stakes exam. Two case studies on teacher involvement in scale development will be presented (writing and speaking). The chapter will compare the two projects, highlighting what was found useful by the involved teachers. It will do so by analyzing teacher statements from retrospective questionnaires (N = 23) about their experience of being involved in one or both of these projects. The chapter will conclude with insights into the importance of teacher involvement in this stage of the test development cycle, and will highlight the usefulness of combining top-down and bottom-up scale development procedures.
Thesis
Full-text available
In higher education, the desire to internationalize has created demands for an internationalized academia to use English increasingly in teaching outside the English native-speaking world. Given this situation, perhaps other criteria for measuring successful communication should be considered than that of the native-speaking minority. With lecturers whose native language is not English increasingly teaching their subjects through English, there is a growing need to develop adequate measures for this purpose and situation as the current normative standards are no longer tenable. Establishing adequate measures for this purpose and situation are relevant to institutions facing the challenge of providing EMI courses and programs while ensuring credible quality control. In order to determine what criteria might be adequate for assessing spoken professional English in an international context, this study investigates self-assessments of professional language in relation to language ideologies. The study involves English-medium instruction (EMI) in the field of engineering and takes place at a Finnish university. Using a mixed-methods approach, the study employed an explorative strategy that involved a concurrent design. The two methods were used in parallel and the results integrated at the interpretation phrase. This approach provides a general picture through micro- and macro-level analyses: the self-perceptions of EMI lecturers (i.e. qualitative) and their students perceptions of English in lectures (i.e. quantitative). The investigation employs a bottom-up approach, and is primarily qualitative. The findings are based on authentic data: video-recorded interviews and lectures, their transcriptions, and a questionnaire. The findings show that EMI lecturers have two basic representations of their English: A) when they compare their English to native-like targets, they find fault with their English, and B) when they think of themselves in their normal work environment, they see their English as working rather well. Certain language ideologies induced type A discourse, including standard language and NS language ideologies, and others induced type B discourse, such as English-as-a-global-language ideologies. The results from the student questionnaire also support interpretation B. Since meaningful testing should reflect the target situation, what my informants say in the type B discourse is relevant to developing assessment criteria. Their views to Common European Framework of Reference for Languages (CEFR) scales are also extremely useful in pointing the way towards the central elements upon which relevant assessments for professional English in an international environment should be based. The conclusions indicate a comprehensibility goal over native-likeness for assessing spoken professional English in an international context. The study outlines some criteria relevant for assessing spoken English for this purpose and situation.
Article
This article aims to identify the potential uses of learner corpus research in the field of language testing. Learner corpora are digitized collections of structured learner data which in many cases are annotated and ideally are freely accessible. Recently, an increase in the number of learner corpora has led to a diversification of research activities. The article gives an overview of the most important directions of this research with regard to language testing, a focus being the illustration of proficiency levels. It also specifies some central desiderata for learner corpus research.
Chapter
The assessment of language quality in the modern period can be traced directly to the work of George Fisher in the early nineteenth century. The establishment of a scale with benchmark samples and tasks has been replicated through Thorndike (1912) and into the present day. The tension between assessing observable attributes in performance and underlying constructs that makes performance possible is as real today as in the past. The debate impacts upon the way scales and descriptors are produced, and the criteria selected to make judgments about what constitutes a quality performance, whether in speech or writing. The tensions work themselves through the history of practice, and today we find ourselves in a pluralistic philosophical environment in which consensus has largely broken down. We therefore face a challenging environment in which to address the pressing questions of evaluating language quality.
Chapter
A central purpose of any test is to convey information to stakeholders about examinees' performances. Scoring criteria allow test scorers, also called raters, to score a test reliably and consistently with its purpose and uses. Score reports provide a bridge from the test results to the real-world decisions made on the basis of these results. Scoring criteria and score reports must be communicated to both technical and nontechnical audiences in meaningful ways, which often means “translating” jargon and specific testing terms into comprehensible language. This chapter discusses types of scoring scales, approaches to scale development, and considerations for score reporting. In particular, this chapter focuses on how test developers can write scoring criteria and score reports that convey the information to test users in ways that are both accurate and understandable.
Chapter
Task-based assessment (TBA) is defined by Brindley (1994, p. 74) as ?the process of evaluating, in relation to a set of explicitly stated criteria, the quality of the communicative performances elicited from learners as part of goal-directed, meaning-focused language use requiring the integration of skills and knowledge.?
Article
This study empirically examines the rating criteria used to assess U.S. college students’ CSL (Chinese as a Second Language) oral performance by analyzing teachers’ assessment of these performances at different proficiency levels. The researcher videotaped ten speeches, and three ACTFL-trained raters assessed oral performance in these samples. The researcher then selected three samples (Samples 1, 2, and 3) to represent Novice High, Intermediate High, and Advanced Low levels. The researcher developed 20 rating items through interviewing ten experienced CSL teachers and running an Exploratory Factor Analysis (EFA) on teachers’ assessments of speech samples. After that, 104 CSL teachers used these rating items to assess the aforementioned samples. The EFAs of teachers’ assessments led to three corresponding rating criteria models (Models 1, 2, and 3). Both Models 2 and 3 for Samples 2 and 3, respectively, were five-criterion models, consisting of fluency, conceptual understanding, content richness, communication appropriateness, and communication clarity. Model 1 for Sample 1 was a four-criterion model, in which the items in communication appropriateness and content richness showed high correlations, and therefore were merged into one category; the other three criteria remained the same. Comparisons of the three models demonstrated that the criteria were constant. The ANOVAs showed that the proficiency levels of these oral performances differed significantly across all five rating criteria. This study empirically supports CSL teachers’ use of constant rating criteria to assess different levels of oral performance. It also provides Chinese teachers with rating criteria they can use to assess U.S. college students’ CSL oral performance.
Chapter
The assessment of language quality in the modern period can be traced directly to the work of George Fisher in the early nineteenth century. The establishment of a scale with benchmark samples and tasks has been replicated through Thorndike (1912) and into the present day. The tension between assessing observable attributes in performance and underlying constructs that makes performance possible is as real today as in the past. The debate impacts upon the way scales and descriptors are produced, and the criteria selected to make judgments about what constitutes a quality performance, whether in speech or writing. The tensions work themselves through the history of practice, and today we find ourselves in a pluralistic philosophical environment in which consensus has largely broken down. We therefore face a challenging environment in which to address the pressing questions of evaluating language quality.
Article
In spite of the widespread use of CEFR scales, there is an overwhelming lack of evidence regarding their power to describe empirical learner language (Fulcher 2004; Harsch 2005; Hulstijn 2007). This paper presents results of a study (author 2014) that focused on the empirical robustness (i.e. the power of level descriptions to capture what learners actually do in a language test) of the CEFR vocabulary & fluency scales (A2-B2). Data stem from an Italian & German speaking test (other authors and author 2012). Results show that the empirical robustness was flawed: Some scale contents were hardly observable or so evenly distributed that they could not distinguish between learners. Contradictory/weak correlations among scale features and heterogeneous cluster solutions suggest that scales did not consistently capture typical learner behaviour. Often, learner language could not be objectively described by any level description. Also, it was only partially possible to link scale contents to research-based measures of fluency and vocabulary. Given the importance of CEFR levels in many high-stakes contexts, the results suggest the need of a large empirical validation project.
Article
Dieser Beitrag arbeitet die Potenziale der Lernerkorpusforschung im Bereich des Sprachtestens heraus. Lernerkorpora sind elektronische Sammlungen strukturierter, häufig auch annotierter und idealiter frei zugänglicher Lernerdaten. In den letzten Jahren hat sich mit der zunehmenden Anzahl solcher Lernerkorpora auch die Forschungslandschaft diversifiziert. Der Artikel zeichnet überblicksartig die wichtigsten Möglichkeiten des Einsatzes von Lernerkorpora im Sprachtestbereich nach, z. B. die Konkretisierung von Kompetenzniveaustufen. Außerdem werden zentrale Desiderate der Lernerkorpusforschung benannt.
Chapter
When assessing scholastic knowledge of students for international comparisons, assessment facilitators must be sure that the different language versions of the test assess the same content across all the language groups in the comparison.Keywords:assessment methods in applied linguistics;assessment;evaluation;multilingualism
Article
It is increasingly recognised that attention should be paid to investigating the needs of a new test, especially in contexts where specific purpose language needs might be identified. This article describes the stages involved in establishing the need for a new assessment of English for professional purposes in China. We first investigated stakeholders’ perceptions of the target language use activities and the necessity of the proposed assessment. We then analysed five existing tests and six language frameworks to evaluate their suitability for the need of the proposed assessment. The resulting proposal is for an advanced-level English assessment capable of providing a diagnostic evaluation of the proficiency of potential employees in areas of relevance to multinationals operating in China. The study has demonstrated the value of following a principled procedure to investigate the necessity for and the needs of a new test at the very beginning of the test development.
Article
Full-text available
Due to the unquestionable roles of technology in language classes, it might be necessary to use computers in assessing language knowledge. This study aimed to examine how computers may be used to assess language ability of ESP students. Sixty computer-major university students at Abadan University are the participants of this study. They have taken an ESP course for a four-month academic term. To measure these participants' ESP knowledge, two types of tests were used: a final achievement test based on course content, and a computer-assisted test based on TLU domain. The study used a computer-assisted test to highlight the validity of the final achievement test. This study also investigated ESP students' perception of computer-supported assessment and highlighted some obstacles that may hinder e-supported activities in an Iranian context. Regarding the findings, the study points to the possibility of using computer-assisted assessment as an alternative to the present mainstream testing system.
Book
Full-text available
The theory and application of Many-Facet Rasch Measurement to judged (rated or rank-ordered) performances, and description of the estimation of the MFRM Rasch measures focusing on missing data.
Article
Full-text available
Tasks are the most visible element in an educational assessment. Their purpose, however, is to provide evidence about targets of inference that cannot be directly seen at all: what examinees know and can do, more broadly conceived than can be observed in the context of any particular set of tasks. This paper concerns issues in assessment design that must be addressed for assessment tasks to serve this purpose effectively and efficiently. The first part of the paper describes a conceptual framework for assessment design, which includes models for tasks. Corresponding models appear for other aspects of an assessment, in the form of a student model, evidence models, an assembly model, a simulator/presentation model, and an interface/environment model. Coherent design requires that these models be coordinated to serve the assessment's purpose. The second part of the paper focuses attention on the task model. It discusses the several roles t h a t task model variables play to achieve the needed coordination in the design phase of an assessment, and to structure task creation and inference in the operational phase.
Article
Full-text available
Advances in cognitive psychology both deepen our understanding of how students gain and use knowledge and broaden the range of performances and situations we want to see to acquire evidence about their developing knowledge. At the same time, advances in technology make it possible to capture more complex performances in assessment settings by including, as examples, simulation, interactivity, and extended responses. The challenge is making sense of the complex data that result. This article concerns an evidence-centered approach to the design and analysis of complex assessments. We present a design framework that incorporates integrated structures for a modeling knowledge and skills, designing tasks, and extracting and synthesizing evidence. The ideas are illustrated in the context of a project with the Dental Interactive Simulation Corporation (DISC), assessing problem solving in dental hygiene with computer-based simulations. After reviewing the substantive grounding of this effort, we describe the design rationale, statistical and scoring models, and operational structures for the DISC assessment prototype.
Article
Full-text available
The primary problems in measuring speaking ability through an oral interview procedure are not those related to efficiency or reliability, but rather those associated with examining the validity of the interview ratings as measures of ability in speaking and of the uses that are made of such ratings. In order to examine all aspects of validity, the abilities measured must be clearly distinguished from the elicitation procedures, in both the design of the interview and in the interpretation of ratings. Research from applied linguistics and language testing is consistent with the position that language proficiency consists of several distinct but related abilities. Research from language testing also indicates that the methods used to measure language ability have an important effect on test performance. Two frameworks—one of communicative language ability and the other of test method facets—are proposed as a basis for distinguishing abilities from elicitation procedures and for informing a program of empirical research and development. The validity of the ACTFL Oral Proficiency Interview (OPI) as it is currently designed and used cannot be adequately examined, much less demonstrated, because it confounds abilities with elicitation procedures in its design, and it provides only a single rating, which has no basis in either theory or research. As test developer, ACTFL has yet to fully discharge its responsibility for providing sufficient evidence of validity to support uses that are made of OPI ratings.
Article
Full-text available
The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1
Article
Full-text available
During the past several years measurement and instructional specialists have distinguished between norm-referenced and criterion-referenced approaches to measurement. More traditional, a norm-referenced measure is used to identify an individual's performance in relation to the performance of others on the same measure. A criterion-referenced test is used to identify an individual's status with respect to an established standard of performance. This discussion examines the implications of these two approaches to measurement, particularly criterion-referenced measurement, with respect to variability, item construction, reliability, validity, item analysis, reporting, and interpretation.
Article
Acknowledgements 1. Introductions 2. Second language performance assessment 3. Modelling performance: opening Pandora's Box 4. Designing a performance test: the Occupational English Test 5. Raters and ratings: introduction to multi-faceted measurement 6. Concepts and procedures in Rasch measurement 7. Mapping and reporting abilities and skill levels 8. Using Rasch analysis in research on second language performance assessment 9. Data, models and dimensions References Index
Article
This article addresses the challenges the authors faced in developing a task-based language assessment instrument for adult immigrants in Canada that is accountable to the needs of diverse groups of stakeholders, including learners, teachers, and administrators. The test, the Canadian Language Benchmarks Assessment (CLBA), is designed to place adult newcomers in language programs appropriate for their level of proficiency in English and to assess progress in these programs. The authors note that while stakeholders wanted the assessment tasks to be authentic and realistic, many were concerned that authentic tasks are culturally biased. These concerns were associated with stakeholders' theories of language, their conceptions of test bias, and their understanding of the purpose and use of the test. The authors suggest that when the low-stakes CLBA is used for its intended purpose, its results can satisfy the imperative for accountability in the context of large-scale, system-wide assessment.
Article
"Measures which assess student achievement in terms of a criterion standard provide information as to the degree of competence attained by a particular student which is independent of reference to the performance of others." Achievement measures may also convey information about the capability of a student compared with the capability of other students. Achievement tests are used (a) to provide information about the characteristics of an individual's present behavior and (b) to provide information about the conditions or instructional treatments which produce that behavior. Test development has been dominated by the particular requirements of predictive, correlation aptitude test theory." Achievement and criterion measurement has attempted frequently to cast itself in this framework; some additional considerations are required. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In the past two decades, there has been a major shift in language testing towards the development and use of performance tests. The basis for this shift is the expectation that such tests would assess a more valid construct of what it really means to know a language. The purpose of this chapter is to review th topic of performance testing by focusing on its definitions, theory, development, and research. The chapter will begin with a review of the different definitions of performance testing and provide examples of the types of performance tests that have been developed and used. The chapter will then examine the extent to which performance tests have drawn upon the theoretical discussions of competence and performance. The next section will describe the research that has been carried out on performance tests. The chapter will end with an agenda for development and research on the manyu unanswered questions concerning performance testing.
Article
For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. And when I have pondered about why such techniques as the spectrum analysis of time series have proved so useful, it has become clear that their “dealing with fluctuations” aspects are, in many circumstances, of lesser importance than the aspects that would already have been required to deal effectively with the simpler case of very extensive data, where fluctuations would no longer be a problem. All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
Testing the EFL skills and text hierarchy of the ACTFL reading guidelines
  • S Park
Park, S. (1999). Testing the EFL skills and text hierarchy of the ACTFL reading guidelines. Unpublished masters thesis, Department of English as a Second Language, University of Hawai'i. Popham, W. J. (1981). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.
Bands and scores Language testing in the 1990s (pp. 71–86) London: Macmillan Standards for educational and psychological testing
  • J C Alderson
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North. Language testing in the 1990s (pp. 71–86). London: Macmillan. American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Bachman, L. F. (1988). Problems in examining the validity of the ACTFL Oral Proficiency Interview. Studies in Second Language Acquisition, 10, 149– 164.
Proficiency measurement: Assessing human performance
  • R Glaser
  • D J Klaus
Glaser, R. & Klaus, D. J. (1962). Proficiency measurement: Assessing human performance. In R. M. Gagne (Ed.), Psychological principles in systems development (pp. 419–474). New York: Holt, Rinehart, & Winston.
Oral interview test of communicative proficiency in English
  • L F Bachman
  • A S Palmer
Bachman, L. F., & Palmer, A. S. (1983). Oral interview test of communicative proficiency in English. Urbana, IL: Photo-offset.
Behavior-based rating scales
  • W C Borman
Borman, W. C. (1986). Behavior-based rating scales. In R. A. Berk (Ed.), Performance assessment: Methods & applications (pp. 100–120). Baltimore: The Johns Hopkins University Press.
Designing second language performance assessments
  • J Norris
  • J D Brown
  • T Hudson
  • J Yoshioka
Norris, J., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language performance assessments. Honolulu: University of Hawai'i Press.
The oral interview test
  • C P Wilds
Wilds, C. P. (1975). The oral interview test. In, B. Spolsky & R. Jones (Eds.) Testing language proficiency (pp. 29–44)
Reading for writing: Cognitive perspectives
  • J Carson
Carson, J. (1993). Reading for writing: Cognitive perspectives. In J. G. Carson & I. Leki (Eds.), Reading in the composition classroom (pp. 85–104). Boston: Heinle and Heinle. Education Week. (2002, January 10). Editorial projects in education, 17. Retrieved Septemeber 3, 2004, from http://nces.ed.gov/programs/digest/d02/tables/ dt153.asp. Foreign Service Institute (FSI). (n.d.) Absolute language proficiency ratings. In M. L. Adams & J. R. Frith (Eds.), Testing kit: French and Spanish (pp. 13–17).