BookPDF Available

Developing Web-Based Language Tests

Authors:

Abstract

This book is about developing language tests with the aid of web-based technology. The technology is represented by WebClass (webclass.co), a learning management system (LMS) that I have been developing and using in blended environments for the last several years. The WebClass platform started off as a simple online system for administering language tests consisting mostly of multiple-choice and gap-filling items. At present, it includes two main modules, Materials and Tests, which can be used to author, manage, and deliver learning materials and assessments. Most importantly perhaps, the testing module can be utilized for the entire process of test development, which includes test and item analysis.
Developing Web-Based
Language Tests
Wojciech Malec
Wydawnictwo KUL
Lublin 2018
Reviewer
PAUL MEARA
Honorary Research Fellow, University of Oxford
Emeritus Professor, Swansea University
Cover design and typesetting
WOJCIECH MALEC
© Copyright by Wydawnictwo KUL, Lublin 2018
All rights reserved. No part of this publication may be reproduced without prior
permission from the copyright owner.
ISBN 978-83-8061-641-7
Wydawnictwo KUL
ul. Konstantynów 1H, 20-708 Lublin, tel. 81 740 93 40
e-mail: wydawnictwo@kul.lublin.pl
Print: volumina.pl Daniel Krzanowski
ul. Ks. Witolda 7-9, 71-063 Szczecin, tel. 91 812 09 08, e-mail: druk@volumina.pl
Table of Contents
Acknowledgements .............................................................................................. 13
List of abbreviations ............................................................................................ 15
Introduction .......................................................................................................... 17
Language testing and technology ........................................................................ 17
Basic terms and concepts ...................................................................................... 18
Organisation of the book ...................................................................................... 23
PART I Principles of Assessment ......................................................... 25
Chapter 1
Test development ................................................................................................. 27
1. Introduction ..................................................................................................... 27
2. Components of test development ................................................................. 28
2.1. Design ...................................................................................................... 30
2.1.1. Test context specifications ........................................................ 31
2.1.2. Test structure specifications ..................................................... 41
2.1.3. Task specifications ..................................................................... 42
2.2. Production .............................................................................................. 49
2.2.1. Task development ...................................................................... 49
2.2.2. Test assembly ............................................................................. 51
2.2.3. Pre-operational testing.............................................................. 52
2.3. Operational use ...................................................................................... 54
viii Developing Web-Based Language Tests
2.3.1. Administration ........................................................................... 55
2.3.2. Scoring ........................................................................................ 57
2.3.3. Decisions and reporting ............................................................ 58
2.4. Evaluation ............................................................................................... 59
2.4.1. Collecting evidence.................................................................... 60
2.4.2. Analysis ....................................................................................... 61
2.4.3. Item banking .............................................................................. 62
3. Summary ........................................................................................................... 63
Chapter 2
Test evaluation ...................................................................................................... 65
1. Introduction ..................................................................................................... 65
2. Practicality ........................................................................................................ 67
3. Authenticity ...................................................................................................... 71
4. Reliability .......................................................................................................... 73
4.1. Classical test theory ................................................................................ 74
4.1.1. Standard error of measurement ............................................... 79
4.2. Generalizability theory .......................................................................... 80
4.2.1. One-facet crossed designs ........................................................ 83
4.2.1.1. Relative and absolute error ....................................... 86
4.2.1.2. Coefficients ................................................................. 87
4.2.1.3. Coefficients and test length ...................................... 89
4.2.1.4. Phi lambda .................................................................. 91
4.2.1.5. Standard error of measurement ............................... 94
4.2.1.6. GT-1 calculator .......................................................... 95
4.3. Decision consistency .............................................................................. 101
4.3.1. Threshold loss agreement ......................................................... 102
4.3.2. Squared-error loss agreement .................................................. 104
4.4. Interpreting reliability estimates .......................................................... 105
5. Validity .............................................................................................................. 106
6. Impact ............................................................................................................... 110
7. Validation ......................................................................................................... 114
7.1. Validating classroom-based language tests ........................................ 117
7.2. Conclusion .............................................................................................. 122
8. Summary ........................................................................................................... 123
Table of Contents ix
Chapter 3
Test items ............................................................................................................... 125
1. Introduction ..................................................................................................... 125
2. Items and tasks ................................................................................................. 126
3. Test formats ...................................................................................................... 129
3.1. Selected-response items ......................................................................... 130
3.1.1. Multiple choice .......................................................................... 130
3.1.2. Binary choice .............................................................................. 137
3.1.3. Multiple response ...................................................................... 140
3.1.4. Multiple-choice cloze ................................................................ 141
3.1.5. Matching ..................................................................................... 143
3.1.6. Other types ................................................................................. 144
3.2. Limited-production items ..................................................................... 145
3.2.1. Gap-filling ................................................................................... 148
3.2.2. Cloze and C-test ......................................................................... 149
3.2.3. Gapped sentences ...................................................................... 154
3.2.4. Transformation .......................................................................... 155
3.2.5. Sentence writing ........................................................................ 156
3.2.6. Error correction ......................................................................... 157
3.2.7. Other types ................................................................................. 159
3.3. Extended-production tasks ................................................................... 160
4. Choosing the item format .............................................................................. 162
5. Summary ........................................................................................................... 165
Chapter 4
Item analysis .......................................................................................................... 167
1. Introduction ..................................................................................................... 167
2. Item facility ....................................................................................................... 169
3. Item discrimination ......................................................................................... 172
3.1. Norm-referenced testing ....................................................................... 173
3.2. Criterion-referenced testing ................................................................. 176
3.3. Item discrimination and test reliability ............................................... 181
4. Distractor evaluation ....................................................................................... 184
4.1. Criterion-referenced testing ................................................................. 194
4.2. Distractor evaluation criteria ................................................................ 197
5. Summary ........................................................................................................... 198
x Developing Web-Based Language Tests
PART II Web-Based Testing ................................................................ 201
Chapter 5
Technology in language testing ......................................................................... 203
1. Introduction ..................................................................................................... 203
2. Computers and the internet in language testing ......................................... 204
2.1. Models of administration ...................................................................... 206
2.2. Web-based testing .................................................................................. 207
2.3. Strengths and limitations ...................................................................... 209
3. WebClass .......................................................................................................... 216
3.1. Administration ....................................................................................... 220
3.2. Communication ..................................................................................... 225
3.3. Materials .................................................................................................. 227
3.4. Assessment .............................................................................................. 231
3.5. Teaching and testing with WebClass................................................... 243
4. Summary ........................................................................................................... 247
Chapter 6
Test design and production on WebClass............................................................... 249
1. Introduction ..................................................................................................... 249
2. Design ............................................................................................................... 251
3. Production ........................................................................................................ 253
3.1. Writing items and tasks one by one..................................................... 259
3.1.1. Test items .................................................................................... 259
3.1.1.1. MC variations ............................................................. 261
3.1.1.2. Other selected-response items ................................. 268
3.1.1.3. Limited-production items ........................................ 271
3.1.2. Extended-production tasks ...................................................... 283
3.2. Converting text into test items ............................................................. 285
3.2.1. Selected-response items ............................................................ 289
3.2.2. Limited-production items ........................................................ 295
3.3. Importing previous items ...................................................................... 299
3.4. Automated test assembly ...................................................................... 302
4. Summary ........................................................................................................... 304
Table of Contents xi
Chapter 7
Test use and evaluation on WebClass .............................................................. 305
1. Introduction ..................................................................................................... 305
2. Test use ............................................................................................................. 306
2.1. Administration ....................................................................................... 307
2.2. Scoring and verification ........................................................................ 314
2.2.1. Automated scoring of test items .............................................. 315
2.2.2. Score verification ....................................................................... 317
2.2.3. Scoring extended responses ..................................................... 323
2.3. Reports and feedback ............................................................................. 327
3. Evaluation ......................................................................................................... 332
3.1. Collecting evidence ................................................................................ 333
3.2. Quantitative analysis .............................................................................. 334
3.3. Item banking ........................................................................................... 345
4. Summary ........................................................................................................... 348
Chapter 8
Administration mode effects ............................................................................. 351
1. Introduction ..................................................................................................... 351
2. Score comparability ......................................................................................... 352
3. The study .......................................................................................................... 355
3.1. Method .................................................................................................... 356
3.1.1. Participants ................................................................................. 356
3.1.2. Materials ..................................................................................... 356
3.1.3. Procedures .................................................................................. 358
3.2. Results and discussion ........................................................................... 359
3.2.1. Measurement characteristics.................................................... 359
3.2.2. Comparability of score-based decisions ................................. 368
3.2.3. Individual differences................................................................ 370
3.2.4. Follow-up study ......................................................................... 372
3.2.5. Qualitative analysis of attitudes ............................................... 373
3.3. Limitations .............................................................................................. 375
4. Summary ........................................................................................................... 376
Conclusions ........................................................................................................... 379
xii Developing Web-Based Language Tests
References .............................................................................................................. 385
Appendices ............................................................................................................ 423
Appx 1. Scoring a test item (questionnaire results) ........................................ 423
Appx 2. Big Test 1 (PBT) .................................................................................... 425
Appx 3. Big Test 1 (WBT) .................................................................................. 429
Index ....................................................................................................................... 433
[do not delete]
List of abbreviations
ANCOVA analysis of covariance
ANOVA analysis of variance
AUA assessment use argument
BNC British National Corpus
CALL computer-assisted language learning
CAT computer-adaptive testing or computer-adaptive test
CBT computer-based testing or computer-based test
CEFR Common European Framework of Reference
CI confidence interval
CMS content management system
CR constructed-response (items)
CRT criterion-referenced testing or criterion-referenced test
CSS Cascading Style Sheets
CSV comma-separated values (file format)
CTT Classical Test Theory
EFL English as a Foreign Language
ENL English as a Native Language
ER error correction (item format)
ESL English as a Second Language
FG fill-the-gaps (item format)
HTML Hypertext Markup Language
ID item discrimination
IQR interquartile range
IRT Item Response Theory
16 Developing Web-Based Language Tests
IUA interpretation/use argument
KSA knowledge, skills, abilities
LCMS learning content management system
LMS learning management system
MC multiple choice (item format)
MR multiple response (item format)
NRT norm-referenced testing or norm-referenced test
PBT paper-based testing or paper-based test
PCA principal component analysis
PHP Hypertext Preprocessor (scripting language)
RW rightwrong (item format)
SD standard deviation
SEM standard error of measurement
SG score group
SQL Structured Query Language
TC transformation complete (item format)
TLU target language use
TOEFL Test of English as a Foreign Language
TOEIC Test of English for International Communication
TTS text-to-speech (technology)
TW transformation word given (item format)
VLE virtual learning environment
WBLT web-based language testing
WBT web-based testing or web-based test
WF word-formation (item format)
WYSIWYG what you see is what you get (editing system)
[do not delete]
Introduction
Language testing and technology
Considering the ubiquity and ever-growing influence of digital technology in
practically all spheres of life, it is only natural that various electronic systems
should also permeate education and educational measurement, including
language testing. Despite the fact that the shift from paper-based testing (PBT) to
computer-based testing (CBT) may not be taking place as rapidly as once
thought (Way & Robin, 2016), it is in all probability as inevitable as the
transition from writing on stone to writing on parchment (cf. Chalhoub-Deville,
1999a, p. xv). Computers are now being used more and more frequently to
design tests, create test items and item pools, assemble items into test forms,
administer these to the test takers, score the responses, deliver reports and
provide feedback, as well as analyse the scores. While all of these activities can be
performed on paper, they are more conveniently conducted on computers and
other electronic devices.
The transition from PBT to CBT is further facilitated by the widespread
popularity of web-based testing (WBT) systems. The role of the internet in this
respect is hard to overestimate for at least two reasons. First, thanks to the
simplicity of most present-day online authoring tools, the construction of web-
based tests and quizzes does not call for any programming expertise (although
some knowledge of HTML may sometimes be useful). Second, online tests can
be delivered and taken anywhere and on virtually any device that is connected to
the internet, a standard web browser being the only program that is needed.
Thus, while the use of computers and similar devices represents a major
18 Developing Web-Based Language Tests
innovative development in language testing, the internet is making the
implementation of this innovation a feasible reality.
This book is about developing language tests with the aid of web-based
technology. The technology is represented by WebClass (webclass.co), a learning
management system (LMS) that I have been developing and using in blended
environments for the last several years. The WebClass platform started off as a
simple online system for administering language tests consisting mostly of
multiple-choice and gap-filling items. At present, it includes two main modules,
Materials and Tests, which can be used to author, manage, and deliver learning
materials and assessments. Most importantly perhaps, the testing module can be
utilized for the entire process of test development, which includes test and item
analysis. Thus far, WebClass has been used primarily by the author of this book.
However, the source code of the system has recently been further developed by
Pearson Central Europe, Ltd., and it should soon be available, in a much
modified form, to a wider group of users.
Basic terms and concepts
Several key terms and concepts pertaining to language testing, and to
educational assessment in general, are worth clarifying. First, as is often done in
the literature (e.g. Popham, 2003; Carr, 2011), the words test (or testing) and
assessment will be used almost interchangeably in this book. The reason is that a
test is a kind of assessment, and assessment as a process subsumes testing.
Accordingly, in many situations the differences between these terms are not
particularly relevant to the discussion. Another term commonly used to refer to
a test or an assessment is measurement. However, the fact that they are used
interchangeably should not be taken to suggest that all of these words are exactly
synonymous.
Specifically, language assessment can be defined as a process which “involves
obtaining evidence to inform inferences about a person’s language-related
knowledge, skills or abilities” (A. Green, 2014, p. 5). Put simply, language
assessment seeks to find out how well an individual performs tasks which require
the use of language. In addition to making inferences, a related purpose of
assessment is to make decisions about the assessees. These decisions may relate
to, for example, selection, placement, grading, or certification. The word
Introduction 19
assessment can also refer to a specific procedure that is employed to gather the
relevant evidence or to an outcome of this procedure. In the latter case, this may
be a test score or a verbal description(Bachman, 2004, p. 7).
When the assessment process or procedure involves assigning numbers, that
is to say, “obtaining a numerical description of the degree to which an individual
possesses a particular characteristic” (M. D. Miller, Linn, & Gronlund, 2009, p.
28), it is known as measurement. In other words, measurement is associated with
numerical quantification of our observations of a person’s performance. A
measurement instrument, as well as the outcome of measurement, can be
referred to as a measure.
Finally, testing is concerned with eliciting samples of performance “by posing
a set of questions in a uniform manner” (M. D. Miller et al., 2009, p. 28). A test
(or an exam) is an instrument, a formal and systematic procedure, as well as an
event which typically has a specific time frame. Testing usually involves
quantification, hence it is a kind of measurement. Since it is conducted with a
view to making inferences and decisions about some attribute(s) of an individual
or a group of individuals, testing is also a type of assessment. However, activities
such as informal questioning by a teacher, observations (i.e. watching students
perform a task), self-assessment (students evaluating their own performance),
peer assessment (students evaluating each other’s performance), as well as
collecting samples of language in the form of portfolios are all different methods
of assessment which do not belong to testing (A. Green, 2014). In addition, tests
are usually distinguished from quizzes because these are typically intended to
provide formative feedback to the learners (information about their strengths
and weaknesses), rather than evidence that forms the basis for making inferences
and decisions.
There is another group of words, related to the above terms, which appear in
similar contexts in this book (they are also often used interchangeably in the
literature). These words refer to the people taking a test and include testees,
assessees, examinees, test takers, and candidates. In the context of classroom-
based assessment, they may be used alongside learners and students.
Another notion that is closely related to testing and assessment is evaluation.
This term may be used in several different contexts. First, it may refer to making
judgments and decisions about individuals. Evaluation thus understood is often
based on test scores, and it can take the form of assigning letter or number
grades or providing narrative descriptions (H. D. Brown, 2004). In the present
20 Developing Web-Based Language Tests
book, however, evaluation is almost always used with reference to the process of
appraising tests as well as individual items of which tests are composed. In this
sense, evaluation consists in assessing the uses that we make of test scores (see
Chapter 2). Test and item evaluation involves performing various kinds of
qualitative and quantitative analyses.
Somewhat confusingly, evaluation is a term that is also often used to refer to
one of the inferences in argument-based validation (which, as a whole, can be
viewed as being at the heart of broadly defined test evaluation, cf. Geisinger,
2016). More precisely, the evaluation inference is concerned with the quality of
test scores as accurate reflections of the attributes being measured (see Section 7
in Chapter 2). Kane (2013) uses the term scoring inference instead. However, the
former term seems to be more common in language testing (e.g. Chapelle,
Enright, & Jamieson, 2008; Knoch & Elder, 2013; Read, 2015; Chung, 2017).
One of the uses of tests and assessments is to make decisions. Such decisions
can be either relative or absolute. As explained by Bachman (2004), relative
decisions are those which are based on each test takers relative standing in a
particular group of individuals. University admission decisions are a classic
example of relative decisions because it is usually the case that only a fixed
number of candidates (those with the highest scores) can be admitted. Absolute
decisions, on the other hand, are based on the amount or degree of knowledge or
skill possessed by each examinee. Examples include certification decisions as well
as various classification decisions (e.g. mastery/non-mastery) made on the basis
of classroom achievement tests.
The distinction between relative and absolute decisions corresponds to the
difference between norm-referenced and criterion-referenced interpretations of
test scores. In norm-referenced score interpretations, each student’s
performance is assessed in relation to some comparison group, which may be the
group of all individuals who have taken the test or a norm group representing
the target examinee population. The average performance of the comparison
group defines the standard against which all other performances are judged. In
criterion-referenced score interpretations, by contrast, individual performances
are compared to a predetermined criterion, so that [t]est scores are linked to
what examinees can do, not viewed in relation to what other examinees can do”
(Hudson, 2014, p. 562). The criterion in question is a well-defined domain of
knowledge or skills that is the target of measurement (although sometimes it is
also taken to mean the cut-point or mastery level that is required to pass a test,
Introduction 21
cf. J. D. Brown & Hudson, 2002, p. 49). Though originally related merely to
different score interpretations, these two frames of reference are now associated
with two different approaches to test development (Bachman, 2004), i.e. norm-
referenced testing (NRT) and criterion-referenced testing (CRT). The tests
produced by following the two approaches are referred to as norm-referenced
tests (NRTs) and criterion-referenced tests (CRTs).
Some basic differences between NRT and CRT are worthy of mention in this
introduction since they will keep reappearing in various contexts throughout the
book. These differences have been summarised by Hudson (2014) under three
headings, namely test purpose, test content and structure, as well as test
development (for more details, see also, e.g., Popham, 1978; Bachman, 1990; J. D.
Brown, 1996; J. D. Brown & Hudson, 2002; Urbina, 2004; Jamieson, 2011;
Hambleton, Zenisky, & Popham, 2016).
First, as already pointed out, the purpose of NRT is to make relative
decisions, i.e. to see whether or not a given student’s performance is close to
what is typical of the entire population of similar students. Strictly speaking,
such tests do not allow us “to identify specifically what students have or have not
learned” (Genesee & Upshur, 1996, p. 213). Rather, NRTs are aimed at
maximizing differences among the performances of the test takers. This goal can
only be achieved when the scores are spread out as much as possible and follow
the pattern of normal distribution. The results of such tests can be conveyed in
terms of percentile rank scores. In the case of CRT, on the other hand, the
purpose of testing is to make absolute decisions, i.e. to find out whether, and
usually also to what extent, the students have attained mastery of a specific
content