ArticlePDF Available

Abstract and Figures

Foreign language departments with the goal of advanced literacy require optimizing student learning, especially at the initial stages of the program. Current practices for admission and placement mainly rely on students’ grades from previous studies, which may be the main reason why intra-group language proficiency often varies dramatically. One essential step for creating an environment that enables students to progress according to their skill level is the development of assessment procedures for admission and placement. Such assessment must prominently include proficiency in the target language. This article promotes the incorporation of an automated C-test into gateway and placement procedures as an instrument that ranks candidates according to general language proficiency. It starts with a review of the literature on aspects of validity of the C-Test construct and contains an outline of the functional design of such an automated C-Test. The article highlights the economic benefits of an automated C-Test platform and the central role of proficiency-based student placement for the success of programs aiming to develop advanced literacy in a foreign language. The findings implicate that developing and using the outlined C-Test platform has the potential to increase student achievement in advanced foreign language instruction significantly.
Content may be subject to copyright.
ISSN: 2519-1268
Issue 8, Spring 2019
Published on February 16, 2019
Vassilis Vagios (National Taiwan University)
Editorial Board
Lai, Ying Chuan (National Chengchi University)
Blanco, José Miguel (Tamkang University)
Chang, Wen Hui (Chung Yuan Christian University)
Leipelt-Tsai, Monika (National Chengchi University)
Tulli, Antonella (National Taiwan University)
Advisory Board
Takada, Yasunari Professor Emeritus, The University of Tokyo
Chang, Han Liang Professor Emeritus, National Taiwan University
Kim, Soo Hwan Hankuk University of Foreign Studies
Finglass, Patrick University of Bristol
Chaudhuri, Tushar Hong Kong Baptist University
Kim, Hyekyong Inje University
Shi Chen
The Journal is published three time a year (February, June, October) by the Department of
Foreign Languages and Literatures, National Taiwan University.
All correspondence should be add ressed to the Department of Foreign Languages and Litera-
tures, National Taiwan University, Roosevelt Rd., Section 4, No. 1, Taipei 106, Taiwan, R.O.C.
Fax: +886 -2-236 45452
© 2019, Department of Foreign Languages and Literatures, National Taiwan
University. All rights reserved.
Issue 8 (Spring 2019)
ISSN: 2519-1268
Table of Contents
El primer viaje a China en una máquina del empo. Luces y sombras en El anacronó-
pete de Enrique Gaspar y Rimbau
 1
Gateways, Placements, and Grouping: Automang the C-Test for Language
Prociency Ranking
 23
Alexander the Great in Macedonian folk tradions
 69
Gateways, Placements, and Grouping: Automating the
C-Test for Language Prociency Ranking
National Taiwan University
Foreign language departments with the goal of advanced literacy require optimizing st udent
learning, especially at the initial stages of the program. Current practices for admission and
placement mainly rely on students’ grades from previous studies, which may be the main
reason why intra-group language prociency often varies dramatically. One essential step for
creating an environment that enables students to progress according to their skill level is the
development of assessment procedures for admission and placement. Such assessment must
prominently include prociency in the target language. This article promotes the incorporation
of an automated C-test into gateway and placement procedures as an instrument that ranks
candidates according to general language prociency. It starts with a review of the literature
on aspects of validity of the C-Test construct and contains an outline of the functional design
of such an automated C-Test. The article highlights the economic benets of an automated
C-Test platform and the central role of prociency-based student placement for the success of
programs aiming to develop advanced literacy in a foreign language. The ndings implicate
that developing and using the outlined C-Test platform has the potential to increase student
achievement in advanced foreign language instruction signicantly.
Keywords: Admission Gateway Testing; Language Prociency Testing; Homogeneity; Grad-
ing; Dierentiated Language Instruction; Ability Grouping
© Wolfgang Georg Odendahl
This work is licensed under a Creative Commons Aribution-NonCommercial-ShareAlike 4.0
International License.
Issue 8 (Spring 2019), pp. 29-67
DOI: 10.6667/interface.8.2019.76
ISSN: 2519-1268
Gateways, Placements, and Grouping: Automating the
C-Test for Language Prociency Ranking
One of the pervasive challenges of foreign language classrooms is the
universities according to their BA grades, and advancing through the
stages of any program requires a passing grade in the previous level
lation principles vary between institutions and even between teachers
  
unintended heterogeneity in the admission process, such as requiring
  
 1 often fail due to issues with in-
         
       
as, 2012). Furthermore, the question of equivalence of test results and
the comparability between testing facilities is still under debate (Alder-
 
These widely recognized benchmark-assessments test the four fou ndational lang uage sk ill areas of
reading, writing, listeni ng, and writing in multi-hour sessions.
whenever language skills are a factor, grouping according to the results
 
language test, which produces a ranked list of all candidates accord-
instrument for informing admission decisions regarding a candidate’s
language skills and allow grouping admitted students accordingly (cf.
        
 
        
       
key study tool for any foreign language-related program, regardless of
the program’s specialization, be it literary studies, teacher training, or
translation studies.
One reason why foreign language departments, especially smaller ones
teaching other than the mainstream languages, shy away from testing
candidates in-house may be rooted in test economy:2 Designing a reli-
able and valid test each year is a very demanding and specialized task.
the gateway problem. Taking its cue from the writing section of the
3 it outlines
  
in this article, refer to section 1.4 below.
3 The ‘Test Deutsch als Fremdspr ache’ (TestDaF) is a standardized lang uage test for foreign
                 
                      
                
2018, p. 149). Normally level 4, the second of three levels (with 3 bei ng the lowest and 5 the
           
         
The same i nstit ution develops the ‘Online Language Placement Test’ (onSET formerly onDaF), an
online placement tests for a whole range of moder n languages at as many university langu age centres
Instit ut, 2017a).
interfa ce
the design of an automated C-Test platform for gateway and placement
testing. Relying on an expandable corpus of texts, the proposed plat-
    
vision and test administration modalities can be adapted according to
the stakes in the outcome. High- and medium-stakes testing, such as
gateway and placement testing, need to have some level of supervision
as precaution against cheating. Individual students may take the test
unsupervised at their leisure as an economical screening-test.4
1 Introductory Considerations
1.1 Research Question
In order to solve the above outlined problem of test economy, the ques-
tion this article aims to answer is how to design an automated test that
improves the existing gateway and placement system of foreign lan-
guage departments. It departs on the proposition that in a situation
an advanced language program being able to graduate, a test for gen-
as long as the test’s ability to measure what the program requires, test-
ing is a superior alternative to relying on previous grades. This paper’s
base hypothesis is that test economy is the major factor preventing insti-
tutions from using in-house generated test data (Bachman, 2005, p. 24).
Administering and grading tests is work intensive and time consuming
 
admits or places students. Only extensive, long-term qualitative studies
may provide an answer to the question whether institutions and teach-
ers will actually revert to testing if the tests are economical, i.e. easy to
 
results of an independent and objective C-Test.
          
1.2 Method
Based on the hypothesis that institutions avoid using their own tests in
gateway and advancement decisions for reasons of test economy, this
article analyzes the bottlenecks in testing by means of literature review.
The problematic points are then individually resolved by sketching out
The aim is to outline the design and functionality of an automated C-Test
testing and student placement in institutions for advanced foreign lan-
guage studies. Secondary purposes, such as self-evaluation, are not the
focus of this article.
or formative purposes. An automated version of the C-Test solves prob-
lems of test economy, thereby allowing foreign language institutions to
do their own testing. Being able to rank students according to identical
ments and place them in groups with peers who show similar language
in particular. The fundamental outline of the technical aspects of the
proposed platform is based on general computer and web programming
facts and the author’s personal coding experience.
1.3 Outline
This article consists of three parts. The Literature Review summarizes
      
    -
interfa ce
ducing the historical development from cloze to C-Test in terms of con-
struct principles, this section will make the argument that the C-Test
web-based, fully automated C-Test platform is an excellent solution for
discusses the argument for in-house testing as opposed to relying on
previous grades.
The second part outlines the functional design of an automated online
C-Test platform with usage examples for its practical application. It will
give two examples for using the platform, one for institutional testing
and the other for individual self-administered assessment, followed by
section is to invoke examples of general usage in institutions teaching
their unique testing needs. The third section discusses several technical
details crucial to the functioning of the platform.
1.4 Operational Denitions
This article frequently uses the key terms Ranking, Tracking, and Test
Ranking: The C-Test construct is designed to measure general lan-
statement (see Figure 3: Results of a Self-administered C-Test in sec. 4)
or a list that ranks candidates hierarchically in relation to their respec-
tive results. The aim of ranking candidates according to their general
didate’s relative performance in order to select and group students with
not part of the teaching process (McNamara, 2011, p. 613; Huhta, 2008,
p. 473). Contrary to the binary logic of a benchmark test such as level
A, B, or C according to CEFR, where the goal is to determine whether
or not the candidate’s skills conform to a predetermined standard, rank-
ing candidates according to their skill-level demands an open-ended
p. 646).5
  -
quirements and the number of admission slots is limited, ranking allows
in the tested skills. For placement purposes, a ranked list allows admin-
istrators to decide on homogeneous or intentional skill-heterogeneous
groups. Therefore, in the context of this article, ranking denotes the
presentation of a test outcome in the form of an ordered list with the
highest scores on top.
In contrast to Steenbergen-Hu et al. (2016, p. 850), this article does not
limit the purpose of placement ranking for ability grouping to pro-
       -
         -
tions (cf. Odendahl, 2016) and improve peer-assisted learning results
(cf. Nesmith, 2018; Odendahl, 2017; Smith, 2017; Tempel-Milner, 2018).
 -
adequate selection procedures and is generally undesirable in foreign
language classes. Therefore, regardless of the grouping goal, it is imper-
ative to have reliable evidence to base the grouping on.
Ability grouping and tracking: In placement practice, the terms abil-
ity grouping and tracking are often used synonymously. Some aca-
demics use the term tracking in reference to distributing students into
        
classes (Loveless, 2013, p. 13). The usage adopted by this article con-
        -
ment of students] into streams or tracks from which they never escape”
manent form of distributing students in homogeneous learning groups
 
  
interfa ce
of ability grouping, was widely practiced in U.S. school systems from
the 1960s to the 1990s, when vocal criticism from equity advocates,6
most notably Robert Slavin (1987, 1990, 1993) and Jeannie Oakes (1985,
1986a, 1986b), contributed to its disuse in the public school system.
sition, and classroom assignment” (Steenbergen-Hu et al., 2016, p. 856).
Tes t ec onom y : The overall ratio between cost spent for testing and its
7 The cost of testing includes
Hertel, 2011, p. 8; Hornke, 2006, p. 434). Furthermore, testing takes not
dates, who take the test and deal with its outcome. These expenditures
            
of utmost importance to determine exactly what purpose a given test
the chances that participants will be able to graduate. While it seems
economical ratio always is to spend as little as necessary in order to gain
as much as possible. With a favorable ratio, the same test can also be
deployed for secondary goals, such as grouping students into skill-ho-
mogeneous classrooms. These requirements build on and conform in
nomical if a) its administration requires little time, b) it consumes little
material, c) it is easy to handle, d) it may be administered as a group
test, and e) its grading is fast and convenient. Hornke (2006, p. 434)
enumerates the stakeholders in economical testing as the candidates,
 
           
1992, p. 5). Relying on Deutsch (1975), Messick (1989, p. 86) discusses the multiple sources of potential
injustice which may be salient in any particular setting.
 
8 Hornke, depart ing from a standardization standpoint with ISO norms i n mind , uses the terms
Therefore, in judging the economic properties of the proposed C-Test
In terms of availability, a test needs to be always accessible, using
as few tools as possible. For example, an online version of a test
will score higher in availability than the paper equivalent, which
has to be physically carried around and distributed. It is more avail-
able than a specialized computer program or app, which are cus-
tom-made for one platform, such as Windows©, Macintosh©, Linux©,
iOS© or Android©. This class of computer programs need advance
installation on a machine present at the time of testing. In addition,
a mobile-accessible user interface scores better on availability than
one that can only be accessed on larger computer screens.
Reliability includes the specialist term from testing research as like-
lihood of getting the same result when testing several times under
paper tests would get a lower score.
       
plies to administrative aspects as well as the candidates perspective.
Test economy is a major factor when considering in-house testing. In
a medium-stakes situation like admission for a master’s program, the
admitting institution might be willing to spend considerable time and
        
of designers. In his words, the ca ndidates do not want to be unduly strained wit h testing, the client does
reliable outcome and the researchers need to optimize the test according to their g rants (Hornke, 2006,
p. 434).
interfa ce
holds true for gateway and placement testing alike.
Homogeneous versus heterogeneous grouping: The practice of plac-
  
        -
    
a controversy about educational values revolving around equality and
is still unresolved. In an oft-cited meta-study, Slavin (1990) concludes
on student achievement. However, Slavin’s sources all use academic
achievement as the norm of measurement instead of independent test-
ing with compatible standards. Newer studies, still based on academ-
        
grouping on students’ academic achievement (cf. Steenbergen-Hu et al.,
The most-cited risk of homogeneous placement is the phenomenon of
          
placement. On the positive side, as Oakes stated early on, tracking
as educational damage from daily classroom contact and competition
          -
els can achieve at least as well in heterogeneous classrooms” (1986a,
grouping have been shown for low attaining (Francis et al., 2017), socio-
 
dents’ self-esteem, but also leads to varying teacher expectations, thus
perpetuating the initial placement in a vicious cycle (Bernhardt, 2014;
Harris, 2012; Oakes, 1985, p. 8).
With regard to equity and equality in education, increasing placement
dents in frequent intervals means increased mobility between groups
Robinson (2008) found that level-appropriate instruction as the result
  
With homogeneous groups, administration and teachers can custom-
ize their learning environment and progression speed to their students’
        
the student’s needs and allow for focused contents. When aiming for
        
homogeneous ability grouping, the ultimate prerogative is having valid
data on the current level of students as the deciding placement factor.
In summary, grouping students in classes according to their current
there is ample evidence that low-achieving students’ academic perfor-
provides a chance for planned heterogeneity.
2 Literature Review
2.1 The Evolution of the C-Test
The C-Test is a special form of cloze test developed in the beginning of
the 1980s by Christine Klein-Braley and Ulrich Raatz (Klein-Braley,
interfa ce
have been the subject of extensive academic discussion and need to be
cy in a foreign language in mind, the following paragraphs will discuss
the validity of the C-Test construct and match it to its practical applica-
tion as an automated platform for gateway and placement testing.
Having language students9-
adapting the grammatical form of that word to its surroundings, stu-
dents can demonstrate their grasp of the subject matter, the extent of
their vocabulary, and their grammatical prowess.
       
10 purposes, but the resulting tests often lack the authenticity of
natural language. Another major drawback, especially when grading
tests with the help of templates, is the occurrence of unplanned ambigu-
ity, i.e. multiple solutions applying to a blank without the designer real-
izing this during construction. In language testing, carefully designed
12 in nature and often includes a whole battery
such as reading, listening, writing, and speaking. In order to render
 
               
subjunct ives, or the declension of adjectives.
 
of this approach is severely limited by the size of such a test, a fact that has contributed to developing the
       
knowledge that can be measured in ind ividuals.
12 The purpose of formative testing is to assess a candidate’s mastery of a given program’s
            
             
outside the classroom and the limitations of textbooks (cf. Bachma n, 1990, p. 62).
summative language testing more economical, Taylor (1953) introduced
the cloze test as a single test unit to replace the battery of tests involved
purposefully deleting certain morphologically meaningful entities, ev-
ery nth word is automatically replaced by a blank. This kind of system-
deletion as opposed to rational deletion (cf. Bachman, 1985, p. 536).
Cloze tests may be an indicator of lexical and grammatical competence
(cf. Jonz, 1990; Alderson, 1979a, 1979b) as well as of discourse com-
          
  
competence cloze tests measure, their scores correlate highly with stan-
    -
guage, the fact that language is redundant” (Spolsky, 1968, p. 5). Redun-
dancy in natural language is important in order to convey unequivocal
meaning and to overcome disruptions, such as acoustic interferences
during a conversation or bad print in written communication. These
overlaying meaningful parts of the message and thus causing a reduc-
tion in the original amount of redundancy.13 Spolsky goes on to analyze
that the ability of understanding a distorted message can be taken as a
sign that the recipient has a thorough understanding of that language
cannot function” with distorted or incomplete messages (Spolsky, 1968,
p. 9). Thus, being able to understand a distorted message is a strong
can produce reliable assessments, they have a considerable number of
13 This lack of redundancy has turned out to be a tech nical problem for the engineer s of early
telephone companies (cf. Shannon, 1948), who battled with severe acoustic interference threaten ing the
  
motivate f urther research into the phenomenon of reduced redu ndancy.
interfa ce
biased; it might occur that a participant gets a bad result because of her
lack of understanding the contents of the text rather than due to her
lack of language skills. (3) The blanks in cloze tests still are prone to
unplanned ambiguity, which makes scoring time consuming and some-
times subjective. (4) Cloze tests are not automatically valid tests of lan-
than on the deletion method (Alderson, 1983, p. 213).
The C-Test was designed to overcome the drawbacks of the cloze pro-
 eliminates every
nth word (usually every 5th) from a given text, the C-Test erases the
second half of every second word.
      
cloze test was its use of the word, a more or less linguistic unit,
as the unit to be deleted, and as she showed, this very fact meant
structural features. The new C-Test that Raatz and Klein-Braley
(1982) have proposed overcomes this by deleting not words but
parts of words; it is thus further from being a measure of struc-
tural ability, and so closer to a general measure.”
(Spolsky, 1985, p. 188)
texts of 80-100 words each, each containing approximately 20 blanks,
resulting in 100 blanks per test (cf. Klein-Braley, 1997, p. 64).
         
short texts with 20 blanks each, thus reducing problems (1) and (2),
i.e. the economy/ bias/ validity complex.15 The measure of replacing
             
tur ned into C-Tests for calibration with native speakers. Af ter an extensive calibration process, only four
texts remain, resulti ng in a C-Test with 80 blanks.
      
the second half of words with blanks makes use of the redundancy in
natural languages, tests grammatical knowledge simultaneously with
vocabulary, and addresses the ambiguity problem (3) of cloze tests.16
C-Tests are a summative form of assessment and instructionally insen-
Baghaei, 2015, p. 85). Among other advantages, the C-Test as a highly
2.2 C-Test Validity Studies
The question of validity of the C-Test construct has been the topic of
papers spanning four decades. Today, there is ample evidence that the
     
this claim are the high correlation between C-Tests and other language
Sigott, 2004).
  -
relation between C-Tests and other language tests in both receptive and
productive skills. Pointing to the consistent correlation of the C-Test’s
 
                  
approximately half as long.” (Klein-Braley, 1997, p. 65).
                       
encountered several problems with ambig uity in French and Spanish C-Tests, which forced hi m to
interfa ce
the validity of C-Test results regarding isolated skills exist (cf. Chapelle,
1994). Roos (1996a) tried with limited success to adapt the C-Test to the
Chinese C-Tests tend to test rather the ability of reading and writing
         
for Japanese kanji characters. Jafarpur (1995) criticizes a lack of face
validity, when his candidates compared the appearance of a C-Test to
a puzzle rather than a language test. Therefore, the C-Test construct is
isolated skills in listening, speaking, or grammar, one should resort to
3 The Case for an Online C-Test Platform
The proposed automated platform can generate, administer, and grade
          
C-Test in the Platform below). The preset testing time for a standard
It is therefore an economical solution for ranking large or small groups
semann and Traxl (2005, p. 277) pointed out, many teachers shy away
from testing because of a lack of time or a perceived lack in compe-
C-Test platform could also help individual teachers with in-class group-
ing,17 assess the overall success of a course, serve individual students
as an indicator of personal learning progress, and help students decide
Making the platform web-based further helps with test economy. It sat-
17 The composition of work groups can be heterogeneous or homogeneous, according to the
pedagogical needs of the task (cf. Odendahl, 2016).
Anyone with an internet connection can access it at all times, access
can be free of charge, and users can access it using any computer or
mobile device with an internet connection.
3.1 Automating the Test: Web-Based General Language Prociency
Only an economical test that strikes a positive balance between cost
and reward has the potential to sway foreign language departments and
teachers in favor of testing over traditional gateway and placement pro-
          
skills and then grouping them according to the same principle is a very
ministering, and grading. As demonstrated by TestDaF and onSET, the
C-Test takes little over half an hour and can be administered and graded
by computer. The following section will introduce the design principles
of the C-Test and the history of its validity debate. It will then proceed to
lay out an online C-Test platform which is able to produce, administer,
and grade a unique C-Test at the press of a button.
3.2 Institutional Gateway Testing
testing, where candidates need to be reasonably supervised to verify
identity and prevent cheating (American Educational Research Associ-
ation, 2014, p. 188).
interfa ce
in which candidates identify themselves, and (d) set an access password
which allows candidates to take this test.
Figure 1: Gateway Testing – Creating a Unique C-Test in the Platform
Based on these variables, the system generates a unique C-Test, which
will only be accessible during the predetermined dates and with the
correct password. On the test date, the candidates will assemble in a
computer-classroom, where they will be instructed about the test mo-
dalities. Afterwards, they open a web browser and log in to the test
with the URL and the password provided on the blackboard. Each can-
didate’s time will individually start after they successfully log in; in
While taking the test, the candidate’s name is displayed in the upper
she is logged in; in a medium-stakes test setting such as described
above, the teacher might use this information to verify the identity of
the candidate taking the test. After the preset amount of time (cf. Figure
Once every candidate in the room has submitted their test, the teacher
may access the ranking list of results.
Figure 2: Ranking List of Test Results
Figure 2: Ranking List of Test Results shows the ranked list of results
which the teacher can pull up after the candidates completed a C-Test
on the web-platform. Here, the teacher’s mouse cursor rests at No. 018,
apply for has just 17 slots available which are awarded to the 17 best
           
Unique C-Test in the Platform), we asked candidates to identify them-
selves with their matriculation number, which here is shown in the sec-
interfa ce
correct answers. The following columns show the same date, similar
IP addresses, and starting times for all candidates. This is owed to the
test setting in a computer classroom. Candidates No. 006 and 014 seem
problems with their original computers. The time limit was set to 40
minutes, so candidates 001, 005, 009, 013, and 017 went overtime and
had points deducted for each minute they delayed submitting their re-
test with the candidates still in the testing room, announcing something
short break. Candidates 1 through 17, please return after the break for
more information about our program. The others may leave at their lei-
sure. Thank you for participating.
The only limit to the number of candidates in such a gateway setting
is the number of available computers. If students are allowed to bring
their own devices, there is virtually no limit to the number of testees.18
the same regardless of the number of candidates.
3.3 Individual Self-Administered Assessment
The second usage example covers self-administered language testing
by individual students. In this setting, a student is unsure whether she
 
to know her chance for succeeding. Her teachers might encourage her,
but she needs an independent and objective assessment of her overall
language skills before committing herself.
18 Computing power will go down with increasing numbers of simultaneously submit ted test
       
server a dministrators would be advisable.
Figure 3: Results of a Self-administered C-Test
In order to get such an independent assessment, she pulls out her smart-
phone or sits down in front of her computer, accesses the C-Test web
platform, skips registration, and directly accesses a test by pressing the
on her screen, giving the achieved percentage points and an estimate of
the corresponding language level acording to the CEFR.19
 
interfa ce
In this example, the student just had to press one button in order to have
the platform generate and administer a unique C-Test. The printable
date and the test results. The mark in the lower right corner indicates an
getting a second opinion on their language skills, students might want
to independently and objectively track their progress by regularly tak-
ing tests in the privacy of their home and at their convenience. They will
4 Technical Details and Inner Workings of the Platform
4.1 Automated Test Generation
The platform relies on a corpus of edited and calibrated texts indexed
by a database. Whenever a user presses the start button on the test web-
each text, counts the number of words while omitting those marked
as exempt from mutilation,20 randomly determines a starting point be-
tween words 15 and 25, splits 20 words in half while replacing the sec-
ond half with a blank and recording the eliminated part as the solution.
          
database in combination with a random starting point for mutilation
    
uncalibrated21 corpus of 63 texts, the core system is language indepen-
 
                 
 
20 See the following section for an extended discussion on how to determine which words should
not be tested.
21 The calibration of texts for use in C-Tests ha s been the topic of several academ ic papers (cf.
dent and theoretically works with any alphabet-based language.22 Two
steps are involved in adding a language set, namely, adapting the user
interface and adding a calibrated and edited text corpus.
4.2 Choosing Texts for Use in the C-Test Database
cated problems. What appears to be a rather easy text when read as a
implications of following the C-Test construction principles for choos-
ing texts, and then explain the pragmatic approach in solving these
The construction principles, as laid down by Raatz and Klein-Braley
           
   -
tence, which is left complete in order to provide some context. Once
the predetermined number of blanks is reached, mutilation stops and
the text comes to a natural end. The texts should be authentic, short,
relevant to the intended user group, and arranged in order of ascending
 
The problems in following these requirements are:
1.          
 
native and non-native speakers. Since the main intended usage for this platform, ranking, ca n be reliably
achieved with uncalibrated texts, the task of calibrating texts from the database will be post poned u ntil
22 There have been experiments with non-alphabet ic languages, such as Japanese (Roos, 1996a,
C-Tests for these languages requires such a lot of adaptations and deviations from the C-Test principles
Furthermore, since the written and oral forms of these languages have only little (Japanese) or no
(Chinese) relation to each other, the resu lts of such tests can not be accepted as an indication of general
interfa ce
2. 
3. -
4. 
The pragmatic answers to these problems are as follows: Problem (1)
is based in Raatz’ and Klein-Braley’s (2002, p. 75) demand that texts
should be non-dialogic and authentic. In terms of content they should
pose serious obstacles, because authentic texts with a very low read-
ability index23 are rarely found outside of textbooks. Similarly, at high
levels of language competency, authentic texts with only 70-100 words
are hard to come by.24 Cronjaeger et al. (2010, p. 75) argue that authentic
texts may altogether be too variable in terms of vocabulary and gram-
matical structures for use with beginning learners. Although textbook
 -
 
from using them. Therefore, all texts for consideration in the database
originate from authentic sources, but are subject to radical revision, cal-
ibration, and partial re-writing before usage.
The pragmatic solution to the second (2) problem, how to ensure enough
                    
elements li ke content, style, st ructure, and desig n to determine a text’s reading ease ( DuBay, 2004, p.
   
of candidates to solve blan ks in content words as opposed to str ucture words (Chapelle, 1994, p. 176).
      
as end-of-sentence markers, but to randomly assign between 15 and 25
      
       
increasing the number of possible C-Test passages created from each
source text 11-fold.
  
created by eliminating the second half of a word. Research indicates
that in C-Tests, blanks resulting from certain word-groups are easier
   -
restoring content words requires knowledge of the formal features of
  -
ting form for a given context (Chapelle, 1994, p. 176). It seems that the
ability to solve mutilated content words in C-Tests is a better measure
tor, both content- and structure words can be part of C-Tests. In order
to further quantify the question of how to adapt C-Tests or candidates
    
specialized databank with texts intentionally tweaking the amount of
learner groups.
urally follows the third (3) problem. The platform relies on a stock da-
tabase of texts, the index of which includes topical keywords and the
readability level of each text. These texts stem from internet blogs, nov-
els, and newspaper articles and are edited for usage in C-Tests.
Since the main purpose of the C-Test platform is to produce rating
interfa ce
undamaged texts is not the most important criterion; even with the oc-
casional ambivalent blank, the ranking hierarchy of candidates still re-
  
2005, p. 277). Concerning individual assessment, however, having
 -
tions, which poses a problem for individual students who use the plat-
this user group, the texts need to be calibrated by means of monitoring
test outcomes and running statistical analyses of problematic blanks to
more reliable test results. Another source of calibrated texts could be
the project of Dresemann and Traxel (2005; 2010), who have assembled
          
speakers re-write the texts in order to avoid ambiguities and other pit-
special attention is given to compound nouns, names, and other words
considered problematic when mutilated.25 While rewriting problemat-
   
undesirable words, there are two other ways to mark these words for
exemption from the automatic mutilation process. First, words with an
asterisk at the end are exempt from mutilation, which will make the mu-
tilation process shift one word to the right. The second option is to shift
mutilation by one or two letters to the left or right by adding ±n to the
fahrt+1, which tells the system to leave one more letter and mutilate the
25 In C-Test research, the process of deleting the second half of words is commonly referred to as
                 
26 One of the most ba sic rules in creating C-Test blank s calls for deleting half of the word. In the
5 Conclusion
cal alternative to accepting candidates’ previous grades as the basis for
gateway testing. It further argues that ranking students by general lan-
such as class placement and in-class grouping. The article rebukes the
allegation of tracking by the argument that knowing the skill level of
students allows for homogeneous grouping as well as according to pat-
terns of planned heterogeneity. Increasing the frequency of placement
promotes mobility according to their current language skills.
In gateway testing, using the same test for all students will set objec-
tive standards for admission. Frequent placement tests and regrouping
cording to their actual and current language skills. The C-Test construct
is an adequate, valid, and reliable means of testing general language
the platform proposed here is an economical testing tool. It presents the
results of individual tests as a printable diploma, and groups tests in list
form, ranking candidates according to their test results. An automated
C-Test generating internet platform makes testing universally available
with very little preparation, minimum time loss, and considerable ben-
The data derived from an automated C-Test platform can support re-
and others. Metadata, like geographical user distribution, frequency of
users and administrators, provides answers to a wide array of questions
concerning foreign language acquisition.
An interesting area of research will be TestDaF, the admission test for
interfa ce
    
provide us with more insights into text processing strategies than right
answers do” (1996, p. 39), an analysis of a large number of wrong an-
swers from language students might reveal new insights for test validity,
language testing and learning.
Alderson, J. C. (1979a). “The Cloze Procedure and Prociency in
English as a Foreign Language”. TESOL Quarterly, 13(2), 219
2 2 7.
. (1979b). “The Eect on the Cloze Test of Changes in Dele-
tion Frequency”. Journal of Research in Reading, 2(2).
. (1983). “The Cloze Procedure and Prociency in English
as a Foreign Language”, in Oller, J. W. (ed.), Issues in Language
Testing Research (pp. 205–217). Rowley, Mass: Newbury House.
. (2017). “Foreword to the Special Issue “The Common Eu-
ropean Framework of Reference for Languages (CEFR) for En-
glish Language Assessment in China” of Language Testing in
A si a ”. Language Testing in Asia, 7(1), 20.
Alderson, J. C., Brunfaut, T., & Harding, L. (2015). “Towards a
Theory of Diagnosis in Second and Foreign Language As-
sessment: Insights from Professional Practice across Diverse
Fi e ld s ”. Applied Linguistics, 36(2), 236–260.
American Educational Research Association. (2014). Standards for
Educational and Psychological Testing. (American Psychological
Association, National Council on Measurement in Education,
& Joint Commiee on Standards for Educational and Psycho-
logical Testing (U.S.), Eds.). Washington, DC: American Edu-
cational Research Association.
Arras, U., Eckes, T., & Grotjahn, R. (2002). “C-Tests im Rahmen
des “Test Deutsch als Fremdsprache” (TestDaF): Erste For-
schungsergebnisse“ in Grotjahn, R. (ed.), Der C-Test. Theore-
tische Grundlagen und praktische Anwendungen (Vol. 4, pp. 175–
209). Bochum: AKS.
Arras, U., & Grotjahn, R. (1994). “Der C-Test im Chinesischen“, in
Grotjahn, R. (ed.), Der C-Test. Theoretische Grundlagen und
praktische Anwendungen (Vol. 2, pp. 1–60). Bochum: Brock-
Asano, Y. (2014). “Nähere Betrachtung des Konstrukts: Allge-
meine Sprachkompetenz”, in Grotjahn, R. (ed.), Der C-Test: Ak-
tuelle Tendenzen (pp. 41–54). Frankfurt / M.: Lang.
interfa ce
Babaii, E., & Ansary, H. (2001). “The C-Test: A Valid Operation-
alization of Reduced Redundancy Principle?” System, 29(2),
209 –219.
Bach man, L. F. (1985). “Performance on Cloze Test s wit h Fixed-Ra-
tio and Rational Deletions”. Tesol Quarterly, 19(3), 535–556.
———. (199 0). Fundamental Considerations in Language Testing. Ox-
ford: Oxford University Press.
. (2005). “Building and Supporting a Case for Test Use. Lan-
guage Assessment Quarterly: An International Journal, 2(1), 1–34.
Baghaei, P. (2010). An Investigation of the Invariance of Rasch
Item and Person Measures in a C-Test, in Grotjahn, R. (ed.),
Der C-Test: Beiträge aus der aktuellen Forschung (pp. 71–100).
Frankfurt / M.: Lang.
. (2011). “Optimal Number of Gaps in C-Test Passages”.
International Education Studies, 4(1), 166 –171.
Baghaei, P., & Grotjahn, R. (2014). “Establishing the Construct
Validity of Conversational C-Tests Using a Multidimensional
Rasch Model”. Psychological Test and Assessment Modeling, 56(1),
Baghaei, P., & Tabatabaee, M. (2015). “The C-Test: An Integrative
Measure of Crystallized Intelligence”. Journal of Intelligence,
3(2), 46–58.
Bernhardt, P. E. (2014). “Making Decisions about Academic Tra-
jectories: A Qualitative Study of Teachers’ Course Recommen-
dation Practices”. American Secondary Education, 42(2), 33–50.
Braddock, J. H., & Slavin, R. E. (1992). “Why Ability Grouping
Must End: Achieving Excellence and Equity in American Edu-
cation. Presented at the Common Destiny Conference, EDRS.
Brulles, D., Saunders, R., & Cohn, S. J. (2010). “Improving Per-
formance for Gifted Students in a Cluster Grouping Model”.
Journal for the Education of the Gifted, 34(2), 327–350.
Chapelle, C. A. (1994). “Are C-tests Valid Measures for L2 Vocabu-
lary Research?” Second Language Research, 10(2), 157187.
Chapelle, C. A., & Voss, E. (2017). “Utilizing Technology in Lan-
guage Assessment”, in Shohamy, E. (ed.), Language Testing
and Assessment (3rd ed., pp. 149–162). New York: Springer Sci-
ence+Business Media.
Cohen, E. G., & Lotan, R. A. (2014). Designing Groupwork: Strate-
gies for the Heterogeneous Classroom (Kindle Edition). New York:
Teachers College Press.
Cronjäger, H., Klapheck, K., Kräschmar, M., & Walter, O. (2010).
“Entwicklung eines C-Tests für Lernanfänger der Sek. I mit
Methoden der klassischen und probabilistischen Tesheorie,
in Grotjahn, R. (ed.), Der C-Test: Beiträge aus der aktuellen For-
schung (pp. 71–100). Frankfurt / M.: Lang.
Daud, N. S. M., Daud, N. M., & Kassim, N. L. A. (2005). “Second
Language Writing Anxiety: Cause or Eect?Malaysian Jour-
nal of ELT Research, 1(1), 19.
Deutsch, M. (1975). “Equity, Equality, and Need: What Deter-
mines Which Value Will Be Used as the Basis of Distribu-
tive Justice?” Journal of Social Issues, 31(3), 137–149. hps://doi.
org/10.1111/j.1540 -4560.1975.tb01000.x
Deygers, B., Van Gorp, K., & Demeester, T. (2018). “The B2 Level
and the Dream of a Common Standard”. Language Assessment
Quarterly, 15(1), 44–58.
Díez-Bedmar, M. B. (2012). “The Use of the Common European
Framework of Reference for Languages to Evaluate Composi-
tions in the English Exam Section of the University Admission
Examination”. Revista de Educación, 357, 55–79.
Drackert, A. (2016). Validating Language Prociency Assessments in
Second Language Acquisition Research. Frankfurt: Lang. hps://
Dresemann, B., & Traxel, O. (2005). “Ermilung von Sprach-
niveaus miels kalibrierter C-Tests. Ein Projekt zur Entwick-
lung einer C-Test Datenbank”, in Gebert, D. (ed.), Innovation
aus Tradition: Dokumentation der 23. Arbeitstagung 2004 (pp. 277–
283). Bochum: AKS-Verl.
DuBay, W. H. (2004). The Principles of Readability. Costa Mesa: Im-
pact Information.
Dunlea, J., & Figueras, N. (2012). “Replicating Results from a
CEFR Test Comparison Project across Continents”, in Tsagari,
D. & Csépes, I. (eds.), Collaboration in Language Testing and As-
interfa ce
sessment (pp. 31–47). Frankfurt: Lang.
Dutcher, L. R. (2018). Interaction and Collaboration across Prociency
Levels in the English Language Classroom (Ph. D. Dissertation).
University of Sydney, Sydney.
Eckes, T. (2007). “Konstruktion und Analyse von C-Tests mit Rat-
ingskalen-Rasch-Modellen. Diagnostica, 53(2), 68–82.
. (2011). “Item banking for C-tests: A polytomous Rasch
modeling approach”. Psychological Test and Assessment Model-
ing, 53(4), 414–439.
Eckes, T., & Baghaei, P. (2015). “Using Testlet Response Theory to
Examine Local Dependence in C-Tests”. Applied Measurement
in Education, 28(2), 85–98.
Eckes, T., & Grotjahn, R. (2006). “A Closer Look at the Construct
Validity of C-Tests”. Language Testing, 23(3), 290325.
Feldt, L. S., & Brennan, R. L. (1989). “Reliability”, in Linn, R. L.
(ed.), Educational Measurement (pp. 105–146). New York; Lon-
don: American Council on Education and Collier Macmillan.
Francis, B., Archer, L., Hodgen, J., Pepper, D., Taylor, B., & Tra-
vers, M.-C. (2017). “Exploring the Relative Lack of Impact of
Research on ‘Ability Grouping’ in England: A Discourse An-
alytic Account”. Cambridge Journal of Education, 47(1), 1–17.
Gamaro, R. (2000). “Rater Reliability in Language Assessment:
The Bug of All Bears. System, 28(1), 31–53.
Gesellschaft für Akademische Studienvorbereitung und Testent-
wicklung e. V., & TestDaF-Institut. (2017a). “About onSET [Cor-
porate]. Retrieved December 31, 2018, from hps://www.on-
———. (2017b). “TestDaF [Corporate]”. Retrieved December 31,
2018, from hp://
Glock, S., & Böhmer, I. (2018). “Teachers’ and preservice teachers
stereotypes, aitudes, and spontaneous judgments of male
ethnic minority students”. Studies in Educational Evaluation, 59,
Gnambs, T., Batinic, B., & Hertel, G. (2011). “Internetbasierte psy-
chologische Diagnostik [Autorenmanuskript], in Hornke, L.
F., Amelang, M., Kersting, M. (eds.), Verfahren zur Leistungs-,
Intelligenz- und Verhaltensdiagnostik (Vol. II/3, pp. 448-498 / 1-62).
Göingen: Hogrefe. Retrieved December 31, from hps://timo.
Grotjahn, R. (1987). “How to Construct and Evaluate a C-Test: A
Discussion of Some Problems and Some Statistical Analyses”,
ιn Klein-Braley, C., Stevenson, D. K., Grotjahn, R. (eds.), Taking
Their Measure: The Validity and Validation of Language Tests (pp.
219–253). Bochum: Brockmeyer.
Haertel, E. H. (2006). “Reliability”, ιn Brennan, R. L. (ed.), Edu-
cational Measurement. Sponsored Jointly by National Council on
Measurement in Education and American Council on Education
(4th ed., pp. 65–110). Michigan: Praeger.
Harmer, J. (2010). How to Teach English (New ed., 6. impr). Harlow:
Pearson Longman.
Harris, D. M. (2012). “Varying Teacher Expectations and Standards
Curriculum Dierentiation in the Age of Standards-Based Re-
fo r m ”. Education and Urban Society, 44(2), 128–150.
Henry, L. (2015). “The Eects of Ability Grouping on the Learning
of Children from Low Income Homes: A Systematic Review”.
The STeP Journal, 2(3), 70–87.
Hornke, L. F. (2006). “Testökonomie: Test Economy”, in Petermann
F. , Eid, M. (eds.), Handbuch der Psychologischen Diagnostik (pp.
434–448). Hogrefe Verlag.
Hornstra, L., van der Veen, I., Peetsma, T., & Volman, M. (2014).
“Does Classroom Composition Make a Dierence: Eects on
Developments in Motivation, Sense of Classroom Belonging,
and Achievement in Upper Primary School”. School Eective-
ness and School Improvement, 1–28.
Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). “Evaluating
CEFR rater performance through the analysis of spoken learn-
er corpora”. Language Testing in Asia, 8(1), 1–17.
Huhta, A. (2008). “Diagnostic and Formative Assessment”, in
Spolsky, B. & Hult, F. M. (eds.), The Handbook of Educational
Linguistics (pp. 469–482).
Jafarpur, A. (1995). “Is C-testing Superior to Cloze?” Language Test-
interfa ce
ing, 12(2), 194–216.
Jones, N., & Saville, N. (2008). “Scales and Frameworks, in
Spolsky, B. & Hult, F. M., The Handbook of Educational Linguis-
tics (pp. 496–510).
Jonz, J. (1990). Another Turn in the Conversation: What Does
Cloze Measure?” TESOL Quarterly, 24(1), 61–83.
Khodadady, E. (2014). “Construct Validity of C-tests: A Factori-
al Approach”. Journal of Language Teaching and Research, 5(6),
1353 –1362.
Khoshdel-Niyat, F. (2017). “The C-Test: A Valid Measure to Test
Second Language Prociency?” [Preprint]. HAL Hprints,
01491274, 1–30. hps://
Kim, Y. (2012). “Implementing Ability Grouping in EFL Contexts:
Perceptions of Teachers and Students”. Language Teaching Re-
search, 16(3), 289–315.
Klein-Braley, C. (1983). A Cloze is a Cloze is a Question”, in Oller,
J. W. (ed.), Issues in Language Testing Research (pp. 218–230).
Rowley, Mass: Newbury House.
. (1996). “Towards a Theory of C-Test Processing”, in Grot-
jahn, R. (ed.), Der C-Test. Theoretische Grundlagen und praktische
Anwendungen (Vol. 3, pp. 23–94). Bochum: Brockmeyer.
. (1997). “C-Tests in the Context of Reduced Redundancy
Testing: An Appraisal”. Language Testing, 14(1), 47–84.
Klein-Braley, C., & Raa, U. (1984). “A Survey of Research on the
C-Te st ”. Language Testing, 1(2), 134–146.
Knapp, A. (2011). “Issues in Certication”, in Knapp, K., Seidl-
hofer, B., Widdowson, H. (eds.), Handbook of Foreign Language
Communication and Learning (pp. 629662). New York: Mouton
de Gruyter.
Lenhard, W., & Lenhard, A. (2011). Berechnung des Lesbarkeitsin-
dex LIX nach Björnson. de, Bibergau: Psychometrica. hps://doi.
Lienert, G. A., & Raa, U. (1994). Testauau und Testanalyse (5th
ed.). Weinheim: Bel PVU.
Lin, W., Yuan, H., & Feng, H. (2008). “Language Reduced Redun-
dancy Tests: A Reexamination of Cloze Test and C-Test”. Jour-
nal of Pan-Pacic Association of Applied Linguistics, 12(1), 61–79.
Loveless, T. (2013). How Well Are American Students Learning? With
Sections on the Latest International Tests, Tracking and Ability
Grouping, and Advanced Math in 8th Grade (Brown Center Re-
port on American Education No. Vol. 3, No. 2) (p. 36). Wash-
ington, DC: Brookings Institution.
McNamara, T. (2011). “Principles of Testing and Assessment”, in
Knapp, K., Seidlhofer, B., Widdowson, H. (eds.), Handbook of
Foreign Language Communication and Learning (pp. 607–627).
New York: Mouton de Gruyter.
Messick, S. (1989). “Validity”, in Linn, R. L. (ed.), Educational Mea-
surement (pp. 13–103). New York; London: American Council
on Education and Collier Macmillan.
Misse, T. C., Brunner, M. M., Callahan, C. M., Moon, T. R., &
Azano, A. P. (2014). “Exploring Teacher Beliefs and Use of
Acceleration, Ability Grouping, and Formative Assessment”.
Journal for the Education of the Gifted, 37(3), 245–268.
Moosbrugger, H., & Kelava, A. (2011). Tesheorie und Fragebogen-
konstruktion (2nd ed.). Springer-Verlag.
Mozgalina, A., & Ryshina-Pankova, M. (2015). “Meeting the Chal-
lenges of Curriculum Construction and Change: Revision and
Validity Evaluation of a Placement Test. The Modern Language
Journal, 99(2), 346370.
Nesmith, B. M. (2018). Deciding on Classroom Composition: Factors
Related to Principals’ Grouping Practices (Doctor of Education).
Georgia Southern University, Statesboro.
Newbold, D. (2012). “Loca l Instit ut ion, Global Exami nat ion: Work-
ing Together for a ‘Co-certication’, in Tsagari, D., Csépes, I.
(eds.), Collaboration in Language Testing and Assessment (pp. 127
142). Frankfurt: Lang.
Norouzian, R., & Plonsky, L. (2018). “Correlation and Simple
Linear Regression in Applied Linguistics”, in Phakiti, A., De
Costa, P., Plonsky, L., Stareld, S. (eds.), The Palgrave Handbook
of Applied Linguistics Research Methodology (pp. 395–421). Lon-
don: Palgrave Macmillan UK. hps://
interfa ce
Norris, J., & Drackert, A. (2018). “Test Review: TestDaF”. Language
Testing, 35(1), 149–157.
North, B. (2000). The Development of a Common Framework Scale of
Language Prociency. New York: Lang.
Oakes, J. (1985). Keeping Track: How Schools Structure Inequality.
New Haven: Yale University Press.
. (1986a). “Keeping Track, Part 1: The Policy and Practice of
Curriculum Inequality”. The Phi Delta Kappan, 68(1), 12–17.
. (1986b). “Keeping Track, Part 2: Curriculum Inequality
and School Reform”. The Phi Delta Kappan, 68(2), 148–154.
Odendahl, W. (2016). “Promoting Student Engagement through
Skill-Heterogeneous Peer Tutoring”. Interface - Journal of Euro-
pean Languages and Literatures, 1, 119–153. h ps://
———. (2017). „Individuelle Noten aus kollaborativer Arbeit“.
Deutsch-Taiwanische Hefte, 16(25), 27–57.
Oller, John William. (1979). Language Tests at School: A Pragmatic
Approach. London: Longman.
Oller, John W., & Conrad, C. A. (1971). “The Cloze Technique and
ESL Prociency”. Language Learning, 21(2), 183–194.
Popham, W. J., Berliner, D. C., Kingston, N. M., Fuhrman, S. H.,
Ladd, S. M., Charbonneau, J., & Chaerji, M. (2014). “Can To-
day’s Standardized Achievement Tests Yield Instructionally
Useful Data? Challenges, Promises and the State of the Art”.
Quality Assurance in Education, 22(4), 2–2.
Raa, U. (1984). The Factorial Validity of C-Tests.
Raa, U., & Klein-Braley, C. (1983). “The C-Test - A Modication
of the Cloze Procedure”, in Stevenson, D. K., Klein-Braley, C.
(eds.), Practice and Problems in Language Testing. Proceedings of
the Fourth International Language Testing Symposium of the Inter-
universitäre Sprachtestgruppe, held at the University of Essex, 14-
17th September, 1981 (Vol. 4, pp. 113–148). Colchester: Univ. of
. (1985). “How to Develop a C-Test. Fremdsprachen Und
Hochschule, 13(14), 20–22.
. (2002). “Introduction to Language Testing and to C-Tests”,
in Coleman, J. A., Grotjahn, R., Raa, U. (eds.), University Lan-
guage Testing and the C-test (pp. 75–91). Bochum: AKS.
Reese, S. (2011). “Dierentiation in the Language Classroom. The
Language Educator, 6(4), 40–46.
Robinson, J. P. (2008). “Evidence of a Dierential Eect of Abil-
ity Grouping on the Reading Achievement Growth of Lan-
guage-Minority Hispanics. Educational Evaluation and Policy
Analysis, 30(2), 141180.
Roos, U. (1996a). “The C-Test in Japanese”, in Grotjahn, R. (ed.),
Der C-Test. Theoretische Grundlagen und praktische Anwendungen
(Vol. 2, pp. 61–118). Bochum: Brockmeyer.
. (1996b). “The Reconstructability of Japanese Characters:
Some New Evidence”, in Grotjahn, R. (ed.), Der C-Test. Theore-
tische Grundlagen und praktische Anwendungen (Vol. 3, pp. 139–
157). Bochum: Brockmeyer.
Rouhani, M. (2008). Another Look at the C-Test: A Validation
Study With Iranian EFL Learners”. The Asian EFL Journal
Quarterly March 2008 Volume 10, Issue, 10(1), 154.
Schoeld, J. (2010). “International Evidence on Ability Grouping
with Curriculum Dierentiation and the Achievement Gap in
Secondary Schools”. The Teachers College Record, 112(5), 8–9.
Shannon, C. E. (1948). A Mathematical Theory of Communica-
t i on ”. The Bell System Technical Journal, 27, 379–423, 623–656.
Shohamy, E. (2017). “Critical Language Testing”, in Shohamy, E.
(ed.), Language Testing and Assessment (3rd ed., pp. 441–454).
New York: Springer Science+Business Media.
Sigo, G. (2004). Towards Identifying the C-Test Construct. Frankfurt:
Slavin, R. E. (1987). „Ability Grouping and Student Achievement
in Elementary Schools: A Best-Evidence Synthesis”. Review of
Educational Research, 57(3), 293–336.
. (1990). “Achievement Eects of Ability Grouping in Sec-
ondary Schools: A Best-Evidence Synthesis”. Review of Educa-
tional Research, 60(3), 471–499.
. (1993). “Ability Grouping in the Middle Grades: Achieve-
ment Eects and Alternatives.The Elementary School Journal,
interfa ce
93(5), 535552.
Smith, A. L. (2017). Grouping Structures of Gifted and High Achiev-
ing Middle School Students: Teacher Perceptions and Data Analysis
of the Impact of Grouping (Ph. D. Dissertation). Columbus State
University. Retrieved December 31, 2018, from hps://csue-
Spolsky, B. (1968). “What Does It Mean to Know a Language, Or
How Do You Get Someone to Perform His Competence?” Pre-
sented at the Second Conference on Problems in Foreign Lan-
guage Testing, University of Southern California: ERIC Clear-
. (1985). “What Does It Mean to Know How to Use a Lan-
guage? An Essay on the Theoretical Basis of Language Test-
i n g ”. Language Testing, 2(2), 180–191.
Steenbergen-Hu, S., Makel, M. C., & Olszewski-Kubilius, P. (2016).
What One Hundred Years of Research Says about the Ef-
fects of Ability Grouping and Acceleration on K–12 Students’
Academic Achievement: Findings of Two Second-Order Me-
t a- A n a l ys es ”. Review of Educational Research, 86(4), 849–899.
Stöger, H., & Ziegler, A. (2013). “Heterogenität und Inklusion im
Unterricht“. Schulpädagogik Heute, 7(4), 1–30.
Sumbling, M., Viladrich, C., Doval, E., & Riera, L. (2014).C-Test as
an Indicator of General Language Prociency in the Context
of a CBT (SIMTEST)”, in Grotjahn, R. (ed.), Der C-Test: Aktuelle
Tendenzen- The C-Test: Current Trends (pp. 55–110). Frankfurt:
Lang. hps://
Sun, X., Fan, J., & Chin, C.-K. (2017). “Developing a Speaking Diag-
nostic Tool for Teachers to Dierentiate Instruction for Young
Learners of Chinese”, in Zhang, D., Lin, C.-H. (eds.), Chinese as
a Second Language Assessment (pp. 249–270). Singapore: Sprin-
ger Singapore. hps://
Tabatabaei, O., & Mirzaei, E. (2014). “Correlational Validation of
Cloze Test and C-Test against IELT”. Journal of Educational and
Social Research, 4(1), 345.
Taylor, W. L. (1953). “Cloze Procedure: A New Tool for Measuring
Readability”. Journalism Quarterly, 30, 415–433.
Tempel-Milner, M. E. (2018). Implementing Full-Time Gifted and Tal-
ented Programs in Title 1 Schools: Reasons, Benets, Challenges and
Opportunity Costs (Ph. D. Dissertation). Univeristy of Mary-
land, College Park.
Tieso, C. L. (2003). “Ability Grouping Is Not Just Tracking Any-
mo r e ”. Roeper Review, 26(1), 29–36.
Tomlinson, C. A. (2014). Dierentiated Classroom: Responding to the
Needs of All Learners. Alexandria, Va.: Ascd.
Tomlinson, C. A., & Imbeau, M. B. (2014). Leading and Managing a
Dierentiated Classroom. Alexandria, Va.: Ascd.
Traxel, O., & Dresemann, B. (2010). “Collect, Calibrate, Compare:
A Practical Approach to Estimating the Diculty of C-Test
It em s”, i n Grotjahn, R. (ed.), Der C-Test: Beiträge aus der aktuellen
Forschung (pp. 57–69). Frankfurt / M.: Lang.
Tremblay, A. (2011). “Prociency Assessment Standards in Second
Language Acquisition Research: “Clozing” the Gap. Studies
in Second Language Acquisition, 33(03), 339–372.
Trim, J., North, B., & Coste, D. (2009). Gemeinsamer europäischer Re-
ferenzrahmen für Sprachen: lernen, lehren, beurteilen [Niveau A1,
A2, B1, B2, C1, C2]. (Council for Cultural Co-operation, Ed.).
Berlin; München; Wien; Zürich; New York NY: Langenscheidt.
Vogl, K., & Preckel, F. (2014). “Full-Time Ability Grouping of Gift-
ed Students: Impacts on Social Self-Concept and School-Relat-
ed Aitudes”. Gifted Child Quarterly, 58(1), 51–68.
Wunsch, C. (2009). „Binnendierenzierung“, in Jung, U. O. H.
(ed.), Praktische Handreichung für Fremdsprachenlehrer (5th ed.,
pp. 41–47). Frankfurt: Lang.
Xie, Q. (2015). ““I must impress the raters!” An investigation
of Chinese test-takers’ strategies to manage rater impres-
sions”. Assessing Writing, 25, 2237. hps://
[received November 23, 2018
accepted January 22, 2019]
... As for the purpose of C-tests, they are considered to be particularly useful in situations where selection or placement decisions need to be made (e.g. Mozgalina & Ryshina-Pankova, 2015;Norris, 2006;Odendahl, 2019). ...
This article examines and reviews two types of reduced redundancy tests, namely cloze tests and C-tests, which involve completing a text from which certain units (whole words or their parts) have been removed. Assessment instruments of this kind are typically used to measure overall language proficiency, for example for the purpose of making placement decisions. The paper also discusses the development of these two measures of reduced redundancy with the help of the WebClass testing system.
Full-text available
<><><><><><><><><><>See link & APA citaion below <><><><><><><><><><><><><><> This chapter provides an applied description of two key methods to evaluate the association between two research variables. First, we provide a conceptual view of the notion of non-directional linear correlation. Using small datasets, we discuss the various behaviors of the correlation statistic, Pearson’s r, under different scenarios. Then, we turn our attention to a neighboring but practically different concept to evaluate the directional association between two research variables: the simple linear regression. Particularly, we shed light on one of the most useful purposes of simple linear regression and prediction. By end of the chapter, we present a conceptually overarching view that links the regression methods to all other methods that applied linguists often use to find important patterns in their data.
Full-text available
Background Although teachers of English are required to assess students’ speaking proficiency in the Common European Framework of Reference for Languages (CEFR), their ability to rate is seldom evaluated. The application of descriptors in the assessment of English speaking on CEFR in the context of English as a foreign language has not often been investigated, either. Methods The present study first introduced a form of rater standardization training. Two trained raters then assessed the speaking proficiency of 100 learners by means of actual corpus data. The study then compared their rating results to evaluate inter-rater reliability. Next, ten samples of exact/adjacent agreement between Raters 1 and 2 were rated by six teachers of English in tertiary education. Two of them had attended rater standardization training with Raters 1 and 2, while the other four had not received any relevant training. Results The two raters agreed exactly in 44% of cases. The rating results between the two trained raters were closely correlated (ρ = .893). Cross-tabulation showed that in one third of the samples, Rater 2 scored higher than Rater 1 and they agreed more often at the higher levels. The better rating performance of Teachers 1 and 2 suggested that rater standardization training may have helped enhance their performance. The unsatisfactory proportion of correctly assigned levels in teachers’ ratings overall was probably due to the high input of subjective judgment based on vague CEFR descriptors. Conclusions Regarding assessment, it is shown that the attendance of rater standardization training is of help in assessing learners’ speaking proficiency in CEFR. This study provides a model for assessing data from spoken learner corpora, which adds an important attribute to future studies of learner corpora. The paper also raises doubts about teachers’ ability to evaluate students’ speaking proficiency in CEFR. As CEFR has been widely adopted in the relevant fields of English language teaching and assessment, it is suggested that the rating training framework established in this study, which uses learner corpus data, be offered to (prospective) teachers of English in tertiary education.
What do C-Tests really measure? This book describes the history and the theoretical background of C-Tests and gives a survey of research into C-Test validity, which leads to the central question: How much context do test takers need to process when they take a C-Test? The author answers this question on the basis of extensive material and data from over 800 subjects. It turns out that the C-Test construct is fluid: Test takers have to take extended context into account, but not everybody to the same extent. The book discusses the implications of the fluid construct phenomenon for language testing practice and research.
The current research investigated German preservice and experienced teachers' implicit stereotypes, attitudes, and explicit cognitions with respect to male ethnic minority students. Using the Implicit Association Test, Study 1 revealed negative implicit stereotypes as preservice and experienced teachers more strongly associated ethnic minority students with negative learning and working behaviors than ethnic majority students. Study 2 showed negative implicit attitudes toward ethnic minority students. Explicit cognitions in both studies were positive. In addition to characterizing teachers' attitudes, Study 2 explored the role of attitudes in spontaneous judgments. Participants with more negative implicit attitudes made less favorable judgments of ethnic minority students. Results are discussed in terms of their implications for ethnic minority students and classroom interactions as well as for teacher education programs.
The C-Test is a gap-filling test belonging to the family of the reduced redundancy tests which is used as an overall measure of general language proficiency in a second or a native language. There is no consensus on the construct underlying the C-Test and many researchers are still puzzled by what is actually activated when examinees take a C-Test. The purpose of the present study is to cast light on this issue by examining the factors that contribute to C-Test item difficulty. A number of factors were selected and entered into regression model to predict item difficulty. Linear logistic test model was also used to support the results of regression analysis. Findings showed that the selected factors only explained 12 per cent of the variance in item difficulty estimates. Implications of the study for C-Test validity and application are discussed.
In Flanders, Belgium, university admission of undergraduate international L2 students requires a certificate of an accredited test of Dutch. The two main university entrance tests used for certification share highly comparable oral components and CEFR-based oral rating criteria. This article discusses to what extent ratings on the oral components of these tests can be compared. The data used are the ratings of the oral performances of the same 82 candidates on both oral test components, which were administered within the same week. The correlation on the overall scores is high, but lower on the oral test component. Further analyses, including linear regression and multifaceted Rasch analysis, indicate that the B2 level was interpreted differently in the two tests. The results show that using the same language proficiency scales as the basis for rating scale criteria may lead to superficial correspondences or a perceived equivalence but does not necessarily lead to greater comparability of shared criteria. The findings of this study are especially useful for contexts in which different tests use similar criteria that are based on the same descriptors, and comparability is only assumed.
Critical language testing (CLT) refers to the examination of the uses and consequences of tests in education and society (Shohamy 2001a, b; Spolsky 1995). The topic gained attention by various scholars and particularly Messick (1981, 1989), who argued for expanding the definition of construct validity as a criterion for evaluating the quality of tests, to include components related to tests use, such as values, impact, and consequences. CLT emerged from the realization that tests are powerful tools in education and society, which may lead to unintended consequences that need to be examined and evaluated. It is the power of tests, especially those of high stakes, that causes test takers and educational systems to change their educational behaviors and strategies as they strive to succeed in tests given their detrimental impact. Ample research on CLT exists which focuses mainly on the uses of tests with regard to high-stakes tests such as the TOEFL, school leaving exams, entrance and placement tests, as well as international/comparative tests such as PISA and TIMMS. These studies pointed to the misuses of tests and their impact that goes far beyond learning and teaching into issues of identity, educational policies, as well as marginalization and discrimination against immigrants and minority groups. The chapter ends with a discussion of alternative testing strategies, developed over the past decade, which aim at minimizing the power and negative consequences of tests mostly by including democratic approaches of formative and dynamic assessment, multilingual testing, inclusive assessment, and bottom-up testing policies and tasks, all aiming to use tests in constructive and positive ways, diminishing their excessive power.
This entry presents an overview of the past, present, and future of technology use in language assessment, also called computer-assisted language testing (CALT), with a focus on technology for delivering tests and processing test takers’ linguistic responses. The past developments include technical accomplishments that contributed to the development of computer-adaptive testing for efficiency, visions of innovation in language testing, and exploration of automated scoring of test takers’ writing. Major accomplishments include computer-adaptive testing as well as some more transformational influences for language testing: theoretical developments prompted by the need to reconsider the constructs assessed using technology, natural language-processing technologies used for evaluating learners’ spoken and written language, and the use of methods and findings from corpus linguistics. Current research investigates the comparability between computer-assisted language tests and those delivered through other means, expands the uses and usefulness of language tests through innovation, seeks high-tech solutions to security issues, and develops more powerful software for authoring language assessments. Authoring language tests with ever changing hardware and software is a central issue in this area. Other challenges include understanding the many potential technological influences on test performance and evaluating the innovations in language assessment that are made possible through the use of technology. The potentials and challenges of technology use in language testing create the need for future language testers with a strong background in technology, language testing, and other areas of applied linguistics.