The Development, Validation and Use
of a Test of Word Recognition for English
David Coulson and Paul Meara
Abstract Word recognition is a basic aspect of vocabulary skill, and a critical skill
in ﬂuent reading. Native speakers of English can recognize single words in about
one tenth of a second. Learners are somewhat slower, but this difference is difﬁcult
to measure without sensitive equipment. This chapter describes how we developed
a test of word recognition for EFL learners, called Q_Lex. In our approach, words
are hidden in nonsense letter strings and this slows recognition speed to a level that
personal computers can easily measure. Learners are assessed on the basis of native
speakers’reaction time norms. We describe the development and validation of this
tool and the measurement principles that underlie it. Especially, we emphasize how
we sought to improve its reliability. Finally, we describe an experiment with Q_Lex
to investigate learners at different levels of proﬁciency.
Keywords Vocabulary Word-recognition assessment Reaction-time
1 Introduction to Word Recognition in Second Language
Rapid word recognition skill is essential to reading. Hulstijn commented, “Learning
to apply reading strategies should not take precedence over establishing a core of
automatically accessible lexical items”(2001, p. 266). Yet, reliable, practical
assessment for this skill is a major challenge. The relationships that develop
between vocabulary sub-skills, such as word recognition and performance are
dynamic and unpredictable. If performance in second languages developed in direct
proportion to the effort spent memorizing words, research would be much less
challenging and much less interesting. Instead, much time must be spent developing
D. Coulson (&)
Ritsumeikan University, Kyoto, Japan
Swansea University, Swansea, UK
©Springer International Publishing Switzerland 2017
R. Al-Mahrooqi et al. (eds.), Revisiting EFL Assessment,
Second Language Learning and Teaching, DOI 10.1007/978-3-319-32601-6_11
and validating new tests. This chapter will argue that the simpler tests are, the less
likely they are to cause trouble. This is because confounding factors cast doubt on
the results of even apparently straightforward tests.
The basic ﬁndings in L2 word recognition date back to some very early work by
Cattell (1886, 1945). In his 1886 paper, Cattell reported a detailed study of two
non-native speakers of German. His key ﬁnding was that the time to recognize
single letters was only slightly shorter than the time needed to recognize whole
words. From this, he inferred that individual letters are not perceived in word
recognition. This became known as the “word superiority”effect. He also reported
that word-recognition in an L2 (German) is slower than word recognition in an L1
(English), and that for single words, the difference is in the region of 10 ms. This
ﬁnding has turned out to be surprisingly robust. More recent research has not
signiﬁcantly improved on this work, despite the advanced technologies that are
available to modern researchers. Cattell’s work relied on an astonishingly inno-
vative use of clockwork and electrical circuitry, which could measure reaction times
accurately to about 2 ms.
Cattell’s second study (posthumously reported in 1947) used only a stopwatch to
measure reaction times. In this study, he recorded reaction times for hundreds of
words. He reasoned that since reaction times are not consistent, the accurate
measurement of a few instances does not necessarily provide reliable information
which can be generalised. With this simpler method, he found that the speed with
which subjects could read words in sentences depended on how well they knew the
language. Cattell’s observation that L2 speakers’reaction times to words are slower
than L1 speakers’and that they could be used to track the degree of ability in the L2
were prescient, and remain highly relevant today.
2 Assessment of Second Language Word Recognition
Although a number of people have written about the need for a practical test of
word recognition ability in an L2, there has not really been much progress in this
area (see, for example, Daller, Milton, & Treffers-Daller, 2007; Milton &
Fitzpatrick, 2014). Daller et al. (2007) accord ﬂuency a central role in lexical
processing as one of three components that describe a three-dimensional “lexical
space”that is deﬁned as learners’breadth of lexical knowledge, the depth of this
knowledge and the ﬂuency, or ability to access the vocabulary appropriately.
However, they state, “It would probably be true to say that we have no widely used
or generally accepted test of vocabulary ﬂuency. More research in this area is
needed (p. 9).”A standardised test of word recognition ability is an obvious can-
The main problem that faces researchers in this area is that, even with modern
technology, it is not easy to measure accurately the very small differences that we
expect to ﬁnd when we compare native speakers and learners on a word recognition
task. The standard approach in the very extensive literature on laboratory studies of
172 D. Coulson and P. Meara
L2 word recognition (e.g. Akamatsu, 2008) is the lexical decision task, where
subjects are presented with a string of letters, and are asked to push a button to
indicate whether the string is a word or not. Sometimes, the stimulus strings are
preceded by a prime, a brieﬂy displayed letter string which can subtly affect the
way the main stimulus word is read. For example, a prime like FRUIT makes it
easier for people to decide that APPLE is a word. Laboratory studies typically
manipulate prime types and stimulus words to show that L2 primes are less
effective than L1 primes, and that L2 stimulus words are more difﬁcult to process
than L1 stimulus words. Work of this type typically relies on very large and tightly
controlled stimulus sets, and this makes it difﬁcult to use the methodology with
low-level learners. The method also relies on specialist computer software, such as
DMDX (Forster & Forster, 2003), or e-prime (Schneider, Essman, & Zuccolotto,
2002) but this software is expensive and can be difﬁcult to work with outside the
conﬁnes of a well-equipped laboratory, or well-controlled testing situation.
There have been only a few attempts to develop practical word recognition tests
which might be usable outside the laboratory, and they have not been very suc-
cessful. Below, we will review some of the most important of these.
Laufer and Nation (2001) investigated the relationship between vocabulary size,
word frequency and ﬂuency (of word recognition). Their aim was to measure the
speed with which subjects match target words at various frequency levels with their
meanings. To do this, they made the Vocabulary Recognition Speed Test (VORST),
a computerized version of the standard vocabulary size test, Nation’s Vocabulary
Levels Test (VLT). In the VLT, each item consists of three target words and a set of
six deﬁnitions (which may be single words), and subjects are required to indicate
which of the six deﬁnitions best ﬁt the three target words. In VORST, the items are
split up so that only a single target word is presented alongside the six other words
A sample item from VORST (Laufer & Nation, 2001):
The software records the time from the appearance of an item until a choice is
made. Two more items are subsequently displayed with the same block of six words
as choices. Bizarrely, once the three selections are completed, test takers are given
the chance to amend their answers. If a new choice is made, the new response time
replaces the ﬁrst latency. The mean response time for each block of six words and
the mean response time in each word frequency level are recorded. Laufer and
Nation (2001) claim that subjects with larger vocabulary size generated faster
reaction times. Further, they claimed many words have to be acquired before some
of this vocabulary becomes available for automatic recognition.
The Development, Validation and Use of a Test …173
Many aspects of this test were unsatisfactory. The VLT is a complex tool that
was originally created speciﬁcally to assess vocabulary size, so re-deploying it for
assessing lexical ﬂuency is very questionable (e.g. Kamimoto, 2004). In addition,
the length of the six deﬁnitions is not consistent—low-frequency target words
generally require longer deﬁnitions, so the central issue of word recognition speed
is compromised. The decision to allow subjects to amend their answers at the end of
each block completely invalidates the test as measure of lexical accessibility.
Overall, VORST fails to meet Stanovich’s requirements for a “clean test”(1982,
p. 487). That is, in the business of measuring very speciﬁc psycholinguistic phe-
nomena such as the time needed for word recognition, it is important to minimize
the number and effect of complicating factors.
A more promising approach seems to be Harrington (2006). Searching for a
relationship between size of vocabulary and speed of Word Recognition,
Harrington devised a task which closely resembles the classical lexical decision
task. Target items consisted of real words and pseudo-words, with subjects being
asked to indicate if they know the meaning of the items or not. Unlike the stan-
dardized Yes/No test on which this study is based (Meara, 2005) subjects’reaction
times were recorded. Harrington found that as the frequency of presented words
decreased, accuracy decreased and reaction times increased. Harrington also cal-
culated each subject’s Co-efﬁcient of Variation (CVRT). Segalowitz et al. have
shown (e.g. 1998) that CVRT gets smaller as the processing of lexical items
becomes more automatic, and this would lead us to expect that more proﬁcient
subjects would exhibit smaller CVRT on the stimulus set, a ﬁnding that is partially
supported by the data. This is a superﬁcially appealing approach to assessing lexical
ﬂuency, but we feel some caution is necessary. CVRT depends on very accurate
time measurements, and it is not clear to us whether it is appropriate to use it with
the Classical Yes/No task. This task is rather more complex for L2 learners than it is
for L1 speakers, and it is not clear to us how CVRT measures will be affected by
this. This is a concern shared by Eyckmans, Van de Velde, van Hout, and Boers
(2007) who rejected the use of a computerized version of the Yes/No test for fear
that a time constraint could lead to biased responses.
An approach which avoids many of the problems mentioned above is to be
found in Prinzmetal and Silvers (1994), who developed a low-tech approach to
word-recognition, which does not rely on advanced technology. In one of their
studies, subjects were presented with a set of three words, a stimulus word and two
other items; one identical to the stimulus word, the other differing from it by a
single letter. The subjects’task was to read aloud the word they thought matched
the stimulus. The difﬁculty of this task could be varied by making the stimulus word
difﬁcult to read, for example by showing it in a small font and/or in low contrast, as
The subjects were judged on the number of items they correctly identiﬁed. An
advantage of words over non-words was found. This shows that useful data about
word recognition skill can be obtained without recourse to sensitive testing
equipment. As a result, these researchers moved away from the measurement of
174 D. Coulson and P. Meara
recognition latencies and became concerned with assessment of word-recognition
performance. As far as we know, this approach has not been tried out with L2
learners, but it seems to us to have some promise as a low-tech assessment tool.
Our own work has also taken a low-tech approach to measuring word recog-
nition. Meara (1986) created a test methodology which could easily be imple-
mented on the rather limited home computers that were available at the time. This
test presented words hidden in a string of twenty letters such as:
The test takers’task was to ﬁnd the embedded word (here: simple) as quickly as
possible, and the time taken to achieve this was measured. There are some critical
features to this methodology. Firstly, the hidden words are very quickly identiﬁable
by people who know them well, so English native speakers have little trouble
ﬁnding the concealed words, and typically there is little variation in the time taken
to do this. Meara reasoned that this lack of variation in the native speaker data
might form the basis of a standard to assess learners by. Secondly, this approach
generates much slower reaction times than we ﬁnd with standard lexical decision
tasks, and this makes it possible to deliver the task on ordinary computing equip-
ment that can be used in a classroom. Thirdly, the method seems to exaggerate the
differences between native speakers and learners, and this makes it much easier to
identify nonnative speaker-like performance in learners. Finally, the available
technology made it possible to monitor the performance of non-standard learners of
an L2. In a signiﬁcant departure for the time, Meara contacted a large number of L2
learners who were following a BBC TV course in Spanish (Dígame), and sent them
some specially designed computer programs on cassette tapes which allowed them
to do the necessary tests at home in their own time. Meara’s results showed that,
generally, recognition of Spanish words hidden in letter strings became faster as the
learners progressed. However, reaction times did not speed up gradually with
exposure: rather words seemed to shift suddenly from a pattern of slow reaction
times to a pattern of faster ones. This shift was not seen for all words, and this
hinted at a dynamic mechanism in which various outcomes are possible, including
delayed progress and even loss of access for some words.
Meara’s initial work was not taken up at the time, probably because the tech-
nology developed rather rapidly at this time, and quickly made his delivery
mechanism obsolete. Nevertheless, we believe that the general approach still has
much to recommend it. The idea resurfaced in the 1990s, when Meara worked on a
revision of his original work which became known as Q_Lex—one of a series of
tests that Meara developed for the EU’s Lingua programme. The work reported in
the next section is a further investigation of the Q_Lex approach.
Direct measurement of word-recognition speed remains a specialist endeavour.
As seen, research tends to focus on elaborate research designs, although simple
measures may reveal equally rich patterns of lexical skill, as shown by Cattell with
his stopwatch and Prinzmetal and Silvers with their large and small words. Complex
design inevitably has an impact on the validity and reliability of lexical tests.
The Development, Validation and Use of a Test …175
3 Researching a Test of Recognition of Embedded Words
Q_Lex is one of several tests devised by Meara. The principle behind these tests is
to provide quick and easy evaluation of lexical performance. The tests were
designed to be simple enough to deploy in ordinary learning situations, and robust
enough to give a reliable snapshot of learners’ability. The tests were generally short
enough to be completed in a few minutes, but in spite of their brevity, they usually
tested signiﬁcantly more words than other readily available tests did. Typically,
they involved high frequency words as stimuli, on the grounds that this allowed the
same tests to be used with learners at different levels of proﬁciency. Typically, too,
the tests were designed so that the performance of learners could be meaningfully
compared with the performance of native speakers. The version of Q_Lex described
in this section is Q_Lex v3.0. It was written in Delphi 4, and was designed to run as
a stand-alone program on the Windows platform.
4 Test Design
In Q_Lex, 50 high-frequency 6-letter words are hidden in a 15-letter string as
These items are shorter than those used in the Dígame project, mainly because
high frequency English words are shorter than their Spanish counterparts. Items are
displayed on a personal computer. They appear in 20-point Arial bold font. The task
for the test taker is to identify the hidden word in each item as quickly as possible.
A timer starts with each presentation, and learners click a mouse to stop the timer
and record their time when they have identiﬁed the hidden word. Once they do this,
the program displays a set of four additional words: one word is the target word,
while the other three are words which are similar to the target word. Test takers
have to identify which of the four words they had seen hidden in the 15 letter
display. This additional check is used to conﬁrm that they had actually correctly
identiﬁed the target words. Test takers’performance is judged against a
native-reader standard as described below.
5 Masking String Design
A signiﬁcant issue with this methodology is the construction of the longer strings in
which the target words are hidden. These masking strings are not just randomly
selected letters. Rather, a procedure is followed which allows the difﬁculty of items
to be controlled. The masking strings used in Q_Lex 3.0 are “1st-order approxi-
mations to English”, which reﬂect the frequency of different letters in English, and
176 D. Coulson and P. Meara
their probability of being preceded and followed by another letter. The methodol-
ogy for constructing these strings is based on work by Miller (1963) and is
described in more detail in Appendix 1.
Some examples of target words hidden in zero-order approximations to English,
1st-order approximation to English and 2nd-order approximations to English are
provided in Table 1. In general, hidden words become harder to identify as the
masking strings become more English like. However, some care needs to be taken
with higher-order approximations because these masking strings will often include
short sequences of letters which make up real words that are not the intended target
word. Items of this sort were excluded from the stimulus items used in this study.
6 The Use of Native Speaker Recognition Times to Create
50 6-letter words from the JACET 8000 list (Aizawa, Ishikawa, & Murata 2005)
were selected. They had a mean rank order of 1484. They were embedded in
1st-order approximation strings. Native-English speakers took the test, and then
data was gathered to make the norms. 18 university graduates participated. Each of
the ﬁfty test items were displayed on a personal computer screen. Subjects clicked
on a button to display each item in turn, and clicked the same button again as soon
as they saw the hidden word. This action stopped the timer, and displayed the
multiple choice screen that allowed subjects to conﬁrm their answers. For pa-
jlchanceacdut the multiple choice options were: chance chalet change and cha-
pel. The three distractors were selected for their orthographic familiarity to the test
item. The aim was to prevent test takers from selecting an answer based on memory
of a few letters from the string, and to promote searching for the whole word.
12 of 18 subjects recognized all 50 items correctly with the others missing
between one and three. The mean reaction time was 22.8 s with a standard devi-
ation of 7.42, indicating that some subjects varied quite a lot from the mean. There
was no evidence of acceleration during the test. In other words, the test was suf-
ﬁciently simple for initial reaction times to items to be similar to those towards the
end of the test. The subjects took the test again 6 months later and most recorded
very similar recognition speed. The correlation co-efﬁcient between the two tests
Table 1 A comparison of
items presented in strings of
Target Zero-OA 1st-OA 2nd-OA
Leave ﬁwleavemtsnt lyleavekicbof retleaveicter
Night zqpwnightuemp slenightrabyg dirsnightunwi
Large tsyjhlargegql heclargenyiti medbilargefou
The Development, Validation and Use of a Test …177
For each item, a norm score was created. These norms were calculated as fol-
lows: for each target word, the mean reaction time of the 18 native speakers was
added to twice the standard deviation of these reaction times. 95 % of the native
speakers’scores are faster than this norm score.
7 English Learners’Scores on Q_Lex
7.1 Preliminary Pilot Studies
50 ﬁrst-year university students majoring in English took part in a pre- and post-test
design, with a gap of 2 weeks. Each test started with ﬁve practice items, and then
the ﬁfty normed items described above were presented. Test takers scored one point
for each item that they recognised correctly within the native-speaker norm, giving
a top score of 50 points.
The purpose of the 2-week gap was to assess if Q_Lex could show consistent
scores in learners who should still have the same level of proﬁciency as in the ﬁrst
administration. The results showed a very marked increase in scores over this short
period. The average score in test 1 was 14.6 points and in test 2 it was 26.4 points.
In addition, the test showed a signiﬁcant practice effect, in that target words at the
end of the test were recognized signiﬁcantly faster than words at the beginning of
the test. We had not anticipated either of these effects.
In a second pilot study, the target words were embedded in 2nd-order approx-
imation strings which makes them slightly more difﬁcult to detect. The prediction
was that better concealed words would be visible to learners who know them better.
Contrary to expectation, this led to a worse outcome in terms of reliability across
the two tests. However, other aspects of the results were noteworthy. We compared
the scores of groups of new ﬁrst-year students embarking on an English degree, and
other students entering other majors, who would not be studying English. In Japan,
all students have to study English until the end of high school, and all of our
students had to pass the university entrance examination which assessed English
ability. At ﬁrst, the scores of these two groups were comparable. A year later, the
average scores of the English major group had increased signiﬁcantly, while those
of a non-English major group had signiﬁcantly fallen. This suggested that Q_Lex
might be sensitive enough to measure changes in both lexical acquisition and
Our concern was that while Q_Lex seemed to be good at detecting large shifts in
the performance of groups, it did not appear to be good at detecting smaller shifts in
the performance of individuals. Some improvement in the assessment of individuals
was achieved by conducting a Rasch analysis on the test items, selecting those that
performed best and discarding the strings that discriminated badly. This post hoc
approach was only partially successful, and did not seem to offer a solution to
reliability problems with the current version of the test.
178 D. Coulson and P. Meara
Nevertheless, as a tool for measuring the skill of recognition in groups of
learners, other tantalizing results also emerged. One was the relationship of
word-recognition skill and scores on the reading section of TOEIC (a common
standardized proﬁciency test in Japan). 73 individuals took TOEIC, and a moderate
correlation (0.50) between Q_Lex and reading scores was found in the
lower-scoring half of this group. Conversely, the top half had almost a non-existent
relationship (0.05) with word-recognition skill. To investigate this further, a tech-
nique called the ‘moving window of correlation’(Verspoor, de Bot & Lowie, 2011)
was used to investigate this relationship. This involves incrementally sampling the
scores of groups of ﬁve students through the group. The result showed a clear, but
quite uneven, downward slope from left to right. In other words, whilst the group
showed a decreasing correlation, no individual was certain of showing this feature.
This is consistent with a dynamic view of second language development, where we
do not expect individuals to develop in predictable stages. Rather, development is
messy and unpredictable. Nonetheless, research tells us that poorer readers are less
skilled at automatic word decoding, and they try to make up for this by relying on
global, top-down skills (Grabe, 2009, p. 28). The fact that the correlations with
word recognition decrease among stronger readers reminds us that
word-recognition ability is not a sufﬁcient condition for effective reading (Koda,
2005). These Q_Lex results appear to map this facet of reading skill. So one-way
Q_Lex might be used is to identify a deﬁcit in word-recognition proﬁciency, rather
than as a test of the presence of this skill in learners.
Overall, it seems that the unpredictable variation in scores for any learner
between two tests is not some fatal weakness of the test, but rather a description of
an usual feature of learners’developing skills to be unstable and erratic.
8 Further Attempts at Improving the Reliability
of the Test
We felt that the content validity of Q_Lex could be improved so that the initial
scores would better reﬂect the ability of learners. We still had one more option on
the strings for item masks: to use zero-order approximation instead of more
English-like strings. Our aim was to reduce the amount of variability in scores, due
not to the natural course of development, but rather unreliable facets of the test
design. We therefore introduced three further changes into the test design:
(a) A new set of ﬁve-letter words was selected. These had higher mean frequency
than the words used before. Shorter words are also more likely to be a single
syllable in length, and this helps reduce the occurrence of recognition based on
word parts. (For example, the word “reduce”cannot be used in Q_Lex, since
test takers might recognize “red”and stop the timer on that basis.) These items
were placed in shorter 13-letter strings. We felt the combination of shorter
words in zero-order approximation strings was likely to promote whole-word
The Development, Validation and Use of a Test …179
(sight) recognition, and, as a result, improve the content validity of Q_Lex. An
example of a new item is: jjovxzemptyjh
(b) In the earlier versions of the test, students could keep searching for the item for
as long as they needed. On occasion, this took over 10 s and far exceeded the
recognition time norms. This led to frustration among test takers. In the new
version of Q_Lex, as each item is displayed, the software displays a timer
which counts down from its norm value (typically about 2 s). When the counter
reaches zero, the test screen automatically changes to the answer screen to
display four choices. The countdown is displayed on the screen and decreases
in steps of 100 ms. This format change was more acceptable to the test takers,
and considerably reduced the total amount of time needed to ﬁnish the test.
(c) Finally, the method of measuring test takers’recognition time was improved. In
earlier versions, an on-screen start/stop button was clicked to display each item.
Now, the test was started by clicking this button, but the reaction time was
recorded by pushing the keyboard space bar. This change allowed for much
more accurate measurement of reaction time (see Appendix 2).
To investigate this new format, 66 words were selected. They had a mean rank
order of 833 in the JACET word list. 20 native speakers took the test, so the norm
values necessary to test learners could be calculated. Initially, the time limit on the
new counter function was set to 2 s. They answered 91 % of the items correctly.
Their mean reaction time was 925 ms (SD = 358 ms). This was much faster than
on earlier versions of the test. The test appeared very easy and uncomplicated so
this reassured us that for learners, this version would be a more valid test of word
106 female ﬁrst-year university students took the version with the new norms.
Their mean score was 36.0 (54.5 %). The test showed good reliability by the Kr-21
method (0.92). Non-scoring responses were largely the result of the target word
being timed out, with a relatively small number of incorrect identiﬁcations (an
average of 7 items). Thus, it seemed that the use of 5-letter words resulted in
shorter, easier strings that provide more reliable results. We felt we had a test that
seemed to be a transparent test of learners’word recognition skill.
9 Rasch Analysis and the Creation of Equivalent Forms
With the new format and items, learners achieved higher scores, but we were still
not sure if these would be consistent over two tests. As explained, this is important
for assessing genuine change in word-recognition ability. In earlier rounds of
investigation, the learners had always taken the same version of the test (albeit,
usually months apart). One way to deal with this issue is to create two parallel
versions of the same test, and use them in a split-half design. This would reveal the
180 D. Coulson and P. Meara
performance of groups of students on different, but equivalent, tests to the one they
The 66 items were examined with Rasch analysis. The range of inﬁt meansquare
was from 0.73 to 1.19. Based on this, six items were removed. The remaining 60
items were split into two 30-item sets. One set had a mean hit rate of 55.1 per item
(the number of hits by the 106 subjects mentioned above). It had a mean value of
inﬁt meansquare of 1.02. The ﬁgures for the other set were 56.7 hits per item and
the inﬁt meansquare was 0.98. Further shufﬂing of the items resulted in Form A,
with an inﬁt meansquare value of 1.00, and Form B with a value of 0.99. The mean
number of hits per items was 55.8 in each set. We predicted that the two sets should
produce similar scores in initial and second tests.
Two groups of subjects took each set. Group 1 had 47 students and Group 2 44
students. Re-tests were conducted 5 weeks after the initial test session. All the test
takers were native speakers of Japanese learning English as a foreign language at
university level. Test 1 mean score was 14.7 points for Form A and 15.6 for
Form B. There was no signiﬁcant difference between them. The mean scores of
groups on both forms in Test 2 were practically identical (Form A, 18.5 points;
Form B, 18.6 points). On inspection, it turned out that for the higher-scoring half of
the group, the change in scores across the tests was 0 %. Form B was judged to be
slightly more reliable. In a further study, learners took Form B twice with a gap of
only 2 weeks. The results showed that there was an increase in scores of about
10 %, a signiﬁcant improvement on earlier versions. We would have preferred to
see a smaller change but we had to satisfy ourselves that some score increase is
inevitable due to test habituation.
Overall, results suggested that the test came as close as possible to providing a
reliable initial score for many learners. However, the number of items in Version B
was only 30, compared to 50 in the original version of Q_Lex. This might be
criticized as being less representative than our stated goal. In fact, due to their
higher frequency, the 30 words in each set represent a better coverage of the ﬁrst
thousand words of English. In the ﬁnal part of the chapter, we will report on a
longitudinal investigation of learners’lexical development using Form B of Q_Lex.
10 Assessing Learners’Skill with Q_Lex
Our aim has been to develop a standard test for the practical assessment of English
learners’word-recognition skill. The key feature of Q_Lex is that the test is
operationalized as the number of items recognized within native-speaker norms,
rather than actual speed of recognition. As described above, this is a practical
solution to the difﬁculties of exact measurement. We have presented evidence that
this test has good concurrent validity. That is, the scores of learners reﬂect their
ability at that moment, and therefore we can gain insights into how this ability
changes over time. Since the course that the students followed included an intensive
vocabulary learning programme, we also address the relationship between
The Development, Validation and Use of a Test …181
vocabulary-size growth and accessibility speed of high-frequency vocabulary,
hypothesized by Laufer and Nation, and Harrington, as reviewed above. Further,
we can investigate whether there is any change in items successfully answered on
the ﬁrst test. That is, to what extent the accessibility of learners’common words
shifts over time. This should reveal more about the dynamic nature of vocabulary
knowledge. We will report on a group of ﬁrst-year university learners at discrete
We had three research questions:
(a) How does the word-recognition of learners change over time?
(b) How does word recognition develop in response to vocabulary learning?
(c) How consistent are learners over time in responding to the same items?
42 ﬁrst-year university students took part (34 females, 8 males). They were
studying English for ﬁve, 90-min sessions per week in an academic preparation
course. They were from three proﬁciency levels. 15 were in the ‘advanced’class, 14
in the ‘intermediate’and 13 in the ‘basic’class. There was also a control group of
students who were studying in the preparation course, but they did not follow the
vocabulary course. Their proﬁciency was similar to the intermediate and advanced
All participants (experimental and control group) took a test of the ﬁrst ﬁve
thousand words of English, called X_Lex, (Meara & Milton, 2003) at the start and
the end of the investigation. Over 9 months, the experimental group students
studied an online vocabulary learning system. Following a test to estimate their
vocabulary size, the website selected words to match their estimated level. Students
had to spend one hour a week learning vocabulary. All four groups (three exper-
imental, one control) took Q_Lex twice with an intervening period of 30 weeks.
Figure 1shows the change in Q_Lex scores between the two test events. The
groups scored at fairly similar levels at the outset. The scores of all the experimental
groups increased between test events. An ANOVA showed that at Time 1 there was
no signiﬁcant difference between the groups, and likewise for the results at Time 2.
The control group recorded the same score (15.3 points) at the end as at the start.
Figure 2shows the X_Lex scores. The results show that the Basic group made
the greatest gain at this 5 k level. The control group showed a slight fall in
vocabulary size. The online vocabulary system reported that the average gain was
1109 words (SD = 778).
182 D. Coulson and P. Meara
Figure 3shows the change of Q_Lex scores. Scores that were initially low
tended to show medium to large gains whereas initially higher scores led to
post-test scores that fell in a narrower range. The scores did not cluster by proﬁ-
There was a high level of consistency (68 %, SD = 16.7) in correct responses
between tests. Further, consistently answered items appeared to have better
accessibility. Figure 4shows the change in reaction times. The left side shows the
reaction times for items that were answered correctly in both tests. The right sides
shows those answered correctly in only one test. In both cases, reaction times fell
signiﬁcantly (at the 0.05 level). However, items answered both at Time 1 and 2 had
initially faster latencies compared to those which were missed later (t = 4.52,
Fig. 1 The pre- and post-scores on Q_Lex
Fig. 2 The pre- and post-scores on X_Lex
The Development, Validation and Use of a Test …183
The results of our investigation revealed that the word-recognition ability of groups
of students who are engaged in a full-time English programme, with an explicit
vocabulary learning component, improved their Q_Lex scores over one academic
year. Conversely, the group not involved in the word-learning activity (but who
followed the same academic preparation course) showed no gain in Q_Lex scores.
That is, Q_Lex appears able to reﬂect changes in groups, based on proﬁciency and
learning-activity differences. What conclusions can therefore be drawn regarding
Q_Lex and what it reveals about the state of lexical development of learners? We
had three research questions:
Fig. 3 The change in individual scores on Q_Lex
Fig. 4 Consistency in answering times
184 D. Coulson and P. Meara
(1) Concerning the change in word-recognition ability over time, all the experi-
mental groups showed signiﬁcant gains in Q_Lex scores. This test apparently
reﬂects changes in lexical accessibility, which are weakly linked to general
proﬁciency. The Advanced group made the biggest gain (5.0 points) and then
the Intermediate group (4.1 points) and then the Basic group (3.3 points).
(2) The evidence for the effect of vocabulary learning on word-recognition ability
is not clear. Gains in vocabulary size were conﬁrmed by the X_Lex results
(Fig. 2) in which we can see gains at the 5 K level for the Basic and
Intermediate groups. (The lack of progress for the Advanced group probably
reﬂects the fact the system was giving them much lower-frequency vocabulary
to learn. In addition, X_Lex tests only the ﬁrst ﬁve thousand words of English,
a level which may not have stretched the more advanced subjects.) A weak
correlation was seen between the reported gains in vocabulary size on the
on-line system over 9 months and gains in Q_Lex scores (r = 0.25).
When we look at these results by proﬁciency, a differentiated pattern appears.
The correlation between the number of words learned and Q_Lex scores was
0.64 for the intermediate group and 0.47 for the Basic group. The Advanced
group showed a negative correlation (−0.28). This might indicate that the
students who learned more frequent vocabulary on the online system (the
Basic and Intermediate groups) extended their Q_Lex score, whereas with the
Advanced Class who studied much more infrequent vocabulary, there seems
to be a negative effect on their Q_Lex scores.
As mentioned, these results do not support the idea that increasingly large
vocabulary size leads to better accessibility on the high-frequency vocabulary
items of Q_Lex. In particular, this result does not match the claims of Laufer
and Nation (2001) that a larger vocabulary leads to greater accessibility.
However, these results do match the ﬁnding by Miralpeix and Meara (2014)
that there is no consistent relationship between vocabulary size and accessi-
bility skill. They also claim the relationship is not random, and that also
appears to be reﬂected in the data. The result lends some support to the idea
that accessibility might be an independent dimension of vocabulary
(3) Concerning the question of consistency in answering, a clear-cut pattern
emerged. Despite a long intervening period between administrations, all par-
ticipants managed to answer at least half of the items they had correctly
responded to in the ﬁrst test. This suggests that some words in memory are
much easier to access, and this facet of knowledge may not vary much over
time. Further, the data from Q_Lex demonstrated that these reliably recog-
nized items had signiﬁcantly faster response times compared to other items
which were recognized in only one or the other test. This also appears to
depend on the individual. In other words, this result is not due to the facility of
the items, but reﬂects a greater sensitivity to certain words among learners.
The Development, Validation and Use of a Test …185
This chapter has provided a brief overview of a new testing tool that we think might
have a useful role to play in L2 vocabulary research. The test we have described is a
low-tech tool that assesses the ability of L2 learners to access words when they are
presented out of context. Unusually, the test is standardized against the performance
of native speakers on the same items. We think that this test method has a number
of features to recommend it to vocabulary researchers, and we hope that it might be
used in future to model how individual learners’ability to process basic vocabulary
changes as their proﬁciency develops.
Appendix 1: Approximations to English
Letters randomly selected from the alphabet are known as zero-order approximation
strings. Words placed in masks made from such a selection of letters are easy to
recognize since the masking string does not resemble English, and the hidden word
stands out against this background. To increase difﬁculty, ﬁrst-order approximation
strings can be used as masking strings. To construct these strings, a letter is chosen
at random from a text, and then every nth subsequent letter is added to the string.
The end result is a masking string that reﬂects the frequency of English letters. (The
letter ‘e’appears more often than ‘z’, for example.) First order approximations have
a closer resemblance to English, so words hidden in this kind of masking string are
better camouﬂaged. Second-order approximation strings reﬂect the distribution of
2-letter pairs in English words—the sequence ‘ab’is much more likely to occur in
these strings than the sequence ‘jj’, for example. As a result, these masking strings
camouﬂage the hidden word more effectively still. The three examples below
illustrate this effect. The zero-order masking string contains only one vowel, so it is
unlike any word spelled in English. Conversely, the ﬁrst and second order masking
strings are increasingly English-like. (Note that in the second order string, the word
‘vein’has appeared fortuitously in the masking string. This would need to be
removed for content validity.)
Zero order approximation string: gwdfdqtablevwcu
First order approximation string: lusetablechtacvutno
Second order approximation string: einentablerveinem
186 D. Coulson and P. Meara
Appendix 2: The Instructions Provided to Subjects
in the Test
The Development, Validation and Use of a Test …187
Akamatsu, N. (2008). The effects of training on automatization of word recognition in English as a
foreign language. Applied Psycholinguistics, 29, 175–193.
Aizawa, K., Ishikawa, S., & Murata, M. (2005). JACET 8000 eitango. Tokyo: Kirihara Shoten.
Cattell, J. M. (1886). On the time taken up by cerebral operations. Mind, 11, 377–392.
Cattell, J. M. (1947). The time it takes to recognize and name letters, pictures, and colors. In: A.
T. Poffenberger. (Ed.), James McKeen Cattell: Man of science (pp. 13–25). Lancaster, PA: The
Science Press. (Translated from German; Ueber die Zeit der Erkennung und Benennung von
Schriftzeichen, Bildern und Farben. 1885 Philosophische Studien, 2: 635–650.)
Daller, H., Milton, J., & Treffers-Daller, J. (2007). Modelling and assessing vocabulary
knowledge. Cambridge: Cambridge University Press.
Eyckmans, J., Van de Velde, H., van Hout, R., & Boers, F. (2007). Learners’response behaviour
in Yes/No vocabulary tests. In H. Daller, J. Milton, J., & Treffers-Daller (Eds.), Modelling and
assessing vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Forster, K. I., & Forster, J. C. (2003). DMDX: A Windows display program with millisecond
accuracy. Behaviour Research Methods, Instruments & Computers, 35(1), 116–124.
Grabe, W. (2009). Reading in a second language. Moving from theory to practice. Cambridge:
Cambridge University Press.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proﬁciency. EUROSLA
Yearbook 1 (2006) (pp. 147–168). Amsterdam: John Benjamins.
Hulstijn, J. H. (2001). Intentional and incidental second language vocabulary learning: A
reappraisal of elaboration, rehearsal and automaticity. In P. Robinson (Ed.), Cognition and
second language instruction (pp. 258–286). Cambridge: Cambridge University Press.
Kamimoto, T. (2004). The vocabulary levels test: A relationship between target items and
distractors. Paper Presented at the 2004 Eurosla conference, San Sebastian, Spain.
Koda, K. (2005). Insights into second language reading: A cross-linguistic approach. Cambridge:
Cambridge University Press.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning recognition: Are
they related? EUROSLA Yearbook 1(2001) (pp. 7–28). Amsterdam: John Benjamins.
Meara, P. M. (1986). The Dígame project. In V. J. Cook (Ed.), Experimental approaches to second
language learning (pp. 101–110). Oxford: Pergamon Institute of English.
Meara, P. M. (2005). Designing vocabulary tests for English, Spanish and other languages. In C.
Butler, M. A. Gómez González, & S, Doval Suárez (Eds.), The dynamics of language use:
Functional and contrastive perspectives (pp. 271–285). Amsterdam: John Benjamins.
Meara, P. M., & Milton, J. L. (2003). X_Lex v 2.05. Swansea: Lognostics. Retrieved from http://
Miller, G. A. (1963). Language and communication. New York: McGraw-Hill.
Miralpeix, I., & Meara, P. M. (2014). Knowledge of the written word. In J. Milton & T. Fitzpatrick
(Eds.), Dimensions of vocabulary knowledge (pp. 30–44). Basingstoke: Palgrave Macmillan.
Milton, J. & Fitzpatrick, T. (2014). Dimensions of vocabulary knowledge. New York: Palgrave
Prinzmetal, W., & Silvers, B. (1994). The word without the tachistoscope. Perception and
Psychophysics, 55(3), 296–312.
Schneider, W., Essman, A., & Zuccolotto, A. (2002). E-Prime user’s guide. Pittsburgh:
Psychology Software Tools.
Segalowitz, S. J., Segalowitz, N. S., & Wood, A. G. (1998). Assessing the development of
automaticity in second language word recognition. Applied Psycholinguistics, 19,53–67.
Stanovich, K. E. (1982). Individual differences in the cognitive processes of Reading: 1. Word
Decoding. Journal of Learning Disabilities., 15, 485–493.
Verspoor, M. H., de Bot, K., & Lowie, W. (2011). A dynamic approach to second language
development: Methods and techniques. Amsterdam: John Benjamins.
188 D. Coulson and P. Meara