Content uploaded by James Milton
Author content
All content in this area was uploaded by James Milton on Dec 10, 2015
Content may be subject to copyright.
Editors’ Introduction
Conventions, Terminology and an Overview of the Book
Over the last 20 years vocabulary research has grown from a Cinderella subject in
foreign language teaching and research, to achieve a position of some salience.
Vocabulary is now considered integral to just about every aspect of language
knowledge. With this development have come standard and widely used tests, such as
vocabulary size and lexical richness measures, and very commonly accepted
metaphors, such as the mental lexicon as a web of words. Less widely known outside
academic circles, however, is the extensive work on learners’ lexis and the utility,
reliability and validity of the tests we use to measure and investigate vocabulary
knowledge and growth. Vocabulary is a lively and vital area of innovation in
approach and research. The penalty we pay for working in so vital a subject is that
even recent, and excellent, surveys of the field are rapidly overtaken by new ideas,
fresh insights in modelling and testing, a healthy re-evaluation of the principles we
work under, and an increasingly large body of empirical research. The intention of
this volume, therefore, is to place in the hands of the reader some of these new ideas
and insights. It brings together contributions from internationally renowned
researchers in this field to explain much of the background to study in this area, and
reconsider some of the ideas which underpin the tests we use. It introduces to a wider
audience the concerns, the new approaches and developments in the field of
vocabulary research and testing.
To place these ideas in context, and to provide a point of entry for non-specialists in
this field, this introduction will survey the conventions and terminology of vocabulary
study which, if you are not aware of them, can make even simple ideas impenetrably
difficult. The background this provides should allow the chapters which follow to be
placed in context and help to explain why the concerns they address are of importance
to researchers. In the second half of this Introduction, therefore, the chapters are
summarised.
Conventions and Terminology
What is a word?
One of our colleagues used to begin lectures on vocabulary learning by asking his
audience how many words they thought they knew in English. Most people did not
know, of course, and had to guess, and the answers they suggested varied enormously
from 200 words up to many millions. These extremes are unusual but in truth it was a
question without a clear answer, because the answer depends on what you mean by a
word and therefore what your unit of counting is. According to context and need,
researchers can count types, tokens, running words, lemmas, and word families as
words.
In one sense it is obvious what a word is. Words are the black marks you are reading
on this page and you know when one word stops and another one starts because there
are spaces between words. There are occasions when it is appropriate to use a
definition of this kind in making word counts, for example, in counting the number of
words in a student’s essay or the number of words in the huge corpus that a dictionary
compiler will collect so it can use real examples of word use in its definitions. When
counting words this way we often refer to them as tokens so it is quite clear what we
are talking about. Sometimes too we refer to running words with much the same
meaning, for example, if you consult a dictionary corpus you may be presented with
the information that the word maunder occurs on average only once every several
million running words.
In addition to knowing the number of words in text or a corpus, researchers
sometimes what to know the number of different words that occur. The terms tokens
and types are used to distinguish between these two types of count. Tokens refers to
the number of words in a text or corpus while types refers to the number of different
words. In the example below
The cat sat on the mat.
there are six tokens, a total of six words, but the word the occurs twice so there
are only five types.
But even with a catch-all definition of this kind there are problems. How do you count
contractions such as don’t, it’s or won’t, are these one word or two? Is the number at
the bottom of this page a word or not? Are the names we have put on the title page of
this book really words? And if you are counting words in speech rather than writing,
how do you count the ums and ers which always occur? Practice can vary according
to the needs of the researcher but generally, numbers, proper nouns and names, and
false starts and mistakes are excluded from word counts.
Once you start counting the number of words a person knows more difficulties raise
their heads. If a student learns the verb to work, for example, this will involve learning
the form works for use with third person singular in the present simple tense, the form
worked for use in the simple past, and working for use with continuous tenses. The
question arises whether the learner has learned one word or four here. These
inflections, these changes to the root form, are highly regular and can be applied to
most verbs in English. Provided a few simple rules of grammar are known the learner
only needs to learn a new root form to have these other forms at his or her disposal
and available for use. It is often convenient, therefore, to think of these words as a
single unit since these forms do not have to be learned separately by the learner;
learning one form means all the others can be deduced and will also be known. This
has the profound advantage in describing vocabulary knowledge of reducing the
numbers we work with to manageable levels; a few thousands or tens of thousands
instead of hundreds of thousands. A collection of words like these, a root form and the
most frequent regular inflections, is known as a lemma. Where a noun has a regular
plural formed by adding –s, as in orange and oranges, for example, these two words
would also form a single lemma. In most word frequency counts and estimates of
learners’ vocabulary sizes, the lemma is used as the basis of counting and work,
works, working and worked would be counted as just one lemma. Rather confusingly,
lemmas are often called words. Unfortunately, we are not always consistent in doing
this. In both Nation’s vocabulary level’s test (1983) and Milton and Meara’s X-Lex
(2003b) word knowledge is tested in what are called 1000 word frequency bands. In
fact, lemmatised word lists have been used and, more properly, these should be called
1000 lemma frequency bands.
But some estimates of a speaker’s vocabulary size (for example, Goulden et al’s
estimate of 17,000 words for educated native speakers of English) can use a larger
unit still and are actually estimates of the number of word families a person knows.
The forms of a word which can be included in a lemma are fairly limited. But words
often have lots of other forms which are clearly related to the root form. The lemma
work, for example, includes working, works and worked but does not include worker
although this is obviously a derived form which is very closely related. The lemma
govern would include governs, governing and governed but not governor or
government. Closely related words like this would be called a word family. Clearly
estimates of size based on the lemma and on the word family will be quite different.
At first sight this may appear confusing and quite unnecessarily complex. Certainly,
researchers often contribute to the confusion both by being unclear as to the units they
use, and by adopting idiosyncratic definitions. The divisions between a word, a lemma
and a word family are not entirely arbitrary, however, and are based on Bauer and
Nation’s (1993) frequency-based groupings of affixes in English. Lemmas will
generally be words made by using affixes from the top three groups, and word
families from the top six. Further, this hierarchy of word unit we use is not the product
of whim on the part of researchers but rather a result of the need to reduce the figures
we work with to manageable proportions. In measuring distance we use millimetres,
centimetres, meters and kilometres, to name just a few, according to the size of what
is being measured and in measuring vocabulary we are behaving no differently.
What is knowing a word?
If defining a word has presented problems, then deciding when a word is actually
known is no easier. There is quite a list of qualities which might be included in the
definition of knowing and this has been added to over the years. Nation’s list, in Table
1, is the latest and most comprehensive incarnation.
Table 1: What is involved in knowing a word (from Nation, 2001, 27)
Form spoken RWhat does the word sound like?
PHow is the word pronounced?
written RWhat does the word look like?
PHow is the word written and spelled?
word parts RWhat parts are recognisable in this word?
PWhat words parts are needed to express meaning?
Meanin
g
form and meaning RWhat meaning does this word form signal?
PWhat word form can be used to express this meaning?
concepts and referents RWhat is included in the concept?
PWhat items can the concept referee to?
associations RWhat others words does this word make us think of?
PWhat other words cold we use instead of this one?
Use grammatical functions RIn what patterns does the word occur?
PIn what patterns must we use this word?
collocations RWhat words or types of word occur with this one?
PWhat words or types of words must we use with this one?
constraints on use RWhere, when and how often would we meet this word?
PWhere, when and how often can we use this word?
It is possible to calculate very different ideas of a learner’s word knowledge
depending on how knowing is defined. Perhaps the most basic, catch-all definition
would be simple passive word recognition; the learner recognises the form of a word
and that it is a word rather than a meaningless jumble of symbols. This aspect of
knowing is clearly identified in Nation’s table. There are several tests (e.g. Meara and
Jones’s EVST, 1990; Meara and Milton’s X-Lex, 2003b) which use this kind of
definition of knowing. In principle, a calculation made using this kind of definition
ought to include every other kind of knowledge since, presumably, a learner could not
reasonably use, attach a meaning or find a correct collocation for something they do
not even recognise as a word. Most of the tests we use to make this kind of calculation
are based on written forms of knowledge and these predict a range of reading and
writing-based language abilities well, but the ability to recognise or use the spoken
form of a word is much less well investigated. Interestingly, initial results from
studies using phonologically based vocabulary size tests (Milton, 2005) suggest that
aural word recognition predicts oral proficiency particularly well. This ties in with
Daller and Huijuan Xue’s chapter which addresses the problems of finding a good
measure of lexical knowledge to tie in with oral proficiency.
A second very common definition of knowing a word can be found within the
Meaning section of Nation’s table. This rests on the idea that a word is known if the
learner can attach a meaning, such as an explanation or a translation, to a foreign
language word. Calculations of vocabulary knowledge and size made on this basis
ought to be smaller than those made on the basis of passive word recognition. Every
learner must be familiar with the sensation of encountering a word they know they
have seen before but cannot, for the moment, attach to a meaning. It seems this aspect
of knowledge can be surprisingly fragile in the foreign language learner’s vocabulary.
The link between form and meaning can disappear quite suddenly and without
explanation and, just as suddenly, reappear. The chapters by Meara and Wilks and by
Schur which investigate the applicability of various kinds of network theory to
vocabulary, can begin to make this kind of phenomenon explicable but, as their
chapters show, this work is still in its infancy. It is a phenomenon which also
underlies the questions encountered in the chapters by Eyckmans and by Fitzpatrick,
where differences in translation and receptive test scores challenge easy interpretation.
Nation’s table of knowing a word draws attention to a further distinction; that of
receptive and productive or passive and active word knowledge: the column with R
and P in it. The distinction here lies in the difference between the words you can
handle in the context of reading or listening to speech, and those you can call readily
to mind when you need to speak or write in the foreign language. Usually the
additional context information which comes with written or spoken language means
that a learner’s passive or receptive vocabulary appears to exceed the productive or
active vocabulary. The relationship between the two types of knowledge is not clear,
and may vary according to a variety of individual learner characteristics or the type of
test used. But it is quite extensively researched going back to Stoddard in 1929.
Estimates vary but the range of studies reviewed in Waring (1997) suggest that
productive vocabulary size is about 50% of receptive vocabulary size; and presumably
one is sub-set of the other. There are, of course, methodological problems inherent in
measuring these two different kinds of vocabulary in a way which is strictly
equivalent and these problems haunt several of the contributors to this volume such as
Richards and Malvern, and van Hout and Vermeer. These methods are considered
later on in this introduction.
Other aspects of word knowledge seem much less well researched and standard tests
are lacking, in some cases we lack even an agreed approach to testing. For example,
Nation includes in his section on Form the idea that word knowledge can include
knowledge at the level of the morpheme. Our concentration on calculating word
knowledge using the lemma or the word family as the basic unit means that our tests
cannot tell us about knowledge at this level of detail. But the testing problems
experienced by Eyckmans, and described in her chapter, may result at least in part
from learners’ abilities to make educated guesses about the meaning of words from
the parts which words are composed of. The concern is that this kind of guesswork
may destabilise some tests of vocabulary knowledge and make the scores they
produce less useful than we may think they are. Again, knowledge of a word’s
collocations, connotation and preferred associations, is an area where we struggle to
find a single, simple way of characterising this knowledge in a way where it can be
usefully quantified and tested. Again, also, our concentration on tests which use the
lemma, and the fact that we often investigate infrequent vocabulary, means that all of
the most frequently linking words tend not to be investigated. Such information falls
below the radar of the tests we use. The chapters by Wilks and Meara, and by Schur,
are a direct attempt to suggest models of analysis and testing methods which might
help fill in these gaps in our knowledge.
What is the lexical space?
It is clear from this discussion that vocabulary knowledge is complex and multi-
faceted. The qualities we investigate are not easily described or tested and we tend to
resort to analogy and metaphor to try to illuminate the way words are learned and
stored. One such idea is the idea of lexical space where a learner’s vocabulary
knowledge is described as a three dimensional space, and where each dimension can
represent an aspect of knowing a word. It is usually seen as something like that in
Figure 1.
Figure 1: the lexical space: dimensions of word knowledge and ability
fluency
breadth
depth
On the horizontal axis is the concept of lexical breadth which is intended, in essence,
to define the number of words a learner knows regardless of how well he or she
knows them. This would include the Form and the form and meaning elements of
Nation’s table. Vocabulary size tests, passive receptive style tests and translation tests
are all tests of lexical breadth, although they may produce varying estimates of size
and knowledge. The chapters by Milton and Eyckmans are directly concerned with
how to make estimates of vocabulary breadth.
On the vertical axis is the concept of lexical depth which is intended to define how
much the learner knows about the words he or she knows. This would include the
elements of concepts and referents, associations, grammatical functions, collocations
and constraints on use from Nation’s list. These elements tend to be tested separately
probably because this is a disparate list of word qualities where we have not yet
succeeded in pinning down a unifying idea or model which can provide the basis of a
comprehensive test of depth. This is not for want of trying, however, and the precise
relationship between the lexicon and grammar has been the subject of considerable
research effort (e.g. Hunston and Francis, 2000). This area might properly be the
subject of an entire, but separate volume. Space in this volume permits only limited
reference to this area based on a further metaphor, that of a web of words, which is
often used to describe this axis and the way the words interact with each other. The
chapters by Wilks and Meara and by Schur, deal with precisely this issue and
investigate the possibility of turning this metaphor into a model of lexical depth which
can be empirically tested with real language users. Meara and Wolter (2004) have
developed a test which allows learners to draw up these webs so that a score can be
assigned to it. At first sight this looks like a promising innovation but it is early days.
The final axis is that of fluency and this is intended to define how readily and
automatically a learner is able to use the words they know and the information they
have on the use of these words. This might involve the speed and accuracy with which
a word can be recognised or called to mind in speech or writing. It would probably be
true to say that we have no widely used or generally accepted test of vocabulary
fluency. Some very promising ideas are emerging (for example, Shiotsu, 2001) but it
is interesting to note that this field is still sufficiently inchoate that no papers were
presented at the Vocabulary Workshop which gave rise to this volume.
These three axes define the lexical space and, in principle, it becomes possible to
locate a learner’s vocabulary knowledge within this space. Some learners may have
large vocabularies but are very limited in the speed and ease with which they can call
these words to mind and put them to use communicatively. These learners ought to be
placed well along the breadth axis but less far along the fluency or breadth axes. Other
learners may appear to have different characteristics and possess comparatively few
vocabulary resources but considerable fluency in calling these to mind and using them
in communication. These learners would occupy a different location in the lexical
space, less far along the breadth axis but further down the fluency axis. This idea is
both attractive and convenient which makes it easy to define the nature of a test or a
learner in a few, short words which we can all understand. But the notion of lexical
space is still, at root, a metaphor with all the drawbacks which go with that. The
nature of the lexicon is not really a three dimensional space and attempts to turn the
metaphor into a detailed model which can be tested empirically run into trouble. The
precise nature of the depth axis is a case in point and Read, who uses the term in his
Read, 2000) review of the field, questions the nature of this axis in later work (Read,
2004).
What are the conventional ways of measuring knowledge in this lexical space?
While we lack a comprehensive range of tests across the whole field of vocabulary
knowledge, we do have a small number of well-established tests in the area of
vocabulary breadth and, more particularly, passive receptive vocabulary knowledge.
At fist sight, testing how much a person knows from the enormous number of words
in the English language, appears a daunting task. There are tens or even hundreds of
thousands of word, depending how you define word, potentially available for learners
to acquire and taking a reasonable sample of these words to test for a learner’s
knowledge should be difficult. A learner may only know a few of these words so the
task is like searching for a needle in a haystack. Nonetheless, it does appear possible
to make a very good sample of words and this is because of the way words are used in
language. Words do not occur randomly in speech or writing and some occur very
much more frequently than others. Thus some verbs such as make or do, prepositions
such as in and on and pronouns such as I or you, are used a lot by every speaker while
other words such as anamnestic and mitogenic are very uncommon and might not be
used at all even by native speakers except in the most specialist of situations. Where
learners are exposed to their new language, therefore, they encounter some words
much more often than others, and some words they never encounter at all. Not
surprisingly, learners are more likely to learn the frequent words than the infrequent
words, or words so rare they never even see or hear them. Tests such as Nation’s
Levels Test and Meara and Milton’s X-Lex take advantage of this to sample the most
frequent words to form believable estimates of overall vocabulary knowledge in
learners. A good test is possible because it can be concentrated in areas where
learning is likely to occur, rather than spread thinly in areas where there is no
knowledge to be detected. Although these tests work well, the frequency effect is an
assumption which does not appear to have been empirically tested and the second
chapter in this volume addresses this issue directly and asks not just whether the effect
really exists but also, how strong it is and whether all learners are affected equally.
The idea of counting the frequency of words in a language tends to be thought of as a
recent innovation and something we can now do because we have computers which
can handle millions of words. But the idea is, in reality, very old and goes back at
least to the study of the writings of the Prophet Mohammed in the 8th century. The
earliest counts made for pedagogical reasons were made in the 1930s and 1940s and
these still provide useful lists, but modern resources such as the Bank of English, the
British National Corpus now make very large corpora available to researchers and
other organisations and these can be broken down so it is possible to investigate, say
only the frequencies of written English or of spoken English. Modern tests tend to be
based on corpora and frequency counts of this kind and, for convenience,
While the Levels Test and X-Lex estimate knowledge within the same area of the
lexical space and are based on frequency counts of English, they are nonetheless two
very different tests. X-Lex, for example, samples the 5000 most frequent words of
English drawing 20 words from each of the five 1000 word frequency bands within
this list and using this to make an estimate of the number of words known out of these
5000 words. This is probably the most general and all-encompassing kind of estimate
of vocabulary size possible since it is a Yes/No or checklist test which requires
learners merely to say if they recognise a test word as a word. An example from a
paper version is shown in Figure 2 and gives an idea of the format.
Figure 2:
Please look at these words. Some of these words are real French words and some
are invented but are made to look like real words. Please tick the words that you know
or can use. Here is an example.
chien
Thank you for your help.
de distance abattre absurde achevé manchir
Any objective, or forced-answer type test of this kind is open to guesswork on the part
of the test-taker. The Yes/No style tests attempt to estimate the amount of guesswork
which is going on, and to adjust the scores the test gives, through the inclusion of
pseudowords in the test. Pseudowords are words which look and sound like real
words but do not really exist. In the French example in Figure 2, manchir is such a
word. Usually this system works well and gives reliable results but in some learners
the pseudowords attract very high numbers of Yes answers and it is it less clear in
these circumstances that the test is working as it should and is giving results which are
useful. This question is addressed in chapter three of this volume.
Nation’s Levels Test has many similarities with X-Lex in that it tests vocabulary
breadth and takes advantage of frequency information in its construction. But it is
appears different since it adopts a complex multiple-choice format. An example of the
question type it uses is shown in Figure 3.
Figure 3: Level’s Test example taken from Nation (2001, 416)
This is a vocabulary test. You must choose the right word to go with
each meaning. Write the number of that word next to its meaning.
1 business
2 clock ______ part of a house
3 horse ______ animal with four legs
4 pencil ______ something used for writing
5 shoe
6 wall
There is no adjustment for guesswork but the presence of a range of answers means
that learners who take the test must use the kind of test strategy which Eyckmans
describes in chapter 3 in order to try to maximise their marks. Even if they do not
know the meaning of test words they will be in a position to pick a likely answer,
because that is the nature of a test and choosing an uncertain answer is more likely to
gain marks than not answering at all. Kamimoto (2005) recently reported speak aloud
protocols conducted with learners taking this test. The feedback he received suggests
that the learners’ choice of guessing strategy can produce considerable differences in
score and the Levels Test might have much more variation according to guesswork
than any of us imagined. However, there is no explicit way in the test for taking
account of this phenomenon.
Like X-lex the Levels Test samples a range of frequency bands in order to gauge
learners’ overall vocabulary knowledge. There are 18 words tested at each of the
second, third, fifth and tenth 1000 word frequency band. In addition it samples
knowledge of the University Word List (Nation, 1990) or the Academic Word List
(Coxhead, 2000); a list of words which are particularly frequent in academic learning
such as school or university.
Both these tests have been able to take advantage of frequency to sample the words in
a language and form a picture of the learners’ vocabulary resources. Because they are
receptive tests the test taker has the opportunity to control the language being tested
and can control the words being investigated. We are not only interested in estimating
receptive or passive knowledge, however. We also need to be able to estimate the
vocabulary resources which learners are able to use productively in speaking or
reading. Testing here is much more difficult because we are reliant on the language
which the learners themselves produce. As language users we are very adaptable
concerning the words we choose to speak or write. We may vary according to who we
are speaking to, for example. A doctor talking to an adult patient may mention acute
abdominal pain but to a child the same condition might be tummy ache. Or we can
vary according to the task we are called upon to perform. A letter or an essay will
contain lots of highly frequent function words such as make, can and do, but a
shopping list will omit these things and consist entirely of a list of nouns. The
significance of this is that a single small piece of writing or speech, and most learners
can only produce small amounts of language, may be untypical of what a learner
knows and can do in other circumstances. So a single, small piece of writing or speech
may tell us very little about the extent of productive word knowledge a learner
possesses. It is a difficulty which resolves itself into two questions which run through
the whole of section three in this volume. How do you collect a sample of productive
language from a learner which is, in some way, typical or representative of the
learner’s vocabulary productive ability? And, how do you analyse the language which
is produced to create a measure which you can use to compare one learner’s
knowledge with another in some meaningful way?
Attempts to arrive at such a measure, and answer the second question, usually revolve
around the concept of lexical richness. The term covers several aspects of vocabulary
use (see Read, 2000: 200 ff.) such as lexical diversity, which is “the variety of active
vocabulary deployed by a speaker or writer” (Malvern and Richards, 2002, 87), and
lexical sophistication (the number of low frequency words) or lexical density (the
ratio of content and function words). These notions are best understood through
examples.
The cat sat on the mat.
We have already established has 6 tokens but only 5 types giving a type token ratio of
5/6 or 0.833. Calculating TTR this way would be a simple measure of lexical
diversity.
By looking at the numbers of function and content words we can make a calculation
of lexical density for the same sentence. Function words are words like and, of, and is
which do not carry much meaning in themselves but are essential to creating a
grammatical sentences or phrases. Prepositions, conjunctions, pronouns and auxiliary
words would almost always be function words. Contents words tend to be nouns,
main verbs and adjectives and are words which contribute most of the meaning to
anything that is said or written. In the example sentence there are three function
words, the occurs twice and on once, and three content words, cat, sat and mat. The
ratio of content and function words would thus be 3/3 giving a lexical density of 1.
By looking at the number of low frequency words it is possible to make a calculation
of lexical sophistication. There are computerised programs which can do this
automatically such as PLex (Meara and Bell, 2001) and this uses a definition of any
word which is not in the 2000 most frequent words in English in order to
systematically identify low frequency words. In the example, there is only one such
word out of the six in the sentence and that is mat. A ratio of 1/6 would give a score
for lexical sophistication of 0.167. Plex attempts to overcome difficulties created by
length by making this calculation for every successive 10 word block in a piece of
writing. It appears (Milton, 2004) that normal, unacademic text produces scores on
this scale of about 2.5. About 70-80% of normal text, it would seem, is made up of
very frequent vocabulary.
Since we know, or at least we think we do, that frequency and vocabulary learning are
closely associated the significance of the numbers of low frequency words a learner
can use is potentially of great interest. Laufer and Nation’s (1995) program the
Lexical Frequency Profiler (LFP) is able to take a text and produce a profile giving
the numbers and proportions of vocabulary in the first 1000 word frequency band, the
second 1000 word frequency band and the academic word list. An example of the
information this program produces is shown in Figure 4.
Figure 4: Example of vocabulary profile produced by Laufer and Nation’s Lexical Frequency
Profiler (1995)
WORD LIST TOKENS/% TYPES/% FAMILIES
one 3499/81.8 558/59.6 380
two 165/ 3.9 88/ 9.4 72
three 298/ 7.0 161/17.2 122
not in the lists 314/ 7.3 130/13.9 ?????
Total 4276 937 574
The results show that for the text under analysis there were 4276 running words or
token and 937 types. Of these 3499 or 81.8% of the tokens came from the first 1000
word frequency band, 165 or 3.9% of tokens from the second 1000 word frequency
band and 298 words or 7% from the university word list. 314 words, 7.3% of token
were low frequency and did not occur in any of the other three categories. In any text,
a surprisingly large number of words only ever occur once, in the example text above
452 words occur only once and that is nearly half of all types, and linguistics have a
name for these which is hapax legomena or sometimes just hapax.
There are, however, many methodological problems in the actual measurement of
lexical richness in written texts or speech. The Type-Token Ratio has been under
discussion for nearly a century (e.g. Thomson and Thompson, 1915; Johnson, 1944)
but this is still being used today (for example, Jarvis et al. 2003 who use this measure
with controlled text length). However, the TTR has been strongly criticised as
unreliable in contexts where texts with different lengths are compared (e.g. van Hout
and Vermeer, 1988; Broeder, Extra and van Hout, 1993; and Vermeer, 2000). The
problem is that TTR is sensitive to text length. Longer texts have a tendency to
produce lower values for the TTR because the chance of a new word (type) occurring
gets lower as text length increases since the speaker/ writer has a limited number of
words at his/her disposal. This means that speakers who produce longer texts get
systematically “marked down” by this measure and those who produce shorter texts
(often good indicator of lower proficiency level) get higher scores. This makes the
TTR less suitable for the measurement of spontaneous speech or writing where texts
with different lengths have to be compared.
The studies in section three of this volume use a variety of the measures described
here, often adapted so they are less sensitive to variations in length. They also use a
variety of ways to collect samples of productive language from which to assess lexical
richness. Most common is the use of written essays and the studies used to validate
both Plex and LFP rely on these. The subject matter of essays can affect scores,
however, as can the register for use chosen by the learner, and this variation poses
problems. Other researchers use transcripts of speech which has the same potential
problem. Tidball and Treffers-Daller in chapter 7 use picture cues to prompt story
telling and this seems a promising way of standardising the nature of student output so
scores can be meaningfully compared.
Test validity and reliability
Whatever the tests we use in measuring vocabulary, there is always a concern that the
tests should work and properly. The concern of almost the whole of this volume is to
establish what the tests tell us so that we can interpret the scores appropriately. There
are two major issues which the users of any test will need to be sure of and these are
test reliability and test validity.
Test reliability is the accuracy with which a test measures what it is supposed to
measure. More usefully, it might be seen as a test of consistency. If you run a
vocabulary test several times on a person whose vocabulary has not changed (several
tests in one afternoon, for example) then the tests should give the same results. If it
does this then it is said to be reliable. If a test cannot do this then you cannot place
much faith in the scores it gives you. Reliability tells you nothing else about the
qualities of the test, whether it is testing what you think it is testing or whether it is the
right test for the use it is put to; it only says whether it is working consistently and
does not give different scores for people with the same ability. Multiple-choice and
forced answer tests generally seem to have very good reliability and part of the
credibility attached to tests of receptive vocabulary knowledge is that they give very
reliable results. Both Milton and Eyckmans touch on this in chapters 2 and 3. Part of
the difficulties which we experience in tests of productive knowledge, and which are
dealt with in great depth in Section three of this volume, is that our data collection
techniques, using unstandardised pieces of speech and writing, are just not yet
consistent enough to allow reliable results to emerge.
Test validity is the extent to which a test measures what it is supposed to measure. So,
in the case of vocabulary size, can we be sure that a test measures this quality and not
something else? There may be several separate issues involved in this notion, which is
quite complex one. One such issue would be content validity which is the degree to
which a test has the necessary content. Tests like the Levels Test and X-Lex can be
argued to have good content validity because they do not waste time and effort testing
words which learners will never know and use frequency information to target testing
in areas where knowledge is likely to be. Of course, if the assumption that frequency
and learning are not as closely related as we think, and chapter 2 examines this, then
the content validity of these tests will be severely compromised. Connected with
content validity is construct validity which is whether the test measures the skill or
construct it is meant to. This is where the whole subject becomes really challenging.
Language knowledge is not a directly accessible quality like shoe size or hat size and
often has to be inferred from activities or actions which may well involve other
knowledge and abilities. This is exactly the issue raised by Eyckmans in chapter 3,
whether a test constructed using pseudowords can work well enough to be valid. It is
also a leitmotiv of all this volume. While van Hout and Vermeer, and Richards and
Malvern consider the construct of productive vocabulary tests from the point of view
of theory, later chapters tackle the problem empirically by investigating what skills
and abilities these tests predict best, for example the chapter by Daller and Huijuan
Xue.
A common way of determining whether a test is well constructed is to test learners
using two tests; a new test, for example, and a well established test of the same
language ability or skill, and compare the results. It is generally expected that if the
new test is well-constructed then score will correlate with the older test. This is called
concurrent validity and it is frequently used in the consideration of new test forms.
One of the attractions of this process is that it usually allows correlations between the
two tests to be calculated and statistical tests of reliability to be applied. The whole
basis of validity appears to be placed on a sounder empirical footing than would
otherwise be the case. Within reason, this is a sensible and useful thing to do but
Chapter 6, by Fitzpatrick, provides a cautionary tale in the overuse, or misuse, of
concurrent validity testing. Test validity in languages is, in reality, rarely a simple or a
black and white issue where a test is clearly valid or invalid. Rather it raises questions
of degrees of confidence and subtleties of interpretation and this volume endeavours
to provide.
The studies in this volume
Nation's introductory chapter raises six validity questions about tests of vocabulary
knowledge. They all centre on the danger that one or more intervening variables can
affect scores on even the most rigorously constructed vocabulary test. The tests
themselves may be fine but the way learners handle the test or the ways in which
users handle them or the results may result in misleading conclusions being drawn.
•The first threat to the validity of a test arises from the testees' attitudes towards the
test, their willingness to participate due to negative experiences with previous tests
and their familiarity with the test format. Learners will vary, therefore in the way
they approach and handle vocabulary tests and test scores may reflect this kind of
variation in addition to variation in vocabulary knowledge. We have very little
knowledge of how this variation may work.
•The second question is about the appropriateness of frequency data. Many tests are
based on frequency lists. Frequency data gathered in formal L1 contexts are
unlikely to provide a useful basis for tests in a more informal L2 setting.
•The third question focuses on the unit of counting (e.g. word families as opposed
to lemmas) and an inappropriate unit of counting might lead to over- or
underestimations of vocabulary knowledge.
•Forth, Nation strongly argues for multiple measures in vocabulary assessment
since vocabulary knowledge is multi-dimensional. Nation discusses several studies
where multiple measures show clear advantages over single measures. The aim is
to use a set of complementary measures that tap on different aspects of vocabulary
knowledge and give a more complete picture than a single measure can.
•A fifth thread to the validity is the language of instruction used in the test. It is a
common finding in research on bilingualism that testees are disadvantaged if the
test is carried out in their weaker language (for immigrants this is in many cases
the language of the host country). The same holds for second language testing.
Scores are generally higher if bilingual test formats than in monolingual formats.
•Sixth, Nation argues for measures that focus on actual language use. He discusses
measures of vocabulary richness in writing but points out that similar measures
can be developed for speaking and other skills. It should be pointed out that the
ultimate aim of all measures is to give insight in an aspect of linguistic knowledge
that is part of language proficiency and therefore of the ability to use the language
effectively.
The authors of the present book address the questions raised by Paul Nation from
different angles and discuss ways to overcome potential problems with the validity of
the measures.
Chapter 2 addresses the first of the validity question raised in Nations opening
chapter; the question about variability between learners. Measures of vocabulary
knowledge that are based on the frequency of the words make the implicit or explicit
assumption that there is predictable relation between the frequency of a word in the
input and the probability that it is learned at a certain stage of the L2 acquisition.
These measures divide the vocabulary into frequency bands, the most frequent
thousand words, the next most frequent thousand words and so on. It can be assumed
that more words are known from band 1 than from band 2 and that more words from
band 2 are known than from higher bands.
However, it is far from clear what constitutes the input for many learners. The
frequency of words used in course books is not necessarily the frequency used by
native speakers in a natural setting. Milton reports on a study that found a high
number of infrequent words in textbooks. The chapter lists a number of other factors
that influence vocabulary learning, including different learner styles. All these factors
can explain that vocabulary knowledge in L2 cannot necessarily be predicted by the
frequency of the words in frequency lists that are at least partially based on L1 usage.
This is especially the case with low level learners that have little access to L1 outside
the classroom. Milton reports the findings of a study of 227 Greek EFL learners. Only
60% of the learners follow the "normal" that is the expected pattern where the
knowledge of a word can be predicted from an established frequency list. Other
learners show an unusual dip in the percentage of known words that belong to
frequency band 2 (labelled level two deficit) or show other less clear patterns. These
patterns are by no means stable. More than a quarter of a subgroup tested twice show
a change in their patterns. Some learners who showed a level two deficit in the first
test did not show that deficit in the second test and vice versa. There are indications
that the use of guessing strategies is one of the reasons for the changing patterns.
Milton's chapter illustrates that many factors influence lexical learning in L2. All
these factors lead to a considerable variability amongst learners, and this in turn is a
threat to the validity of measures that are based on frequency lists. Milton points out
that high variability can be found mainly in the knowledge of the most frequent 2000
words. The knowledge of words with a higher frequency seems to follow more the
expected pattern. He argues that frequency based measures are not under threat per se
but that the have to take the variability especially amongst low level learners into
account.
In Chapter 3 Eyckmans addresses the issue of test validity with regard to the Yes/No
format in particular. She reports the findings of studies with, in total, 2000 French
learners of Dutch. The results are that testees generally tend to favour either the Yes or
the No response and do not demonstrate the neutral attitude to test words which the
test supposes. This introduces a variability that does not reflect differences in
vocabulary knowledge but personal preferences or attitudes. These attitudes are
obviously a threat to the validity of this test as they would be to any forced answer
test. To overcome this problem it is common to include pseudo words in Yes/No style
tests. Testees who claim to know the non-existing pseudo word obviously have a
tendency to use guessing strategies. The use of such strategies introduces construct-
irrelevant variance which has to be compensated for. This is done with correction
formula that “marks down” the testee for high guessing. The number of pseudo words
that are claimed to be known, also called the false alarm rate, is the basis for these
correction formulae. However, these formulae vary between different tests and lead to
different scores for testees with a tendency to use guessing strategies. The French
learners of Dutch in this study appear strongly to favour a Yes response and this ought
to unbalance the test in a way that pseudowords cannot compensate for.
Eyckmans investigates whether the use of certain features that are only available in a
computer based Yes/No test can overcome these problems with the validity of the
format. These features include forced response, presenting the items in random order,
imposing a time limit per item and repeating the instruction with every word
presented. In order to decide whether these features give some added value for the test
format she uses two different computer based test formats, one that includes all these
special features (format B) and one that does not and is therefore not very dissimilar
to a pencil an paper test (format A). She then compares the hit rate (real words that
are claimed to be known) and the rejection rate (pseudo words that are not claimed to
be known). For format A the hit rate and the rejection rate show a statistically
significant negative correlation which can only be explained as the result of a
systematic response bias. Those testees who claim that they know the correct words
also claim to know the pseudo words, and those who have a tendency not to claim that
they know the correct words have this tendency also for pseudo words. With other
words candidates have a systematic bias towards a Yes or a No response. For format B
there was no significant correlation between the two response rates. This does,
however, not prove a higher validity of format B as both versions had a high false
alarm rate (more than 20% of the pseudo words were claimed to be known). In order
to investigate this further a translation task was used. Testees were asked to translate
the items into their L1 (French). The marking of this translation was relatively lenient,
wrong spelling etc. were accepted. Nevertheless almost half of the words that were
claimed to be known were translated incorrectly. The question is whether correction
formulae can compensate for this overestimation vocabulary knowledge by the
testees. To investigate the correlation between various scores obtained by different
correction formulae and the scores of the translation task were computed. The
correction formulae reduce the number of hit scores (existing words that are claimed
to be known) by “marking down” testees who apply guessing strategies. However,
this does not lead to a dramatic increase in the correlations with the translation task. In
other words, concurrent validity of the correction formulae cannot be established with
this group of learners. Overall, Eyckmans concludes that the validity of the Yes/No
format has to be questioned in that it may not be equally suitable for all types of
learner unless response bias is also known.
In Chapter 4 Richards and Malvern discuss a number of studies with regard to the
validity questions raised by Nation. In line with Milton (chapter 2) the validity of
measures based on L1 frequency data is questioned for lower and intermediate
learners because their access to L1 outside the classroom may be limited. However, a
study with secondary school learners showed that a frequency based test (X_Lex,
Meara and Milton, 2003a) aimed at lower-intermediate learners yielded convincing
results showing that these learners are sensitive to the frequency of words. The test is
based on five frequency bands (1 - 5k) and there is a steady decline in the percentage
of known words from the highest to the lowest frequency band. The authors discuss
several other threats to the validity of vocabulary tests and identify learners attitudes
which are related to the face validity of a test as the most serious threat as this aspect
is to a large extend beyond the control of the researcher.
A specific concern of the authors is that is that of multiple measures. Richards and
Malvern argue strongly in favour of multiple measures and against the
misinterpretation that the measure developed by the authors (D) was suggested as
single measure. D was originally developed to overcome the problems with the falling
type-token-ratio (TTR) curve. The longer a text is, the fewer new words will be
introduced, simply because the speaker/ writer runs out of new words. This leads to a
systematic decrease in the ratio between new words (types) and all words (tokens) and
makes it difficult to compare texts with different lengths. D is the single parameter of
a mathematical model for this falling TTR-curve and allows comparing
speakers/writers irrespective of the length of the text produced. D had been labelled as
a “unified measure” but the authors stress that this does not mean “single measure”.
The mathematical modelling is unified in the sense that it can be applied to various
other ratios, not only the TTR. Richards and Malvern discuss a number of studies that
use the same underlying mathematical model and apply it to other ratios (e.g. research
in early L1 acquisition with focus on the relation between nouns and verbs).
In the remainder of the chapter the authors investigate the validity question three in
Nation’s opening chapter that is the question about the unit of counting. There can be
no single answer to the best unit of counting since this might depend on the age of the
learners and the aim of the study. Richards and Malvern discuss a study where the
same data base, transcriptions from the Bristol Corpus in L1 acquisition, was analysed
according to different definitions of the unit of counting. Ranging from a very broad
definition of what would count as a word on the basis of the unedited transcript
towards a very narrow definition at the other end of the spectrum to a fully
lemmatised transcription. The authors applied D as measure of diversity to the
transcripts with the five different definitions of the unit of counting. The results show
high inter-correlations between the D-values obtained. The authors argue, however,
that these correlations do not mean that the five versions of the test measure the same
thing. They bring forward three objections to such an interpretation. They argue that
better edited data are always more reliable, that a high inter-correlation does not mean
that the individuals score similarly (which is also the case in their data with negative
consequences if judgements were to be made on individual children) and thirdly that
the correlations can be misleading from a statistical point of view. Overall, Richards
and Malvern make a strong case for a carefully defined unit of counting.
Chapter 5 is entirely focused on the measurement of lexical richness. Van Hout and
Vermeer discuss the validity problems with the type-token ratio and attempt to
overcome these problems with mathematical transformation. The authors list a
number of mathematical transformations that have been proposed in the past but focus
in their investigation on the measure of Guiraud (a simple mathematical
transformation of the TTR) and D and the mathematical model that is discussed in the
previous chapter 4. The argument is complex but rich in detail.
First they illustrate the problems that arise due to the influence of text lengths with an
empirical study based on the first three chapters of Genesis (King James Bible). With
increasing text length (number of tokens) there is also in increase in the number of
new words (types). This is increase follows, however, a curvilinear pattern rather that
a straightforward linear one. This is in line with the expectations as mentioned in
chapter 4. Writers, speakers simply run out of new words and repeat words already
used when the longer the text gets. Therefore the number of new words (types) does
not increase in the same way as the total number of words (tokens). In the study of
van Hout and Vermeer this leads to an upwards curvilinear relationship because they
look at the increase of types with increasing text length. This corresponds to a
systematic decrease of the type-token ratio. This decrease of the TTR is stronger at the
beginning of growing text length before it flattens out, hence the curvilinear pattern.
Van Hout and Vermeer show that this pattern cannot be modelled by a simple
regression. They carry out four regression analysis on the basis of the raw data and
three mathematical transformations that attempt to straighten out the curvilinear
pattern. The regression analysis on the raw data leads to a constant which does
“makes no sense” as the authors point out since the model would predict that a text
with length 1 token would contain 59 types (see Table 3 of the chapter). Also the
regressions carried out on the mathematical transformations lead to constants, in two
cases with a negative sign. This is not interpretable because a text cannot contain a
negative number of types. It remains to be discussed whether these findings rule out
the use of regression analysis per se to model the relation between types and tokens.
Van Hout and Vermeer go on to consider the TTR two other measures for lexical
diversity that have been proposed in the literature. These are the index of Guiraud and
the index of Herdan, which are two mathematical transformations of the TTR. In a
study on Moroccan and Turkish L2 learners of Dutch in primary education all values
for the indices decrease with increasing L2 exposure. The authors make a tentative
interpretation of these data by stating that an increase of high frequent function words
at a certain point in L2 acquisition might be the reason for these surprising findings.
This raises a validity question similar to those addressed by Nation about the
influence of the unit of counting. Complex L2 acquisition processes might not be
accessible by counting mere types and token without looking into the function that
different types have at different stages of the acquisition process. Van Hout and
Vermeer report on a further study with adult L2 learners where there was no increase
in the measures of lexical richness used over time. They argue that during the L2
acquisition a complex relationship between types and tokens used might take place.
With this in mind they consider a theoretical approach to why and how texts differ in
the number of types used using an urn model of different vocabularies. Two word
categories (function words and content words) are giving different probabilities to be
drawn from an urn and end up as “words in a text”. The number of different words in
these texts depends on several parameters: the probability of the two word categories
to be chosen; the ratio between these word categories; and the number of items in the
urn, the vocabulary size. The main finding is that even with a relatively simple model
like the TTR discriminates only poorly between the different vocabulary sizes.
However, this simple model can be made a bit more “human” by a more complex set
of probabilities for the items. The function words are divided into two groups, one
with a high and one with a low probability to be chosen from the urn. For the content
words the probability to be drawn can be increased if the same item has been chosen
earlier already. Three different lexicons can be modelled in this way. All have the
same size but different probabilities. The number of types that are drawn differs
widely and the 95% confidence interval for the mean score of types drawn increases
the more “human” the lexicons are made. The authors argue that these findings raise
serious doubts about the reliability of measures that are based on the occurrences of
types.
As a consequence they suggest a new way of measuring richness by taking the
frequencies of the types into account (the MLR or Measure of Lexical Richness). Van
Hout and Vermeer are mainly concerned with the vocabulary size of school children.
Therefore they define the frequency on the basis of a corpus that was gathered in a
school setting. This corpus comprises almost 2 million words with 26,000 lemmas
that can be divided into 9 frequency bands. The size of a child’s productive can be
estimated on the basis of the number of words that are used from each frequency
band. Van Hout and Vermeer report the findings of a study that uses this new
measure. Two groups of children were compared, 16 L1 learners and 16 L2 learners
of Dutch. These children carried out a word recognition task and a word definition
task. In addition spontaneous speech was recorded. On the basis of these data several
measures were carried out, including the Index of Guiraud, two versions of MLR and
D (computed with the CLAN command Vocd). The Index of Guiraud and MLR show
significant differences between the groups whereas no significant differences can be
found with D. Extrapolation of the vocabulary size based on MLR shows a clearly
larger vocabulary for L1 learners than for L2 learners. Van Hout and Vermeer draw
the overall conclusion that frequency data need to be taken into account in research on
lexical richness. They argue that only large corpora can provide these data. In line
with Nation's second validity questions, they gather these data from the environment
of the children, the school setting.
In Chapter 6 Fitzpatrick addresses another issue of test validity: The premature use of
test formats that have not been sufficiently scrutinised in pilot studies. Whereas in
many cases researchers look mainly at the concurrent or criterion-related validity of a
test, Fitzpatrick argues that the construct and content validity of a test equally deserve
attention. The test format under investigation in this chapter is Lex30, a vocabulary
association task. 30 frequent words are used in this test and the subjects are asked to
give 3 to 4 word associations for each stimulus. The test score of Lex30 is the
percentage of infrequent word associations given by the subjects. In order to
investigate what these test scores mean two other tests were administered with the
same subjects, the Controlled Productive version of the Levels Test (Laufer and
Nation, 1999) and a translation task based on 20 randomly selected words from each
of the first three frequency bands of Nation’s word list. It was assumed that these
three tests will yield high correlations as they seem to test the same ability; to produce
L2 words which represent various frequency bands. All correlations between the three
tests are statistically significant. However, the magnitude of the correlation between
Lex30 and the two other tests is relatively low. This means that either the tests vary in
their degree of accuracy or measure different aspects of lexical knowledge.
Fitzpatrick pursues this idea and tries to establish the construct validity of the tests
involved by having a closer look what and how these tests measure vocabulary
knowledge. It emerges that the Productive Levels test as used in this study focuses
mainly on the most frequent 3000 words and so does the translation task. However,
Lex30 awards marks for any word beyond the first frequency band (most frequent
thousand words) and therefore includes words that are beyond the first 3000 words.
Furthermore the three tests have different stimuli to elicit the lexical response. Lex30
has only a semantic stimulus, whereas there is also an orthographic stimulus in the
other tasks and even an additional orthographic stimulus in the Productive Levels Test
and that the three tests therefore activate knowledge in different ways. An argument
that the three tests measure different types of vocabulary knowledge can be made on
the basis of Nation’s list of “aspects of word knowledge”. This clearly indicates that
some aspects (e.g. meaning associations) are only relevant for Lex30 and others (e.g.
collocations or appropriateness of the word) only for the Productive Levels test.
Overall, Fitzpatrick concludes that more work is needed to establish the construct
validity of the tests involved before they can be used by teachers and other decision-
makers. This might be regrettable from a practical point of view but seems to be a
necessary precaution especially for high-stake test settings.
The multi-facetted concept of lexical knowledge is an argument for multiple measures
as outlined by Nation in his introductory chapter. In Chapter 7, therefore, Tidball and
Treffers-Daller apply this concept to three groups of French speakers, university
students at level 1, final year students and native speakers. All three groups carried
out a semi-spontaneous productive task based on retelling stories from two short
comic strips. The transcriptions of these stories where then entered into the CHAT
format and analysed. In addition a C-test was administered. Two different types of
measures were used to analyse the data. The first type of measures applied in this
study is not based on any frequency data but on the occurrence of types and tokens
(D, Index of Guiraud and TTR). The second type of measures is based on the
distinction between basic and advanced words (Advanced Guiraud and LRD). This
distinction is made in the present study on the basis of a word list for spoken French.
As Nation points out, the validity of word list based measures is threatened if an
inappropriate list is used. The list used in this case is based on oral data but is
relatively old (collected in 1959) and in principle, if oral language use changes
rapidly, this might invalidate any conclusions based on such a list. In practice,
however, it appears that the most frequent words in such lists do not change
substantially (Nation, 2004). A further problem with frequency lists is the unit of
counting. Tidball and Treffers-Daller argue strongly for a lemmatization of the data,
especially for French with its rich morphology. The question how measures can be
adapted to specific languages has rarely been addressed (see also Daller et al., 2002)
but it seems logical that the structural characteristics of a language influences the way
in which the unit of counting is usefully defined.
All measures in this study show the differences between the groups and yield
statistically significant results when carrying out an Anova. The differences for the
TTR are significant but the effect size (Eta2) is much lower for the TTR than for the
other measures. This is in line with the expectations. The highest value for Eta2 is
achieved with D for the vocabulary measures. It is worth mentioning that the C-test
whose validity has been questioned (see Alderson, 2002) yields the highest value for
Eta2 overall. Quite astonishing are the very high correlation between D and the index
of Guiraud (r = .973) which needs to be investigated further. Overall, the authors
conclude that the word-list based measure in this study do not show an advantage over
the other measures, probably because the definition of basic items was not entirely
appropriate to the task.
In Chapter 8 Daller and Huijuan Xue also use multiple measures to investigate the
oral proficiency of Chinese EFL learners with different measures of lexical richness.
The focus of this chapter is a methodological analysis of different measures of lexical
richness: which measures can be used to describe the differences between the groups
and to what extent to these findings reflect the results of Tidball and Treffers-Daller
reported in the previous chapter? Two groups of Chinese EFL learners, one in the UK
and another at a different level of ability in China, were asked to describe orally two
picture stories (the same as in the study of Tidball and Treffers-Daller). These
descriptions were then transcribed into the CHAT format and analysed. Two types of
measures were used for this analysis: measures based on word lists (LFP/beyond
2000, P-Lex and Advanced Guiraud) and measures that are not based on any
frequency data but only on the occurrence of types and tokens (D, Index of Guiraud
and TTR). We used Nation's word list in the present study for the definition of basic
and advanced words.
All measures except the TTR and P-Lex yielded highly significant differences
between the two groups. For the TTR this was predictable as the text length of the
spoken descriptions varied vastly (from 61 to 602 tokens). An interpretation for P-Lex
is difficult. This measure is designed not to be a function of text length. It is a measure
that models the occurrence of rare words according to a Poisson distribution. An
interpretation of this result is not easy and more research into the use of this measure
would be necessary. The other measures discriminate between the two groups with the
highest Eta2 achieved with the Index of Guiraud, followed by D and then the word-list
based measures. We do not know what the actual difference in lexical proficiency
between the two groups is and therefore we do not know what an “appropriate” value
for Eta2 would be. Guiraud might exaggerate the actual differences. Nevertheless, it
turns out to be the best measure in the sense that it discriminates between the groups
most clearly. It might be a magnifying glass but still a useful one. This is in line with
the quite positive results for this index obtained by van Hout and Vermeer (see
chapter 5). The fact that the word-list based measures yield a significant p-value but
still a lower value for Eta2 than Guiraud and D is probably due to the fact that the
word list here is not appropriate for the task. It is not a word list that is based on
spoken language and it is relatively old (roughly similar to the word list used in the
previous chapter). There are two ways out of this problem: We could try to find more
appropriate word lists or we could base our decisions on basic and advanced words on
a different criterion, e.g. teacher judgements (see Daller et al., 2003). Overall, this and
the previous chapter come to the same conclusions. Firstly, the word-lists used have to
be selected very carefully as Nation already points out in his introductory chapter.
Secondly, the Index of Guiraud seems to be still, more than 50 years after it was
proposed, a good measure of lexical richness.
In Chapter 9 Wilks and Meara challenge the network metaphor as a suitable concept
for vocabulary research. They argue that the metaphor itself might shape our concept
of lexical organisation rather than explain it. They ask whether the attitude and actual
behaviour in the test situation shapes the construct of a network that does not exist in
this form outside the test situation. In this way a network is a construct which
develops “a life of its own”. This would certainly undermine the validity of these tests
quite similar to validity question about learners attitudes raised by Nation. To
investigate the concept of lexical networks further Wilks and Meara carry out a study
on word associations. Two groups of French learners and a group of French native
speakers were asked to identify word associations in 20 sets containing each 10 words
that were randomly selected from a French basic word list. The learner group with the
higher proficiency identified more associations than the lower proficiency group. The
highest number of associations can be found with native speakers. The number of
associations for each group can quite convincingly be modelled with a Poisson
distribution where the higher proficiency of the group is reflected in a higher
parameter lambda (Poisson distributions can be defined by a single parameter
lambda). However, when the association profiles of individual subjects are examined
there are considerable deviations from the model curves and there is a “surprising
degree of individual variation”. This means that the number of associations made by
the individuals does not follow predictable patterns in many cases. A further analysis
including interviews with the subject shows that quite different strategies were applied
to carry out the task and that idiosyncratic association behaviour form a serious threat
to the validity of the test. The main problem is that it is impossible to identify what is
a legitimate association as it varies from individual to individual even in the native
speaker group. Our access to lexical networks is only possible via the test-taking task
and is therefore indirect. It is even possible associations are only created through the
task and do not exist in this form permanently outside the task. On the basis of these
findings the authors argue for a long-term approach to research on the relationship
between theory and methodology.
In Chapter 10 Ellen Schur investigates again the organisation of the lexicon as a
network. Her hypothesis is that the organisation of the lexicon follows the pattern of
small world networks, a concept adopted from graph theory. Small world networks
have several characteristics, such as sparseness of the connections and their clustering
around a few nodes, which appear to mimic the way we think words may interconnect
in a lexicon. Another useful characteristic is that the organisation of small word
networks is neither completely regular nor completely random. The nodes in lexical
networks are obviously words which form the network through their connection with
other words through association or collocation. Nodes in these networks are either
highly connected or have only a limited number of associates with other words.
Schur's research design is based on associations of different groups between a set of
50 verbs selected by the researcher. The groups were bilinguals, monolinguals and
foreign language learners, and they produced clearly different association networks.
This allows making a distinction between different learner types (L1 versus L2) on the
basis of their word association. Interestingly, there are also different network
association between EFL learners with different L1 background. The group of
Chinese EFL students in the present study showed a clearly distinct behaviour in their
network associations when compared with Hebrew EFL learners. The Chinese
students produce much more bi-directional associations than the Hebrew students do.
An explanation of this difference is far from easy; however, this shows that a research
design based on the elicitation of associations and the consequent construction of
networks allows distinguishing between different learner types. This is an argument
for the validity of this specific research method. All association networks produced by
these groups were analysed according to graph theoretical criteria. Schur shows that
the groups create different association networks but all networks have in common that
they belong to the category of small world networks. This is at least a tentative
finding which is a strong argument for further research in this area which might lead
to a new way in understanding lexical knowledge and its organisation.
In Chapter 11 Haquebord and Stellingwerf investigate the relation between
vocabulary knowledge and reading competence in a Dutch school setting. They
discuss the development of an adaptive vocabulary test that can be used as part of a
diagnostic test battery for reading proficiency at secondary school level. This is a very
important area of research because an increasing number of pupils in the Netherlands
in particular and in Europe in general are L2 learners. One important research
question is whether the difficulties such learners often experience in reading, are due
to a small vocabulary or a lack of appropriate reading strategies. An answer to this
question has obviously important pedagogical consequences for remedial
programmes. The authors argue that reading proficiency is based on vocabulary
knowledge (e.g. word recognition) and a combination of top-down and bottom-up
reading strategies. Good readers combine these two strategies to extract meaning from
the text. The authors identify three types of weaker readers with different needs for
remedial programmes. The first group of weak readers relies mainly on top-down
strategies (e.g. guessing) which can compensate for a lack of lexical knowledge.
However, if texts become more difficult these readers fail and need additional
vocabulary training. The second group of weak readers have sufficient vocabulary
knowledge but lack successful reading strategies. This group needs additional reading
instructions. A third group has problems in both areas and needs special attention.
Haquebord and Stellingwerf developed an adaptive vocabulary test that can, in
addition to a well-established reading comprehension test in Dutch, draw a detailed
picture of the reading profile of secondary pupils. This allows an informed judgement
on remedial teaching for pupils with different reading profiles. The vocabulary test
developed by the authors is adaptive in the sense that it can be adjusted to different
reading levels. An empirical investigation with more than 2700 pupils in secondary
education in the Netherlands shows that the vocabulary test differentiates sufficiently
between students at different levels. This is an argument for the reliability and the
validity of the test. The fact that the vocabulary test in combination with the
established reading comprehension test allows identifying different reader types as
mentioned above is an additional argument for the construct validity of the test. More
than 50% of the pupils in the present study are classified as fluent readers. It is,
however, a worrying fact that about 15% pupils have severe reading difficulties. It is
perhaps not a co-incident that this percentage matches the percentage of L2 first year
students in secondary education in the Netherlands. This large number of pupils with
severe reading difficulties makes it clear that valid vocabulary tests and research on
vocabulary knowledge in are of vital importance in our present school environment.
In Chapter 12 Lorenzo-Dus argues strongly in favour of integrated methodologies in
vocabulary research. She points out that it is essential to combine qualitative and
quantitative approaches to draw a comprehensive picture of foreign language
proficiency. To make the point for qualitative approaches Lorenzo-Dus reports a
study carried out by Koike (1998) where two interview techniques are compared, the
oral proficiency interview (OPI) and the simulated oral proficiency (SOPI). In the first
technique an examiner is present during the interview, in the latter technique the
candidate has to carry out a series of oral tasks without an examiner being present. It
would be expected that the performance of candidates would differ with the two
interview techniques. However, quantitative analyses fail to show these differences
whereas a qualitative approach reveals variation in candidate performance when these
two different interview techniques used. Lorenzo-Dus takes this as a further indication
that qualitative approaches have to be part of research designs in vocabulary research.
For this reason she combines both approaches in a study of oral interviews in Spanish.
A first result which is also discussed in the following chapter is the fact that
examiners obviously focus more on lexical sophistication than on lexical diversity.
This can be shown with purely quantitative methods. However, other aspects of the
interviews are only accessible with qualitative approaches. Lorenzo-Dus shows that
the ratings of the examiners are not only influenced by the number of rare words used
but also by their position in discourse. Examinees who receive high ratings use these
words more in natural stretches of talk rather than in prefabricated units.
Another aspect that appears important in the prediction of teacher ratings are the
accommodation strategies used by examiners or interlocutors. The more the examiner
has to accommodate to the examinee the lower will be the rating of the candidate’s
performance. However, only a qualitative analysis will reveal the use of some of these
strategies. The findings of Lorenzo-Dus show that the number of confirmation
questions do not differ significantly between interviews with students at different
proficiency levels. Therefore a quantitative approach does not give further insights in
the importance of this question type for the rating of the examiners. A qualitative
analysis of the function of these confirmation questions reveals, however, that they
are mainly used to check comprehension with poorer candidates and to manage turn-
taking with better students. Overall, the findings of Lorenzo-Dus are a clear indication
that a combination of different approaches is useful, which is a further argument for
multiple measurements in research on vocabulary proficiency.
The aim of the final chapter, Chapter 13, is to investigate which aspects of lexical
richness are more important for teacher judgements on foreign language proficiency.
The answer to this question is important since at least in a classroom setting the
construct of language proficiency is partially determined by the judgements of the
teachers. We assume that teachers focus more on lexical sophistication than on lexical
diversity because judgements based on the use of rare words allow economical
marking strategies. Four experienced teachers were asked to give an overall rating of
31 essays written by EFL learners. They were also asked to judge various other
linguistic aspects of these essays, including the vocabulary range. There are high
inter-correlations between the ratings of the teachers as a group which is an indication
for reliable judgements overall. However, there are also high correlations between the
ratings that teachers give on the different linguistic qualities of individual essays. It
can be assumed that there are large halo effects and that in general teachers are mainly
focused on an overall, holistic rating. This is in line with earlier findings on the
assessment of oral interviews (Malvern and Richards, 2002). An obvious explanation
is that teachers have economical marking strategies where holistic ratings are more
efficient than more detailed judgements. The ratings of the teachers where then
correlated with scores of the essays obtained by different measures of lexical richness.
These are measures of lexical diversity (TTR, D and the Index of Guiraud) and
measures of lexical sophistication (Guiraud Advanced, P-Lex and the total number of
Advanced Types). The teacher ratings correlate highly with all measures of lexical
sophistication but only modestly and not significant with D. This is an indication that
lexical sophistication is indeed more important than lexical diversity for teacher
ratings. A further finding of this study is that the TTR is not a useful measure where
essays with different lengths are analyses. This is in line with the expectations. Quite
surprisingly the Index of Guiraud, which is only a simple mathematical
transformation of the TTR, seems to be a useful measure. As chapter 7 and 8 report
similar findings for the Index of Guiraud in other contexts we argue that this
relatively old index should not be discarded from our repertoire of lexical measures
for the time being and be included in future research designs.