Jeffrey Stewart

Director of Educational Measurement
Kyushu Sangyo University · Language Education and Research Center

Publications

  • Source
    Jeffrey Stewart
    [Show abstract] [Hide abstract]
    ABSTRACT: Validated under a Rasch framework (Beglar, 2010), The Vocabulary Size Test (VST) (Nation & Beglar, 2007) is an increasingly popular measure of decontextualized written receptive vocabulary size in the field of second language acquisition. However, although the validation indicates that the test has high internal reliability, still unaddressed is the possibility that it overestimates learner vocabulary size due to guessing effects inherent in its multiple-choice format, as size estimates are made by multiplying its raw score by a constant (100 or 200). This paper argues that the VST's multiple-choice format results in a test of passive recognition of words that does not approximate the experience of readers of authentic English texts, details drawbacks of the Rasch framework and mean-square fit statistics in detecting the overall contribution of guessing effects to raw test scores that could have allowed such deficiencies to remain undetected during the test's validation, overviews challenges that multiple-choice formats pose for vocabulary tests, and concludes by proposing methods of testing and analysis that can address these concerns.
    Language Assessment Quarterly An International Journal 08/2014; 11(3):271-282. · 1.14 Impact Factor
  • Source
    Raymond Stubbe, Jeffrey Stewart
    [Show abstract] [Hide abstract]
    ABSTRACT: Yes/No tests offer an expedient method of testing learners' vocabulary knowledge, although a drawback of this method is that since the method is self-report, actual knowledge cannot be con-firmed. "Pseudowords" have been used within such lists to test if learners are reporting knowledge of words they cannot possibly know, but it is unclear how to use this information to adjust scores. Although a variety of scoring formulas have been proposed in the literature, empiri-cal research (e.g., Mochida & Harrington, 2006) has found little evidence of their efficacy. The authors propose that a standard least squares model (multiple regression), in which the counts of words reported known and counts of pseudowords reported known are added as separate predictor variables, can be used to generate scoring formulas that have substantially higher predictive power. This is demonstrated on pilot data, and limitations of the method and goals of future research are discussed.
    Shiken. 11/2012; 16(2):2-7.
  • Source
    Jeffrey Stewart, Aaron Gibson, Luke Fryer
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlike classical test theory (CTT), where estimates of reliability are assumed to apply to all mem-bers of a population, item response theory provides a theoretical framework under which reliabil-ity can vary by test score. However, different IRT models can result in very different interpreta-tions of reliability, as models that account for item quality (slopes) and probability of a correct guess significantly alter estimates. This is illustrated by fitting a TOEIC Bridge practice test to 1 (Rasch) and 3-parameter logistic models and comparing results. Under the Bayesian Information Criterion (BIC) the 3-parameter model provided superior fit. The implications of this are dis-cussed.
    Shiken Research Bulletin. 11/2012; 16(2).
  • Source
    Jeffrey Stewart
    [Show abstract] [Hide abstract]
    ABSTRACT: Most researchers distinguish between receptive (passive) and productive (active) word knowledge. Most vocabulary tests employed in second language acquisition (SLA), such as the Vocabulary Levels Test (VLT) and Vocabulary Size Test (VST), test receptive knowledge. This is unfortunate, as the multiple-choice format employed on most receptive tests inflates estimates of vocabulary size, and there are clear theoretical advantages to focusing instead on productive knowledge, which is associated with greater strength of knowledge as well as written and oral communication skills. This is in large part due to the logistical problems associated with such tests, as the full-word answers given must either be entered online or hand-marked. This paper will describe a multiple-choice format test of active vocabulary knowledge, in which learners confirm their knowledge of an English word by selecting its first letter. As there are 25 possible options, odds of guessing the correct answer by chance are reduced to 0.04. Findings of the study include that word difficulty estimates and scores are highly correlated to those of conventional, full-word active tests (!0.90), and that test reliability is higher on the proposed format than on that of a receptive test of the same words.
  • Source
    Jeffrey Stewart, Aaron O. Batty, Nicholas Bovee
    [Show abstract] [Hide abstract]
    ABSTRACT: Second language vocabulary acquisition has been modeled both as multidimensional in nature and as a continuum wherein the learner's knowledge of a word develops along a cline from recognition through production. In order to empirically examine and compare these models, the authors assess the degree to which the Vocabulary Knowledge Scale (VKS; Paribakht & Wesche, 1993), which implicitly assumes a cline model of acquisition, conforms to a linear trait model under the Rasch Partial Credit Model, and determine the dimensionality of the individual tasks contained on the scale (self-report, first language [L1] equivalent, and sentence) using DETECT. The authors find that, although the VKS functions adequately overall as a measurement model, Stages 3 (can give an adequate L1 equivalent) and 4 (can use with semantic appropriateness) are psychometrically indistinct, suggesting they should be collapsed into a single category of definitional knowledge. Analysis under DIMTEST and DETECT indicates that other forms of vocabulary knowledge measured by the VKS are weakly multidimensional, which has implications for continuum models of vocabulary acquisition.
    TESOL Quarterly 06/2012; · 0.97 Impact Factor
  • Source
    Jeffrey Stewart
    [Show abstract] [Hide abstract]
    ABSTRACT: It has been frequently stated that Item Response Theory produces interval-scale measures where raw scores can only provide ordinal measures, and that therefore, researchers should choose IRT measures when selecting variables for common statistical tests, because raw scores may not meet their assumptions (Wright, 1992; Harwell & Gattie, 2001). In this study, this claim is empirically examined by conducting Pearson correlations and ANOVAs on two data sets using raw scores, Rasch person measures and 2-Parameter IRT ability estimates, in order to determine if results differed as a consequence. Raw Scores and Rasch person measures were very highly correlated, and lead to extremely similar results in all cases. For a well-constructed, reliable test the same was true of 2PL ability estimates. However, in cases where the test has middling to poor reliability, 2PL ability estimates appear to produce a somewhat more sensitive measure of a latent trait than raw scores, which can result in meaningful differences in statistical tests.
    Shiken Research Bulletin. 01/2012;
  • Source
    Jeffrey Stewart, David A. White
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple choice tests such as the Vocabulary Levels Test (Nation, 1990) are often viewed as a preferable estimator of vocabulary knowledge when compared to yes/no checklists, as self-reports introduce the possibility of students over- or under- reporting how many words they know. However, multiple-choice tests have their own unique disadvantage: if a multiple-choice test lists possible answers, guessing effects may inflate estimates of vocabulary size. Estimation of guessing effects on the VLT are complicated by the fact that distractors are chosen from the same frequency level of words as the correct answer, and therefore from the tested domain. This introduces the possibility that increases in scores due to guessing could vary depending on the overall proportion of words in the tested domain known by the test taker. In this study, the precise relationship between the proportion of words a student knows and their expected score was determined using elementary probability theory, and the accuracy of the resulting formula was confirmed using a Monte Carlo simulation. As proportions of known words rises, so too does the probability of correctly guessing the diminishing numbers of remaining unknown words. This results in a fairly consistent score increase of approximately 16-17 points on a 99-item VLT test until over 60% of words are known, at which point the score increase due to guessing gradually begins to diminish.
    TESOL Quarterly 01/2011; 45(2):370-380. · 0.97 Impact Factor

43 Following View all

23 Followers View all