Journal of Educational Measurement (J Educ Meas)

Publisher: National Council on Measurement in Education, Wiley

Journal description

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research and reports of applications of measurement in an educational context. Solicited reviews of books, software, published educational and psychological tests, and other important measurement works appear in the Review Section of the journal. In addition, comments on technical and substantive issues addressed in articles and reviews previously published in JEM are encouraged. Comments will be reviewed and the authors of the original article will be given the opportunity to respond.

Current impact factor: 1.00

Impact Factor Rankings

Additional details

5-year impact 1.30
Cited half-life >10.0
Immediacy index 0.00
Eigenfactor 0.00
Article influence 1.16
Website Journal of Educational Measurement website
Other titles Journal of educational measurement (Online), JEM
ISSN 1745-3984
OCLC 58648984
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details


  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 2 years embargo
  • Conditions
    • Some journals have separate policies, please check with each journal directly
    • On author's personal website, institutional repositories, arXiv, AgEcon, PhilPapers, PubMed Central, RePEc or Social Science Research Network
    • Author's pre-print may not be updated with Publisher's Version/PDF
    • Author's pre-print must acknowledge acceptance for publication
    • Non-Commercial
    • Publisher's version/PDF cannot be used
    • Publisher source must be acknowledged with citation
    • Must link to publisher version with set statement (see policy)
    • If OnlineOpen is not available, BBSRC, EPSRC, MRC, NERC and STFC authors, may self-archive after 6 months
    • If OnlineOpen is not available, AHRC and ESRC authors, may self-archive after 12 months
    • This policy is an exception to the default policies of 'Wiley'
  • Classification

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: No abstract is available for this article.
    No preview · Article · Nov 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: An odds ratio approach (ORA) under the framework of a nested logit model was proposed for evaluating differential distractor functioning (DDF) in multiple-choice items and was compared with an existing ORA developed under the nominal response model. The performances of the two ORAs for detecting DDF were investigated through an extensive simulation study. The impact of model misfit on the performance of each ORA was also examined. To facilitate practical interpretation of each method, effect size measures were obtained and compared. Finally, data from a college-level mathematics placement test were analyzed using the two approaches.
    No preview · Article · Nov 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study investigated the impact of anonymizing text on predicted scores made by two kinds of automated scoring engines: one that incorporates elements of natural language processing (NLP) and one that does not. Eight data sets (N = 22,029) were used to form both training and test sets in which the scoring engines had access to both text and human rater scores for training, but only the text for the test set. Machine ratings were applied under three conditions: (a) both the training and test were conducted with the original data, (b) the training was modeled on the anonymized data, but the predictions were made on the original data, and (c) both the training and test were conducted on the anonymized text. The first condition served as the baseline for subsequent comparisons on the mean, standard deviation, and quadratic weighted kappa. With one exception, results on scoring scales in the range of 1–6 were not significantly different. The results on scales that were much wider did show significant differences. The conclusion was that anonymizing text for operational use may have a differential impact on machine score predictions for both NLP and non-NLP applications.
    No preview · Article · Nov 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: The credibility of standard-setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with 36 judges, non-numeric item-level intrajudge consistency feedback was provided to treatment-group judges after the first and second rounds of Angoff ratings. Compared to the judges in the control condition, those receiving the feedback significantly improved their intrajudge consistency, with the effect being stronger after the first round than after the second round. To examine whether this feedback has deleterious effects on between-judge consistency, I also examined interjudge consistency at the cut score level and the item level using generalizability theory. The results showed that without the feedback, cut score variability worsened; with the feedback, idiosyncratic item-level variability improved. These results suggest that non-numeric intrajudge consistency feedback achieves its intended effect and potentially improves interjudge consistency. The findings contribute to standard-setting feedback research and provide empirical evidence for practitioners planning Angoff procedures.
    No preview · Article · Nov 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: Given the importance of large-scale assessments to educational policy conversations, it is critical that subpopulation achievement is estimated reliably and with sufficient precision. Despite this importance, biased subpopulation estimates have been found to occur when variables in the conditioning model side of a latent regression model contain measurement error. As such, this article proposes a method to correct for misclassification in the conditioning model by way of the misclassification simulation extrapolation (MC-SIMEX) method. Although the proposed method is computationally intensive, results from a simulation study show that the MC-SIMEX method improves latent regression coefficients and associated subpopulation achievement estimates. The method is demonstrated with PIRLS 2006 data. The importance of collecting high-priority, policy-relevant contextual data from at least two sources is emphasized and practical applications are discussed.
    No preview · Article · Nov 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: Criterion-related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion-related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.
    No preview · Article · Sep 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: The purpose of this study was to investigate whether simulated differential motivation between the stakes for operational tests and anchor items produces an invalid linking result if the Rasch model is used to link the operational tests. This was done for an external anchor design and a variation of a pretest design. The study also investigated whether a constrained mixture Rasch model could identify latent classes in such a way that one latent class represented high-stakes responding while the other represented low-stakes responding. The results indicated that for an external anchor design, the Rasch linking result was only biased when the motivation level differed between the subpopulations to which the anchor items were administered. However, the mixture Rasch model did not identify the classes representing low-stakes and high-stakes responding. When a pretest design was used to link the operational tests by means of a Rasch model, the linking result was found to be biased in each condition. Bias increased as percentage of students showing low-stakes responding to the anchor items increased. The mixture Rasch model only identified the classes representing low-stakes and high-stakes responding under a limited number of conditions.
    No preview · Article · Sep 2015 · Journal of Educational Measurement

  • No preview · Article · Jun 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: SIBTEST is a differential item functioning (DIF) detection method that is accurate and effective with small samples, in the presence of group mean differences, and for assessment of both uniform and nonuniform DIF. The presence of multilevel data with DIF detection has received increased attention. Ignoring such structure can inflate Type I error. This simulation study examines the performance of newly developed multilevel adaptations of SIBTEST in the presence of multilevel data. Data were simulated in a multilevel framework and both uniform and nonuniform DIF were assessed. Study results demonstrated that naïve SIBTEST and Crossing SIBTEST, ignoring the multilevel data structure, yield inflated Type I error rates, while certain multilevel extensions provided better error and accuracy control.
    No preview · Article · Jun 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cognitive diagnosis models provide profile information about a set of latent binary attributes, whereas item response models yield a summary report on a latent continuous trait. To utilize the advantages of both models, higher order cognitive diagnosis models were developed in which information about both latent binary attributes and latent continuous traits is available. To facilitate the utility of cognitive diagnosis models, corresponding computerized adaptive testing (CAT) algorithms were developed. Most of them adopt the fixed-length rule to terminate CAT and are limited to ordinary cognitive diagnosis models. In this study, the higher order deterministic-input, noisy-and-gate (DINA) model was used as an example, and three criteria based on the minimum-precision termination rule were implemented: one for the latent class, one for the latent trait, and the other for both. The simulation results demonstrated that all of the termination criteria were successful when items were selected according to the Kullback-Leibler information and the posterior-weighted Kullback-Leibler information, and the minimum-precision rule outperformed the fixed-length rule with a similar test length in recovering the latent attributes and the latent trait.
    No preview · Article · Jun 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: A mixed-effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed-effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis-Hastings Robbins-Monro (MH-RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design- and trait-based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four-parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH-RM algorithm, and an empirical example using the extended mixed-effects IRT model is presented.
    No preview · Article · Jun 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study is to assess the efficiency of using the multiple-group categorical confirmatory factor analysis (MCCFA) and the robust chi-square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF-free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi-square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.
    No preview · Article · Jun 2015 · Journal of Educational Measurement
  • [Show abstract] [Hide abstract]
    ABSTRACT: This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two-stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two-stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non-Bayesian (no prior) estimators was of more practical significance than the choice of number-correct versus item-pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non-Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low- and high-performing examinees.
    No preview · Article · Mar 2015 · Journal of Educational Measurement