Journal of Educational Measurement (J Educ Meas)

Publisher: National Council on Measurement in Education, Wiley

Journal description

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research and reports of applications of measurement in an educational context. Solicited reviews of books, software, published educational and psychological tests, and other important measurement works appear in the Review Section of the journal. In addition, comments on technical and substantive issues addressed in articles and reviews previously published in JEM are encouraged. Comments will be reviewed and the authors of the original article will be given the opportunity to respond.

Current impact factor: 1.00

Impact Factor Rankings

Additional details

5-year impact 1.30
Cited half-life >10.0
Immediacy index 0.00
Eigenfactor 0.00
Article influence 1.16
Website Journal of Educational Measurement website
Other titles Journal of educational measurement (Online), JEM
ISSN 1745-3984
OCLC 58648984
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details


  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 2 years embargo
  • Conditions
    • Some journals have separate policies, please check with each journal directly
    • On author's personal website, institutional repositories, arXiv, AgEcon, PhilPapers, PubMed Central, RePEc or Social Science Research Network
    • Author's pre-print may not be updated with Publisher's Version/PDF
    • Author's pre-print must acknowledge acceptance for publication
    • Non-Commercial
    • Publisher's version/PDF cannot be used
    • Publisher source must be acknowledged with citation
    • Must link to publisher version with set statement (see policy)
    • If OnlineOpen is not available, BBSRC, EPSRC, MRC, NERC and STFC authors, may self-archive after 6 months
    • If OnlineOpen is not available, AHRC and ESRC authors, may self-archive after 12 months
    • This policy is an exception to the default policies of 'Wiley'
  • Classification

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: Criterion-related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion-related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.
    Journal of Educational Measurement 09/2015; 52(3). DOI:10.1111/jedm.12081
  • [Show abstract] [Hide abstract]
    ABSTRACT: The purpose of this study was to investigate whether simulated differential motivation between the stakes for operational tests and anchor items produces an invalid linking result if the Rasch model is used to link the operational tests. This was done for an external anchor design and a variation of a pretest design. The study also investigated whether a constrained mixture Rasch model could identify latent classes in such a way that one latent class represented high-stakes responding while the other represented low-stakes responding. The results indicated that for an external anchor design, the Rasch linking result was only biased when the motivation level differed between the subpopulations to which the anchor items were administered. However, the mixture Rasch model did not identify the classes representing low-stakes and high-stakes responding. When a pretest design was used to link the operational tests by means of a Rasch model, the linking result was found to be biased in each condition. Bias increased as percentage of students showing low-stakes responding to the anchor items increased. The mixture Rasch model only identified the classes representing low-stakes and high-stakes responding under a limited number of conditions.
    Journal of Educational Measurement 09/2015; 52(3). DOI:10.1111/jedm.12080
  • [Show abstract] [Hide abstract]
    ABSTRACT: The amount of data available in the context of educational measurement has vastly increased in recent years. Such data are often incomplete, involve tests administered at different time points and during the course of many years, and can therefore be quite challenging to model. In addition, intermediate results like grades or report cards being available to pupils, teachers, parents, and policy makers might influence future performance, which adds to the modeling difficulties. We propose the use of simple data filters to obtain a reduced set of relevant data, which allows for simple checks on the relative development of persons, items, or both.
    Journal of Educational Measurement 09/2015; 52(3). DOI:10.1111/jedm.12078

  • Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12074
  • [Show abstract] [Hide abstract]
    ABSTRACT: SIBTEST is a differential item functioning (DIF) detection method that is accurate and effective with small samples, in the presence of group mean differences, and for assessment of both uniform and nonuniform DIF. The presence of multilevel data with DIF detection has received increased attention. Ignoring such structure can inflate Type I error. This simulation study examines the performance of newly developed multilevel adaptations of SIBTEST in the presence of multilevel data. Data were simulated in a multilevel framework and both uniform and nonuniform DIF were assessed. Study results demonstrated that naïve SIBTEST and Crossing SIBTEST, ignoring the multilevel data structure, yield inflated Type I error rates, while certain multilevel extensions provided better error and accuracy control.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12071
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cognitive diagnosis models provide profile information about a set of latent binary attributes, whereas item response models yield a summary report on a latent continuous trait. To utilize the advantages of both models, higher order cognitive diagnosis models were developed in which information about both latent binary attributes and latent continuous traits is available. To facilitate the utility of cognitive diagnosis models, corresponding computerized adaptive testing (CAT) algorithms were developed. Most of them adopt the fixed-length rule to terminate CAT and are limited to ordinary cognitive diagnosis models. In this study, the higher order deterministic-input, noisy-and-gate (DINA) model was used as an example, and three criteria based on the minimum-precision termination rule were implemented: one for the latent class, one for the latent trait, and the other for both. The simulation results demonstrated that all of the termination criteria were successful when items were selected according to the Kullback-Leibler information and the posterior-weighted Kullback-Leibler information, and the minimum-precision rule outperformed the fixed-length rule with a similar test length in recovering the latent attributes and the latent trait.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12069
  • [Show abstract] [Hide abstract]
    ABSTRACT: A mixed-effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed-effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis-Hastings Robbins-Monro (MH-RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design- and trait-based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four-parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH-RM algorithm, and an empirical example using the extended mixed-effects IRT model is presented.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12072
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study is to assess the efficiency of using the multiple-group categorical confirmatory factor analysis (MCCFA) and the robust chi-square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF-free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi-square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12073
  • [Show abstract] [Hide abstract]
    ABSTRACT: This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two-stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two-stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non-Bayesian (no prior) estimators was of more practical significance than the choice of number-correct versus item-pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non-Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low- and high-performing examinees.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12063
  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, the number of interruptions during testing due to unexpected technical issues seems to be on the rise. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees' scores. Researchers such as Hill and Sinharay et al. examined the impact of interruptions at an aggregate level. However, there is a lack of research on the assessment of impact of interruptions at an individual level. We attempt to fill that void. We suggest four methodological approaches, primarily based on statistical hypothesis testing, linear regression, and item response theory, which can provide evidence on the individual-level impact of interruptions. We perform a realistic simulation study to compare the Type I error rate and power of the suggested approaches. We then apply the approaches to data from the 2013 Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12064
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assessment of differential item functioning (DIF) is routinely conducted to ensure test fairness and validity. Although many DIF assessment methods have been developed in the context of classical test theory and item response theory, they are not applicable for cognitive diagnosis models (CDMs), as the underlying latent attributes of CDMs are multidimensional and binary. This study proposes a very general DIF assessment method in the CDM framework which is applicable for various CDMs, more than two groups of examinees, and multiple grouping variables that are categorical, continuous, observed, or latent. The parameters can be estimated with Markov chain Monte Carlo algorithms implemented in the freeware WinBUGS. Simulation results demonstrated a good parameter recovery and advantages in DIF assessment for the new method over the Wald method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12061
  • Source

    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12068
  • [Show abstract] [Hide abstract]
    ABSTRACT: Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean-mean, mean-sigma, random-groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12065
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research on equating with small samples has shown that methods with stronger assumptions and fewer statistical estimates can lead to decreased error in the estimated equating function. This article introduces a new approach to linear observed-score equating, one which provides flexible control over how form difficulty is assumed versus estimated to change across the score scale. A general linear method is presented as an extension of traditional linear methods. The general method is then compared to other linear and nonlinear methods in terms of accuracy in estimating a criterion equating function. Results from two parametric bootstrapping studies based on real data demonstrate the usefulness of the general linear method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12062
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The assumption of conditional independence between the responses and the response times (RTs) for a given person is common in RT modeling. However, when the speed of a test taker is not constant, this assumption will be violated. In this article we propose a conditional joint model for item responses and RTs, which incorporates a covariance structure to explain the local dependency between speed and accuracy. To obtain information about the population of test takers, the new model was embedded in the hierarchical framework proposed by van der Linden (2007). A fully Bayesian approach using a straightforward Markov chain Monte Carlo (MCMC) sampler was developed to estimate all parameters in the model. The deviance information criterion (DIC) and the Bayes factor (BF) were employed to compare the goodness of fit between the models with two different parameter structures. The Bayesian residual analysis method was also employed to evaluate the fit of the RT model. Based on the simulations, we conclude that (1) the new model noticeably improves the parameter recovery for both the item parameters and the examinees’ latent traits when the assumptions of conditional independence between the item responses and the RTs are relaxed and (2) the proposed MCMC sampler adequately estimates the model parameters. The applicability of our approach is illustrated with an empirical example, and the model fit indices indicated a preference for the new model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12060