Journal of Educational Measurement (J Educ Meas)

Publisher: National Council on Measurement in Education, Wiley

Journal description

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research and reports of applications of measurement in an educational context. Solicited reviews of books, software, published educational and psychological tests, and other important measurement works appear in the Review Section of the journal. In addition, comments on technical and substantive issues addressed in articles and reviews previously published in JEM are encouraged. Comments will be reviewed and the authors of the original article will be given the opportunity to respond.

Current impact factor: 1.00

Impact Factor Rankings

Additional details

5-year impact 1.30
Cited half-life 0.00
Immediacy index 0.00
Eigenfactor 0.00
Article influence 1.16
Website Journal of Educational Measurement website
Other titles Journal of educational measurement (Online), JEM
ISSN 1745-3984
OCLC 58648984
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details


  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 2 years embargo
  • Conditions
    • Some journals have separate policies, please check with each journal directly
    • On author's personal website, institutional repositories, arXiv, AgEcon, PhilPapers, PubMed Central, RePEc or Social Science Research Network
    • Author's pre-print may not be updated with Publisher's Version/PDF
    • Author's pre-print must acknowledge acceptance for publication
    • On a non-profit server
    • Publisher's version/PDF cannot be used
    • Publisher source must be acknowledged with citation
    • Must link to publisher version with set statement (see policy)
    • If OnlineOpen is not available, BBSRC, EPSRC, MRC, NERC and STFC authors, may self-archive after 6 months
    • If OnlineOpen is not available, AHRC and ESRC authors, may self-archive after 12 months
    • This policy is an exception to the default policies of 'Wiley'
  • Classification
    ​ yellow

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, the number of interruptions during testing due to unexpected technical issues seems to be on the rise. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees' scores. Researchers such as Hill and Sinharay et al. examined the impact of interruptions at an aggregate level. However, there is a lack of research on the assessment of impact of interruptions at an individual level. We attempt to fill that void. We suggest four methodological approaches, primarily based on statistical hypothesis testing, linear regression, and item response theory, which can provide evidence on the individual-level impact of interruptions. We perform a realistic simulation study to compare the Type I error rate and power of the suggested approaches. We then apply the approaches to data from the 2013 Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12064
  • [Show abstract] [Hide abstract]
    ABSTRACT: This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two-stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two-stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non-Bayesian (no prior) estimators was of more practical significance than the choice of number-correct versus item-pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non-Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low- and high-performing examinees.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12063
  • Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12068
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assessment of differential item functioning (DIF) is routinely conducted to ensure test fairness and validity. Although many DIF assessment methods have been developed in the context of classical test theory and item response theory, they are not applicable for cognitive diagnosis models (CDMs), as the underlying latent attributes of CDMs are multidimensional and binary. This study proposes a very general DIF assessment method in the CDM framework which is applicable for various CDMs, more than two groups of examinees, and multiple grouping variables that are categorical, continuous, observed, or latent. The parameters can be estimated with Markov chain Monte Carlo algorithms implemented in the freeware WinBUGS. Simulation results demonstrated a good parameter recovery and advantages in DIF assessment for the new method over the Wald method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12061
  • [Show abstract] [Hide abstract]
    ABSTRACT: Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean-mean, mean-sigma, random-groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12065
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research on equating with small samples has shown that methods with stronger assumptions and fewer statistical estimates can lead to decreased error in the estimated equating function. This article introduces a new approach to linear observed-score equating, one which provides flexible control over how form difficulty is assumed versus estimated to change across the score scale. A general linear method is presented as an extension of traditional linear methods. The general method is then compared to other linear and nonlinear methods in terms of accuracy in estimating a criterion equating function. Results from two parametric bootstrapping studies based on real data demonstrate the usefulness of the general linear method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12062
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assumption of conditional independence between the responses and the response times (RTs) for a given person is common in RT modeling. However, when the speed of a test taker is not constant, this assumption will be violated. In this article we propose a conditional joint model for item responses and RTs, which incorporates a covariance structure to explain the local dependency between speed and accuracy. To obtain information about the population of test takers, the new model was embedded in the hierarchical framework proposed by van der Linden (2007). A fully Bayesian approach using a straightforward Markov chain Monte Carlo (MCMC) sampler was developed to estimate all parameters in the model. The deviance information criterion (DIC) and the Bayes factor (BF) were employed to compare the goodness of fit between the models with two different parameter structures. The Bayesian residual analysis method was also employed to evaluate the fit of the RT model. Based on the simulations, we conclude that (1) the new model noticeably improves the parameter recovery for both the item parameters and the examinees’ latent traits when the assumptions of conditional independence between the item responses and the RTs are relaxed and (2) the proposed MCMC sampler adequately estimates the model parameters. The applicability of our approach is illustrated with an empirical example, and the model fit indices indicated a preference for the new model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12060
  • Journal of Educational Measurement 01/2015;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Students’ performance in assessments is commonly attributed to more or less effective teaching. This implies that students’ responses are significantly affected by instruction. However, the assumption that outcome measures indeed are instructionally sensitive is scarcely investigated empirically. In the present study, we propose a longitudinal multilevel-differential item functioning (DIF) model to combine two existing yet independent approaches to evaluate items’ instructional sensitivity. The model permits for a more informative judgment of instructional sensitivity, allowing the distinction of global and differential sensitivity. Exemplarily, the model is applied to two empirical data sets, with classical indices (Pretest–Posttest Difference Index and posttest multilevel-DIF) computed for comparison. Results suggest that the approach works well in the application to empirical data, and may provide important information to test developers.
    Journal of Educational Measurement 12/2014; 51(4):381-399. DOI:10.1111/jedm.12051
  • [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model. The power is related to the item response function (IRF) for the studied item, the latent trait distributions, and the sample sizes for the reference and focal groups. Simulation studies show that the theoretical values calculated from the formulas derived in the article are close to what are observed in the simulated data when the assumptions are satisfied. The robustness of the power formulas are studied with simulations when the assumptions are violated.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12058
  • Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12056
  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, interruptions during testing due to unexpected technical issues seem unavoidable. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees’ scores. There is a lack of research on this topic due to the novelty of the problem. This article is an attempt to fill that void. Several methods, primarily based on propensity score matching, linear regression, and item response theory, were suggested to determine the overall impact of the interruptions on the examinees’ scores. A realistic simulation study shows that the suggested methods have satisfactory Type I error rate and power. Then the methods were applied to data from the Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions in 2013. The results indicate that the interruptions did not have a significant overall impact on the student scores for the ISTEP+ test.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12052
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computerized adaptive testing offers the possibility of gaining information on both the overall ability and cognitive profile in a single assessment administration. Some algorithms aiming for these dual purposes have been proposed, including the shadow test approach, the dual information method (DIM), and the constraint weighted method. The current study proposed two new methods, aggregate ranked information index (ARI) and aggregate standardized information index (ASI), which appropriately addressed the noncompatibility issue inherent in the original DIM method. More flexible weighting schemes that put different emphasis on information about general ability (i.e., in item response theory) and information about cognitive profile (i.e., in cognitive diagnostic modeling) were also explored. Two simulation studies were carried out to investigate the effectiveness of the new methods and weighting schemes. Results showed that the new methods with the flexible weighting schemes could produce more accurate estimation of both overall ability and cognitive profile than the original DIM. Among them, the ASI with both empirical and theoretical weights is recommended, and attribute-level weighting scheme is preferred if some attributes are considered more important from a substantive perspective.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12057
  • Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12055
  • Journal of Educational Measurement 09/2014; 51(3). DOI:10.1111/jedm.12050
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article, we introduce a section preequating (SPE) method (linear and nonlinear) under the randomly equivalent groups design. In this equating design, sections of Test X (a future new form) and another existing Test Y (an old form already on scale) are administered. The sections of Test X are equated to Test Y, after adjusting for the imperfect correlation between sections of Test X, to obtain the equated score on the complete form of X. Simulations and a real-data application show that the proposed SPE method is fairly simple and accurate.
    Journal of Educational Measurement 09/2014; 51(3). DOI:10.1111/jedm.12049
  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the current bandwidth selection methods in kernel equating and propose a method based on Silverman's rule of thumb for selecting the bandwidth parameters. In kernel equating, the bandwidth parameters have previously been obtained by minimizing a penalty function. This minimization process has been criticized by practitioners for being too complex and that it does not offer sufficient smoothing in certain cases. In addition, the bandwidth parameters have been treated as constants in the derivation of the standard error of equating even when they were selected by considering the observed data. Here, the bandwidth selection is simplified, and modified standard errors of equating (SEEs) that reflect the bandwidth selection method are derived. The method is illustrated with real data examples and simulated data.
    Journal of Educational Measurement 09/2014; 51(3). DOI:10.1111/jedm.12044
  • [Show abstract] [Hide abstract]
    ABSTRACT: Preequating is in demand because it reduces score reporting time. In this article, we evaluated an observed-score preequating method: the empirical item characteristic curve (EICC) method, which makes preequating without item response theory (IRT) possible. EICC preequating results were compared with a criterion equating and with IRT true-score preequating conversions. Results suggested that the EICC preequating method worked well under the conditions considered in this study. The difference between the EICC preequating conversion and the criterion equating was smaller than .5 raw-score points (a practical criterion often used to evaluate equating quality) between the 5th and 95th percentiles of the new form total score distribution. EICC preequating also performed similarly or slightly better than IRT true-score preequating.
    Journal of Educational Measurement 09/2014; 51(3). DOI:10.1111/jedm.12047