Journal of Educational Measurement (J Educ Meas)

Publisher: National Council on Measurement in Education, Wiley

Journal description

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research and reports of applications of measurement in an educational context. Solicited reviews of books, software, published educational and psychological tests, and other important measurement works appear in the Review Section of the journal. In addition, comments on technical and substantive issues addressed in articles and reviews previously published in JEM are encouraged. Comments will be reviewed and the authors of the original article will be given the opportunity to respond.

Current impact factor: 1.00

Impact Factor Rankings

Additional details

5-year impact 1.30
Cited half-life 0.00
Immediacy index 0.00
Eigenfactor 0.00
Article influence 1.16
Website Journal of Educational Measurement website
Other titles Journal of educational measurement (Online), JEM
ISSN 1745-3984
OCLC 58648984
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details


  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 2 years embargo
  • Conditions
    • Some journals have separate policies, please check with each journal directly
    • On author's personal website, institutional repositories, arXiv, AgEcon, PhilPapers, PubMed Central, RePEc or Social Science Research Network
    • Author's pre-print may not be updated with Publisher's Version/PDF
    • Author's pre-print must acknowledge acceptance for publication
    • On a non-profit server
    • Publisher's version/PDF cannot be used
    • Publisher source must be acknowledged with citation
    • Must link to publisher version with set statement (see policy)
    • If OnlineOpen is not available, BBSRC, EPSRC, MRC, NERC and STFC authors, may self-archive after 6 months
    • If OnlineOpen is not available, AHRC and ESRC authors, may self-archive after 12 months
    • This policy is an exception to the default policies of 'Wiley'
  • Classification
    ​ yellow

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: SIBTEST is a differential item functioning (DIF) detection method that is accurate and effective with small samples, in the presence of group mean differences, and for assessment of both uniform and nonuniform DIF. The presence of multilevel data with DIF detection has received increased attention. Ignoring such structure can inflate Type I error. This simulation study examines the performance of newly developed multilevel adaptations of SIBTEST in the presence of multilevel data. Data were simulated in a multilevel framework and both uniform and nonuniform DIF were assessed. Study results demonstrated that naïve SIBTEST and Crossing SIBTEST, ignoring the multilevel data structure, yield inflated Type I error rates, while certain multilevel extensions provided better error and accuracy control.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12071
  • Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12074
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cognitive diagnosis models provide profile information about a set of latent binary attributes, whereas item response models yield a summary report on a latent continuous trait. To utilize the advantages of both models, higher order cognitive diagnosis models were developed in which information about both latent binary attributes and latent continuous traits is available. To facilitate the utility of cognitive diagnosis models, corresponding computerized adaptive testing (CAT) algorithms were developed. Most of them adopt the fixed-length rule to terminate CAT and are limited to ordinary cognitive diagnosis models. In this study, the higher order deterministic-input, noisy-and-gate (DINA) model was used as an example, and three criteria based on the minimum-precision termination rule were implemented: one for the latent class, one for the latent trait, and the other for both. The simulation results demonstrated that all of the termination criteria were successful when items were selected according to the Kullback-Leibler information and the posterior-weighted Kullback-Leibler information, and the minimum-precision rule outperformed the fixed-length rule with a similar test length in recovering the latent attributes and the latent trait.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12069
  • [Show abstract] [Hide abstract]
    ABSTRACT: A mixed-effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed-effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis-Hastings Robbins-Monro (MH-RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design- and trait-based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four-parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH-RM algorithm, and an empirical example using the extended mixed-effects IRT model is presented.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12072
  • [Show abstract] [Hide abstract]
    ABSTRACT: Admission decisions frequently rely on multiple assessments. As a consequence, it is important to explore rational approaches to combine the information from different educational tests. For example, U.S. graduate schools usually receive both TOEFL iBT® scores and GRE® General scores of foreign applicants for admission; however, little guidance has been given to combine information from these two assessments, even though the relationships between such sections as GRE Verbal and TOEFL iBT Reading are obvious. In this study, principles are provided to explore the extent to which different assessments complement one another and are distinguishable. Augmentation approaches developed for individual tests are applied to provide an accurate evaluation of combined assessments. Because augmentation methods require estimates of measurement error and internal reliability data are unavailable, required estimates of measurement error are obtained from repeaters, examinees who took the same test more than once. Because repeaters are not representative of all examinees in typical assessments, minimum discriminant information adjustment techniques are applied to the available sample of repeaters to treat the effect of selection bias. To illustrate methodology, combining information from TOEFL iBT scores and GRE General scores is examined. Analysis suggests that information from the GRE General and TOEFL iBT assessments is complementary but not redundant, indicating that the two tests measure related but somewhat different constructs. The proposed methodology can be readily applied to other situations where multiple assessments are needed.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12075
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study is to assess the efficiency of using the multiple-group categorical confirmatory factor analysis (MCCFA) and the robust chi-square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF-free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi-square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.
    Journal of Educational Measurement 06/2015; 52(2). DOI:10.1111/jedm.12073
  • [Show abstract] [Hide abstract]
    ABSTRACT: This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two-stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two-stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non-Bayesian (no prior) estimators was of more practical significance than the choice of number-correct versus item-pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non-Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low- and high-performing examinees.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12063
  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, the number of interruptions during testing due to unexpected technical issues seems to be on the rise. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees' scores. Researchers such as Hill and Sinharay et al. examined the impact of interruptions at an aggregate level. However, there is a lack of research on the assessment of impact of interruptions at an individual level. We attempt to fill that void. We suggest four methodological approaches, primarily based on statistical hypothesis testing, linear regression, and item response theory, which can provide evidence on the individual-level impact of interruptions. We perform a realistic simulation study to compare the Type I error rate and power of the suggested approaches. We then apply the approaches to data from the 2013 Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12064
  • Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12068
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assessment of differential item functioning (DIF) is routinely conducted to ensure test fairness and validity. Although many DIF assessment methods have been developed in the context of classical test theory and item response theory, they are not applicable for cognitive diagnosis models (CDMs), as the underlying latent attributes of CDMs are multidimensional and binary. This study proposes a very general DIF assessment method in the CDM framework which is applicable for various CDMs, more than two groups of examinees, and multiple grouping variables that are categorical, continuous, observed, or latent. The parameters can be estimated with Markov chain Monte Carlo algorithms implemented in the freeware WinBUGS. Simulation results demonstrated a good parameter recovery and advantages in DIF assessment for the new method over the Wald method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12061
  • [Show abstract] [Hide abstract]
    ABSTRACT: Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean-mean, mean-sigma, random-groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12065
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assumption of conditional independence between the responses and the response times (RTs) for a given person is common in RT modeling. However, when the speed of a test taker is not constant, this assumption will be violated. In this article we propose a conditional joint model for item responses and RTs, which incorporates a covariance structure to explain the local dependency between speed and accuracy. To obtain information about the population of test takers, the new model was embedded in the hierarchical framework proposed by van der Linden (2007). A fully Bayesian approach using a straightforward Markov chain Monte Carlo (MCMC) sampler was developed to estimate all parameters in the model. The deviance information criterion (DIC) and the Bayes factor (BF) were employed to compare the goodness of fit between the models with two different parameter structures. The Bayesian residual analysis method was also employed to evaluate the fit of the RT model. Based on the simulations, we conclude that (1) the new model noticeably improves the parameter recovery for both the item parameters and the examinees’ latent traits when the assumptions of conditional independence between the item responses and the RTs are relaxed and (2) the proposed MCMC sampler adequately estimates the model parameters. The applicability of our approach is illustrated with an empirical example, and the model fit indices indicated a preference for the new model.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12060
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research on equating with small samples has shown that methods with stronger assumptions and fewer statistical estimates can lead to decreased error in the estimated equating function. This article introduces a new approach to linear observed-score equating, one which provides flexible control over how form difficulty is assumed versus estimated to change across the score scale. A general linear method is presented as an extension of traditional linear methods. The general method is then compared to other linear and nonlinear methods in terms of accuracy in estimating a criterion equating function. Results from two parametric bootstrapping studies based on real data demonstrate the usefulness of the general linear method.
    Journal of Educational Measurement 03/2015; 52(1). DOI:10.1111/jedm.12062
  • [Show abstract] [Hide abstract]
    ABSTRACT: Students’ performance in assessments is commonly attributed to more or less effective teaching. This implies that students’ responses are significantly affected by instruction. However, the assumption that outcome measures indeed are instructionally sensitive is scarcely investigated empirically. In the present study, we propose a longitudinal multilevel-differential item functioning (DIF) model to combine two existing yet independent approaches to evaluate items’ instructional sensitivity. The model permits for a more informative judgment of instructional sensitivity, allowing the distinction of global and differential sensitivity. Exemplarily, the model is applied to two empirical data sets, with classical indices (Pretest–Posttest Difference Index and posttest multilevel-DIF) computed for comparison. Results suggest that the approach works well in the application to empirical data, and may provide important information to test developers.
    Journal of Educational Measurement 12/2014; 51(4):381-399. DOI:10.1111/jedm.12051
  • [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model. The power is related to the item response function (IRF) for the studied item, the latent trait distributions, and the sample sizes for the reference and focal groups. Simulation studies show that the theoretical values calculated from the formulas derived in the article are close to what are observed in the simulated data when the assumptions are satisfied. The robustness of the power formulas are studied with simulations when the assumptions are violated.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12058
  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, interruptions during testing due to unexpected technical issues seem unavoidable. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees’ scores. There is a lack of research on this topic due to the novelty of the problem. This article is an attempt to fill that void. Several methods, primarily based on propensity score matching, linear regression, and item response theory, were suggested to determine the overall impact of the interruptions on the examinees’ scores. A realistic simulation study shows that the suggested methods have satisfactory Type I error rate and power. Then the methods were applied to data from the Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions in 2013. The results indicate that the interruptions did not have a significant overall impact on the student scores for the ISTEP+ test.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12052
  • Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12056
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computerized adaptive testing offers the possibility of gaining information on both the overall ability and cognitive profile in a single assessment administration. Some algorithms aiming for these dual purposes have been proposed, including the shadow test approach, the dual information method (DIM), and the constraint weighted method. The current study proposed two new methods, aggregate ranked information index (ARI) and aggregate standardized information index (ASI), which appropriately addressed the noncompatibility issue inherent in the original DIM method. More flexible weighting schemes that put different emphasis on information about general ability (i.e., in item response theory) and information about cognitive profile (i.e., in cognitive diagnostic modeling) were also explored. Two simulation studies were carried out to investigate the effectiveness of the new methods and weighting schemes. Results showed that the new methods with the flexible weighting schemes could produce more accurate estimation of both overall ability and cognitive profile than the original DIM. Among them, the ASI with both empirical and theoretical weights is recommended, and attribute-level weighting scheme is preferred if some attributes are considered more important from a substantive perspective.
    Journal of Educational Measurement 12/2014; 51(4). DOI:10.1111/jedm.12057