Journal of Educational Measurement (J Educ Meas )

Publisher: National Council on Measurement in Education, Blackwell Publishing

Journal description

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research and reports of applications of measurement in an educational context. Solicited reviews of books, software, published educational and psychological tests, and other important measurement works appear in the Review Section of the journal. In addition, comments on technical and substantive issues addressed in articles and reviews previously published in JEM are encouraged. Comments will be reviewed and the authors of the original article will be given the opportunity to respond.

Current impact factor: 1.00

Impact Factor Rankings

Additional details

5-year impact 1.30
Cited half-life 0.00
Immediacy index 0.00
Eigenfactor 0.00
Article influence 1.16
Website Journal of Educational Measurement website
Other titles Journal of educational measurement (Online), JEM
ISSN 1745-3984
OCLC 58648984
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Blackwell Publishing

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • Some journals impose embargoes typically of 6 or 12 months, occasionally of 24 months
    • no listing of affected journals available as yet
  • Conditions
    • See Wiley-Blackwell entry for articles after February 2007
    • Publisher's version/PDF cannot be used
    • On author's server, institutional server or subject-based server
    • Server must be non-commercial
    • Publisher copyright and source must be acknowledged with set statement ("The definitive version is available at www.blackwell-synergy.com")
    • Articles in some journals can be made Open Access on payment of additional charge
    • 'Blackwell Publishing' is an imprint of 'Wiley'
  • Classification
    ​ yellow

Publications in this journal

  • Journal of Educational Measurement 01/2015;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Students’ performance in assessments is commonly attributed to more or less effective teaching. This implies that students’ responses are significantly affected by instruction. However, the assumption that outcome measures indeed are instructionally sensitive is scarcely investigated empirically. In the present study, we propose a longitudinal multilevel-differential item functioning (DIF) model to combine two existing yet independent approaches to evaluate items’ instructional sensitivity. The model permits for a more informative judgment of instructional sensitivity, allowing the distinction of global and differential sensitivity. Exemplarily, the model is applied to two empirical data sets, with classical indices (Pretest–Posttest Difference Index and posttest multilevel-DIF) computed for comparison. Results suggest that the approach works well in the application to empirical data, and may provide important information to test developers.
    Journal of Educational Measurement 12/2014; 51(4):381-399.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model. The power is related to the item response function (IRF) for the studied item, the latent trait distributions, and the sample sizes for the reference and focal groups. Simulation studies show that the theoretical values calculated from the formulas derived in the article are close to what are observed in the simulated data when the assumptions are satisfied. The robustness of the power formulas are studied with simulations when the assumptions are violated.
    Journal of Educational Measurement 12/2014; 51(4).
  • Journal of Educational Measurement 12/2014; 51(4).
  • [Show abstract] [Hide abstract]
    ABSTRACT: With an increase in the number of online tests, interruptions during testing due to unexpected technical issues seem unavoidable. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees’ scores. There is a lack of research on this topic due to the novelty of the problem. This article is an attempt to fill that void. Several methods, primarily based on propensity score matching, linear regression, and item response theory, were suggested to determine the overall impact of the interruptions on the examinees’ scores. A realistic simulation study shows that the suggested methods have satisfactory Type I error rate and power. Then the methods were applied to data from the Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions in 2013. The results indicate that the interruptions did not have a significant overall impact on the student scores for the ISTEP+ test.
    Journal of Educational Measurement 12/2014; 51(4).
  • Journal of Educational Measurement 12/2014; 51(4).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computerized adaptive testing offers the possibility of gaining information on both the overall ability and cognitive profile in a single assessment administration. Some algorithms aiming for these dual purposes have been proposed, including the shadow test approach, the dual information method (DIM), and the constraint weighted method. The current study proposed two new methods, aggregate ranked information index (ARI) and aggregate standardized information index (ASI), which appropriately addressed the noncompatibility issue inherent in the original DIM method. More flexible weighting schemes that put different emphasis on information about general ability (i.e., in item response theory) and information about cognitive profile (i.e., in cognitive diagnostic modeling) were also explored. Two simulation studies were carried out to investigate the effectiveness of the new methods and weighting schemes. Results showed that the new methods with the flexible weighting schemes could produce more accurate estimation of both overall ability and cognitive profile than the original DIM. Among them, the ASI with both empirical and theoretical weights is recommended, and attribute-level weighting scheme is preferred if some attributes are considered more important from a substantive perspective.
    Journal of Educational Measurement 12/2014; 51(4).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning fixed-effect parameters to describe response patterns in the bundle. Unfortunately, this model becomes difficult to manage when a polytomous item is graded by more than two raters. In this study, by adding random-effect parameters to the facets model, we propose a class of generalized rater models to account for the local dependence among multiple ratings and intrarater variation in severity. A series of simulations was conducted with the freeware WinBUGS to evaluate parameter recovery of the new models and consequences of ignoring the local dependence or intrarater variation in severity. The results revealed a good parameter recovery when the data-generating models were fit, and a poor estimation of parameters and test reliability when the local dependence or intrarater variation in severity was ignored. An empirical example is provided.
    Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article, we introduce a section preequating (SPE) method (linear and nonlinear) under the randomly equivalent groups design. In this equating design, sections of Test X (a future new form) and another existing Test Y (an old form already on scale) are administered. The sections of Test X are equated to Test Y, after adjusting for the imperfect correlation between sections of Test X, to obtain the equated score on the complete form of X. Simulations and a real-data application show that the proposed SPE method is fairly simple and accurate.
    Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent guidelines for fair educational testing it is advised to check the validity of individual test scores through the use of person-fit statistics. For practitioners it is unclear on the basis of the existing literature which statistic to use. An overview of relatively simple existing nonparametric approaches to identify atypical response patterns is provided. A simulation study was conducted to compare the different approaches and on the basis of the literature review and the simulation study guidelines for the use of person-fit approaches are given.
    Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the current bandwidth selection methods in kernel equating and propose a method based on Silverman's rule of thumb for selecting the bandwidth parameters. In kernel equating, the bandwidth parameters have previously been obtained by minimizing a penalty function. This minimization process has been criticized by practitioners for being too complex and that it does not offer sufficient smoothing in certain cases. In addition, the bandwidth parameters have been treated as constants in the derivation of the standard error of equating even when they were selected by considering the observed data. Here, the bandwidth selection is simplified, and modified standard errors of equating (SEEs) that reflect the bandwidth selection method are derived. The method is illustrated with real data examples and simulated data.
    Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel-smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.
    Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Preequating is in demand because it reduces score reporting time. In this article, we evaluated an observed-score preequating method: the empirical item characteristic curve (EICC) method, which makes preequating without item response theory (IRT) possible. EICC preequating results were compared with a criterion equating and with IRT true-score preequating conversions. Results suggested that the EICC preequating method worked well under the conditions considered in this study. The difference between the EICC preequating conversion and the criterion equating was smaller than .5 raw-score points (a practical criterion often used to evaluate equating quality) between the 5th and 95th percentiles of the new form total score distribution. EICC preequating also performed similarly or slightly better than IRT true-score preequating.
    Journal of Educational Measurement 09/2014; 51(3).
  • Journal of Educational Measurement 09/2014; 51(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S-X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S-X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.
    Journal of Educational Measurement 03/2014; 51(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true-score, and IRT observed-score equating methods. Using mixed-format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.
    Journal of Educational Measurement 03/2014; 51(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: The DINA (deterministic input, noisy, and gate) model has been widely used in cognitive diagnosis tests and in the process of test development. The outcomes known as slip and guess are included in the DINA model function representing the responses to the items. This study aimed to extend the DINA model by using the random-effect approach to allow examinees to have different probabilities of slipping and guessing. Two extensions of the DINA model were developed and tested to represent the random components of slipping and guessing. The first model assumed that a random variable can be incorporated in the slipping parameters to allow examinees to have different levels of caution. The second model assumed that the examinees’ ability may increase the probability of a correct response if they have not mastered all of the required attributes of an item. The results of a series of simulations based on Markov chain Monte Carlo methods showed that the model parameters and attribute-mastery profiles can be recovered relatively accurately from the generating models and that neglect of the random effects produces biases in parameter estimation. Finally, a fraction subtraction test was used as an empirical example to demonstrate the application of the new models.
    Journal of Educational Measurement 03/2014; 51(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Three local observed-score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed-score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.
    Journal of Educational Measurement 03/2014; 51(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle; volume; minimum error variance of the linear combination; minimum error variance of the composite score with optimized weight; and Kullback-Leibler information) were studied and compared with two methods for item exposure control (the Sympson-Hetter procedure and the fixed-rate procedure, the latter simply refers to putting a limit on the item exposure rate) using simulated data. The maximum priority index method was used for the content constraints. Results showed that the Sympson-Hetter procedure yielded better precision than the fixed-rate procedure but had much lower item pool usage and took more time. The five item selection procedures performed similarly under Sympson-Hetter. For the fixed-rate procedure, there was a trade-off between the precision of the ability estimates and the item pool usage: the five procedures had different patterns. It was found that (1) Kullback-Leibler had better precision but lower item pool usage; (2) minimum angle and volume had balanced precision and item pool usage; and (3) the two methods minimizing the error variance had the best item pool usage and comparable overall score recovery but less precision for certain domains. The priority index for content constraints and item exposure was implemented successfully.
    Journal of Educational Measurement 03/2014; 51(1).