Educational and Psychological Measurement (EDUC PSYCHOL MEAS )

Publisher: American College Personnel Association; Science Research Associates, SAGE Publications


Educational and Psychological Measurement publishes data-based studies in educational measurement, as well as theoretical papers in the measurement field. The journal focuses on discussions of problems in measurement of individual differences, as well as research on the development and use of tests and measurement in education, psychology, industry and government.

  • Impact factor
    Show impact factor history
    Impact factor
  • 5-year impact
  • Cited half-life
  • Immediacy index
  • Eigenfactor
  • Article influence
  • Website
    Educational and Psychological Measurement website
  • Other titles
    Educational and psychological measurement, EPM
  • ISSN
  • OCLC
  • Material type
    Periodical, Internet resource
  • Document type
    Journal / Magazine / Newspaper, Internet Resource

Publisher details

SAGE Publications

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 12 months embargo
  • Conditions
    • On author website, repository and PubMed Central
    • On author's personal web site
    • Publisher copyright and source must be acknowledged
    • Publisher's version/PDF cannot be used
    • Post-print version with changes from referees comments can be used
    • "as published" final version with layout and copy-editing changes cannot be archived but can be used on secure institutional intranet
    • If funding agency rules apply, authors may use SAGE open to comply
  • Classification
    ​ yellow

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: To further understand the properties of data-generation algorithms for multivariate, nonnormal data, two Monte Carlo simulation studies comparing the Vale and Maurelli method and the Headrick fifth-order polynomial method were implemented. Combinations of skewness and kurtosis found in four published articles were run and attention was specifically paid to the quality of the sample estimates of univariate skewness and kurtosis. In the first study, it was found that the Vale and Maurelli algorithm yielded downward-biased estimates of skewness and kurtosis (particularly at small samples) that were also highly variable. This method was also prone to generate extreme sample kurtosis values if the population kurtosis was high. The estimates obtained from Headrick’s algorithm were also biased downward, but much less so than the estimates obtained through Vale and Maurelli and much less variable. The second study reproduced the first simulation in the Curran, West, and Finch article using both the Vale and Maurelli method and the Heardick method. It was found that the chi-square values and empirical rejection rates changed depending on which data-generation method was used, sometimes sufficiently so that some of the original conclusions of the authors would no longer hold. In closing, recommendations are presented regarding the relative merits of each algorithm.
    Educational and Psychological Measurement 01/2015;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The authors analyze the effectiveness of the R2 and delta log odds ratio effect size measures when using logistic regression analysis to detect differential item functioning (DIF) in dichotomous items. A simulation study was carried out, and the Type I error rate and power estimates under conditions in which only statistical testing was used were compared with the rejection rates obtained when statistical testing was combined with an effect size measure based on recommended cutoff criteria. The manipulated variables were sample size, impact between groups, percentage of DIF items in the test, and amount of DIF. The results showed that false-positive rates were higher when applying only the statistical test than when an effect size decision rule was used in combination with a statistical test. Type I error rates were affected by the number of test items with DIF, as well as by the magnitude of the DIF. With respect to power, when a statistical test was used in conjunction with effect size criteria to determine whether an item exhibited a meaningful magnitude of DIF, the delta log odds ratio effect size measure performed better than R2. Power was affected by the percentage of DIF items in the test and also by sample size. The study highlights the importance of using an effect size measure to avoid false identification.
    Educational and Psychological Measurement 01/2015;
  • Educational and Psychological Measurement 10/2014;
  • Educational and Psychological Measurement 01/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article investigates the effect of the rural–urban divide on mean response styles (RSs) and their relationships with the sociodemographic characteristics of the respondents. It uses the Representative Indicator Response Style Means and Covariance Structure (RIRSMACS) method and data from Guyana—a developing country in the Caribbean. The rural–urban divide effects substantial mean RSs differentials, and it moderates both their relationships with and the explanatory power of the respondents’ sociodemographic characteristics. Within-country research is therefore subject to substantial rural–urban RSs bias, and it is hence imperative that researchers control RSs in such studies. Previous research findings should also be reexamined with RSs controlled. In addition, joint modeling of culture, RSs, and their sociodemographic predictors may clarify some of the conflicting results about their effects in the cross-cultural research literature.
    Educational and Psychological Measurement 01/2014; 74(1):97-115.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study compares the progressive-restricted standard error (PR-SE) exposure control procedure to three commonly used procedures in computerized adaptive testing, the randomesque, Sympson–Hetter (SH), and no exposure control methods. The performance of these four procedures is evaluated using the three-parameter logistic model under the manipulated conditions of item pool size (small vs. large) and stopping rules (fixed-length vs. variable-length). PR-SE provides the advantage of similar constraints to SH, without the need for a preceding simulation study to execute it. Overall for the large and small item banks, the PR-SE method administered almost all of the items from the item pool, whereas the other procedures administered about 52% or less of the large item bank and 80% or less of the small item bank. The PR-SE yielded the smallest amount of item overlap between tests across conditions and administered fewer items on average than SH. PR-SE obtained these results with similar, and acceptable, measurement precision compared to the other exposure control procedures while vastly improving on item pool usage.
    Educational and Psychological Measurement 10/2013; 73(5):857-874.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although a substantial amount of research has been conducted on differential item functioning in testing, studies have focused on detecting differential item functioning rather than on explaining how or why it may occur. Some recent work has explored sources of differential functioning using explanatory and multilevel item response models. This study uses hierarchical generalized linear modeling to examine differential performance due to gender and opportunity to learn, two variables that have been examined in the literature primarily in isolation, or in terms of mean performance as opposed to item performance. The relationships between item difficulty, gender, and opportunity to learn are explored using data for three countries from an international survey of preservice mathematics teachers.
    Educational and Psychological Measurement 10/2013; 73(5):836-856.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale experiments that involve nested structures may assign treatment conditions either to subgroups such as classrooms or to individuals such as students within subgroups. Key aspects of the design of such experiments include knowledge of the variance structure in higher levels and the sample sizes necessary to reach sufficient power to detect the treatment effect. This study provides methods for maximizing power within a fixed budget in three-level block randomized balanced designs with two levels of nesting, where, for example, students are nested within classrooms and classrooms are nested within schools, and schools and classrooms are random effects. The power computations take into account the costs of units of different levels, the variance structure at the second (e.g., classroom) and third (e.g., school) levels, and the sample sizes (e.g., number of Level-1, Level-2, and Level-3 units).
    Educational and Psychological Measurement 10/2013; 73(5):784-802.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This note is concerned with a latent variable modeling approach for the study of differential item functioning in a multigroup setting. A multiple-testing procedure that can be used to evaluate group differences in response probabilities on individual items is discussed. The method is readily employed when the aim is also to locate possible sources of differential item functioning in homogenous behavioral measuring instruments across two or more populations under investigation. The approach is readily applicable using the popular software Mplus and R and is illustrated with a numerical example.
    Educational and Psychological Measurement 10/2013; 73(5):898-908.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Adapting the original latitude of acceptance concept to Likert-type surveys, response latitudes are defined as the range of graded response options a person is willing to endorse. Response latitudes were expected to relate to attitude involvement such that high involvement was linked to narrow latitudes (the result of selective, careful responding) and low involvement was linked to wide latitudes (the result of disinterested, careless responding). In an innovative application of item response theory, parameters from Samejima’s graded response model were used to examine response latitude width. Other item response theory–based tools (e.g., test characteristic curves, information functions) were used to examine the influence of response latitudes on the psychometric functioning of several attitude surveys. A mix of experimental and nonexperimental methods was employed to create groups of high and low involvement surveys. Comparisons of these surveys showed that high involvement was related to significantly narrower response latitudes than low involvement. Furthermore, wide response latitudes were related to unfavorable psychometric properties such as reduced survey discrimination and reduced internal validity relative to narrow latitudes. Comparisons of information functions in high and low involvement conditions, however, were less consistent. Implications of wide response latitudes are quite unfavorable for researchers and suggest that an element of error is present when respondents feel little involvement with an attitude topic.
    Educational and Psychological Measurement 08/2013; 73(4):690-712.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assessments in response formats with ordered categories are ubiquitous in the social and health sciences. Although the assumption that the ordering of the categories is working as intended is central to any interpretation that arises from such assessments, testing that this assumption is valid is not standard in psychometrics. This is surprising given that it has been known for some 35 years that this assumption can be checked routinely using the psychometric Rasch model for more than two ordered categories. The purpose of this article is twofold. First, to demonstrate three distinct but related legacies of R. A. Fisher that have contributed to the use of the Rasch model to assess the empirical ordering of categories: (a) his construction of sufficient statistics, (b) his recognition that the ordering of categories should be an empirical property of the data, and (c) his integration of the design of empirical studies with statistical analyses of data. Second, to suggest two reasons behind both the indifference, and even the rejection, of both the need and possibility of testing the assumption of the empirical ordering of categories: (a) the lack of recognition of the problem before it was understood that it could be solved using the Rasch model and (b) the legacy of K. Pearson that legitimized the atheoretical modeling of data with parameters that have no substantive meaning.
    Educational and Psychological Measurement 08/2013; 73(4):553-580.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A multiple testing method for examining factorial invariance for latent constructs evaluated by multiple indicators in distinct populations is outlined. The procedure is based on the false discovery rate concept and multiple individual restriction tests and resolves general limitations of a popular factorial invariance testing approach. The discussed method controls the overall significance level and is associated with higher power than conventional multiple testing procedures. The procedure avoids the necessity to choose a reference variable in applications of latent variable modeling for testing factorial invariance. The outlined method permits location of factorial invariance violations, in addition to its examination for a given set of construct indicators, and is illustrated with a numerical example.
    Educational and Psychological Measurement 08/2013; 73(4):713-727.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multilevel data structures are ubiquitous in the assessment of differential item functioning (DIF), particularly in large-scale testing programs. There are a handful of DIF procures for researchers to select from that appropriately account for multilevel data structures. However, little, if any, work has been completed to extend a popular DIF method to this case. Thus, the primary goal of this study was to introduce and investigate the effectiveness of several new options for DIF assessment in the presence of multilevel data with the Mantel–Haenszel (MH) procedure, a popular, flexible, and effective tool for DIF detection. The performance of these new methods was compared with the standard MH technique through a simulation study, where data were simulated in a multilevel framework, corresponding to examinees nested in schools, for example. The standard MH test for DIF detection was employed, along with several multilevel extensions of MH. Results demonstrated that these multilevel tests proved to be preferable to standard MH in a wide variety of cases where multilevel data were present, particularly when the intraclass correlation was relatively large. Implications of this study for practice and future research are discussed.
    Educational and Psychological Measurement 08/2013; 73(4):648-671.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article summarizes the key finding of a study that (a) tests the measurement invariance (MI) of the popular Students’ Approaches to Learning instrument (Programme for International Student Assessment [PISA]) across ethnic/cultural groups within a country and (b) discusses implications for research focusing on the role of affective measures in immigrant and minority education. The Students’ Approaches to Learning instrument captures some of the most prominent constructs in educational psychology. Results indicate significant variation in MI across various affective scales and across cultural groups. This study demonstrates that even if MI for specific scales is established across countries, it is still necessary to test MI across cultural groups within a country. We then discuss implications of MI across immigrant groups and highlight the relevance of MI testing for all research studying the affective conditions of educational achievement among immigrant students or the educational motivation of minority students.
    Educational and Psychological Measurement 08/2013; 73(4):601-630.