Journal of applied measurement (J Appl Meas )


Journal of Applied Measurement publishes refereed scholarly work from all academic disciplines that relates to measurement theory and its application to developing variables. The construction and interpretation of meaningful and unambiguous variables is a salient feature of measurement. It represents the congruence of measurement theory and substantive research in a wide range of scientific endeavors. The development of variables that map the persons and items onto a common metric, operational defined by the items, that are invariant across samples of persons and items, is a cornerstone of developing an understanding of the phenomena being measured and the construction and verification of hypotheses based on these phenomena. The journal will also publish invited articles that provide examples of methodological issues that are relevant to constructing useful variables.

  • Impact factor
  • 5-year impact
  • Cited half-life
  • Immediacy index
  • Eigenfactor
  • Article influence
  • Website
    Journal of Applied Measurement website
  • Other titles
    Journal of applied measurement
  • ISSN
  • OCLC
  • Material type
  • Document type
    Journal / Magazine / Newspaper

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: This article reports the results of an application of the Rasch rating scale model to the Teaching Strategies GOLD assessment system in a norm sample of children aged birth to 71 months. The analyses focused on the examination of dimensionality, rating scale effectiveness, the hierarchy of item difficulties, and the relationship of developmental scale scores to child age. Results show that each subscale satisfies the Rasch model for unidimensionality. Ratings were found to be less reliable at the lowest and highest ends of the scale and less distinct at 'In-between' levels. Items appear to form theoretically expected hierarchies, supporting evidence for construct validity for the measures. Moderately high correlations of developmental scale scores with child age suggest that teachers are able to make valid ratings of the developmental progress of children across the intended age range.
    Journal of applied measurement 01/2014; 15(4):405-421.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Testing hypotheses on a respondent's individual fit under the Rasch model requires knowledge of the distributional properties of a person fit statistic. We argue that the Rasch Sampler (Verhelst, 2008), a Markov chain Monte Carlo algorithm for sampling binary data matrices from a uniform distribution, can be applied for simulating the distribution of person fit statistics with the Rasch model in the same way as it used to test for other forms of misfit. Results from two simulation studies are presented which compare the approach to the original person fit statistics based on normalization formulas. Simulation 1 shows the new approach to hold the expected Type I error rates while the normalized statistics deviate from the nominal alpha-level. In Simulation 2 the power of the new approach was found to be approximately the same or higher than for the normalized statistics under most conditions.
    Journal of applied measurement 01/2014; 15(3):276-291.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The purpose of this study was to examine the extent to which raters' subjectivity impacts measures of teacher dispositions using the Dispositions Assessments Aligned with Teacher Standards (DAATS) battery. This is an important component of the collection of evidence of validity and reliability of inferences made using the scale. It also provides needed support for the use of subjective affective measures in teacher training and other professional preparation programs, since these measures are often feared to be unreliable because of rater effect. It demonstrates the advantages of using the Multi-Faceted Rasch Model as a better alternative to the typical methods used in preparation programs, such as Cohen's Kappa. DAATS instruments require subjective scoring using a six-point rating scale derived from the affective taxonomy as defined by Krathwohl, Bloom, and Masia (1956). Rater effect is a serious challenge and can worsen or drift over time. Errors in rater judgment can impact the accuracy of ratings, and these effects are common, but can be lessened through training of raters and monitoring of their efforts. This effort uses the multifaceted Rasch measurement models (MFRM) to detect and understand the nature of these effects.
    Journal of applied measurement 01/2014; 15(3):240-251.
  • [Show abstract] [Hide abstract]
    ABSTRACT: An aspect of child behavior and temperament, called Negative Emotionality in the literature, is very important to teachers of very young children. The Children's Behavior Questionnaire, initially designed by Rothbart, Ahadi, Hershey and Fisher (2001) for use in western countries, was modified in line with Rasch measurement theory, revised for suitability with Hong Kong preschool children, and conceptually ordered from easy to hard along a continuum of attitude/behavior for negative emotionality, before data collection. Three ordered scoring categories (never or rarely scored 1, on some occasions scored 2, and on many occasions scored 3) were used. Data were collected from preschool teachers for N = 628 preschool children from 32 schools in Hong Kong and analyzed with the 2010 Rasch unidimensional measurement model computer program (RUMM2030). The item-trait interaction probability is 0.05 (2 = 101.88, df = 80) which indicates that there is reasonable agreement about the different difficulties of the items along the scale for all the children. Results and implications are discussed, and revisions for the scale suggested.
    Journal of applied measurement 01/2014; 15(1):69-81.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recommended guidelines for discrimination index of multiple choice questions are often indiscriminately applied to essay type questions also. Optimal discrimination index under normality condition for essay question is independently derived. Satisfactory region for discrimination index of essay questions with passing mark at 50% of the total is between 0.12 and 0.31 instead of 0.40 or more in the case for multiple-choice questions. Optimal discrimination index for essay question is shown to increase proportional to the range of scores. Discrimination efficiency as the ratio of the observed discrimination index over the optimal discrimination index is defined. Recommended guidelines for discrimination index of essay questions are provided.
    Journal of applied measurement 01/2014; 15(1):94-9.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The assessment of differential item functioning (DIF) remains an area of active research in psychometrics and educational measurement. In recent years, methodological innovations involving mixture Rasch models have provided researchers with an additional set of tools for more deeply understanding the root causes of DIF, while at the same time increased interest in the role of disabilities and accommodations has also made itself felt in the measurement community. The current study furthered work in both areas by using the newly described multilevel mixture Rasch model to investigate the presence of DIF associated with disability and accommodation status at both examinee and school levels for a 3rd grade language assessment. Results of the study found that indeed DIF was present at both levels of analysis, and that it was associated with the presence of disabilities and the receipt of accommodations. Implications of these results for both practitioners and researchers are discussed.
    Journal of applied measurement 01/2014; 15(2):133-51.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This research continues prior work published in this journal (Peoples, O'Dwyer, Shields and Wang, 2013). The first paper described the scale development, psychometric analyses and part-validation of a theoretically-grounded Rasch-based instrument, the Nature of Science Instrument-Elementary (NOSI-E). The NOSI-E was designed to measure elementary students' understanding of the Nature of Science (NOS). In the first paper, evidence was provided for three of the six validity aspects (content, substantive and generalizability) needed to support the construct validity of the NOSI-E. The research described in this paper examines two additional validity aspects (structural and external). The purpose of this study was to determine which of three competing internal models provides reliable, interpretable, and responsive measures of students' understanding of NOS. One postulate is that the NOS construct is unidimensional;. alternatively, the NOS construct is composed of five independent unidimensional constructs (the consecutive approach). Lastly, the NOS construct is multidimensional and composed of five inter-related but separate dimensions. The vast body of evidence supported the claim that the NOS construct is multidimensional. Measures from the multidimensional model were positively related to student science achievement and students' perceptions of their classroom environment; this provided supporting evidence for the external validity aspect of the NOS construct. As US science education moves toward students learning science through engaging in authentic scientific practices and building learning progressions (NRC, 2012), it will be important to assess whether this new approach to teaching science is effective, and the NOSI-E may be used as a measure of the impact of this reform.
    Journal of applied measurement 01/2014; 15(4):338-358.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Automatic item generation (AIG) is a broad class of methods that are being developed to address psychometric issues arising from internet and computer-based testing. In general, issues emphasize efficiency, validity, and diagnostic usefulness of large scale mental testing. Rapid prominence of AIG methods and their implicit perspective on mental testing is bringing painful scrutiny to many sacred psychometric assumptions. This report reviews basic AIG ideas, then presents conceptual foundations, image model development, and operational application to artistic judgment aptitude testing.
    Journal of applied measurement 01/2014; 15(1):1-25.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Validity of specification equations used by auto-text processors to estimate theoretical text complexity have increased importance because of the Common Core State Standards. Theoretical estimates of text complexity will inform (a) setting standards for college and career readiness, (b) grade-level standards, matching readers to text, and (d) creating a daily diet of stretch and targeted text designed to grow reading ability and content knowledge. The purpose of this research was to investigate the specification equation used in the Lexile Framework for Reading to measure text complexity. The Lexile Reading Analyzer contains a specification equation that uses proxies for the semantic difficulty and syntactic complexity to estimate the theoretical complexity of professionally-edited text. Differences between theoretical and empirical estimates of text complexity were examined for a set of 446 professionally authored, previously published passages. Students in grades 2-12 read these passages using A Learning Oasis, a web-based technology, to ensure that most of the articles read were well-targeted to student ability (+100L). Each article was response illustrated using an auto-generated semantic cloze item type embedded into passages. Observed student performance on this item type was used to derive an empirical estimate of text complexity for each passage. Theoretical estimates of text complexity accounted for approximately 90 percent of the variance in empirical estimates of text complexity. These findings suggest that the specification equation contains powerful predictors of empirical text complexity, speculation remains on what additional variables might account for the 10 percent of unexplained variation.
    Journal of applied measurement 01/2014; 15(4):359-371.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study is to reevaluate validity of Turkish version of the ECOS-16 questionnaire by using Rasch analysis in post-menopausal women with osteoporosis. ECOS-16 (Assessment of health related quality of life in osteoporosis) is a quality of life questionnaire, which is convenient for measuring the quality of life of post-menopausal women with osteoporosis. 132 post-menopausal women with osteoporosis who attended Uludag Universtity, Ataturk Rehabilitation and Research Center between January 2010 and March 2011 were included in this study. The subjects filled out Turkish version of ECOS-16 questionnaire by themselves. The Rasch model was used for assessing construct validity of ECOS-16 data. Internal consistency was assessed by Cronbach's alpha coefficient. The mean infit and outfit mean square (z std) were found as 1.08 (0.1) and 1.02 (-0.1), respectively. The separation indices for the item and person were found as 7.72 and 3.13; the separation reliabilities were 0.98 and 0.91, respectively. Cronbach's alpha coefficient was found as 0.90. The construct validity of ECOS-16 questionnaire was assessed by Rasch analysis.
    Journal of applied measurement 01/2014; 15(3):302-312.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research indicates that the scope of practice for primary care physicians has been shrinking (Tong, Makaroff, Xierali, Parhat, Puffer, Newton, et al., 2012; Xierali, Puffer, Tong, Bazemore, and Green, 2012; and Bazemore, Makaroff, Puffer, Parhat, Phillips, Xierali, et al., 2012) despite research showing that areas with robust primary care services have better population health outcomes at lower costs (Starfield, Shi, and Macinko, 2005). Examining issues related to the scope of practice for primary care physicians has wide-ranging implications for both patient health outcomes and related healthcare costs. This article describes the development and use of a scale intended to measure the breath of the individual physician's scope of practice using 22 self-reported, dichotomous indicators obtained from a physician survey.
    Journal of applied measurement 01/2014; 15(3):227-239.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Several concepts from Georg Rasch's last papers are discussed. The key one is comparison because Rasch considered the method of comparison fundamental to science. From the role of comparison stems scientific inference made operational by a properly developed frame of reference producing specific objectivity. The exact specifications Rasch outlined for making comparisons are explicated from quotes, and the role of causality derived from making comparisons is also examined. Understanding causality has implications for what can and cannot be produced via Rasch measurement. His simple examples were instructive, but the implications are far reaching upon first establishing the key role of comparison.
    Journal of applied measurement 01/2014; 15(1):26-39.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Plagiarism is a significant area of concern in higher education, given university students' high self-reported rates of plagiarism. However, research remains inconsistent in prevalence estimates and suggested precursors of plagiarism. This may be a function of the unclear psychometric properties of the measurement tools adopted. To investigate this, we modified an existing plagiarism scale (to broaden its scope), established its psychometric properties using traditional (EFA, Cronbach's alpha) and modern (Rasch analysis) survey evaluation approaches, and examined results of well-functioning items. Results indicated that traditional and modern psychometric approaches differed in their recommendations. Further, responses indicated that although most respondents acknowledged the seriousness of plagiarism, these attitudes were neither unanimous nor consistent across the range of issues assessed. This study thus provides rigorous psychometric testing of a plagiarism attitude scale and baseline data from which to begin a discussion of contextual, personal, and external factors that influence students' plagiarism attitudes.
    Journal of applied measurement 01/2014; 15(4):372-393.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The purpose of the present studies was to test the effects of systematic sources of measurement error on the parameter estimates of scales using the Rasch model. Studies 1 and 2 tested the effects of mood and affectivity. Study 3 evaluated the effects of fatigue. Last, studies 4 and 5 tested the effects of motivation on a number of parameters of the Rasch model (e.g., ability estimates). Results indicated that (a) the parameters of interest and the psychometric properties of the scales were substantially distorted in the presence of all systematic sources of error, and, (b) the use of HGLM provides a way of adjusting the parameter estimates in the presence of these sources of error. It is concluded that validity in measurement requires a thorough evaluation of potential sources of error and appropriate adjustments based on each occasion.
    Journal of applied measurement 01/2014; 15(4):314-337.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A variety of methods for evaluating the psychometric quality of rater-mediated assessments have been proposed, including rater effects based on latent trait models (e.g., Engelhard, 2013; Wolfe, 2009). Although information about rater effects contributes to the interpretation and use of rater-assigned scores, it is also important to consider ratings in terms of the structure of the rating scale on which scores are assigned. Further, concern with the validity of rater-assigned scores necessitates investigation of these quality control indices within student subgroups, such as gender, language, and race/ethnicity groups. Using a set of guidelines for evaluating the interpretation and use of rating scales adapted from Linacre (1999, 2004), this study demonstrates methods that can be used to examine rating scale functioning within and across student subgroups with indicators from Rasch measurement theory (Rasch, 1960) and Mokken scale analysis (Mokken, 1971). Specifically, this study illustrates indices of rating scale effectiveness based on Rasch models and models adapted from Mokken scaling, and considers whether the two approaches to evaluating the interpretation and use of rating scales lead to comparable conclusions within the context of a large-scale rater-mediated writing assessment. Major findings suggest that indices of rating scale effectiveness based on a parametric and nonparametric approach provide related, but slightly different, information about the structure of rating scales. Implications for research, theory, and practice are discussed.
    Journal of applied measurement 01/2014; 15(2):100-32.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article describes a study investigating the effect of intervention on student problem solving and higher order competency development using a series of complex numeracy performance tasks (Airasian and Russell, 2008). The tasks were sequenced to promote and monitor student development towards hypothetico-deductive reasoning. Using Rasch partial credit analysis (Wright and Masters, 1982) to calibrate the tasks and analysis of residual gain scores to examine the effect of class and school membership, the study illustrates how directed intervention can improve students' higher order competency skills. This paper demonstrates how the segmentation defined by Wright and Masters can offer a basis for interpreting the construct underlying a test and how segment definitions can deliver targeted interventions. Implications for teacher intervention and teaching mentor schemes are considered. The article also discusses multilevel regression models that differentiate class and school effects, and describes a process for generating, testing and using value added models.
    Journal of applied measurement 01/2014; 15(1):53-68.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently the linear logistic test model (LLTM) by Fischer (1973) is increasingly used. In applications of LLTM, a likelihood-ratio test comparing the likelihood of the LLTM to the likelihood of the Rasch model is the most often applied model test. The present simulation study evaluates the empirical Type I risk, test power, and approximation to the expected distribution in the context of the LLTM. Furthermore, as possible influence factors on the distribution of the likelihood-ratio test statistic, the misspecification of the superior model, the closeness to singularity of the design matrix, and different sorts of misspecification of the design matrix are implemented. In summary, results of the simulations indicate that the likelihood-ratio test statistic holds the fixed Type I risk under typical conditions. Nevertheless, it is especially important to ensure the fit of the superior model, the Rasch model, and to consider the closeness to singularity of the design matrix.
    Journal of applied measurement 01/2014; 15(3):252-266.