About
165
Publications
42,234
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,415
Citations
Publications
Publications (165)
Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are u...
This study presents a new criterion-referenced approach for exploring rating quality within the framework of latent-class signal detection theory (LC-SDT) that goes beyond commonly used reliability indices, and provides substantively meaningful indicators of rater accuracy that can be used to inform rater training and monitoring at the individual r...
The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start-up, plodding, boredom, or...
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on large-sca...
Previous research includes frequent admonitions regarding the importance of establishing connectivity in data collection designs prior to the application of Rasch models. However, details regarding the influence of characteristics of the linking sets used to establish connections among facets, such as locations on the latent variable, model–data fi...
Mental rotation is an important aspect of spatial ability. While the importance of measuring mental rotation has been explored, disputes still exist within the literature surrounding sources of item difficulty in mental rotation tests (MRTs). Furthermore, gender differences in MRT performance are often seen but not fully understood. In the current...
This study examined the psychometric properties of the English version of the 10-item Connor–Davidson Resilience Scale using the Rasch Rating Scale model in a sample of 177 international students and scholars at a U.S. university. The Connor-Davison Resilience Scale was developed to measure individual differences in psychological resilience. Previo...
Objective: This study aimed to validate the Cultural Humility and Enactment Scale - Supervision (CHES-S).
Method: The sample included a total of 201 post-masters counselors who were currently engaged in clinical supervision across 11 U.S. states, who were recruited from the state licensing board email lists. Analyses included confirmatory factor a...
Background: Previous studies examining physical activity (PA) between subgroups of sexual minority women (SMW) have reported inconsistent results and have primarily utilized self-reported PA data. Purpose: To assess potential differences in accelerometer-measured and self-reported PA between subgroups of young adult SMW. Methods: Sexual orientation...
We examined the longitudinal psychometric properties of the Perceived Stress Scale – 4 items version (PSS‐4) using item response theory with a sample of 361 mental health counsellors. Participants completed the PSS‐4 at three timepoints at six‐month intervals in a one‐year period. There were 290 participants who (80.3%) identified as female, 51 (14...
Double-scoring constructed-response items is a common but costly practice in mixed-format assessments. This study explored the impacts of Targeted Double-Scoring (TDS) and random double-scoring procedures on the quality of psychometric outcomes, including student achievement estimates, person fit, and student classifications under various condition...
Assessment literacy's vital role in faculty effectiveness within higher education lacks sufficient tools for measuring faculty attitudes on this matter. Employing a sequential mixed-methods approach, this study utilized the theory of planned behavior to develop the Assessment Literacy Attitude Scale (ALAS) and evaluate its psychometric properties w...
What are foundational competencies in educational measurement? We published a framework for these foundational competencies in this journal (Ackerman et al. 2024) and were grateful to receive eight commentaries raising a number of important questions about the framework and its implications. We identified five cross‐cutting recommendations among th...
For more than two decades, researchers/schools have adopted Self-Determination Theory (SDT)- based interventions to provide valuable insights into improving education process. The systematic review examined 36 SDT-based intervention studies (N = 11,792 participants) to understand the nature and effects of these interventions in promoting students’...
We explored three approaches to resolving or re-scoring constructed response
items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found...
Diagnostic classification models (DCMs) are psychometric models designed to classify examinees according to their proficiency or nonproficiency of specified latent characteristics. These models are well suited for providing diagnostic and actionable feedback to support intermediate and formative assessment efforts. Several DCMs have been developed...
This research will evaluate the sensitivity of IRT indicators in identifying careless responders with
incomplete data, which can help with the detection of patterns of missingness in realistic data
collection settings and address the need of accurate response pattern indicators. Our preliminary
findings suggested to include person-fit statistics in...
In an era where AI systems are increasingly influential, their ethical use and accurate evaluation have become paramount. This paper presents a new interdisciplinary methodology, intertwining metrologically-oriented psychometrics and ethical persuasion to ensure technical and moral standards are met. Rooted in Wilson's (2023) construct-mapping appr...
This article presents the consensus of an National Council on Measurement in Education Presidential Task Force on Foundational Competencies in Educational Measurement. Foundational competencies are those that support future development of additional professional and disciplinary competencies. The authors develop a framework for foundational compete...
Methods to identify carelessness in survey research can be valuable tools in reducing bias during survey development, validation, and use. Because carelessness may take multiple forms, researchers typically use multiple indices when identifying carelessness. In the current study, we extend the literature on careless response identification by exami...
Diagnostic classification models (DCMs) are psychometric models designed to classify examinees according to their proficiency or non-proficiency of specified latent characteristics. These models are well-suited for providing diagnostic and actionable feedback to support formative assessment efforts. Several DCMs have been developed and applied in d...
Well-designed spatial assessments can incorporate multiple sources of complexity that reflect important aspects of spatial reasoning. When these aspects are systematically included in spatial reasoning items, researchers can use psychometric models to examine the impact of each aspect on item difficulty. These methods can then help the researchers...
Careless responding, where participants do not fully engage with item content, is pervasive in survey research. Left undetected, carelessness can compromise the interpretation and use of survey results, including information about participant locations on the construct, item difficulty, and the psychometric quality of the instrument. We present and...
Sparse rating designs, where each examinee’s performance is scored by a small proportion of raters, are prevalent in practical performance assessments. However, relatively little research has focused on the degree to which different analytic techniques alert researchers to rater effects in such designs. We used a simulation study to compare the inf...
This study examined the acute effects of high-intensity resistance exercise with blood flow restriction (BFR) on performance and fatigue, metabolic stress, and markers of inflammation (interleukin-6 (IL-6)), muscle damage (myoglobin), angiogenesis (vascular endothelial growth factor (VEGF)). Thirteen resistance-trained participants (four female, 24...
In standalone performance assessments, researchers have explored the influence of different rating designs on the sensitivity of latent trait model indicators to different rater effects as well as the impacts of different rating designs on student achievement estimates. However, the literature provides little guidance on the degree to which differe...
Children’s fears have received scholarly attention for well over 50 years. A considerable amount of literature has focused on such fear variables as fear intensity and fear prevalence scores and age and gender differences. We used meta-analysis to systematically review the findings related to gender differences in children’s fear intensity scores a...
The purpose of this study was to investigate lower limb blood flow responses under varying blood flow restriction (BFR) pressures based on individualized limb occlusion pressures (LOP) using a commonly used occlusion device. Twenty-nine participants (65.5% female, 23.8 ± 4.7 years) volunteered for this study. An 11.5cm tourniquet was placed around...
Item response theory was used to study the psychometric properties of the Client Meaningful Experiences Scale (CMES). In a sample of 306 adult counseling clients, we examined the dimensional structure of the scale, item-fit, and person-fit statistics. Implications of these findings for counselors, counselors-in-training, and counseling researchers...
Respondents’ problem-solving behaviors comprise behaviors that represent complicated cognitive processes that are frequently systematically tied to one another. Biometric data, such as visual fixation counts (FCs), which are an important eye-tracking indicator, can be combined with other types of variables that reflect different aspects of problem-...
Careless responding is a pervasive issue that impacts the interpretation and use of responses from survey instruments. Researchers have proposed numerous useful methods for detecting carelessness in survey research, including relatively simple summary statistics such as the frequency of adjacent responses in the same category (e.g., “long-string” a...
Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., low...
Supervisor ratings from field experiences are a key part of preparing high-quality teacher candidates. The appropriate use of supervisor ratings for these purposes depends on their reliability, validity, and fairness. We explore the quality of ratings from 10 teacher candidate supervisors (faculty, university supervisors, and classroom teachers) at...
Internalization is an important component of motivation and self-determination. In most of the previous studies of internalization, researchers have focused on the theoretical framework of internalization continuum, but they have not yet empirically evaluated the stages of internalization as a continuum. This article presents a new Internalization...
Purpose
Although researchers have examined empathy among many populations worldwide, investigations of empathy among Farsi-speakers are limited. The purpose of this study is to evaluate the psychometric properties of the Interpersonal Reactivity Index (IRI) for Farsi-speakers (IRI-Farsi).
Methods
After translating, we explored psychometric propert...
Researchers frequently evaluate rater judgments in performance assessments for evidence of differential rater functioning (DRF), which occurs when rater severity is systematically related to construct-irrelevant student characteristics after controlling for student achievement levels. However, researchers have observed that methods for detecting DR...
In many performance assessments, one or two raters from the complete rater pool scores each performance, resulting in a sparse rating design, where there are limited observations of each rater relative to the complete sample of students. Although sparse rating designs can be constructed to facilitate estimation of student achievement, the relativel...
Using item response theory, we examined the psychometric properties of scores on the Trauma-Informed Practice Scales – Supervision Version (TIPS-SV) – a unidimensional measure of supervisees’ perceptions of their supervisors’ adherence to trauma-informed supervision – in a sample of 312 supervisees. Implications for research and supervision practic...
An Exploratory Quantitative Text Analysis (EQTA) method was proposed to synthesize large sets of scholarly publications and to examine thematic characteristics in the Journal of Applied Measurement (JAM). After synthesizing 578 articles published in JAM from 2000 to 2020, authors classified each article into five categories to compare the differenc...
There is strong theoretical and empirical support for the use of digital games for learning. Despite their research support, digital games are not widely used in mathematics classrooms. One main contributor to this lack of adoption is the paucity of extant research on the extent to which mathematics teachers are using digital games, how teachers ar...
A variety of resources are available from which researchers can identify measurement instruments, including peer-reviewed journal articles, collections of technical information about published instruments, and electronic databases that are sponsored by universities, testing organizations, and other groups. Although these resources are widespread, m...
Researchers frequently use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, when they have relatively small samples of examinees. Researchers have provided some guidance regarding the minimum sample size for applications of MSA under various conditions. However, these studies have not focused on item-level mea...
Scoring procedures for the constructed-response (CR) items in large-scale mixed-format educational assessments often involve checks for rater agreement or rater reliability. Although these analyses are important, researchers have documented rater effects that persist despite rater training and that are not always detected in rater agreement and rel...
The use of mixed-format tests made up of multiple-choice (MC) items and constructed response (CR) items is popular in large-scale testing programs, including the National Assessment of Educational Progress (NAEP) and many district- and state-level assessments in the United States. Rater effects, or raters’ scoring tendencies that result in performa...
When analysts evaluate performance assessments, they often use modern measurement theory models to identify raters who frequently give ratings that are different from what would be expected, given the quality of the performance. To detect problematic scoring patterns, two rater fit statistics, the infit and outfit mean square error ( MSE) statistic...
Many large‐scale performance assessments include score resolution procedures for resolving discrepancies in rater judgments. The goal of score resolution is conceptually similar to person fit analyses: To identify students for whom observed scores may not accurately reflect their achievement. Previously, researchers have observed that rater‐agreeme...
Classroom observation is a common approach to teacher evaluation. Yet, concerns about differences in rater judgment are widespread. Despite this concern, few researchers have examined the practical impact of such differences in rater judgments on teachers’ judged effectiveness. This study fills that gap. Using data from a large-scale teacher evalua...
Practical constraints in rater-mediated assessments limit the availability of complete data. Instead, most scoring procedures include one or two ratings for each performance, with overlapping performances across raters or linking sets of multiple-choice items to facilitate model estimation. These incomplete scoring designs present challenges for de...
Although researchers have studied empathy among many populations, there are few studies in which researchers have focused on empathy among Farsi speakers. We explore the psychometric properties of a Farsi translation of the Empathy Quotient (EQ), and compare the degree to which the items function in a comparable way to the English version of the it...
The original version of this article unfortunately contained a mistake in the author group section. The correct name of the second author is Parvaneh Yaghoubi Jami. The original article has been corrected. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.
Purpose : To examine the impact of an out-of-school swimming program on children and youth from one underserved community. Method : Participants were 200 children and youth who attended the out-of-school swimming program during two consecutive summers. The theoretical framework employed drew from previous research on socialization. A mixed-methods...
Purpose: The aim of this study was to compare the effects of low ([LV]; 4 total sets), moderate ([MV]; 8 total sets), and high set volumes ([HV]; 12 total sets) in acute full-body resistance exercise sessions on post-exercise parasympathetic reactivation measured using RMSSD. Methods: Ten resistance-trained participants (25.8±6.8 yr., 173.4±10.6 cm...
Purpose: The aim of this study was to compare the effects of low ([LV]; 4 total sets), moderate ([MV]; 8 total sets), and high set volumes ([HV]; 12 total sets) in acute full-body resistance exercise sessions on post-exercise parasympathetic reactivation measured using RMSSD. Methods: Ten resistance-trained participants (25.8 ± 6.8 yr., 173.4 ± 10....
The trend of high average academic performance of Chinese students on international large-scale assessments has drawn the attention of researchers and practitioners. In this paper, we explored motivation as a fundamental factor that contributes to academic performance in order to explore the degree to which differences in students' learning motivat...
The aim was to examine the validity of heart rate variability (HRV) measurements from photoplethysmography (PPG) via a smartphone application pre-and post-resistance exercise (RE) and to examine the intraday and interday reliability of the smartphone PPG method. Thirty-one adults underwent two simultaneous ultrashort-term electrocardiograph (ECG) a...
We examined the psychometric properties of the Counselor Burnout Inventory (CBI) with 560 early career, post-master’s counselors. We tested the dimensional structure of the CBI, item ordering, and the function of the rating scale using item response theory. Implications of the findings for researchers, counselors, and counselor educators are discus...
Holmes CJ, Winchester LJ, MacDonald HV, Fedewa MV, Wind SA, Esco MR. Changes in Heart Rate Variability and Fatigue Measures Following Moderate Load Resistance Exercise. JEPonline 2020;23 (5):24-36. The purpose of this study was to determine the relationship between changes in heart rate variability (HRV), neuromuscular performance, and fatigue biom...
Researchers frequently use Rasch models to analyze survey responses because these models provide accurate parameter estimates for items and examinees when there are missing data. However, researchers have not fully considered how missing data affect the accuracy of dimensionality assessment in Rasch analyses such as principal components analysis (P...
Scoring procedures for many rater-mediated performance assessments include score resolution procedures in which a third rater adjudicates discrepancies between two raters’ ratings of the same performance. There are numerous approaches for calculating resolved scores that involve different combinations of the original and third ratings. Using data f...
Rater fit analyses provide insight into the degree to which rater judgments correspond to expected properties, as defined within a measurement framework. Parametric models such as the Rasch model provide a useful framework for evaluating rating quality; however, these models are not appropriate for all assessment contexts. The purpose of this study...
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test‐takers’ achievement estimates given their response patterns, has...
Presentation given for the NSCA Alabama State Clinic 2020 virtual conference. This research is part of a dissertation project "Monitoring the Effects of Resistance Exercise on Heart Rate Variability"
Purpose
With the continued rise of technology in sports, many innovations have been made with smartphone applications designed to measure physiological fatigue and recovery. However, the proper methodology and potential pitfalls for long-term monitoring have not been fully explored. Therefore, this study aimed to observe compliance rates of self-me...
In the current study, we examined the extent to which supervisees’ perceptions of power dynamics related to gender and race in a sample of 229 trainees. Overall, we did not find systematic differences in supervisees’ perceptions of power in clinical supervision based on their gender and race. However, utilizing differential item functioning (DIF) a...
A major challenge in the widespread application of Mokken scale analysis (MSA) to educational performance assessments is the requirement of complete data, where every rater rates every student. In this study, simulated and real data are used to demonstrate a method by which researchers and practitioners can apply MSA to incomplete rating designs. T...
Purpose : To determine the influence of an elementary methods course and early field experience on eight preservice teachers’ (PTs’) value orientations. Method : The theoretical perspective employed was occupational socialization. Data were collected with the short form of the value orientation inventory and five qualitative techniques (formal and...
In previous studies, researchers have focused on the development and interpretation of measurement tools related to self-efficacy. However, researchers have seldom investigated whether these instruments demonstrate acceptable psychometric properties, including similar item interpretations between subgroups of respondents. The purpose of this study...
Researchers and practitioners have used the Modern Language Aptitude Test (MLAT) to assess language aptitude and identify possible language learning deficiencies in examinees since the 1950s. However, researchers have not assessed its psychometric properties using modern measurement theory methods. We use the dichotomous Rasch model to explore the...
Advanced Manufacturing and Prototyping Integrated to Unlock Potential (AMP-IT-UP) is a National Science Foundation (NSF) funded K-12 Math & Science Partnership (MSP) project with a goal of promoting math, science, and engineering learning through STEM integration-focused curricula. As part of this project, curriculum writers developed one-week modu...
Researchers apply individual person fit analyses as a procedure for checking model-data fit for individual test-takers. When a test-taker misfits, it means that the inferences from their test score regarding what they know and can do may not be accurate. One problem in applying individual person fit procedures in practice is the question of how muc...
Teacher evaluation systems often include classroom observations in which raters use rating scales to evaluate teachers’ effectiveness. Recently, researchers have promoted the use of multifaceted approaches to investigating reliability using Generalizability theory, instead of rater reliability statistics. Generalizability theory allows analysts to...
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the...
Analytic rubrics for writing assessments are intended to provide diagnostic information regarding students’ strengths and weaknesses related to several domains, such as the meaning and mechanics of their composition. Although individual domains refer to unique aspects of student writing, the same rating scales are often applied across different dom...
Meaningful interpretation of teacher evaluation based on classroom observation depends on the degree to which principals’ judgments are free from errors and systematic bias. Previous researchers have identified factors that may influence classroom observation ratings, including characteristics of students in the classrooms being observed. In this s...
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of s...
Previous research on negotiations between physical education teachers and students has been purely qualitative. The purpose of this study was to produce a quantified negotiation profile for one preservice teacher (George) while he taught three sport education (SE) seasons. The specific research questions we attempted to answer were as follows: (a)...
In operational administrations of rater-mediated performance assessments, practical constraints often result in incomplete data collection designs, in which each rater does not rate each performance on each task. Unless the data collection design includes systematic links, such as raters scoring a subset of the same test-takers as other raters, it...
The purpose of the study was to examine the impact of a physical education teacher’s apparent age on middle school students’ learning and perceptions of the teacher. Two hundred and seventy three middle school students were randomly assigned to view one of two virtually identical films of swimming lessons taught by the same teacher. During the youn...
Differences in rater judgments that are systematically related to construct-irrelevant characteristics threaten the fairness of rater-mediated writing assessments. Accordingly, it is essential that researchers and practitioners examine the degree to which the psychometric quality of rater judgments is comparable across test-taker subgroups. Nonpara...
Differential Item Functioning (DIF) detection procedures provide validity evidence for proposed interpretations of test scores that can help researchers and practitioners ensure that test scores are free from potential bias, and that individual items do not create an advantage for any subgroup of examinees over another. In this study, we use the Ra...
Rater effects, or raters’ tendencies to assign ratings to performances that are different from the ratings that the performances warranted, are well documented in rater-mediated assessments across a variety of disciplines. In many real-data studies of rater effects, researchers have reported that raters exhibit more than one effect, such as a combi...
Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard‐setting panels should have the proper qualifications to make the judgments asked o...
Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater‐mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three...