Constructing Measures. An Item Response Modeling Approach
Abstract
Constructing Measures introduces a way to understand the advantages and disadvantages of measurement instruments, how to use such instruments, and how to apply these methods to develop new instruments or adapt old ones. The book is organized around the steps taken while constructing an instrument. It opens with a summary of the constructive steps involved. Each step is then expanded on in the next four chapters. These chapters develop the "building blocks" that make up an instrument--the construct map, the design plan for the items, the outcome space, and the statistical measurement model. The next three chapters focus on quality control. They rely heavily on the calibrated construct map and review how to check if scores are operating consistently and how to evaluate the reliability and validity evidence. The book introduces a variety of item formats, including multiple-choice, open-ended, and performance items; projects; portfolios; Likert and Guttman items; behavioral observations; and interview protocols. Each chapter includes an overview of the key concepts, related resources for further investigation and exercises and activities. Some chapters feature appendices that describe parts of the instrument development process in more detail, numerical manipulations used in the text, and/or data results. A variety of examples from the behavioral and social sciences and education including achievement and performance testing; attitude measures; health measures, and general sociological scales, demonstrate the application of the material. An accompanying CD features control files, output, and a data set to allow readers to compute the text's exercises and create new analyses and case archives based on the book's examples so the reader can work through the entire development of an instrument. Constructing Measures is an ideal text or supplement in courses on item, test, or instrument development, measurement, item response theory, or rasch analysis taught in a variety of departments including education and psychology. The book also appeals to those who develop instruments, including industrial/organizational, educational, and school psychologists, health outcomes researchers, program evaluators, and sociological measurers. Knowledge of basic descriptive statistics and elementary regression is recommended. © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved.
... In the present research, we specifically show how a developmental perspective to assessment, realized through the Four Building Blocks of assessment (Wilson, 2023), allows educators to gain novel insights on student growth-insights that would otherwise not be possible to see without the application of this comprehensive assessment framework. We explain how each of the Four Building Blocks-1) Construct Map, 2) Items Design, 3) Outcome Space, and 4) Wright Map-connect to our assessment of students in the subsequent sections. ...
... In this study, the Outcome Space includes the distinct Likert categories ranging from "Strongly Disagree" to "Strongly Agree." The items mapped onto the highest levels ("Leader") are the most difficult to agree with on the Likert scale whereas the items mapped onto the lowest levels ("Required") are the easiest to agree with on the Likert scale; D) The Wright Map involves relating the scored survey responses back to the Construct Map by translating the scores into "locations" on the construct's continuum (Wilson, 2023). This process is visualized as item-person Wright maps. ...
... Building Block 1: Construct Map. The first building block, the Construct Map, is a construct definition tool that relies on a developmental perspective to assess student achievement and growth (Wilson, 2023). In a Construct Map, this developmental perspective manifests through qualitatively different levels of performance on the latent construct. ...
Making Undergraduate STEM Education more Inclusive, Interpersonal, and Interdisciplinary through Challenge-Based Learning The increasing complexity of global challenges demands a STEM-enriched approach to learning for all students, regardless of their future career paths. Challenge-Based Learning (CBL) is a pedagogical method to foster a STEM-enriched education, engaging students in the design of societally impactful, interdisciplinary solutions. To investigate the potential of CBL, specifically in the context of Undergraduate STEM Education (USE), it is crucial to assess students' affective development such as their attitudes, beliefs, and self-perceptions related to STEM. This dissertation explores the impact of CBL on student affect through three interconnected studies centered on a large-enrollment Bioinspired Design course. Chapter 1 explores overall growth in measures of science connection-Science Identity (SciID), Science Self-Efficacy (Eff), and Internalization of Scientific Community Values (Val)-using the Tripartite Integration Model of Social Influence (TIMSI) framework. Results demonstrated significant pre/post increases in SciID and Eff across five semesters, with Val remaining stable. Item level analyses revealed specific impacts of CBL activities on these affective measures, particularly in developing students' confidence in creating novel technologies. Chapter 2 investigates the equity of these affective growth outcomes across seven demographic variables. Results indicated that the observed increases in science connection were largely equitable across diverse student populations, with differences in SciID development based on STEM major status and class status. Chapter 3 introduces and validates a novel affective construct: Innovation Skills self-efficacy. Developed using the Berkeley Evaluation & Assessment Research (BEAR) Assessment System, this construct provides a more targeted measure of self-efficacy aligned with the Innovation Skills needed for the future STEM-enriched workforce. Results showed approximately one standard deviation of pre/post growth, with a large effect size in the context of educational interventions. Collectively, this dissertation showcases the potential of CBL approaches in USE to foster equitable development of science connection and Innovation Skills self-efficacy across diverse student populations through comprehensive, psychometrically robust assessments of student affect. This research underscores the importance of holistic approaches to STEM education that cultivate not only knowledge and skills, but also the attitudes and beliefs necessary for success in the known and unknown STEM-enriched careers of the future.
... This process yielded fourteen items that were most relevant to our study (see S1 Table). We analyzed data from all respondents who replied to all 14 items, using a generalized Rasch model for polytomous data [57]. We found that the14-item instrument showed strong internal consistency statistics of 0.79 for the pre-test (n = 547) and 0.84 for the post-test (n = 281). ...
... We found that the14-item instrument showed strong internal consistency statistics of 0.79 for the pre-test (n = 547) and 0.84 for the post-test (n = 281). In addition to establishing content validity as noted above, we also checked estimates from the model for expected patterns; all weighted mean square fit statistics were within the expected range (0.77-1.33) and the response category thresholds within each item were ordered from low to high as theorized [57]. ...
... Given that this study required both the estimation of student scores and the use of those scores as the dependent variable in linear regression, we used item response theory (hereafter IRT) and specifically a partial-credit model to estimate item difficulties and student scores [57]. We analyzed two types of student scores, Expected A Posteriori (EAP) scores that average multiple estimates, and plausible values (PV) that include a wider range of possible student scores to account for error more accurately. ...
Small group work has been shown to improve students’ achievement, learning, engagement, and attitudes toward science. Previous studies that focused on different methods of group formation and their possible impacts mainly focused on measures of students’ academic ability, such as GPA, SAT scores, and previous familiarity with course content. However little attention has been given to other characteristics such as students’ social demographic identities in research about group formation and students’ experiences. Here, we studied the criteria students use to form lab groups, examined how the degree of demographic isolation varies between student-selected and randomly-formed groups, and tested whether demographic isolation is associated with group work attitudes. We used a pre-post survey research design to examine students’ responses in a large-enrollment biology laboratory course. Descriptive analyses showed that “students sitting next to me” (57%) followed by the combination of “students sitting next to me” and “friends” (22%) were the two most common criteria students reported that they considered when forming research groups. Notably, over 80 percent of students reported forming groups with those who sat nearby. We studied instances where students were isolated by being the only members of a historically marginalized population in their lab groups. The prevalence of demographic isolation in student-selected groups was found to be lower than in the simulated randomly assigned groups. We also used multilevel linear regression to examine whether being an isolated student was associated with attitudes about group work, yielding no consistent statistically significant effects. This study contributes to growing knowledge about the relationship between students’ demographic isolation in groups and group work attitudes.
... In this chapter, we explore two learning progressions (Wilson, 2009(Wilson, , 2023)one in the NGSS DCI of life science (interdependent relationships in ecosystems) and the other in the science and engineering practice of engaging in argument from evidence (scientific argumentation)to understand their own structure and the relationship between them in order to inform the ongoing conversation about learning progressions in science assessment. Specifically, we are interested in investigating the question, what is the relationship between the content (DCI) and practice (SEP) constructs? ...
... The data collected from 1,366 middle school students in a large and diverse urban school district in the United States was used to investigate the validity of the learning progression. The learning progression in the study comprised four waypoints (Wilson, 2023). The least sophisticated waypoint (Waypoint 0) is named "Notions" and indicates that students could only express naive and often inaccurate knowledge about ecosystems. ...
... The study describes the use of Wilson's (2023) construct modeling approach called the BAS. Using this approach, researchers developed assessment material including questions or items within tasks or item bundles (Rosenbaum, 1988) and collected empirical evidence to validate the learning progression. ...
... Assessments are designed to capture a single underlying characteristic or trait at a time, though several may be scaled in relation with one another to capture the additional information provided by shared variance (Wilson, 2023). What this trait is, the construct it represents, is an abstraction, made manifest by the creation, that is, the design and development, of the measuring tool. ...
... In healthcare assessment, where we attempt to measure human experience, we have yet in most cases to realize such rigor (section 2.4.7 in Pendrill, 2024), and our focus must be on developing conceptually sound, valid, and precise tools if the quality of measurement obtained in health care is to match that obtained in the physical sciences. Wilson (2023) describes the construction of measures as involving four steps, but I prefer to think of them as spaces. The actions that occur within and between these spaces involve constructions; as a result, our science reflects out those constructions. ...
... Having collected data from an assessment and having analyzed the data for alignment to probabilistic conjoint measurement theory, researchers ask the inferential question about the extent to which the ordering of items from "hardest" to "easiest" reflects the proposed underlying construct (Wilson, 2023). It is not always the case that the empirical results match the proposed construct; indeed, many practitioners of probabilistic conjoint measurement consider the ability to identify inconsistencies with the underlying construct and variations in its expression as a strength of the approach (Avlund et al., 1993;Confrey et al., 2021;Sul, 2024;Wilson, 1994). ...
... To construct reasonable and meaningful measurements, we propose an iterative process that incorporates both qualitative and quantitative perspectives. Our approach builds on Mark Wilson's (2005) construct modeling as well as other advancements in test theory and measurement science (cf. Pendrill, 2019;Boone, 2016;Hobart & Cano, 2009;Andrich & Marais, 2019;Salzberger, 2019). ...
... This is accomplished by sketching out the definition of the construct to be measured in a construct map, where the focus is to broadly describe the characteristics of being located at different points of the continuum. If the construct of interest is multidimensional, which social sustainability is regarded to be, it is recommended to handle each aspect separately, formulating a construct map for each aspect (Wilson, 2005). ...
... When individuals are under scrutiny, real-world observations are derived from people answering surveys, taking tests, or performing tasks to calibrate the measurement system. As such, it is possible to develop items and measurement tools by letting respondents answer questions and, in that way, gather data (Wilson, 2005;Boone, 2016;Hobart & Cano, 2009;Andrich & Marais, 2019). When municipalities are under scrutiny the real-world observations are harder to get and instead, aggregated data from multiple municipalities and several data sources could be required to calibrate the measurement system. ...
... Measurement implies abstracting qualitative observations from the real world (Thurstone, 1959) and exchanging the observed data for inferred meaning. In this way, one assigns numbers to categories where the numbers have certain properties, checks that the assignment was successful and makes use of the measurements for the purpose of summarising responses (Wilson, 2005). In this thesis, I apply the FoNS model to define categories of number sense. ...
... To measure an abstract concept, such as number sense, I needed to operationalise the construct. An item response modelling approach (Wilson, 2005) to instrument development enabled the construction needed to develop measures of children's number sense, identify central components and analyse how the different measures relate. The item response modelling approach involves four building blocks: 1) a construct map, 2) item design (explained further in Chapter 5.1), 3) outcome space and 4) a measurement model. ...
... The outcome space involves categories connected to the scoring of items (Wilson, 2005). The children's answers to ENSA were scored as right or wrong, giving one or zero points to each item. ...
Children enter school with several valuable mathematical experiences, but many teachers lack the methods to formatively assess young children’s mathematical knowledge. Consequently, children may not be met with sufficient challenges to facilitate their further mathematical development. Moreover, current assessment tools used for the youngest children are either too time-demanding to carry out, lack focus on specific mathematical domains or do not consider various aspects of assessment validity. In this PhD thesis, I investigate how technology can enhance the formative assessment of children’s early number sense by developing a digital assessment tool—early number sense assessment (ENSA)—focusing specifically on early number sense and assessment validity. The development and further investigation of ENSA in use reveals that technology can supplement the assessment of number sense in a time-efficient way without being dependent on reading and writing skills or the competence of the individual instructor. ENSA also describes previously unknown aspects of Norwegian children’s early mathematical knowledge. Furthermore, the thesis shows how interactive items have the potential to describe more of the process, leading to an answer to mathematical problems while displaying evidence of assessment validity. Additionally, the use of technology brings some specific design elements that can affect children’s results and strategies when finding answers to mathematical problems. Finally, teachers’ interpretations of children’s assessment results situated in a wider social practice are limited and shaped by external factors that ultimately affect the formative value of assessments for teachers.
... We used the Standards for Educational and Psychological Testing (Standards, AERA et al., 2014) to frame our validity argument and used the construct modeling approach (Wilson, 2023) to guide our investigation. Using the construct modeling approach, we collected reliability evidence and validity evidence based on test content, response processes, internal structure, and relations to other variables; and evidence of fairness (e.g., differential item functioning or DIF), just as they are listed (as "sources") in the Standards. ...
... The construct modeling approach integrates four building blocks into a comprehensive instrument-development system, and the methods are derived from a well-established system for developing assessments. The four building blocks are named (1) the construct map, (2) item design, (3) outcome space, and (4) the Wright Map (Wilson, 2023). As shown on Fig. 2, these building blocks represent steps in a cycle of development, which can be repeated several times in order to arrive at a sound measurement tool. ...
... The construct map, items, response options, and respondents were calibrated using the fourth building block as specified by Wilson's (2023) model for validation. Conquest 5.0 (Adams et al., 2020) was used to complete the analysis. ...
The authentic research experience, which provides students with meaningful collaborative research opportunities designed to promote discovery and innovation under the guidance of mentors, is increasing as a way to attract and engage students in STEM fields. However, despite the increase in authentic research experiences offered to students, there has been little research, particularly at the high school level, investigating students’ attitudes about themselves as researchers. To address this need, we developed a theory (or construct) for how high school age students self-identify as researchers and a companion survey to measure their identity. After three iterative development cycles, 823 high school students from diverse backgrounds were administered the 12-item survey, the Researcher Identity Survey—Form G (RISG). The partial credit Rasch model (1960/1980) was used to analyze the survey data. The results indicate that the survey identifies and locates high school age students as researchers validly and reliably along an easy to use and interpret scale. The survey holds promise as an important element for use in programs designed to broaden the entryway for students into the STEM disciplines.
... Thousands of peer-reviewed articles, books, and practical implementations of advanced measurement theory over the last several decades [36][37][38][39][40][41][42][43][44][45][46][47][48] have not, however, led to the creation of new SI units for measurements in education, health care, social services, environmental resource management, etc. It is accordingly also the case, then, that no units have emerged for informing exchanges of literacy, health, community, or natural capital that are as familiar and easily accessed in everyday language as seconds, degrees Celsius, meters, grams, or volts. ...
... Instead of a perpendicular force lifting the burden of initiation and effecting a subjective experience of flow, individuals experiencing a drag are pulled along in a chaotic turbulence running parallel with the mainstream social forces. Although mainstream measurement methods in psychology and the social sciences take it for granted that no alternative to these kinds of scoring models are available, that has not been true for over 50 years [29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48]. ...
Mathematical and scientific language is an extension of everyday language that feeds back into it and becomes an essential part of daily life in the measurement of time, temperature, mass, length, etc. Language is said by philosophers to be the vehicle of thought: we think only in signs. Language is complexity, simplified. It semiotically integrates discontinuous levels of meaning without reducing them to a uniformity dictated by one of them alone. Thus, unrealistic conceptual ideals; standardized phonemes, alphabets, vocabularies, and grammars; and unique local circumstances each are brought to bear simultaneously in systems, metasystems, and paradigms. Being born into and educated within systems of pre-existing words and ideas absorbs us into complex semiotic flows of meaning. Language lifts the burden of initiating communication. Because we do not have to invent our own signs and symbols, and neither do we have to translate between personal sign systems, language provides a labor-saving economy of thought. The metaphors of reduced labor, lifted burdens, and vehicular transportation suggest an approach to modelling, estimating, measuring, managing, and enhancing the value of linguistic and metrological standards. Key to this approach is the adoption of a specifically metrological perspective on quantification, one that focuses on the objective reproducibility and repeatability of unit quantities that persist in their properties independent of the samples measured and the particular instrument used to measure. Measurement models of this kind enable the testing of hypotheses concerning the kinds of stable representations needed to achieve language’s labor-saving value and economy as the vehicle of thought. When these hypotheses are not falsified, measurement representations capable of multilevel, complex flows in streamlined communication system are obtained. In the absence of these hypothesis tests, or when these hypotheses are falsified, in contrast, the meaning of merely numeric representations remains dependent on strictly local circumstances and are not comparable across them, leaving the vehicle of thought stalled in turbulent traffic. A mechanical conception of how language can provide an economical lift in the vehicle of thought is described; an organic conception is reserved for development at a later time. The current approach focuses on ways in which sociodynamic forces may be harnessed by a sociofoil mechanism providing lift in a manner analogous to the way aerofoils and hydrofoils similarly harness aerodynamic and hydrodynamic forces to reduce drag and improve efficiency.
... This limitation, Prigogine says, makes it necessary to bring in a "supplementary parameter scaling the extension of the fluctuations" to make it apply in unstable systems [1] (p. 117). That parameter is effectively provided in polytomous rating scale models [16,[70][71][72] expanding on Rasch's original dichotomous model. ...
... Expanding δ into two terms β and δ capturing both the mutation and the context of its fractional change in selective advantage, along with the "supplementary parameter" τ, and rewriting the equation as a probabilistic conjoint measurement model [16,[70][71][72] gives: ...
The theory of self-organizing complex adaptive structures energized by the dissipation of entropy describes the evolution of system transitions through hierarchies of levels separated by discontinuities. A paradoxical combination of structural invariance and randomness infuses irreducibly complex systems spanning the full range of existence, from the physical to the chemical, biological, psychological, and social. Philosophically and historically speaking, the objectivist worldview of humanity alienated from a clockwork universe is in the process of being replaced with a worldview including humanity as a participant in the universe’s playful unfolding. The reductionist sense of dualistic science and the pessimism of a worldview in which everything degrades and tends toward thermodynamic equilibrium contrasts with a participatory science of immanent unfolding, irreducible complexity, and evolving collective rationality. Possibilities for operationalizing this phenomenon of “deterministic chaos” across the sciences are explored in terms of the stochastic invariance principles of additive conjoint measurement modeling. These principles inform potentials for a new metrological infrastructure and suggest that the coherence of a new class of SI units might be based on the dissipation of entropy in a manner analogous to the way the speed of light contributes to the coherence of the current SI.
... Dr. Leonard Finkelstein addressed the definition of measurement, distinguishing between widely, strongly, and weakly defined measurements [15]. Dr. Mark Wilson constructed measures using item response modeling approaches [16], while Dr. Paul ...
... Dr. Leonard Finkelstein addressed the definition of measurement, distinguishing between widely, strongly, and weakly defined measurements [15]. Dr. Mark Wilson constructed measures using item response modeling approaches [16], while Dr. Paul Holland focused on sampling theory foundations for item response theory models [17]. ...
In the industry 4.0 era, there exists a pressing need for intelligent data management solutions to enhance the operations of small businesses. This study introduces a pioneering methodology that harnesses the power of AI-driven analysis of internal voice communications, an often-overlooked source of valuable insights within the small business environment. The research centers on an advanced platform that utilizes the Regularized Bayesian Approach, meticulously tailored for the processing of unstructured and semi-structured data, with a specific focus on internal voice messages. This methodology enables the generation of in-depth insights into employees’ emotional, psychological, and motivational states. Furthermore, the integration of data with a psychometric system enables the production of comprehensive personality evaluations, providing digital portraits for every employee. These portraits offer valuable insights into employee well-being and motivations, particularly beneficial for small businesses with limited HR resources. The potential benefits for small businesses are multifaceted and research-driven, including enhanced employee safety, improved efficiency, advanced risk management, and streamlined HR processes. Additionally, this research underscores the growing relevance and potential of this approach in the Emotion AI market. Through the analysis of voice messages, entities, intent, and relationships between utterances can be discerned, offering a comprehensive view of employee sentiment, loyalty, and satisfaction. This study serves as the foundation for fostering a positive work environment, enhancing productivity, and providing a roadmap for mental health improvement and reduced attrition in small businesses. It contributes to the evolving field of intelligent data management and its applications in enhancing small business operations. Keywords: voice recognition, small business enhancement, emotion AI, artificial intelligence, Bayesian approach
... Linacre (2006) states that the Rasch model agreement is acceptable with Infit/Outfit MNSQ values between 0.5 and 1.5. It fits well between 0.75 and 1.33 (Wilson, 2005). After correcting for characteristic level, cross-gender DIF was calculated by comparing item difficulty. ...
... For most items, the Infit and Outfit MNSQ were within the ideal range (0.75-1.33) (Wilson, 2005), with five exceptions (Items #1, 2, 3, 4, and 14) but within the acceptable range (0.5-1.5) (Linacre, 2006). All of the PTSS item measures were found to be invariant (within the margin of error) between genders since no items exhibited large DIF across the sexes, according to the DIF analysis. ...
This study developed a brief student-reported Perceived Teacher Support Scale (PTSS) to assess how well students felt supported by their teachers. Using a theory-driven approach, the PTSS with instrumental, emotional, informational, and appraisal support subscales was developed based on the framework of social support. A total of 1,138 middle school students in grades seven to ten from middle schools in mainland China participated in this survey. The psychometric features of the PTSS were studied using factor and Rasch analyses. Exploratory factor analysis revealed a three-factor model while confirmatory factor analysis supported three- and four-factor solutions. The Rasch analysis further demonstrated the psychometric quality of the four subscales: scale dimensionality, rating scale functioning, and item fit. Measurement invariance across gender was confirmed. The final PTSS had 25 items in four subscales evaluating students’ perceived teacher support: instrumental, emotional, informational, and appraisal support. The correlation between the PTSS and student engagement supported concurrent validity. Finally, the study’s limitations and implications are discussed. In general, the PTSS scale is a more effective tool for measuring students’ perceived teacher support. It can be used to understand the situation of teacher support in different dimensions, and can also be used to conduct relevant cross-sectional and longitudinal research experiments.
... The researchers employed construct modeling to introduce a real-time digital learning platform in which the three mathematics content constituents were designed diagnostic tasks to measure the degree of mathematical competency on the part of seventh-grade students (Wilson, 2005). Moreover, a design-based research method was adopted from Adams (2005), and the MRCMLM (Adams et al., 1997) was applied to verify the quality of the real-time digital learning platform. ...
... Researchers utilized the construct modeling approach Wilson (2005) proposed to design the mathematical competency measurement model. In addition, design-based research was employed as the research design (Vongvanich, 2020) by adapting four building blocks as the sections of the measurement scheme (Wilson & Sloane, 2000). ...
The research aims to develop and verify a real-time digital learning platform for diagnosing the degree of mathematical competency of seventh-grade students. A total of 1,559 students from four regions of Thailand and six experts participated in this research. The researchers employed a design-based approach and the Multidimensional Coefficient Multinomial Logit Model to evaluate the effectiveness of the real-time digital learning platform. The measurement tool consists of two aspects, namely mathematical measures and the construction of knowledge dimensions, with 58 items to simulate students' responses to the three respective content constituents: number and algebra, measurement and geometry, and statistics and probability. The findings showed that the multidimensional model offered a significantly better statistical fit in measurement, geometry, statistics, and probability, while numbers and algebra fit better in the unidimensional model. There were positive, strong, and significant correlations between the dimensions. The findings indicated that all items fit the OUTFIT MNSQ and INFIT MNSQ, which are between 0.75-1.33. The findings revealed that the internal structure when it came to evaluating the degrees of students' mathematical competency using the Wright map and the Multidimensional Test Response Model conformed with the quality of the digital learning platform in terms of its usefulness, accuracy, and feasibility. This real-time digital learning platform for diagnosing the degree of mathematical competency can be accessed from anywhere with an internet connection, making education more accessible to a wider audience, including individuals in remote areas or those with physical disabilities.
... Construct validity requires that variation in the construct in question causally leads to proportionate variation in item response patterns. This causal interpretation in turn requires a clear understanding of the nature of the attribute measured and the causal mechanisms underlying its measurement (Bringmann and Eronen 2016;Eronen and Bringmann 2021;Wilson 2023). Divorced from a theory of response behavior, "tables of correlations between test scores and other measures cannot provide more than circumstantial evidence for validity" (Borsboom, Mellenbergh, andvan Heerden 2004, 1062). ...
... They represent one's actual theory regarding the target construct. For example, according to the BEAR Assessment System (Wilson 2023;Bhatti et al. 2023), measure construction should follow a cyclical, dynamic, and iterative process that emulates the scientific process and that may lead to bidirectional refinement of the construct theory itself. The process begins with proposing a construct map, which represents a hypothesized (qualitative) continuum of the attribute, ranging from high to low levels. ...
The validity of the standard ideal L2 self scale has increasingly been called into question. This paper reports both quantitative and qualitative investigations into whether the ideal L2 self items tap into the intended construct of an actual–ideal discrepancy. Study 1 involved an experimental approach manipulating the items to explicitly refer to ability beliefs. Data from 1,362 participants across three countries (Austria, China, and Saudi Arabia) showed a lack of discriminant validity between original and manipulated items. Study 2 used cognitive interviewing to examine the thought processes of 24 Japanese university students as they responded to ideal L2 self items. Thematic analysis revealed that responses were dominated by reflections on current ability and expectations about using the language in specific situations, rather than envisioning an idealized future self. The findings of Studies 1 and 2 converge to indicate that the standard ideal L2 self scale does not successfully operationalize the intended theoretical construct of actual–ideal self discrepancies. Instead, responses are mostly driven by beliefs about ability to achieve the states described in each item. The results therefore challenge the validity of this widely-used scale, calling for a reinterpretation of its findings in the L2 Motivational Self System literature.
... Construct validity requires that variation in the construct in question causally leads to proportionate variation in item response patterns. This causal interpretation in turn requires a clear understanding of the nature of the attribute measured and the causal mechanisms underlying its measurement (Bringmann and Eronen 2016;Eronen and Bringmann 2021;Wilson 2023). Divorced from a theory of response behaviour, 'tables of correlations between test scores and other measures cannot provide more than circumstantial evidence for validity' (Borsboom, Mellenbergh, andvan Heerden 2004, 1062). ...
... They represent one's actual theory regarding the target construct. For example, according to the BEAR Assessment System (Bhatti et al. 2023;Wilson 2023), measure construction should follow a cyclical, dynamic, and iterative process that emulates the scientific process and that may lead to bidirectional refinement of the construct theory itself. The process begins with proposing a construct map, which represents a hypothesised (qualitative) continuum of the attribute, ranging from high to low levels. ...
The validity of the standard ideal L2 self scale has increasingly been called into question. This paper reports both quantitative and qualitative investigations into whether the ideal L2 self items tap into the intended construct of an actual-ideal discrepancy. Study 1 involved an experimental approach manipulating the items to explicitly refer to ability beliefs. Data from 1,362 participants across three countries (Austria, China, and Saudi Arabia) showed a lack of discriminant validity between original and manipulated items. Study 2 used cognitive interviewing to examine the thought processes of 24 Japanese university students as they responded to ideal L2 self items. Thematic analysis revealed that responses were dominated by reflections on current ability and expectations about using the language in specific situations, rather than envisioning an idealised future self. The findings of Studies 1 and 2 converge to indicate that the standard ideal L2 self scale does not successfully operationalise the intended theoretical construct of actual-ideal self discrepancies. Instead, responses are mostly driven by beliefs about ability to achieve the states described in each item. The results therefore challenge the validity of this widely-used scale, calling for a reinterpretation of its findings in the L2 Motivational Self System literature. ARTICLE HISTORY
... The four-pronged idea undergirding all of our work on dispositions assessmentthat has held constant over timeis that (1) items should be written to show consistency with professional standards; (2) multiple instruments of different item types are necessary; (3) that design and scoring must be tied to a meaningful taxonomy that defines levels in the scale; and (4) advanced measurement practice provides a solid foundation for use of the results in scoring candidates while seeking to improve both their individual performance as well as the performance of the program. Building on results obtained in research on socio-emotional learning (Pancorbo et al., 2021;Lei et al., 2023), by mapping variation in the measurements in terms of the item content, the process we describe helps us in diagnosing opportunities for individual and program improvements and to avoid construct under-representation and construct irrelevant variance (Baghaei, 2008;Stenner & Horabin, 1992;Wilson, 2023). ...
... Measurement modeling involves a careful delineation of the expected construct during the instrument design stage (Wilson, 2023). Conceptually, the idea of measurement modeling is simple. ...
... (5.1)). To respond to the most common end-user needmeasurement values of a specific latent trait attributed to personsit is natural to start by defining the latent trait related to the person and after that designing items to be Difficulty of a specific task, e.g., task difficulty X for Y, at time Z Measured difficulty of a specific task, e.g., measured task difficulty X for Y, at time Z Ability of a specific person, e.g., ability X of person Y, at time Z Measured ability of a specific person, e.g., measured ability X of person Y, time Za used to assess what it means for persons going up or down the scale (Wilson, 2005). However, historically, some psychologists have tended to view what is being measured as an empirical matter with a conception that views validity as something to be discovered afterward (Borsboom, Mellenbergh, & van Heerden, 2004). ...
... Furthermore, a good example of the latter comes from Morel and colleagues (2022), who provided a conceptual model for experiences from the early stage of Parkinson's disease, and in turn, lay the foundation of what latent trait to be measured in order to make a decision based on what matters for that target group. The literature about designing measurements for latent traits, again naturally, starts by defining the latent trait related to the person and, after that, designing items (Wilson, 2005). A subsequent step when intending to test and evaluate the measurement in research is the study design. ...
... Whether and where in the curriculum students should be taught a conceptual understanding of measurement and its role in the sciences is an ongoing question. [13] [14,15,16] Finally, it is important to recognize that traditional document analysis would have plausibly revealed the same findings as reported here. Still, an NLP application produces faster results with fewer resources. ...
Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications , which are known as "domains" for Common Core standards, and "strands" for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.
https://doi.org/10.48550/arXiv.2412.04482
... Whether and where in the curriculum students should be taught a conceptual understanding of measurement and its role in the sciences is an ongoing question. [13] [14,15,16] Finally, it is important to recognize that traditional document analysis would have plausibly revealed the same findings as reported here. Still, an NLP application produces faster results with fewer resources. ...
Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications , which are known as "domains" for Common Core standards, and "strands" for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.
... Map analysis provides a visual representation of how item difficulty aligns with student ability levels (Wilson, 2005). This analysis is particularly valuable for identifying any mismatches between the test's difficulty and the ability range of the student population, offering insights into the test's overall effectiveness and areas for potential improvement. ...
This study examines the psychometric properties of the GSP122 test, an Information and Communication Technology (ICT) knowledge assessment administered at a public university in Nigeria. Despite its importance in evaluating students' ICT competencies, no prior attempt has been made to investigate the test's psychometric qualities. The research focuses on three key aspects: item reliability, Differential Item Functioning (DIF), and Wright Map analysis. The study employs Rasch analysis to evaluate these properties. A sample of 600 GSP122 test scripts was randomly selected from undergraduate students across various departments to ensure a representative assessment. Findings reveal that the test possesses strong item reliability, indicating consistency in measuring the intended construct. Furthermore, all items are found to be DIF-free, suggesting fairness across different subgroups of test-takers. The Wright Map analysis, however, indicates that the test doesn't accurately target the abilities of students at the extreme ends and bottom of the proficiency spectrum. Specifically, some items are identified as too difficult and too easy relative to the students' ability levels. These results provide valuable insights into the GSP122 test's strengths and areas for improvement. While the test demonstrates robustness in reliability and fairness, adjustments in item difficulty could enhance its effectiveness in assessing students across all proficiency levels. This comprehensive analysis contributes to the validation of the GSP122 test and offers a foundation for evidence-based refinements in ICT assessment practices within the Nigerian higher education context.
... Next, Rasch Modeling can be used to develop and include new items designed to assess mathematical visual thinking abilities at various difficulty levels. By analyzing the Item Characteristic Curve (ICC) and Item Information Function (IIF), items can be designed that provide optimal information contributions at various student ability levels [31][32] [33]. ...
... This body of work is useful in envisioning how advances in the measurement and management of human, social and natural capital must entail transformed forms of social organization if their potential for informing sustainable policies and practices is to be realized. The technical complexities of established measurement models and methods [13][14][15][16][17] constitute a paradigm shift in how quantification is conceived; enacting this shift requires new forms of social and economic organization. After spelling out the role of measurement in establishing common product definitions, property rights and lower transaction costs, the primary features of rigorously defined, meaningful and useful quantification will be described. ...
The economics of commercial production are dependent in several key respects on the market institutions that create the context for profitable transactions. Markets are not created as much by trade as by the institutions that structure the standards taken for granted in the background of agreements and contracts. Property rights, scientific rationality, access to capital, and transportation and communications networks are all essential to a functional economy. These key elements are integrated into today’s market institutions only for manufactured capital, however, as they are mistakenly deemed irrelevant or inaccessible to human, social and natural capital. Costs associated with measuring and managing human, social and natural capital are, then, minimized and externalized whenever possible. Contrary to common opinion, however, proven, well-documented and long-standing resources exist for bringing scientific rationality to bear on intangible assets in meaningful and useful ways. The role of measurement standards in creating common product definitions and enforceable property rights, and in lowering transaction costs—and so in creating economically effective market institutions—is well understood, but virtually no attention has been paid to opportunities for extending those definitions, rights and lower costs into the domains of intangible assets. One crucial aspect of the problem is the widespread but mistaken assumption that quantification inherently requires the homogenization and smothering of unique individual differences. On the contrary, standards are the means by which both global harmonization and mass customization become possible, as is eminently apparent in the way universally accessible musical scales and tuned instruments set the stage for creative improvisation. Everyday language provides the model prototype for metrology as well as for the complex interrelations of formal, abstract and concrete meanings that are deployed in measurement. Designing and implementing standardized communications media adaptable to local circumstances are inherently complex, but solutions have been available, tested and in use for decades. Multilevel semiotic systems for communicating and managing the development and growth of human, social and natural processes set the stage for new developments in production research.
... Whether and where in the curriculum students should be taught a conceptual understanding of measurement and its role in the sciences is an ongoing question. [13] [14,15,16] Finally, it is important to recognize that traditional document analysis would have plausibly revealed the same findings as reported here. Still, an NLP application produces faster results with fewer resources. ...
Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications, which are known as "domains" for Common Core standards, and "strands" for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.
... Descriptive statistics were then used to summarize and describe the responses to each questionnaire item, providing an overview of central tendencies, such as mean scores, and measures of variability, such as standard deviation. This analysis helped identify general patterns within the data, highlighting areas where users expressed high satisfaction or noted challenges with the system [31]. Mean scores allowed for a straightforward interpretation of the responses, giving insight into the overall user experience and satisfaction across different user groups, while standard deviations highlighted the degree of consensus among respondents for each statement. ...
This study explores developing and implementing a web-based Management Information System (MIS) tailored for SMK Negeri 2 Banda Aceh, a vocational school in Indonesia. To enhance administrative efficiency and address unique challenges in vocational education, the system centralizes tasks such as attendance tracking, academic record management, and internship coordination. Employing the waterfall model, this project proceeded through structured phases, including requirements analysis, system design, development, and usability testing. A sample of 50 users, comprising students, teachers, and school operators, evaluated the system based on usability, interface design, and information clarity through a questionnaire, yielding high satisfaction scores. Reliability testing and correlation analysis revealed strong internal consistency across questionnaire items and identified critical factors influencing user satisfaction, such as interface appeal and effective error resolution. The results indicate that the system meets core user needs and contributes to a streamlined, user-friendly school management process. With implementation planning, user training, and ongoing maintenance, this MIS offers a sustainable solution that can serve as a model for vocational schools across Indonesia, showcasing the potential of digital solutions in advancing educational administration and supporting career readiness in vocational education.
... The measurement instrument was calibrated using a dichotomous Rasch model (for more details, see Rasch, 1960; for recent accounts, see, e.g., Bond et al., 2021;Wilson, 2023). Therefore, the items had to be converted into a dichotomous format. ...
Attitude toward nature and environmental attitude are two distinct propensities that both further learning about the environment. The present study builds upon prior research by investigating the role of attitude toward nature in learning about environmental issues. In a sample of 1,486 university, middle and high school students (Mage = 15.25, SD = 3.2), we first calibrated a pool of items expressing attitude toward nature. We found differences in how adolescents expressed their appreciation for nature at different ages. It is essential to consider these differences to accurately ascertain adolescents’ attitudes toward nature. We then conducted a mediation test. Whereas attitude toward nature determined the levels of knowledge students gained and retained, environmental attitude fully mediated the environmental knowledge subsequently demonstrated by the students. Our research suggests that researchers and educators may benefit from taking an experiential approach to learning about sustainable development by promoting appreciation for nature.
... The importance of using construct maps to explicitly represent the construct has been long established (Wilson, 2004). Still, designers often focus on minimizing context to develop a simplified and pure construct absent of any construct irrelevant variance (Randall, 2021). ...
Scholarship on the science of reading (SoR) has, in some instances, taken up more narrow views of reading in discussions and instantiations of reading assessment that do not center equity and justice, especially in schools. This can lead to less valid and even harmful reading assessment, especially for students from historically marginalized communities with diverse language, cultural, and neurological differences. Here, we draw on critically-minded reading research, as well as on work in equity-oriented educational assessment, to inform a justice-based reading assessment framework that can guide research, theory, policy, and practice. Using an equity-oriented and justice-based lens, the framework outlines three interwoven components: (1) relational and humanizing assessment practices; (2) justice-based products and outcomes; and, (3) a critical construct of reading. The framework compels designers, developers, and users to center the needs of rights-holders, and especially those from historically marginalized communities, throughout the assessment process. To do so, the framework outlines five principles that include orienting to equity and justice; prioritizing humanizing and critical assessment practices; grounding assessment in a complex, dynamic, and critical construct of reading for diverse populations; designing for justice-based social consequences, and engaging in critical debrief throughout. These principles guide eight phases of assessment, which we outline in detail. Finally, we discuss conceptual contributions as well as practical implications.
... This grading system helps students to understand in which areas they are most strong and in which areas they require improvement. Rubrics also help to guide teachers in identifying their students' strengths and weaknesses, directing the learning process accordingly, and helping them provide learners with personalized feedback (Wilson, 2023). In this way, rubrics are considered an effective tool for evaluating student performance and providing them with feedback (Panadero & Jonsson, 2013). ...
... The five requirements of invariant measurement are described in Chapter 1. Chapters 2 and 3 provide descriptions of item-invariant person measurement and person-invariant item calibration based on a family of Rasch measurement models (Dichotomous Model, Partial Credit Model, Rating Scale Model, and Many Facet Model). A detailed description of how to construct measures based on Wilson (2005Wilson ( , 2023 is included in Chapter 4. Chapters 5 and 6 provide historical and comparative perspectives on the key ideas undergirding invariant measurement. Chapter 7 covers a variety of estimation methods that researchers and theorists have proposed for estimating the parameters in Rasch models, while Chapter 8 introduces the important idea of model-data fit. ...
... While content validity, which assesses how well the instrument reflects the knowledge being measured, is considered sufficient by many researchers, statistical validity provides an additional level of rigor that ensures the results are consistent and generalizable. 15 Similarly, knowledge tests play a key role in distinguishing between students who have achieved the objectives of the instrument and those who have not, which is essential for a comprehensive learning process. 16 For all these reasons, the general objective of this research was the construction and validation of a questionnaire to assess the knowledge of BLS and AED of primary and secondary school students (6 to 16 years old). ...
The 60 % of cardiac arrests happen in the out-of-hospital setting. In 2023, the International Liaison Committee on Resuscitation issued a statement entitled “Children save lives”, recommending the teaching of basic life support to children from the age of 12. However, we have not identified validated instruments that assess the level of knowledge of schoolchildren about BLS and AED. Objective: Construction and psychometric validation of a questionnaire to assess knowledge on Basic Life Support (BLS) and Automated External Defibrillator (AED) in primary to secondary school children. Method: Cross-sectional descriptive study of validation of the questionnaire consisting of several phases: construction of the questionnaire on knowledge on BLS and AED (ConocES-BLS/AED), content validation, pilot test and psychometric validation. Results: The ConocES-SVB/AED questionnaire was constructed, content validation was carried out by 14 experts, the pilot test carried out on 105 students reported good reliability (0.84), and finally with the psychometric validation a questionnaire composed of 12 items was obtained and psychometrically validated using the Item Response Theory in a final sample of 182 participants. Adequate fit values and acceptable reliability (0.65) were obtained, demonstrating its usefulness to accurately measure the level of knowledge about SVB/AED maneuvers in schoolchildren. Conclusions: The created and validated questionnaire provides educators with a fundamental resource to identify areas of lack of knowledge, improve and design effective educational interventions for schoolchildren on SVB/AED maneuvers.
... To further assess the psychometric properties of the SAQ, we employed item response theory (IRT) using graded response model (GRM) analysis [33]. Specifically, a graded multidimensional IRT (MIRT) model was tested. ...
Background
Preoperative anxiety is commonly found in patients who are waiting for surgery and can lead to negative surgical outcomes. Understanding the sources of surgical anxiety allows healthcare providers to identify at-risk patients and implement psychosocial interventions such as counseling, relaxation techniques, and cognitive‒behavioral therapy to minimize anxiety. Few comprehensive psychiatric measures are available to assess preoperative anxiety in Arabic.
Objective
Our study aimed to translate, adapt, and validate the Surgical Anxiety Questionnaire (SAQ) into the modern standard Arabic language, also known as Fusha al-Asr Arabic.
Methods
To translate the questionnaire, the research team used the gold standard process of forward translation by two independent translators along with back translation evaluation by four trained medical doctors. A cross-sectional study was conducted using an online survey completed by 208 Arabic speakers (mean age 38 years, 44% women) from four countries. Psychometric analyses, which included internal consistency, test-retest reliability, convergent validity, confirmatory factor analysis, and item response analysis, were performed. Convergent validity tests were performed against the Generalized Anxiety Disorder 2-item Scale (GAD-2), Patient Health Questionnaire-4 (PHQ-2), Perceived Stress Scale 4 (PSS-4), and Arabic version of the Visual Analog Scale for anxiety (VAS-A).
Results
The mean SAQ of our sample was 19.38 ± 12.63 (possible range 0–68). The Arabic SAQ translation demonstrated excellent internal consistency, with McDonald’s omega and a Cronbach’s alpha of approximately 0.90. The test-retest reliability was also high, with an intraclass coefficient of 0.94. The SAQ showed strong convergent validity against the GAD-2 (r = 0.94, p < 0.01). The SAQ also showed weak-moderate correlations with the PHQ-2 (r = 0.26, p < 0.01), PSS-4 (r = 43, p < 0.01), and VAS-A (r = 0.36, p < 0.01) scores. The original three-factor structure was supported by confirmatory factor analysis, confirming the original structure reported in the original English language version. The results for fitness indices showed acceptable preliminary results (CFI/TLI approximately 0.90), and deleting some items improved the model fit (CFI/TLI > 0.90, RMSEA < 0.08). We suggest retaining the original factorial solution until further validation studies can be conducted. The item response theory (IRT) results identified no items that were excessively difficult or subject to guessing. The multidimensional IRT provided evidence that the SAQ items form a multidimensional scale assessing surgical anxiety that fits the classical model reasonably well.
Conclusion
The SAQ has demonstrated acceptable reliability and validity; thus, it is a trustworthy and valid tool for evaluating preoperative anxiety in Arabic speakers. Future research could benefit from using the SAQ in both surgical and psychiatric research.
... It is generally considered adequate if the separation index exceeds 2 and the separation reliability coefficient exceeds 0.7 [44]. To visualize item features and individual measures, a Wright map was used to estimate the values measured by the sample respondents and the average location of all items on a common scale (logits) [45]. Additionally, differential item functioning (DIF) was conducted to test measurement invariance across genders, and DIF is typically reported as a DIF > 0.5 logits with a p value < 0.05 [46]. ...
Background
There is currently no dedicated measurement for assessing nursing students’ study interest in China. Considering the good reliability, validity, and widespread applicability of the Study Interest Questionnaire-Short Form (SIQ-SF), the objective of this study was to validate its usage among Chinese nursing students.
Methods
The translation and cross-cultural adaptation rigorously followed the modified Brislin model. A cross-sectional survey was conducted using the Chinese version of the SIQ-SF and convenience sampling was employed to select nursing students. The Psychometric evaluation of the Chinese version of the SIQ-SF was conducted based on Classical Test Theory and Item Response Theory.
Results
A total of 1158 participants were included in the analysis. The item-level content validity index (CVI) ranged from 0.9 to 1.0, and the scale-level CVI was 0.98. In the Exploratory factor analysis, three factors with eigenvalues above 1 were identified, accounting for 62.554% of the cumulative variance. In the confirmatory factor analysis, the CMIN\DF was 5.639, the GFI was 0.953, the CFI was 0.902, and the IFI was 0.904. The Cronbach’s α coefficient of the Chinese version of the SIQ-SF was 0.70. Thirty-one participants were invited to sign the scale after two weeks. The intraclass correlation coefficient was 0.784, and that of items ranged from 0.70 to 0.819. The infit MnSQ values ranged between 0.76 and 1.51, and the outfit MnSQ values ranged between 0.72 and 1.76. The point-measure correlation value ranged between 0.30 and 0.68. The item difficulty measures ranged from − 0.66 to 1.44 logit and the individual learning interest estimations ranged from − 4.22 to 4.97 logit. DIF contrast ranged from 0.00 to 0.33 logits, with all p values greater than 0.05.
Conclusions
The Chinese version of the SIQ-SF demonstrated acceptable reliability and validity among Chinese nursing students and could be used to assess nursing students’ study interest in China. With the aid of this scale, teachers can gain a better understanding of nursing students’ study interests, thereby maximizing their learning effects through appropriate content and methods.
... This national assessment highlights the primary goals of schools, namely character development and student competencies. When executed effectively, its results can be used to improve teaching and learning processes in schools (Wilson, 2023). Surveys on learning environments among students are part of this national assessment, conducted in grades 5, 10, and 11, measuring various aspects such as school safety climate, diversity climate, socioeconomic index, quality of learning, and teacher development (Beatty et al., 2021). ...
Objective: This study aims to support the implementation of literacy programs in schools to enhance the school safety climate. The program targets the negative impacts of bullying and violence and helps students develop conflict resolution skills to reduce harmful incidents. Theoretical Framework: Academic literacy is defined as the ability to read and comprehend texts within specific content areas in a school setting. This includes understanding simple texts as well as addressing difficulties with more complex materials often encountered in school. Method: A quantitative approach with path analysis and purposive sampling was used, with the Taro Yamane method applied to select relevant samples. The study examines five dimensions of the school safety climate: Teacher-Student Relationship Quality, Student Character and Behavior, Peer Relationships, School Safety Climate Quality, and Bullying Frequency. Results and Discussion: The findings reveal that Teacher-Student Relationships, Student Behavior, and Peer Relationships all have a direct impact on the quality of the school safety climate. The school safety climate, in turn, significantly influences the overall school environment. Research Implications: The study suggests that integrating literacy programs can significantly improve the school safety climate by enhancing relationships and reducing bullying. Originality/Value: The integration of literacy programs represents a novel approach to fostering a safer and more supportive school environment. These findings offer practical insights for schools aiming to enhance their safety climates and support student development.
... Wright's admonitions and recommendations in this regard are not often followed, but the value of the probabilistic models of measurement he advocated was recognized by many soon after they were introduced (Wilson & Fisher, 2017). These models have been further explicated and increasingly applied in psychology, health care, and the social sciences over the last several decades (Andersen, 1977(Andersen, , 1980Andrich, 1978Andrich, , 2010Andrich & Marais, 2019;Aryadoust et al.;Bezruczko, 2005;Boone & Staver, 2020;Embretson, 1996bEmbretson, , 2010Engelhard, 2012;Fischer, 1973Fischer, , 1981Fischer & Molenaar, 1995;Fisher & Wright, 1994;Hagell, 2014Hagell, , 2019Loevinger, 1965;Massof, 2008;Masters & Keeves, 1999;Pendrill, 2019Pendrill, , 2024Melin et al., 2021;Pesudovs, 2006Pesudovs, , 2010Salzberger, 2009;Smith & Smith, 2004von Davier & Carstensen, 2007;Wilson, 1992Wilson, , 2018Wilson, , 2023. ...
... The path to coherence in measurement might begin from analyses of existing data, an available instrument, or a theory of the construct to be quantified. In any case, an approach taking various forms that can be generally characterized as construct mapping (Daniel & Embretson, 2010;De Boeck & Wilson, 2004;Embretson, 2010;Fischer, 1973;Stenner et al., 2013;Stone et al., 1999;Wilson, 2004Wilson, , 2023) offers a powerful strategy for both physical and social scientists. It emphasizes the importance of an a priori theory about how physical, chemical, biological, psychological, or social processes might manifest at every level of the instrument's desired utility or use cases. ...
... (Lynch, 2001;Popham, 2000;Thorndike & Thorndike-Christ, 2009) While the assignment of value may take either qualitative or quantitative form, it is important to note that under this definition, measurement is a form of assessment that relies on the quantification of the assigned value. The sense of quantification intended here is metrological, following Mari, et al. (2023), Pendrill (2019), and Wilson (2023). Though my purpose here is to spell out the sociohistorical aspects of culturally specific assessment, the measurement ideas and methods put to use in creating those assessments are in tune with efforts addressing the irreducible complexity involved in creating and sharing meaningful comparisons (Confrey et al., 2021;Fisher & Wilson, 2015;Lehrer, 2013;Mallinson, 2024). ...
How can assessment be defined and operationalized to avoid measurement disjuncture, a misalignment wherein elements of an instrument development process from one worldview are applied in ways that negate and override another worldview? Culturally specific assessment is introduced to counter measurement disjuncture and to establish a critically derived form of assessment that is culturally responsive and rele- vant. The conceptual workspace of culturally specific assessment developers resides within a swirling environment of socio-historical factors that produces both disjunc- tures and cultural aspirations for new assessment possibilities, moving in uplifting di- rections. The theoretical framework of the disjuncture-response dialectic illuminates how settler colonialism and intellectual elimination contribute to measurement disjunc- tures and informs culturally specific assessment alternatives in support of intellectual amplification and indigenous sovereignty.
... Methods are provided here. The example of a practical implementation of measurement modeling we present assumes readers possess a basic understanding of the relevant theory (Andrich & Marais, 2019;Bond et al., 2021;Linacre, 1994;Linacre et al., 1994;Myford & Wolfe, 2009;Smith et al., 1994;Wilson, 2023). The aim is to clearly articulate what can be accomplished in the measurement context of a job analysis, especially as it applies to the weighting of tests. ...
... Various articulations of the logic and methods of measurement demonstrated in the chapters of this book have been available and in use for over 50 years (Bradley & Terry, 1952;Brink, 1972;Fischer, 1968Fischer, , 1973Guttman, 1944Guttman, , 1950Rasch, 1960Rasch, , 1961Loevinger, 1965;Luce, 1959;Luce & Tukey, 1964;Wright, 1965Wright, , 1968, and in some respects, for almost 100 years (Thurstone, 1926(Thurstone, , 1928see Andrich, 1978see Andrich, , 1988Brogden, 1977). Over the course of those years, the inferential advantages and meaningful comparability provided by this logic and these methods (Andrich, 2010;Embretson, 1996;Rasch, 1977;Narens & Luce, 1986;Wilson, 2005Wilson, , 2013bWright, 1977Wright, , 1997Wright & Masters, 1982;Wright & Stone, 1979) led to their adoption across wide ranges of research and practice (Alagumalai et al., 2005) involving everything from educational (Confrey et al., 2021;Connolly et al., 1971Connolly et al., /2007Green, 1986;Masters & Keeves, 1999;Wilson, 2018;Wright, 1977Wright, , 1997, psychological (Commons et al., 2014;Dawson, 2004;K. Fischer & Dawson, 2002), and health care outcomes (Allen & Pak, 2023;Belvedere & de Morton 2010;Bezruczko, 2005;Cano et al., 2009;Christensen et al., 2013;A. ...
... To be clear, these seven principles are guidelines in the construction of items which then define a Rasch scale-only unidimensionality, equal discrimination, and local independence are formal statistical assumptions. As assumptions, they are subject to testing and numerous procedures exist for this purpose (for details see Andrich & Marais, 2019;Engelhard, 2012;Wilson, 2004;Wright & Masters, 1982). ...
... The development of the scoring rubric is related to the construct modeling approach, namely item design and result space (Figure 1). The results space consists of a set of different qualitative categories for identifying, evaluating, and scoring student answers [47]. 282 multi-representation abilities based on the pattern of interpretation of answers is shown in Table 7. ...
The advantage of four-tier model test is that it has four levels of questions that require complex reasoning abilities from students to express the reasons for their answers to the problems given. This study aimed to develop a multiple representation test with a four-tier model to measure students’ multiple representation in electricity topics. The stages of developing this test instrument used a stage model of The Design Based Research which consists of five stages, namely developing an assessment framework, designing items, developing rubrics, conducting trials, and applying the Racsh Model analysis with the Item Response approach. Theory (IRT) program assisted Winsteps version 3.68.2. The research method used was descriptive-exploratory method to describe the results of the development and validation of four-tier model test. This test consisted to 20 items were developed based on four types of representation, namely verbal representations, diagrammatic representations (pictures), mathematical representations, and graphic representations with each indicator. The research subjects involved were 30 prospective physics students at the pilot test stage and 79 prospective physics students at the field test stage from different universities in Makassar city. The results of the development of the four-tier test model test overall items are valid with a high level of reliability (Cronbach's Alpha value = 0.80). Based on the results of expert judgment validation and testing, it can be concluded that the multiple representation test with four-tier model on electricity topic is feasible to use.
This study examines the psychometric properties of a Mathematics Readiness Assessment Tool for elementary school students using Rasch analysis through the Winstep program. The analysis involved 214 students responding to 15 items. Results show strong psychometric characteristics with an item reliability of 0.95 and person reliability of 0.69. The instrument demonstrates good construct validity with INFIT and OUTFIT MNSQ values falling within the acceptable range (0.5-1.5). Analysis of unidimensionality showed satisfactory results with raw variance explained by measures at 26.8% and unexplained variance in the first contrast at 7.6%. Local independence assumption was met with residual correlations ranging from -0.22 to 0.20. Item difficulty ranges from -1.33 to 1.56 logits, indicating a good spread of item difficulties. The test information function shows optimal measurement precision for students of average ability levels, though less precise for extremely high or low abilities. Differential Item Functioning (DIF) analysis revealed some items requiring revision, particularly items 13-15. The instrument's Alpha Cronbach value of 0.73 indicates good internal consistency. Overall, the assessment tool demonstrates adequate psychometric properties for measuring mathematics readiness in elementary school students, though some improvements are recommended for optimal measurement across all ability levels.
This article is based on a qualitative research project related to PIRLS (Progress in International Reading Literacy Study). We explore how 25 Danish students from two fourth-grade classes interact with PIRLS items as readers and writers. We focus on two specific constructed-response items from a literary text used in PIRLS 2016 by examining and comparing students’ written responses to these items and oral responses expressed during interviews. To some extent, students’ text interactions differ from expectations presented in the international scoring guide. Based on our analysis, which draws on reader-response theory, we find that the scoring guide largely expects students to assume an efferent or discursive approach to reading, whereas many students in this study seem to approach texts and respond to items based on a more personal literary experience. We also find that the written student responses contain less elaboration than their oral responses. In this article, we discuss both written and oral responses as indicators of reading comprehension in PIRLS and the inferences related to student performance, with a focus on aesthetic and efferent reading.
This paper presents the findings of a validation study for an instrument designed to assess the views on the non-epistemic nature of science (NE-NOS) among pre-service physics teachers (PSPTs). Despite the acknowledged significance of the nature of science (NOS), research has predominantly focused on its epistemological aspects, sidelining non-epistemological facets that encompass contextual, social, and psychological dimensions relevant to science and its practitioners. Drawing from a comprehensive literature review within science education research, we developed a construct map describing three underlying components of NE-NOS. This construct map forms the basis for a proposed hypothetical progression outlining the developmental stages of PSPTs’ views on NE-NOS, categorized as naïve, mixed, and informed. The items comprising the NE-NOS assessment adopt an ordered multiple-choice format, where each response option reflects a specific level of views on NE-NOS. Results from a validation study involving 309 PSPTs demonstrate robust reliability and validity through Rasch analysis, corroborated by evidence in internal scale validity, construct validity, concurrent validity, and response process validity. The evaluation of PSPTs using the instrument reveals a prevailing mixed level of views on NE-NOS. The implications of the NE-NOS instrument for enhancing theoretical understanding of NOS and NOS-based teacher training are discussed. In conclusion, the NE-NOS assessment validly measures PSPTs’ NE-NOS views and could serve as a valuable tool for raising awareness of NE-NOS. Researchers and teacher educators can utilize it as a diagnostic instrument to study the effects of NOS education.
Objectives: Dupuytren disease is a common condition that causes progressive finger contractures resulting in impaired hand function and difficulties in performing daily activities. Patient reported outcome measures (PROMs) are commonly used in research and clinical practice to evaluate treatment outcomes. Both general upper extremity PROMs and Dupuytren-specific PROMs are available, typically developed using conventional methodology based on classical test theory. However, most current PROMs have been shown to have low responsiveness and the relevance of included items have been questioned. In this study we aim to develop a new Dupuytren-specific PROM using modern measurement methodology based on item response theory (IRT). Methods: The study will be performed in three phases. In Phase 1, (completed), an expert group developed a questionnaire with a large number of potentially relevant items derived from existing PROMs and patient collaboration. In Phase 2, the questionnaire will be administered to 300 patients with Dupuytren disease, and their responses will be analyzed with IRT methodology to identify the best performing items to be included in the new PROM. In Phase 3, the new PROM will be administered to 300 additional patients for validation. Conclusion: This new Dupuytren-specific patient-reported outcome measure will help advance clinical research on Dupuytren disease.
Half a century ago, Jean Piaget suggested a new approach to the study of moral (and non-moral) behavior which we have called Dual Aspect Model, and which, as we shall see, has opened up a new, very prosperous field of research into the nature and development of human morality. When I came across Piaget’s suggestion, in the early 1970s, I was thrilled. I imagined that if this notion, which has received little attention in research on moral behavior and development, were fully understood, it could revolutionize psychology and education.
يهدف البحث الحالي إلى استخدام نموذج سلم التقدير (Rating Scale Model - RSM) أحد امتدادات نموذج راش آحادي المعلم في تطوير مقياس التشوه المعرفي لدى طلبة المرحلة الإعدادية. ولتحقيق هذا الهدف, فقد اعتمد الباحث (قائمة التشوه المعرفي- CDI) ليوركا ودي توماسو Yurica & DiTomaso, 2001, التي كيفها (مختار والسعداوي, 2014) للبيئة المحلية, وتتكون هذه القائمة من (69) فقرة من نوع التقرير الذاتي خماسية التدريج. وتم تعيين عينة طبقية عشوائية بلغت (300) طالباً وطالبة من (طلبة الإعدادية الدوام الصباحي في مديرية العامة لتربية محافظة كركوك).
وقد تم التحقق من افتراضات نظرية الاستجابة للفقرة اللوغارتيمية وأهمها اقتراض أحادية البعد من خلال إجراءات التحليل ألعاملي للمقياس بطريقة المكونات الأساسية، حيث تم الحصول على عامل واحد ذي معنى مفسر للمقياس، واعتمد العامل نفسه على الحدود الدنيا لـ(جتمان) الذي يعد العامل دالاً إحصائيا حينما يكون الجذر الكامن الذي يمكن تفسيره يساوي أو يزيد عن (1)، واعتماد محك (بيرت – بانكس) لنسبة تشبع فقرات المقياس بالعامل العام، وكما اعتمد الباحث أسلوباً آخر لتحقق أحادية القياس هو استخراج معامل الارتباط بيرسون (علاقة درجة الفقرة بالدرجة الكلية للمقياس) كمؤشر للتناسق الداخلي بين فقرات المقياس وبأنها تتجه لقياس سمة واحدة حيث تراوحت قمية معامل الارتباط ما بين (0,71) و (0,38) . أما بالنسبة للتحقق من افتراض الاستقلال المحلي أجرى الباحث التحليل ألعاملي للمقياس بطريقة (التحقق من مصفوفة الاعتمادية للفقرات). ولتحليل بيانات فقرات المقياس اعتمد الباحث على نموذج سلم التقدير، وباستخدام البرنامج المحوسب (ConstructMap-4.6).
وقد أظهرت نتائج التحليل الإحصائي بأن مدى تقديرات صعوبة (موقع) الفقرات بلغ من (2,862) الى (-2,109), بمتوسط تقدير الخطأ المعياري (0,037) وذلك بعد حذف (3) فقرات, وإما قيمة تاو () لتدريجات الاستجابة فقد بلغت (-2,391, -1,008, 0,671, 1,945) وعلى التتالي من اقل تدريج. وبلغ مدى قدرة الأفراد من (1,986) إلى (-2,416). وبالنسبة لثبات المعلومات فقد بلغ (0,91), وبلغ ثبات تقدير فقرات المقياس (0,75) وثبات تقدير القدرة بلغ (0,90). وبهذه النتائج تبين للباحث فعالية النموذج في تطوير (قائمة التشوه المعرفي- CDI).
The international relevance of measuring students' attitudes toward mathematics is well-established. The purpose of this article is to construct a scale for secondary school students' attitudes toward the importance of mathematics in Oman using the Guttman method and Rasch model. This study describes a new approach that combines these two methods to develop a psychometrically sound instrument. Using a sample of 309 eleventh-grade students, the authors developed and validated a scale, starting with 14 items. The content validity assessment led to the removal of two items, whereas the Rasch analysis resulted in the elimination of one additional item. The final 11-item scale demonstrated good fit with Rasch model assumptions, including unidimensionality, equal discrimination, satisfactory reliability, and separation coefficients. The reproducibility (0.93) and scalability (0.82) coefficients met Guttman's criteria. The scale values ranged from-2.68 to 2.04 logits, confirming the cumulative principle. This paper is novel because it integrates the Guttman and Rasch approaches to develop a robust attitude scale, thus improving single-method approaches. The instrument can be used by researchers and stakeholders to effectively measure students' attitudes toward mathematics, potentially improving educational practices and outcomes.
The aim of this study is to validate the 36 items related to emotional problems from the Korean Children and Youth Panel Survey (KCYPS) using Rasch models. Specifically, the analysis focused on the 2018 data from high school sophomores in the 5th wave of KCYPS, employing both the Partial Credit Model (PCM) and the Rating Scale Model (RSM). The findings are as follows. First, all 36 items were found to fit both the PCM and RSM, indicating that the 4-point response categories were appropriate for both models. Second, the PCM, which assumes non-equidistant thresholds, demonstrated a statistically better fit than the RSM, which assumes equidistant thresholds. Third, items related to somatization and depression were particularly effective in capturing threshold shifts of emotional problems. Based on these results, the study discusses modeling options for threshold estimation and proposes future research topics, including the application of longitudinal item response theory to KCYPS data.
This study investigates the application of Decision Tree analysis to optimize test item selection in the development of mathematical procedures, particularly in the domains of Measurement and Geometry. Employing Machine Learning techniques, the research focuses on assessing the mathematical proficiency of seventh-grade students. A predictive model is constructed and assessed using a dataset comprising responses from 528 students, with analysis conducted using Decision Trees with a depth of three layers. Utilizing Python via Google Colaboratory, the study evaluates three test formats-essay, multiple-choice, and hybrid on Precision, Recall, F-measure, Accuracy, and Error metrics. Findings highlight the superior efficacy of the essay format, achieving an accuracy of 0.77 and an error rate of 0.56, meeting established criteria. Conversely, the multiple-choice format demonstrates comparatively lower effectiveness, with an accuracy of 0.62 and an error rate of 0.74. These results underscore the significance of employing Decision Tree Analysis for informed test item selection, particularly favoring subjective formats in optimizing mathematical proficiency assessments within measurements and geometry.
This study aims to develop a digital tool named “S2QEK TOOL” which is a digital platform using Google Forms for developing teachers who teach in independent study curriculum of international standard schools. the sample consisted of 15 informants and 335 middle school teachers from 37 schools under the Chaiyaphum Secondary Educational Service Area Office. The researchers used a design-based method using a design-based approach, including construct maps, design items for the digital tool, outcome space, and quality evaluation using Wright map. Data was analyzed using content analysis, Multidimensional Random Coefficient Logit Model, and Maximum Likelihood Estimation method. The study aimed to improve teacher competencies in managing independent study courses. The findings from the semi-structured interviews with 15 informants revealed that the most effective tool that can be used in assessing's the independent study courses teachers' competency levels is a digital assessment tool. On the other hand, the findings from the 335 the independent study courses teachers' examination results indicated that the teachers' the independent study courses competencies can be categorized into four levels unsatisfactory, basic, proficient, and distinguished levels for the three aspects. According to the examination results, researchers continued to create three construct maps and their scoring guide. In conclusion, the researchers have successfully created the construct maps of each aspect of independent study courses competencies, Skills of knowledge, teaching methods, and abilities to fit the actual context by adopting a scoring guide by referring to the concepts of guidelines for teaching and learning management in international standard schools framework.
Item response theory models posit latent variables to account for regularities in students' performances on test items. Wilson's
“Saltus” model extends the ideas of IRT to development that occurs in stages, where expected changes can be discontinuous,
show different patterns for different types of items, or even exhibit reversals in probabilities of success on certain tasks.
Examples include Piagetian stages of psychological development and Siegler's rule-based learning. This paper derives marginal
maximum likelihood (MML) estimation equations for the structural parameters of the Saltus model and suggests a computing approximation
based on the EM algorithm. For individual examinees, empirical Bayes probabilities of learning-stage are given, along with
proficiency parameter estimates conditional on stage membership. The MML solution is illustrated with simulated data and an
example from the domain of mixed number subtraction.
We develop an approach to the measurement of knowledge content, knowledge access and knowledge learning. This approach has two elements: First we describe a theoretical view of cognition, called the Newell-Dennett framework, which we see as being particularly favourable to the development of a measurement aproach. Then, we describe a class of measurement models, based on Rasch modeling, which we see as being particularly favourable to the development of cognitive theories. Knowledge content and access are viewed as determining the observable actions selected by an agent in order to achieve desired goals in observable situations. To the degree that models within the theory fit the data at hand, one considers measures of observed behavior to be manifestations of intelligent agents having specific classes of knowledge content and varying degrees of access to that knowledge. Although agents, environment, and knowledge are constitutively defined (in terms of one another), successful applica...
In this article an item response model is introduced for repeated ratings of student work, which we have called the Rater Bundle Model (RBM). Development of this model was motivated by the observation that when repeated ratings occur, the assumption of conditional independence is violated, and hence current state-of-the-art item response models, such as the rater facets model, that ignore this violation, underestimate measurement error, and overestimate reliability. In the rater bundle model these dependencies are explicitly parameterized. The model is applied to both real and simulated data to illustrate the approach.
It is argued that we should distinguish between categories of description and that which is described. By making this distinction, we transcend the commonplace individual differences perspective in two ways. First, differences between individuals in functioning are not axiomatically interpreted as reflecting individual differences in the traditional sense. Secondly, the differences between individuals described are seen as being relative to the one who is doing the describing. In accordance with this, a description is given of the conception of learning underlying the model of description chosen to characterise learning from the learner's point of view. As far as the learner is concerned, the way in which he conceptualises learning, the way in which he experiences the learning situation and the way in which he performs the learning task are seen as three different aspects of his idea of learning and hence they appear to be logically related to each other. The distinction between relations, which are of a logical and empirical character, is seen as the basis from which we can transcend the perspective of the individual differences and thereby make it possible to study individual differences in a more meaningful way.
The technique of scalogram analysis was employed to test Piaget's theory that there is a fixed order in which classificatory concepts are acquired. It was hypothesized that there are 11 steps by which children learn to build upon simple equivalence groupings to attain the concept of class inclusion. When these steps were translated into tasks and administered to 122 Ss between 4 and 9 years of age: (a) there was a significant correlation between S's age and the number of tasks mastered; (b) the order of difficulty of the tasks corresponded to the predicted order; (c) there was no set order of mastery such that passage of a more difficult item invariably implied passage of all the easier items; (d) for each task there were no age differences among Ss who made different kinds of errors.
Why are some students able to learn to use the trial and error method to balance chemical equations while others are not? To test the hypothesis that formal reasoning is required to balance even simple one-step equations, while formal reasoning and a sufficiently large mental capacity are required to balance more complex many-step equations, a sample of science students was tested to determine level of intellectual development, mental capacity, and degree of field dependence/field independence. Students were then given classroom instruction in using trial and error to balance equations. As predicted, a posttest revealed significant correlations between developmental level and equation balancing ability for both simple and complex equations. Also, as predicted, mental capacity correlated significantly with complex equations but not with simple equations. Field dependence/field independence played no significant role in performance. Educational implications are drawn.
Will performance assessments in mathematics have gender DIF? Do male and female examinees provide similar solution strategies?
The author discusses conditions under which Measurement Driven Instruction might and might not produce its expected results. He considers how test stakes, Performance standards, and the nature of content assessment interrelate to influence the operation of MDI programs. In discussing instructional expectations and realities as they apply to MDI, he argues that under some conditions instruction driven by measurement measurement corrupt the measurement process.
We empirically test existing theories on the provision of public goods, in particular air quality, using data on sulfur dioxide (SO2) concentrations from the Global Environment Monitoring Projects for 107 cities in 42 countries from 1971 to 1996. The results are as follows: First, we provide additional support for the claim that the degree of democracy has an independent positive effect on air quality. Second, we find that among democracies, presidential systems are more conducive to air quality than parliamentary ones. Third, in testing competing claims about the effect of interest groups on public goods provision in democracies we establish that labor union strength contributes to lower environmental quality, whereas the strength of green parties has the opposite effect.
When the three-parameter logistic model is applied to tests covering a broad range of difficulty, there frequently is an increase in mean item discrimination and a decrease in variance of item difficulties and traits as the tests become more difficult. To examine the hypothesis that this unexpected scale shrinkage effect occurs because the items increase in complexity as they increase in difficulty, an approximate relationship is derived between the unidimensional model used in data analysis and a multidimensional model hypothesized to be generating the item responses. Scale shrinkage is successfully predicted for several sets of simulated data.