Anne Corinne Huggins-Manley

Anne Corinne Huggins-Manley
University of Florida | UF · College of Education

PhD

About

77
Publications
25,404
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,132
Citations
Introduction
I study issues of fairness in educational measurement.

Publications

Publications (77)
Article
Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently,...
Article
Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderat...
Article
Objectives Age-related psychometric differences in Diagnostic and Statistical Manual of Mental Disorders, 5th Edition ( DSM-5 ) opioid use disorder (OUD) diagnostic criteria have been hypothesized, but not been tested. This study investigated DSM-5 OUD diagnostic criteria for age-related measurement noninvariance among younger adults (YAs) and midd...
Article
Social desirability bias (SDB) is a common threat to the validity of conclusions from responses to a scale or survey. There is a wide range of person-fit statistics in the literature that can be employed to detect SDB. In addition, machine learning classifiers, such as logistic regression and random forest, have the potential to distinguish between...
Article
The purpose of this study is to develop a nonparametric DIF method that (a) compares focal groups directly to the composite group that will be used to develop the reported test score scale, and (b) allows practitioners to explore for DIF related to focal groups stemming from multicategorical variables that constitute a small proportion of the overa...
Article
In classroom assessments, examinees can often answer test items multiple times, resulting in sequential multiple‐attempt data. Sequential diagnostic classification models (DCMs) have been developed for such data. As student learning processes may be aligned with a hierarchy of measured traits, this study aimed to develop a sequential hierarchical D...
Article
This study was conducted to collect validity evidence to support the use of an observation instrument to evaluate the performance of special education teachers (SETs) of students with significant disabilities (SWSD). In the study, a purposive sample of 49 SETs of SWSD, who were appropriately credentialed and experienced, evaluated the content of th...
Article
Full-text available
Behavior rating scales are frequently used assessment tools to measure social skills. Use of norm-referenced assessments such as behavior rating scales requires examiners and test publishers to consider when norms become obsolete and norm-referenced scores can no longer be validly interpreted. A fundamental factor influencing norm obsolescence rega...
Article
The global COVID-19 health pandemic caused major interruptions to educational assessment systems, partially due to shifts to remote learning environments, entering the post-COVID educational world into one that is more open to heterogeneity in instructional and assessment modes for secondary students. In addition, in 2020, educational inequities we...
Article
Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the perform...
Article
The current study examines both student self-regulated learning (SRL) and teacher orchestration in a virtual learning environment (VLE), with respect to student achievement. The study used SRL indicators derived from the log data on how students used the VLE system, survey data on how teachers made use of the VLE for Algebra instruction, as well as...
Article
The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument‐based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approa...
Article
According to the Standards for Educational and Psychological Testing (2014), one aspect of test fairness concerns examinees having comparable opportunities to learn prior to taking tests. Meanwhile, many researchers are developing platforms enhanced by artificial intelligence (AI) that can personalize curriculum to individual student needs. This le...
Article
Special education teachers’ (SETs) working conditions play a crucial role in shaping the size, quality, and effectiveness of the SET workforce, and thereby shape the quality of instruction provided to students with disabilities. Valid measures of SETs’ working conditions are essential for conducting robust research on how to improve working conditi...
Article
The purpose of this research was to develop a short measure of technology anxiety and provide validity and reliability evidence for its use in a variety of studies in the social sciences. Technology anxiety is an emotion oriented towards a negative affect leading to the avoidance of information and communication technology (Wilson, 2018). We develo...
Article
The unstructured multiple-attempt (MA) item response data in virtual learning environments (VLEs) are often from student-selected assessment data sets, which include missing data, single-attempt responses, multiple-attempt responses, and unknown growth ability across attempts, leading to a complex and complicated scenario for using this kind of dat...
Article
The piecewise latent growth models (PWLGMs) can be used to study changes in the growth trajectory of an outcome due to an event or condition, such as exposure to an intervention. When there are multiple outcomes of interest, a researcher may choose to fit a series of PWLGMs or a single parallel-process PWLGM. A comparison of these models is provide...
Article
There is an increasing trend of learning analytics dashboards (LADs) being used to provide feedback to learners. However, there is little empirical evidence about the influence of their design features on learners' cognitive and affective outcomes, especially in high-anxiety courses such as statistics. To address this gap, this study employed a two...
Article
The Q-matrix is commonly used in diagnostic classification models and has recently been incorporated into the multidimensional item response theory (MIRT) models to add information about the relationship between items and dimensions of the latent trait. The reformulation of the MIRT models with Q-matrix (MIRT-Q) has presented to improve the precisi...
Article
Full-text available
We describe the development and validation of the Social-Emotional Teaching Practices Questionnaire-Chinese (SETP-C), a self-report instrument designed to gather information about Chinese preschool teachers’ implementation of social-emotional practices. Initially (study 1), 262 items for the SETP-C were generated. Content validation of these items...
Conference Paper
Full-text available
Careless responding and keeping students motivated for different tests have been common problems in many areas, especially in education. This study's objective was to demonstrate a novel approach to detect careless responding using person-fit indices developed within the field of psychometrics combined with a random forest. The data used was obtain...
Article
At the heart of many counseling research interests and questions is a desire to understand causal relationships between variables. However, inferring causation from correlational studies ranges from difficult to impossible, and researchers have found that various literature bases contain large proportions of studies that draw unsupported causal inf...
Article
With the growing use of virtual learning environments (VLE), innovative methods to evaluate their performance are increasingly needed. A key difficulty in evaluating VLE using system logs is the large heterogeneity of usage patterns. The current study demonstrates an approach to classify complex patterns of student-level and classroom-level usage w...
Article
Full-text available
In data collected from virtual learning environments (VLEs), item response theory (IRT) models can be used to guide the ongoing measurement of student ability. However, such applications of IRT rely on unbiased item parameter estimates associated with test items in the VLE. Without formal piloting of the items, one can expect a large amount of noni...
Article
Based on the achievement goal theory, this experimental study explored the influence of predictive and descriptive learning analytics dashboards on graduate students' motivation and statistics anxiety in an online graduate-level statistics course. Participants were randomly assigned into one of three groups: (a) predictive dashboard, (b) descriptiv...
Article
Full-text available
The advances in technology to capture and process unprecedented amounts of educational data has boosted the interest in Learning Analytics Dashboard (LAD) applications as a way to provide meaningful visual information to administrators, parents, teachers and learners. Despite the frequent argument that LADs are useful to support target users and th...
Article
In reading intervention research, implementation fidelity is assumed to be positively related to student outcomes, but the methods used to measure fidelity are often treated as an afterthought. Fidelity has been conceptualized and measured in many different ways, suggesting a lack of construct validity. One aspect of construct validity is the fidel...
Article
Full-text available
Research Findings: Preschool social-emotional education has become an increasingly important area of research and practice in mainland China. The social development domain has been recognized as an independent preschool curricular domain since 2001. Little is known, however, about the specific practices that preschool teachers in China are using to...
Conference Paper
Full-text available
In data collected from virtual learning environments (VLEs), item response theory (IRT) models can be used to guide the ongoing measurement of student ability. However, such applications of IRT rely on unbiased item parameter estimates associated with test items in the VLE. Without formal piloting of the items, one can expect a large amount of non-...
Article
Online learning platforms integrating open educational resources (OERs) are increasingly adopted in secondary education as supplemental resources for teaching and learning. However, students report difficulties sustaining their engagement because of the self-paced nature of OER-supported learning environments. We noted that little attention has bee...
Article
The semi-generalized partial credit model (Semi-GPCM) has been proposed as a unidimensional modeling method for handling not applicable scale responses and neutral scale responses, and it has been suggested that the model may be of use in handling missing data in scale items. The purpose of this study is to evaluate the ability of the unidimensiona...
Article
Virtual learning environments (VLEs) are increasingly used at-scale in educational contexts to facilitate teaching and promote learning, and the data they produce can be used for educational research purposes. Meanwhile, the U.S. Department of Education’s Office of Educational Technology has repeatedly emphasized the importance of using evidence to...
Article
Full-text available
Numerous studies have been undertaken to design, develop, and provide validity evidence for using instruments to measure students’ attitudes toward STEM (Science, Technology, Engineering, and Mathematics). This study presents validity evidence of scores produced from the S-STEM measurement tool and used to evaluate changes in attitudes during an ed...
Article
The purpose of this study is to evaluate whether a recently developed semiordered model can be used to explore the functioning of neutral response options in rating scale data. Huggins-Manley, Algina, and Zhou developed a class of unidimensional models for semiordered data within scale items (i.e., items with both ordered response categories and an...
Article
Full-text available
Although the use of technology in the K12 classroom has been shown to have a positive impact, research on the use of open education resources (OER) is relatively limited, especially research focusing on low‐achieving students. The present study examines the relationship between usage of Algebra Nation, a self‐guided system that provided instruction...
Article
Item response theory (IRT) models provide an important contribution in the analysis of polytomous items, such as Likert scale items in survey data. We propose a bifactor generalized partial credit model (bifac-GPC model) with flexible link functions - probit, logit and complementary log-log - for use in analysis of ordered polytomous item scale dat...
Article
Multidimensional item response theory (MIRT) models use data from individual item responses to estimate multiple latent traits of interest, making them useful in educational and psychological measurement, among other areas. When MIRT models are applied in practice, it is not uncommon to see that some items are designed to measure all latent traits...
Article
Routing examinees to modules based on their ability level is a very important aspect in computerized adaptive multistage testing. However, the presence of missing responses may complicate estimation of examinee ability, which may result in misrouting of individuals. Therefore, missing responses should be handled carefully. This study investigated m...
Article
This study aimed to assess the accuracy of the empirical item characteristic curve (EICC) preequating method given the presence of test speededness. The simulation design of this study considered the proportion of speededness, speededness point, speededness rate, proportion of missing on speeded items, sample size, and test length. After crossing a...
Article
Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have perfor...
Article
Full-text available
The purpose of this study is to develop and evaluate unidimensional models that can handle semiordered data within scale items (i.e., items with multiple ordered response categories, and one additional nominal response category). We apply the models to scale data with not applicable (NA) responses to compare the model performance to conditions in w...
Article
Polytomous Item Response Theory (IRT) models are used by specialists to score assessments and questionnaires that have items with multiple response categories. In this article we study the performance of five model comparison criteria for comparing fit of the graded response and generalized partial credit models using the same data set when the cho...
Article
Full-text available
We conducted a simulation study to explore the precision of test outcomes across computerized adaptive testing (CAT) and computerized adaptive multistage testing (ca-MST) when the number of different content areas was varied across a variety of test lengths. We compared one CAT and two ca-MST designs (1-3 and 1-3-3 panel designs) across several man...
Article
Full-text available
The purpose of this article is to explore validity evidence and appropriate uses of the revised Technology Uses and Perceptions Survey (TUPS) designed to measure in-service teacher perspectives about technology integration in K–12 schools and classrooms. The revised TUPS measures 10 domains, including Access and Support; Preparation of Technology U...
Article
Developing a diagnostic tool within the diagnostic measurement framework is the optimal approach to obtain multidimensional and classification-based feedback on examinees. However, end users may seek to obtain diagnostic feedback from existing item responses to assessments that have been designed under either the classical test theory or item respo...
Article
Full-text available
The purpose of many test in the educational and psychological measurement is to measure test takers’ latent trait scores from responses given to a set of items. Over the years, this has been done by traditional methods (paper and pencil tests). However, compared to other test administration models (e.g., adaptive testing), traditional methods are e...
Article
Given the relationships of item response theory (IRT) models to confirmatory factor analysis (CFA) models, IRT model misspecifications might be detectable through model fit indexes commonly used in categorical CFA. The purpose of this study is to investigate the sensitivity of weighted least squares with adjusted means and variance (WLSMV)-based ro...
Article
The performance of item selection methods in multidimensional computerized adaptive testing has only been studied using an independent cluster multidimensional structure. The goal of this study is to examine the effect of four different item selection methods on test utilization and measurement accuracy under more complex multidimensional data stru...
Article
There is an increasing demand for assessments that can provide more fine-grained information about examinees. In response to the demand, diagnostic measurement provides students with feedback on their strengths and weaknesses on specific skills by classifying them into mastery or nonmastery attribute categories. These attributes often form a hierar...
Article
This study defines subpopulation item parameter drift (SIPD) as a change in item parameters over time that is dependent on subpopulations of examinees, and hypothesizes that the presence of SIPD in anchor items is associated with bias and/or lack of invariance in three psychometric outcomes. Results show that SIPD in anchor items is associated with...
Article
Cognitive diagnosis models (CDMs) estimate student ability profiles using latent attributes. Model fit to the data needs to be ascertained in order to determine whether inferences from CDMs are valid. This study investigated the usefulness of some popular model fit statistics to detect CDM fit including relative fit indices (AIC, BIC, and CAIC), an...
Article
The TPACK (technological pedagogical content knowledge) framework (Mishra & Koehler, 2006) has gained tremendous momentum from within the educational technology community. Specifically, much discourse has focused on how to measure this multidimensional construct to further define the contours of the framework and potentially make some meaningful pr...
Article
The purpose of this article is to demonstrate constraining the nominal response model inMplus software to calibrate data under the partial credit model (PCM) and generalized partial credit model (GPCM). Currently, many researchers are uncertain if the PCM and GPCM can be estimated within Mplus. Through model constraint commands in Mplus, we demonst...
Chapter
Diagnostic test has gained attention for its potentiality to produce fine-grained information about examinees. The dependency among attributes (i.e. attribute structure) is one of the most important factors affecting diagnostic test design. This article introduces four types of attribute structures and examines the effects of the attribute number,...
Article
This study compares two methods of defining groups for the detection of differential item functioning (DIF): (a) pairwise comparisons and (b) composite group comparisons. We aim to emphasize and empirically support the notion that the choice of pairwise versus composite group definitions in DIF is a reflection of how one defines fairness in DIF stu...
Article
Full-text available
Using data collected from two multiyear teacher professional development projects employing randomized control trials, this study describes the development and validation of a paper-based test of elementary teachers' science content knowledge (SCK). Evidence of construct validity is presented, including evidence on internal structural features usin...
Article
SEAsic (score equity assessment-summary index computation) is an R package for computing and graphing a variety of indices that quantify an important aspect of test fairness, that of reported score equity. Historically, test fairness has been statistically defined as a lack of differential predication and/or a presence of measurement invariance at...
Article
Invariant relationships in the internal mechanisms of estimating achievement scores on educational tests serve as the basis for concluding that a particular test is fair with respect to statistical bias concerns. Equating invariance and differential item functioning are both concerned with invariant relationships yet are treated separately in the p...
Article
Full-text available
The Remote Associates Test (RAT) is often assumed to be a measure of creativity; however, the RAT has been broadly applied in psychological studies. Originally developed to assess individual differences in associative processing, the RAT has been used to study various constructs, such as creativity, problem solving, insight, and memory. Aside from...
Article
This study explored subprocesses of reading for 157 fifth grade Spanish-speaking English language learners (ELLs) by examining whether morphological awareness made a unique contribution to reading comprehension beyond a strong covariate-phonological decoding. The role of word reading and reading vocabulary as mediators of this relationship was also...
Article
The purpose of this study is to utilize Score Equity Assessment (SEA) to examine measurement comparability and equity in reported scores on a statewide fifth-grade science assessment with respect to groups of students defined by disability status, English Language Learner status and use of test accommodations. Benefits of SEA include a focus on equ...
Article
It is possible that functions used to link tests are sensitive to subpopulations of test takers. The REMSD and RMSD(x) are weighted effect sizes of linking invariance, yet it is often unclear how the weights are most appropriately applied when subpopulation group sizes are heterogeneous. The objective of this research is to apply two different weig...
Article
A goal for any linking or equating of two or more tests is that the linking function be invariant to the population used in conducting the linking or equating. Violations of population invariance in linking and equating jeopardize the fairness and validity of test scores, and pose particular problems for test‐based accountability programs that requ...
Article
As part of a 5-year professional development intervention aimed at improving science and literacy achievement of English language learning (ELL) students in urban elementary schools, this study examined fourth-grade students' science achievement across a 3-year (2005–2008) implementation of our professional development intervention consisting of cu...
Article
This study describes the development and validation of the Extract the Base test (ETB), which assesses derivational morphological awareness. Scores on this test were validated for 580 monolingual students and 373 Spanish-speaking English language learners (ELLs) in third through fifth grade. As part of the validation of the internal structure, whic...

Network

Cited By