Randall D. Penfield's research while affiliated with University of North Carolina at Greensboro and other places

Publications (68)

Article
Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have perfor...
Article
Relationship quality is the most frequently assessed construct in the intimate relationships literature. Dozens of assessment instruments exist, but the vast majority conceptualize relationship quality in terms of satisfaction (or a similar construct), which focuses on the hedonic (pleasure or happiness) dimension of the relationship. Some scholars...
Article
Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model-data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing...
Article
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytom...
Article
Full-text available
Increased sebum production is a common skin complaint and plays an important role in acne and oily scalp conditions. To choose the correct skin care products, which mostly are marketed for dry, oily or normal skin, the consumer must self assess their skin type. Studies show that individuals incorrectly self assess their sebum secretion levels. In o...
Article
Full-text available
The Rasch model, a member of a larger group of models within item response theory, is widely used in empirical studies. Detection of uniform differential item functioning (DIF) within the Rasch model typically employs null hypothesis testing with a concomitant consideration of effect size (e.g., signed area [SA]). Parametric equivalence between con...
Article
For nearly four decades, economic analysis has dominated academic discussion of tort law. Courts also have paid increasing attention to the potential deterrent effects of their tort decisions. But at the center of each economic model and projection of cost and benefit lies a widely accepted but grossly undertested assumption that tort liability in...
Article
Full-text available
Measurement invariance is a common consideration in the evaluation of the validity and fairness of test scores when the tested population contains distinct groups of examinees, such as examinees receiving different forms of a translated test. Measurement invariance in polytomous items has traditionally been evaluated at the item-level, correspondin...
Article
The study of measurement invariance in polytomous items that targets individual score levels is known as differential step functioning (DSF). The analysis of DSF requires the creation of a set of dichotomizations of the item response variable. There are two primary approaches for creating the set of dichotomizations to conduct a DSF analysis: the a...
Article
A goal for any linking or equating of two or more tests is that the linking function be invariant to the population used in conducting the linking or equating. Violations of population invariance in linking and equating jeopardize the fairness and validity of test scores, and pose particular problems for test‐based accountability programs that requ...
Article
Full-text available
Background/Context: While different instructional approaches have been proposed to integrate academic content and English proficiency for English language learning (ELL) students, studies examining the magnitude of the relationship are non-existent. This study examined the relationship between the "form" (i.e., conventions, organization, and style/...
Article
This article explores how the magnitude and form of differential item functioning (DIF) effects in multiple-choice items are determined by the underlying differential distractor functioning (DDF) effects, as modeled under the nominal response model. The results of a numerical investigation indicated that (a) the presence of one or more nonzero DDF...
Article
The increase in the squared multiple correlation coefficient (Delta R(2)) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. Algina, Keselman, and Penfield found that intervals based on asymptotic principles were typically very inaccurate, even though the sample size was quite large...
Article
Full-text available
This study examined both student and school predictors of science achievement as measured by a high-stakes state test. The study involved 23,854 fifth-grade students from 198 elementary schools in a large urban school district with a diverse student population. Multilevel modeling was conducted to examine both student and school predictors simultan...
Article
Crossing, or intersecting, differential item functioning (DIF) is a form of nonuniform DIF that exists when the sign of the between-group difference in expected item performance changes across the latent trait continuum. The presence of crossing DIF presents a problem for many statistics developed for evaluating DIF because positive and negative co...
Article
This study evaluated the Battelle Developmental Inventory, 2nd Edition, Screening Test (BDI-2 ST) for use in states’ child outcomes accountability systems under the Individuals with Disabilities Education Act. Complete Battelle Developmental Inventory, 2nd Edition (BDI-2), assessment data were obtained for 142 children, ages 2 to 62 months, who had...
Article
In this article, I address two competing conceptions of differential item functioning (DIF) in polytomously scored items. The first conception, referred to as net DIF, concerns between-group differences in the conditional expected value of the polytomous response variable. The second conception, referred to as global DIF, concerns the conditional d...
Article
This study applied the maximum expected information (MEI) and the maximum posterior-weighted information (MPI) approaches of computer adaptive testing item selection to the case of a test using polytomous items following the partial credit model. The MEI and MPI approaches are described. A simulation study compared the efficiency of ability estimat...
Article
In 2008, Penfield showed that measurement invariance across all response options of a multiple-choice item (correct option and the J distractors) can be modeled using a nominal response model that included a differential distractor functioning (DDF) effect for each of the J distractors. This article extends this concept to consider how the differen...
Article
A growing body of research showing that grade retention serves as an educationally low-quality placement has raised increasing concerns about whether the use of standardized tests in making decisions concerning grade retention conforms to current standards for appropriate and nondiscriminatory test use. This article examines the extent to which tes...
Article
The purpose of this article is to discuss curriculum-based measurement (CBM) as it is currently utilized in research and practice and to propose a new approach for developing measures to monitor the academic progress of students longitudinally. To accomplish this, we first describe CBM and provide several exemplars of CBM in reading and mathematics...
Article
Full-text available
Recent test-based accountability policy in the U.S. has involved annually assessing all students in core subjects and holding schools accountable for adequate progress of all students by implementing sanctions when adequate progress is not met. Despite its potential benefits, basing educational policy on assessments developed for a student populati...
Article
Full-text available
As part of our professional development intervention, this study examined third-grade ELL students' writing achievement that included “form” (i.e., conventions, organization, and style/voice) and “content” (i.e., specific knowledge and understanding of science) in expository science writing. The study included six treatment schools from a large urb...
Article
In this study, we investigate the logistic regression (LR), Mantel-Haenszel (MH), and Breslow-Day (BD) procedures for the simultaneous detection of both uniform and nonuniform differential item functioning (DIF). A simulation study was used to assess and compare the Type I error rate and power of a combined decision rule (CDR), which assesses DIF u...
Article
Full-text available
This study examined the effect of fidelity of implementation (FOI) on the science achievement gains of third grade students broadly and students with limited literacy in English specifically. The study was conducted in the context of a professional development intervention with elementary school teachers to promote science achievement of ELL studen...
Article
Full-text available
This study explored beginning special education teacher quality and the role that knowledge and skill for teaching reading plays in defining quality. The authors examined the relationship between beginning teachers' knowledge for teaching reading and their classroom practices during reading instruction and, further, relationships between classroom...
Article
Full-text available
This descriptive study examined urban elementary school teachers’ perceptions of their science content knowledge, science teaching practices, and support for language development of English language learners. Also examined were teachers’ perceptions of organizational supports and barriers associated with teaching science to nonmainstream students....
Article
Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item-level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recentl...
Article
Full-text available
The assessment of differential item functioning (DIF) in polytomous items addresses between-group differences in measurement properties at the item level, but typically does not inform which score levels may be involved in the DIF effect. The framework of differential step functioning (DSF) addresses this issue by examining between-group difference...
Article
This study examined predictors of the following three science teaching practices with English language learning (ELL) students: (a) reform-oriented practices to promote understanding and inquiry, (b) traditional/conventional practices, and (c) English language development practices. Data were collected from 140 third- through fifth-grade teachers....
Article
The examination of measurement invariance in polytomous items is complicated by the possibility that the magnitude and sign of lack of invariance may vary across the steps underlying the set of polytomous response options, a concept referred to as differential step functioning (DSF). This article describes three classes of nonparametric DSF effect...
Article
Investigations of differential distractor functioning (DDF) can provide valuable information concerning the location and possible causes of measurement invariance within a multiple-choice item. In this article, I propose an odds ratio estimator of the DDF effect as modeled under the nominal response model. In addition, I propose a simultaneous dist...
Article
The squared multiple semipartial correlation coefficient is the increase in the squared multiple correlation coefficient that occurs when two or more predictors are added to a multiple regression model. Coverage probability was investigated for two variations of each of three methods for setting confidence intervals for the population squared multi...
Article
Full-text available
Measurement invariance in the partial credit model (PCM) can be conceptualized in several different but compatible ways. In this article the authors distinguish between three forms of measurement invariance in the PCM: step invariance, item invariance, and threshold invariance. Approaches for modeling these three forms of invariance are proposed, a...
Article
Full-text available
This study is part of a 5-year professional development intervention aimed at improving science and literacy achievement of English language learners (or ELL students) in urban elementary schools within an environment increasingly driven by high-stakes testing and accountability. Specifically, the study examined science achievement at the end of th...
Article
Full-text available
One aspect of construct validity is the extent to which the measurement properties of a rating scale are invariant across the groups being compared. An increasingly used method for assessing between-group differences in the measurement properties of items of a scale is the framework of differential item functioning (DIF). In this paper we introduce...
Article
Many statistics used in the assessment of differential item functioning (DIF) in polytomous items yield a single item-level index of measurement invariance that collapses information across all response options of the polytomous item. Utilizing a single item-level index of DIF can, however, be misleading if the magnitude or direction of the DIF cha...
Article
Purpose: While facing a shortage of faculty members, dental schools need to be innovative in their educational methodologies. One approach to augment student learning would be to mentor dental students as participating faculty in current courses. A study was undertaken to evaluate dental students as instructors in preclinical prosthodontics and occ...
Article
A widely used approach for categorizing the level of differential item functioning (DIF) in dichotomous items is the scheme proposed by Educational Testing Service (ETS) based on a transformation of the Mantel-Haeszel common odds ratio. In this article two classification schemes for DIF in polytomous items (referred to as the P1 and P2 schemes) are...
Article
The standard error of the maximum likelihood ability estimator is commonly estimated by evaluating the test information function at an examinee's current maximum likelihood estimate (a point estimate) of ability. Because the test information function evaluated at the point estimate may differ from the test information function evaluated at an exami...
Article
The increase in the squared multiple correlation coefficient (Delta R-2) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. The coverage probability that an asymptotic and percentile bootstrap confidence interval includes Delta rho(2) was investigated. As expected, coverage probabili...
Chapter
This chapter presents a description of many of the commonly employed methods in the detection of item bias. Because much of the statistical detection of item bias makes use of differential item functioning (DIF) procedures, the majority of this chapter focuses on the description of statistical methods for the analysis of DIF. DIF detection procedur...
Article
One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test. This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items. The proposed estimators are direct extensions of the nonit...
Article
Kelley compared three methods for setting a confidence interval (CI) around Cohen's standardized mean difference statistic: the noncentral-t-based, percentile (PERC) bootstrap, and biased-corrected and accelerated (BCA) bootstrap methods under three conditions of normormality, eight cases of sample size, and six cases of population effect size (ES)...
Article
Full-text available
This study (a) provided a conceptual introduction to differential item functioning (DIF), (b) introduced the multifaceted Rasch rating scale model (MRSM) and an associated statistical procedure for identifying DIF in rating scale items, and (c) applied this procedure to previously collected data from American coaches who responded to the coaching e...
Article
Liu and Agresti (1996) proposed a Mantel and Haenszel-type (1959) estimator of a common odds ratio for several 2 × J tables, where the J columns are ordinal levels of a response variable. This article applies the Liu-Agresti estimator to the case of assessing differential item functioning (DIF) in items having an ordinal response variable. A simula...
Article
Confidence intervals must be robust in having nominal and actual probability coverage in close agreement. This article examined two ways of computing an effect size in a two-group problem: (a) the classic approach which divides the mean difference by a single standard deviation and (b) a variant of a method which replaces least squares values with...
Article
Full-text available
Classroom teachers need effective, efficient strategies to prevent and/or ameliorate destructive student behaviors and increase socially appropriate ones. During the past two decades, researchers have found that cognitive strategies can decrease student disruption/aggression and strengthen pro-social behavior. Following preliminary pilot work, we c...
Article
How can we best extend DIF research to performance assessment? What are the issues and problems surrounding studies of DIF on complex tasks? What appear to be the best approaches at this time?
Article
Full-text available
The factorial and construct validity of the Exercise Imagery Inventory (EII) were assessed with 3 separate samples of participants. In Phase 1, a 41-item measure was administered to 504 undergraduate students. Exploratory factor analysis supported a 4-factor model that explained 65% of the variance. In Phase 2, a 19-item measure was administered to...
Article
The authors argue that a robust version of Cohen's effect size constructed by replacing population means with 20% trimmed means and the population standard deviation with the square root of a 20% Winsorized variance is a better measure of population separation than is Cohen's effect size. The authors investigated coverage probability for confidence...
Article
Expert review sessions are often conducted to determine the content validity of scale items. The accurate quantification of content validity is usually limited by a relatively small number of experts as well as by a small number of rating categories. These factors, combined with the bounded and discrete nature of rating scale categories, hinder use...
Article
This article applies a weighted maximum likelihood (WML) latent trait estimator to the generalized partial credit model (GPCM). The relevant equations required to obtain the WML estimator using the Newton-Raphson algorithm are presented, and a simulation study is described that compared the properties of the WML estimator to those of the maximum li...
Article
Full-text available
Probability coverage for eight different confidence intervals (CIs) of measures of effect size (ES) in a two-level repeated measures design was investigated. The CIs and measures of ES differed with regard to whether they used least squares or robust estimates of central tendency and variability, whether the end critical points of the interval were...
Article
The Rasch family of models displays several well-documented properties that distinguish them from the general item response theory (IRT) family of measurement models. This paper describes an additional unique property of Rasch models, referred to as the property of item information constancy. This property asserts that the area under the informatio...
Article
Item content-relevance is an important consideration for researchers when developing scales used to measure psychological constructs. Aiken (1980) proposed a statistic, V, that can be used to summarize item content-relevance ratings obtained from a panel of expert judges. This article proposes the application of the Score confidence interval to Aik...
Article
Content validity is often assessed using the mean rating of endorsement provided by content experts for each item of the test. The lack of normality of the sample mean rating poses a major obstacle to the estimation of how far the sample mean is expected to lie from the population mean that it estimates. To overcome this obstacle, this article appl...
Article
The partial credit model (PCM) is commonly employed to parameterize items and individuals using responses to a set of polytomous items. Because the PCM does not include a discrimination parameter, it may encounter substantial lack of fit to the data in certain situations. To determine the impact of model misfit on the estimation of person and item...
Article
This article presents a generalization of the Score method of constructing confidence intervals for the population proportion (E. B. Wilson, 1927) to the case of the population mean of a rating scale item. A simulation study was conducted to assess the properties of the Score confidence interval in relation to the traditional Wald (A. Wald, 1943) c...
Article
It is often the case in performing a differential item functioning (DIF) analysis that comparisons are made between a single reference group and multiple focal groups. Conducting a separate test of DIF for each focal group has several undesirable qualities: (a) the Type I error rate will exceed the intended nominal level if the level of significanc...

Citations

... Moreover, in developing and implementing automated scoring systems, human ratings are considered the "gold standard" for determining the accuracy of the scores these systems produce (Powers et al., 2015;Williamson et al., 2012;Wolfe, 2020). Therefore, it is essential to ensure that the assessments conform to high psychometric quality standards, particularly regarding their reliability, validity, and fairness Lane & DePascale, 2016;Penfield, 2016). ...
... Zhao and Hambleton (2017) considered the model misfit a source of (Cook and Petersen, 1987) systematic error in the equating process. Similarly, differential item functioning (DIF) can be considered as a source of systematic errors (Huggins-Manley et al., 2018;Penfield & Camilli, 2006). DIF is a group-based comparison that tests the effect of the group composition at the item level. ...
... The Mantel, SMD, and CCLOR tests have been applied in several studies across various tests to measure different types of DIF. For instance, CCLOR was applied by Penfield, Giacobbi, and Myers (2007) to detect gender DIF in the Exercise Imagery Inventory using total score of the relevant subscales as matching variable. Two out of the 19 items in the scale, one with moderate DIF and another with large DIF, were flagged as functioning differently between genders. ...
... Findings from HRS (Health and Retirement Study) also showed lowest risk of all-cause mortality among those with highest levels of purpose in life as well as reduced risk of mortality from heart, circulatory, and blood conditions Kim et al., 2013a, b). A meta-analysis of ten prospective studies reported significant associations between purpose in life and reduced all-cause mortality and reduced cardiovascular events (Cohen et al., 2016). Relevant for understanding these profiles of morbidity and mortality is evidence showing that those with higher eudaimonic wellbeing are more likely to use preventive healthcare services and practice better health behaviors (diet, exercise) (Chen et al., 2019;Hill & Weston, 2019;Hooker & Masters, 2016;Kim et al., 2014Kim et al., , 2017Steptoe & Fancourt, 2019). ...
... Since then, this family of distributions has received considerable attention both in theoretical and empirical applications, particularly in Monte Carlo computer studies. These distributions have been used in research investigating the properties of mean differences (e.g., Branco, Oliveira, & Oliveira, 2010;Wilcox, Erceg-Hurn, Clark, & Carlson, 2014), linear regression (e.g., Luh & Guo, 2002), the correlation coefficient (e.g., Wilcox, 2001), analysis of variance (e.g., Fernandez, Vallejo, & Livacic-Rojas, 2010), multivariance analysis of variance (e.g., Tabesh, Ayatollahi, & Towhidi, 2010), normality tests (e.g., Brys, Hubert, & Struyf, 2004;Tabesh, Heidari, & Saki, 2014), multiple imputation (e.g., He & Raghunathan, 2006) and many more (e.g., Algina, Keselman, & Penfield, 2006a, 2006bKeselman, Kowalchuk, & Lix, 1998a;Keselman, Lix, & Kowalchuk, 1998b;Keselman, Wilcox, Kowalchuk, & Olejnik, 2002;Keselman, Wilcox, Taylor, & Kowalchuk, 2000;Kowalchuk, Keselman, Wilcox, & Algina, 2006;Wilcox, Keselman, & Kowalchuk, 1998). ...
... Confidence interval. With the sample size N and the quantile of the normal distribution q a=2 , the confidence interval of SPC 2 is as follows (Algina, 2008): ...
... For instance, "using a variety of instructional methods" in the research literature on working with multilingual students is illustrated to include sociocultural instructional practices generally (Shaw, Lyon, Stoddart, Mosqueda, & Menon, 2014;Swanson, Bianchini, & Lee, 2014;Teemant & Hausman, 2013) as well as specific aspects of a sociocultural pedagogy like collaboration and dialogue (Brooks & Thurston, 2010;Cole, 2013;Garrett & Hong, 2016;Moore & Schleppegrell, 2014;Turner, Dominguez, Empson, & Maldonado, 2013). The research on quality teaching for multilingual learners also clearly demonstrates the value of inquiry pedagogies (Jackson & Ash, 2012;Johnson, Bolshakova, & Waldron, 2016;Manzo, Cruz, Faltis, & de la Torre, 2011;Santau, Maerten-Rivera, & Huggins, 2011), culturally sustaining pedagogies (Carbone & Orellana, 2010;Huerta, 2011;Macleroy, 2013;Pawan, 2008) and attending to both content and language development (Beal, Adams, & Cohen, 2010;Brown, Ryoo, & Rodriguez, 2010;Bunch, 2013;Carrejo & Reinhartz, 2012;Echevarria, Richards-Tutor, Canges, & Francis, 2011;Echevarria, Richards-Tutor, Chinn, & Ratleff, 2011;Jackson & Ash, 2012;Lara-Alecio et al., 2012;Lee, Penfield, & Buxton, 2011). The research cited above addresses many of the other aspects of teaching as well. ...
... This system creates 16 possible skin phenotypes (Table 1) [21]. It applies to all ethnicities, ages, and genders, and is defined by a validated questionnaire (64-item questionnaire) called The Baumann Skin Type Indicator (BSTI) [22][23][24]. The Baumann Skin Type is calculated by a software that assigns a letter to which parameter and if applicable, a subtype of sensitive skin. ...
... When presenting adequate fit to all the responses, the IRT model assumes that a person's response to an item can be accurately depicted by the person's latent ability (Ames & Penfield, 2015). Therefore, we examined the absolute model fit that measures the degree to which a hypothesized model is apart from a perfect model for each phase by using M2 statistics Tucker & Lewis, 1973) and comparative fit index (CFI; Bentler, 1990). ...
... Content area teachers rarely receive PD focused on improving reading comprehension (Shanahan & Shanahan, 2008;Swanson et al., 2016), but when they do, it is infrequently research based, ongoing, contextualized, or deep (Brownell et al., 2009;Garet et al., 2001;Grant et al., 1996). Thus, PD programs fall short and do not always produce reading gains for students, especially those with reading difficulties (Garet et al., 2008). ...