John Mazzeo’s research while affiliated with Educational Testing Service and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (10)


Variance Estimation for Random-Groups Linking in Large-Scale Survey Assessments
  • Chapter

February 2023

·

2 Reads

·

1 Citation

Bingchen Liu

·

·

John Mazzeo

The random-groups design is frequently used in equating and linking scores from two tests, in which the linking functions are derived from the test scores of two samples of the test-taker population. In this paper, we consider estimating variances of test score population statistics for large-scale survey assessments (LSAs), where the random-groups design is used in linking latent variable test scores. Examples of LSAs include National Assessment of Educational Progress (NAEP), Trends in International Mathematics and Science Study (TIMSS), and Programme for International Student Assessment (PISA). In estimating variances of population statistics in LSAs, the common practice takes into account the uncertainties due to sampling and latency. In this paper, we propose a variance estimation method as an extension of the existing procedure that takes into account the random-groups linking. We illustrate the method using a NAEP dataset for which a linear linking function is used in linking test scores from a computer-based test to those from a paper-and-pencil test. The proposed method can be easily extended when random-groups equating and linking are applied to other assessment contexts, with linking functions being parametric or non-parametric.KeywordsVariance estimationStatisticsSamplingJackknifeAssessmentNAEPRandom-groups linking designScore linkingEducation surveyAssessmentIRTWeightImputationResamplingPlausible valuesSampling varianceLatency varianceLinking functionLatent variableComplex sampling


Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure

September 2005

·

211 Reads

·

172 Citations

Journal of Educational Measurement

Shealy and Stout (1993) proposed a DIF detection procedure called SIBTEST and demonstrated its utility with both simulated and real data sets'. Current versions of SIBTEST can be used only for dichotomous items. In this article, an extension to handle polytomous items is developed. Two simulation studies are presented which compare the modified SIBTEST procedure with the Mantel and standardized mean difference (SMD) procedures. The first study compares the procedures under conditions in which the Mantel and SMD procedures have been shown to perform well (Zwick, Donoghue, & Grima, 1993). Results of Study I suggest that SIBTEST performed reasonably well, but that the Mantel and SMD procedures performed slightly better. The second study uses data simulated under conditions in which observed-score DIF methods for dichotomous items have not performed well. The results of Study 2 indicate that under these conditions the modified SIBTEST procedure provides better control of impact-induced Type I error inflation than the other procedures.


Descriptive and Inferential Procedures for Assessing Differential Item Functioning in Polytomous Items

October 1997

·

32 Reads

·

78 Citations

Applied Measurement in Education

Differential item functioning (DIF) assessment procedures for items with more than 2 ordered score categories were evaluated. Three descriptive statistics—the standardized mean difference (SMD; Dorans & Schmitt, 1991) and 2 procedures based on SIBTEST (Shealy & Stout, 1993)—were considered, along with 5 inferential procedures: 2 based on SMD, 2 based on SIBTEST, and the Mantel (1963) method. A simulation showed that, when the 2 examinee groups had the same distribution, the descriptive index that performed best was the SMD. When the group means differed by 1 SD, a modified form of the SIBTEST DIF effect size measure tended to perform best. The 5 inferential procedures performed almost indistinguishably when the 2 groups had identical distributions. When the groups had different distributions and the studied item was highly discriminating, the SIBTEST procedures showed much better Type I error control than did the SMD and Mantel methods, particularly in short tests. The power ranking of the 5 procedures was inconsistent; it depended on the direction of DIF and other factors. Routine application of these polytomous DIF methods seems feasible when a reliable test is available for matching examinees. The Type I error rates of the Mantel and SMD methods may be a concern under certain conditions. The current version of SIBTEST cannot easily accommodate matching tests that do not use number-right scoring. Additional research in these areas would be useful.


Describing and Categorizing DIF in Polytomous Items

June 1997

·

11 Reads

·

20 Citations

ETS Research Report Series

The purpose of this project was to evaluate statistical procedures for assessing differential item functioning (DIF) in polytomous items (items with more than two score categories). Three descriptive statistics—the Standardized Mean Difference, or SMD (Dorans & Schmitt, 1991), and two procedures based on SIBTEST (Shealy & Stout, 1993) were considered, along with five inferential procedures—two based on SMD, two based on SIBTEST, and the Mantel (1963) method. The DIF procedures were evaluated through applications to simulated data, as well as data from ETS tests. The simulation included conditions in which the two groups of examinees had the same ability distribution and conditions in which the group means differed by one standard deviation. When the two groups had the same distribution, the descriptive index that performed best was the SMD. When the two groups had different distributions, a modified form of the SIBTEST DIF effect size measure tended to perform best. The five inferential procedures performed almost indistinguishably when the two groups had identical distributions. When the two groups had different distributions and the studied item was highly discriminating, the SIBTEST procedures showed much better Type I error control than did the SMD and Mantel methods, particularly in short tests. The power ranking of the five procedures was inconsistent; it depended on the direction of DIF and other factors. Routine application of these polytomous DIF methods at ETS seems feasible in cases where a reliable test is available for matching examinees. For the Mantel and SMD methods, Type I error control may be a concern under certain conditions. In the case of SIBTEST, the current version cannot easily accommodate matching tests that do not use number‐right scoring. Additional research in these areas is likely to be useful.


The unique correspondence of item response function and item category response functions in polytomously scored item response models

February 1994

·

39 Reads

·

65 Citations

Psychometrika

The item response function (IRF) for a polytomously scored item is defined as a weighted sum of the item category response functions (ICRF, the probability of getting a particular score for a randomly sampled examinee of ability ). This paper establishes the correspondence between an IRF and a unique set of ICRFs for two of the most commonly used polytomous IRT models (the partial credit models and the graded response model). Specifically, a proof of the following assertion is provided for these models: If two items have the same IRF, then they must have the same number of categories; moreover, they must consist of the same ICRFs. As a corollary, for the Rasch dichotomous model, if two tests have the same test characteristic function (TCF), then they must have the same number of items. Moreover, for each item in one of the tests, an item in the other test with an identical IRF must exist. Theoretical as well as practical implications of these results are discussed.


SEX-RELATED PERFORMANCE DIFFERENCES ON CONSTRUCTED-RESPONSE AND MULTIPLE-CHOICE SECTIONS OF ADVANCED PLACEMENT EXAMINATIONS

June 1993

·

22 Reads

·

50 Citations

ETS Research Report Series

A number of studies in which scores on multiple‐choice and constructed‐response tests have been analyzed in terms of the sex of the test takers have indicated that the test performance of females relative to that of males was better on constructed‐response tests than on multiple‐choice tests. This report describes three exploratory studies of the performance of males and females on the multiple‐choice and constructed‐response sections of four Advanced Placement (AP) Examinations: United States History, Biology, Chemistry, and English Language and Composition. The studies were intended to evaluate some possible reasons for the apparent relationship between test format and the magnitude of sex‐related differences in performance. For the first study, analyses were carried out to evaluate the extent to which such differences could be attributed to differences in the score reliabilities associated with these two modes of assessment. For the second study, analyses of the multiple‐choice sections and follow‐up descriptive analyses were conducted to assess the extent to which sex‐related differences in multiple‐choice scores could be attributed to the presence of differentially functioning items favoring males. For the third study, a set of exploratory analyses was undertaken to determine whether patterns of sex‐related differences could be observed for different types of constructed‐response questions. The results of the first study provided little support for the “different‐reliabilities” hypothesis. Across all exams and all ethnic groups, there were substantial differences between the scores of males and females even after taking into account differences in the reliabilities of the two sections. The results of the second study indicated that fairly small numbers of items exhibited substantial amounts of sex‐related differential item functioning (DIF), and removing these items resulted in almost no reduction in the magnitude of sex‐related differences on the multiple‐choice sections. The results of the third study identified some consistent patterns across ethnic and racial groups regarding which questions females will perform best on, relative to males. However, taken as a whole, the results of the third study suggest that topic variability may have a greater effect than the variability associated with particular question types or broadly defined content areas.


Item response theory scale linking in NAEP

June 1992

·

40 Reads

·

38 Citations

Journal of Educational and Behavioral Statistics

In educational assessments, it is often necessary to compare the performance of groups of individuals who have been administered different forms of a test. If these groups are to be validly compared, all results need to be expressed on a common scale. When assessment results are to be reported using an item response theory (IRT) proficiency metric, as is done for the National Assessment of Educational Progress (NAEP), establishing a common metric becomes synonymous with expressing IRT item parameter estimates on a common scale. Procedures that accomplish this are referred to here as scale linking procedures. This chapter discusses the need for scale linking in NAEP and illustrates the specific procedures used to carry out the linking in the context of the major analyses conducted for the 1990 NAEP mathematics assessment.


Comparability of Computer and Paper‐and‐Pencil Scores for Two CLEP® General Examinations

June 1992

·

32 Reads

·

45 Citations

ETS Research Report Series

John Mazzeo

·

Barry Druesne

·

·

[...]

·

Alan Muhlstein

This report describes two studies that investigated the comparability of scores from paper-and-pencil and computer-administered versions of the College-Level Examination Program (CLEP) General Examinations in Mathematics and English Composition. The first study used a prototype computer-administered version of each examination. Based on the results of the first study and feedback from the study participants, several modifications were made to these prototype versions. A second study was then conducted using the modified computer versions. Both studies used a single-group counterbalanced equating design. Data for the Mathematics Examination were collected at Southwest Texas State University, and data for the English Composition Examination were collected at Utah State University. The results of Study 1 suggest that, despite efforts to design computer versions of the CLEP Mathematics and English Composition General Examinations that were administratively similar to the paper-and-pencil examinations (i.e., allowed item review and answer changing and were comparably timed), mode-of-administration effects (i.e., changes in average scores as a function of the mode of test delivery) were found. The results of Study 2 suggest that the modifications made to the computer versions eliminated the mode-of-administration effects for the English Composition Examination but not for the Mathematics Examination. The results of both studies underscore the need to determine empirically (rather than to just assume) the equivalence of computer and paper versions of an examination.


The Equivalence of Scores from Automated and Conventional Educational and Psychological Tests: A Review of the Literature

June 1988

·

32 Reads

·

117 Citations

ETS Research Report Series

A literature review was conducted to determine the current state of knowledge concerning the effects of the computer administration of standardized educational and psychological tests on the psychometric properties of these instruments. Studies were grouped according to a number of factors relevant to the administration of tests by computer. Based on the studies reviewed, we arrived at the following conclusions: The rate at which test‐takers omit items in an automated test may differ from the rate at which they omit items in a conventional presentation. Scores on tests from automated versions of personality inventories such as the Minnesota Multiphasic Personality Inventory are lower than scores obtained in the conventional testing format. These differences may result in part from differing omit rates, as described above, but some of the differences may be caused by other factors. Scores from automated versions of speed tests are not likely to be comparable with scores from paper‐and‐pencil versions. The presentation of graphics in an automated test may have an effect on score equivalence. Such effects were obtained in studies using the Hidden Figures Test. However, in studies with three Armed Services Vocational Aptitude Battery (ASVAB) tests, effects were not found. Tests containing items based on reading passages can become more difficult when presented on a CRT. This was demonstrated in a single study with the ASVAB tests. The possibility of such asymmetric practice effects may make it wise to avoid conducting equating studies based on single‐group counterbalanced designs.


Table 1 . Actual observed-score percent-correct statistics for market-basket study samples on forms MB1 and MB2 MB1 MB2
Table 2 . Projected true-score percent-correct statistics for the main NAEP sample on form MB2 and actual Main NAEP composite scale scores
Table 7 presents the two sets of projected market-basket results from the main NAEP sample side by side. A comparison of the two sets of projected results shows patterns
Technical Report for the 2000 Market-Basket Study in Mathematics
  • Article
  • Full-text available

46 Reads

·

1 Citation

Download

Citations (8)


... Additionally, females tend to outperform males on items involving mood, contextual clues, or abstract concept understanding, while males do better on logical inference tasks (O 'Neill & McPeek, 1993). Studies also indicate that females excel in oral and constructed response formats (e.g., essay-type, short answers, and fill-in-the blanks) due to stronger writing skills, while males often perform better on multiple-choice (MC) items, partly due to a greater willingness to guess (Aryadoust, 2012;Bolger & Kellaghan, 1990;Mazzeo et al., 1993;Pae, 2012;Willingham & Cole, 1997). Cognitive processing differences also contribute, with females showing stronger verbal skills and males relying more on spatial processing, affecting performance across various task types (O 'Neill & McPeek, 1993). ...

Reference:

Fitting the Mixed Rasch Model to the Listening Comprehension Section of the IELTS: Identifying Latent Class Differential Item Functioning
SEX-RELATED PERFORMANCE DIFFERENCES ON CONSTRUCTED-RESPONSE AND MULTIPLE-CHOICE SECTIONS OF ADVANCED PLACEMENT EXAMINATIONS
  • Citing Article
  • June 1993

ETS Research Report Series

... Some studies showed that test takers' performance had a significant difference between these two test modes due to gender, age, familiarity with the computer, etc. (Parshall & Kromrey, 1993;Gallagher, Bridgeman & Cahalan, 2000;Oduntan, Ojuawo & Oduntan, 2015;Choi &Tinkler, 2002;Wang et al., 2008;Goldberg & Pedulla, 2002;Jeong, 2008). However, some other studies revealed the opposite result that test takers' performance had no significant difference between these two test modes (Mazzeo & Harvey, 1988;Mead & Drasgow, 1993;Anakwe, 2008;Öz & Özturan, 2018). These studies have laid a solid foundation and guidance about the comparative study of test takers' performance on CBT and PBT. ...

The Equivalence of Scores from Automated and Conventional Educational and Psychological Tests: A Review of the Literature
  • Citing Article
  • June 1988

ETS Research Report Series

... However, the use of paper is still prevalent in everyday contexts, ranging from restaurant menus to charitable pledge forms. Recognizing the importance of the medium through which an action is performed, prior research has compared the effects of using digital devices to using paper on reading, learning, and test-taking performances (e.g., Mazzeo et al. 1991, DeAngelis 2000, Watson 2001, Clariana and Wallace 2002, and Mangen et al. 2013. The present research broadens these lines of inquiry to the domain of virtuous behavior by asking the following question: How does using a digital device instead of paper influence people's likelihood to engage in virtuous behavior? ...

Comparability of Computer and Paper‐and‐Pencil Scores for Two CLEP® General Examinations
  • Citing Article
  • June 1992

ETS Research Report Series

... A simple way to calculate a standardized effect size is by dividing the mean difference by the pooled standard deviation. This typically is a straightforward process, and it has even been recommended for use with the polytomous version of the SIBTEST procedure (Zwick et al., 1997). However, the use of such an effect size with the dichotomous or polytomous SIBTEST procedures does not account for the selected inclusion of examinees in the calculation. ...

Describing and Categorizing DIF in Polytomous Items
  • Citing Article
  • June 1997

ETS Research Report Series

... Finally, the metric of the latent proficiency is set to ensure the comparability of scores between assessments administered in different years or with different administration modes with linking methods. These linking methods can usually be classified as either common-item or common-population linking (for more detail, see Jewsbury, 2019;Jewsbury, Jia & Xi, in press;Yamamoto & Mazzeo, 1992). ...

Item response theory scale linking in NAEP
  • Citing Article
  • June 1992

Journal of Educational and Behavioral Statistics

... Facet level was not included as the facets only consist of 2 items and research indicates that DIF-items in a small item group can introduce biased DIF effects in the other analyzed items, distorting the overall picture (Andrich & Hagquist, 2015;Penfield & Camilli, 2007). Three different statistics were consulted to flag nonage-neutral items and to determine the size and direction of DIF in the polytomous items: The Mantel chi-square (Mantel) statistic (Zwick et al., 1997), the Liu-Agresti log odds ratio (L-A LOR) (Liu & Agresti, 1996) and Cox's noncentrality parameter estimator (Cox's B) (Camilli & Congdon, 1999). When the Mantel statistic exceeds the critical value 3.84 (type I error rate of 0.05) (Penfield, 2013), the item in question displays DIF. ...

Descriptive and Inferential Procedures for Assessing Differential Item Functioning in Polytomous Items
  • Citing Article
  • October 1997

Applied Measurement in Education

... Several methods have been developed, first addressing dichotomous items and then generalized to polytomous ones, that also consider non-uniform DIF. In particular, the SIBTEST [42,43], which relies on defining an additional latent trait contributing to the DIF, the ordinal logistic regression [44,45], which directly applies logistic regression, and the MIMIC (multiple indicators multiple causes) model, which combines confirmatory factor analysis (CFA) with structural equation modeling (SEM) [46]. ...

Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure
  • Citing Article
  • September 2005

Journal of Educational Measurement

... For respondent r, the expected value on item i is E(X ri |Θ r =θ r ) = m x=1 P (X ri ≥ x|Θ r =θ r ). The expectation of X ri as a function of Θ r , E(X ri |Θ r ), is referred to as the item response function (IRF; Chang & Mazzeo, 1994). Most single-level IRT models are defined by at least these three assumptions: ...

The unique correspondence of item response function and item category response functions in polytomously scored item response models
  • Citing Article
  • February 1994

Psychometrika