## Publications (117)

Examples of the impact of statistical theory on assessment practice are provided from the perspective of a statistician trained in theoretical statistics who began to work on assessments. Goodness of fit of item‐response models is examined in terms of restricted likelihood‐ratio tests and generalized residuals. Minimum discriminant information adju...

Minimum discriminant information adjustment (MDIA), an approach to weighting samples to conform to known population information, provides a generalization of raking and poststratification. In the case of simple random sampling with replacement with uniform sampling weights, large‐sample properties are available for MDIA estimates of population mean...

Best linear prediction (BLP) and penalized best linear prediction (PBLP) are techniques for combining sources of information to produce task scores, section scores, and composite test scores. The report examines issues to consider in operational implementation of BLP and PBLP in testing programs administered by ETS.

For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unfor...

Distractor analyses are routinely conducted in educational assessments with multiple‐choice items. In this research report, we focus on three item response models for distractors: (a) the traditional nominal response (NR) model, (b) a combination of a two‐parameter logistic model for item scores and a NR model for selections of incorrect distractor...

Cross‐validation is a common statistical procedure applied to problems that are otherwise computationally intractable. It is often employed to assess the effectiveness of prediction procedures. In this report, cross‐validation is discussed in terms of U‐statistics. This approach permits consideration of the statistical properties of cross‐validatio...

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In su...

Measures of agreement are compared to measures of prediction accuracy within a general context. Differences in appropriate use are emphasized, and approaches are examined for both numerical and nominal variables. General estimation methods are developed, and their large‐sample properties are compared.

Many assessments of writing proficiency that aid in making high‐stakes decisions consist of several essay tasks evaluated by a combination of human holistic scores and computer‐generated scores for essay features such as the rate of grammatical errors per word. Under typical conditions, a summary writing score is provided by a linear combination of...

In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP,...

In investigations of unusual testing behavior, a common question is whether a specific pattern of responses occurs unusually often within a group of examinees. In many current tests, modern communication techniques can permit quite large numbers of examinees to share keys, or common response patterns, to the entire test. To address this issue, stat...

Latent regression models are used for score-reporting purposes in large-scale educational survey assessments such as the National Assessment of Educational Progress (NAEP) and Trends in International Mathematics and Science Study (TIMSS). One component of these models is based on item response theory. While there exists some research on assessment...

In item-response theory (IRT), item parameters estimated from examinee responses characterize performance of items from a test form used in administration of an educational assessment (Hambleton, Swaminathan, & Rogers, 1991, ch. 1). These parameters are specific to the proficiency distribution of the examinees for that administration. When an asses...

Unmotivated test takers using rapid guessing in item responses can affect validity studies and teacher and institution performance evaluation negatively, making it critical to identify these test takers. The authors propose a new nonparametric method for finding response-time thresholds for flagging item responses that result from rapid-guessing be...

The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies,...

Feinberg and Wainer (2014) provided a simple equation to approximate/predict a subscore's value. The purpose of this note is to point out that their equation is often inaccurate in that it does not always predict a subscore's value correctly. Therefore, the utility of their simple equation is not clear.

Admission decisions frequently rely on multiple assessments. As a consequence, it is important to explore rational approaches to combine the information from different educational tests. For example, U.S. graduate schools usually receive both TOEFL iBT® scores and GRE® General scores of foreign applicants for admission; however, little guidance has...

Adjustment by minimum discriminant information provides an approach to linking test forms in the case of a nonequivalent groups design with no satisfactory common items. This approach employs background information on individual examinees in each administration so that weighted samples of examinees form pseudo-equivalent groups in the sense that th...

In this study, we apply jackknifing to anchor items to evaluate the impact of anchor selection on equating stability. In an ideal world, the choice of anchor items should have little impact on equating results. When this ideal does not correspond to reality, selection of anchor items can strongly influence equating results. This influence does not...

In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of...

Standard 3.9 of the Standards for Educational and Psychological Testing () demands evidence of model fit when item response theory (IRT) models are employed to data from tests. Hambleton and Han () and Sinharay () recommended the assessment of practical significance of misfit of IRT models, but few examples of such assessment can be found in the li...

This commentary addresses the modeling and final analytical path taken, as well as the terminology used, in the paper "Hierarchical diagnostic classification models: a family of models for estimating and testing attribute hierarchies" by Templin and Bradshaw (Psychometrika, doi: 10.1007/s11336-013-9362-0 , 2013). It raises several issues concerning...

Recently there has been an increasing level of interest in subtest scores, or subscores, for their potential diagnostic value. Haberman (200810.
Haberman , S. J. 2008. When can subscores have value?. Journal of Educational and Behavioral Statistics, 33: 204–229. [CrossRef], [Web of Science ®]View all references) suggested a method to determine if...

Generalized residuals are a tool employed in the analysis of contingency tables to examine possible sources of model error. They have typically been applied to log-linear models and to latent-class models. A general approach to generalized residuals is developed for a very general class of models for contingency tables. To illustrate their use, gen...

A general program for item-response analysis is described that uses the stabilized Newton—Raphson algorithm. This program is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling. The ability variables may be either polytomous or mult...

Monitoring a very frequently administered educational test with a relatively short history of stable operation imposes a number of challenges. Test scores usually vary by season, and the frequency of administration of such educational tests is also seasonal. Although it is important to react to unreasonable changes in the distributions of test scor...

Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may b...

Anchor tests play a key role in test score equating. We attempt to find, through theoretical derivations, an anchor test with optimal item characteristics. The correlation between the scores on a total test and on an anchor test is maximized with respect to the item parameters for data satisfying several item response theory models. Results suggest...

Haberman (2008) suggested a method to determine if subtest scores have added value over the total score. The method is based on classical test theory and considers the estimation of the true subscores. Performance of subgroups, for example, those based on gender or ethnicity, on subtests is often of interest. Researchers such as Stricker (1993) and...

Alternative approaches are discussed for use of e-rater® to score the TOEFL iBT® Writing test. These approaches involve alternate criteria. In the 1st approach, the predicted variable is the expected rater score of the examinee's 2 essays. In the 2nd approach, the predicted variable is the expected rater score of 2 essay responses by the examinee o...

Subscores are reported for several operational assessments. Haberman (2008) suggested a method based on classical test theory to determine if the true subscore is predicted better by the corresponding subscore or the total score. Researchers are often interested in learning how different subgroups perform on subtests. Stricker (1993) and Livingston...

Standard 3.9 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 1999) demands evidence of model fit when an item response theory (IRT) model is used to make inferences from a data set. We applied two recently sugg...

There are several techniques that increase the precision of subscores by borrowing information from other parts of the test. These techniques have been criticized on validity grounds in several of the recent publications. In this note, the authors question the argument used in these publications and suggest both inherent limits to the validity argu...

The purpose of this ITEMS module is to provide an introduction to subscores. First, examples of subscores from an operational test are provided. Then, a review of methods that can be used to examine if subscores have adequate psychometric quality is provided. It is demonstrated, using results from operational and simulated data, that subscores have...

Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman (2008b) suggested reporting an augmented subscore that is a linear combination of a subscore and the total score. Sinharay and Haberman (2008) and Sinharay (2010) showed that augmented subscores often lead to more accurate diagnostic...

This study examined the adequacy of a multiple linear regression model for predicting first-year college grade point average (FYGPA) using SAT® scores and high school grade point average (HSGPA). A variety of techniques, both graphical and statistical, were used to examine if it is possible to improve on the linear regression model. The results sug...

Recently, the literature has seen increasing interest in subscores for their potential diagnostic values; for example, one study suggested the report of weighted averages of a subscore and the total score, whereas others showed, for various operational and simulated data sets, that weighted averages, as compared to subscores, lead to more accurate...

Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a...

Continuous exponential families are applied to linking test forms via an internal anchor. This application combines work on continuous exponential families for single-group designs and work on continuous exponential families for equivalent-group designs. Results are compared to those for kernel and equipercentile equating in the case of chained equ...

For testing programs that administer multiple forms within a year and across years, score equating is used to ensure that scores can be used interchangeably. In an ideal world, samples sizes are large and representative of populations that hardly change over time, and very reliable alternate test forms are built with nearly identical psychometric p...

The synthetic function is a weighted average of the identity (the linking function for forms that are known to be completely parallel) and a traditional equating method. The purpose of the present study was to investigate the benefits of the synthetic function on small-sample equating using various real data sets gathered from different administrat...

Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement. Scale anchoring, a technique which describes what students at different points on a score scale know and can do, is a tool to provide such information. Sca...

In common equipercentile equating methods such as the percentile-rank method or kernel equating (von Davier, Holland, & Thayer,
2004b), sample distributions of test scores are approximated by continuous distributions with positive density functions on
intervals that include all possible scores. The use of continuous distributions with positive dens...

Paul Holland’s work over his long and varied career has shown both breadth and depth. He has made major contributions to the
analysis of discrete data, to the study of social networks, to equating, to differential item functioning (DIF), to item response
theory (IRT), and to causal inference. He has worked on a wide variety of applied problems rang...

Sampling errors limit the accuracy with which forms can be linked. Limitations on accuracy are especially important in testing programs in which a very large number of forms are employed. Standard inequalities in mathematical statistics may be used to establish lower bounds on the achievable inking accuracy. To illustrate results, a variety of equa...

Most automated essay scoring programs use a linear regression model to predict an essay score from several essay features. This article applied a cumulative logit model instead of the linear regression model to automated essay scoring. Comparison of the performances of the linear regression model and the cumulative logit model was performed on a la...

Will subscores provide additional information than what is provided by the total score? Is there a method that can estimate more trustworthy subscores than observed subscores? To answer the first question, this study evaluated whether the true subscore was more accurately predicted by the observed subscore or total score. To answer the second quest...

Recently, there has been increasing interest in reporting diagnostic scores. This paper examines reporting of subscores using multidimensional item response theory (MIRT) models. An MIRT model is fitted using a stabilized Newton-Raphson algorithm (Haberman, 1974, 1988) with adaptive Gauss-Hermite quadrature (Haberman, von Davier, & Lee, 2008). A ne...

Recently, there has been increasing interest in reporting subscores. This paper examines reporting of subscores using multidimensional
item response theory (MIRT) models (e.g., Reckase in Appl. Psychol. Meas. 21:25–36, 1997; C.R. Rao and S.Sinharay (Eds), Handbook of Statistics, vol.26, pp.607–642, North-Holland, Amsterdam, 2007; Beguin & Glas in P...

Diagnostic scores are of increasing interest in educational testing due to their potential remedial and instructional benefit. Naturally, the number of educational tests that report diagnostic scores is on the rise, as are the number of research publications on such scores. This article provides a critical evaluation of diagnostic score reporting i...

Grouped jackknifing may be used to evaluate the stability of equating procedures with respect to sampling error and with respect to changes in anchor selection. Properties of grouped jackknifing are reviewed for simple-random and stratified sampling, and its use is described for comparisons of anchor sets. Application is made to examples of item re...

A regression procedure is developed to link simultaneously a very large number of item response theory (IRT) parameter estimates obtained from a large number of test forms, where each form has been separately calibrated and where forms can be linked on a pairwise basis by means of common items. An application is made to forms in which a two-paramet...

Generalized residuals are a tool employed in the analysis of contingency tables to examine goodness of fit. They may be applied to item response models with little complication. Their use is illustrated with testing data from operational programs. Models considered include the Rasch model and the two-parameter logistic model.

Abstract Diagnostic scores are of increasing interest due to their potential remedial and instructional benefit. Naturally, the number of testing programs that report diagnostic scores is on the rise, as are the number of research works on such scores. This paper starts by showing examples of diagnostic subscores reported by operational testing pro...

In educational testing, subscores may be provided based on a portion of the items from a larger test. One consideration in evaluation of such subscores is their ability to predict a criterion score. Two limitations on prediction exist. The first, which is well known, is that the coefficient of determination for linear prediction of the criterion sc...

Continuous exponential families are applied to linking forms via a single-group design. In this application, a distribution from the continuous bivariate exponential family is used that has selected moments that match those of the bivariate distribution of scores on the forms to be linked. The selected continuous bivariate distribution then yields...

This study uses historical data to explore the consistency of SAT® I: Reasoning Test score conversions and to examine trends in scaled score means. During the period from April 1995 to December 2003, both Verbal (V) and Math (M) means display substantial seasonality, and a slight increasing trend for both is observed. SAT Math means increase more t...

Multidimensional item response models can be based on multivariate normal ability distributions or on multivariate polytomous ability distributions. For the case of simple structure in which each item corresponds to a unique dimension of the ability vector, some applications of the two-parameter logistic model to empirical data are employed to illu...

Will reporting subscores provide any additional information than the total score? Is there a method that can be used to provide more trustworthy subscores than observed subscores? These 2 questions are addressed in this study. To answer the 2nd question, 2 subscore estimation methods (i.e., subscore estimated from the observed total score or subsco...

The reliability of a scaled score can be computed by use of item response theory. Estimated reliability can be obtained even if the item response model selected is not valid.

Outliers in assessments are often treated as a nuisance for data analysis; however, they can also assist in quality assurance. Their frequency can suggest problems with form codes, scanning accuracy, ability of examinees to enter responses as they intend, or exposure of items.

This study addressed the sampling error and linking bias that occur with small samples in a nonequivalent groups anchor test design. We proposed a linking method called the synthetic function, which is a weighted average of the identity function and a traditional equating function (in this case, the chained linear equating function). Specifically,...

In educational tests, subscores are often generated from a portion of the items in a larger test. Guidelines based on mean squared error are proposed to indicate whether subscores are worth reporting. Alternatives considered are direct reports of subscores, estimates of subscores based on total score, combined estimates based on subscores and total...

Continuous exponential families may be employed to find continuous distributions with the same initial moments as the discrete distributions encountered in typical applications of classical equating. These continuous distributions provide distribution functions and quantile functions that may be employed in equating. To illustrate, an application i...

Abstract Sample-size requirements were considered for automated essay scoring in cases in which the automated essay score estimates the score provided by a human rater. Analysis considered both cases in which an essay prompt is examined in isolation and those in which a family of essay prompts is studied. In typical cases in which content analysis...

Techniques are developed for approximation and exact computation of the asymptotic limit of the item parameter estimates obtained by application of joint maximum-likelihood estimation to the Rasch model.

In the case of exponential families, it is a straightforward matter to approximate a density function by use of summary statistics; however, an appropriate approach to such approximation is far less clear when an exponential family is not assumed. In this paper, a maximin argument based on information theory is used to derive a new approach to dens...

The synthetic function, which is a weighted average of the identity (the trivial linking function for forms that are known to be completely parallel) and a traditional equating method, has been proposed as an alternative for performing linking with very small samples (Kim, von Davier, & Haberman, 2006). The purpose of the present study was to inves...

There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory...

Recently, there has been an increasing level of interest in reporting subscores for components of larger assessments. This paper examines the issue of reporting subscores at an aggregate level, especially at the level of institutions to which the examinees belong. A new statistical approach based on classical test theory is proposed to assess when...

In item-response theory, if a latent-structure model has an ability variable, then elementary information theory may be employed to provide a criterion for evaluation of the information the test provides concerning ability. This criterion may be considered even in cases in which the latent-structure model is not valid, although interpretation of th...

Statistical prediction problems often involve both a direct estimate of a true score and covariates of this true score. Given the criterion of mean squared error, this study determines the best linear predictor of the true score given the direct estimate and the covariates. Results yield an extension of Kelley’s formula for estimation of the true s...

Bounds are established for log odds ratios (log cross-product ratios) involving pairs of items for item response models. First,
expressions for bounds on log odds ratios are provided for one-dimensional item response models in general. Then, explicit
bounds are obtained for the Rasch model and the two-parameter logistic (2PL) model. Results are als...

The interaction model, a generalization of the Rasch model (RM) for binary responses, retains many of the attractive features
of the RM but does not assume local independence. Like the RM, the interaction model has simple sufficient statistics and
a relatively straightforward interpretation. Computation of conditional maximum-likelihood estimates i...

Computer scores can be developed so that they predict essay scores provided by human readers, or scores can be produced that correspond to an analytical scale for writing assessment. This chapter discusses the basic approaches such as regression analysis, composite scales, content analysis, the analysis of discrete responses, and Bayesian analysis....

Cognitive diagnostic assessment traditionally has been performed through the use of specialized tests designed for this purpose. These tests are traditionally analyzed by straightforward statistical methods. However, within psychometrics, recently a much different picture of cognitive diagnosis has emerged in which statistical models are employed t...

Adaptive quadrature is applied to marginal maximum likelihood estimation for item response models with normal ability distributions. Even in one dimension, significant gains in speed and accuracy of computation may be achieved.

This study addresses the sample error and linking bias that occur with small and unrepresentative samples in a non-equivalent groups anchor test (NEAT) design. We propose a linking method called the synthetic function, which is a weighted average of the identity function (the trivial equating function for forms that are known to be completely paral...

A simple score test of the normal two-parameter logistic (2PL) model is presented that examines the potential attraction of the normal three-parameter logistic (3PL) model for use with a particular item. Application is made to data from a test from the Praxis™ series. Results from this example raise the question whether the normal 3PL model should...

Bounds are established for log cross-product ratios (log odds ratios) involving pairs of items for item response models. First, expressions for bounds on log cross-product ratios are provided for unidimensional item response models in general. Then, explicit bounds are obtained for the Rasch model and the two-parameter logistic (2PL) model. Results...

Multinomial-response models are available that correspond implicitly to tests in which a total score is computed as the sum of polytomous item scores. For these models, joint and conditional estimation may be considered in much the same way as for the Rasch model for right-scored tests. As in the Rasch model, joint estimation is only attractive if...

Recently, there has been an increasing level of interest in reporting subscores. This paper examines the issue of reporting subscores at an aggregate level, especially at the level of institutions that the examinees belong to. A series of statistical analyses is suggested to determine when subscores at the institutional level have any added value o...

When a simple random sample of size n is employed to establish a classification rule for prediction of a polytomous variable by an independent variable, the best achievable rate of misclassification is higher than the corresponding best achievable rate if the conditional probability distribution is known for the predicted variable given the indepen...

If a parametric model for the ability distribution is not assumed, then the customary two-parameter and three-parameter logistic models for item response analysis present identifiability problems not encountered with the Rasch model. These problems impose substantial restrictions on possible models for ability distributions.

Some probabilistic illustrations of the reliability coefficient are provided to assist in interpretation of this measure. All explanations are derived under the assumption that the joint distribution of examinee scores from two parallel tests is well approximated by a bivariate normal distribution.

Latent-class item response models with small numbers of latent classes are quite competitive in terms of model fit to corresponding item-response models, at least for one- and two-parameter logistic (1PL and 2PL) models. Provided that care is taken in terms of computational procedures and in terms of use of only limited numbers of latent classes, c...

A chi-square statistic suitable for testing a primary hypothesis can be partitioned into components such that each component gives a test for a corresponding secondary hypothesis. Some partitionings are exact and some are approximate. The theory is based on the Fisher–Cochran theorem about decomposing quadratic functions of normal variables. The hi...

Statistical prediction problems often involve both a direct estimate of a true score and covariates of this true score. Given the criterion of mean squared error, this study determines the best linear predictor of the true score given the direct estimate and the covariates. Results yield an extension of Kelley's formula for estimation of the true s...

The usefulness of joint and conditional maximum-likelihood is considered for the Rasch model under realistic testing conditions in which the number of examinees is very large and the number is items is relatively large. Conditions for consistency and asymptotic normality are explored, effects of model error are investigated, measures of prediction...

Statistical and measurement properties are examined for features used in essay assessment to determine the generalizability of the features across populations, prompts, and individuals. Data are employed from TOEFL® and GMAT® examinations and from writing for CriterionSM.

Criteria for prediction of multinomial responses are examined in terms of estimation bias. Logarithmic penalty and least squares are quite similar in behavior but quite different from maximum probability. The differences ultimately reflect deficiencies in the behavior of the criterion of maximum probability.

A categorical profile is a vector of observed values of several categorical variables that share a common context. Statistical analysis of categorical profiles may involve study of the joint distribution of the profiles or study of the relationship of the profiles to explanatory variables. Such analysis entails special difficulties due to the very...

Correspondence models are a special class of statistical models for the association between categorical variables. The specific parametric structure of such models is based, as in the common (descriptive) correspondence analysis, on the canonical form of joint discrete distributions. Bivariate correspondence models are considered first, based on a...

Statistics may be described as the science of description of measurements on natural populations (Kendall and Stuart, 1977, pp. 1–2). This brief description of statistics requires some amplification. In general, a population S is a nonempty set, and a subpopulation V of S is a nonempty subset of S. In statistical practice, a population of interest...

