[Show abstract][Hide abstract] ABSTRACT: With a few exceptions, the problem of linking item response model parameters from different item calibrations has been conceptualized as an instance of the problem of test equating scores on different test forms. This paper argues, however, that the use of item response models does not require any test score equating. Instead, it involves the necessity of parameter linking due to a fundamental problem inherent in the formal nature of these models-their general lack of identifiability. More specifically, item response model parameters need to be linked to adjust for the different effects of the identifiability restrictions used in separate item calibrations. Our main theorems characterize the formal nature of these linking functions for monotone, continuous response models, derive their specific shapes for different parameterizations of the 3PL model, and show how to identify them from the parameter values of the common items or persons in different linking designs.
[Show abstract][Hide abstract] ABSTRACT: Posterior odds of cheating on achievement tests are presented as an alternative to [Formula: see text] values reported for statistical hypothesis testing for several of the probabilistic models in the literature on the detection of cheating. It is shown how to calculate their combinatorial expressions with the help of a reformulation of the simple recursive algorithm for the calculation of number-correct score distributions used throughout the testing industry. Using the odds avoids the arbitrary choice between statistical tests of answer copying that do and do not condition on the responses the test taker is suspected to have copied and allows the testing agency to account for existing circumstantial evidence of cheating through the specification of prior odds.
[Show abstract][Hide abstract] ABSTRACT: Occasionally situations arise in which a measurement does not lend itself to such traditional methods of reliability estimation as the test-retest, parallel-test, or internal consistency methods, for example, because a single item variable or an index based on heterogeneous data is involved. In this paper, it is proposed to base reliability estimation in such situations on estimates of validity coefficients as lower bounds. These lower bounds can be maximized by a deliberate selection of predictor variables, both in the case of single item variables and heterogeneous indices. Two selection procedures are examined and compared, one based on expert judgment and one on backward deletion of predictors with cross validation as a provision against chance capitalization. Some examples presented in the paper suggest that these methods provide satisfying estimates.
No preview · Article · Apr 2014 · The Journal of Experimental Education
[Show abstract][Hide abstract] ABSTRACT: Three local observed-score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed-score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.
No preview · Article · Mar 2014 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: An optimal adaptive design for test-item calibration based on Bayesian optimality criteria is presented. The design adapts the choice of field-test items to the examinees taking an operational adaptive test using both the information in the posterior distributions of their ability parameters and the current posterior distributions of the field-test parameters. Different criteria of optimality based on the two types of posterior distributions are possible. The design can be implemented using an MCMC scheme with alternating stages of sampling from the posterior distributions of the test takers' ability parameters and the parameters of the field-test items while reusing samples from earlier posterior distributions of the other parameters. Results from a simulation study demonstrated the feasibility of the proposed MCMC implementation for operational item calibration. A comparison of performances for different optimality criteria showed faster calibration of substantial numbers of items for the criterion of D-optimality relative to A-optimality, a special case of c-optimality, and random assignment of items to the test takers.
[Show abstract][Hide abstract] ABSTRACT: This article is a response to the commentaries on the position paper on observed-score equating by van der Linden (this issue). The response focuses on the more general issues in these commentaries, such as the nature of the observed scores that are equated, the importance of test-theory assumptions in equating, the necessity to use multiple equating transformations, and the choice of conditioning variables in equating.
No preview · Article · Sep 2013 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: In spite of all of the technical progress in observed-score equating, several of the more conceptual aspects of the process still are not well understood. As a result, the equating literature struggles with rather complex criteria of equating, lack of a test-theoretic foundation, confusing terminology, and ad hoc analyses. A return to Lord's foundational criterion of equity of equating, a derivation of the true equating transformation from it, and mainstream statistical treatment of the problem of estimating the transformation for various data-collection designs exist as a solution to the problem.
No preview · Article · Sep 2013 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: Two simple constraints on the item parameters in a response–time model are proposed to control the speededness of an adaptive test. As the constraints are additive, they can easily be included in the constraint set for a shadow-test approach (STA) to adaptive testing. Alternatively, a simple heuristic is presented to control speededness in plain adaptive testing without any constraints. Both types of control are easy to implement and do not require any other real-time parameter estimation during the test than the regular update of the test taker’s ability estimate. Evaluation of the two approaches using simulated adaptive testing showed that the STA was especially effective. It guaranteed testing times that differed less than 10 seconds from a reference test across a variety of conditions.
No preview · Article · Aug 2013 · Journal of Educational and Behavioral Statistics
[Show abstract][Hide abstract] ABSTRACT: Automated test assembly uses the methodology of mixed integer programming to select an optimal set of items from an item bank. Automated test-form generation uses the same methodology to optimally order the items and format the test form. From an optimization point of view, production of fully formatted test forms directly from the item pool using a simultaneous optimization model is more attractive than any of the current, more time-consuming two-stage processes. The goal of this study was to provide such simultaneous models both for computer-delivered and paper forms, as well as explore their performances relative to two-stage optimization. Empirical examples are presented to show that it is possible to automatically produce fully formatted optimal test forms directly from item pools up to some 2,000 items on a regular PC in realistic times.
[Show abstract][Hide abstract] ABSTRACT: Optimal test-design methods are applied to rule-based item generation. Three different cases of automated test design are presented: (a) test assembly from a pool of pregenerated, calibrated items; (b) test generation on the fly from a pool of calibrated item families; and (c) test generation on the fly directly from calibrated features defining the item families. The last two cases do not assume any item calibration under a regular response theory model; instead, entire item families or critical features of them are assumed to be calibrated using a hierarchical response model developed for rule-based item generation. The test-design models maximize an expected version of the Fisher information in the test and control critical attributes of the test forms through explicit constraints. Results from a study with simulated response data highlight both the effects of within-family item-parameter variability and the severity of the constraint sets in the test-design models on their optimal solutions.
No preview · Article · Mar 2013 · Applied Psychological Measurement
[Show abstract][Hide abstract] ABSTRACT: The probability of test takers changing answers upon review of their initial choices is modeled. The primary purpose of the model is to check erasures on answer sheets recorded by an optical scanner for numbers and patterns that may be indicative of irregular behavior, such as teachers or school administrators changing answer sheets after their students have finished the test or test takers communicating with each other about their initial responses. A statistical test based on the number of erasures is derived from the model. Besides, it is shown how to analyze the residuals under the model to check for suspicious patterns of erasures. The use of the two procedures is illustrated for an empirical data set from a large-scale assessment. The robustness of the model with respect to less than optimal opportunities for regular test takers to review their responses is investigated.
Preview · Article · Feb 2012 · Journal of Educational and Behavioral Statistics
[Show abstract][Hide abstract] ABSTRACT: The issue of compensation in multidimensional response modeling is addressed. We show that multidimensional response models are compensatory in their ability parameters if and only if they are monotone. In addition, a minimal set of assumptions is presented under which the MLEs of the ability parameters are also compensatory. In a recent series of articles, beginning with Hooker, Finkelman, and Schwartzman (2009) in this journal, the second type of compensation was presented as a paradoxical result for certain multidimensional response models, leading to occasional unfairness in maximum-likelihood test scoring. First, it is indicated that the compensation is not unique and holds generally for any multiparameter likelihood with monotone score functions. Second, we analyze why, in spite of its generality, the compensation may give the impression of a paradox or unfairness.
[Show abstract][Hide abstract] ABSTRACT: According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics.
Full-text · Article · Nov 2011 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: Two methods of local linear observed-score equating for use with anchor-test and single-group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed-score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor-test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single-group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression.
No preview · Article · Sep 2011 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: This article reviews the use of the software program lp_solve version 5.5 for solving mixed-integer automated test assembly (ATA) problems. The program is freely available under Lesser General Public License 2 (LGPL2). It can be called from the statistical language R using the lpSolveAPI interface. Three empirical problems are presented to demonstrate how to use the program and interface to (a) simultaneously assemble multiple test forms with absolute targets for their test information functions, (b) assemble shadow tests for computerized adaptive testing, and (c) assemble multistage tests using relative targets for their test information functions, all subject to various quantitative and categorical constraints. The results of this study indicate that it is now possible for researchers and testing organizations to implement ATA for small to moderately sized test assembly problems using free software.
No preview · Article · Jun 2011 · Applied Psychological Measurement
[Show abstract][Hide abstract] ABSTRACT: In automated test assembly (ATA), the methodology of mixed-integer programming is used to select test items from an item bank to meet the specifications for a desired test form and optimize its measurement accuracy. The same methodology can be used to automate the formatting of the set of selected items into the actual test form. Three different cases are discussed: (i) computerized test forms in which the items are presented on a screen one at a time and only their optimal order has to be determined; (ii) paper forms in which the items need to be ordered and paginated and the typical goal is to minimize paper use; and (iii) published test forms with the same requirements but a more sophisticated layout (e.g., double-column print). For each case, a menu of possible test-form specifications is identified, and it is shown how they can be modeled as linear constraints using 0—1 decision variables. The methodology is demonstrated using two empirical examples.
No preview · Article · Jun 2011 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: It is shown how the time limit on a test can be set to control the probability of a test taker running out of time before completing it. The probability is derived from the item parameters in the lognormal model for response times. Examples of curves representing the probability of running out of time on a test with given parameters as a function of the time limit are presented. Unlike the traditional methods of dealing with test speededness, which assess the degree of speededness after the test has been administered, the curves enables us to set a desired degree in advance. The method is demonstrated using an empirical data set.
[Show abstract][Hide abstract] ABSTRACT: An application of a hierarchical IRT model for items in families generated through the application of different combinations
of design rules is discussed. Within the families, the items are assumed to differ only in surface features. The parameters
of the model are estimated in a Bayesian framework, using a data-augmented Gibbs sampler. An obvious application of the model
is computerized algorithmic item generation. Such algorithms have the potential to increase the cost-effectiveness of item
generation as well as the flexibility of item administration. The model is applied to data from a non-verbal intelligence
test created using design rules. In addition, results from a simulation study conducted to evaluate parameter recovery are
Keywordshierarchical modeling–item generation–item response theory–Markov chain Monte Carlo method
[Show abstract][Hide abstract] ABSTRACT: A critical component of test speededness is the distribution of the test taker’s total time on the test. A simple set of constraints on the item parameters in the lognormal model for response times is derived that can be used to control the distribution when assembling a new test form. As the constraints are linear in the item parameters, they can easily be included in a mixed integer programming model for test assembly. The use of the constraints is demonstrated for the problems of assembling a new test form to be equally speeded as a reference form, test assembly in which the impact of a change in the content specifications on speededness is to be neutralized, and the assembly of test forms with a revised level of speededness.
Preview · Article · Feb 2011 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: One of the highlights in the observed-score equating literature is a theorem by Lord in his 1980 monograph, Applications of Item Response Theory to Practical Testing Problems. The theorem states that observed scores on two different tests cannot be equated unless the scores are perfectly reliable
or the forms are strictly parallel (Lord, 1980, Chapter 13, Theorem 13.3.1). Because the first condition is impossible and
equating under the second condition is unnecessary, the theorem is rather sobering.