Wim J. van der Linden

CTB/McGraw-Hill LLC., Monterey, California, United States

Are you Wim J. van der Linden?

Claim your profile

Publications (122)136.61 Total impact

  • Wim J van der Linden, Charles Lewis
    [Show abstract] [Hide abstract]
    ABSTRACT: Posterior odds of cheating on achievement tests are presented as an alternative to [Formula: see text] values reported for statistical hypothesis testing for several of the probabilistic models in the literature on the detection of cheating. It is shown how to calculate their combinatorial expressions with the help of a reformulation of the simple recursive algorithm for the calculation of number-correct score distributions used throughout the testing industry. Using the odds avoids the arbitrary choice between statistical tests of answer copying that do and do not condition on the responses the test taker is suspected to have copied and allows the testing agency to account for existing circumstantial evidence of cheating through the specification of prior odds.
    Psychometrika 06/2014; · 2.21 Impact Factor
  • Marie Wiberg, Wim J. Linden, Alina A. Davier
    [Show abstract] [Hide abstract]
    ABSTRACT: Three local observed-score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed-score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.
    Journal of Educational Measurement 03/2014; 51(1). · 1.00 Impact Factor
  • Wim J van der Linden, Hao Ren
    [Show abstract] [Hide abstract]
    ABSTRACT: An optimal adaptive design for test-item calibration based on Bayesian optimality criteria is presented. The design adapts the choice of field-test items to the examinees taking an operational adaptive test using both the information in the posterior distributions of their ability parameters and the current posterior distributions of the field-test parameters. Different criteria of optimality based on the two types of posterior distributions are possible. The design can be implemented using an MCMC scheme with alternating stages of sampling from the posterior distributions of the test takers' ability parameters and the parameters of the field-test items while reusing samples from earlier posterior distributions of the other parameters. Results from a simulation study demonstrated the feasibility of the proposed MCMC implementation for operational item calibration. A comparison of performances for different optimality criteria showed faster calibration of substantial numbers of items for the criterion of D-optimality relative to A-optimality, a special case of c-optimality, and random assignment of items to the test takers.
    Psychometrika 01/2014; · 2.21 Impact Factor
  • Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: This article is a response to the commentaries on the position paper on observed-score equating by van der Linden (this issue). The response focuses on the more general issues in these commentaries, such as the nature of the observed scores that are equated, the importance of test-theory assumptions in equating, the necessity to use multiple equating transformations, and the choice of conditioning variables in equating.
    Journal of Educational Measurement 09/2013; 50(3). · 1.00 Impact Factor
  • Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: In spite of all of the technical progress in observed-score equating, several of the more conceptual aspects of the process still are not well understood. As a result, the equating literature struggles with rather complex criteria of equating, lack of a test-theoretic foundation, confusing terminology, and ad hoc analyses. A return to Lord's foundational criterion of equity of equating, a derivation of the true equating transformation from it, and mainstream statistical treatment of the problem of estimating the transformation for various data-collection designs exist as a solution to the problem.
    Journal of Educational Measurement 09/2013; 50(3). · 1.00 Impact Factor
  • Wim J. van der Linden, Xinhui Xiong
    [Show abstract] [Hide abstract]
    ABSTRACT: Two simple constraints on the item parameters in a response–time model are proposed to control the speededness of an adaptive test. As the constraints are additive, they can easily be included in the constraint set for a shadow-test approach (STA) to adaptive testing. Alternatively, a simple heuristic is presented to control speededness in plain adaptive testing without any constraints. Both types of control are easy to implement and do not require any other real-time parameter estimation during the test than the regular update of the test taker’s ability estimate. Evaluation of the two approaches using simulated adaptive testing showed that the STA was especially effective. It guaranteed testing times that differed less than 10 seconds from a reference test across a variety of conditions.
    Journal of Educational and Behavioral Statistics 08/2013; 38(4):418-438. · 1.07 Impact Factor
  • Qi Diao, Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: Automated test assembly uses the methodology of mixed integer programming to select an optimal set of items from an item bank. Automated test-form generation uses the same methodology to optimally order the items and format the test form. From an optimization point of view, production of fully formatted test forms directly from the item pool using a simultaneous optimization model is more attractive than any of the current, more time-consuming two-stage processes. The goal of this study was to provide such simultaneous models both for computer-delivered and paper forms, as well as explore their performances relative to two-stage optimization. Empirical examples are presented to show that it is possible to automatically produce fully formatted optimal test forms directly from item pools up to some 2,000 items on a regular PC in realistic times.
    Applied Psychological Measurement 07/2013; 37(5):361-374. · 1.49 Impact Factor
  • Hanneke Geerlings, Wim J. van der Linden, Cees A. W. Glas
    [Show abstract] [Hide abstract]
    ABSTRACT: Optimal test-design methods are applied to rule-based item generation. Three different cases of automated test design are presented: (a) test assembly from a pool of pregenerated, calibrated items; (b) test generation on the fly from a pool of calibrated item families; and (c) test generation on the fly directly from calibrated features defining the item families. The last two cases do not assume any item calibration under a regular response theory model; instead, entire item families or critical features of them are assumed to be calibrated using a hierarchical response model developed for rule-based item generation. The test-design models maximize an expected version of the Fisher information in the test and control critical attributes of the test forms through explicit constraints. Results from a study with simulated response data highlight both the effects of within-family item-parameter variability and the severity of the constraint sets in the test-design models on their optimal solutions.
    Applied Psychological Measurement 03/2013; 37(2):140-161. · 1.49 Impact Factor
  • Source
    Wim J. van der Linden, Minjeong Jeon
    [Show abstract] [Hide abstract]
    ABSTRACT: The probability of test takers changing answers upon review of their initial choices is modeled. The primary purpose of the model is to check erasures on answer sheets recorded by an optical scanner for numbers and patterns that may be indicative of irregular behavior, such as teachers or school administrators changing answer sheets after their students have finished the test or test takers communicating with each other about their initial responses. A statistical test based on the number of erasures is derived from the model. Besides, it is shown how to analyze the residuals under the model to check for suspicious patterns of erasures. The use of the two procedures is illustrated for an empirical data set from a large-scale assessment. The robustness of the model with respect to less than optimal opportunities for regular test takers to review their responses is investigated.
    Journal of Educational and Behavioral Statistics 01/2012; 37(1):180-199. · 1.07 Impact Factor
  • Wim Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: The issue of compensation in multidimensional response modeling is addressed. We show that multidimensional response models are compensatory in their ability parameters if and only if they are monotone. In addition, a minimal set of assumptions is presented under which the MLEs of the ability parameters are also compensatory. In a recent series of articles, beginning with Hooker, Finkelman, and Schwartzman (2009) in this journal, the second type of compensation was presented as a paradoxical result for certain multidimensional response models, leading to occasional unfairness in maximum-likelihood test scoring. First, it is indicated that the compensation is not unique and holds generally for any multiparameter likelihood with monotone score functions. Second, we analyze why, in spite of its generality, the compensation may give the impression of a paradox or unfairness.
    Psychometrika 01/2012; 77(1):21-30. · 2.21 Impact Factor
  • Source
    Wim J. van der Linden, Minjeong Jeon, Steve Ferrara
    [Show abstract] [Hide abstract]
    ABSTRACT: According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics.
    Journal of Educational Measurement 11/2011; 48(4):380 - 398. · 1.00 Impact Factor
  • Marie Wiberg, Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: Two methods of local linear observed-score equating for use with anchor-test and single-group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed-score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor-test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single-group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression.
    Journal of Educational Measurement 09/2011; 48(3):229 - 254. · 1.00 Impact Factor
  • Wim J. van der Linden, Qi Diao
    [Show abstract] [Hide abstract]
    ABSTRACT: In automated test assembly (ATA), the methodology of mixed-integer programming is used to select test items from an item bank to meet the specifications for a desired test form and optimize its measurement accuracy. The same methodology can be used to automate the formatting of the set of selected items into the actual test form. Three different cases are discussed: (i) computerized test forms in which the items are presented on a screen one at a time and only their optimal order has to be determined; (ii) paper forms in which the items need to be ordered and paginated and the typical goal is to minimize paper use; and (iii) published test forms with the same requirements but a more sophisticated layout (e.g., double-column print). For each case, a menu of possible test-form specifications is identified, and it is shown how they can be modeled as linear constraints using 0—1 decision variables. The methodology is demonstrated using two empirical examples.
    Journal of Educational Measurement 06/2011; 48(2):206-222. · 1.00 Impact Factor
  • Source
    Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: A critical component of test speededness is the distribution of the test taker’s total time on the test. A simple set of constraints on the item parameters in the lognormal model for response times is derived that can be used to control the distribution when assembling a new test form. As the constraints are linear in the item parameters, they can easily be included in a mixed integer programming model for test assembly. The use of the constraints is demonstrated for the problems of assembling a new test form to be equally speeded as a reference form, test assembly in which the impact of a change in the content specifications on speededness is to be neutralized, and the assembly of test forms with a revised level of speededness.
    Journal of Educational Measurement 02/2011; 48(1):44 - 60. · 1.00 Impact Factor
  • Source
    Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: It is shown how the time limit on a test can be set to control the probability of a test taker running out of time before completing it. The probability is derived from the item parameters in the lognormal model for response times. Examples of curves representing the probability of running out of time on a test with given parameters as a function of the time limit are presented. Unlike the traditional methods of dealing with test speededness, which assess the degree of speededness after the test has been administered, the curves enables us to set a desired degree in advance. The method is demonstrated using an empirical data set.
    Applied Psychological Measurement 01/2011; 35(3):183-199. · 1.49 Impact Factor
  • Hanneke Geerlings, Cees A. W. Glas, Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: An application of a hierarchical IRT model for items in families generated through the application of different combinations of design rules is discussed. Within the families, the items are assumed to differ only in surface features. The parameters of the model are estimated in a Bayesian framework, using a data-augmented Gibbs sampler. An obvious application of the model is computerized algorithmic item generation. Such algorithms have the potential to increase the cost-effectiveness of item generation as well as the flexibility of item administration. The model is applied to data from a non-verbal intelligence test created using design rules. In addition, results from a simulation study conducted to evaluate parameter recovery are presented. Keywordshierarchical modeling–item generation–item response theory–Markov chain Monte Carlo method
    Psychometrika 01/2011; 76(2):337-359. · 2.21 Impact Factor
  • Qi Diao, Wim J. van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: This article reviews the use of the software program lp_solve version 5.5 for solving mixed-integer automated test assembly (ATA) problems. The program is freely available under Lesser General Public License 2 (LGPL2). It can be called from the statistical language R using the lpSolveAPI interface. Three empirical problems are presented to demonstrate how to use the program and interface to (a) simultaneously assemble multiple test forms with absolute targets for their test information functions, (b) assemble shadow tests for computerized adaptive testing, and (c) assemble multistage tests using relative targets for their test information functions, all subject to various quantitative and categorical constraints. The results of this study indicate that it is now possible for researchers and testing organizations to implement ATA for small to moderately sized test assembly problems using free software.
    Applied Psychological Measurement 01/2011; 35(5):398-409. · 1.49 Impact Factor
  • Source
    Cees A W Glas, Wim J van der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: Marginal maximum-likelihood procedures for parameter estimation and testing the fit of a hierarchical model for speed and accuracy on test items are presented. The model is a composition of two first-level models for dichotomous responses and response times along with multivariate normal models for their item and person parameters. It is shown how the item parameters can easily be estimated using Fisher's identity. To test the fit of the model, Lagrange multiplier tests of the assumptions of subpopulation invariance of the item parameters (i.e., no differential item functioning), the shape of the response functions, and three different types of conditional independence were derived. Simulation studies were used to show the feasibility of the estimation and testing procedures and to estimate the power and Type I error rate of the latter. In addition, the procedures were applied to an empirical data set from a computerized adaptive test of language comprehension.
    British Journal of Mathematical and Statistical Psychology 11/2010; 63(Pt 3):603-26. · 1.26 Impact Factor
  • Source
    Wim J. Van Der Linden
    [Show abstract] [Hide abstract]
    ABSTRACT: Although response times on test items are recorded on a natural scale, the scale for some of the parameters in the lognormal response-time model (van der Linden, 2006) is not fixed. As a result, when the model is used to periodically calibrate new items in a testing program, the parameter are not automatically mapped onto a common scale. Several combinations of linking designs and procedures for the lognormal model are examined that do map parameter estimates onto a common scale. For each of the designs, the standard error of linking is derived. The results are illustrated using examples with simulated data.
    Journal of Educational Measurement 02/2010; 47(1):92 - 114. · 1.00 Impact Factor
  • Source
    Wim J. van der Linden
    Measurement Interdisciplinary Research and Perspectives 01/2010; 8(1):21-26.

Publication Stats

893 Citations
136.61 Total Impact Points

Institutions

  • 2009–2014
    • CTB/McGraw-Hill LLC.
      Monterey, California, United States
  • 1991–2010
    • Universiteit Twente
      • Department of Research Methodology, Measurement and Data Analysis (OMD)
      Enschede, Provincie Overijssel, Netherlands
  • 2008
    • University of Massachusetts Amherst
      Amherst Center, Massachusetts, United States
  • 1982
    • University of Amsterdam
      Amsterdamo, North Holland, Netherlands