Applied Psychological Measurement

Published by SAGE Publications
Online ISSN: 0146-6216
Publications
Article
A difficult result to interpret in Computerized Adaptive Tests (CATs) occurs when an ability estimate initially drops and then ascends continuously until the test ends, suggesting that the true ability may be higher than implied by the final estimate. We explain why this asymmetry occurs and show that early mistakes by high ability students can lead to considerable underestimation, even in tests with 45 items. The opposite response pattern, where low ability students start with lucky guesses, leads to much less bias. We show that using Barton and Lord's (1981) four-parameter model and a less informative prior can lower bias and RMSE for high ability students with a poor start, as the CAT algorithm ascends more quickly after initial underperformance. We also show that the 4PM slightly outperforms a CAT in which less discriminating items are initially used. The practical implications and relevance for psychological measurement more generally are discussed.
 
Article
Item selection is a core component in computerized adaptive testing (CAT). Several studies have evaluated new and classical selection methods; however, the few that have applied such methods to the use of polytomous items have reported conflicting results. To clarify these discrepancies and further investigate selection method properties, six different selection methods are compared systematically. The results showed no clear benefit from more sophisticated selection criteria and showed one method previously believed to be superior-the maximum expected posterior weighted information (MEPWI)-to be mathematically equivalent to a simpler method, the maximum posterior weighted information (MPWI).
 
Article
The recent surge of interests in cognitive assessment has led to developments of novel statistical models for diagnostic classification. Central to many such models is the well-known Q-matrix, which specifies the item-attribute relationships. This article proposes a data-driven approach to identification of the Q-matrix and estimation of related model parameters. A key ingredient is a flexible T-matrix that relates the Q-matrix to response patterns. The flexibility of the T-matrix allows the construction of a natural criterion function as well as a computationally amenable algorithm. Simulations results are presented to demonstrate usefulness and applicability of the proposed method. Extension to handling of the Q-matrix with partial information is presented. The proposed method also provides a platform on which important statistical issues, such as hypothesis testing and model selection, may be formally addressed.
 
Article
We describe the effective lagrangian approach to the color superconductivity. The effective description that arises if one considers only the leading terms in the expansion for very high densities is particularly simple. It is based on a lagrangian whose effective fermion fields are velocity-dependent; moreover strong interactions do not change quark velocity and the effective lagrangian does not contain spin matrices. All these features render the effective theory similar to the Heavy Quark Effective Theory, which is the limit of Quantum ChromoDynamics for infinite quark masses. For this reason one can refer to the effective lagrangian at high density as the High Density Effective Theory (HDET). In some cases HDET results in analytical, though approximate, relations that are particularly simple to handle. After a pedagogical introduction, several topics are considered. They include the treatment of the Color-Flavor-Locking and the 2SC model, with evaluation of the gap parameters by the Nambu-Gorkov equations, approximate dispersion laws for the gluons and calculations of the Nambu-Goldstone Bosons properties. We also discuss the effective lagrangian for the crystalline color superconductive (LOFF) phase and we give a description of the phonon field related to the breaking of the rotational and translational invariance. Finally a few astrophysical applications of color superconductivity are discussed.
 
Article
Rating scale items are ubiquitous in psychometric practice. Yet, the psychometric properties of the rating scales can often vary by examinee, as well as by item. To address this practical psychometric problem, we introduce a novel, Bayesian nonparametric IRT model for rating scale items. The model is an infinite-mixture of Rasch partial credit models, with rating thresholds being the random parameters that are subject to the mixture, and with (infinitely-many) covariate-dependent stick-breaking weights. Random parameters and the mixture weights are assigned a Dependent Dirichlet process prior (DDP) distribution. Thus, the novel model allows the rating category thresholds to vary flexibly across items and examinees, and allows the distribution of the category thresholds to vary flexibly as a function of covariates. We illustrate the novel model through the analysis of a real rating data set that has been studied extensively in the psychometric modeling literature. The model is shown to have better predictive performance than other IRT rating models of common usage.
 
Article
The Green-function technique, termed the irreducible Green functions (IGF) method, that is a certain reformulation of the equation-of motion method for double-time temperature dependent Green functions is presented. This method was developed to overcome some ambiguities in terminating the hierarchy of the equations of motion of double-time Green functions and to give a workable technique to systematic way of decoupling. The approach provides a practical method for description of the many-body quasi-particle dynamics of correlated systems on a lattice with complex spectra. Moreover, it provides a very compact and self-consistent way of taking into account the damping effects and finite lifetimes of quasi-particles due to inelastic collisions. In addition, it correctly defines the Generalized Mean Fields, that determine elastic scattering renormalizations and, in general, are not functionals of the mean particle densities only. Although some space is devoted to the formal structure of the method, the emphasis is on its utility. Applications to the lattice fermion models such as Hubbard/Anderson models and to the Heisenberg ferro- and antiferromagnet, which manifest the operational ability of the method are given. It is shown that the IGF method provides a powerful tool for the construction of essentially new dynamical solutions for strongly interacting many-particle systems with complex spectra.
 
Article
Investigated the effect of response format on diagnostic assessment of students' performance on an algebra test. 231 8th and 9th graders (aged 14–25 yrs) in Israel were administered a test consisting of linear algebraic equations with 1 unknown to identify students' "bugs" (incorrect rules used to solve a problem). Two sets of parallel, open-ended (OE) items and a set of multiple-choice (MC) items that were stem-equivalent to 1 of the OE item sets were compared using a bug analysis and a rule-space analysis. Items with identical format (parallel OE items) were more similar than items with different formats (OE vs MC). Thus, OE provides a more valid measure than MC for the purpose of diagnostic assessment. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Reviews the conceptual differences between classical test theory and item response theory (IRT) and introduces 7 papers presented in this area that deal with new models, parameter estimation, and applications. Basic problems with classical test theory are discussed in light of IRT approaches to educational and psychological measurement. (14 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Investigated the accuracy of 3 exact person tests for assessing model-data fit in the Rasch model. Based on a simulation study, empirical Type I error rates and statistical power of the person tests were computed using known difficulty and θ parameters, as well as maximum likelihood estimates. Results indicate that the exact person test conditioned on total score is a promising tool for assessing consistency of response patterns with the Rasch model. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Investigated the degree to which subject matter experts could predict the difficulty and discrimination of items on the Test of Standard Written English. Despite an extended training period, the raters did not approach a high level of accuracy, nor were they able to pinpoint the factors that contribute to item difficulty and discrimination. Further research should attempt to uncover those factors by examining the items from a linguistic and psycholinguistic perspective. (12 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
The need for more realistic and richer forms of assessmentineducational tests has led to the inclusion (in many tests) of polytomously scored items, multiple items based on a single stimulus (a "testlet"), and the increased use of a generalized mixture of binary and polytomous item formats. In this paper we extend earlier work (Bradlow, Wainer & Wang, 1999# Wainer, Bradlow & Du, 2000) on the modeling of testlet based response data to include the situation in which a test is composed, partially or completely, of polytomously scored items and/or testlets. The model we propose, a modified version of commonly employed item response models, is embedded within a fully Bayesian framework, and inferences under the model are obtained using Markovchain Monte Carlo (MCMC) techniques. We demonstrate its use within a designed series of simulations and by analyzing operational data from the North Carolina Test of Computer Skills and the Educational Testing Service's Test of Spoken English...
 
Article
Studied properties of F. M. Lord's (1980) chi-square test of item bias in a computer simulation. Theta parameters were drawn from a standard normal distribution and responses to a 50-item test were generated using Scholastic Aptitude Test—Verbal item parameters estimated by Lord. 100 independent samples were generated under each of the 4 combinations of 2 sample sizes ( N = 1,000 and N = 250) and 2 logistic models (2- and 3-parameter). LOGIST was used to estimate item and person parameters simultaneously. For each of the 50 items, 50 independent chi-square tests of the equality of item parameters were calculated. The overall proportions significant were as high as 11 times the nominal alpha level. When person parameters were held fixed at their true values and only item parameters were estimated, the actual rejection rates were close to the nominal rates. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Conducted a parallel 2-stage estimation procedure, using 6 indices of response bias and the Mantel-Haenszel (MH) statistic, which requires less computing time and cost. Sample size, number of biased items, and magnitude of the bias were varied. The 2nd stage of the procedure did not identify substantial numbers of false positives. Identification of true positives in the 2nd stage was useful only when magnitude of the bias was not small and number of biased items was large (20% or 40% of the test). Weighted indices tended to identify more true and false positives than unweighted item response theory (IRT) indices. The MH statistic identified fewer false positives, but did not identify small bias as well as IRT indices. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
The oblique factor patterns of the Hiskey-Nebraska Test of Learning Ability (H-NTLA) suggest that the organization of subtest abilities is different for deaf and hearing children aged 3–10 yrs. Findings indicate that 2 of the H-NTLA subtests, Memory for Color and Block Patterns, assess different cognitive abilities in younger deaf and hearing examinees. Results are consistent with H. R. Mykelbust's (1964) organismic shift hypothesis which maintains that sensory deprivation alters the equilibrium and integration of perceptual and conceptual abilities. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Influence of Correlated Item Error (γ ) on Reliability (ω) and Four Values of Internal Consistency (α1, α2, α3, and α4)
Article
The properties of internal consistency (α), classical reliability (ρ), and congeneric reliability (ω) for a composite test with correlated item error are analytically investigated. Possible sources of correlated item error are contextual effects, item bundles, and item models that ignore additional attributes or higher-order attributes. The relation between reliability and internal consistency is determined by the deviance from true-score equivalence. The influence of correlated item error on α, ρ, and ω is conveyed strictly through the total item error covariance. As the total item error covariance increases, ρ and ω decrease, but a increases. The necessary and sufficient condition for α to be a lower bound to ρ and to ω is that the total item error covariance not exceed the deviance from true-score equivalence, Coefficient α will uniformly exceed ρ or ω in true-score equivalent tests with positively correlated item error. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
A computer-interactive multidimensional scaling (MDS) program (INTERSCAL) was used together with free response methods to represent and label dimensions of individual cognitive structure underlying person perception. INTERSCAL reduced by 40% the number of judgments required by each of 27 undergraduate and 4 high school student respondents over traditional complete judgment MDS methods. The dimensional structures derived by INTERSCAL were predictive of semantic differential type judgments, Repertory Grid Test triad judgments, and independent pair-comparison judgments. Typically, 1 or 2 dimensions were recovered and were labeled evaluative and potency dimensions, respectively. These dimensional structures were stable within Ss over a period of 10 wks. This pattern of overall consistency implies that particular characteristics of an individual's structure and changes in the relative location of the stimuli over time may be given serious consideration, and that INTERSCAL is an efficient method for scaling such dimensional structures. (29 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
DEPCORR is a computer program for comparing 2 dependent zero-order correlations that have one variable in common. DEPCORR uses SAS (SAS Institute Inc, 1990) to compute H. Hotelling's t (1940), E. J. Williams' t (1959), I. Olkin's z (1967), and X. L. Meng's et al.'s z (1992). In addition, DEPCORR also computes O. J. Dunn and V . A. Clark's z test and J. H. Steiger's modification of Dunn and Clark's z. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
The computer program LDIP provides indices of local dependence for polytomous items under item response theory (cf. Chen, 1998). The indices are the Pearson chi-square statistic Χ² (Agresti, 1996), the likelihood ratio chi-square statistic G² (Agresti, 1996), Yen's (1993) index of local dependence Q³, and the Fisher-transformed correlation difference statistic Zd (Press, Flannery, Teukolsky, & Vetterling, 1986). LDIP is used in conjunction with the two popular item response theory computer programs for polytomous items, MULTILOG (Thissen, Chen, & Bock, 2002) and PARSCALE (Muraki & Bock, 2002). LDIP obtains the indices of local dependence for the graded response model (Samejima, 1969) and the generalized partial credit model (Muraki, 1992) using two result files that contain item parameter estimates and ability estimates from either MULTILOG or PARSCALE. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Investigated whether individuals differ in their likelihood of eliciting consistent vs inconsistent responses. The present article represents a summary of an earlier study by the author and R. R. Jones (1969). The California Personality Inventory, the EPPS, and the MMPI were each administered on 2 separate occasions to 95 male and 108 female undergraduates. These instruments provided an opportunity to compare consistency measures from inventories differing in both content and response format. The 2 administrations of the same inventory were separated by 4 wks, during which time the other 2 inventories were given. 15 different item subsets were used to obtain measures of response variability vs consistency. Results suggest that the evidence regarding a trait of response consistency is equivocal. While consistency measures that possess substantial homogeneity and a significant degree of convergent validity can be constructed, such measures inevitably are confounded by variance attributable to the particular set of stimuli (items) used to elicit the responses, a problem shared with all other putative response sets and styles. (47 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Compares the rating scale model and the dual scaling method on common data sets to evaluate similarities and differences of both approaches to the analysis of Likert scale data. 326 Grade 12 females in the Arts, Commerce, and Science faculties of a junior college in Singapore were administered the Students' Liking for Computer-Related Activities scale using a 6-point ordered response Likert scale. The primary result was that conformity of responses to the item response theory model requirements and targeting of scale statements and their ordered response categories to the respondents in dual scaling are vital for a resolution of scaling problems. This study establishes the similarity of the 2 scaling methods. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Develops a multidimensional item response theory (IRT) model for polytomously scored items, based on F. Samejima's (1969, 1972) graded response model and using the normal ogive. The model is expressed both in factor-analytic and IRT parameters. An EM algorithm for estimation of the model parameters is presented, as well as analysis of data from the 1992 National Assessment of Education Progress (Grade 4 main writing assessment, N = 9,136), performed by a computerized version of the algorithm. Advantages to the described full-information item factor procedure include (1) use of all information contained in the response category patterns before reducing the data to correlation coefficients and (2) being able to handle the case of matrix sampling of the examinee–item matrix. However, required computing resources restrict analyses to the use of 4 dimensions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
Reviews the past 50 yrs of developments in multidimensional scaling (MDS) and summarizes the 6 papers in the present journal. Some common themes that run through these papers are identified: (1) The difficulty of handling large data sets poses an impediment to applied MDS research. (2) In applied areas, there is a movement away from studies that simply describe stimulus structures toward those that explicitly examine theories of structures. (42 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
 
Article
An item-selection algorithm to neutralize the differential effects of time limits on scores on computerized adaptive tests is proposed. The method is based on a statistical model for the response-time distributions of the examinees on items in the pool that is updated each time a new item has been administered. Predictions from the model are used as constraints in a 0-1 linear programming (LP) model for constrained adaptive testing that maximizes the accuracy of the ability estimator. The method is demonstrated empirically using an item pool from the Armed Services Vocational Aptitude Battery and the responses of 38,357 examinees. The empirical example suggests that the algorithm is able to reduce the speededness of the test for the examinees who otherwise would have suffered from the time limit. Also, the algorithm did not seem to introduce any differential effects on the statistical properties of the theta estimator. (Contains 9 figures and 14 references.) (SLD)
 
Type I error rate for DIMPACK and DIMTEST_DOS with 2PL data Note: 2PL = two parameter logistic model; AT = assessment subtest.
Article
DIMPACK Version 1.0 for assessing test dimensionality based on a nonparametric conditional covariance approach is reviewed. This software was originally distributed by Assessment Systems Corporation and now can be freely accessed online. The software consists of Windows-based interfaces of three components: DIMTEST, DETECT, and CCPROX/HAC, which conduct hypothesis test for unidimensionality, cluster items, and perform hierarchical cluster analysis, respectively. Two simulation studies were conducted to evaluate the software in confirming test unidimensionality (a Type I error study) and detecting multidimensionality (a statistical power study). The results suggested that different data always be used in selecting assessment subtest items independent of calculating the DIMTEST statistic. The Type I error rate was excessively inflated otherwise. The statistical power was found low when sample size was small or the dimensions were highly correlated. It is suggested that some major changes be made to the software before it can be successfully useful among practitioners.
 
Article
Thirty tests from the 1955 edition of Cattell's Ob jective-Analytic (O-A) Test Battery, plus Forms A and B of the Sixteen Personality Factor Question naire (16PF), were administered to 82 male under graduates. In addition, each subject was rated by 7 to 11 close associates on each of 20 bipolar rating scales, 4 scales tapping each of 5 peer-rating fac tors. These peer ratings were used as criterion vari ables to be predicted by the 16PF scales and by the O-A Battery. The O-A Battery measures were slightly more highly related to one peer-rating fac tor (Culture); the 16PF scales were slightly more highly related to another (Conscientiousness); and the two sets of test variables were essentially equiv alent in predicting the other three factors (two of which showed no significant relationships with either instrument). The lack of any consistent su periority of the objective test scores over the ques tionnaire scales, coupled with some criticisms of the objective tests on purely logical grounds, should make one cautious in accepting the claims being made for the comparative validity of the O-A Bat tery. Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/66876/2/10.1177_014662168000400205.pdf
 
Article
This article reviews a new item response theory (IRT) model estimation program, IRTPRO 2.1, for Windows that is capable of unidimensional and multidimensional IRT model estimation for existing and user-specified constrained IRT models for dichotomously and polytomously scored item response data.
 
Article
Guessing behavior is an issue discussed widely with regard to multiple choice tests. Its primary effect is on number-correct scores for examinees at lower levels of proficiency. This is a systematic error or bias, which increases observed test scores. Guessing also can inflate random error variance. Correction or adjustment for guessing formulas has been applied to address some of these issues. The purpose of this research comment is to draw attention to the adjustment for guessing implicit in the three-parameter logistic item response theory model. Potential equity issues also arise with respect to this adjustment.
 
Article
Using a model followed in earlier research, corre lations were computed between undergraduate and graduate grade-point averages as well as between these and standard graduate and professional school tests. Approximately 1200 law school stu dents constituted a professional school sample and another 1200 students in mathematics, physics, and chemistry constituted a graduate school sample. Earlier findings were replicated. In addition, it is shown that both graduate and professional school grades form simplex matrices and that early grades are more highly predictable from aptitude tests than later grades. There is evidence for a single simplex matrix extending through the four under graduate and three post-graduate years only in the law school sample. There are two separate simplex matrices for the two levels in the graduate school sample. Correlations between test scores and under graduate grades are biased to very low values in the professional school sample by a compensatory selec tion system, but both aptitude and achievement tests are clearly more highly correlated with fresh- man than with senior grades in the graduate school sample. In this sample, however, the advanced test in the discipline is more highly correlated with first year graduate grades than with senior grades. These data suggest that the first year in a new aca demic learning situation represents a greater intel lectual challenge than subsequent years.
 
Article
The problem of factor score indeterminacy implies that the factor and the error scores cannot be completely disentangled in the factor model. It is therefore proposed to compute Harman’s factor score predictor that contains an additive combination of factor and error variance. This additive combination is discussed in the framework of classical test theory. On this basis, a definition of reliability, standard error of measurement, and confidence intervals for the factor score predictor are proposed. It is argued that factor score predictor intervals should be used instead of single score predictors to account for the error term in the factor model. The calculation of reliabilities and factor score predictor intervals is illustrated by means of a small simulation study and an empirical example.
 
Article
Within the framework of item response theory (IRT), there are two recent lines of work on the estimation of classification accuracy (CA) rate. One approach estimates CA when decisions are made based on total sum scores, the other based on latent trait estimates. The former is referred to as the Lee approach, and the latter, the Rudner approach, each after its representative contributor. In this article, the two approaches are delineated in the same framework to highlight their similarities and differences. In addition, a simulation study manipulating IRT model, sample size, test length, and cut score location was conducted. The study investigated the empirical CA that can be achieved using either the total scores or the latent trait estimates. It also evaluated the performances of the two approaches in estimating their respective empirical CAs. Results on the empirical CA suggest that when the model fits, classifications made with the latent trait estimate shall be equally or more accurate than classifications made with total score. The magnitude of difference was governed by divergence from the one-parameter logistic (1PL) model. Both Lee and Rudner approaches provided good estimates of their respective empirical CAs for every condition that was simulated. Practical implications of the simulation results are discussed.
 
Article
This article describes procedures for estimating various indices of classification consistency and accuracy for multiple category classifications using data from a single test administration. The estimates of the classification consistency and accuracy indices are compared under three different psychometric models: the two-parameter beta binomial, four-parameter beta binomial, and three-parameter logistic IRT (item response theory) models. Using real data sets, the estimation procedures are illustrated, and the characteristics of the estimated classification indices are examined. This article also examines the behavior of the estimated classification indices as a function of the latent variable. All three components of the models (i.e., the estimated true score distributions, fitted observed score distributions, and estimated conditional error variances) appear to have considerable influence on the magnitudes of the estimated classification indices. Choosing a model in practice should be based on various considerations including the degree of model fit to the data, suitability of the model assumptions, and the computational feasibility.
 
Article
The purpose of this research was to explore the validity of ACT Assessment test results as indi cators of reading skill. This was considered impor tant because of the large and growing number of postsecondary institutions using open admission policies and admitting students with a wide range of reading ability. If ACT scores were found to be predictive of reading performance, ACT data could be used to identify those students who might profit from remedial reading training. Various test score combinations were used to predict students' per formance on numerous reading tests at a large number of postsecondary institutions. These studies indicated a high level of predictive accuracy.
 
Article
Some test design problems can be seen as combina torial optimization problems. Several suggestions are presented, with various possible applications. Results obtained thus far are promising; the methods suggested can also be used with highly structured test specifica tions.
 
Cumulative percentages of examinees finishing the CAT in different test lengths under the DINA model with Criterion 1.  
Cumulative percentages of examinees finishing the CAT in different test lengths under the fusion model with Criterion 1.  
Classification Accuracy of Latent Class and Other Summary Statistics Under the DINA Model With Test Security Control.
Classification Accuracy of Latent Class and Other Summary Statistics Under the Fusion Model With Test Security Control.
Article
Interest in developing computerized adaptive testing (CAT) under cognitive diagnosis models (CDMs) has increased recently. CAT algorithms that use a fixed-length termination rule frequently lead to different degrees of measurement precision for different examinees. Fixed precision, in which the examinees receive the same degree of measurement precision, is a major advantage of CAT over nonadaptive testing. In addition to the precision issue, test security is another important issue in practical CAT programs. In this study, the authors implemented two termination criteria for the fixed-precision rule and evaluated their performance under two popular CDMs using simulations. The results showed that using the two criteria with the posterior-weighted Kullback–Leibler information procedure for selecting items could achieve the prespecified measurement precision. A control procedure was developed to control item exposure and test overlap simultaneously among examinees. The simulation results indicated that in contrast to no method of controlling exposure, the control procedure developed in this study could maintain item exposure and test overlap at the prespecified level at the expense of only a few more items.
 
Descriptive Statistics for the Item Pool (500 Items)
Article
Most computerized adaptive testing (CAT) programs do not allow test takers to review and change their responses because it could seriously deteriorate the efficiency of measurement and make tests vulnerable to manipulative test-taking strategies. Several modified testing methods have been developed that provide restricted review options while limiting the trade-off in CAT efficiency. The extent to which these methods provided test takers with options to review test items, however, still was quite limited. This study proposes the item pocket (IP) method, a new testing approach that allows test takers greater flexibility in changing their responses by eliminating restrictions that prevent them from moving across test sections to review their answers. A series of simulations were conducted to evaluate the robustness of the IP method against various manipulative test-taking strategies. Findings and implications of the study suggest that the IP method may be an effective solution for many CAT programs when the IP size and test time limit are properly set.
 
Article
In the human sciences, a common assumption is that latent traits have a hierarchical structure. Higher order item response theory models have been developed to account for this hierarchy. In this study, computerized adaptive testing (CAT) algorithms based on these kinds of models were implemented, and their performance under a variety of situations was examined using simulations. The results showed that the CAT algorithms were very effective. The progressive method for item selection, the Sympson and Hetter method with online and freeze procedure for item exposure control, and the multinomial model for content balancing can simultaneously maintain good measurement precision, item exposure control, content balance, test security, and pool usage.
 
Article
This paper discusses four item selection rules to design efficient individualized tests for the random weights linear logistic test model: minimum posterior weighted (DB) and minimum expected posterior weighted (EDB) D-error, maximum expected Kullback-Leibler divergence between subsequent posteriors (KLP) and maximum mutual information (MUI). The random weights linear logistic test model decomposes test items into a set of subtasks or cognitive features and assumes individual-specific effects of the features on the difficulty of the items. In contrast to a single ability score, the individual effects provide a more profound profile of a test taker's proficiency, giving one's strengths and weaknesses with respect to the item features. Simulations show how the design efficiency of the different criteria appears to be equivalent. However, KLP and MUI are given preference over DB and EDB due to their lower complexity, highly reducing the computational intensity.
 
Article
The random-threshold generalized unfolding model (RTGUM) was developed by treating the thresholds in the generalized unfolding model as random effects rather than fixed effects to account for the subjective nature of the selection of categories in Likert items. The parameters of the new model can be estimated with the JAGS (Just Another Gibbs Sampler) freeware, which adopts a Bayesian approach for estimation. A series of simulations was conducted to evaluate the parameter recovery of the new model and the consequences of ignoring the randomness in thresholds. The results showed that the parameters of RTGUM were recovered fairly well and that ignoring the randomness in thresholds led to biased estimates. Computerized adaptive testing was also implemented on RTGUM, where the Fisher information criterion was used for item selection and the maximum a posteriori method was used for ability estimation. The simulation study showed that the longer the test length, the smaller the randomness in thresholds, and the more categories in an item, the more precise the ability estimates would be.
 
Article
This study explored a computerized adaptive test delivery algorithm for latent class identification based on the mixture Rasch model. Four item selection methods based on the Kullback–Leibler (KL) information were proposed and compared with the reversed and the adaptive KL information under simulated testing conditions. When item separation was large, all item selection methods did not differ evidently in terms of accuracy in classifying examinees into different latent classes and estimating latent ability. However, when item separation was small, two methods with class-specific ability estimates performed better than the other two methods based on a single latent ability estimate across all latent classes. The three types of KL information distributions were compared. The KL and the reversed KL information could be the same or different depending on the ability level and the item difficulty difference between latent classes. Although the KL information and the reversed KL information were different at some ability levels and item difficulty difference levels, the use of the KL, the reversed KL, or the adaptive KL information did not affect the results substantially due to the symmetric distribution of item difficulty differences between latent classes in the simulated item pools. Item pool usage and classification convergence points were examined as well.
 
Article
Variable-length computerized adaptive testing (VL-CAT) allows both items and test length to be “tailored” to examinees, thereby achieving the measurement goal (e.g., scoring precision or classification) with as few items as possible. Several popular test termination rules depend on the standard error of the ability estimate, which in turn depends on the item parameter values. However, items are chosen on the basis of their parameter estimates, and capitalization on chance may occur. In this article, the authors investigated the effects of capitalization on chance on test length and classification accuracy in several VL-CAT simulations. The results confirm that capitalization on chance occurs in VL-CAT and has complex effects on test length, ability estimation, and classification accuracy. These results have important implications for the design and implementation of VL-CATs.
 
Article
Multidimensional computerized adaptive testing (MCAT) is able to provide a vector of ability estimates for each examinee, which could be used to provide a more informative profile of an examinee’s performance. The current literature on MCAT focuses on the fixed-length tests, which can generate less accurate results for those examinees whose abilities are quite different from the average difficulty level of the item bank when there are only a limited number of items in the item bank. Therefore, instead of stopping the test with a predetermined fixed test length, the authors use a more informative stopping criterion that is directly related to measurement accuracy. Specifically, this research derives four stopping rules that either quantify the measurement precision of the ability vector (i.e., minimum determinant rule [D-rule], minimum eigenvalue rule [E-rule], and maximum trace rule [T-rule]) or quantify the amount of available information carried by each item (i.e., maximum Kullback–Leibler divergence rule [K-rule]). The simulation results showed that all four stopping rules successfully terminated the test when the mean squared error of ability estimation is within a desired range, regardless of examinees’ true abilities. It was found that when using the D-, E-, or T-rule, examinees with extreme abilities tended to have tests that were twice as long as the tests received by examinees with moderate abilities. However, the test length difference with K-rule is not very dramatic, indicating that K-rule may not be very sensitive to measurement precision. In all cases, the cutoff value for each stopping rule needs to be adjusted on a case-by-case basis to find an optimal solution.
 
Article
The Monte Carlo approach which has previously been implemented in traditional computerized adaptive testing (CAT) is applied here to cognitive diagnostic CAT to test the ability of this approach to address multiple content constraints. The performance of the Monte Carlo approach is compared with the performance of the modified maximum global discrimination index (MMGDI) method on simulations in which the only content constraint is on the number of items that measure each attribute. The results of the two simulation experiments show that (a) the Monte Carlo method fulfills all the test requirements and produces satisfactory measurement precision and item exposure results and (b) the Monte Carlo method outperforms the MMGDI method when the Monte Carlo method applies either the posterior-weighted Kullback–Leibler algorithm or the hybrid Kullback–Leibler information as the item selection index. Overall, the recovery rate of the knowledge states, the distribution of the item exposure, and the utilization rate of the item bank are improved when the Monte Carlo method is used.
 
Pool Information Function
Comparison Study Without Guessing or Hesitation
Comparison Study With Guessing
Comparison Study With Hesitation
Article
This article presents a new algorithm for computerized adaptive testing (CAT) when content constraints are present. The algorithm is based on shadow CAT methodology to meet content constraints but applies Monte Carlo methods and provides the following advantages over shadow CAT: (a) lower maximum item exposure rates, (b) higher utilization of the item pool, and (c) more robust ability estimates. Computer simulations with Law School Admission Test items demonstrated that the new algorithm (a) produces similar ability estimates as shadow CAT but with half the maximum item exposure rate and 100% pool utilization and (b) produces more robust estimates when a high- (or low-) ability examinee performs poorly (or well) at the beginning of the test.
 
Article
Computerized adaptive testing is subject to security problems, as the item bank content remains operative over long periods and administration time is flexible for examinees. Spreading the content of a part of the item bank could lead to an overestimation of the examinees' trait level. The most common way of reducing this risk is to impose a maximum exposure rate (r[superscript max]) that no item should exceed. Several methods have been proposed with this aim. All of these methods establish a single value of (r[superscript max]) throughout the test. This study presents a new, the multiple-(r[superscript max]) method, that defines as many values of r[superscript max] the number of items presented in the test. In this way, it is possible to impose a high degree of randomness in item selection at the beginning of the test, leaving the administration of items with the best psychometric properties to the moment when the trait level estimation is most accurate. The implementation of the multiple-r[superscript max] method is described and is tested in simulated item banks and in an operative bank. Compared with a single maximum exposure method, the new method has a more balanced usage of the item bank and delays the possible distortion of trait estimation due to security problems, with either no or only slight decrements of measurement accuracy. (Contains 1 table and 13 figures.)
 
Article
The object of this paper is to present Rasch's psychometric model as a special case of additive conjoint measurement. The connection between these two areas has been discussed before, but largely ignored. Because the theory of conjoint measurement has been formulated determinis tically, there have been some difficulties in its application. It is pointed out in this paper that the Rasch model, which is a stochastic model, does not suffer from this fault. The exposition centers on the analyses of two data sets, each of which was ana lyzed using Rasch scaling methods as well as some of the methods of conjoint measurement. The results, using the different procedures, are com pared.
 
Article
DETECT is a nonparametric methodology to identify the dimensional structure underlying test data. The associated DETECT index, Dmax, denotes the degree of multidimensionality in data. Conditional covariances (CCOV) are the building blocks of this index. In specifying population CCOVs, the latent test composite θTT is used as the conditional variable. In estimating the CCOVs, the total test score of all items in the test (T) or the rest score of remaining items (S) are generally used as estimates of the latent composite θTT . However, estimated CCOVs are biased when using T or S as a proxy for θTT. Some type of correction is needed to adjust this bias. This study was an investigation of different ways to estimate the DETECT index based on the conditional scores T and S, and additional bias adjustments, resulting in six different estimates, D 1 through D 6. These six indices were investigated in simulated settings, varying the test length, sample size, and the degree of multidimensionality (108 in all). The results showed that indices D 1, D 2, and D 5 are not acceptable as they displayed highly inflated D max values. No statistically significant differences were found between indices D 3 and D 6. Overall comparison of indices D 3 with D 4 showed that even though they differed significantly in D max values, there was no practically meaningful difference between the performance of these two indices. For these reasons, it is recommended that the current index D 3 be retained even though index D 4 displayed slightly better results in the current study.
 
Article
Thesis (Ed. D.)--Indiana University, 1974. Vita.
 
Article
The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results. The current formula for coefficient alpha (α; the commonly used reliability coefficient) only provides internal consistency reliability estimates of raw scores. This article presents a general form of α and extends its use to estimate internal consistency reliability for nonlinear scale scores (used for relative decisions). The article also examines this estimator of reliability using different score scales with real data sets of both dichotomously scored and polytomously scored items. Different score scales show different estimates of reliability. The effects of transformation functions on reliability of different score scales are also explored.
 
Article
This article describes SIMREL, a software program designed for the simulation of alpha coefficients and the estimation of its confidence intervals. SIMREL runs on two alternatives. In the first one, if SIMREL is run for a single data file, it performs descriptive statistics, principal components analysis, and variance analysis of the item scores in the data file. The second alternative in SIMREL utilizes Monte Carlo simulation. It uses the data employed in the first alternative as population data and obtains estimators of each parameter using samples at the desired sample size and replication number from the population data.
 
Top-cited authors
Fritz Drasgow
  • University of Illinois, Urbana-Champaign
John Hattie
  • University of Melbourne
Rob Meijer
  • University of Groningen
Mark Wilson
  • University of California, Berkeley
Allan S. Cohen
  • University of Georgia