Figure - uploaded by Nils Myszkowski
Content may be subject to copyright.
Item parameters of binary logistic models.

Item parameters of binary logistic models.

Source publication
Article
Full-text available
Assessing job applicants' general mental ability online poses psychometric challenges due to the necessity of having brief but accurate tests. Recent research (Myszkowski & Storme, 2018) suggests that recovering distractor information through Nested Logit Models (NLM; Suh & Bolt, 2010) increases the reliability of ability estimates in reasoning mat...

Context in source publication

Context 1
... parameter estimates (along with standard errors for the 2PL and 3PL) of the 2PL, 3PL, and 4PL models are presented respectively in Table 2. ...

Citations

... Thus, we expect our approach to be conservative (i.e., confidence intervals are wider as compared to an approach that takes the dependence into account) and any inferences based on the intervals should be treated with caution. However, approaches that explicitly model the potential dependence here such as case resampling combined with model refitting and fully non-parametric bootstrapping (Myszkowski & Storme, 2018;Storme et al., 2019) would be computationally too demanding for the CMPCM which involves approximation of an infinite series. This assertion was tested on a Dell Precision 3551 laptop with the Windows 11 operating system and an x86-64 processor. ...
... Furthermore, we focused on reliability as a main outcome in this work and for the quantification of uncertainty in reliability estimates we had to rely on a rather pragmatic-yet conservative-bootstrap approach. This approach did not take into account the potential dependence between estimates of SE 2 and ̂ 2 , because approaches such as non-parametric bootstrap combined with a case resampling procedure (Myszkowski & Storme, 2018;Storme et al., 2019) cannot easily be implemented for the CMPCM. The CMPCM involves computation of an infinite sum and refitting the models for only 1000 bootstrap samples are expected to take weeks even on a high-performance computer cluster. ...
Article
Full-text available
Are latent variables of researcher performance capacity merely elaborate proxies of productivity? To investigate this research question, we propose extensions of recently used item-response theory models for the estimation of researcher performance capacity. We argue that productivity should be considered as a potential explanatory variable of reliable individual differences between researchers. Thus, we extend the Conway-Maxwell Poisson counts model and a negative binomial counts model by incorporating productivity as a person-covariate. We estimated six different models: a model without productivity as item and person-covariate, a model with raw productivity as person-covariate, a model with log-productivity as person covariate, a model that treats log-productivity as a known offset, a model with item-specific influences of productivity, and a model with item-specific influences of productivity as well as academic age as person-covariate. We found that the model with item-specific influences of productivity fitted two samples of social science researchers best. In the first dataset, reliable individual differences decreased substantially from excellent reliability when productivity is not modeled at all to inacceptable levels of reliability when productivity is controlled as a person-covariate, while in the second dataset reliability decreased only negligibly. This all emphasizes the critical role of productivity in researcher performance capacity estimation.
... For categorical data that extend beyond binary responses, models such as the nominal response model (Bock 1972) and the nested logit model (Suh and Bolt 2010) provide a framework for analyzing responses with multiple categories, capturing the full complexity of item responses in person ability estimation. These methods have been discussed as ways to increase the reliability of person estimates-especially at low ability levels-in the context of progressive matrices (Myszkowski and Storme 2018;Storme et al. 2019). Statistical methods for detecting distractor responses that could have discrimination power have also been discussed in the context of progressive matrices (Forthmann et al. 2020). ...
Article
Full-text available
Measurement models traditionally make the assumption that item responses are independent from one another, conditional upon the common factor. They typically explore for violations of this assumption using various methods, but rarely do they account for the possibility that an item predicts the next. Extending the development of auto-regressive models in the context of personality and judgment tests, we propose to extend binary item response models—using, as an example, the 2-parameter logistic (2PL) model—to include auto-regressive sequential dependencies. We motivate such models and illustrate them in the context of a publicly available progressive matrices dataset. We find an auto-regressive lag-1 2PL model to outperform a traditional 2PL model in fit as well as to provide more conservative discrimination parameters and standard errors. We conclude that sequential effects are likely overlooked in the context of cognitive ability testing in general and progressive matrices tests in particular. We discuss extensions, notably models with multiple lag effects and variable lag effects.
... One of the techniques is the Item Response theory (IRT), which is a psychological and educational measurement technique. IRT has been used to select the bespoke test items for individual learners [2], evaluate the rational ability of job applicants [25], conduct a trust-based assessment [26], and requirement-based real-time testing with academic standards [27]. IRT provides a basis for measuring the ability of test participants and items based on responses, however analysis of only correct responses is not enough to measure the ability, therefore some additional parameters may be looked into. ...
Article
The integration of information and communication technologies in education has exploded various opportunities in learning and assessment. The state-of-the-art electronic learning and assessment systems are confined to delivering instructional content without focusing on learning analytics. Hence, more objective assessment systems are required that can keep track of the performance of students, especially for self-learning in the post-COVID-19 digital era. These assessment systems need to be optimized so that students can receive an accurate prescription in a limited time during the unavailability of teachers. Therefore, this study intends to propose an adaptive e-assessment model for learning and assessment. The proposed model comprises components including a domain model, student model, and assessment adaptation engine. The features of fuzzy logic have been utilized to address uncertainty and analyze student performance using a learner-centric approach. The prototype has been verified by deploying it on a computer science course offered at a degree-level program of an open university. The results reveal the improved performance of learners using the adaptive e-assessment system. The study also facilitates by providing a roadmap for the researchers to develop a generalized and personalized e-learning system for other courses.
... As responses to items are typically simply scored as "correct versus incorrect", information about the participants' reasoning that led to the selection of a particular distractor is lost (Laurence and Macedo 2022). The importance of analyzing distractors for their informative content has come into focus in recent years (Forthmann et al. 2020;Storme et al. 2019). While the focus of previous work was dedicated to analyzing single distractors for ability, our approach focuses on increasing distractor information by using the interplay of distractors and their operations resulting from the facet design (Guttman and Schlesinger 1967). ...
Article
Full-text available
Figural matrices tests are among the most popular and well-investigated tests used to assess inductive reasoning abilities. Solving these tests requires the selection of a target that completes a figural matrix among distractors. Despite their generally good psychometric properties, previous matrices tests have limitations associated with distractor construction that prevent them from realizing their full potential. Most tests allow participants to identify the correct response by eliminating distractors based on superficial features. The goal of this study was to develop a novel figural matrices test which is less prone to the use of response elimination strategies, and to test its psychometric properties. The new test consists of 48 items and was validated with N = 767 participants. Measurement models implied that the test is Rasch scalable, inferring a uniform underlying ability. The test showed good to very good reliability (retest-correlation: r = 0.88; Cronbach’s alpha: α = 0.93; split-half reliability: r = 0.88) and good construct validity (r = 0.81 with the Raven Progressive Matrices Test, r = 0.73 with global intelligence scores of the Intelligence Structure Test 2000R, and r = 0.58 with the global score of the Berlin Intelligence Structure Test). It even superseded the Raven Progressive Matrices Tests in criterion-related validity (correlation with final year high school grades (r = −0.49 p < .001)). We conclude that this novel test has excellent psychometric properties and can be a valuable tool for researchers interested in reasoning assessment.
... In Uganda, studies show a high prevalence of the ESBL among patients with UTIs [8][9][10] Individuals dwelling in refugee settlements face numerous socio-economic, hygiene, and health care challenges [11]. These predispose to high-risk behaviors which may augment acquiring Urinary Tract Infections [12] In Uganda, the United Nations High Commission for Refugees (UNHCR) health report indicated that 2% of the consultations registered in health centers in refugee settlements are empirically treated for Urinary Tract Infections [6]. According to unpublished health center III records in Nakivale refugee settlement, the prevalence of clinically diagnosed Urinary Tract Infections was seen to be 15%. ...
Preprint
Full-text available
Background World Health Organization approximates that one in four individuals have had at least one UTI episode requiring treatment with an antimicrobial agent by the teen age. At Nakivale refugee camp, the overwhelming number of refugees often associated with poor living conditions such as communal bathrooms and toilets and multiple sex partners do predispose the refuges to urinary tract infections. Aim To determine the prevalence of bacterial community-onset urinary tract infections among refugees in Nakivale refugee settlement and determine the antimicrobial susceptibility patterns of the isolated pathogens. Methods This study was a cross-sectional study, that included 216 outpatients attending Nakivale Health Centre III between July and September 2020. Results Prevalence of UTI was 24.1% (52/216). The majority 86(39.81%) of the refugees were from DR Congo, followed by those from Somalia 58(26.85%). The commonest causative agent was Staphylococcus aureus 22/52 (42.31%) of total isolates, followed by Escherichia coli 21/52 (40.38%). Multidrug resistant isolates accounted for 71.15% (37/52) and mono resistance was 26.92% (14/52). Out of the 52 bacterial isolates, 30 (58%) were Extended-Spectrum Beta-Lactamase organisms (ESBLs). Twenty-one (70.0%) isolates were ESBL producers while 9(30%) were non-ESBL producers. Both bla TEM and bla CTX-M were 62.5% each while bla SHV detected was 37.5%. Conclusions The prevalence of UTI among refugees in Nakivale settlement is high with Staphylococcus aureus and Escherichia coli as the commonest causes of UTI . There is a high rate of multidrug resistance to common drugs used to treat UTI. The prevalence of ESBL-producing Enterobacteriaceae is high and the common ESBL genes are bla TEM and bla CTX-
... Recent studies have also used this family of procedures to model information contained in incorrect responses to increase the precision of measurement (Smith et al., 2020;Storme et al., 2019) by using models that were explicitly designed to acknowledge a cognitive response process with two-stages: nested logit models (NLM; Suh & Bolt, 2010). ...
... Regarding reliability of the test score, both the 2PNL and mixed-format models outperformed the dichotomous models (1PL and 2PL), the nominal model (NRM) and the CTT-based estimates, as consequence of providing more information across the latent continuum. These results show that modelling information present in distractors is advantageous for estimating θ and increasing the reliability of scores, and are coherent with similar research (Storme et al. , 2019). A similar inference can be drawn from the correlation between different models. ...
Preprint
Full-text available
Background: Aims of this study were to assess construct validity (dimensionality and measurement invariance) and reliability of the previously developed Cognitive module of the Portuguese Physical Literacy Assessment – Questionnaire (PPLA-Q). Secondary aims were to assess whether using distractor information was useful for higher precision, and whether a total sum-score has enough precision for applied PE settings. Methods: Parametric Item Response Theory (IRT) models were estimated using a final sample of 508 Portuguese adolescents (Mage= 16, SD = 1 years) studying in public schools in Lisbon. A retest subsample of 73 students, collected 15 days after baseline, was used to calculate Intraclass Correlation Coefficient (ICC) and Svenson’s ordinal paired agreement. Results: A mixed 2-parameter nested logit + graded response model provided the best fit to the data, C2 (21) = 23.92, p = .21; CFI = .98; RMSEAC2= .017 [0,.043] with no misfitting items. Modelling distractor information provided an increase in available information and thus, reliability. There was evidence of differential item functioning in one item in favor of male students, however it did not translate in statistically significant differences at test level (sDTF = -0.06; sDTF% = -0.14). Average score reliability was low (marginal reliability= .60); while adequate reliability was attained in the -2 to -1 θ range. ICC results suggest poor to moderate test-retest reliability (ICC = .56, [.38, .70]); while Svenson’s method resulted in 6 out of 10 items with acceptable agreement (>.70), and 4 remaining items revealing a small individual variability across time points. We found a high correlation (r = .91 [.90,.93]) among sum-score and scores derived from calibrated mixed model. Conclusions: Evidence supports the construct validity of the cognitive module of the PPLA-Q to assess Content Knowledge in the Portuguese PE context for grade 10-12 (15-18 years) adolescents. This test attainted acceptable reliability for distinguishing student with transitional knowledge (between Foundation and Mastery), with further revisions needed to target full spectrum of θ. Its sum-score might be used in applied settings to get a quick overview of student’s knowledge; for precision IRT score is recommended. Further scrutiny of test-retest reliability is warranted in future research, along with the use of 3-parameter logistic models.
... The analysis of educational and psychological tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often modeled using item response theory (IRT; [1][2][3]; for applications, see, e.g., [4][5][6][7][8][9][10][11][12][13][14]) models. In this article, the two-parameter logistic (2PL; [15]) IRT model is investigated to compare two groups on test items. ...
Article
Full-text available
This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difficulties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.
... A more advanced approach involves using item response theory (IRT) on the dichotomized items and then scoring the persons using the resulting model. However, a further refinement is to employ categorical/nominal IRT (Storme et al., 2019;Suh & Bolt, 2010). In this approach, each response is allowed to have its own relationship to the underlying trait. ...
... In the nominal, some incorrect responses are deemed more incorrect than others, and thus used to estimate the ability. This approach should be slightly more effective if a large sample is available for the model training (Storme et al., 2019). ...
... Furthermore, using an IRT scoring approach is slightly superior to both of these simpler approaches (r with age .40 vs. .35/.38; r with education .44 vs. .42/.43). However, we find that using the full categorical data is not better than using the dichotomized data (Storme et al., 2019). It is thus suggested that the website also adopt a dichotomous IRT approach for scoring in conjunction with the sum of correct responses approach, given its ease of understanding. ...
Article
Full-text available
We examined data from the popular free online 45-item “Vocabulary IQ Test” from https://openpsychometrics.org/tests/VIQT/. We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25). Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly unrelated to the score obtained (r = -.02). The test scores correlated well with self-reported criterion variables educational attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and differential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and age (r = .88), and less so for sex (r = .32). With differential item functioning, we only tested the sex difference for bias. We find that some items display moderate biases in favor of one sex (13 items had pbonferroni < .05 evidence of bias). However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.
... Still, while Rasch models have undoubtedly revolutionized psychometric research, the fact that Rasch models are historically prominent is not a valid argument, especially if we consider that, for binary items, 1parameter models are regularly empirically outperformed by models that are more flexible, especially models that account for item differences in discrimination (e.g., Storme, Myszkowski, Baron, & Bernard, 2019). Further, in the case of more traditional (linear) measurement models, the assumption that items do not vary in discrimination/loadings is generally regarded as simply not realistic (Trizano-Hermosilla & Alvarado, 2016), as illustrated by the common use of factor analysis. ...
Article
Fluency tasks are among the most common item formats for the assessment of certain cognitive abilities, such as verbal fluency or divergent thinking. A typical approach to the psychometric modeling of such tasks (e.g., Intelligence, 2016, 57, 25) is the Rasch Poisson Counts Model (RPCM; Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1960), in which, similarly to the assumption of (essential) ‐equivalence in Classical Test Theory, tasks have equal discriminations—meaning that, beyond varying in difficulty, they do not vary in how strongly they are related to the latent variable. In this research, we question this assumption in the case of divergent thinking tasks, and propose instead to use a more flexible 2‐Parameter Poisson Counts Model (2PPCM), which allows to characterize tasks by both difficulty and discrimination. We further propose a Bifactor 2PPCM (B2PPCM) to account for local dependencies (i.e., specific/nuisance factors) emerging from tasks sharing similarities (e.g., similar prompts and domains). We reanalyze a divergent thinking dataset (Psychology of Aesthetics, Creativity, and the Arts, 2008, 2, 68) and find the B2PPCM to significantly outperform the 2PPCM, both outperforming the RPCM. Further extensions and applications of these models are discussed.
... In their paper, Garcia-Garzon et al. (2019) propose an extensive reanalysis of the dimensionality of the SPM-LS, using a large variety of techniques, including bifactor models and exploratory graph analysis. Storme et al. (2019) later find that the reliability boosting strategy proposed in the original paper-which consisted of using nested logit models (Suh and Bolt 2010) to recover information from distractor information-is useful in other contexts, by using the example on a logical reasoning test applied in a personnel selection context. Moreover, Bürkner (2020) later presents how to use his R Bayesian multilevel modeling package brms (Bürkner 2017) in order to estimate various binary item response theory models, and compares the results with the frequentist approach used in the original paper with the item response theory package mirt (Chalmers 2012). ...
Article
Full-text available
It is perhaps popular belief-at least among non-psychometricians-that there is a unique or standard way to investigate the psychometric qualities of tests. If anything, the present Special Issue demonstrates that it is not the case. On the contrary, this Special Issue on the "analysis of an intelligence dataset" is, in my opinion, a window to the present vividness of the field of psychometrics. Much like an invitation to revisit a story with various styles or with various points of view, this Special Issue was opened to contributions that offered extensions or reanalyses of a single-and somewhat simple-dataset, which had been recently published. The dataset was from a recent paper (Myszkowski and Storme 2018), and contained responses from 499 adults to a non-verbal logical reasoning multiple-choice test, the SPM-LS, which consists of the Last Series of Raven's Standard Progressive Matrices (Raven 1941). The SPM-LS is further discussed in the original paper (as well as through the investigations presented in this Special Issue), and most researchers in the field are likely familiar with the Standard Progressive Matrices. The SPM-LS is simply a proposition to use the last series of the test as a standalone test. A minimal description of the SPM-LS would probably characterize it as a theoretically unidimensional measure-in the sense that one ability is tentatively measured-comprised of 12 pass-fail non-verbal items of (tentatively) increasing difficulty. Here, I refer to the pass-fail responses as the binary responses, and the full responses (including which distractor was selected) as the polytomous responses. In the original paper, a number of analyses had been used, including exploratory factor analysis with parallel analysis, confirmatory factor analyses using a structural equation modeling framework, binary logistic item response theory models (1-, 2-, 3-and 4-parameter models), and polytomous (unordered) item response theory models, including the nominal response model (Bock 1972) and nested logit models (Suh and Bolt 2010). In spite of how extensive the original analysis may have seemed, the contributions of this Special Issue present several extensions to our analyses. I will now briefly introduce the different contributions of the Special Issue, in chronoligical order of publication. In their paper, Garcia-Garzon et al. (2019) propose an extensive reanalysis of the dimensionality of the SPM-LS, using a large variety of techniques, including bifactor models and exploratory graph analysis. Storme et al. (2019) later find that the reliability boosting strategy proposed in the original paper-which consisted of using nested logit models (Suh and Bolt 2010) to recover information from distractor information-is useful in other contexts, by using the example on a logical reasoning test applied in a personnel selection context. Moreover, Bürkner (2020) later presents how to use his R Bayesian multilevel modeling package brms (Bürkner 2017) in order to estimate various binary item response theory models, and compares the results with the frequentist approach used in the original paper with the item response theory package mirt (Chalmers 2012). Furthermore, Forthmann et al. (2020) later proposed a new procedure that can be used to detect (or select) items that could present discriminating distractors (i.e., items for which distractor responses could be used to extract additional information). In addition, Partchev (2020) then discusses issues that relate to the use of distractor information to extract information on ability in multiple choice tests, in particular in the context of cognitive assessment, and presents how to use the R package dexter (Maris et al. 2020) to study the binary responses and distractors of the SPM-LS.