ArticlePDF Available

Detecting Examinees With Item Preknowledge in Large-Scale Testing Using Extreme Gradient Boosting (XGBoost)

Authors:

Abstract and Figures

Researchers frequently use machine-learning methods in many fields. In the area of detecting fraud in testing, there have been relatively few studies that have used these methods to identify potential testing fraud. In this study, a technical review of a recently developed state-of-the-art algorithm, Extreme Gradient Boosting (XGBoost), is provided and the utility of XGBoost in detecting examinees with potential item preknowledge is investigated using a real data set that includes exami-nees who engaged in fraudulent testing behavior, such as illegally obtaining live test content before the exam. Four different XGBoost models were trained using different sets of input features based on (a) only dichotomous item responses, (b) only nominal item responses, (c) both dichotomous item responses and response times, and (d) both nominal item responses and response times. The predictive performance of each model was evaluated using the area under the receiving operating characteristic curve and several classification measures such as the false-positive rate, true-positive rate, and precision. For comparison purposes, the results from two person-fit statistics on the same data set were also provided. The results indicated that XGBoost successfully classified the honest test takers and fraudulent test takers with item preknowledge. Particularly, the classification performance of XGBoost was reasonably good when the response time information and item responses were both taken into account.
Content may be subject to copyright.
A preview of the PDF is not available
... Recently, with the prevalence of computer-based testing, more data types can be collected including process data such as response time and answer change patterns. Several studies (e.g., Man et al., 2019;Zhou & Jiao, 2022;Zopluoglu, 2019) explored item responses and response time data for cheating detection. Further, other studies (e.g., Man & Harring, 2020) included multi-modal data: assessment product data (item responses) and process data (response time) and biometric data such as visual fixation counts for cheating detection. ...
... Further, their study found that the inclusion of data augmented from the original data such as the summary statistics of item response time, the total test scores, and the number of attempts in taking the test, turned out to be effective features in cheating detection. Zopluoglu (2019) incorporated data augmentation as well by converting item responses into strings of nominal response patterns and found better performance for models including the augmented features (i.e., nominal response patterns) in terms of AUC (an increase of 0.016) and precision (an increase of 8.8% at the false positive rate of 0.01). Zhou and Jiao (2022) found that stacking learning with data augmentation improved cheating detection accuracy in terms of recall (up to 3 times), precision (up to 9 times) and F1 scores (up to 9 times). ...
Article
Full-text available
Machine learning methods have been explored for cheating detection in large-scale assessment in recent years. Most of these studies analyzed item responses and response time data. Though a few studies investigated data augmentation in the feature space, data augmentation in machine learning for cheating detection is far beyond thorough investigation. This study explored data augmentation of the feature space for the blending ensemble learning at the meta-model level for cheating detection. Four anomaly detection techniques assigned outlier scores to augment the meta-model's input data in addition to the most informative features from the original dataset identified by four feature selection methods. The performance of the meta-model with data augmentation was compared with that of each base model and the meta-model without data augmentation. Based on the evaluation criteria, the best-performing meta-model with data augmentation was identified. In general, data augmentation in the blending ensemble learning for cheating detection greatly improved the accuracy of cheating detection compared with other alternative approaches.
... Although process data may be analyzed with any statistical procedure, it has become popular to analyze them using classifiers from the field of machine learning; see, for example, Burlak et al. (2006), Chen and Chen (2017), Kim et al. (2017), Man et al. (2019), Ranger et al. (2020), Zhou and Jiao (2022), and Zopluoglu (2019). In machine learning, a detector of cheating is built from a classifier that learns to distinguish regular responders from cheaters. ...
... Furthermore, the credentialing data are a good representative of the data one would analysis for cheaters. It is a computer-based high-stakes test with single-choice response format that is employed over a longer testing period; note that the credentialing data have been analyzed before by Boughton et al. (2017), Man et al. (2019) or Zopluoglu (2019), just to mention a few. ...
Article
Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.
... In recent years, machine learning algorithms have been explored by different researchers for cheating detection. Zopluoglu (2019) studied Extreme Gradient Boosting to detect item pre-knowledge. Man et al. (2019) explored both supervised (K-nearest mean, random forest, and support vector machine [SVM]) and unsupervised (K-means and self-organizing mapping) machine learning algorithms for fraud detection. ...
... For supervised machine learning, we need to split the data into a training set (for model training) and a test set (for model evaluation). Zopluoglu (2019) split the dataset to 80% for training and 20% for test. This study applied the split of 75% training versus 25% test on K-Fold Cross-Validation. ...
Article
Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models—Gradient Boosting and Random Forest—generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.
... This algorithm integrates weak learners to create a strong learner. However, weak learners are created through residual fitting in this algorithm 39,40 . XGBoost model extends the cost function of the first-order Taylor and presents the second-order derivative information to make the model converge faster when the model is learning. ...
Article
Full-text available
Oil viscosity plays a prominent role in all areas of petroleum engineering, such as simulating reservoirs, predicting production rate, evaluating oil well performance, and even planning for thermal enhanced oil recovery (EOR) that involves fluid flow calculations. Experimental methods of determining oil viscosity, such as the rotational viscometer, are more accurate than other methods. The compositional method can also properly estimate oil viscosity. However, the composition of oil should be determined experimentally, which is costly and time-consuming. Therefore, the occasional inaccessibility of experimental data may make it inevitable to look for convenient methods for fast and accurate prediction of oil viscosity. Hence, in this study, the error in viscosity prediction has been minimized by taking into account the amount of dissolved gas in oil (solution gas–oil ratio: Rs) as a representative of oil composition along with other conventional black oil features including temperature, pressure, and API gravity by employing recently developed machine learning methods based on the gradient boosting decision tree (GBDT): extreme gradient boosting (XGBoost), CatBoost, and GradientBoosting. Moreover, the advantage of the proposed method lies in its independence to input viscosity data in each pressure region/stage. The results were then compared with well-known correlations and machine-learning methods employing the black oil approach applying least square support vector machine (LSSVM) and compositional approach implementing decision trees (DTs). XGBoost is offered as the best method with its greater precision and lower error. It provides an overall average absolute relative deviation (AARD) of 1.968% which has reduced the error of the compositional method by half and the black oil method (saturated region) by five times. This shows the proper viscosity prediction and corroborates the applied method's performance.
Article
Pan and Wollack (PW) proposed a machine learning method to detect compromised items. We extend the work of PW to an approach detecting compromised items and examinees with item preknowledge simultaneously and draw on ideas in ensemble learning to relax several limitations in the work of PW. The suggested approach also provides a confidence score, which is based on an autoencoder to represent our confidence that the detection result truly corresponds to item preknowledge. Simulation studies indicate that the proposed approach performs well in the detection of item preknowledge, and the confidence score can provide helpful information for users.
Article
In recent years, machine learning (ML) techniques have received more attention in detecting aberrant test‐taking behaviors due to advantages when compared to traditional data forensics methods. However, defining “True Test Cheaters” is challenging—different than other fraud detection tasks such as flagging forged bank checks or credit card frauds, testing organizations are often lack of physical evidences to identify “True Test Cheaters” to train ML models. This study proposed a statistically defensible method of labeling “True Test Cheaters” in the data, demonstrated the effectiveness of using ML approaches to identify irregular statistical patterns in exam data, and established an analytical framework for evaluating and conducting real‐time ML‐based test data forensics. Classification accuracy and false negative/positive results are evaluated across different supervised‐ML techniques. The reliability and feasibility of operationally using this approach for an IT certification exam are evaluated using real data.
Article
Full-text available
Background Nodular thyroid disease is by far the most common thyroid disease and is closely associated with the development of thyroid cancer. Coal miners with chronic coal dust exposure are at higher risk of developing nodular thyroid disease. There are few studies that use machine learning models to predict the occurrence of nodular thyroid disease in coal miners. The aim of this study was to predict the high risk of nodular thyroid disease in coal miners based on five different Machine learning (ML) models. Methods This is a retrospective clinical study in which 1,708 coal miners who were examined at the Huaihe Energy Occupational Disease Control Hospital in Anhui Province in April 2021 were selected and their clinical physical examination data, including general information, laboratory tests and imaging findings, were collected. A synthetic minority oversampling technique (SMOTE) was used for sample balancing, and the data set was randomly split into a training and Test dataset in a ratio of 8:2. Lasso regression and correlation heat map were used to screen the predictors of the models, and five ML models, including Extreme Gradient Augmentation (XGBoost), Logistic Classification (LR), Gaussian Parsimonious Bayesian Classification (GNB), Neural Network Classification (MLP), and Complementary Parsimonious Bayesian Classification (CNB) for their predictive efficacy, and the model with the highest AUC was selected as the optimal model for predicting the occurrence of nodular thyroid disease in coal miners. Result Lasso regression analysis showed Age, H-DLC, HCT, MCH, PLT, and GGT as predictor variables for the ML models; in addition, heat maps showed no significant correlation between the six variables. In the prediction of nodular thyroid disease, the AUC results of the five ML models, XGBoost (0.892), LR (0.577), GNB (0.603), MLP (0.601), and CNB (0.543), with the XGBoost model having the largest AUC, the model can be applied in clinical practice. Conclusion In this research, all five ML models were found to predict the risk of nodular thyroid disease in coal miners, with the XGBoost model having the best overall predictive performance. The model can assist clinicians in quickly and accurately predicting the occurrence of nodular thyroid disease in coal miners, and in adopting individualized clinical prevention and treatment strategies.
Article
Objectives: Recent studies have revealed the change of molecular subtypes in breast cancer (BC) after neoadjuvant therapy (NAT). This study aims to construct a non-invasive model for predicting molecular subtype alteration in breast cancer after NAT. Methods: Eighty-two estrogen receptor (ER)-negative/ human epidermal growth factor receptor 2 (HER2)-negative or ER-low-positive/HER2-negative breast cancer patients who underwent NAT and completed baseline MRI were retrospectively recruited between July 2010 and November 2020. Subtype alteration was observed in 21 cases after NAT. A 2D-DenseUNet machine-learning model was built to perform automatic segmentation of breast cancer. 851 radiomic features were extracted from each MRI sequence (T2-weighted imaging, ADC, DCE, and contrast-enhanced T1-weighted imaging), both in the manual and auto-segmentation masks. All samples were divided into a training set (n = 66) and a test set (n = 16). XGBoost model with 5-fold cross-validation was performed to predict molecular subtype alterations in breast cancer patients after NAT. The predictive ability of these models was subsequently evaluated by the AUC of the ROC curve, sensitivity, and specificity. Results: A model consisting of three radiomics features from the manual segmentation of multi-sequence MRI achieved favorable predictive efficacy in identifying molecular subtype alteration in BC after NAT (cross-validation set: AUC = 0.908, independent test set: AUC = 0.864); whereas an automatic segmentation approach of BC lesions on the DCE sequence produced good segmentation results (Dice similarity coefficient = 0.720). Conclusions: A machine learning model based on baseline MRI is proven useful for predicting molecular subtype alterations in breast cancer after NAT. Key points: • Machine learning models using MRI-based radiomics signature have the ability to predict molecular subtype alterations in breast cancer after neoadjuvant therapy, which subsequently affect treatment protocols. • The application of deep learning in the automatic segmentation of breast cancer lesions from MRI images shows the potential to replace manual segmentation..
Article
The study presents statistical procedures that monitor functioning of items over time. We propose generalized likelihood ratio tests that surveil multiple item parameters and implement with various sampling techniques to perform continuous or intermittent monitoring. The procedures examine stability of item parameters across time and inform compromise as soon as they identify significant parameter shift. The performance of the monitoring procedures was validated using simulated and real-assessment data. The empirical evaluation suggests that the proposed procedures perform adequately well in identifying the parameter drift. They showed satisfactory detection power and gave timely signals while regulating error rates reasonably low. The procedures also showed superior performance when compared with the existent methods. The empirical findings suggest that multivariate parametric monitoring can provide an efficient and powerful control tool for maintaining the quality of items. The procedures allow joint monitoring of multiple item parameters and achieve sufficient power using powerful likelihood-ratio tests. Based on the findings from the empirical experimentation, we suggest some practical strategies for performing online item monitoring.
Book
Full-text available
Artificial Intelligence in Highway Safety provides cutting-edge advances in highway safety using AI. The author is a highway safety expert. He pursues highway safety within its contexts, while drawing attention to the predictive powers of the AI techniques in solving complex problems for safety improvement. This book provides both theoretical and practical aspects of highway safety. Each chapter contains theory and its contexts in plain language with several real-world examples. It is suitable for anyone interested in highway safety and AI and it provides an illuminating and accessible introduction to this fast-growing research trend. Material supplementing the book can be found at https://github.com/subasish/AI_in_HighwaySafety. It offers a variety of supplemental materials, including data sets and R codes.
Article
Full-text available
The modern web-based technology greatly popularizes computer-administered testing, also known as online testing. When these online tests are administered continuously within a certain “testing window,” many items are likely to be exposed and compromised, posing a type of test security concern. In addition, if the testing time is limited, another recognized aberrant behavior is rapid guessing, which refers to quickly answering an item without processing its meaning. Both cheating behavior and rapid guessing result in extremely short response times. This article introduces a mixture hierarchical item response theory model, using both response accuracy and response time information, to help differentiate aberrant behavior from normal behavior. The model-based approach is compared to the Bayesian residual-based fit statistic in both simulation study and two real data examples. Results show that the mixture model approach consistently outperforms the residual method in terms of correct detection rate and false positive error rate, in particular when the proportion of aberrance is high. Moreover, the model-based approach is also able to correctly identify compromised items better than residual method.
Article
Full-text available
Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute. (Notebooks are available at https://physics.bu.edu/~pankajm/MLnotebooks.html )
Article
Full-text available
Item preknowledge describes a situation in which a group of examinees (called aberrant examinees) have had access to some items (called compromised items) from an administered test prior to the exam. Item preknowledge negatively affects both the corresponding testing program and its users (e.g., universities, companies, government organizations) because scores for aberrant examinees are invalid. In general, item preknowledge is hard to detect due to multiple unknowns: unknown groups of aberrant examinees (at unknown test centers or schools) accessing unknown subsets of items prior to the exam. Recently, multiple statistical methods were developed to detect compromised items. However, the detected subset of items (called the suspicious subset) naturally has an uncertainty due to false positives and false negatives. The uncertainty increases when different groups of aberrant examinees had access to different subsets of items; thus, compromised items for one group are uncompromised for another group and vice versa. The impact of uncertainty on the performance of eight statistics (each relying on the suspicious subset) was studied. The measure of performance was based on the receiver operating characteristic curve. Computer simulations demonstrated how uncertainty combined with various independent variables (e.g., type of test, distribution of aberrant examinees) affected the performance of each statistic.
Article
Response‐time models are of increasing interest in educational and psychological testing. This article focuses on the lognormal model for response times, which is one of the most popular response‐time models, and suggests a simple person‐fit statistic for the model. The distribution of the statistic under the null hypothesis of no misfit is proved to be a χ2 distribution. A simulation study and a real data example demonstrate the usefulness of the suggested statistic.
Article
Repeatedly using items in high-stake testing programs provides a chance for test takers to have knowledge of particular items in advance of test administrations. A predictive checking method is proposed to detect whether a person uses preknowledge on repeatedly used items (i.e., possibly compromised items) by using information from secure items that have zero or very low exposure rates. Responses on the secure items are first used to estimate a person’s proficiency distribution, and then the corresponding predictive distribution for the person’s responses on the possibly compromised items is constructed. The use of preknowledge is identified by comparing the observed responses to the predictive distribution. Different estimation methods for obtaining a person’s proficiency distribution and different choices of test statistic in predictive checking are considered. A simulation study was conducted to evaluate the empirical Type I error and power rate of the proposed method. The simulation results suggested that the Type I error of this method is well controlled, and this method is effective in detecting preknowledge when a large proportion of items are compromised even with a short secure section. An empirical example is also presented to demonstrate its practical use.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.
Article
This article addresses the issue of how to detect item preknowledge using item response time data in two computer-based large-scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article.