Article

Classification Performance of Answer-Copying Indices Under Different Types of IRT Models

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Test fraud has recently received increased attention in the field of educational testing, and the use of comprehensive integrity analysis after test administration is recommended for investigating different types of potential test frauds. One type of test fraud involves answer copying between two examinees, and numerous statistical methods have been proposed in the literature to screen and identify unusual response similarity or irregular response patterns on multiple-choice tests. The current study examined the classification performance of answer-copying indices measured by the area under the receiver operating characteristic (ROC) curve under different item response theory (IRT) models (one- [1PL], two- [2PL], three-parameter [3PL] models, nominal response model [NRM]) using both simulated and real response vectors. The results indicated that although there is a slight increase in the performance for low amount of copying conditions (20%), when nominal response outcomes were used, these indices performed in a similar manner for 40% and 60% copying conditions when dichotomous response outcomes were utilized. The results also indicated that the performance with simulated response vectors was almost identically reproducible with real response vectors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... We design the levels of the variables considering the previous similar studies and real-life conditions. While in previous studies, test difficulty was defined into three levels (easy, medium, and difficult), we decided to fix this variable as the medium because it reflects real-life conditions (Sunbul & Yormaz, 2018;Zopluoglu, 2016). The copier's ability and the source is another variable that might affect the power of the copy index (Sotaridona & Meijer, 2002;Steinkamp, 2017;Sunbul & Yormaz, 2018). ...
... In previous studies, the test length was commonly defined as 40 and 80 items. Because in the real-world, large-scale tests often include approximately 40 items in a sub-test, we decided to fix the test length as 40 (Sotaridona & Meijer, 2002, Sunbul & Yormaz, 2018Yormaz & Sunbul, 2017;Wollack, 1997Wollack, , 2003Zopluoglu, 2016). Regarding the related literature, the copier ratio is manipulated as 5% and %15 (Steinkamp, 2017). ...
... In the previous studies comparing the power and type 1 error of the copy index, both small and large sample sizes were utilized (from 50 to 10000). However, to be prevented from biased estimations about the item and person parameters, we manipulated sample size as 1000 and 3000 (Hurtz & Weiner, 2018;Sunbul & Yormaz, 2018;van der Linden & Sotaridona, 2006;Yormaz & Sunbul, 2017;Wollack, 2003;Zopluoglu, 2016;Zopluoglu & Davenport, 2012). Based on the relevant literature, in this study, we manipulated copying percentage as lower (10%), medium (30%), and upper (60%). ...
... The expected values of each response alternative will be estimated by using nominal response model estimations under the restriction of ability parameters of probable copier and source. Despite the fact that ω's original formula was developed by using Nominal Response Model and using option matrix of a multiple choice item test, it is also possible to compute ω on dichotomous data sets by using 1PL, 2PL or 3PL models of IRT [25], [26]. Literature review presents several studies which compare ω for nominal responses and dichotomous responses. ...
... Numerous studies have been conducted to compare answer-copying detection indices for several conditions in terms of Type I error and power studies [5]- [7], [12], [13], [15], [19], [20], [23], [25], [26], [28], [30]- [33]. Literature review points that two of the answer copying indices, ω [23] and GBT [28], are remarkable with low Type I error rates and high detection power ratios [19], [20], [23], [25], [26], [30]- [34]. ...
... Numerous studies have been conducted to compare answer-copying detection indices for several conditions in terms of Type I error and power studies [5]- [7], [12], [13], [15], [19], [20], [23], [25], [26], [28], [30]- [33]. Literature review points that two of the answer copying indices, ω [23] and GBT [28], are remarkable with low Type I error rates and high detection power ratios [19], [20], [23], [25], [26], [30]- [34]. However there are limited studies which compare the performance of these indices in literature. ...
Article
Full-text available
In this study Type I Error and the power rates of ω and GBT (generalized binomial test) indices were investigated for several nominal alpha levels and for 40 and 80-item test lengths with 10,000-examinee sample size under several test level restrictions. As a result, Type I error rates of both indices were found to be below the acceptable nominal alpha levels. The power study showed that average test difficulty was very effective for power (true detection) rates of indices. Clear patterns were observed for the increase of test difficulty in favor of both ω and GBT power rate. Contrary to expectations; average test discrimination was not as effective as average test difficulty. The results of the interaction effects of item discrimination and difficulty showed that for the cases whose b parameters were lower than 0 with weak discrimination, indices had weak power for both ω and GBT. In addition, for the cases whose b parameter levels were below zero with high discrimination indices, the power performance of both answer-copying indices were very weak. Results for test length showed that with the increase of test length the power rate of both ω and GBT tended to increase. Also, ω performed slightly better than GBT or very close to GBT for 80-item test length however, ω performed better than GBT in terms of power rate for the cases with 40-item test length
... The expected values of each response alternative will be estimated by using nominal response model estimations under the restriction of ability parameters of probable copier and source. Despite the fact that ω's original formula was developed by using Nominal Response Model and using option matrix of a multiple choice item test, it is also possible to compute ω on dichotomous data sets by using 1PL, 2PL or 3PL models of IRT [25], [26]. Literature review presents several studies which compare ω for nominal responses and dichotomous responses. ...
... Numerous studies have been conducted to compare answer-copying detection indices for several conditions in terms of Type I error and power studies [5]- [7], [12], [13], [15], [19], [20], [23], [25], [26], [28], [30]- [33]. Literature review points that two of the answer copying indices, ω [23] and GBT [28], are remarkable with low Type I error rates and high detection power ratios [19], [20], [23], [25], [26], [30]- [34]. ...
... Numerous studies have been conducted to compare answer-copying detection indices for several conditions in terms of Type I error and power studies [5]- [7], [12], [13], [15], [19], [20], [23], [25], [26], [28], [30]- [33]. Literature review points that two of the answer copying indices, ω [23] and GBT [28], are remarkable with low Type I error rates and high detection power ratios [19], [20], [23], [25], [26], [30]- [34]. However there are limited studies which compare the performance of these indices in literature. ...
Article
Full-text available
In this study Type I Error and the power rates of ω and GBT (generalized binomial test) indices were investigated for several nominal alpha levels and for 40 and 80-item test lengths with 10,000-examinee sample size under several test level restrictions. As a result, Type I error rates of both indices were found to be below the acceptable nominal alpha levels. The power study showed that average test difficulty was very effective for power (true detection) rates of indices. Clear patterns were observed for the increase of test difficulty in favor of both ω and GBT power rate. Contrary to expectations; average test discrimination was not as effective as average test difficulty. The results of the interaction effects of item discrimination and difficulty showed that for the cases whose b parameters were lower than 0 with weak discrimination, indices had weak power for both ω and GBT. In addition, for the cases whose b parameter levels were below zero with high discrimination indices, the power performance of both answer-copying indices were very weak. Results for test length showed that with the increase of test length the power rate of both ω and GBT tended to increase. Also, ω performed slightly better than GBT or very close to GBT for 80-item test length however, ω performed better than GBT in terms of power rate for the cases with 40-item test length
... Table 1 shows a list of the main responsesimilarity and people-fit indexes, which present better fits under certain conditions. The indexes recommended by Karabastos (2003), Haney and Clarke (2007), de la Torre and Deng (2008), Guo and Drasgow (2010), Belov (2011Belov ( , 2015, Eckerly, Babcock, & Wollack (2015), Doyoung et al. (2017), Maynes (2017), Wollack & Cizek (2017), Zopluoglu (2016Zopluoglu ( , 2017Zopluoglu ( , 2019aZopluoglu ( , 2019b, and Sanzvelasco, Luzardo, García, and Abad (2020), which were obtained through simulation studies, were selected. ...
Article
Full-text available
El pasaje de la enseñanza presencial a la modalidad a distancia, como medida para enfrentar el COVID-19, trajo como consecuencia la necesidad de la validación de los resultados de las pruebas tomadas en formato electrónico. Se considera que los estudiantes tienen mayor facilidad para cometer fraudes en pruebas realizadas a distancia. El objetivo del artículo es presentar el estudio del falseamiento como un aporte al análisis de la validez psicométrica de las pruebas. A través de una revisión bibliográfica se analiza el concepto de falseamiento y sus tipos. Se presentan los principales métodos para detectarlo, que pueden ser utilizados para asegurar la validez de los resultados en las pruebas síncronas, tipo opción múltiple. Se describen los usos, potencialidades y limitaciones de los métodos presentados. Por último, se plantean los principales desafíos por superar para la validación de los resultados de pruebas síncronas realizadas a distancia.
Article
A potential negative consequence of high-stakes testing is inappropriate test behaviour involving individuals and/or institutions. Inappropriate test behaviour and test collusion can result in aberrant response patterns and anomalous test scores and invalidate the intended interpretation and use of test results. A variety of statistical techniques have been developed to investigate anomaly in test results at individual test-taker and group levels. This article provides a brief introduction to some of the most widely studied techniques, with a particular focus on their underlying concepts and principles employed to derive indices for identifying anomalous test results. Detailed mathematical derivations of the various statistical models discussed are not given, but relevant references are provided.
Article
Full-text available
This article presents the Variable Match Index (VM-Index), a new statistic for detecting answer copying. The power of the VM-Index relies on two-dimensional conditioning as well as the structure of the test. The asymptotic distribution of the VM-Index is analyzed by reduction to Poisson trials. A computational study comparing the VM-Index with the K-Index demonstrates that with the VM-Index, there is a large decrease in the Type II error rate and a smaller decrease in the Type I error rate.
Article
Full-text available
This article presents a new method to detect copying on a standardized multiple-choice exam. The method combines two statistical approaches in successive stages. The first stage uses Kullback-Leibler divergence to identify examinees, called subjects, who have demonstrated inconsistent performance during an exam. For each subject the second stage uses the K-Index to search for a possible source of the responses. Both stages apply a hypothesis test given a significance level. Monte Carlo methods are applied to approximate empirical distributions and then compute critical values providing a low Type I error rate and a good copying-detection rate. The results with both simulated and empirical data demonstrate the effectiveness of this approach.
Article
Full-text available
The generalized binomial test (GBT) and ω indices are the most recent methods suggested in the literature to detect answer copying behavior on multiple-choice tests. The ω index is one of the most studied indices, but there has not yet been a systematic simulation study for the GBT index. In addition, the effect of the ability levels of the examinees in answer copying pairs on the statistical properties of the GBT and ω indices have not been systematically addressed as yet. The current study simulated 500 answer copying pairs for each of 1,440 conditions (12 source ability level × 12 cheater ability level × 10 amount of copying) to study the empirical power and 10,000 pairs of independent response vectors for each of 144 conditions (12 source ability level × 12 cheater ability level) to study the empirical Type I error rates of the GBT and ω indices. Results indicate that neither GBT nor ω inflated the Type I error rates, and they are reliable to use in practice. The difference in statistical power of these two methods was very small, and GBT performs slightly better than does ω. The main effect for the amount of copying and the interaction effect between source ability level and the amount of copying are found to be very strong while all other main and interactions effects are negligible.
Article
Full-text available
This study examines university students' behaviors, attitudes, and beliefs related to academic dishonesty using data collected in 1984, 1994, and 2004. We are unaware of any other research program that has used the same instrument to monitor academic dishonesty at the same institution over such a long period of time. Several authors have critiqued the academic dishonesty literature, questioning the validity of comparing historical and recent studies (Brown & Emmett, 2001; Graham, Monday, O'Brien, & Steffen, 1994; Whitley, 1998; Whitley, Nelson, & Jones, 1999) since different studies have measured academic dishonesty in many different ways (Vowell and Chen, 2004). Whitley et al. (1999) stated, "Some of this variance [in reported cheating incidence rates], perhaps a substantial degree, could be due to the wide range of measures used to assess both cheating behavior and attitudes…In the case of both attitudes and behavior the studies used too many different operational definitions to allow assessment of the relationship between operational definition and effect size" (pg. 667). Brown and Emmett (2001) have also questioned studies that report high levels of college cheating, suggesting that these studies might simply be defining cheating in broader terms. In the current study, students were defined as "cheaters" if they reported cheating at some time in their college career on quizzes, exams, or assignments, however they defined those terms. All others were defined as "noncheaters." This same rule was also followed in 1984 and 1994. In 1984, we found that 54% of students admitted to cheating and we characterized these cheaters as immature, lacking educational commitment, and likely to use neutralizing attitudes to lessen guilt associated with cheating (Haines, Diekhoff, LaBeff, & Clark, 1986). Cheating increased in 1994 to 61%. This increase was significant and suggested that academic dishonesty was on the rise. Cheaters continued to neutralize more than noncheaters; however, both cheaters and noncheaters evidenced less neutralizing than the 1984 cohort. Even as cheating increased, neutralizing decreased, indicating to us that academic dishonesty had become so normative that it was no longer viewed by students as a deviant behavior that needed to be justified (Diekhoff et al., 1996). The recent literature has reported similarly high rates of overall academic dishonesty, with reports ranging from 52-90% (Genereux & McLeod, 1995; Graham et al., 1994; Lester & Diekhoff, 2002; McCabe & Bowers, 1994; Vowell and Chen, 2004). Academic dishonesty percentages are lower if one looks at behavior within a specific semester. For example, Jordan (2001) found that only 31% of students cheated on an exam or paper during one semester. In addition, 9% of the students in the Jordan study committed 75% of the cheating acts. These studies suggested that most students engage in cheating at some point during their academic career; however, a much smaller percent cheats in any given semester. External factors (e.g., fear of detection and punishment) appear to be more effective in deterring cheating than internal factors (e.g., guilt) (Diekhoff et al., 1996; Genereux & McLeod, 1995; Graham et al., 1994). In 1994, we found that external factors ranked as the top 4 out of 6 deterrents to cheating. First and foremost was the embarrassment of being caught by a faculty member. Being dropped by the instructor ranked second, followed by fear of the university's response, and receiving an 'F.' Guilt ranked fifth, and fear of disapproval by one's friends showed the least deterrent effect (Diekhoff et al., 1996). Genereux & McLeod, (1995) and Burns, Davis, Hoshino, and Miller (1998) also reported that the threat of punishment, such as fear of expulsion, was a top deterrent to cheating. Additional external deterrents included instructor vigilance and spacing in the exam room (Genereux & McLeod, 1995). Thus, the reduction of academic dishonesty depends primarily on faculty and institutional actions. Unfortunately, the literature is quite clear on how disengaged faculty and university administrators are from student cheating. Diekhoff, LaBeff, Shinohara, and Yasukawa (1999) reported that only 3% of cheaters reported having ever been caught, and Jendrek (1989) and McCabe (1993) found that most faculty members are reluctant to follow official university policies and procedures in handling student cheating. Seventy-one percent of the faculty surveyed in a national sample stated that confronting a student about cheating is one of the most negative...
Article
Full-text available
Academic dishonesty is a fundamental issue for the academic integrity of higher education institutions, and one that has lately been gaining increasing media attention. This study reports on a survey of 1206 students and 190 academic staff across four major Queensland universities in relation to student academic misconduct. The aim of the survey was to determine the prevalence of academic misconduct, and to investigate the extent to which perceptions of dishonesty are shared between students and staff, as preliminary steps toward developing effective strategies to deal with the academic dishonesty/misconduct problem. Results indicate a higher tolerance for academic misconduct by students in comparison to staff, particularly with respect to falsification of research results and plagiarism, as well as considerable underestimation by staff of the prevalence of virtually all forms of student academic misconduct. Overall, the study’s findings confirm the significance of the issue of academic dishonesty within the Australian tertiary sector, indicating considerable divergence between students and staff in terms of perceptions of the seriousness and prevalence of student academic misconduct. We suggest that university administrators need to examine this issue closely in order to develop mechanisms for managing and curtailing the level of academic misconduct, since a failure to do so may lead to a further undermining of the academic integrity of the Australian tertiary sector.
Article
Full-text available
Academic cheating has become a widespread problem among high school and college students. In this study, 490 students (ages 14 to 23) evaluated the acceptability of an act of academic dishonesty under 19 different circumstances where a person's motive for transgressing differed. Students' evaluations were related to self-reports of cheating behavior, sex, school grade, and psychological variables. Results indicated that high school and college students took motives into account when evaluating the acceptability of academic cheating. Cheating behavior was more common among those who evaluated cheating leniently, among male students, and among high schoolers. Also, acceptance of cheating and cheating behavior were negatively related to self-restraint, but positively related to tolerance of deviance. The results are discussed with reference to biological, cultural, and developmental factors.
Article
Full-text available
A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect differences in the accuracy of diagnostic techniques.
Article
Despite a plethora of research on the academic misconduct carried out by U.S. high school and undergraduate university students, little research has been done on the academic misconduct of Canadian students. This paper addresses this shortcoming by presenting the results of a study conducted at 11 Canadian higher education institutions between January 2002 and March 2003. We maintain that academic misconduct does indeed occur in Canada – amongst high school, undergraduate and graduate students. Common self-reported behaviours were as follows: working on an assignment with others when asked for individual work, getting questions and answers from someone who has already taken a test, copying a few sentences of material without footnoting, fabricating or falsifying lab data, and receiving unauthorized help on an assignment. Possible factors associated with these behaviours include student maturity, perceptions of what constitutes academic misconduct, faculty assessment and invigilation practices, low perceived risk, ineffective and poorly understood policies and procedures, and a lack of education on academic misconduct. Canadian educational institutions are encouraged to address these issues, beginning with a recommitment to academic integrity.
Chapter
This chapter discusses the feasibility of the person response curve (PRC) approach for investigating the fit of persons to the three-parameter item response theory (IRT) model. To operationalize the PR,C it subdivides ability test items into separate strata of varying difficulty levels. The limited literature on person variability within a test, thus, seems to have three major trends: (1) the direct analysis of person variability as originally suggested by Mosier, later called the testee's trace line by Weiss, the subject characteristic curve by Vale and Weiss, and the person characteristic curve by Lumsden, (2) the designation of highly variable persons as aberrant by Levine and Rubin, (3) the elimination of aberrant person-item interactions by Wright. A careful analysis of these three approaches indicates that the first approach is the most general of the three, subsuming the other two as special cases: If the entire pattern of a testee's responses is studied as a function of the difficulty levels of the items, the identification of aberrant response patterns or person–item interactions follows directly. In addition, postulating a person characteristic curve in conjunction with IRT provides a means of testing whether the response patterns of single individuals fit the theory, regardless of the number of parameters assumed.
Article
Cheating on multiple-choice examinations is a serious problem not easily overcome by using more test forms, more proctors, or larger testing rooms. A statistical procedure compares answers for pairs of students using those items on which both made errors. If the number of identical wrong answers is sufficiently greater than the number expected by chance and if the students were seated close together, then cheating is likely. Using this analysis with 90 examinations has suggested ways to discourage cheating and demonstrated some limitations of the procedure.
Article
Test security and other concerns can lead to an interest in the assessment of how unusual it is for the answers of two different examinees to agree as much as they do. At Educational Testing Service, a measure called the K-index is used to assess ‘unusual agreement’ between the incorrect answers of two examinees on a multiple-choice test. Here, I describe the K-index and give the results of an empirical study of some of the assumptions that underlie it and its use. The results of this study show that the K-index can be expected to give a conservative estimate of the probability of chance agreement in the typical situations for which it is used, and that several important assumptions underlying the K-index are supported by relevant data. In addition, the results presented here suggest a minor modification of the current (as of 1993) application of K-index to part of the SAT to better insure that it is a conservative measure of chance agreement.
Article
Comparison data on SAT were collected on pairs of examinees in three samples for later use in detecting instances of willful copying. In each sample the answer sheet of each examinee was compared with the answer sheet of every other examinee to determine the expected degree of similarity among answer sheets for “honest” examinees. Eight detection indices were developed and bivariate distributions were run for possible operational use in making future judgments regarding examinees who were actually suspected of copying. The results of a series of analyses permitted the elimination of six of the eight indices. The remaining two have been used successfully at Educational Testing Service since 1970.
Article
Despite a plethora of research on the academic misconduct carried out by U.S. high school and undergraduate university students, little research has been done on the academic misconduct of Canadian students. This paper addresses this shortcoming by presenting the results of a study conducted at 11 Canadian higher education institutions between January 2002 and March 2003. We maintain that academic misconduct does indeed occur in Canada - amongst high school, undergraduate and graduate students. Common self-reported behaviours were as follows: working on an assignment with others when asked for individual work, getting questions and answers from someone who has already taken a test, copying a few sentences of material without footnoting, fabricating or falsifying lab data, and receiving unauthorized help on an assignment. Possible factors associated with these behaviours include student maturity, perceptions of what constitutes academic misconduct, faculty assessment and invigilation practices, low perceived risk, ineffective and poorly understood policies and procedures, and a lack of education on academic misconduct. Canadian educational institutions are encouraged to address these issues, beginning with a recommitment to academic integrity.
Article
A statistical test for detecting answer copying on multiple-choice items is presented. The test is based on the exact null distribution of the number of random matches between two test takers under the assumption that the response process follows a known response model. The null distribution can easily be generalized to the family of distributions of the number of random matches under the alternative hypothesis of answer copying. It is shown how this information can be used to calculate such features as the maximum, minimum, and expected values of the power function of the test. For the case of the nominal response model, the test is an alternative to the one based on statistic ω. The differences between the two tests are discussed and illustrated using empirical results.
Article
A review was conducted of the results of 107studies of the prevalence and correlates of cheatingamong college students published between 1970 and 1996.The studies found cheating to be more common in the 1969-75 and 1986-96 time periods thanbetween 1976 and 1985. Among the strongest correlates ofcheating were having moderate expectations of success,having cheated in the past, studying under poor conditions, holding positive attitudes towardcheating, perceiving that social norms support cheating,and anticipating a large reward for success. However, animportant limitation on the conclusions drawn from this research is that many variables wereincluded in only one or a few studies. A model of theantecedents of cheating is proposed and the implicationsof this model for the identification of students at risk for cheating and controlling cheatingare discussed.
Article
Previous work on the use of w for detection of answer copying was based on the assumption that item parameters for the nominal response model were known. Such an assumption limits the usefulness of a, particularly in the classroom, because most teachers do not have a set of precalibrated items. This study investigated empirical Type I error rates and the power of w when the item and trait (0) parameters were unknown and estimated from datasets of 100 and 500 examinees. rlype I error rate was unaffected by estimating item parameters from the data. Power was slightly lower in the 100-examinee datasets, but was almost identical in the 500-examinee datasets.
Article
An index of collaboration between testees based upon the incidence of the same errors is reported. "1. The index was found reasonably effective in the identification of collaboration despite the inability of the administrator to detect its existence 2. The index was most effective in the identification of large scale one-way collaboration involving the copying of at least 16% of the answers from a single adjacent test-taker. 3. Two-way active collaboration was identified when only 10% of the answers were shared by two individuals. 4. Identification of collaboration was least effective when an individual copied answers from several other test-takers in a sporadic and unsystematic manner."
Article
This study deals with theoretical and methodological problems relating to the comparison of test results of individuals and groups with differing cultural backgrounds. An analysis of the score patterns on test items can be used to identify individuals and groups for whom the test scores have a deviant meaning. A statistic is developed by which the deviance of score patterns can be quantified and which indicates whether or not a person's test score can be compared with the scores in a specific group. In two simulation studies the discriminative power of deviance scores is investigated by means of computer-generated score patterns. The last part of the study deals with the application of deviance scores to cross-cultural research. It is shown that deviance scores can be used to identify individuals for whom the test scores give an underestimation of their capacities.
Article
This paper reports the development of indices reflecting the probability that the observed correspondence between the multiple-choice test responses of two examinees was due to chance. Applications of the indices are presented both with respect to apprehending persons who cheat by copying answers and with respect to monitoring the prevalence of this form of cheating in order to evaluate methods of preventing it.
Article
For a set of k items having nonintersecting item response functions (IRFS), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix HT has a non-negative value. Based on this result, a method is proposed for using HT to investigate whether a set of IRFS intersect. Results from a monte carlo study support the proposed use of HT. These results support the use of HT as an exten sion to Mokken's nonparametric item response theory approach.
Article
This report describes a program implemented to raise the awareness of the dramatic increases of reported academic cheating, to influence the attitudes of the targeted students towards cheating, and ultimately to reduce incidents of cheating in the targeted classrooms. The targeted populations consisted of middle school students in growing middle-class urban and suburban communities in northeast Illinois. Problems of increasing academic cheating were documented through data from national reports, educational institutions, and the targeted schools in the project. Analysis of the probable causes in the literature revealed that although there are multiple causes of cheating, four categories are most prevalent: (1) societal value of extrinsic over intrinsic rewards; (2) lack of clarity of the definition of cheating; (3) consequences for cheating that are not severe; and (4) societal acceptance of cheating. A review of the solution strategies suggested by the literature yielded dozens of solutions. The one chosen for the project was selected because the extensive research base supported the approach. The researchers chose an action plan that implemented a character-education plan that clearly defined cheating and its consequences and adjusted students' attitudes about cheating. Postintervention data indicated that incidents of observable cheating decreased by more than 50% after the action plan had been completed. Also, students' awareness of the consequences of cheating increased. Postintervention data also suggested that students are less likely to cheat when a clear definition is communicated and the consequences for cheating are clearly stated. Four appendixes contain teacher and student surveys, an observation checklist, and parent consent letters for the study. (Contains 27 figures and 21 references.) (Author/SLD)
Article
An index is proposed to detect cheating on multiple-choice examinations, and its use is evaluated through simulations. The proposed index is based on the compound binomial distribution. In total, 360 simulated data sets reflecting 12 different cheating (copying) situations were obtained and used for the study of the sensitivity of the index in detecting cheating and the conditions that affect its effectiveness. A computer program in C language was written to analyze each data set. The simulated data sets were also used to compare an index developed by R. Frary and others (1977) and error-similarity analysis (F. Belleza and S. Belleza, 1989). In general, the new index was effective in detecting cheaters as long as enough items were copied. It was sensitive enough to detect cheating when between 25 and 50% of the items were copied in a 50-item test, but was less sensitive when the test was shorter. It was also less sensitive when there were fewer cheaters in a class. Although effectiveness is influenced by test length, it is not influenced by class size. Similarities and differences among the three indexes are discussed. (Contains 2 tables, 11 figures, and 14 references.) (SLD)
Article
When examinees copy answers to test questions from other examinees, the validity of the test is compromised. Most available statistical procedures for detecting copying were developed out of classical test theory (CrT); hence, they suffer from sampledependent score and item statistics, and biased estimates of the expected number of answer matches between a pair of examinees. Item response theory (IRT) based procedures alleviate these problems; however, because they fail to compare the similarity of responses between neighboring examinees, they have relatively poor power for detecting copiers. A new IRT-based test statistic, wo, was compared with the best CUT-based index g2 under various copying conditions, amounts of copying, test lengths, and sample sizes. w consistently held the Type I error rate at or below the nominal level; g2 yielded substantially inflated Type I error rates. The power of w varied as a function of both test length and the percentage of items copied. w demonstrated good power to detect copiers, provided that at least 20% of the items were copied on an 80-item test and at least 30% were copied on a 40-item test. Based on these results, with regard to both Tbype I error rate and power, c appears to be more useful than g2 as a copying index.
Article
Many of the currently available statistical indexes to detect answer copying lack sufficient power at small α levels or when the amount of copying is relatively small. Furthermore, there is no one index that is uniformly best. Depending on the type or amount of copying, certain indexes are better than others. The purpose of this article was to explore the utility of simultaneously using multiple copying indexes to detect different types and amounts of answer copying. This study compared eight copying indexes: S1 and S2 (Sotaridona & Meijer, 2003), K¯ 2 (Sotaridona & Meijer, 2002), ω (Wollack, 1997),B and H (Angoff, 1974), and new indexes Runs and MaxStrings, plus all possible pairs and triplets of the 8 indexes using multiple comparison procedures (Dunn, 1961) to adjust the critical α level for each index in a pair or triplet. Empirical Type-I error rates and power of all indexes, pairs, and triplets were examined in a real data simulation (i.e., where actual examinee responses to items [rather than generated item response vectors] were changed to match the actual responses for randomly selected source examinees) for 2 test lengths, 9 sample sizes, 3 types of copying, 4 α levels, and 4 percentages of items copied. This study found that using both ω and H* (i.e., H with empirically derived critical values) can help improve power in the most realistic types of copying situations (strings and mixed copying). The ω-H* paired index improved power most particularly for small percentages of items copied and small amounts of copying, two conditions for which copying indexes tend to be underpowered.
Article
Reports research that extends appropriateness measurement methods to examinees with moderately high nonresponse rates. These methods treat nonresponse as if it were a deliberate option choice and then attempt to measure the appropriateness of the pattern of option choices. Earlier studies used only the dichotomous pattern of "right" and "not right" answers. A general polychotomous model is introduced along with a technique called standardization, which is designed to reduce the observed confounding between measured appropriateness and ability. (17 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Two new indices to detect answer copying on a multiple-choice test—S1 and S2—were proposed. The S1 index is similar to the K index (Holland, 1996) and the K2 index (Sotaridona & Meijer, 2002) but the distribution of the number of matching incorrect answers of the source and the copier is modeled by the Poisson distribution instead of the binomial distribution to improve the detection rate of K and K2. The S2 index was proposed to overcome a limitation of the K and K2 index, namely, their insensitiveness to correct answers copying. The S2 index incorporates the matching correct answers in addition to the matching incorrect answers. A simulation study was conducted to investigate the usefulness of S1 and S2 for 40- and 80-item tests, 100 and 500 sample sizes, and 10%, 20%, 30%, and 40% answer copying. The Type I errors and detection rates of S1 and S2 were compared with those of the K2 and the ω copying index (Wollack, 1997). Results showed that all four indices were able to maintain their Type I errors, with S1 and K2 being slightly conservative compared to S2 and ω. Furthermore, S1 had higher detection rates than K2. The S2 index showed a significant improvement in detection rate compared to K and K2.
Article
It has been common practice to simply use a single total score although some attempts have been made to use wrong response patterns diagnostically. This is partially attributable to the fact that there are limited amounts of data available for particular response patterns, and because there is both low reliability of individual items and a tendency to attribute a large fraction of the common variance of the items to a single dimension. Recently, however, several approaches have been suggested which are intended to identify unusual response patterns. There are a wide variety of factors that could lead to an unusual response pattern. While a question may be very easy for most students, unique background experiences may make that same item very difficult for others. For example, the child who has never gone camping may find a reading passage about a camping experience more difficult than children who have had the experience. Other individual differences in motivational disposition, for instance, test anxiety, may make a normally simple item very difficult for some people. Students' exposure to different subject matter, and the way in which that subject matter has been stressed, will influence how they perform on tests. This may result in some measurable variation in scores from class to class. Items which are generally difficult for most students may be relatively easy for students who have been in classes where that particular content was emphasized. Such variation from the norm may lead to the systematic over- or under-estimation of an individual's or group's level of achievement, distorting the measurement results. Indices measuring the degree to which the response pattern for an individual is unusual could be used in a variety of ways. They could identify individuals for whom the standard interpretation of the test score is misleading, or identify groups with atypical instructional and/or experiential histories that alter the relative difficulty ordering of the items. In addition, the items that contribute most to high values on an index for particular subgroups could be identified and judgments made regarding the appropriateness of the item content for those subgroups. There are two major types of indices of the degree to which an individual's pattern of responses is unusual. First, there are the "appropriateness" indices which are based upon
Article
The accurate measurement of examinee test performance is critical to educational decision-making, and inaccurate measurement can lead to negative consequences for examinees. Person-fit statistics are important in a psychometric analysis for de-tecting examinees with aberrant response patterns that lead to inaccurate measure-ment. Unfortunately, although a large number of person-fit statistics is available, there is little consensus as to which ones are most useful. The purpose of this study was to compare 36 person-fit indices, under different testing conditions, to obtain a better consensus as to their relative merits. The results of these comparisons, and their implications, are discussed. Sound decisions in educational settings hinge largely on accurate measurement of student characteristics. Such measurements can help identify those individuals who are qualified enough to enter a particular school, or receive a particular edu-cational degree. Also, these measurements can be used to monitor students' learn-ing progress. This may, for example, enable educators to productively tailor their curriculum, or help policy makers decide on important educational issues. In contrast, the inaccurate measurement of test performance can lead to nega-tive consequences. On the one hand, spuriously high test scores can lead to un-qualified individuals being enrolled into an educational program (e.g., undergrad-uate, graduate, or professional), or being awarded an educational degree. On the other hand, qualified individuals with spuriously low test scores may be unfairly excluded from academic programs, or unfairly denied a degree. Furthermore, the inaccurate measurement of test performance undermines the assessment of stu-dents' learning progress, and curriculum planning efforts.
Article
We investigated the statistical properties of the K-index (Holland, 1996) that can be used to detect copying behavior on a test. A simulation study was conducted to investigate the applicability of the K-index for small, medium, and large datasets. Furthermore, the Type I error rate and the detection rate of this index were compared with the copying index, ω (Wollack, 1997). Several approximations were used to calculate the K-index. Results showed that all approximations were able to hold the Type I error rates below the nominal level. Results further showed that using ω resulted in higher detection rates than the K-indices for small and medium sample sizes (100 and 500 simulees).
Article
This study investigated the Type I error rate and power of four copying indices, K-index (Holland, 1996), Scrutiny! (Assessment Systems Corporation, 1993), g2 (Frary, Tideman, & Watts, 1977), and ω (Wollack, 1997) using real test data from 20,000 examinees over a 2-year period. The data were divided into three different test lengths (20, 40, and 80 items) and nine different sample sizes (ranging from 50 to 20,000). Four different amounts of answer copying were simulated (10%, 20%, 30%, and 40% of the items) within each condition. The ω index demonstrated the best Type I error control and power in all conditions and at all α levels. Scrutiny! and the K-index were uniformly conservative, and both had poor power to detect true copiers at the small α levels typically used in answer copying detection, whereas g2 was generally too liberal, particularly at small α levels. Some comments on the proper uses of copying indices are provided.
Article
Academic dishonesty has been an important issue. However, only few researches had been done in Asian countries, especially a nationwide study. A sample of 2,068 college students throughout Taiwan was selected and surveyed on four domains of academic dishonesty, including: cheating on test, cheating on assignment, plagiarism, and falsifying documents. The major findings of this study were: (1) the prevalence rate for all types of dishonesty behaviors among college students in Taiwan was 61.72%; (2) the top five most practiced academic dishonesty behaviors in Taiwan are provided paper or assignment for another student, gave prohibited help to others on their assignment, copied others’ assignments, passed answers to other students, and copied from other students; (3) students’ attitudes correlated with behaviors in all four domains of academic dishonesty; (4) females reported less acceptable to and behaved less academic dishonesty behaviors than males; and (5) freshmen had more dishonest practices than other class ranks.
Article
A new family of indices was introduced earlier as a link between two approaches: One based on item response theory and the other on sample statistics. In this study, the statistical properties of these indices are investigated and then the relationships to Guttman Scales, and to item and person response curves are discussed. Further, these indices are standardized, and an example of their potential usefulness for diagnosing students' misconceptions is shown.
Article
A multivariate logistic latent trait model for items scored in two or more nominal categories is proposed. Statistical methods based on the model provide 1) estimation of two item parameters for each response alternative of each multiple choice item and 2) recovery of information from wrong responses when estimating latent ability. An application to a large sample of data for twenty vocabulary items shows excellent fit of the model according to a chi-square criterion. Item and test information curves are compared for estimation of ability assuming multiple category and dichotomous scoring of these items. Multiple scoring proves substantially more precise for subjects of less than median ability, and about equally precise for subjects above the median.
Testing Integrity Symposium: Issues and Recommendations for Best Practice Retrieved from http://nces College cheating: A twenty-year follow-up and the addition of an honor code
U.S. Department of Education, Institute of Education Sciences, & National Center for Education Statistics. (2013). Testing Integrity Symposium: Issues and Recommendations for Best Practice. Retrieved from http://nces.ed.gov/pubs2013/2013454.pdf Vandehey, M. A., Diekhoff, G. M., & LaBeff, E. E. (2007). College cheating: A twenty-year follow-up and the addition of an honor code. Journal of College Student Development, 48, 468-480.
Multilog (Version 7) [Computer software
  • D Thissen
  • W H Chen
  • R D Bock
Thissen, D., Chen, W. H., & Bock, R. D. (2003). Multilog (Version 7) [Computer software]. Lincolnwood, IL: Scientific Software International.
Testing and data integrity in the administration of statewide student assessment programs. Retrieved from http
National Council on Measurement in Education. (2012). Testing and data integrity in the administration of statewide student assessment programs. Retrieved from http://ncme.org/default/assets/File/Commi ttee%20Docs/Test%20Score%20Integrity/Test%20Integrity-NCME%20Endorsed%20%282012%20FIN AL%29.pdf