Article

Number 3 Summer 2021 Predictability of Discrimination Coefficient and Difficulty Index of Psychiatry Multiple-Choice Questions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background: Multiple-choice questions are among the most common written tests. This study aimed to evaluate the faculty members'ability to determine and predict the level of difficulty and discrimination coefficient of multiple-choice tests at Psychiatry Department. Methods: All faculty members at Psychiatry Department of Iran University of Medical Sciences participated in this study. The difficulty and discrimination coefficient of all questions (150 questions) of the mid-term exam of psychiatric residents were measured with both software program and formulas by hand. Then, from each group of questions with high, medium, and low difficulty coefficient, 10 questions (30 questions in total) were selected and provided to faculty members for ranking each question in terms of difficulty and discrimination coefficient. Finally, the correlation between faculty members' evaluation and standard results was measured by the Spearman's correlation. To calculate the discrimination coefficient, the number of people who answered a question correctly in the low-score group was subtracted from the high-score group and then the result was divided by the number of people in a group. Results: Twenty-five faculty members participated in this study. There was a significant negative correlation between difficulty level and discrimination coefficient in the whole group (r=-0.196, p=0.045), but this was not the case in the upper and lower groups (r=-0.063, p=0.733). In addition, the correlation between the discrimination coefficient obtained from the formula and the average discrimination coefficient of faculty members was not significant (r=-0.047, p=0.803). Conclusion: It seems that the ability of faculty members to predict the discrimination coefficient and difficulty level of questions is not sufficient.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background and Objective: The exact design of the residency exams (standards-based) is very important. The purpose of this study was Quantitative and qualitative indicators evaluation of residency exams in Qazvin University of Medical Sciences and study of the relationships between taxonomic levels by quantitative indicators. Materials and Methods: All questions of residency exams from 7 Medical Specialties of Qazvin University of Medical Sciences in 2012 and 2013 were studied. Taxonomy of questions in three levels: remembered, understanding and judges were determined. Discriminating power was studied in four groups as good, fair, poor, and negative. Difficulty was classified into 5 levels of easy, medium, hard, very difficult and extremely difficult. SPSS v.19 software was used for the calculation of the T-test and ANOVA. Results: A total of 2100 questions, 38 percent of the questions reminded taxonomy, 24.1 percent of the level of understanding and 37.9 percent level were judges. 24.9 percent of questions had negative discriminating power, 34.1 percent had poor, 22.5 percent had average and 18.5 percent had good discriminating power. For the degree of difficulty, 22.1 percent simple questions, 22.6 percent average, 24.5 percent difficult, 17.5 percent very difficult and 13.2 percent was extremely difficult. A significant association was found between taxonomic questions and differentiation factor, taxonomy remembered had less differentiation index and questions, taxonomic understanding and application had differentiation index higher(p<0.0001). Conclusion: The results showed increased levels of taxonomic questions can enhance the discriminating power test questions. Since questions designed the taxonomy of two and three were lower than the standard (80%), teachers were recommended design more questions higher taxonomy in residency exams. Keywords: Taxonomy, Difficulty index, Discrimination - index, Residency exam
Article
Full-text available
Introduction: Multiple-choice questions (MCQs) are a cornerstone of assessment in medical education. Monitoring item properties (difficulty and discrimination) are important means of investigating examination quality. However, most item property guidelines were developed for use on large cohorts of examinees; little empirical work has investigated the suitability of applying guidelines to item difficulty and discrimination coefficients estimated for small cohorts, such as those in medical education. We investigated the extent to which item properties vary across multiple clerkship cohorts to better understand the appropriateness of using such guidelines with small cohorts. Methods: Exam results for 32 items from an MCQ exam were used. Item discrimination and difficulty coefficients were calculated for 22 cohorts (n = 10-15 students). Discrimination coefficients were categorized according to Ebel and Frisbie (1991). Difficulty coefficients were categorized according to three guidelines by Laveault and Grégoire (2014). Descriptive analyses examined variance in item properties across cohorts. Results: A large amount of variance in item properties was found across cohorts. Discrimination coefficients for items varied greatly across cohorts, with 29/32 (91%) of items occurring in both Ebel and Frisbie's 'poor' and 'excellent' categories and 19/32 (59%) of items occurring in all five categories. For item difficulty coefficients, the application of different guidelines resulted in large variations in examination length (number of items removed ranged from 0 to 22). Discussion: While the psychometric properties of items can provide information on item and exam quality, they vary greatly in small cohorts. The application of guidelines with small exam cohorts should be approached with caution.
Article
Full-text available
Medical education has undergone vast reform following the Islamic Revolution in the last three decades, with remarkable qualitative and quantitative progress having been achieved following the establishment of the Ministry of Health and Medi- cal Education in 1985. There have been rises in the number of medical, dentistry and pharmacy schools from 7 to 36, 3 to 15 and 3 to 11, respectively, and in the numbers of student admissions in all programmes of medical sciences from 3630 to 6177 and teaching staff from 1573 to 13108, in the decades mentioned. The numbers of students in clinical subspecialty and PhD degrees have increased from zero to 268 and 350, respectively. The quality of medical education has improved with increasing field and ambulatory care training, with more emphasis on teaching preventive medicine and a significant rise in the research activities. In conclusion, Islamic Republic of Iran has been successful in upgrading medical education and re- search by the unification of health services and medical education into one ministry.
Article
Full-text available
Investigated the degree to which subject matter experts could predict the difficulty and discrimination of items on the Test of Standard Written English. Despite an extended training period, the raters did not approach a high level of accuracy, nor were they able to pinpoint the factors that contribute to item difficulty and discrimination. Further research should attempt to uncover those factors by examining the items from a linguistic and psycholinguistic perspective. (12 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
To investigate the relationship of items having good difficulty and discrimination indices with their distractor efficiency to find how 'ideal questions' can be affected by non-functioning distractors (NF-Ds). The cross-sectional study was conducted at Fatima Jinnah Dental College, Karachi, during Jan-Jun 2009, with 102 First Year dental students (17-20 years). Physiology paper of the first semester, given after 22 weeks of teaching general topics of physiology, was analysed. The paper consisted of 50 one-best MCQs, having 5 options each. The MCQs were analysed for difficulty index (p-value), discrimination index (DI), and distractor efficiency (DE). Items having p-value between 30-70 and DI > or = 0.25 were considered as having good difficulty and discrimination indices respectively. Effective distractors were considered as the ones selected by at least 5% of the students. The mean score was 27.31 +/- 5.75 (maximum 50 marks). Mean p-value and DI were 54.14 +/- 17.48 and 0.356 +/- 0.17, respectively. Seventy eight per cent items were of average (recommended) difficulty (mean p-value = 51.44 +/- 11.11) and having DE = 81.41%. Sixty two per cent items had excellent DI (0.465 +/- 0.083) with DE = 83.06%. Combining the two indices, 32 (64%) items could be called as 'ideal' (p-value = 30 to 70; DI > 0.24) and had DE = 85.15%. Overall 42% items had no Non-functioning Distractors (NF-D), while 12% had 3 NF-Ds. Excellent discrimination (DI = 0.427) was achieved with items having one NF-D, while items with 2 NF-D and no NF-D had nearly equal but lower DI (0.365 and 0.351 respectively). One-best MCQs having average difficulty and high discrimination with three functioning distractors should be incorporated into future tests to improve the test score and properly discriminate among the students. Items with two NF-Ds, though easier, are better discriminators than items with no NF-D.
Article
Full-text available
The use of standardized patients in Objective Structured Clinical Examinations in the assessment of psychiatric residents has increased in recent years. The aim of this study is to investigate the experience of psychiatry residents and examiners with standardized patients in Iran. Final-year residents in psychiatry participated in this study. Experienced examiners were asked to complete a questionnaire concerning the ability of standardized patients to realistically portray psychiatric patients. Standardized patients can convincingly portray psychiatric disorders and act according to the requested complex scenarios. According to these findings, the authors recommend the use of standardized patients in OSCEs for psychiatric board certification exams.
Article
Introduction Psychiatry's postgraduate training curriculum in Iran has been revised and one of the core revisions has been the incorporation of full‐time 9‐months of psychotherapy training. However, little is known about psychotherapy training in Iran. Methods An online anonymous survey was developed by the Early Career Psychiatrists (ECP) Section of the World Psychiatric Association (WPA). The survey included 16 questions about the: (a) quality of psychotherapy training (supervision, type of psychotherapy training available, barriers in accessing training); (b) organizational aspects of psychotherapy training (compulsoriness, payment, and assessment); (c) satisfaction with training in psychotherapy; (d) self‐confidence in the use of psychotherapy. This survey was circulated to Iranian early career psychiatrists and psychiatric trainees. Results 112 early career psychiatrists and psychiatric trainees from across Iran responded to the survey; 98.2% of which stated that psychotherapy training is included in their psychiatry training, and cognitive behavioral therapy and psychodynamic psychotherapy were the most reported modalities integrated into their psychiatric training. Moreover, 43.3% of the participants reported that they were satisfied or very satisfied with their psychotherapy training during the training years. Discussion Psychotherapy is integrated into psychiatric training programs in most educational centers in Iran. The modalities and satisfaction of trainees are similar to that of high‐income countries in other continents. Supervision and training in modalities like family therapy could be further implemented and adapted to the Iranian culture.
Article
Previous investigations of the ability of content experts and test developers to estimate item difficulty have, for the most part, produced disappointing results. These investigations were based on a noncomparative method of independently rating the difficulty of items. In this article, we argue that, by eliciting comparative judgments of difficulty, judges can more accurately estimate item difficulties. In this study, judges from different backgrounds rank ordered the difficulty of SAT® mathematics items in sets of 7 items. Results showed that judges are reasonably successful in rank ordering several items in terms of difficulty, with little variability across judges and content areas. Simulations of a possible implementation of comparative judgments for difficulty estimation show that it is possible to achieve high correlations between true and estimated difficulties with relatively few comparisons. Implications of these results for the test development process are discussed.
Article
The Nedelsky standard setting procedure utilizes an option elimination strategy to estimate the probability that a minimally competent candidate (MCC) will answer a multiple-choice item correctly. The purpose of this study was to investigate the accuracy of predicted item performance from the Nedelsky ratings. The results indicate that test taking behavior of MCCs does not match the underlying Nedelsky assumption that MCCs randomly guess among the options judges believe should be attractive. Further, the accuracy of predicted item performance appears to vary as a function of item difficulty and content domain. However, an analysis of the relationship between judges' rating of distractor difficulty and proportion of examinees selecting item distractors indicated that useful information about examinee item performance is obtainable from Nedelsky-based judgments.
Article
The purposes of this research study were to develop and field test anchor-based judgmental methods for enabling test specialists to estimate item difficulty statistics. The study consisted of three related field tests. In each, researchers worked with six Law School Admission Test (LSAT) test specialists and one or more of the LSAT subtests. The three field tests produced a number of conclusions. A considerable amount was learned about the process of extracting test specialists' estimates of item difficulty. The ratings took considerably longer to obtain than had been expected. Training, initial ratings, and discussion took a considerable amount of time. Test specialists felt they could be trained to estimate item difficulty accurately and, to some extent, they demonstrated this. Average error in the estimates of item difficulty varied from about 11% to 13 %. Also the discussions were popular with the panelists, and almost always resulted in improved item difficulty estimates. By the end of the study, the two expected frameworks that developers thought they might provide test specialists, had merged to one. Test specialists seemed to benefit from the descriptions of items located at three levels of difficulty and from information about the item statistics of many items. Four appendixes describe tasks and contain the field test materials. (Contains 8 tables and 18 references.) (SLD)
Article
Minimum standards were established for the National Teacher Examinations (NTE) area examinations in mathematics and in elementary education by independent panels of teacher educators who had been instructed in the use of either the Angoff, Nedelsky, or Jaeger procedures. Of these three procedures, only the Jaeger method requires that normative data be provided to the judges when evaluating the items. However, it was of interest to study the effect such information would have upon the standards obtained using the other two methods. Therefore, the design incorporated three sequential review sessions with the level of normative information different for each. A three-factor ANOVA revealed significant main effects for methods and sessions but not for subject area. None of the interactions was significant. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest that the Angoff procedure, as modified during the second session of this study, yields the most defensible standards for the NTE area examinations.
Article
The purpose of this study was to evaluate whether multiple-choice item difficulty could be predicted either by a subjective judgment by the question author or by applying a learning taxonomy to the items. Eight physiology faculty members teaching an upper-level undergraduate human physiology course consented to participate in the study. The faculty members annotated questions before exams with the descriptors "easy," "moderate," or "hard" and classified them according to whether they tested knowledge, comprehension, or application. Overall analysis showed a statistically significant, but relatively low, correlation between the intended item difficulty and actual student scores (ρ = -0.19, P < 0.01), indicating that, as intended item difficulty increased, the resulting student scores on items tended to decrease. Although this expected inverse relationship was detected, faculty members were correct only 48% of the time when estimating difficulty. There was also significant individual variation among faculty members in the ability to predict item difficulty (χ(2) = 16.84, P = 0.02). With regard to the cognitive level of items, no significant correlation was found between the item cognitive level and either actual student scores (ρ = -0.09, P = 0.14) or item discrimination (ρ = 0.05, P = 0.42). Despite the inability of faculty members to accurately predict item difficulty, the examinations were of high quality, as evidenced by reliability coefficients (Cronbach's α) of 0.70-0.92, the rejection of only 4 of 300 items in the postexamination review, and a mean item discrimination (point biserial) of 0.37. In conclusion, the effort of assigning annotations describing intended difficulty and cognitive levels to multiple-choice items is of doubtful value in terms of controlling examination difficulty. However, we also report that the process of annotating questions may enhance examination validity and can reveal aspects of the hidden curriculum.
Article
Although communication skills have been observed as a crucial element in the delivery of high-quality medical care, the emphasis given to them within medical education in Iran is severely limited, and the state of such teaching is unknown in many other countries. This exploratory study investigated the views and experiences of medical education course planners in Iran with respect to the current status of communication skills training within Iranian medical schools. The findings are based on the in-depth interviews with Iranian medical course planners. The findings demonstrate that there is a deep concern about the lack of communication skills training within the Iranian medical curriculum. Medical students' acquisition and use of communication skills is consistently poor. Furthermore, medical litigation can then result from poor communication skills among medical students. Both positive and negative attitudes toward integrating communication skills into the medical curriculum were revealed. There is a real need to integrate communication skills into Iranian medical education with due attention to ethnical and religious issues. Some recommendations are made, and the limitations of the study are discussed.
Methods of student assessment used by faculty members of Basic Medical Sciences in Medical University of Zahedan
  • Komeili Gh
  • Rezai Gh
Komeili Gh, Rezai Gh. Methods of student assessment used by faculty members of Basic Medical Sciences in Medical University of Zahedan. Iranian J Medical Education 2001;1(4):52-7.
Analytical study of quantitative indices of multiple-choice questions of immunology department in Ahvaz Jundishapur University of Medical Sciences
  • A Shakurnia
  • M Ghafourian
  • A Khodadadi
  • A Ghadiri
Shakurnia A, Ghafourian M, Khodadadi A, Ghadiri A. Analytical study of quantitative indices of multiple-choice questions of immunology department in Ahvaz Jundishapur University of Medical Sciences. JundiShapur Educational Development 2018;9(2):72-83.
Difficulty prediction of test items. Teachers College Contributions to Education
  • S Tinkelman
Tinkelman S. Difficulty prediction of test items. Teachers College Contributions to Education. 1947.
Quality analysis of multiple choice questions (MCQs) examinations of noncontinuous undergraduate medical records
  • S Hosseiniteshnizi
  • S Zare
  • S Solati
HosseiniTeshnizi S, Zare S, Solati S. Quality analysis of multiple choice questions (MCQs) examinations of noncontinuous undergraduate medical records. Hormozgan Med J 2010;14(3):177-83.
Item analysis of multiple choice questions-an assessment of the assessment tool
  • G Mehta
  • V Mokhasi
Mehta G, Mokhasi V. Item analysis of multiple choice questions-an assessment of the assessment tool. Int J Health Sci Res 2014;4(7):197-202.