Article

Exploring examinee behaviours as validity evidence for multiple-choice question examinations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Context: Clinical-vignette multiple choice question (MCQ) examinations are used widely in medical education. Standardised MCQ examinations are used by licensure and certification bodies to award credentials that are meant to assure stakeholders as to the quality of physicians. Such uses are based on the interpretation of MCQ examination performance as giving meaningful information about the quality of clinical reasoning. There are several assumptions foundational to these interpretations and uses of standardised MCQ examinations. This study explores the implicit assumption that cognitive processes elicited by clinical-vignette MCQ items are like the processes thought to occur with 'real-world' clinical reasoning as theorised by dual-process theory. Methods: Fourteen participants (three medical students, five residents and six staff physicians) completed three sets of five timed MCQ items (total 15) from the Medical Knowledge Self-Assessment Program (MKSAP). Upon answering a set of MCQs, each participant completed a retrospective think aloud (TA) protocol. Using constant comparative analysis (CCA) methods sensitised by dual-process theory, we performed a qualitative thematic analysis. Results: Examinee behaviours fell into three categories: clinical reasoning behaviours, test-taking behaviours and reactions to the MCQ. Consistent with dual-process theory, statements about clinical reasoning behaviours were divided into two sub-categories: analytical reasoning and non-analytical reasoning. Each of these categories included several themes. Conclusions: Our study provides some validity evidence that test-takers' descriptions of their cognitive processes during completion of high-quality clinical-vignette MCQs align with processes expected in real-world clinical reasoning. This supports one of the assumptions important for interpretations of MCQ examination scores as meaningful measures of clinical reasoning. Our observations also suggest that MCQs elicit other cognitive processes, including certain test-taking behaviours, that seem 'inauthentic' to real-world clinical reasoning. Further research is needed to explore if similar themes arise in other contexts (e.g. simulated patient encounters) and how observed behaviours relate to performance on MCQ-based assessments.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... One of the oft-cited criticisms of MCQs is that they may be restricted to the assessment of lower-level cognitive abilities given that examinees are asked to "recognize" an answer from a list of presented options rather than constructing an answer with no potential prompting (Newble et al. 1979;McCoubrie 2004). However, MCQs have been demonstrated to be useful in assessing a variety of skill levels including clinical reasoning (Beullens et al. 2005;Surry et al. 2017) and situational judgment (Lievens et al. 2005). Another critique is that MCQs may provide cueing that differentially impacts students (Schuwirth et al. 1996). ...
... There is significant evidence to demonstrate that examinees use clinical reasoning when answering well-constructed context-dependent items (Coderre et al. 2004;Heist et al. 2014;Surry et al. 2017). In a study that examined the problem-solving strategies used by general practitioners and medical students, evidence suggested that context-dependent items elicit different reasoning processes when compared to items that are designed to test factual knowledge, with the former leading experts to use illness scripts (i.e. ...
... In a recent study, which included a sample of 102 MCQs from a high-stakes examination, 94% of items were found to assess the application of knowledge (in contrast to simple recall), as judged by a panel of experts (Pugh et al., in preparation). Using a "think aloud" protocol, another study explored clinical reasoning in medical students, residents, and staff physicians when answering MCQs from the Medical Knowledge Self-Assessment Program (MKSAP) (Surry et al. 2017). Participants' descriptions of their clinical reasoning processes suggested that they used both System 1 (i.e. ...
Article
Full-text available
Despite the increased emphasis on the use of workplace-based assessment in competency-based education models, there is still an important role for the use of multiple choice questions (MCQs) in the assessment of health professionals. The challenge, however, is to ensure that MCQs are developed in a way to allow educators to derive meaningful information about examinees’ abilities. As educators’ needs for high-quality test items have evolved so has our approach to developing MCQs. This evolution has been reflected in a number of ways including: the use of different stimulus formats; the creation of novel response formats; the development of new approaches to problem conceptualization; and the incorporation of technology. The purpose of this narrative review is to provide the reader with an overview of how our understanding of the use of MCQs in the assessment of health professionals has evolved to better measure clinical reasoning and to improve both efficiency and item quality.
... These hypotheses can be explored in qualitative think-aloud studies. Previous work in this area (Coderre et al. 2004;Heist et al. 2014;Durning et al. 2015;Surry et al. 2017) has focused on the thought processes used for SBAQs. These studies have consistently shown that students and doctors use both analytical and non-analytical reasoning strategies to answer SBAQs; an approach that aligns with the 'processes expected in realworld clinical reasoning' (Surry et al. 2017) (p.1075, our emphasis). ...
... Previous work in this area (Coderre et al. 2004;Heist et al. 2014;Durning et al. 2015;Surry et al. 2017) has focused on the thought processes used for SBAQs. These studies have consistently shown that students and doctors use both analytical and non-analytical reasoning strategies to answer SBAQs; an approach that aligns with the 'processes expected in realworld clinical reasoning' (Surry et al. 2017) (p.1075, our emphasis). However, these studies also reveal the use of 'test-taking' behaviours that reduce the authenticity of this ...
... We used a 'think aloud' study design, whereby participants are asked to voice the thoughts that occur to them as they complete a task (Hevey 2012), in this case answering SBAQs and VSAQs. We subsequently used a content analysis approach and derived our initial content themes from a previous think aloud study undertaken by Surry et al. (2017). ...
Article
Full-text available
Background Single-best answer questions (SBAQs) are common but are susceptible to cueing. Very short answer questions (VSAQs) could be an alternative, and we sought to determine if students’ cognitive processes varied across question types and whether students with different performance levels used different methods for answering questions. Methods We undertook a ‘think aloud’ study, interviewing 21 final year medical students at five UK medical schools. Each student described their thought processes and methods used for eight questions of each type. Responses were coded and quantified to determine the relative frequency with which each method was used, denominated on the number of times a method could have been used. Results Students were more likely to use analytical reasoning methods (specifically identifying key features) when answering VSAQs. The use of test-taking behaviours was more common for SBAQs; students frequently used the answer options to help them reach an answer. Students acknowledged uncertainty more frequently when answering VSAQs. Analytical reasoning was more commonly used by high-performing students compared with low-performing students. Conclusions Our results suggest that VSAQs encourage more authentic clinical reasoning strategies. Differences in cognitive approaches used highlight the need for focused approaches to teaching clinical reasoning and dealing with uncertainty.
... Development of higherorder thinking skills is related to student academic success [2]. Therefore, many medical school faculty strive to write MCQs that entail the application of analytic thinking skills in an effort to model skills needed for clinical reasoning [3,4]. ...
... "The selection of item types depends on the intent of their use: for a medium-to high-stakes summative examination, the use of vignettes that require higher-order thinking skills and application of knowledge would be preferable to simple recall items." [13 p. 32] MCQs requiring application of knowledge are utilized because they are thought to be a reliable measure of clinical reasoning [3]. Learners need to have basic knowledge (facts) in order to approach higher-order questions; in other words, they have to walk before they can run. ...
... Our study also demonstrated that if a student gets a question correct, they are less likely to identify the question as higher order. Previous work has found the heavy use of pattern recognition by examinees [3], while other studies suggest that higher performing students utilize both clinical reasoning behavior such as pattern recognition and test-taking strategies to rule out alternatives [15]. These results suggest that although faculty intend to write higher-or lower-order MCQs, students' perception of these questions is more dependent on their knowledge and performance than Bloom's taxonomy. ...
Article
Full-text available
Background Analytic thinking skills are important to the development of physicians. Therefore, educators and licensing boards utilize multiple-choice questions (MCQs) to assess these knowledge and skills. MCQs are written under two assumptions: that they can be written as higher or lower order according to Bloom’s taxonomy, and students will perceive questions to be the same taxonomical level as intended. This study seeks to understand the students’ approach to questions by analyzing differences in students’ perception of the Bloom’s level of MCQs in relation to their knowledge and confidence. Methods A total of 137 students responded to practice endocrine MCQs. Participants indicated the answer to the question, their interpretation of it as higher or lower order, and the degree of confidence in their response to the question. Results Although there was no significant association between students’ average performance on the content and their question classification (higher or lower), individual students who were less confident in their answer were more than five times as likely (OR = 5.49) to identify a question as higher order than their more confident peers. Students who responded incorrectly to the MCQ were 4 times as likely to identify a question as higher order than their peers who responded correctly. Conclusions The results suggest that higher performing, more confident students rely on identifying patterns (even if the question was intended to be higher order). In contrast, less confident students engage in higher-order, analytic thinking even if the question is intended to be lower order. Better understanding of the processes through which students interpret MCQs will help us to better understand the development of clinical reasoning skills.
... To better fill this gap in our understanding, more robust means of detecting error in clinical practice [5] and novel experimental approaches are necessary. We believe using multiple-choice questions (MCQs), widely applied in standardized exams to assess clinical reasoning and found to elicit real-world reasoning processes in previous research [26][27][28], supplemented with a think aloud (TA) protocol can provide valuable insight into such errors. Furthermore, MCQs hold the advantage of having an a priori distinct correct answer, allowing for a clear, prospective analysis that limits hindsight bias. ...
... In this mixed-methods study, we explore what CDRs, if any, are present when medical students, residents, and attending physicians solve MCQs and how these CDRs may relate to incorrect answer selection (i.e., error). We hypothesized that CDRs detected in think alouds completed during answering high-quality clinical-vignette MCQs, a task previously shown to elicit clinical reasoning processes [26][27][28], would be associated with errors. Such a finding would be consistent with views of dual process theory posited by Croskerry, Kahneman, and Tversky and support the position that system 1 (automatic) reasoning processes like CDRs may contribute to error [8,14,16,20]. ...
... While we did this to avoid altering participants' thinking while completing the MCQs and we carefully followed recommendations for this use of the TA, it is possible that participants' verbalizations reflect their post hoc explanations rather their actual reasoning with answering the MCQs. The view that reasoning during clinical-vignette MCQs is similar to "native," or "real-world," clinical reasoning is also controversial and may be viewed as a limitation; however, there are several studies with evidence to support the similarities of reasoning processes in these different contexts [26][27][28]. Larger investigations may be helpful in studying the nature of the association of specific CDRs with errors and the interactions of CDRs with contextual factors (i.e., fatigue, time constraints, language barriers, electronic health records, interruptions, multi-tasking, "difficult" patients, etc.) [3,[10][11][12][13][14][15]. We performed think alouds following each block of related items (vs after each item) and performing think alouds following each item may have provided a more in depth understanding of thinking on the item level. ...
Article
Full-text available
Background Cognitive dispositions to respond (i.e., cognitive biases and heuristics) are well-established clinical reasoning phenomena. While thought by many to be error-prone, some scholars contest that these cognitive dispositions to respond are pragmatic solutions for reasoning through clinical complexity that are associated with errors largely due to hindsight bias and flawed experimental design. The purpose of this study was to prospectively identify cognitive dispositions to respond occurring during clinical reasoning to determine whether they are actually associated with increased odds of an incorrect answer (i.e., error). Methods Using the cognitive disposition to respond framework, this mixed-methods study applied a constant comparative qualitative thematic analysis to transcripts of think alouds performed during completion of clinical-vignette multiple-choice questions. The number and type of cognitive dispositions to respond associated with both correct and incorrect answers were identified. Participants included medical students, residents, and attending physicians recruited using maximum variation strategies. Data were analyzed using generalized estimating equations binary logistic model for repeated, within-subjects measures. Results Among 14 participants, there were 3 cognitive disposition to respond categories – Cognitive Bias, Flaws in Conceptual Understanding, and Other Vulnerabilities – with 13 themes identified from the think aloud transcripts. The odds of error increased to a statistically significant degree with a greater per-item number of distinct Cognitive Bias themes (OR = 1.729, 95% CI [1.226, 2.437], p = 0.002) and Other Vulnerabilities themes (OR = 2.014, 95% CI [1.280, 2.941], p < 0.001), but not with Flaws in Conceptual Understanding themes (OR = 1.617, 95% CI [0.961, 2.720], p = 0.070). Conclusion This study supports the theoretical understanding of cognitive dispositions to respond as phenomena associated with errors in a new prospective manner. With further research, these findings may inform teaching, learning, and assessment of clinical reasoning toward a reduction in patient harm due to clinical reasoning errors. Electronic supplementary material The online version of this article (10.1186/s12909-018-1372-2) contains supplementary material, which is available to authorized users.
... However, no specific data on clinical reasoning behaviors with SCT are available. One study by Surry et al. examined clinical reasoning behaviors in a 210-item clinical-vignette MCQ test based on dual-process theory [36]. The results showed that both systems 1 and 2 processes were elicited for nearly all test questions (100 and 97.1%, respectively) in a small sample of subjects [36]. ...
... One study by Surry et al. examined clinical reasoning behaviors in a 210-item clinical-vignette MCQ test based on dual-process theory [36]. The results showed that both systems 1 and 2 processes were elicited for nearly all test questions (100 and 97.1%, respectively) in a small sample of subjects [36]. Further studies are needed to explore system 1 and system 2 reasoning use during an SCT to support the assumption that SCT mostly explores system 2. Finally, our findings illustrate some of the difficulties in studying the links between clinical reasoning and burnout. ...
Article
Full-text available
Background Burnout results from excessive demands at work. Caregivers suffering from burnout show a state of emotional exhaustion, leading them to distance themselves from their patients and to become less efficient in their work. While some studies have shown a negative impact of burnout on physicians’ clinical reasoning, others have failed to demonstrate any such impacts. To better understand the link between clinical reasoning and burnout, we carried out a study looking for an association between burnout and clinical reasoning in a population of general practice residents. Methods We conducted a cross-sectional observational study among residents in general practice in 2017 and 2019. Clinical reasoning performance was assessed using a script concordance test (SCT). The Maslach Burnout Inventory for Human Services Survey (MBI-HSS) was used to determine burnout status in both original standards of Maslach’s burnout inventory manual (conventional approach) and when individuals reported high emotional exhaustion in combination with high depersonalization or low personal accomplishment compared to a norm group (“emotional exhaustion +1” approach). Results One hundred ninety-nine residents were included. The participants’ mean SCT score was 76.44% (95% CI: 75.77–77.10). In the conventional approach, 126 residents (63.31%) had no burnout, 37 (18.59%) had mild burnout, 23 (11.56%) had moderate burnout, and 13 (6.53%) had severe burnout. In the “exhaustion + 1“ approach, 38 residents had a burnout status (19.10%). We found no significant correlation between burnout status and SCT scores either for conventional or “exhaustion + 1“ approaches. Conclusions Our data seem to indicate that burnout status has no significant impact on clinical reasoning. However, one speculation is that SCT mostly examines the clinical reasoning process’s analytical dimension, whereas emotions are conventionally associated with the intuitive dimension. We think future research might aim to explore the impact of burnout on intuitive clinical reasoning processes.
... Although Downing [26] has proposed that construct validity is not a subtype of validity but rather it is validity in its entirety and evaluation of construct validity requires evidence from multiple sources, including content, response process and intrinsic flaws and errors associated with a method of assessment [26]. For SBA questions, the errors can relate to the quality of item-writing and non-functioning distractors (i.e. the other least plausible options in an SBA question) [27,28]. Overall, validity is defined as the degree to which an assessment method and its content measure what it is expected to evaluate and at an appropriate level [2,3,5,26]. ...
... Presumably, the optimal type of assessment depends on the type of competency assessed. 22 Clinical knowledge might be best evaluated by machines, 20 such as the multiple-choice examination, 23 yet most other clinical competencies (such as communication or collaboration) are arguably socially determined 14 and complex, 24 and might therefore be best evaluated by human assessors. When it is difficult to obtain a representative evaluation, a sample of mixed assessments may be useful. ...
Article
Full-text available
Context Ethnicity‐related differences in clinical grades exist. Broad sampling in assessment of clinical competencies involves multiple assessments used by multiple assessors across multiple moments. Broad sampling in assessment potentially reduces irrelevant variances and may therefore mitigate ethnic disparities in clinical grades. Objectives Research question 1 (RQ1): to assess whether the relationship between students’ ethnicity and clinical grades is weaker in a broadly sampled versus a global assessment. Research question 2 (RQ2): to assess whether larger ethnicity‐related differences in grades occur when supervisors are given the opportunity to deviate from the broadly sampled assessment score. Methods Students’ ethnicity was classified as Turkish/Moroccan/African, Surinamese/Antillean, Asian, Western, and native Dutch. RQ1: 1667 students (74.3% native Dutch students) were included, who entered medical school between 2002 and 2004 (global assessment, 818 students) and between 2008 and 2010 (broadly sampled assessment, 849 students). The main outcome measure was whether or not students received ≥3 times a grade of 8 or higher on a scale from 1 to 10 in five clerkships. RQ2: 849 students (72.4% native Dutch students) were included, who were assessed by broad sampling. The main outcome measure was the number of grade points by which supervisors had deviated from broadly sampled scores. Both analyses were adjusted for gender, age, (im)migration status and average bachelor grade. Results Research question 1: ethnicity‐related differences in clinical grades were smaller in broadly sampled than in global assessment, and this was also seen after adjustments. More specifically, native Dutch students had reduced probabilities (0.87–0.65) in broadly sampled as compared with global assessment, whereas Surinamese (0.03–0.51) and Asian students (0.21–0.30) had increased probabilities of having ≥3 times a grade of 8 or higher in five clerkships. Research question 2: when supervisors were allowed to deviate from original grades, ethnicity‐related differences in clinical grades were reintroduced. Conclusions Broadly sampled assessment reduces ethnicity‐related differences in grades.
... Most common incorrect VSA answers (n, % of all students): association. 2 6 Candidates may focus on practising exam technique rather than understanding the principles of the subject matter and honing their cognitive reasoning skills, thus adversely impacting learning behaviours. 6 7 Because patients do not present with a list of five possible diagnoses, investigations or treatment options, 8 SBA questions do not simulate the 'situations they [the candidates] will face when they undertake patient-related clinical tasks' (p66). 9 Any alternative method of assessing applied medical knowledge must therefore provide increased content and response process validity, without resulting in significant reductions in other types of validity, reliability, acceptability, educational impact or an unacceptable increase in cost. ...
Article
Full-text available
Objectives The study aimed to compare candidate performance between traditional best-of-five single-best-answer (SBA) questions and very-short-answer (VSA) questions, in which candidates must generate their own answers of between one and five words. The primary objective was to determine if the mean positive cue rate for SBAs exceeded the null hypothesis guessing rate of 20%. Design This was a cross-sectional study undertaken in 2018. Setting 20 medical schools in the UK. Participants 1417 volunteer medical students preparing for their final undergraduate medicine examinations (total eligible population across all UK medical schools approximately 7500). Interventions Students completed a 50-question VSA test, followed immediately by the same test in SBA format, using a novel digital exam delivery platform which also facilitated rapid marking of VSAs. Main outcome measures The main outcome measure was the mean positive cue rate across SBAs: the percentage of students getting the SBA format of the question correct after getting the VSA format incorrect. Internal consistency, item discrimination and the pass rate using Cohen standard setting for VSAs and SBAs were also evaluated, and a cost analysis in terms of marking the VSA was performed. Results The study was completed by 1417 students. Mean student scores were 21 percentage points higher for SBAs. The mean positive cue rate was 42.7% (95% CI 36.8% to 48.6%), one-sample t-test against ≤20%: t=7.53, p<0.001. Internal consistency was higher for VSAs than SBAs and the median item discrimination equivalent. The estimated marking cost was £2655 ($3500), with 24.5 hours of clinician time required (1.25 s per student per question). Conclusions SBA questions can give a false impression of students’ competence. VSAs appear to have greater authenticity and can provide useful information regarding students’ cognitive errors, helping to improve learning as well as assessment. Electronic delivery and marking of VSAs is feasible and cost-effective.
... 46 Even when MCQs replicate complex thought processes such as clinical reasoning, learners still engage in strategies of looking for clues in the question and formatting to help them find the answers. 49 MCQs can be used for retrieval practice but should be used with care. Other test formats may better accomplish educators' aims. ...
Article
Educational systems are rarely designed for long-term retention of information. Strong evidence has emerged from cognitive psychology and applied education studies that repeated retrieval of information significantly improves retention compared to repeated studying. This effect likely emerges from the processes of memory consolidation and reconsolidation. Consolidation and reconsolidation are the means by which memories are organized into associational networks or schemas that are created and recreated as memories are formed and recalled. As educators implement retrieval practice, they should consider how various test formats lead to different degrees of schema activation. Repeated acts of retrieval provide opportunities for schemas to be updated and strengthened. Spacing of retrieval allows more consolidated schemas to be reactivated. Feedback provides metacognitive monitoring to ensure retrieval accuracy and can lead to shifts from ineffective to effective retrieval strategies. By using the principles of retrieval practice, educators can improve the likelihood that learners will retain information for longer periods of time.
... This supports one of the assumptions important for interpretations of multiple choice question (MCQ) examination scores as meaningful measures of clinical reasoning. 8 The tutorials and assessments were placed on a world wide website with free internet access available for use 24/7. Subsequent to publishing of the proposed curriculum in 2010, the American Board of Internal Medicine and the American Society for Clinical Pathology began collaboration in 2012 on the initiative to develop the Choosing Wisely® campaign with the goal to create medical specialtyspecific lists of "things physicians and patients should question." ...
Article
Full-text available
Web-based learning applications can support health sciences education, including knowledge acquisition in pathology and laboratory medicine. Websites can be developed to provide learning content, assessments, and products supporting pathology education. In this paper, we review informatics principles, practices, and procedures involved with educational website development in the context of existing websites and published studies of educational website usage outcomes, including that of the authors. We provide an overview with analysis of potential results of usage to inform how such websites may be used, and to guide further development. We discuss the value of educational websites for individual users, educational institutions, and professional organizations. Educational websites may offer assessments that are formative, for learning itself, as practice, preparation, and self-assessment. Open access websites have the advantage of worldwide availability 24/7, particularly aiding persons in low resource settings. Commercial offerings for educational support in formal curricula are beyond the scope of this review. This review is intended to guide those interested in website development to support non-commercial educational purposes for users seeking to improve their knowledge and diagnostic skills supporting careers in pathology.
... They concluded that keywords can communicate entire diagnoses and activate illness scripts independently of any other information. Think aloud studies looking at approaches to answering multiple choice questions have also identi ed recognition of buzzwords as a test-taking cognitive approach to answering questions(Surry et al 2017).Sam et al (2021) identi ed the response to buzzwords as a test-taking behaviour leading to ...
Preprint
Full-text available
Background Automated Item Generation (AIG) uses computer software to create multiple items from a single question model. Items generated using AIG software have been shown to be of similar quality to those produced using traditional item writing methods. However, there is currently a lack of data looking at whether item variants to a single question result in differences in student performance or human-derived standard setting. The purpose of this study was to use 50 Multiple Choice Questions (MCQs) as models to create four distinct tests which would be standard set and given to final year UK medical students, and then to compare the performance and standard setting data for each. Methods Pre-existing questions from the UK Medical Schools Council (MSC) Assessment Alliance item bank, created using traditional item writing techniques, were used to generate four ‘isomorphic’ 50-item MCQ tests using AIG software. All UK medical schools were invited to deliver one of the four papers as an online formative assessment for their final year students. Each test was standard set using a modified Angoff method. Thematic analysis was conducted for item variants with high and low levels of variance in facility (for student performance) and average scores (for standard setting). Results 2218 students from 12 UK medical schools sat one of the four papers. The average facility of the four papers ranged from 0.55–0.61, and the cut score ranged from 0.58–0.61. Twenty item models had a facility difference >0.15 and 10 item models had a difference in standard setting of >0.1. Variation in parameters that could alter clinical reasoning strategies had the greatest impact on item facility. Conclusions Item facility varied to a greater extent than the standard set. This may relate to variants creating greater disruption of clinical reasoning strategies in novice learners as opposed to experts, in addition to the well documented tendency of standard setters to revert to the mean.
... Although once thought to be useful only in the assessment of lower-order skills (i.e., recall of facts), wellconstructed MCQs have been shown to be beneficial in assessing clinical reasoning (Coderre, Harasym, Mandin, & Fick, 2004;Heist, Gonzalo, Durning, Torre, & Elnicki, 2014;Skakun, Maguire, & Cook, 1994). In fact, examinees have been shown to use both system I (automatic, non-analytic) and system II (analytic) cognitive processes when answering MCQs, which aligns with the processes that clinicians use in practice (Surry, Torre, & Durning, 2017). However, to date, there are no studies demonstrating that items developed using AIG do in fact target these higher-order skills. ...
Article
Full-text available
Abstract The purpose of this study was to compare the quality of multiple choice questions (MCQs) developed using automated item generation (AIG) versus traditional methods, as judged by a panel of experts. The quality of MCQs developed using two methods (i.e., AIG or traditional) was evaluated by a panel of content experts in a blinded study. Participants rated a total of 102 MCQs using six quality metrics and made a judgment regarding whether or not each item tested recall or application of knowledge. A Wilcoxon two-sample test evaluated differences in each of the six quality metrics rating scales as well as an overall cognitive domain judgment. No significant differences were found in terms of item quality or cognitive domain assessed when comparing the two item development methods. The vast majority of items (> 90%) developed using both methods were deemed to be assessing higher-order skills. When compared to traditionally developed items, MCQs developed using AIG demonstrated comparable quality. Both modalities can produce items that assess higher-order cognitive skills.
... Multiple-choice questions (MCQs) have been favoured as an efficient and pragmatic method of assessment in the health professions (Case & Swanson, 2000;Epstein, 2007). The single-best-answer MCQ style is commonly used to assess how students apply their knowledge to specific clinical scenarios (Surry, Torre & Durning, 2017;Swanson, Holtzman, & Allbee, 2008;Tan, McAleer, & Final, 2008). ...
Article
This article was migrated. The article was not marked as recommended. Background Multiple choice questions (MCQs) are used at all stages of medical training. However, their focus on testing students' recall of facts rather than actively facilitating learning remains an ongoing concern for educators. Having students develop MCQ items is a possible strategy to enhance the learning potential of MCQs.Methods Medical students wrote MCQs as part of a course on the medical care of vulnerable populations. Student perceptions of learning and assessment through MCQ writing were explored via surveys and focus group interviews. Survey responses were analysed using descriptive statistics and transcribed interviews were analysed thematically.ResultsStudents reported that writing MCQs enhanced their learning and exam preparation and reduced their exam-related anxiety. It encouraged students to research what they did not know and benchmark their learning to that of their peers. Students described using deep learning strategies, were motivated to write high quality MCQ items for their peers and prioritised vocational learning in the development of their questions.Conclusion The study suggests student-developed MCQs can enhance the learning value of MCQs as a form of assessment. It also highlighted that students can be capable designers of assessment and that learning processes can be improved if students are provided agency over their learning and assessment.
... 8 Additional issues with MCQs may include various 'test-taking' behaviours, such as eliminating wrong answers to arrive at the correct one, guessing from the options available and seeking clues from the language used to deduce the correct answer independently from the knowledge required. 9 MCQs end up testing recognition memory, and recall is significantly affected by this cueing effect. Creating a good MCQ with valid and meaningful distractors (incorrect options) can be extremely hard. ...
Article
Full-text available
Many examinations are now delivered online using digital formats, the migration to which has been accelerated by the COVID-19 pandemic. The MRCPsych theory examinations have been delivered in this way since Autumn 2020. The multiple choice question formats currently in use are highly reliable, but other formats enabled by the digital platform, such as very short answer questions (VSAQs), may promote deeper learning. Trainees often ask for a focus on core knowledge, and the absence of cueing with VSAQs could help achieve this. This paper describes the background and evidence base for VSAQs, and how they might be introduced. Any new question formats would be thoroughly piloted before appearing in the examinations and are likely to have a phased introduction alongside existing formats.
... [7] In some studies, only a part of the assessments has been validated using this framework. [10][11][12][13][14][15] However, Wools, Eggen and Béguin have used this model to determine the validity of assessments during social workers' training. [16] In addition, in various studies, not all four inferences of Kane's Framework have been considered equally, and in some articles that have provided recommendations for the use of this framework, the recommendations have not been the same for all inferences. ...
Article
Full-text available
Background: Kane's validity framework examines the validity of the interpretation of a test at the four levels of scoring, generalization, extrapolation, and implications. No model has been yet proposed to use this framework particularly for a system of assessment. This study provided a model for the validation of the internal medicine residents' assessment system, based on the Kane's framework. Materials and methods: Through a five stages study, first, by reviewing the literature, the methods used, and the study challenges, in using Kane's framework, were extracted. Then, possible assumptions about the design and implementation of residents' tests and the proposed methods for their validation at each of their four inferences of Kane's validity were made in the form of two tables. Subsequently, in a focus group session, the assumptions and proposed validation methods were reviewed. In the fourth stage, the opinions of seven internal medicine professors were asked about the results of the focus group. Finally, the assumptions and the final validation model were prepared. Results: The proposed tables were modified in the focus group. The validation table was developed consisting of tests, used at each Miller's pyramid level. The results were approved by five professors of the internal medicine. The final table has five rows, respectively, as the levels of Knows and Knows How, Shows How, Shows, Does, and the fifth one for the final scores of residents. The columns of the table demonstrate the necessary measures for validation at the four levels of inferences of Kane's framework. Conclusion: The proposed model ensures the validity of the internal medicine specialty residency assessment system based on Kane's framework, especially at the implication level.
... Decisions around diagnosis and management are often nuanced and indeed even experts do not always agree on a single diagnosis or best course of action. Furthermore, as summarised by Surry et al. (2017), 'patients do not walk into the clinic saying, I have one of these five diagnoses, which do you think is most likely?' (p. 1082). ...
Article
Full-text available
Most undergraduate written examinations use multiple-choice questions, such as single best answer questions (SBAQs) to assess medical knowledge. In recent years, a strong evidence base has emerged for the use of very short answer questions (VSAQs). VSAQs have been shown to be an acceptable, reliable, discriminatory, and cost-effective assessment tool in both formative and summative undergraduate assessments. VSAQs address many of the concerns raised by educators using SBAQs including inauthentic clinical scenarios, cueing and test-taking behaviours by students, as well as the limited feedback SBAQs provide for both students and teachers. The widespread use of VSAQs in medical assessment has yet to be adopted, possibly due to lack of familiarity and experience with this assessment method. The following twelve tips have been constructed using our own practical experience of VSAQs alongside supporting evidence from the literature to help medical educators successfully plan, construct and implement VSAQs within medical curricula.
... More advanced medical students have shown high levels of adherence to and knowledge of procedures, providing more correct answers in a formative multiple choice exam than less advanced medical students [23]. Even in high stakes multiple choice exams it is very difficult to design questions beyond the clinical reasoning aspect of pattern recognition [24], and medical students provide very little analytical reasoning and much test-taking behaviour while answering multiple choice questions [25]. As far as OSCE performance is concerned, final year medical students performed weakest when clinical reasoning skills were assessed, which require high levels of awareness, and strongest in procedural skills, which require high levels of adherence to procedures [26]. ...
Article
Full-text available
Background Important competences of physicians regarding patient safety include communication, leadership, stress resistance, adherence to procedures, awareness, and teamwork. Similarly, while selected, prospective flight school applicants are tested for the same set of skills. The aim of our study was to assess these core competences in advanced undergraduate medical students from different medical schools. Methods In 2017, 67 medical students (year 5 and 6) from the universities of Hamburg, Oldenburg, and TU Munich, Germany, participated in the verified Group Assessment Performance (GAP)-Test at the German Aerospace Center (DLR) in Hamburg. All participants were rated by DLR assessment observers with a set of empirically derived behavioural checklists. This lists consisted of 6-point rating scales (1: very low occurrence to 6: very high occurrence) and included the competences leadership, teamwork, stress resistance, communication, awareness, and adherence to procedures. Medical students’ scores were compared with the results of 117 admitted flight school applicants. Results Medical students showed significantly higher scores than admitted flight school applicants for adherence to procedures (p < .001, d = .63) and communication (p < .01, d = .62). They reached significantly lower ratings for teamwork (p < .001, d = .77), stress resistance (p < 0.001, d = .70), and awareness (p < .001, d = 1.31). Students in semester 10 showed significantly (p < .02, d = .58) higher scores in domain awareness compared to the final year students. On average, flight school entrance level was not reached by either group for this domain. Conclusions Advanced medical students’ low results for awareness are alarming as awareness is essential and integrative for clinical reasoning and patient safety. Further studies should elucidate and discuss whether awareness needs to be included in medical student selection or integrated into the curriculum in training units.
Chapter
Prüfungen erfüllen aus didaktischer Sicht wichtige Funktionen. Besonders hervorzuheben ist deren Einfluss auf das Lernen („assessment drives learning“). Nach der Darstellung der Anforderungen an Prüfungen aus messtheoretischer Sicht sowie der Prüfungsplanung werden die einzelnen Prüfungsformate (schriftlich, mündlich, praktisch) dargestellt. Ein besonderes Augenmerk liegt dabei auf den Multiple-Choice-Fragen, die charakteristisch für das Medizinstudium sind. Es werden Tipps für die Formulierung guter Fragen sowie Hinweise zu deren Wiederverwendung gegeben. Mit dem „Key-Feature-Problem“ und dem „Script Concordance Test“ werden weitere Möglichkeiten der schriftlichen Prüfung beschrieben. Im Teil zu den mündlichen Prüfungen werden Einflussfaktoren auf die Bewertung und Anregungen für die generelle Verbesserung betrachtet. Hinsichtlich der praktischen Prüfungen werden neben der Objective Structured Clinical Examination (OSCE) auch Möglichkeiten aufgezeigt, wie Prüfungen im Rahmen der klinischen Tätigkeit durchgeführt werden können.
Article
Full-text available
Objective: To construct and validate a questionnaire related to adult cardiopulmonary resuscitation in Basic Life Support, using the Automatic External Defibrillator, in the hospital environment. Methodology: applied study, conducted at the University of São Paulo at Ribeirão Preto College of Nursing, from January 2017 to March 2018. Participants were 16 Urgency and Emergency experts, with Fehring’s criteria used to select them. The rules of the National Council of Medical Examiners manual and guidelines of the American Heart Association were applied. Descriptive statistics and inter-rater agreement analysis through Gwet’s AC1 were used. The questionnaire was validated in relation to organization, objectivity and clarity. Results: a validated questionnaire with 20 multiple choice questions with “almost perfect” inter-rater agreement was produced. Conclusion: the questionnaire was shown to be valid for use as an assessment instrument on the subject addressed.
Article
Full-text available
Introduction: This resource describes the development and implementation of resources medical educators or researchers can use for developing or analyzing resident through attending physicians’ clinical reasoning in an outpatient clinic setting. The resource includes: a) two scenario-based simulations (i.e., diabetes, angina), implementation support materials, an open-ended post-encounter form, and a think-aloud reflection protocol. Method: We designed two scenarios with potential case ambiguity and contextual factors to add complexity for studying clinical reasoning. They are designed to be used prior to an open-ended written exercise and a think-aloud reflection to elicit reasoning and reflection. We report on their use in a research context but developed them to be used in both educational and research settings. Results: Twelve physicians (5 interns, 3 residents, and 4 attendings) considered between three and six differential diagnoses for the diabetes scenario (m = 4.0) and between three and nine (m = 4.3) differentials for angina. In think-aloud reflections, participants reconsidered their thinking between zero and 14 times (m = 3.5) for diabetes and zero and 11 (m = 3.3) times for angina. Cognitive load scores ranged from four to eight (out of ten; m = 6.2) for diabetes and five to eight (m = 6.6) for angina. Participants rated scenario authenticity between four and five (out of five). Discussion: The potential case content ambiguity along with the contextual factors (e.g., patient suggesting alternative diagnoses) provide a complex environment in which to explore or teach clinical reasoning.
Article
Purpose: Direct assessment of trainee performance across time is a core tenant of competency-based medical education. Unlike variability of psychomotor skills across levels of expertise, performance variability exhibited by a particular trainee across time remains unexplored. The goal of this study was to document the consistency of individual surgeons' technical skill performance. Method: A secondary analysis of assessment data (collected 2010-2012, originally published 2015) generated by a prospective cohort of participants at Montreal Children's Hospital with differing levels of expertise was conducted in 2017. Trained raters scored blinded recordings of a myringotomy and tube insertion performed 4 times by junior and senior residents and attending surgeons over a 6-month period using a previously reported assessment tool. Descriptive exploratory analyses and univariate comparison of standard deviations (SDs) were conducted to document variability within individuals across time and across training levels. Results: Thirty-six assessments from 9 participants were analyzed. The SD of scores for junior residents was highly variable (5.8 out of a scale of 30 compared to 1.8 for both senior residents and attendings [F(2,19) = 5.68, P < 0.05]). For a given individual, the range of scores was twice as large for junior residents than for senior residents and attendings. Conclusions: Surgical residents may display highly variable performances across time, and individual variability appears to decrease with increasing expertise. Operative skill variability could be underrepresented in direct-observation assessment; emphasis on an adequate amount of repetitive evaluations for junior residents may be needed to support judgments of competence or entrustment.
Article
Full-text available
Background: A solid understanding of the science underpinning treatment is essential for all doctors. Pathology teaching and assessment are fundamental components of the undergraduate medicine curriculum. Assessment drives learning and the choice of assessments influences students' learning behaviours. The use of multiple-choice questions is common but is associated with significant cueing and may promote "rote learning". Essay-type questions and Objective Structured Clinical Examinations (OSCEs) are resource-intensive in terms of delivery and marking and do not allow adequate sampling of the curriculum. To address these limitations, we used a novel online tool to administer Very Short Answer questions (VSAQs) and evaluated the utility of the VSAQs in an undergraduate summative pathology assessment. Methods: A group of 285 medical students took the summative assessment, comprising 50 VSAQs, 50 single best answer questions (SBAQs), and 75 extended matching questions (EMQs). The VSAQs were machine-marked against pre-approved responses and subsequently reviewed by a panel of pathologists, with the software remembering all new marking judgements. Results: The total time taken to mark all 50 VSAQs for all 285 students was 5 hours, compared to 70 hours required to manually mark an equivalent number of questions in a paper-based pathology exam. The median percentage score for the VSAQs test (72%) was significantly lower than that of the SBAQs (80%) and EMQs (84%), p <0.0001. VSAQs had a higher Cronbach alpha (0.86) than SBAQs (0.76), and EMQs (0.77). VSAQs, SBAQs and EMQs had a mean point-biserial of 0.35, 0.30 and 0.28, respectively. Conclusion: VSAQs are an acceptable, reliable and discriminatory method for assessing pathology, and may enhance students' understanding of how pathology supports clinical decision-making and clinical care by changing learning behaviour.
Article
University assessment is in the midst of transformation. Assessments are no longer designed solely to determine that students can remember and regurgitate lecture content, nor in order to rank students to aid with some future selection process. Instead assessments are expected to drive, support and enhance learning and to contribute to student self‐assessment and development of skills and attributes for a lifetime of learning. While traditional purposes of certifying achievement and determining readiness to progress remain important, these new expectations for assessment can create tensions in assessment design, selection and deployment. With recognition of these tensions, three contemporary approaches to assessment in medical education are described. These approaches include: careful consideration of the educational impact of assessment – before, during (test or recall enhanced learning) and after assessments; development of student (and staff) assessment literacy; and planning of cohesive systems of assessment (with a range of assessment tools) designed to assess the various competencies demanded of future graduates. These approaches purposefully straddle the cross purposes of assessment in modern health professions education. The implications of these models are explored within the context of medical education and then linked with contemporary work in the anatomical sciences in order to highlight current synergies and potential future innovations when using evidence informed strategies to boost the educational impact of assessments.
Article
Full-text available
Background: Student performance in examinations reflects on both teaching and student learning. Very short answer questions require students to provide a self-generated response to a question of between one and five words, which removes the cueing effects of single best answer format examinations while still enabling efficient machine marking. The aim of this study was to pilot a method of analysing student errors in an applied knowledge test consisting of very short answer questions, which would enable identification of common areas that could potentially guide future teaching. Methods: We analysed the incorrect answers given by 1417 students from 20 UK medical schools in a formative very short answer question assessment delivered online. Findings: The analysis identified four predominant types of error: inability to identify the most important abnormal value, over or unnecessary investigation, lack of specificity of radiology requesting and over-reliance on trigger words. Conclusions: We provide evidence that an additional benefit to the very short answer question format examination is that analysis of errors is possible. Further assessment is required to determine if altering teaching based on the error analysis can lead to improvements in student performance.
Article
Full-text available
This commentary addresses the gap in the literature regarding discussion of the legitimate use of Constant Comparative Analysis Method (CCA) outside of Grounded Theory. The purpose is to show the strength of using CCA to maintain the emic perspective and how theoretical frameworks can maintain the etic perspective throughout the analysis. My naturalistic inquiry model shows how conceptual frameworks and theoretical frameworks can be integrated when using the CCA method.
Article
Full-text available
An ongoing debate exists in the medical education literature regarding the potential benefits of pattern recognition (non-analytic reasoning), actively comparing and contrasting diagnostic options (analytic reasoning) or using a combination approach. Studies have not, however, explicitly explored faculty's thought processes while tackling clinical problems through the lens of dual process theory to inform this debate. Further, these thought processes have not been studied in relation to the difficulty of the task or other potential mediating influences such as personal factors and fatigue, which could also be influenced by personal factors such as sleep deprivation. We therefore sought to determine which reasoning process(es) were used with answering clinically oriented multiple-choice questions (MCQs) and if these processes differed based on the dual process theory characteristics: accuracy, reading time and answering time as well as psychometrically determined item difficulty and sleep deprivation. We performed a think-aloud procedure to explore faculty's thought processes while taking these MCQs, coding think-aloud data based on reasoning process (analytic, nonanalytic, guessing or combination of processes) as well as word count, number of stated concepts, reading time, answering time, and accuracy. We also included questions regarding amount of work in the recent past. We then conducted statistical analyses to examine the associations between these measures such as correlations between frequencies of reasoning processes and item accuracy and difficulty. We also observed the total frequencies of different reasoning processes in the situations of getting answers correctly and incorrectly. Regardless of whether the questions were classified as 'hard' or 'easy', non-analytical reasoning led to the correct answer more often than to an incorrect answer. Significant correlations were found between self-reported recent number of hours worked with think-aloud word count and number of concepts used in the reasoning but not item accuracy. When all MCQs were included, 19 % of the variance of correctness could be explained by the frequency of expression of these three think-aloud processes (analytic, nonanalytic, or combined). We found evidence to support the notion that the difficulty of an item in a test is not a systematic feature of the item itself but is always a result of the interaction between the item and the candidate. Use of analytic reasoning did not appear to improve accuracy. Our data suggest that individuals do not apply either System 1 or System 2 but instead fall along a continuum with some individuals falling at one end of the spectrum.
Article
Full-text available
Context specificity and the impact that contextual factors have on the complex process of clinical reasoning is poorly understood. Using situated cognition as the theoretical framework, our aim was to evaluate the verbalized clinical reasoning processes of resident physicians in order to describe what impact the presence of contextual factors have on their clinical reasoning. Participants viewed three video recorded clinical encounters portraying straightforward diagnoses in internal medicine with select patient contextual factors modified. After watching each video recording, participants completed a think-aloud protocol. Transcripts from the think-aloud protocols were analyzed using a constant comparative approach. After iterative coding, utterances were analyzed for emergent themes with utterances grouped into categories, themes and subthemes. Ten residents participated in the study with saturation reached during analysis. Participants universally acknowledged the presence of contextual factors in the video recordings. Four categories emerged as a consequence of the contextual factors: (1) emotional reactions (2) behavioral inferences (3) optimizing the doctor patient relationship and (4) difficulty with closure of the clinical encounter. The presence of contextual factors may impact clinical reasoning performance in resident physicians. When confronted with the presence of contextual factors in a clinical scenario, residents experienced difficulty with closure of the encounter, exhibited as diagnostic uncertainty. This finding raises important questions about the relationship between contextual factors and clinical reasoning activities and how this relationship might influence the cost effectiveness of care. This study also provides insight into how the phenomena of context specificity may be explained using situated cognition theory.
Article
Full-text available
This is an important book. It addresses the question: Are human beings systematically irrational? They would be so if they were "hard-wired" to reason badly on certain types of tasks. Even if they could discover on reflection that the reasoning was bad, the unreflective tendency to reason badly would be a systematic irrationality. According to Stanovich, psychologists have shown that "people assess probabilities incorrectly, they display confirmation bias, they test hypotheses inefficiently, they violate the axioms of utility theory, they do not properly calibrate degrees of belief, they overproject their own opinions onto others, they allow prior knowledge to become implicated in deductive reasoning, they systematically underweight information about nonoccurrence when evaluat-ing covariation, and they display numerous other information-processing bi-ases." (1-2) Such cognitive psychologists as Nisbett and Ross (1980) and Kahneman, Slovic and Tversky (1982) interpret this apparently dismal typical performance as evidence of hard-wired "heuristics and biases" (whose pres-ence can be given an evolutionary explanation) which are sometimes irra-tional. Critics have proposed four alternative explanations. (1) Are the deficiencies just unsystematic performance errors of basically competent subjects due to such temporary psychological malfunctions as in-attention or memory lapses? Stanovich and West (1998a) administered to the same subjects four types of reasoning tests: syllogistic reasoning, selection, statistical reasoning, argument evaluation. They assumed that, ifmistakes were random performance errors, there would no significant correlation between scores on the different types of tests. In fact, they found modest but statisti-cally very significant correlations (at the .001 level) between all pairs of scores except those on statistical reasoning and argument evaluation. Hence, they concluded, not all mistakes on such reasoning tasks are random performance errors.
Article
Full-text available
Background Clinical vignette multiple-choice questions (MCQs) are widely used in medical education, but clinical reasoning (CR) strategies employed when approaching these questions have not been well described. Objectives (1) To identify CR strategies and test-taking (TT) behaviors of physician trainees while solving clinical vignette MCQs. (2) To examine relationships of these strategies and behaviors with performance on a high stakes clinical vignette MCQ examination. Methods Thirteen PGY-1 level trainees completed 6 clinical vignette MCQs using a think-aloud protocol. Thematic analysis employing elements of grounded theory was performed on data transcriptions to identify CR strategies and TT behaviors. Participants’ CR strategies and TT behaviors were then compared with their USMLE Step 2 CK scores. Results Twelve CR strategies and TT behaviors were identified. Low performers on Step 2 CK demonstrated increased premature closure and faulty knowledge and less ruling out of alternatives or admission of knowledge deficits. High performers on Step 2 CK demonstrated increased ruling out of alternatives and admission of knowledge deficits, and less premature closure, faulty knowledge, or closure prior to reading the alternatives. Conclusions Patterns of clinical reasoning strategies and behaviors during clinical vignette style MCQs appear to be associated with clinical vignette MCQ exam performance.
Article
Full-text available
Dual-process and dual-system theories in both cognitive and social psychology have been subjected to a number of recently published criticisms. However, they have been attacked as a category, incorrectly assuming there is a generic version that applies to all. We identify and respond to 5 main lines of argument made by such critics. We agree that some of these arguments have force against some of the theories in the literature but believe them to be overstated. We argue that the dual-processing distinction is supported by much recent evidence in cognitive science. Our preferred theoretical approach is one in which rapid autonomous processes (Type 1) are assumed to yield default responses unless intervened on by distinctive higher order reasoning processes (Type 2). What defines the difference is that Type 2 processing supports hypothetical thinking and load heavily on working memory. © The Author(s) 2013.
Article
Full-text available
Background: Whether the think-aloud protocol is a valid measure of thinking remains uncertain. Therefore, we used functional magnetic resonance imaging (fMRI) to investigate potential functional neuroanatomic differences between thinking (answering multiple-choice questions in real time) versus thinking aloud (on review of items). Methods: Board-certified internal medicine physicians underwent formal think-aloud training. Next, they answered validated multiple-choice questions in an fMRI scanner while both answering (thinking) and thinking aloud about the questions, and we compared fMRI images obtained during both periods. Results: Seventeen physicians (15 men and 2 women) participated in the study. Mean physician age was 39.5 + 7 (range: 32-51 years). The mean number of correct responses was 18.5/32 questions (range: 15-25). Statistically significant differences were found between answering (thinking) and thinking aloud in the following regions: motor cortex, bilateral prefrontal cortex, bilateral cerebellum, and the basal ganglia (p < 0.01). Discussion: We identified significant differences between answering and thinking aloud within the motor cortex, prefrontal cortex, cerebellum, and basal ganglia. These differences were by degree (more focal activation in these areas with thinking aloud as opposed to answering). Prefrontal cortex and cerebellum activity was attributable to working memory. Basal ganglia activity was attributed to the reward of answering a question. The identified neuroimaging differences between answering and thinking aloud were expected based on existing theory and research in other fields. These findings add evidence to the notion that the think-aloud protocol is a reasonable measure of thinking.
Article
Full-text available
Background: Although the American Board of Internal Medicine (ABIM) certification is valued as a reflection of physicians' experience, education, and expertise, limited methods exist to predict performance in the examination. Purpose: The objective of this study was to develop and validate a predictive tool based on variables common to all residency programs, regarding the probability of an internal medicine graduate passing the ABIM certification examination. Methods: The development cohort was obtained from the files of the Cleveland Clinic internal medicine residents who began training between 2004 and 2008. A multivariable logistic regression model was built to predict the ABIM passing rate. The model was represented as a nomogram, which was internally validated with bootstrap resamples. The external validation was done retrospectively on a cohort of residents who graduated from two other independent internal medicine residency programs between 2007 and 2011. Results: Of the 194 Cleveland Clinic graduates used for the nomogram development, 175 (90.2%) successfully passed the ABIM certification examination. The final nomogram included four predictors: In-Training Examination (ITE) scores in postgraduate year (PGY) 1, 2, and 3, and the number of months of overnight calls in the last 6 months of residency. The nomogram achieved a concordance index (CI) of 0.98 after correcting for over-fitting bias and allowed for the determination of an estimated probability of passing the ABIM exam. Of the 126 graduates from two other residency programs used for external validation, 116 (92.1%) passed the ABIM examination. The nomogram CI in the external validation cohort was 0.94, suggesting outstanding discrimination. Conclusions: A simple user-friendly predictive tool, based on readily available data, was developed to predict the probability of passing the ABIM exam for internal medicine residents. This may guide program directors' decision-making related to program curriculum and advice given to individual residents regarding board preparation.
Article
Full-text available
General practitioners (GPs) are often faced with complicated, vague problems in situations of uncertainty that they have to solve at short notice. In such situations, gut feelings seem to play a substantial role in their diagnostic process. Qualitative research distinguished a sense of alarm and a sense of reassurance. However, not every GP trusted their gut feelings, since a scientific explanation is lacking. This paper explains how gut feelings arise and function in GPs' diagnostic reasoning. The paper reviews literature from medical, psychological and neuroscientific perspectives. Gut feelings in general practice are based on the interaction between patient information and a GP's knowledge and experience. This is visualized in a knowledge-based model of GPs' diagnostic reasoning emphasizing that this complex task combines analytical and non-analytical cognitive processes. The model integrates the two well-known diagnostic reasoning tracks of medical decision-making and medical problem-solving, and adds gut feelings as a third track. Analytical and non-analytical diagnostic reasoning interacts continuously, and GPs use elements of all three tracks, depending on the task and the situation. In this dual process theory, gut feelings emerge as a consequence of non-analytical processing of the available information and knowledge, either reassuring GPs or alerting them that something is wrong and action is required. The role of affect as a heuristic within the physician's knowledge network explains how gut feelings may help GPs to navigate in a mostly efficient way in the often complex and uncertain diagnostic situations of general practice. Emotion research and neuroscientific data support the unmistakable role of affect in the process of making decisions and explain the bodily sensation of gut feelings.The implications for health care practice and medical education are discussed.
Article
Full-text available
Both systemic and individual factors contribute to missed or delayed diagnoses. Among the multiple factors that impact clinical performance of the individual, the caliber of cognition is perhaps the most relevant and deserves our attention and understanding. In the last few decades, cognitive psychologists have gained substantial insights into the processes that underlie cognition, and a new, universal model of reasoning and decision making has emerged, Dual Process Theory. The theory has immediate application to medical decision making and provides an overall schema for understanding the variety of theoretical approaches that have been taken in the past. The model has important practical applications for decision making across the multiple domains of healthcare, and may be used as a template for teaching decision theory, as well as a platform for future research. Importantly, specific operating characteristics of the model explain how diagnostic failure occurs.
Article
Full-text available
Clinical judgment is a critical aspect of physician performance in medicine. It is essential in the formulation of a diagnosis and key to the effective and safe management of patients. Yet, the overall diagnostic error rate remains unacceptably high. In more than four decades of research, a variety of approaches have been taken, but a consensus approach toward diagnostic decision making has not emerged. In the last 20 years, important gains have been made in psychological research on human judgment. Dual-process theory has emerged as the predominant approach, positing two systems of decision making, System 1 (heuristic, intuitive) and System 2 (systematic, analytical). The author proposes a schematic model that uses the theory to develop a universal approach toward clinical decision making. Properties of the model explain many of the observed characteristics of physicians' performance. Yet the author cautions that not all medical reasoning and decision making falls neatly into one or the other of the model's systems, even though they provide a basic framework incorporating the recognized diverse approaches. He also emphasizes the complexity of decision making in actual clinical situations and the urgent need for more research to help clinicians gain additional insight and understanding regarding their decision making.
Article
Full-text available
The Internal Medicine In-Training Examination (ITE) is administered during residency training in the United States as a self-assessment and program assessment tool. Performance on this exam correlates with outcome on the American Board of Internal Medicine Certifying examination. Internal Medicine Program Directors use the United States Medical Licensing Examination (USMLE) to make decisions in recruitment of potential applicants. This study was done to determine a correlation of USMLE Steps 1, 2 and 3 results with ITE scores in each level of Internal Medicine training. A retrospective review of all residents graduating from an Internal Medicine program from 1999 to 2006 was done. Subjects included had data for all USMLE Steps and ITE during all years of training. Thirty-one subjects were included in the study. Correlations of USMLE Steps 1, 2 and 3 were done with ITE scores (percent correct) in each year of training. Pearson's correlation coefficient (r) was determined for each pairing and a t test to determine statistical significance of the correlation was done. Statistical significance was defined as P value <0.05. The r values for USMLE Step 1 and ITE percent correct in PGY I, II and III were 0.46, 0.55 and 0.51 respectively. Corresponding r values for USMLE Step 2 and ITE percent correct were 0.79, 0.70 and 0.72; for USMLE Step 3 these values were 0.51, 0.37 and 0.51 respectively for each training year. USMLE scores are correlated with ITE scores. This correlation was strongest for USMLE Step 2.
Article
Full-text available
The ubiquity of multiple-choice questions (MCQs) results from their efficiency and hence reliability. Cognitive knowledge assessed by MCQ predicts and correlates well with overall competence and performance but examinees and examiners alike frequently perceive MCQ-based testing as 'unfair'. Fairness is akin to defensibility and is an increasingly important concept in testing. It is dependent on psychometric adequacy, diligence of construction, attention to consequential validity and appropriate standard setting. There is a wealth of evidence that extended matching questions are the fairest format but MCQs should always be combined with practical assessments, as written testing emphasizes learning from written sources.
Article
This paper will attempt to illustrate the use of a kaleidoscope metaphor as a template for the organization and analysis of qualitative research data. It will provide a brief overview of the constant comparison method, examining such processes as categorization, comparison, inductive analysis, and refinement of data bits and categories. Graphic representations of our metaphoric kaleidoscope will be strategically interspersed throughout this paper.
Article
The American College of Rheumatology In-Training Examination (ACR ITE) is a feedback tool designed to identify strengths and weaknesses in individual fellow content knowledge and training program curricula. We determined whether scores on the ACR ITE, other major standardized medical examinations and competency-based ratings predict performance on the American Board of Internal Medicine (ABIM) Rheumatology Certification Examination. From 2008 to 2012, 629 second-year fellows took the ACR ITE. Bivariate correlations for assessment scores and multiple linear regression analysis were used to determine if Rheumatology Certification Examination scores were predicted by ACR ITE scores, United States Medical Licensing Examination (USMLE) scores, ABIM Internal Medicine (IM) Certification Examination scores, fellowship director ratings of overall clinical competency, and demographic variables. Logistic regression was used to evaluate if these assessments predicted a passing outcome on the Rheumatology Certification Examination. In the initial linear model, the strongest predictors of the Rheumatology Certification Examination score were the 2(nd) -Year ACR ITE (β = 0.438) and IM Certification Examination scores (β = 0.273). Using a stepwise model, the strongest predictors of higher scores on the Rheumatology Certification Examination were ACR ITE scores (β = 0.449) and IM Certification Examination scores (β = 0.276). Based on logistic regression results, ACR ITE performance predicts a pass/fail outcome on the Rheumatology Certification Examination (OR = 1.016, 95% CI, 1.011-1.021). The predictive value of the ACR ITE score with the Rheumatology Certification Examination performance supports use of the ITE as a valid feedback tool during fellowship training. This article is protected by copyright. All rights reserved. © 2015, American College of Rheumatology.
Article
ContextAssessment is central to medical education and the validation of assessments is vital to their use. Earlier validity frameworks suffer from a multiplicity of types of validity or failure to prioritise among sources of validity evidence. Kane's framework addresses both concerns by emphasising key inferences as the assessment progresses from a single observation to a final decision. Evidence evaluating these inferences is planned and presented as a validity argument.Objectives We aim to offer a practical introduction to the key concepts of Kane's framework that educators will find accessible and applicable to a wide range of assessment tools and activities.ResultsAll assessments are ultimately intended to facilitate a defensible decision about the person being assessed. Validation is the process of collecting and interpreting evidence to support that decision. Rigorous validation involves articulating the claims and assumptions associated with the proposed decision (the interpretation/use argument), empirically testing these assumptions, and organising evidence into a coherent validity argument. Kane identifies four inferences in the validity argument: Scoring (translating an observation into one or more scores); Generalisation (using the score[s] as a reflection of performance in a test setting); Extrapolation (using the score[s] as a reflection of real-world performance), and Implications (applying the score[s] to inform a decision or action). Evidence should be collected to support each of these inferences and should focus on the most questionable assumptions in the chain of inference. Key assumptions (and needed evidence) vary depending on the assessment's intended use or associated decision. Kane's framework applies to quantitative and qualitative assessments, and to individual tests and programmes of assessment.Conclusions Validation focuses on evaluating the key claims, assumptions and inferences that link assessment scores with their intended interpretations and uses. The Implications and associated decisions are the most important inferences in the validity argument.
Article
A core objective of residency education is to facilitate learning, and programs need more curricula and assessment tools with demonstrated validity evidence. We sought to demonstrate concurrent validity between performance on a widely shared, ambulatory curriculum (the Johns Hopkins Internal Medicine Curriculum), the Internal Medicine In-Training Examination (IM-ITE), and the American Board of Internal Medicine Certifying Examination (ABIM-CE). A cohort study of 443 postgraduate year (PGY)-3 residents at 22 academic and community hospital internal medicine residency programs using the curriculum through the Johns Hopkins Internet Learning Center (ILC). Total and percentile rank scores on ILC didactic modules were compared with total and percentile rank scores on the IM-ITE and total scores on the ABIM-CE. The average score on didactic modules was 80.1%; the percentile rank was 53.8. The average IM-ITE score was 64.1% with a percentile rank of 54.8. The average score on the ABIM-CE was 464. Scores on the didactic modules, IM-ITE, and ABIM-CE correlated with each other (P < .05). Residents completing greater numbers of didactic modules, regardless of scores, had higher IM-ITE total and percentile rank scores (P < .05). Resident performance on modules covering back pain, hypertension, preoperative evaluation, and upper respiratory tract infection was associated with IM-ITE percentile rank. Performance on a widely shared ambulatory curriculum is associated with performance on the IM-ITE and the ABIM-CE.
Article
To determine if there is an association between several commonly obtained premedical school and medical school measures and board certification performance. We specifically included measures from our institution for which we have predictive validity evidence into the internship year. We hypothesized that board certification would be most likely to be associated with clinical measures of performance during medical school, and with scores on standardized tests, whether before or during medical school. Achieving board certification in an American Board of Medical Specialties specialty was used as our outcome measure for a 7-year cohort of graduates (1995-2002). Age at matriculation, Medical College Admissions Test (MCAT) score, undergraduate college grade point average (GPA), undergraduate college science GPA, Uniformed Services University (USU) cumulative GPA, USU preclerkship GPA, USU clerkship year GPA, departmental competency committee evaluation, Internal Medicine (IM) clerkship clinical performance rating (points), IM total clerkship points, history of Student Promotion Committee review, and United States Medical Licensing Examination (USMLE) Step 1 score and USMLE Step 2 clinical knowledge score were associated with this outcome. Ninety-three of 1,155 graduates were not certified, resulting in an average rate of board certification of 91.9% for the study cohort. Significant small correlations were found between board certification and IM clerkship points (r = 0.117), IM clerkship grade (r = 0.108), clerkship year GPA (r = 0.078), undergraduate college science GPA (r = 0.072), preclerkship GPA and medical school GPA (r = 0.068 for both), USMLE Step 1 (r = 0.066), undergraduate college total GPA (r = 0.062), and age at matriculation (r = -0.061). In comparing the two groups (board certified and not board certified cohorts), significant differences were seen for all included variables with the exception of MCAT and USMLE Step 2 clinical knowledge scores. All the variables put together could explain 4.1% of the variance of board certification by logistic regression. This investigation provides some additional validity evidence that measures collected for purposes of student evaluation before and during medical school are warranted. Reprint & Copyright © 2015 Association of Military Surgeons of the U.S.
Article
To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.
Article
Qualitative research in general and the grounded theory approach in particular, have become increasingly prominent in medical education research in recent years. In this Guide, we first provide a historical perspective on the origin and evolution of grounded theory. We then outline the principles underlying the grounded theory approach and the procedures for doing a grounded theory study, illustrating these elements with real examples. Next, we address key critiques of grounded theory, which continue to shape how the method is perceived and used. Finally, pitfalls and controversies in grounded theory research are examined to provide a balanced view of both the potential and the challenges of this approach. This Guide aims to assist researchers new to grounded theory to approach their studies in a disciplined and rigorous fashion, to challenge experienced researchers to reflect on their assumptions, and to arm readers of medical education research with an approach to critically appraising the quality of grounded theory studies.
Article
Prior work has found that a doctor's clinical reasoning performance varies on a case-by-case (situation) basis; this is often referred to as 'context specificity'. To explore the influence of context on diagnostic and therapeutic clinical reasoning, we constructed a series of videotapes to which doctors were asked to respond, modifying different contextual factors (patient, doctor, setting). We explored how these contextual factors, as displayed by videotape encounters, may have influenced the clinical reasoning of board-certified internists (experts). Our purpose was to clarify the influence of context on reasoning, to build upon education theory and to generate implications for education practice. Qualitative data about experts were gathered from two sources: think-aloud protocols reflecting concurrent thought processes that occurred while board-certified internists viewed videotape encounters, and free-text responses to queries that explicitly asked these experts to comment on the influence of selected contextual factors on their clinical reasoning processes. These data sources provided both actual performance data (think-aloud responses) and opinions on reflection (free-text answers) regarding the influence of context on reasoning. Results for each data source were analysed for emergent themes and then combined into a unified theoretical model. Several themes emerged from our data and were broadly classified as components influencing the impact of contextual factors, mechanisms for addressing contextual factors, and consequences of contextual factors for patient care. Themes from both data sources had good overlap, indicating that experts are somewhat cognisant of the potential influences of context on their reasoning processes; notable exceptions concerned the themes of missed key findings, balancing of goals and the influence of encounter setting, which emerged in the think-aloud but not the free-text analysis. Our unified model is consistent with the tenets of cognitive load, situated cognition and ecological psychology theories. A number of potentially modifiable influences on clinical reasoning were identified. Implications for doctor training and practice are discussed.
Article
Physicians who are disciplined by state licensing boards are more likely to have demonstrated unprofessional behavior in medical school. Information is limited on whether similar performance measures taken during residency can predict performance as practicing physicians. To determine whether performance measures during residency predict the likelihood of future disciplinary actions against practicing internists. Retrospective cohort study. State licensing board disciplinary actions against physicians from 1990 to 2006. 66,171 physicians who entered internal medicine residency training in the United States from 1990 to 2000 and became diplomates. Predictor variables included components of the Residents' Annual Evaluation Summary ratings and American Board of Internal Medicine (ABIM) certification examination scores. 2 performance measures independently predicted disciplinary action. A low professionalism rating on the Residents' Annual Evaluation Summary predicted increased risk for disciplinary action (hazard ratio, 1.7 [95% CI, 1.3 to 2.2]), and high performance on the ABIM certification examination predicted decreased risk for disciplinary action (hazard ratio, 0.7 [CI, 0.60 to 0.70] for American or Canadian medical school graduates and 0.9 [CI, 0.80 to 1.0] for international medical school graduates). Progressively better professionalism ratings and ABIM certification examination scores were associated with less risk for subsequent disciplinary actions; the risk ranged from 4.0% for the lowest professionalism rating to 0.5% for the highest and from 2.5% for the lowest examination scores to 0.0% for the highest. The study was retrospective. Some diplomates may have practiced outside of the United States. Nondiplomates were excluded. Poor performance on behavioral and cognitive measures during residency are associated with greater risk for state licensing board actions against practicing physicians at every point on a performance continuum. These findings support the Accreditation Council for Graduate Medical Education standards for professionalism and cognitive performance and the development of best practices to remediate these deficiencies.
Article
Context specificity, or the variation in a participant's performance from one case, or situation, to the next, is a recognized problem in medical education. However, studies have not explored the potential reasons for context specificity in experts using the lens of situated cognition and cognitive load theories (CLT). Using these theories, we explored the influence of selected contextual factors on clinical reasoning performance in internal medicine experts. We constructed and validated a series of videotapes portraying different chief complaints for three common diagnoses seen in internal medicine. Using the situated cognition framework, we modified selected contextual factors--patient, encounter, and/or physician--in each videotape. Following each videotape, participants completed a post-encounter form (PEF) and a think-aloud protocol. A survey estimating recent exposure from their practice to the correct videotape diagnoses was also completed. The time given to complete the PEF was randomly varied with each videotape. Qualitative utterances from the think-aloud procedure were converted to numeric measures of cognitive load. Survey and cognitive load measures were correlated with PEF performance. Pearson correlations were used to assess relations between the independent variables (cognitive load, survey of experience, contextual factors modified) and PEF performance. To further explore context specificity, analysis of covariance (ANCOVA) was used to assess differences in PEF scores, by diagnosis, after controlling for time. Low correlations between PEF sections, both across diagnoses and within each diagnosis, were observed (r values ranged from -.63 to .60). Limiting the time to complete the PEF impacted PEF performance (r = .2 to .4). Context specificity was further substantiated by demonstrating significant differences on most PEF section scores with a diagnosis (ANCOVA). Cognitive load measures were negatively correlated with PEF scores. The presence of selected contextual factors appeared to influence diagnostic more than therapeutic reasoning (r = -.2 to -.38). Contextual factors appear to impact expert physician performance. The impact observed is consistent with situated cognition and CLT's predictions. These findings have potential implications for educational theory and clinical practice.
Article
For 223 residents from eight teaching hospitals, the results of the second-year in-training examination and the first-sitting certifying examination of the American Board of Internal Medicine were highly correlated. The results of the in-training examination can serve residents as an important measure of their preparedness for certification and can be useful in identifying the need for more intensive self-study strategies during the subsequent one and a half years.
Article
Our objective was to determine the ability of the internal medicine In-Training Examination (ITE) to predict pass or fail outcomes on the American Board of Internal Medicine (ABIM) certifying examination and to develop an externally validated predictive model and a simple equation that can be used by residency directors to provide probability feedback for their residency programs. We collected a study sample of 155 internal medicine residents from the three Virginia internal medicine programs and a validation sample of 64 internal medicine residents from a residency program outside Virginia. Scores from both samples were collected across three class cohorts. The Kolmogorov-Smirnov z test indicated no statistically significant difference between the distribution of scores for the two samples (z = 1.284, p = .074). Results of the logistic model yielded a statistically significant prediction of ABIM pass or fail performance from ITE scores (Wald = 35.49, SE = 0.036, df = 1, p < .005) and overall correct classifications for the study sample and validation sample at 79% and 75%, respectively. The ITE is a useful tool in assessing the likelihood of a resident's passing or failing the ABIM certifying examination but is less predictive for residents who received ITE scores between 49 and 66.
Article
Although the written component of the Royal College of Physicians and Surgeons of Canada (RCPSC)internal medicine examination is important for obtaining licensure and certification as a specialist, no methods exist to predict a candidate's performance on the examination. We obtained data from 5 Canadian universities from 1988 to 1998 in order to compare raw scores from the American Internal Medicine In-Training Examination (AIMI-TE) with raw scores and outcomes (pass or fail) of the written component of the RCPSC internal medicine examination. Mean scores on the AIMI-TE correlated well with scores on the RCPSC internal medicine written examination for all postgraduate years (r = 0.62, r = 0.55 and r = 0.65 for postgraduate years 1, 2 and 3 respectively). Scores above the 50th percentile on the AIMI-TE w/ere predictive of a low failure rate (< 1.5%) on the RCPSC internal medicine written examination, whereas scores at or below the 10th percentile were associated with a high failure rate (about 24%). Candidates who are eligible to take the written component of the RCPSC certification examination in internal medicine can use the AIMI-TE to predict their performance on the Canadian examination. The AIMI-TE is a useful test for residents in all levels of training, because the examination scores have a strong relation to expected performance on the Canadian examination for each year of postgraduate training.
Article
This study investigates the predictive validity of the In-Training Examination (ITE). Although studies have confirmed the predictive validity of ITEs in other medical specialties, no study has been done for general pediatrics. Each year, residents in accredited pediatric training programs take the ITE as a self-assessment instrument. The ITE is similar to the American Board of Pediatrics General Pediatrics Certifying Examination. First-time takers of the certifying examination over a 5-year period who took at least 1 ITE examination were included in the sample. Regression models analyzed the predictive value of the ITE. The predictive power of the ITE in the first training year is minimal. However, the predictive power of the ITE increases each year, providing the greatest power in the third year of training. Even though ITE scores provide information regarding the likelihood of passing the certification examination, the data should be used with caution, particularly in the first training year. Other factors also must be considered when predicting performance on the certification examination. This study continues to support the ITE as an assessment tool for program directors, as well as a means of providing residents with feedback regarding their acquisition of pediatric knowledge.
Teaching Clinical Reasoning. ACP Teaching Medicine Series
  • Ratcliffe T
  • Durning SJ
Dual-process theories of higher cognition: advancing the debate
  • Evans JSBT
  • Stanovich KE
  • Stanovich