Article

Feedback to support examiners’ understanding of the standard-setting process and the performance of students: AMEE Guide No. 145

Taylor & Francis
Medical Teacher
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The ratings that judges or examiners use for determining pass marks and students' performance on OSCEs serve a number of essential functions in medical education assessment, and their validity is a pivotal issue. However, some types of errors often occur in ratings that require special efforts to minimise. Rater characteristics (e.g. generosity error, severity error, central tendency error or halo error) may present a source of performance irrelevant variance. Prior literature shows the fundamental problems in student performance measurement attached to judges’ or examiners’ errors. It also indicates that the control of such errors supports a robust and credible pass mark and thus, accurate student marks. Therefore, for a standard-setter who identifies the pass mark and an examiner who rates student performance in OSCEs, proper, user-friendly feedback on their standard-setting and ratings is essential for reducing bias. This feedback provides useful avenues for understanding why performance ratings may be irregular and how to improve the quality of ratings. This AMEE Guide discusses various methods of feedback to support examiners' understanding of the performance of students and the standard-setting process with an effort to make inferences from assessments fair, valid and reliable.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Feedback is one of the most studied educational interventions, and there is much evidence for the role of feedback in enhance learning within the psychology and education literature (Wisniewski Zierer and Hattie 2019). A number of reviews and guides on giving feedback are available (for example, see Veloski et al. 2006;Wulf Shea and Lewthwaite 2010;Tavakol et al. 2022), however, there are specific considerations when giving feedback about the development of clinical skills from a cognitive science perspective. ...
Article
Students have to develop a wide variety of clinical skills, from cannulation to advanced life support, prior to entering clinical practice. An important challenge for health professions’ educators is the implementation of strategies for effectively supporting students in their acquisition of different types of clinical skills and also to minimize skill decay over time. Cognitive science provides a unified approach that can inform how to maximize clinical skill acquisition and also minimize skill decay. The Guide discusses the nature of expertise and mastery development, the key insights from cognitive science for clinical skill development and skill retention, how these insights can be practically applied and integrated with current approaches used in clinical skills teaching.
Article
Background: Simulation training, a novel learning method, provides medical students with opportunities to practice managing stressful situations as if they were experiencing them in reality. Recently, there has been increased recognition of the value of simulation-based education. This study aimed to evaluate the most effective approach for providing feedback during a simulation program. Methods: In this interventional study, a total of 43 obstetrics and gynecology residents were recruited and stratified into three groups based on their residency stage. These residents participated in a simulation-based program focused on the management of post-partum hemorrhage (PPH). The program involved handling a PPH scenario, during which they received feedback either during the task (in-task; IT) or after completing the task (end-task; ET). Following the simulation, a post-test was administered, and the results were compared between the IT and ET feedback groups. Results: Demographic variables did not differ significantly between the ET and IT groups. Generally, there were no significant differences in secondary knowledge (P=0.232) or secondary performance (P=0.196) following the simulation program between the two groups. However, Among second-year residents, the change in primary and secondary performance was not significant in either the ET (P=0.76) or IT (P=0.74) group, while the IT group showed a significant improvement in knowledge (P=0.04). For third-year residents, the point change in primary and secondary knowledge and performance was not statistically significant in either the ET or IT groups. Conclusion: The final knowledge and performance following simulation programs do not significantly differ between the IT and ET groups. However, second-year residents experienced an improvement in knowledge.
Article
Post-assessments psychometric reports are a vital component of the assessment cycle to ensure that assessments are reliable, valid and fair to make appropriate pass-fail decisions. Students' scores can be summarised by examination of frequency distributions, central tendency measures and dispersion measures. Item discrimination indicies to assess the quality of items, and distractors that differentiate between students achieving or not achieving the learning outcomes are key. Estimating individual item reliability and item validity indices can maximise test-score reliability and validity. Test accuracy can be evaluated by assessing test reliability, consistency and validity and standard error of measurement can be used to measure the variation. Standard setting, even by experts, may be unreliable and reality checks such as the Hofstee method, P values and correlation analysis can improve validity. The Rasch model of student ability and item difficulty assists in modifying assessment questions, pinpointing areas for additional instruction. We propose 12 tips to support test developers in interpreting structured psychometric reports, including analysis and refinement of flawed items and ensuring fair assessments with accurate and defensible marks.
Book
Full-text available
Human ratings are subject to various forms of error and bias. Since the early days of performance assessment, this problem has been sizeable and persistent. For example, expert raters evaluating the quality of an essay, an oral communication, or a work sample often come up with different ratings for the same performance. In cases like this, assessment outcomes largely depend upon which raters happen to provide the rating, posing a threat to the validity and fairness of the assessment. This book presents a psychometric approach that establishes a coherent framework for drawing reliable, valid, and fair inferences from rater-mediated assessments, thus answering the problem of inevitably fallible human ratings: many-facet Rasch measurement (MFRM). Throughout the book, sample data from a writing performance assessment illustrate key concepts, theoretical foundations, and analytic procedures, stimulating the readers to adopt the MFRM approach in their current or future professional context.
Article
Full-text available
Objectives Sources of bias, such as the examiners, domains and stations, can influence the student marks in objective structured clinical examination (OSCE). This study describes the extent to which the facets modelled in an OSCE can contribute to scoring variance and how they fit into a Many-Facet Rasch Model (MFRM) of OSCE performance. A further objective is to identify the functioning of the rating scale used. Design A non-experimental cross-sectional design. Participants and settings An MFRM was used to identify sources of error (eg, examiner, domain and station), which may influence the student outcome. A 16-station OSCE was conducted for 329 final year medical students. Domain-based marking was applied, each station using a sample from eight defined domains across the whole OSCE. The domains were defined as follows: communication skills, professionalism, information gathering, information giving, clinical interpretation, procedure, diagnosis and management. The domains in each station were weighted to ensure proper attention to the construct of the individual station. Four facets were assessed: students, examiners, domains and stations. Results The results suggest that the OSCE data fit the model, confirming that an MFRM approach was appropriate to use. The variable map allows a comparison with and between the facets of students, examiners, domains and stations and the 5-point score for each domain with each station as they are calibrated to the same scale. Fit statistics showed that the domains map well to the performance of the examiners. No statistically significant difference between examiner sensitivity (3.85 logits) was found. However, the results did suggest examiners were lenient and that some behaved inconsistently. The results also suggest that the functioning of response categories on the 5-point rating scale need further examination and optimisation. Conclusions The results of the study have important implications for examiner monitoring and training activities, to aid assessment improvement.
Article
Full-text available
General Medical Council (GMC) in the UK has emphasized the importance of internal consistency for students’ assessment scores in medical education. 1 Typically Cronbach’s alpha is reported by medical educators as an index of internal consistency. Medical educators mark assessment questions and then estimate statistics that quantify the consistency (and, if possible, the accuracy and appropriateness) of the assessment scores in order to improve subsequent assessments. The basic reason for doing so is the recognition that student marks are affected by various types of errors of measurement which always exist in student marks, and which reduce the accuracy of measurement. The magnitude of measurement errors is incorporated in the concept of reliability of test scores, where reliability itself quantifies the consistency of scores over replications of a measurement procedure. Therefore, medical educators need to identify and estimate sources of measurement error in order to improve students’ assessment. Under the Classical Test Theory (CTT) model, the stu
Article
Full-text available
Typical validation studies on standard setting models, most notably the Angoff and modified Angoff models, have ignored construct development, a critical aspect associated with all conceptualizations of measurement processes. Stone compared the Angoff and objective standard setting (OSS) models and found that Angoff failed to define a legitimate and stable construct. The present study replicates and expands this work by presenting results from a 5-year investigation of both models, using two different approaches (equating and annual standard setting) within two testing settings (health care and education). The results support the original conclusion that although the OSS model demonstrates effective construct development, the Angoff approach appears random and lacking in clarity. Implications for creating meaningful and valid standards are discussed.
Article
Full-text available
This article reviews the empirical literature on 9 topics about the modified Angoff standard-setting method that have been studied repeatedly in the literature, while taking into consideration the methodological warrant for the findings on the topics. It concludes that we can be reasonably confident about selecting the appropriate number of judges and about the extent to which judges' modified Angoff item estimates are ranked similarly to item difficulty. Item estimates probably deviate inconsistently from difficulty values too frequently, although this deficiency in the method might be remedied somewhat by the effects of judge activities between standard-setting rounds. More studies need to be done about the appropriate level of judge expertise and about the process of describing the performance level at which the cutscores are to be set. The warrants for the findings of much of the empirical modified Angoff literature are often insufficient for making firm conclusions, and many uncertainties about the method remain.
Article
Full-text available
Classical Test Theory has traditionally been used to carry out post-examination analysis of objective test data. It uses descriptive methods and aggregated data to help identify sources of measurement error and unreliability in a test, in order to minimise them. Item Response Theory (IRT), and in particular Rasch analysis, uses more complex methods to produce outputs that not only identify sources of measurement error and unreliability, but also identify the way item difficulty interacts with student ability. In this Guide, a knowledge-based test is analysed by the Rasch method to demonstrate the variety of useful outputs that can be provided. IRT provides a much deeper analysis giving a range of information on the behaviour of individual test items and individual students as well as the underlying constructs being examined. Graphical displays can be used to evaluate the ease or difficulty of items across the student ability range as well as providing a visual method for judging how well the difficulty of items on a test match student ability. By displaying data in this way, problem test items are more easily identified and modified allowing medical educators to iteratively move towards the 'perfect' test in which the distribution of item difficulty is mirrored by the distribution of student ability.
Article
Full-text available
Numerous studies have compared the Angoff standard-setting procedure to other standard-setting methods, but relatively few studies have evaluated the procedure based on internal criteria. This study uses a generalizability theory framework to evaluate the stability of the estimated cut score. To provide a measure of internal consistency, this study also compares the estimated proportion correct scores resulting from the Angoff exercise to empirical conditional proportion correct scores. In this research, judges made independent estimates of the proportion of minimally proficient candidates that would be expected to answer each item correctly; they then discussed discrepancies and revised their estimates. Discussion of discrepancies decreased the variance components associated with the judge and judge-by-item effects, indicating increased agreement between judges, but it did not improve the correspondence between the judgments and the empirical proportion correct estimates. The judges then were given examinee performance information for a subset of the items. Subsequent ratings showed a substantial increase in correspondence with the empirical conditional proportion correct estimates. Particular attention is given to examining the discrepancy between the judgments and empirical proportion correct estimates as a function of item difficulty.
Article
Full-text available
The most ubiquitous method of performance appraisal is rating. Ratings, however, have been shown to be prone to various types of systematic and random error. Studies relating to performance rating are reviewed under the following headings: roles, context, vehicle, process, and results. In general, cognitive characteristics of raters seem to hold the most promise for increased understanding of the rating process. A process model of performance rating is derived from the literature. Research in the areas of implicit personality theory and variance partitioning is combined with the process model to suggest a unified approach to understanding performance judgments in applied settings. (6 p ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This paper was prepared under a grant from the Carnegie Corporation to the National Assessment of Educational Progress. Fritz Mosher initiated the project, and through his involvement and conversations, saw that it was taken seriously. The Analysis Advisory Committee of NAEP, under Fred Mosteller's chairmanship, proved to be a rigorous testing ground for the paper. Mary Lee Smith contributed at every stage of its preparation: reading the literature, developing the ideas, outlining and editing the final product. Nancy W. Burton helped in developing the topic. Laura A. Driscoll and Gregory C. Camilli deserve thanks for assistance in literature search and data analysis. Conversation among Glass, Smith, Burton, and Robert Glaser on a day in June 1976 got things started; we are much indebted to Glaser for helping set things off in a direction that now pleases us.
Article
Full-text available
The purpose of this Guide is to provide both logical and empirical evidence for medical teachers to improve their objective tests by appropriate interpretation of post-examination analysis. This requires a description and explanation of some basic statistical and psychometric concepts derived from both Classical Test Theory (CTT) and Item Response Theory (IRT) such as: descriptive statistics, explanatory and confirmatory factor analysis, Generalisability Theory and Rasch modelling. CTT is concerned with the overall reliability of a test whereas IRT can be used to identify the behaviour of individual test items and how they interact with individual student abilities. We have provided the reader with practical examples clarifying the use of these frameworks in test development and for research purposes.
Article
Full-text available
One of the key goals of assessment in medical education is the minimisation of all errors influencing a test in order to produce an observed score which approaches a learner's 'true' score, as reliably and validly as possible. In order to achieve this, assessors need to be aware of the potential biases that can influence all components of the assessment cycle from question creation to the interpretation of exam scores. This Guide describes and explains the processes whereby objective examination results can be analysed to improve the validity and reliability of assessments in medical education. We cover the interpretation of measures of central tendency, measures of variability and standard scores. We describe how to calculate the item-difficulty index and item-discrimination index in examination tests using different statistical procedures. This is followed by an overview of reliability estimates. The post-examination analytical methods described in this guide enable medical educators to construct reliable and valid achievement tests. They also enable medical educators to develop question banks using the collection of appropriate questions from existing examination tests in order to use computerised adaptive testing.
Article
As a medical educator, you may be directly or indirectly involved in the quality of assessments. Measurement has a substantial role in developing the quality of assessment questions and student learning. The information provided by psychometric data can improve pedagogical issues in medical education. Through measurement we are able to assess the learning experiences of students. Standard setting plays an important role in assessing the performance quality of students as doctors in the future. Presentation of performance data for standard setters may contribute towards developing a credible and defensible pass mark. Validity and reliability of test scores are the most important factors for developing quality assessment questions. Analysis of the answers to individual questions provides useful feedback for assessment leads to improve the quality of each question, and hence make students’ marks fair in terms of diversity and ethnicity. Item Characteristic Curves (ICC) can send signals to assessment leads to improve the quality of individual questions.
Article
Context: There is a growing body of research investigating assessor judgments in complex performance environments such as OSCE examinations. Post hoc analysis can be employed to identify some elements of "unwanted" assessor variance. However, the impact of individual, apparently "extreme" assessors on OSCE quality, assessment outcomes and pass/fail decisions has not been previously explored. This paper uses a range of "case studies" as examples to illustrate the impact that "extreme" examiners can have in OSCEs, and gives pragmatic suggestions to successfully alleviating problems. Method and results: We used real OSCE assessment data from a number of examinations where at station level, a single examiner assesses student performance using a global grade and a key features checklist. Three exemplar case studies where initial post hoc analysis has indicated problematic individual assessor behavior are considered and discussed in detail, highlighting both the impact of individual examiner behavior and station design on subsequent judgments. Conclusions: In complex assessment environments, institutions have a duty to maximize the defensibility, quality and validity of the assessment process. A key element of this involves critical analysis, through a range of approaches, of assessor judgments. However, care must be taken when assuming that apparent aberrant examiner behavior is automatically just that.
Article
Background: When measuring assessment quality, increasing focus is placed on the value of station-level metrics in the detection and remediation of problems in the assessment. Aims: This article investigates how disparity between checklist scores and global grades in an Objective Structured Clinical Examination (OSCE) can provide powerful new insights at the station level whenever such disparities occur and develops metrics to indicate when this is a problem. Method: This retrospective study uses OSCE data from multiple examinations to investigate the extent to which these new measurements of disparity complement existing station-level metrics. Results: In stations where existing metrics are poor, the new metrics provide greater understanding of the underlying sources of error. Equally importantly, stations of apparently satisfactory “quality” based on traditional metrics are shown to sometimes have problems of their own – with a tendency for checklist score “performance” to be judged stronger than would be expected from the global grades awarded. Conclusions: There is an ongoing tension in OSCE assessment between global holistic judgements and the necessarily more reductionist, but arguably more objective, checklist scores. This article develops methods to quantify the disparity between these judgements and illustrates how such analyses can inform ongoing improvement in station quality.
Article
Laboratory studies have shown that performance assessment judgments can be biased by "contrast effects." Assessors' scores become more positive, for example, when the assessed performance is preceded by relatively weak candidates. The authors queried whether this effect occurs in real, high-stakes performance assessments despite increased formality and behavioral descriptors. Data were obtained for the 2011 United Kingdom Foundational Programme clinical assessment and the 2008 University of Alberta Multiple Mini Interview. Candidate scores were compared with scores for immediately preceding candidates and progressively distant candidates. In addition, average scores for the preceding three candidates were calculated. Relationships between these variables were examined using linear regression. Negative relationships were observed between index scores and both immediately preceding and recent scores for all exam formats. Relationships were greater between index scores and the average of the three preceding scores. These effects persisted even when examiners had judged several performances, explaining up to 11% of observed variance on some occasions. These findings suggest that contrast effects do influence examiner judgments in high-stakes performance-based assessments. Although the observed effect was smaller than observed in experimentally controlled laboratory studies, this is to be expected given that real-world data lessen the strength of the intervention by virtue of less distinct differences between candidates. Although it is possible that the format of circuital exams reduces examiners' susceptibility to these influences, the finding of a persistent effect after examiners had judged several candidates suggests that the potential influence on candidate scores should not be ignored.
Article
The present study updates Woehr and Huffcutt's (1994) rater training meta-analysis and demonstrates that frame-of-reference (FOR) training is an effective method of improving rating accuracy. The current meta-analysis includes over four times as many studies as included in the Woehr and Huffcutt meta-analysis and also provides a snapshot of current rater training studies. The present meta-analysis also extends the previous meta-analysis by showing that not all operationalizations of accuracy are equally improved by FOR training; Borman's differential accuracy appears to be the most improved by FOR training, along with behavioural accuracy, which provides a snapshot into the cognitive processes of the raters. We also investigate the extent to which FOR training protocols differ, the implications of protocol differences, and if the criteria of interest to FOR researchers have changed over time.
Article
Evidence of stable standard setting results over panels or occasions is an important part of the validity argument for an established cut score. Unfortunately, due to the high cost of convening multiple panels of content experts, standards often are based on the recommendation from a single panel of judges. This approach implicitly assumes that the variability across panels will be modest, but little evidence is available to support this assertion. This article examines the stability of Angoff standard setting results across panels. Data were collected for six independent standard setting exercises, with three panels participating in each exercise. The results show that although in some cases the panel effect is negligible, for four of the six data sets the panel facet represented a large portion of the overall error variance. Ignoring the often hidden panel/occasion facet can result in artificially optimistic estimates of the cut score stability. Results based on a single panel should not be viewed as a reasonable estimate of the results that would be found over multiple panels. Instead, the variability seen in a single panel can best be viewed as a lower bound of the expected variability when the exercise is replicated.
Article
This AMEE Guide offers an overview of methods used in determining passing scores for performance-based assessments. A consideration of various assessment purposes will provide context for discussion of standard setting methods, followed by a description of different types of standards that are typically set in health professions education. A step-by-step guide to the standard setting process will be presented. The Guide includes detailed explanations and examples of standard setting methods, and each section presents examples of research done using the method with performance-based assessments in health professions education. It is intended for use by those who are responsible for determining passing scores on tests and need a resource explaining methods for setting passing scores. The Guide contains a discussion of reasons for assessment, defines standards, and presents standard setting methods that have been researched with performance-based tests. The first section of the Guide addresses types of standards that are set. The next section provides guidance on preparing for a standard setting study. The following sections include conducting the meeting, selecting a method, implementing the passing score, and maintaining the standard. The Guide will support efforts to determine passing scores that are based on research, matched to the assessment purpose, and reproducible.
Article
This review identifies 38 methods for either setting standards or adjusting them based on an analysis of classification error rates. A trilevel classification scheme is used to categorize the methods, and 10 criteria of technical adequacy and practicability are proposed to evaluate them. The salient characteristics of 23 continuum standard-setting methods are described and evaluated in the form of a “consumer’s guide.” Specific recommendations are offered for classroom teachers, educational certification test specialists, licensing and certification boards, and test publishers and independent test contractors.
Article
A critical aspect of the Angoff, and related, methods of standard setting is the conceptualization of the minimally competent or borderline examinees. The purpose of this study was to investigate the relations between the Angoff ratings (minimum passing levels; MPLs) and the actual p values for a group of borderline examinees. The correlation between the Angoff MPLs and the actual p values for the borderline group was.55-about the same size as the correlation between predicted and actual p values for the total group of examinees (.51). The judges tended to over estimate values for the total group: The average difference between predicted and actual p values was .12, and 61% of the differences were categorized as overestimates. In contrast, 61% of the differences between MPLs and actual p values for the borderline group were considered to be accurate.
Article
Nedelsky (1954) and Angoff (1971) have sug gested procedures for establishing a cutting score based on raters' judgments about the likely perfor mance of minimally competent examinees on each item in a test. In this paper generalizability theory is used to characterize and quantify expected vari ance in cutting scores resulting from each proce dure. Experimental test data are used to illustrate this approach and to compare the two procedures. Consideration is also given to the impact of rater disagreement on some issues of measurement relia bility or dependability. Results suggest that the dif ferences between the Nedel sky and Angoff proce dures may be of greater consequence than their ap parent similarities. In particular, the restricted na ture of the Nedelsky (inferred) probability scale may constitute a basis for seriously questioning the ap plicability of this procedure in certain contexts.
Article
Purpose Earlier studies of absolute standard setting procedures for objective structured clinical examinations (OSCEs) show inconsistent results. This study compared a rational and an empirical standard setting procedure. Reliability and credibility were examined first. The impact of a reality check was then established. Methods The OSCE included 16 stations and was taken by trainees in their final year of postgraduate training in general practice and experienced general practitioners. A modified Angoff (independent judgements, no group discussion) with and without a reality check was used as a rational procedure. A method related to the borderline group procedure, the borderline regression (BR) method, was used as an empirical procedure. Reliability was assessed using generalisability theory. Credibility was assessed by comparing pass rates and by relating the passing scores to test difficulty. Results The passing scores were 73·4% for the Angoff procedure without reality check (Angoff I), 66·0% for the Angoff procedure with reality check (Angoff II) and 57·6% for the BR method. The reliabilities (expressed as root mean square errors) were 2·1% for Angoffs I and II, and 0·6% for the BR method. The pass rates of the trainees and GPs were 19% and 9% for Angoff I, 66% and 46% for Angoff II, and 95% and 80% for the BR method, respectively. The correlation between test difficulty and passing score was 0·69 for Angoff I, 0·88 for Angoff II and 0·86 for the BR method. Conclusion The BR method provides a more credible and reliable standard for an OSCE than a modified Angoff procedure. A reality check improves the credibility of the Angoff procedure but does not improve its reliability.
Article
Licensure, credentialling and academic institutions are seeking new innovative approaches to the assessment of professional competence. Central to these recent initiatives is the need to determine standards of performance, which separate the competent from the non-competent candidate. Setting standards for performance assessment is a relatively new area of study. Consequently, there is no one recommended approach to setting standards. The goal of this guide is to familiarize the reader with the framework, principles, key concepts and practical considerations of standard setting approaches and to enable the reader to make 'educated' choices in selecting the most appropriate standard setting approach for their testing needs.
Article
Presents corrected tables for calculation errors that appeared in Tables 1 and 2 of a previous article by the present author (see record 1983-22403-001). It is noted that the errors were unsystematic and small and did not affect the conclusions drawn from the empirical example in the article. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
issues that appear to arise exclusively when the tools of educational measurement are used to certify students' competence definitions of competency testing / applications of competency testing in the United States / setting standards of competence / opportunity to learn / competency testing and the law (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple-choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide performance data as a fatal flaw for the procedure; others have considered it appropriate for experts to integrate performance data into their judgments but have been concerned that experts may rely too heavily on the data. There have, however, been relatively few studies examining how experts use the data. This article reports on two studies that examine how experts modify their judgments after reviewing data. In both studies, data for some items were accurate and data for other items had been manipulated. Judges in both studies substantially modified their judgments whether the data were accurate or not.
Article
Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges’ ratings, but were far less effective in predicting p values
Article
Is training judges beyond initial orientation required? How can we help judges apply their conceptualization of minimal competence to individual items?
Article
Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.
Article
The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the “borderline group,” but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed.
Article
There are many threats to validity in high-stakes achievement testing. One major threat is construct-irrelevant variance (CIV). This article defines CIV in the context of the contemporary, unitary view of validity and presents logical arguments, hypotheses, and documentation for a variety of CIV sources that commonly threaten interpretations of test scores. A more thorough study of CIV is recommended.
Article
As great emphasis is rightly placed upon the importance of assessment to judge the quality of our future healthcare professionals, it is appropriate not only to choose the most appropriate assessment method, but to continually monitor the quality of the tests themselves, in a hope that we may continually improve the process. This article stresses the importance of quality control mechanisms in the exam cycle and briefly outlines some of the key psychometric concepts including reliability measures, factor analysis, generalisability theory and item response theory. The importance of such analyses for the standard setting procedures is emphasised. This article also accompanies two new AMEE Guides in Medical Education (Tavakol M, Dennick R. Post-examination Analysis of Objective Tests: AMEE Guide No. 54 and Tavakol M, Dennick R. 2012. Post examination analysis of objective test data: Monitoring and improving the quality of high stakes examinations: AMEE Guide No. 66) which provide the reader with practical examples of analysis and interpretation, in order to help develop valid and reliable tests.
Article
1965 ed. published under title: "Measuring educational achievement". Incl. bibl., index.
Article
On 26 October 1974, 3356 diplomates of the American Board of Internal Medicine (ABIM) took a 1-day written examination for recertification consisting of multiple-choice, matching, and true-false questions derived from the American College of Physicians' Medical Knowledge Self-Assessment Program III and the ABIM Certifying Examination pool. The passing score was set by using a normative standard applied to a reference group of internists practicing general internal medicine who had had 2 or more years of residency training completed between the years 1949 and 1958. The passing score represented approximately 63% correct answers. The failure rate for the total number of examinees was 4.3%. Mean score of examinees showed an inverse relation with age but relatively slight differences when analyzed according to the degree of subspecialization, practice setting, hospital affiliation, or size of patient community.
Article
Passing scores for licensure and certification tests are justified by showing that decisions based on the passing score achieve the purposes of the credentialing program while avoiding any serious negative consequences. The standard should be high enough to provide adequate protection for the public, and not so high as to unnecessarily restrict the supply of qualified practitioners or to exclude competent candidates from practicing. This paper begins by examining the intended outcomes of licensure and certification programs and by outlining the interpretive argument that is typically used for written credentialing examinations. Criteria are then developed for evaluating standard-setting methods in terms of how well they serve the goals of protecting the public, maintaining an adequate supply of practitioners, and protecting the rights of candidates. Finally, the criteria are used to evaluate the Angoff method. This analysis identifies two potential sources of bias in the Angoff method, and suggests ways to control these weaknesses.
Article
Construct-irrelevant variance (CIV) - the erroneous inflation or deflation of test scores due to certain types of uncontrolled or systematic measurement error - and construct underrepresentation (CUR) - the under-sampling of the achievement domain - are discussed as threats to the meaningful interpretation of scores from objective tests developed for local medical education use. Several sources of CIV and CUR are discussed and remedies are suggested. Test score inflation or deflation, due to the systematic measurement error introduced by CIV, may result from poorly crafted test questions, insecure test questions and other types of test irregularities, testwiseness, guessing, and test item bias. Using indefensible passing standards can interact with test scores to produce CIV. Sources of content underrepresentation are associated with tests that are too short to support legitimate inferences to the domain and which are composed of trivial questions written at low-levels of the cognitive domain. "Teaching to the test" is another frequent contributor to CUR in examinations used in medical education. Most sources of CIV and CUR can be controlled or eliminated from the tests used at all levels of medical education, given proper training and support of the faculty who create these important examinations.
Standard setting. Handbook of test development. S. Downing and T. Haladyna. London: Routledge
  • G Cizek
Generaliability theory
  • R Brennan
  • Brennan R.
The validity and credibility of the achievement levels for the 1990 National Assessment of Educational Progress in mathematics
  • R Linn
  • D Koretz
  • E Baker
  • L Burstein
  • Linn R
Understanding the cognitive process of standard-setting panelists. Setting performance standards. G. Cizek. New York: Routledge
  • W Skorupski
Setting passing standards for credentialing programs. Setting performance standards. G. Cizek. New York: Routledge
  • C Buckendahi
  • S Davis-Becker
Empirical evidence for the evaluation of performance standards estimated using the Angoff procedure
  • B Clauser
  • P Harik
  • M J Margolis
  • I Mcmanus
  • J Mollon
  • L Chis
  • S Williams
  • Clauser B