[Show abstract][Hide abstract] ABSTRACT: To mitigate security concerns and unfair score gains, credentialing programs routinely administer new test material to examinees retesting after an initial failing attempt. Counterintuitively, a small but growing body of recent research suggests that repeating the identical form does not create an unfair advantage. This study builds upon and extends this research by investigating changes in responses to specific items encountered on both the first and repeat attempts. Results indicate that scores gains for repeat examinees who were assigned an identical form were not different from repeat examinees who received a different, but parallel, form. Analyses of responses to individual items answered incorrectly on the initial attempt found that examinees 68% of the time selected the same incorrect option on their second attempt, suggesting repeaters are misinformed rather than uninformed. Implications for feedback, remediation, and retesting policies are discussed.
Full-text · Article · Dec 2014 · Educational Measurement Issues and Practice
[Show abstract][Hide abstract] ABSTRACT: This paper illustrates the utility of practice analysis for informing curriculum and assessment design in professions education. The paper accomplishes three objectives: (1) Introduces four healthcare utilization surveys administered by the National Center for Health Statistics (NCHS); (2) Summarizes selected results for the survey, the National Hospital Ambulatory Medical Care Survey – Emergency Department (NHAMCS-ED); and (3) Illustrates how the data can inform decisions regarding the design of curricula and assessments in professions education. The survey tracks over 129 million patient visits to various healthcare facilities, documenting the health problems prompting those visits, the diagnostic studies performed, and the types of services provided. While the specific examples are relevant to nursing, medicine, and other healthcare fields, the general principles apply to other professions.
[Show abstract][Hide abstract] ABSTRACT: Purpose:
Previous studies on standardized patient (SP) exams reported score gains both across attempts when examinees failed and retook the exam and over multiple SP encounters within a single exam session. The authors analyzed the within-session score gains of examinees who repeated the United States Medical Licensing Examination Step 2 Clinical Skills to answer two questions: How much do scores increase within a session? Can the pattern of increasing first-attempt scores account for across-session score gains?
Data included encounter-level scores for 2,165 U.S. and Canadian medical students and graduates who took Step 2 Clinical Skills twice between April 1, 2005 and December 31, 2010. The authors modeled examinees' score patterns using smoothing and regression techniques and applied statistical tests to determine whether the patterns were the same or different across attempts. In addition, they tested whether any across-session score gains could be explained by the first-attempt within-session score trajectory.
For the first and second attempts, the authors attributed examinees' within-session score gains to a pattern of score increases over the first three to six SP encounters followed by a leveling off. Model predictions revealed that the authors could not attribute the across-session score gains to the first-attempt within-session score gains.
The within-session score gains over the first three to six SP encounters of both attempts indicate that there is a temporary "warm-up" effect on performance that "resets" between attempts. Across-session gains are not due to this warm-up effect and likely reflect true improvement in performance.
No preview · Article · Mar 2013 · Academic medicine: journal of the Association of American Medical Colleges
[Show abstract][Hide abstract] ABSTRACT: Although a few studies report sizable score gains for examinees who repeat performance-based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single-take examinees and 4,030 repeat examinees who completed a 6-hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication-interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single-take and multiple-take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low-scoring examinees regardless of retest status. In addition, on their first attempt multiple-take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple-take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single-take examinees. The findings support the validity of inferences based on scores from the second attempt.
No preview · Article · Dec 2012 · Journal of Educational Measurement
[Show abstract][Hide abstract] ABSTRACT: Item-level information, such as difficulty and discrimination are invaluable to the test assembly, equating, and scoring practices. Estimating these parameters within the context of large-scale performance assessments is often hindered by the use of unbalanced designs for assigning examinees to tasks and raters because such designs result in very sparse data matrices. This article addresses some of the issues using a multistage confirmatory factor analytic approach. The approach is illustrated using data from a performance test in medicine for which examinees encounter multiple patients with medical problems (tasks), with each problem portrayed by a different trained patient (rater). A series of models was fit to rating data (1) to obtain alternative task difficulty and discrimination parameters and (2) to evaluate the observed improvement in the goodness of model fit due to accounted rater and test site effects. The results suggest that availability of alternative task parameter estimates can be useful in practice for making decisions related to task banking, rater training, and test assembly.
Full-text · Article · Jan 2012 · Applied Measurement in Education
[Show abstract][Hide abstract] ABSTRACT: Examinees who initially fail and later repeat an SP-based clinical skills exam typically exhibit large score gains on their second attempt, suggesting the possibility that examinees were not well measured on one of those attempts. This study evaluates score precision for examinees who repeated an SP-based clinical skills test administered as part of the US Medical Licensing Examination sequence. Generalizability theory was used as the basis for computing conditional standard errors of measurement (SEM) for individual examinees. Conditional SEMs were computed for approximately 60,000 single-take examinees and 5,000 repeat examinees who completed the Step 2 Clinical Skills Examination(®) between 2007 and 2009. The study focused exclusively on ratings of communication and interpersonal skills. Conditional SEMs for single-take and repeat examinees were nearly indistinguishable across most of the score scale. US graduates and IMGs were measured with equal levels of precision at all score levels, as were examinees with differing levels of skill speaking English. There was no evidence that examinees with the largest score changes were measured poorly on either their first or second attempt. The large score increases for repeat examinees on this SP-based exam probably cannot be attributed to unexpectedly large errors of measurement.
No preview · Article · Oct 2011 · Advances in Health Sciences Education
[Show abstract][Hide abstract] ABSTRACT: Studies completed over the past decade suggest the presence of a gap between what students learn during medical school and their clinical responsibilities as first-year residents. The purpose of this survey was to verify on a large scale the responsibilities of residents during their initial months of training.
Practice analysis surveys were mailed in September 2009 to 1,104 residency programs for distribution to an estimated 8,793 first-year residents. Surveys were returned by 3,003 residents from 672 programs; 2,523 surveys met inclusion criteria and were analyzed.
New residents performed a wide range of activities, from routine but important communications (obtain informed consent) to complex procedures (thoracentesis), often without the attending physician present or otherwise involved.
Medical school curricula and the content of competence assessments prior to residency should consider more thorough coverage of the complex knowledge and skills required early in residency.
No preview · Article · Oct 2011 · Academic medicine: journal of the Association of American Medical Colleges
[Show abstract][Hide abstract] ABSTRACT: Prior studies report large score gains for examinees who fail and later repeat standardized patient (SP) assessments. Although research indicates that score gains on SP exams cannot be attributed to memorizing previous cases, no studies have investigated the empirical validity of scores for repeat examinees. This report compares single-take and repeat examinees in terms of both internal (construct) validity and external (criterion-related) validity.
Data consisted of test scores for examinees who took the United States Medical Licensing Examination Step 2 Clinical Skills (CS) exam between July 16, 2007, and September 12, 2009. The sample included 12,090 examinees who completed Step 2 CS on one occasion and another 4,030 examinees who completed the exam on two occasions. The internal measures included four separately scored performance domains of the Step 2 CS examination, whereas the external measures consisted of scores on three written assessments of medical knowledge (Step 1, Step 2 clinical knowledge, and Step 3). The authors subjected the four Step 2 CS domains to confirmatory factor analysis and evaluated correlations between Step 2 CS scores and the three written assessments for single-take and repeat examinees.
The factor structure for repeat examinees on their first attempt was markedly different from the factor structure for single-take examinees, but it became more similar to that for single-take examinees by their second attempt. Scores on the second attempt correlated more highly with all three external measures.
The findings support the validity of scores for repeat examinees on their second attempt.
No preview · Article · Aug 2011 · Academic medicine: journal of the Association of American Medical Colleges
[Show abstract][Hide abstract] ABSTRACT: Prior research indicates that the overall reliability of performance ratings can be improved by using ordinary least squares (OLS) regression to adjust for rater effects. The present investigation extends previous work by evaluating the impact of OLS adjustment on standard errors of measurement (SEM) at specific score levels. In addition, a cross-validation (i.e., resampling) design was used to determine the extent to which any improvements in measurement precision would be realized for new samples of examinees. Conditional SEMs were largest for scores toward the low end of the score distribution and smallest for scores at the high end. Conditional SEMs for adjusted scores were consistently less than conditional SEMs for observed scores, although the reduction in error was not uniform throughout the distribution. The improvements in measurement precision held up for new samples of examinees at all score levels.
[Show abstract][Hide abstract] ABSTRACT: The use of standardized patients to assess communication skills is now an essential part of assessing a physician's readiness for practice. To improve the reliability of communication scores, it has become increasingly common in recent years to use statistical models to adjust ratings provided by standardized patients. This study employed ordinary least squares regression to adjust ratings, and then used generalizability theory to evaluate the impact of these adjustments on score reliability and the overall standard error of measurement. In addition, conditional standard errors of measurement were computed for both observed and adjusted scores to determine whether the improvements in measurement precision were uniform across the score distribution. Results indicated that measurement was generally less precise for communication ratings toward the lower end of the score distribution; and the improvement in measurement precision afforded by statistical modeling varied slightly across the score distribution such that the most improvement occurred in the upper-middle range of the score scale. Possible reasons for these patterns in measurement precision are discussed, as are the limitations of the statistical models used for adjusting performance ratings.
Full-text · Article · Oct 2010 · Advances in Health Sciences Education
[Show abstract][Hide abstract] ABSTRACT: Years of research with high-stakes written tests indicates that although repeat examinees typically experience score gains between their first and subsequent attempts, their pass rates remain considerably lower than pass rates for first-time examinees. This outcome is consistent with expectations. Comparable studies of the performance of repeat examinees on oral examinations are lacking. The current research evaluated pass rates for more than 50,000 examinees on written and oral exams administered by six medical specialty boards for several recent years. Pass rates for first-time examinees were similar for both written and oral exams, averaging about 84% across all boards. Pass rates for repeat examinees on written exams were expectedly lower, ranging from 22% to 51%, with an average of 36%. However, pass rates for repeat examinees on oral exams were markedly higher than for written exams, ranging from 53% to 77%, with an average of 65%. Four explanations for the elevated repeat pass rates on oral exams are proposed, including an increase in examinee proficiency, construct-irrelevant variance, measurement error (score unreliability), and memorization of test content. Simulated data are used to demonstrate that roughly one third of the score increase can be explained by measurement error alone. The authors suggest that a substantial portion of the score increase can also likely be attributed to construct-irrelevant variance. Results are discussed in terms of their implications for making pass-fail decisions when retesting is allowed. The article concludes by identifying areas for future research.
No preview · Article · Sep 2010 · Evaluation & the Health Professions
[Show abstract][Hide abstract] ABSTRACT: Previous research has shown that ratings of English proficiency on the United States Medical Licensing Examination Clinical Skills Examination are highly reliable. However, the score distributions for native and nonnative speakers of English are sufficiently different to suggest that reliability should be investigated separately for each group.
Generalizability theory was used to obtain reliability indices separately for native and nonnative speakers of English (N = 29,084). Conditional standard errors of measurement were also obtained for both groups to evaluate measurement precision for each group at specific score levels.
Overall indices of reliability (phi) exceeded 0.90 for both native and nonnative speakers, and both groups were measured with nearly equal precision throughout the score distribution. However, measurement precision decreased at lower levels of proficiency for all examinees.
The results of this and future studies may be helpful in understanding and minimizing sources of measurement error at particular regions of the score distribution.
No preview · Article · Oct 2009 · Academic medicine: journal of the Association of American Medical Colleges
[Show abstract][Hide abstract] ABSTRACT: Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts.
No preview · Article · May 2009 · Educational Measurement Issues and Practice
[Show abstract][Hide abstract] ABSTRACT: Examinees who take credentialing tests and other types of high-stakes assessments are usually provided an opportunity to repeat the test if they are unsuccessful on initial attempts. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign an alternate form to repeat examinees. Given that the use of multiple forms presents both practical and psychometric challenges, it is important to determine if unwarranted score gains occur. Most research indicates that repeat examinees realize score gains when taking the same form twice; however, the research is far from conclusive, particularly within the context of credentialing. For the present investigations, two samples of repeat examinees were randomly assigned to receive either the same test form or a different, but parallel, form on the second occasion. Study 1 found score gains of about 0.79 SD units for 71 examinees who repeated a certification examination in computed tomography. Study 2 found gains of 0.48 SD units for 765 examinees who repeated a radiography certification examination. In both studies score gains for examinees receiving the parallel test were nearly indistinguishable from score gains for those who received the same test. Factors are identified that may influence the generalizability of these findings to other assessment contexts.
No preview · Article · May 2007 · Personnel Psychology
[Show abstract][Hide abstract] ABSTRACT: The purpose of a credentialing examination is to assure the public that individuals who work in an occupation or profession have met certain standards. To be consistent with this purpose, credentialing examinations must be job related, and this requirement is typically met by developing test plans based on an empirical job or practice analysis. The purpose of this module is to describe procedures for developing practice analysis surveys, with emphasis on task inventory questionnaires. Editorial guidelines for writing task statements are presented, followed by a discussion of issues related to the development of scales for rating tasks and job responsibilities. The module also offers guidelines for designing and formatting both mail-out and Internet-based questionnaires. It concludes with a brief overview of the types of data analyses useful for practice analysis questionnaires.
No preview · Article · Jun 2005 · Educational Measurement Issues and Practice
[Show abstract][Hide abstract] ABSTRACT: As the practice of cardiovascular interventional technology (CVIT) has evolved over the last 50 years, so has the role of radiographers employed in this specialty. In 1991, the American Registry of Radiologic Technologists (ARRT) initiated a certification program to recognize radiologic technologists practicing in CVIT. The certification program consisted of a single examination that covered all aspects of CVIT (e.g., neurologic, cardiac, genitourinary). In 2000, the ARRT conducted a study to investigate further the nature of subspecialization occurring within CVIT. A comprehensive job analysis questionnaire was developed that consisted of 137 clinical activities organized into 19 general domains of practice. The questionnaire was completed by a national sample of 848 radiologic technologists working in CVIT, who indicated the frequency with which they performed each of the 137 activities. Responses were subjected to cluster analysis to classify technologists into homogeneous groups corresponding to different CVIT subspecialties. Results indicated that CVIT consists of two major subspecialties: one corresponding to cardiac procedures and one corresponding to procedures involving organ systems other than the heart. Other smaller subspecialties also emerged from the cluster analysis. A multidimensional scaling of the profiles suggested that CVIT subspecialization can be explained by two dimensions: (1) whether the procedures are diagnostic or interventional and (2) the type of organ system involved. The findings are discussed in terms of their implications for education, certification, and performance evaluation.
No preview · Article · Feb 2004 · Journal of allied health
[Show abstract][Hide abstract] ABSTRACT: To determine whether radiation therapy department administrators prefer to hire graduates with certain types of educational preparation. The study was undertaken by the American Registry of Radiologic Technologists as part of a larger project to determine educational requirements for radiation therapists.
Forty-one department administrators evaluated applications from a pool of 984 hypothetical applicants for the position of radiation therapist. Applications were created by systematically varying eight characteristics such as years of experience, quality of educational program, and ratings from prior references. Type of educational program (baccalaureate degree, associate's degree, or hospital certificate) was of particular interest in this study. Each administrator evaluated 24 applications and assigned a rating ranging from 1 to 5 to indicate the extent to which he or she desired to interview each applicant. All ratings and applicant characteristics were coded and subjected to regression-type analyses to determine the relative importance of each applicant characteristic to administrators' decision-making policies.
Information obtained from applicant references had the greatest impact on administrators' evaluations of applicant quality. Specifically, reference ratings of cooperation and technical skills were the two most important characteristics, followed closely by reference ratings of interpersonal skills and dependability. Quality of educational program had some influence, as did years of experience. Type of educational program had virtually no impact on interview decisions for a vast majority of the administrators.
When making hiring decisions about hypothetical applicants, department administrators place most emphasis on evidence relating to past performance and give almost no weight to type of educational preparation. The extent to which these results generalize to actual applicants is addressed in the article.
No preview · Article · Sep 2003 · International Journal of Radiation OncologyBiologyPhysics
[Show abstract][Hide abstract] ABSTRACT: Practice analysis (i.e., job analysis) serves as the cornerstone for the development of credentialing examinations and is generally used as the primary source of evidence when validating scores on such exams. Numerous methodological questions arise when planning and conducting a practice analysis, but there is little consensus in the measurement community regarding the answers to these questions. This article offers recommendations concerning the following issues: selecting a method of practice analysis; developing rating scales to describe practice; determining the content of test plans; using multivariate procedures for structuring test plans; and determining topic weights for test plans. The article closes by suggesting several references for further reading.
No preview · Article · Aug 2002 · Educational Measurement Issues and Practice
[Show abstract][Hide abstract] ABSTRACT: To determine if graduates of different types of educational programs obtain similar scores on the Examination in Radiation Therapy administered by the American Registry of Radiologic Technologists. The results will help inform discussions regarding educational requirements for radiation therapists.
Test scores were obtained for 531 candidates who had taken the examination for the first time in 1997, 1998, or 1999. Candidates were divided into the following three categories, based on the type of educational program attended: hospital-based certificate, associate's degree, or bachelor's degree. To determine if test scores were related to the type of educational preparation, analyses of variance were conducted separately to test for differences in total scores and section scores, and scores on test questions intended to measure critical thinking skills.
Candidates with an associate's degree scored slightly lower than candidates with a bachelor's degree on the total test (p < 0.10) and lower than candidates with either a certificate or bachelor's degree on Section B of the examination (Treatment Planning and Delivery, p < 0.10). Baccalaureate candidates did not obtain higher scores than those prepared in certificate programs. On critical thinking questions, candidates with certificates scored higher than those with associate's degrees (p < 0.10). Some evidence suggested that candidates with a certificate scored higher on critical thinking than those with a bachelor's degree (p < 0.10), and that candidates with a bachelor's degree scored higher than candidates with an associate's degree (p < 0.10).
Although some of the differences in the mean test scores among the three educational groups were statistically significant, all differences were small and do not support one type of educational preparation over another.
No preview · Article · Jul 2002 · International Journal of Radiation OncologyBiologyPhysics