ArticlePDF Available

Clinical Performance Evaluations of Third-Year Medical Students and Association With Student and Evaluator Gender

Authors:

Abstract and Figures

Purpose: Clinical performance evaluations are major components of medical school clerkship grades. But are they sufficiently objective? This study aimed to determine whether student and evaluator gender is associated with assessment of overall clinical performance. Method: This was a retrospective analysis of 4,272 core clerkship clinical performance evaluations by 829 evaluators of 155 third-year students, within the Alpert Medical School grading database for the 2013-2014 academic year. Overall clinical performance, assessed on a three-point scale (meets expectations, above expectations, exceptional), was extracted from each evaluation, as well as evaluator gender, age, training level, department, student gender and age, and length of observation time. Hierarchical ordinal regression modeling was conducted to account for clustering of evaluations. Results: Female students were more likely to receive a better grade than males (adjusted odds ratio [AOR] 1.30, 95% confidence interval [CI] 1.13-1.50), and female evaluators awarded lower grades than males (AOR 0.72, 95% CI 0.55-0.93), adjusting for department, observation time, and student and evaluator age. The interaction between student and evaluator gender was significant (P = .03), with female evaluators assigning higher grades to female students, while male evaluators' grading did not differ by student gender. Students who spent a short time with evaluators were also more likely to get a lower grade. Conclusions: A one-year examination of all third-year clerkship clinical performance evaluations at a single institution revealed that male and female evaluators rated male and female students differently, even when accounting for other measured variables.
Content may be subject to copyright.
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Academic Medicine, Vol. 92, No. 6 / June 2017 835
Research Report
Selection of graduating medical
students into residency programs is
driven by multiple factors. However,
according to program directors, the most
important selection criteria are students’
grades on required core clerkships.1
Clinical performance evaluations (CPEs)
are used in most core clinical clerkships as
assessment and grading tools for medical
students. Clinicians who work with
medical students are asked to complete
formal evaluations of each student’s
basic clinical skills, such as history taking
and case presentation, as well as fund of
knowledge and professionalism. In most
clerkships, these evaluations, along with
standardized written examinations and
objective structured clinical examinations
(OSCEs), provide the data from which
students’ final clerkship grades are
determined. Studies show that these
CPEs are weighted more heavily than the
other evaluation methods, accounting for
50% to 70% of the final grade across all
clerkships.2,3 Despite the importance of
core clerkship clinical evaluations, there
is a paucity of literature examining the
degree of objectivity of this measure.4
The numerous evaluations that
occur over the course of attaining
entrance to medical school and
during the preclinical years are largely
standardized and unlikely to exhibit
grader-dependent bias. In contrast,
medical students are evaluated in a
more subjective manner when being
assessed on their clinical performance.
For that reason, the association of
grading with gender and the gender
pairing of trainer and trainee is
important, yet these factors are not well
understood in the medical setting in
areas where subjectivity of grading is
high. Literature from the education field
has shown that student gender often
plays a role in how students are treated
and graded.5,6 In primary schools, girls
are awarded better grades than boys,
despite similar test scores, which some
researchers attribute to “noncognitive
skills”—specifically, “a more
developed attitude towards learning.6
Additionally, teachers’ gender can affect
their expectations and perceptions
of educational competence and
performance.7,8 Furthermore, studies9–11
suggest that gender pairing can
enhance, through a “role-model effect,
student engagement and behavior, or,
conversely, gender noncongruence may
induce “stereotype threat,” in which
anxiety that one will confirm a negative
stereotype can lead to a decrement in
performance.
A few small studies12–14 have suggested
an interaction between student and
evaluator gender in the grading of
medical students’ simulated clinical
performance on OSCEs by standardized
patients (SPs). One small study of OSCE
grading13 found that male and female
medical students fared similarly overall;
however, when graded solely by female
SPs, women scored significantly higher,
yet male and female students were rated
the same by male SPs. These findings
were replicated in a more recent study
of OSCE grading,14 which specifically
Abstract
Purpose
Clinical performance evaluations are
major components of medical school
clerkship grades. But are they sufficiently
objective? This study aimed to determine
whether student and evaluator gender
is associated with assessment of overall
clinical performance.
Method
This was a retrospective analysis
of 4,272 core clerkship clinical
performance evaluations by 829
evaluators of 155 third-year students,
within the Alpert Medical School
grading database for the 2013–2014
academic year. Overall clinical
performance, assessed on a
three-point scale (meets expectations,
above expectations, exceptional), was
extracted from each evaluation, as
well as evaluator gender, age, training
level, department, student gender and
age, and length of observation time.
Hierarchical ordinal regression modeling
was conducted to account for clustering
of evaluations.
Results
Female students were more likely to
receive a better grade than males
(adjusted odds ratio [AOR] 1.30, 95%
confidence interval [CI] 1.13–1.50),
and female evaluators awarded lower
grades than males (AOR 0.72, 95% CI
0.55–0.93), adjusting for department,
observation time, and student and
evaluator age. The interaction between
student and evaluator gender was
significant (P = .03), with female
evaluators assigning higher grades
to female students, while male
evaluators’ grading did not differ by
student gender. Students who spent a
short time with evaluators were also
more likely to get a lower grade.
Conclusions
A one-year examination of all third-year
clerkship clinical performance evaluations
at a single institution revealed that male
and female evaluators rated male and
female students differently, even when
accounting for other measured variables.
Acad Med. 2017;92:835–840.
First published online January 17, 2017
doi: 10.1097/ACM.0000000000001565
Please see the end of this article for information
about the authors.
Correspondence should be addressed to Alison
Riese, Department of Pediatrics, Alpert Medical
School, Hasbro Children’s Hospital, 593 Eddy St.,
Providence, RI 02903; telephone: (401) 444-8531;
e-mail: ariese@lifespan.org.
Clinical Performance Evaluations of Third-Year
Medical Students and Association With
Student and Evaluator Gender
Alison Riese, MD, MPH, Leah Rappaport, MD, Brian Alverson, MD,
Sangshin Park, DVM, MPH, PhD, and Randal M. Rockney, MD
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Research Report
Academic Medicine, Vol. 92, No. 6 / June 2017
836
examined the gender interaction during
a “gender-sensitive” patient situation, the
examination of the chest.
Similar disparities in grading regarding
student and evaluator gender have
been found in a few small studies of
nonsimulated clinical settings.15,16 A
small study of students completing a
required one-month ambulatory care
medicine clerkship at the Medical
College of Wisconsin16 showed that the
highest mean grade was given by male
preceptors to female students, and the
lowest mean grade was given by female
preceptors to male students. In a study
of evaluations of internal medicine
residents, male residents received higher
grades from male attendings than from
female attendings.17 Conversely, a study
of medical student grading in obstetrics–
gynecology18 found that female students
performed better on written exams
and OSCEs; however, they were graded
similarly to male students by their faculty
evaluators.
The influence of gender on grading
in the clinical setting is important to
understand, considering the highly
subjective nature of clinical evaluations
compared with multiple-choice tests,
where gender has no bearing on
grade assignment, as well as the more
structured setting of OSCEs, where
graders are generally well trained and
have more uniform interactions with
the students being assessed. CPEs are
completed by evaluators of all training
levels, who interact with students in
various types of settings and over varying
durations, yet their assessments are
weighted heavily in clerkship grading.
As a first step in any effort to increase
objectivity in clinical grade assignment,
it is necessary to fully understand what
issues influence evaluators’ grading of
student clinical performance. There has
been no study examining third-year
core clerkships as a whole to see how the
gender of the evaluator and the gender
of the student may be associated with
differences in the clinical evaluation of
the student. We carried out this study
to determine whether student and
evaluator gender is associated with the
grades assigned. Secondarily, we sought
to explore other student and evaluator
factors that may be associated with
variance in grading.
Method
This was a retrospective study conducted
at the Alpert Medical School (AMS).
All 4,462 CPEs recorded in the medical
school’s grading database (OASIS) from
third-year core clerkships during the
2013–2014 academic year were initially
included. At AMS, the core clerkships
and their duration during the study
period consisted of internal medicine
(12 weeks) and surgery, obstetrics–
gynecology, family medicine, pediatrics,
and psychiatry (each 6 weeks). The
medical school’s administrative offices
compiled deidentified demographic
information about the student and
evaluator for each CPE and assigned
an ID number for each student and
each evaluator who was involved in
the CPEs being studied. The evaluator
IDs were used to account for nesting of
evaluations among evaluators—that is,
cases where evaluators assessed more
than one student. As the indicator of the
student’s global clinical performance, the
“overall clinical performance” grade on
the CPE, which from now on we refer to
as the student grade, was extracted from
each CPE. The possible grades that could
be selected by each evaluator completing
a CPE were “exceptional, “above
expectations,“meets expectations,” and
“below expectations.” An evaluation was
excluded if it was noted to be a duplicate
entry or if data were incomplete for the
primary outcome or predictor variables.
Additionally, CPEs with a grade of
“below expectations” were excluded
because of the rare occurrence (< 1%) of
this grade.
Because we were provided deidentified
data, we were not able to match those
data with any objective nonclinical
evaluations. However, we did compare
the United States Medical Licensing
Examination (USMLE) Step 1 scores
for men versus women in the class of
2015. The medical school administrative
offices provided the means and standard
deviations (SDs) of the USMLE Step 1
scores for the male and female students in
that class, since these students’ CPE data
were in our study. The means and SDs for
these two groups were compared using
Student t test. This study was declared
exempt by the Lifespan institutional
review board. (Lifespan Corporation,
Rhode Island’s largest health system,
is affiliated with the AMS of Brown
University.)
For each CPE, the dataset contained
demographic information about the
clerkship context, the student, and the
evaluator. Clerkship characteristics for
each CPE consisted of the clerkship
department and the length of observation
time for the student/evaluator (either
< 2 half-days or 2 half-days). Student
demographic information included
student gender and age (grouped as
25–27 years old and 28 years old).
Evaluator variables were evaluator gender,
Table 1
Demographic Information Regarding
the Third-Year Medical Students and
Their Evaluators at Alpert Medical
School, 2013 to 2014a
Characteristic No. (%)
Students (n = 155)
Gender
Male 76 (49.0)
Female 79 (51.0)
Age quartile
2526 years 49 (33.3)
27 years 36 (24.5)
28 years 27 (18.4)
> 28 years 35 (23.8)
Evaluators (n = 829)
Gender
Male 399 (48.1)
Female 430 (51.9)
Age quartile
2530 years 290 (37.8)
3140 years 210 (27.4)
4150 years 137 (17.9)
> 50 years 130 (17.0)
Training level
Resident (PGY 1) 168 (22.3)
Resident (PGY 2) 89 (11.8)
Resident (PGY 35) 155 (20.6)
Attending 342 (45.4)
Department
Family medicine 126 (15.2)
Psychiatry 50 (6.0)
Internal medicine 336 (40.5)
Obstetrics
gynecology
47 (5.7)
Pediatrics 155 (18.7)
Surgery 115 (13.9)
Abbreviation: PGY indicates postgraduate year.
aThe authors carried out a retrospective analysis
of 4,272 third-year clerkship clinical performance
evaluations, involving the evaluators and the students
whose data are in this table, to determine whether
student and evaluator gender are associated with
assessment of overall clinical performance.
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Research Report
Academic Medicine, Vol. 92, No. 6 / June 2017 837
age (in quartiles), and training level
(residency year or attending).
All statistical analyses were performed
using SAS 9.4 (SAS Institute, Cary,
North Carolina). A P value < .05 was
considered to be statistically significant.
This study examined the associations of
final grade with gender and covariates
using chi-square tests. Hierarchical
ordinal regression modeling was
conducted to examine the effects of
student and evaluator characteristics on
a student’s grade (“exceptional,“above
expectations,” or “meets expectations”),
adjusting for nonindependence, or
“clustering, of evaluators who rated
more than one student. Gender and
covariates with a P value < .05 in the
univariable model were incorporated
into a multivariable regression model,
which was built by the stepwise
selection procedure. Variables that
significantly reduced residual variance
were retained in the final model. To
avoid colinearity, phi coefficients were
estimated for two independent variables.
If high colinearity among variables was
observed (r > 0.6), we selected the most
relevant variable to the student’s grade
for multivariable modeling. Because
of the small number of evaluations in
family medicine and psychiatry, data
from these specialties were combined
for the multivariable modeling. After the
main effects model was built, interaction
terms were explored for significance.
Results
Of the 4,462 CPEs initially included in
this study, 190 (0.043%) were excluded.
Thirty-eight were excluded because they
were duplicates, and 136 were excluded
because of missing values in predictors
of interest (student or evaluator gender;
no. = 18) or in the outcome of interest
(grade; no. = 118). In addition, 16 CPEs
were excluded because of a “below
expectations” grade. Thus, the final
study dataset comprised 4,272 CPEs,
which were completed by 829 evaluators
regarding the performance of 155
students. The mean (SD) USMLE Step
1 score for the AMS class of 2015 was
221 (18.70) for women and 231 (18.98)
for men (P = .0083). The median age
of students was 27 years (interquartile
range [IQR] 26–28 years); the median
age of evaluators was 33 years (IQR
29–45 years). (See Table 1 for student
and evaluator demographics.) While the
number of students rotating through
each clerkship was consistent, the number
of CPEs for each student varied by
clerkship. The internal medicine clerkship
evaluators completed 1,267 CPEs
(30% of all CPEs), and the pediatrics
clerkship evaluators completed 1,154
(27%), which means that these two
clerkships contributed a large percentage
of CPEs compared with the percentages
contributed by the other four clerkships.
There was variability in the number
of CPEs per student (median 27, IQR
6–39) and CPEs per evaluator (median
3, IQR 1–7). Each clerkship, student, and
evaluator characteristic examined was
associated with a statistically significant
difference in the distribution of grades
received. (See Table 2.)
In univariable models, all predictors
were associated with the grade. Because
of high correlation between faculty age
Table 2
Associations of Third-Year Clinical Performance Grades With Clerkship, Student, and
Evaluator Characteristics, Alpert Medical School, 2013 to 2014a
Characteristic
No. (%) of
all clinical
evaluations
(n = 4,272)
No. (%) of grades by category
P
value
Meets
expectations
(n =721)
Above
expectations
(n = 1,826)
Exceptional
(n = 1,725)
Clerkships
Department
Obstetricsgynecology 602 (14.1) 136 (22.6) 271 (45.0) 195 (32.4) < .001
Pediatrics 1,154 (27.0) 236 (20.5) 483 (41.9) 435 (37.7)
Psychiatry 300 (7.0) 45 (15.0) 136 (45.3) 119 (39.7)
Family medicine 369 (8.6) 52 (14.1) 158 (42.8) 159 (43.1)
Surgery 580 (13.6) 68 (11.7) 268 (46.2) 244 (42.1)
Internal medicine 1,267 (29.7) 184 (14.5) 510 (40.3) 573 (45.2)
Observation time
2 half-days 505 (11.8) 143 (28.4) 261 (51.8) 100 (19.8) < .001
> 2 half-days 3,785 (88.2) 578 (15.3) 1,565 (41.5) 1,625 (43.1)
Students
Gender
Male 2,036 (47.7) 381 (18.7) 857 (42.1) 798 (39.2) .009
Female 2,236 (52.3) 340 (15.2) 969 (43.3) 927 (41.5)
Age
2527 years 2,542 (63.7) 482 (19.0) 1,091 (42.9) 969 (38.1) < .001
28 years 1,448 (36.3) 211 (14.6) 617 (42.6) 620 (42.8)
Evaluators
Gender
Male 2,081 (48.7) 263 (12.6) 893 (42.9) 925 (44.5) < .001
Female 2,191 (51.3) 458 (20.9) 933 (42.6) 800 (36.5)
Age quartile
2530 years 1,671 (41.2) 300 (18.0) 659 (39.4) 712 (42.6) < .001
3140 years 1,111 (27.4) 193 (17.4) 450 (40.5) 468 (42.1)
4150 years 701 (17.3) 88 (12.6) 325 (46.4) 288 (41.1)
51 years 573 (14.1) 93 (16.2) 276 (48.2) 204 (35.6)
Training level
Resident (PGY 1) 763 (18.8) 115 (15.1) 281 (36.8) 367 (48.1) < .001
Resident (PGY 2) 647 (15.9) 94 (14.5) 278 (43.0) 275 (42.5)
Resident (PGY 35) 883 (21.7) 179 (20.3) 350 (39.6) 354 (40.1)
Attending 1,776 (43.7) 287 (16.2) 826 (46.5) 663 (37.3)
Abbreviation: PGY indicates postgraduate year.
aThe authors carried out a retrospective analysis of 4,272 third-year clerkship clinical performance evaluations
by 829 evaluators of 155 students to determine whether student and evaluator gender are associated with
assessment of overall clinical performance.
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Research Report
Academic Medicine, Vol. 92, No. 6 / June 2017
838
and training level (phi coefficient 0.84),
only evaluator age was considered for the
multivariable model. A total of 32.9% of the
variability in the grades was accounted for
by within-evaluator nesting of grades in the
multivariable model (intraclass correlation
coefficient = 0.329; P < .001). All significant
differences in the univariable models were
retained in the multivariable model. In the
multivariable model, female student gender
was associated with higher grades (adjusted
odds ratio [AOR], 1.30; 95% CI, 1.13–1.50).
Female faculty gender was associated
with lower grades (AOR, 0.72; 95% CI,
0.55–0.93). Longer observation time, older
student age, and younger evaluator age were
all associated with higher grades. Evaluators
in internal medicine had the highest odds
of giving a better grade, while those in
obstetrics–gynecology had the lowest odds.
(See Table 3.)
The interaction between student and faculty
gender, adjusted for all other main effects,
was also significant (P = .03; see Figure 1).
Male evaluators did not significantly differ
in their grading of male and female students
(P = .29); however, female evaluators
gave lower grades to male students
compared with female students (P < .001).
Additionally, a significant interaction
between faculty age and faculty gender was
found (P = .047), with older male evaluators
giving significantly lower grades than
younger men (P = .001), while there was
no significant difference in grading for the
female age groups (P = .71). (See Figure 2).
There was no interaction between student
gender and student age (P = .63).
Discussion
In one year at a large U.S. medical school,
there were over 4,000 CPEs of students
in core clerkships, and data revealed
that in clerkship grading, overall, male
students received lower grades than female
students on their CPEs. This finding is
in accordance with literature examining
gender differences in clinical performance.
In general, male and female medical
students perform similarly on the MCAT
exam and have similar preclinical GPAs
and USLME tests scores,15,19,20 albeit with
factors including content area and student
and school characteristics playing a role
in performance. The class of students
represented in our dataset actually differed
in their performance on USMLE Step 1,
with men performing significantly better.
In contrast, other studies15,21 suggest
that female medical students do tend
to perform better on OSCEs, including
those that are part of the USMLE Step
2 Clinical Skills (CS) exam, and receive
better evaluations on their actual clinical
performance. There was no interaction
between evaluator gender and student
gender found in the study of Step 2 CS
scoring.21 However, our findings show that
the discrepancy in clinical performance
grades between male and female medical
students was driven primarily by female
evaluators.
The discrepancy between male and
female evaluators’ assessment of medical
Table 3
Odds Ratios of Receiving a Higher Grade—by Clerkship, Student, and Evaluator
Characteristics—on 4,272 Third-Year Clerkship Clinical Performance Evaluations,
Alpert Medical School, 2013 to 2014a,b
Characteristic
Univariable model Multivariable model
OR (95% CI) P value AOR (95% CI) P value
Clerkships
Department
Obstetricsgynecology 1.00c< .001 1.00c.002
Pediatrics 1.11 (0.67–1.83) 0.83 (0.49–1.40)
Family medicine and psychiatryd1.52 (0.91–2.53) 1.24 (0.71–2.17)
Surgery 1.88 (1.10–3.21) 1.31 (0.74–2.33)
Internal medicine 2.09 (1.30–3.36) 1.61 (0.97–2.69)
Observation time
2 half-days 1.00c< .001 1.00c< .001
> 2 half-days 2.72 (2.20–3.38) 2.74 (2.17–3.46)
Students
Gender
Male 1.00c.016 1.00c.026
Female 1.17 (1.03–1.33) 1.30 (1.13–1.50)
Age group
2527 years 1.00c< .001 1.00c< .001
28 years 1.39 (1.21–1.60) 1.48 (1.27–1.71)
Evaluators
Gender
Male 1.00c< .001 1.00c.012
Female 0.64 (0.51–0.81) 0.72 (0.55–0.93)
Age group
2530 years 1.00c.034 1.00c.028
3140 years 0.78 (0.57–1.06) 0.76 (0.56–1.04)
4150 years 0.77 (0.54–1.09) 0.72 (0.50–1.03)
51 years 0.59 (0.41–0.85) 0.57 (0.39–0.84)
Training levele
Attending 1.00c< .001
Resident (PGY 1) 1.90 (1.40–2.59)
Resident (PGY 2) 1.68 (1.21–2.35)
Resident (PGY 35) 1.29 (0.95–1.76)
Abbreviations: PGY indicates postgraduate year; OR, odds ratio; AOR, adjusted odds ratio.
aThe authors carried out a retrospective analysis of 4,272 third-year clerkship clinical performance evaluations
by 829 evaluators of 155 students to determine whether student and evaluator gender are associated with
assessment of overall clinical performance.
bA total of 32.9% of the variability in the grades was accounted for by “within-evaluator” nesting of grades in
the multivariable model (intraclass correlation coefficient = 0.33; P < .001).
cReference.
dBecause of the small number of evaluations, family medicine and psychiatry were combined for the multivariable
modeling.
eTraining level was not retained in the multivariable model because of high phi coefficient (0.84) between faculty
age quartile and training levels.
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Research Report
Academic Medicine, Vol. 92, No. 6 / June 2017 839
students’ clinical performance is most
perplexing. Medical students’ clinical
performance is influenced by attributes
outside of medical knowledge and
clinical acumen. Indeed, two studies22,23
reported that medical students who
showed empathy received better clinical
evaluations, and women scored higher
on empathy scales than men did.
Additionally, some studies22,23 found
that female students’ interpersonal skills
surpassed those of men. In primary care,
a study24 showed that female physicians
communication skills surpassed those
of their male counterparts, which, if
future studies confirm this result, is
an important finding because doctor–
patient communication has been linked
to improved health outcomes.25 If the
body of literature showing that women
outperform men in the clinical setting is
applied, our findings suggest that female
evaluators accurately detected superior
performance in their female students,
while male evaluators either were unable
to detect these differences or were biased
in their grading methods.
However, it is likely that this finding
highlights an even more complicated
interplay between gender and academic
performance and assessment. As in
the primary education world, female
students’ “learning attitude” may also
play a role, as well as the possible role
modeling of same-gender evaluators and
the stereotype threat of opposite-gender
graders, which may influence students
to perform differently depending on
the gender of their evaluators. Another
potential complicating matter is that
patients may interact differently with
medical students depending on the
student’s gender, which could also affect
the assessment of their performance.
This has been demonstrated in a study26
examining physician–patient interaction,
where patients were found to speak
differently and make more psychosocial
disclosures to female physicians.
Whatever the cause, it is concerning that
our study findings suggest that male
and female students experience different
gradings of their clinical performances,
and that the gender of the evaluator is an
independent driver of this difference.
Our data also found a significant
interaction between evaluator age and
gender, with younger male evaluators
awarding higher grades than older male
evaluators and than female evaluators in
all age groups. While younger evaluators
have been found to be more lenient
graders in other studies,27,28 to our
knowledge the age–gender interaction
has not been examined elsewhere,
and this finding warrants additional
investigation. Again, it is concerning that
intrinsic evaluator characteristics have
led to differential grading of students.
Either improved training of graders
is needed, or the characteristics of the
evaluators must be taken into account
when considering their ability to give
fair clerkship grades.
Our data also demonstrate substantial
differences in the way clerkship students
are graded by department at our school,
a finding that we suspect applies to
many schools. This variability should
be examined to provide a consistent
approach to CPEs. Differences in the
structure and duration of the different
core clerkships, as well as the time
students spend with evaluators, must be
taken into consideration when looking
at CPEs. In some cases, the structure of
the clerkship and number of evaluators
providing CPEs may result in fewer
Figure 1 Two-way interaction effect of student and evaluator gender on predicted probabilities
of evaluators’ grading of male and female students on 4,272 clinical performance evaluations of
medical students completing third-year core clerkships at Alpert Medical School, 2013 to 2014.
Male evaluators did not significantly differ in their grading of male and female students (P = .29);
however, female evaluators gave lower grades to male students than to female students
(P < .001).
Figure 2 Two-way interaction effect of evaluator gender and age quartile on predicted probabilities
of their grading of male and female students on 4,272 clinical performance evaluations of medical
students completing third-year core clerkships at Alpert Medical School, 2013 to 2014. A significant
interaction between faculty age and faculty gender was found (P = .047), with older male evaluators
giving significantly lower grades than younger men (P = .001), while there was no significant
difference in grading for the female age groups (P = .71).
Copyright © by the Association of American Medical Colleges. Unauthorized reproduction of this article is prohibited.
Research Report
Academic Medicine, Vol. 92, No. 6 / June 2017
840
grading events per student, which may
exaggerate the influence of gender and
age on a student’s final clerkship grade.
Our study has some limitations. We
evaluated only one year of grading events
at one medical school in the United States.
A multicenter study would be needed
to see if these data are generalizable to
other institutions. The grading system
used is an ordinal one, and these data
may not be reflective of data produced by
other grading systems at other medical
schools. We were not able to adjust for
or compare clinical performance grades
with standardized test scores, since the
individual-level data were not available
in our dataset. Further, we recognize
that gender representation, and thus
gender interactions at a medical school in
2013–2014, might be very different from
what was obtained in previous years, when
gender relationships and generational
differences would perhaps skew data in
other ways.
Further study is needed to learn whether
the trends of gender-pairing influence on
grading at our medical school are found
at other medical schools. Additionally,
the cause of the grading differences by
evaluator and student gender is still
unknown. Next steps may include a
qualitative approach to discover reasons
for the discrepancy in how medical
students’ performance is perceived
and assessed by evaluators of different
genders.
Acknowledgments: The authors would like to
acknowledge the assistance of Alpert Medical
Schools’ Medical School Administrative Office
for the compilation of the dataset used for this
study. They would also like to thank Jennifer F.
Friedman, MD, MPH, PhD, for her mentorship
and guidance, and Kelvin Moore for his efforts
assisting with the literature review.
Funding/Support: None reported.
Other disclosures: None reported.
Ethical approval: This study was reviewed by the
Lifespan institutional review board and deemed
exempt. (Lifespan Corporation, Rhode Island’s
largest health system, is affiliated with the Alpert
Medical School of Brown University.)
Previous presentations: Pediatric Hospital
Medicine Conference, Chicago, Illinois, July 29,
2016; and Lifespan Annual Research Celebration,
Providence, Rhode Island, October 20, 2016.
A. Riese is assistant professor, Department of
Pediatrics and Medical Science, Section of Medical
Education, Alpert Medical School of Brown
University, Providence, Rhode Island.
L. Rappaport is a first-year pediatrics resident,
University of Michigan Medical School, Ann Arbor,
Michigan.
B. Alverson is associate professor, Department
of Pediatrics and Medical Science, Section of
Medical Education, Alpert Medical School of Brown
University, Providence, Rhode Island.
S. Park is postdoctoral research associate, Alpert
Medical School of Brown University and Center
for International Health Research at Rhode Island
Hospital, Providence, Rhode Island.
R.M. Rockney is professor, Department of
Pediatrics, Family Medicine, and Medical Science,
Section of Medical Education, Alpert Medical School
of Brown University, Providence, Rhode Island.
References
1 Green M, Jones P, Thomas JX Jr. Selection
criteria for residency: Results of a national
program directors survey. Acad Med.
2009;84:362–367.
2 Kassebaum DG, Eaglen RH. Shortcomings
in the evaluation of students’ clinical skills
and behaviors in medical school. Acad Med.
1999;74:842–849.
3 Hemmer PA, Papp KK, Mechaber AJ,
Durning SJ. Evaluation, grading, and use of
the RIME vocabulary on internal medicine
clerkships: Results of a national survey and
comparison to other clinical clerkships. Teach
Learn Med. 2008;20:118–126.
4 Holmboe ES. Faculty and the observation
of trainees’ clinical skills: Problems and
opportunities. Acad Med. 2004;79:16–22.
5 Lavy V, Sand E. On the Origins of Gender
Human Capital Gaps: Short- and Long-Term
Consequences of Teachers’ Stereotypical
Biases. Cambridge, MA: National Bureau of
Economic Research; 2015.
6 Cornwell C, Mustard DB, Van Parys J.
Noncognitive skills and the gender disparities
in test scores and teacher assessments:
Evidence from primary school. J Hum
Resour. 2013;48:236–264.
7 Mullola S, Ravaja N, Lipsanen J, et al.
Gender differences in teachers’ perceptions
of students’ temperament, educational
competence, and teachability. Br J Educ
Psychol. 2012;82(pt 2):185–206.
8 Heyder A, Kessels U. Do teachers equate
male and masculine with lower academic
engagement? How students’ gender
enactment triggers gender stereotypes at
school. Soc Psychol Educ. 2015;18:467–485.
9 Dee TS. Teachers and the gender gaps
in student achievement. J Hum Resour.
2007;42:528–554.
10 Keller J. Stereotype threat in classroom
settings: The interactive effect of domain
identification, task difficulty and stereotype
threat on female students’ maths performance.
Br J Educ Psychol. 2007;77(pt 2):323–338.
11 Huguet P, Regner I. Stereotype threat
among schoolgirls in quasi-ordinary
classroom circumstances. J Educ Psychol.
2007;99(3):545.
12 Ramsbottom-Lucier M, Johnson MM,
Elam CL. Age and gender differences in
students’ preadmission qualifications and
medical school performances. Acad Med.
1995;70:236–239.
13 Dawson-Saunders B, Rutala PJ, Witzke DB,
Leko EO, Fulginiti JV. The influences of
student and standardized patient genders
on scoring in an objective structured
clinical examination. Acad Med. 1991;66(9
suppl):S28–S30.
14 Carson JA, Peets A, Grant V, McLaughlin K.
The effect of gender interactions on students’
physical examination ratings in objective
structured clinical examination stations.
Acad Med. 2010;85:1772–1776.
15 Haist SA, Wilson JF, Elam CL, Blue AV,
Fosson SE. The effect of gender and age on
medical school performance: An important
interaction. Adv Health Sci Educ Theory
Pract. 2000;5:197–205.
16 Wang-Cheng RM, Fulkerson PK, Barnas
GP, Lawrence SL. Effect of student and
preceptor gender on clinical grades in
an ambulatory care clerkship. Acad Med.
1995;70:324–326.
17 Rand VE, Hudes ES, Browner WS, Wachter
RM, Avins AL. Effect of evaluator and
resident gender on the American Board of
Internal Medicine evaluation scores. J Gen
Intern Med. 1998;13:670–674.
18 Bienstock JL, Martin S, Tzou W, Fox HE.
Medical students’ gender is a predictor of
success in the obstetrics and gynecology basic
clerkship. Teach Learn Med. 2002;14:240–243.
19 Cuddy MM, Swanson DB, Clauser BE. A
multilevel analysis of examinee gender and
USMLE Step 1 performance. Acad Med.
2008;83(10 suppl):S58–S62.
20 Cuddy MM, Swanson DB, Clauser BE.
A multilevel analysis of the relationships
between examinee gender and United States
Medical Licensing Exam (USMLE) Step 2
CK content area performance. Acad Med.
2007;82(10 suppl):S89–S93.
21 Swygert KA, Cuddy MM, van Zanten M,
Haist SA, Jobe AC. Gender differences in
examinee performance on the Step 2 Clinical
Skills data gathering (DG) and patient note
(PN) components. Adv Health Sci Educ
Theory Pract. 2012;17:557–571.
22 Austin EJ, Evans P, Goldwater R, Potter V. A
preliminary study of emotional intelligence,
empathy and exam performance in first
year medical students. Pers Individ Dif.
2005;39:1395–1405.
23 Hojat M, Gonnella JS, Mangione S, et al.
Empathy in medical students as related to
academic performance, clinical competence
and gender. Med Educ. 2002;36:522–527.
24 Roter DL, Hall JA, Aoki Y. Physician gender
effects in medical communication: A meta-
analytic review. JAMA. 2002;288:756–764.
25 Street RL Jr, Makoul G, Arora NK,
Epstein RM. How does communication
heal? Pathways linking clinician–patient
communication to health outcomes. Patient
Educ Couns. 2009;74:295–301.
26 Hall JA, Roter DL. Do patients talk differently
to male and female physicians? A meta-
analytic review. Patient Educ Couns.
2002;48:217–224.
27 Hull AL. Medical student performance. A
comparison of house officer and attending
staff as evaluators. Eval Health Prof.
1982;5(1):87–94.
28 Spielvogel R, Stednick Z, Beckett L, Latimore
D. Sources of variability in medical student
evaluations on the internal medicine clinical
rotation. Int J Med Educ. 2012;3:245–251.
... 108 We also encountered a large variation in the terms used to describe the same group(s 70 and vague terms, such as Asian. 31,68 Age was reported in variable ways, with some studies choosing age ranges (e.g., comparing those aged > 29 years with those aged < 29 years), 40 distinct ages (e.g., 25, 26, and 27 years), 72 and mean age of learners. 73 These discrepancies make drawing conclusions regarding bias related to age impossible to compare. ...
... Some studies chose to include examiner-examinee interactions based on gender as well as race and ethnicity. 10,12,42,46,51,57,72,85 Dyad analysis may be used to isolate where differential scoring is occurring within the sample size, thereby creating an opportunity for directed intervention and improvement. 109 Although our review did not explore this, assessors' demographics are necessary to gauge a comprehensive understanding of the diversity of assessors and how this may be contributing to the assessment scores of learners. ...
... 13,43,44,49,56,68,[83][84][85]87,88 Other studies analyzed the impact of multiple SIDCs independently but did not consider exploring intersectionality. 8,9,40,42,64,67,68,70,72,73 The study of a single SIDC is not representative of the culminated effects that medical learners who fall into multiple categories experience. 117,118 Therefore, the conclusions derived from these incomprehensive studies cannot gauge the extent of effects of these mutually reinforcing SIDCs. ...
Article
Purpose Observed assessments are integral to medical education but may be biased against structurally marginalized communities. Current understanding of assessment bias is limited because studies have focused on single specialties, levels of training, or social identity characteristics (SIDCs). This scoping review maps studies investigating bias in observed assessments in medical education arising from trainees’ observable SIDCs at different medical training levels, with consideration of medical specialties, assessment environments, and assessment tools. Method MEDLINE, Embase, ERIC, PsycINFO, Scopus, Web of Science Core Collection, and Cochrane Library were searched for articles published between January 1, 2008, and March 15, 2023, on assessment bias related to 6 observable SIDCs: gender (binary), gender nonconformance, race and ethnicity, religious expression, visible disability, and age. Two authors reviewed the articles, with conflicts resolved by consensus or a third reviewer. Results were interpreted through group review and informed by consultation with experts and stakeholders. Results Sixty-six of 2,920 articles (2.3%) were included. These studies most frequently investigated graduate medical education (44 [66.7%]), used quantitative methods (52 [78.8%]), and explored gender bias (63 [95.5%]). No studies investigated gender nonconformance, religious expression, or visible disability. One evaluated intersectionality, with SIDCs described inconsistently. General surgery (16 [24.2%]) and internal medicine (12 [18.2%]) were the most studied specialties. Simulated environments (37 [56.0%]) were studied more frequently than clinical environments (29 [43.9%]). Bias favoring men was found more in assessments of intraoperative autonomy (5 of 9 [55.6%]), whereas clinical examination bias often favored women (15 of 19 [78.9%]). When race and ethnicity bias was identified, it consistently favored White students. Conclusions This review mapped studies of gender, race, and ethnicity bias in the medical education assessment literature, finding limited studies on other SIDCs and intersectionality. These findings will guide future research by highlighting the importance of consistent terminology, unexplored SIDCs, and intersectionality.
... In addition, multiple linear regression models have identified factors contributing to EI that include family support, university socialisation, and learning facilities. Riese et al. (2017) have examined the relationship between student or assessors gender factor to the overall performance of clinical assessments that are assessed on three scales that are met, unexpected, or extraordinary. In determining the clustering of the assessments, a hierarchical ordinal regression modeling method was used. ...
Article
The Fourth Industrial Revolution (IR 4.0) has significantly impacted the provision of a creative and critical thinking workforce. Education 4.0 is designed to meet the ever-changing needs of the industry. Universities should be ready to produce more competitive graduates who are prepared for IR 4.0. The steps to identifying students at risk are important to ensure graduates reach the required level and reduce the risk of failure. This step should be an essential part of the university's academic procedure. Predictive models that are in line with the current requirements are a priority for researchers today and are important in predicting accuracy. This study has successfully developed a predictive model of student final exam performance based on the current needs. The model has been applied to modern education theory that emphasises the ability of students and the difficulty of a question. The proposed MIRT model considers individual abilities and is incorporated into one of the ordinal regression model. Based on model fitting and Lipsitz statistic, the model known as the COM-MIRT overcame the performance of existing model. To ensure that the data used meet the IRT assumptions, principal component analysis was performed to determine the appropriate dimensions for the main assessments of the course. In addition, validity and reliability tests were performed to assess the accuracy and consistency of each item's score on an instrument. Meanwhile, the amount of separation for items and persons was derived from the separation index (SI). Rasch measurement analysis provided values for MNSQ, PMC, Cronbach's alpha, SR, and SI through WINSTEP software. Finally, the cumulative logit equation generated in the study can help educators and universities in formulating appropriate plans so that the final exam performance of students for mathematical statistics course is enhanced beyond the expectation of a predictive model.
... Previous studies have shown that students' performance improves incrementally, both on multiple-choice examinations of relevant knowledge [7] for clerkships, and on self-assessments of competency [8] after their clerkship rotation [7,[9][10][11]. however, medical schools vary widely in the delivery and sequencing of their disciplinary clerkship rotations and the evaluation of students during their third-year rotations [12,13], making it challenging to determine which clerkships provide the most significant positive impact on the development of future clinicians. additionally, grading rubrics across different clerkships even within an individual school are not necessarily comparable [14], further complicating the evaluation of any given clerkship's effectiveness. ...
Article
Full-text available
Purpose This study quantified the impact of clinical clerkships on medical students’ disciplinary knowledge using the Comprehensive Clinical Science Examination (CCSE) as a formative assessment tool. Methods This study involved 155 third-year medical students in the College of Human Medicine at Michigan State University who matriculated in 2016. Disciplinary scores on their individual Comprehensive Clinical Science Examination reports were extracted by digitizing the bar charts using image processing techniques. Segmented regression analysis was used to quantify the differences in disciplinary knowledge before, during, and after clerkships in five disciplines: surgery, internal medicine, psychiatry, pediatrics, and obstetrics and gynecology (ob/gyn). Results A comparison of the regression intercepts before and during their clerkships revealed that, on average, the participants improved the most in ob/gyn (β=11.193, p<.0001), followed by psychiatry (β=10.005, p<.001), pediatrics (β=6.238, p<.0001), internal medicine (β=1.638, p=.30), and improved the least in surgery (β=−2.332, p=.10). The regression intercepts of knowledge during their clerkships and after them, on the other hand, suggested that students’ average scores improved the most in psychiatry (β=7.649, p=.008), followed by ob/gyn (β=4.175, p=.06), surgery (β=4.106, p=.007), and pediatrics (β=1.732, p=.32). Conclusions These findings highlight how clerkships influence the acquisition of disciplinary knowledge, offering valuable insights for curriculum design and assessment. This approach can be adapted to evaluate the effectiveness of other curricular activities, such as tutoring or intersessions. The results have significant implications for educators revising clerkship content and for students preparing for the United States Medical Licensing Examination Step 2.
... Of the various admission strategies, high school performance appears to be a strong predictor of academic standing in medical education programs (24). Therefore, many medical schools place a lot of weight on student's high school scores during the admission process, particularly for those who are accepted straight out of high school. ...
Article
Full-text available
Introduction: Medical schools face substantial challenges in objectively selecting the best applicants, and the admission process can impact medical students’ academic performance. This study aimed to estimate the students’ academic success in the preclinical stage of undergraduate medical education using admission tests. Methods: This cross-sectional study was conducted on 1,193 students’ records from the 2014 to 2019 cohorts. The students’ admission data comprised the cohort, sex, admission track, psychological test, and academic tests. The academic success was based on the student’s end-year academic evaluation. Data were analyzed using contingency and Kendall’s tau b tests with IBM ® SPSS ® Statistics version 16.0 for Windows. Results: Most of the 1,193 preclinical medical students’ records included in the study were females (68.1%), from the regular admission track (78.5%), from the considered psychology test category (52.8%), and had an academic admission test of less than or equal to the median. (51.6%). Most students (89.7%) met all the academic requirements to pass the end-year evaluation. The bivariate analyses showed significant correlations between academic success and cohort (P<0.001), psychology test (P=0.005), and academic test (P<0.001). The analyses showed no significant correlation between academic success and sex (P=0.324), and admission track (P=0.128). Conclusions: This study indicated that cohort and psychology tests could estimate the student’s academic success at the preclinical stage of undergraduate medical education. The admission criteria related to the academic tests during the admission process should be re-evaluated, so that the academic tests could select the best students among the applicants. Keywords: Academic success, College admission test, Medical education
... The challenge of grading reliability stems, in part, from inter-institutional and inter-clerkship variability in grading practices, as well as interrater differences in subjective judgement of student performance [7][8][9][10]. Furthermore, increasing evidence suggests gender and racial bias contribute to grading discrepancies, including at our own institution, Washington University School of Medicine in St. Louis (WUSM) [11][12][13][14][15][16][17]. ...
Article
Full-text available
Background Collective decision-making by grading committees has been proposed as a strategy to improve the fairness and consistency of grading and summative assessment compared to individual evaluations. In the 2020–2021 academic year, Washington University School of Medicine in St. Louis (WUSM) instituted grading committees in the assessment of third-year medical students on core clerkships, including the Internal Medicine clerkship. We explored how frontline assessors perceive the role of grading committees in the Internal Medicine core clerkship at WUSM and sought to identify challenges that could be addressed in assessor development initiatives. Methods We conducted four semi-structured focus group interviews with resident (n = 6) and faculty (n = 17) volunteers from inpatient and outpatient Internal Medicine clerkship rotations. Transcripts were analyzed using thematic analysis. Results Participants felt that the transition to a grading committee had benefits and drawbacks for both assessors and students. Grading committees were thought to improve grading fairness and reduce pressure on assessors. However, some participants perceived a loss of responsibility in students’ grading. Furthermore, assessors recognized persistent challenges in communicating students’ performance via assessment forms and misunderstandings about the new grading process. Interviewees identified a need for more training in formal assessment; however, there was no universally preferred training modality. Conclusions Frontline assessors view the switch from individual graders to a grading committee as beneficial due to a perceived reduction of bias and improvement in grading fairness; however, they report ongoing challenges in the utilization of assessment tools and incomplete understanding of the grading and assessment process.
Article
Medical students require assessment and actionable feedback to develop clinical competency. However, feedback is of inconsistent quality and is frequently ineffective. To further our understanding about faculty and students’ perceptions about challenges to feedback and improve the feedback experience on the surgery clerkship, we aimed to use Group Concept Mapping (GCM), a participatory research methodology, to 1) identify barriers to exchanging feedback on the surgery clerkship and 2) examine how an institutional quality initiative to improve feedback (the Flash Feedback tool) might address the identified barriers. We prospectively enrolled study participants from 10/2022 to 03/2023. Third-year medical students completing their surgery clerkship during the 2022–2023 academic year and department of surgery faculty at a single institution were eligible for inclusion. GCM participants utilized an asynchronous web-based platform to brainstorm barriers to feedback. Participants then individually sorted the brainstormed ideas into categories based on perceived relatedness. Sorted items were analyzed to generate a two-dimensional graphical representation of associations between ideas and concepts. GCM was also used to evaluate faculty and student perceptions about the effectiveness of the Flash Feedback tool implemented during the same academic year. 20 participants identified 44 unique barriers to providing/receiving feedback. Hierarchical cluster analysis resulted in a four-cluster solution composed of the following domains: 1) lack of longitudinal exposure, 2) time constraints, 3) perceived interpersonal challenges, and 4) lack of objectivity/standardization. Both students and faculty rated the Flash Feedback tool favorably in addressing these barriers. The barriers identified by the students and faculty in our GCM study represent a cohesive knowledge structure with which to conceptualize challenges to exchanging feedback. Standardized, immediate post-encounter web-based applications such as our Flash Feedback tool may be especially helpful to address issues with subjectivity, non-specific feedback, and perceived interpersonal challenges that impede trust between educator and learner.
Article
Introduction: As the number of medical students who identify as underrepresented in medicine (URiM) increases, the disparities related to gender and URiM status persist. This study examines the current initiatives within family medicine clerkships to reduce bias in evaluations. Methods: Our 10-item survey was included as a module in the 2022 Council of Academic Family Medicine Educational Research Alliance national survey of family medicine clerkship directors. Our survey questions asked about whether programs had strategies to reduce bias in student evaluations, antiracism initiatives, perceptions on effectiveness of the initiatives, and type and cadence of faculty development on evaluations for preceptors. Results: The overall response rate for the survey was 59.12% (94/159); all respondents completed our module. Seventy percent said they had implemented strategies to reduce bias in evaluations, 60% felt these were effective, and 80% felt that reducing bias in evaluations was a priority. The majority, 89/91(95%), indicated that their medical schools had a current social justice, diversity, or antiracism initiative. We identified a positive association between specific antibias medical school initiatives and clerkship directors undertaking practices to reduce bias in evaluations (P=.005). Conclusions: Most programs had implemented strategies to reduce bias and felt that doing so was a priority. Community-based preceptors were less likely to have faculty development around reducing bias compared to those in academics. Further improvements may need to prioritize including community preceptors in educational efforts to reduce bias.
Article
Background: Reflection on both student and teacher perspectives is crucial for effective communication and professional relationships during education. Objectives: This observational cohort study aimed to compare students' self-assessment with teacher assessments, as well as with estimated self-assessment and estimated teacher-assessment, using the pictorial representation of illness and self-measure (PRISM) during an objective structured clinical examination (OSCE). Additionally, it sought to compare self-assessment and teacher-assessment with OSCE scores. Methods: Fourth-year dental students (n = 44) were included at the beginning of their clinical course. Three tasks were selected for the OSCE exams: Oral examination on a model (task 1), matrix placement (task 2), and endodontic radiograph evaluation (task 3). Objective structured clinical examination scores were rated by an independent rater. Students and one of three calibrated teachers used PRISM to evaluate their respective assessments independently and blinded from each other. The relationships between the different assessments were determined using the Pearson correlation coefficient. Results: For task 1, a moderate correlation was found between students' self-assessment and estimated self-assessment (r = 0.44, P < 0.01). For task 2, moderate correlations were observed between self-assessment and teacher-assessment, estimated teacher-assessment and teacher-assessment, as well as between self-assessment and estimated self-assessment (P ≤ 0.01). For task 3, moderate correlations were found between self-assessment and teacher-assessment, and between self-assessment and estimated self-assessment (P < 0.01). A moderate negative correlation between self-assessment and the OSCE score was observed only for task 2 (r = -0.41, P = 0.01). Moderate negative correlations between teacher-assessment in PRISM and the OSCE score were found for all three tasks (P < 0.01). Conclusions: Self-assessment and teacher-assessment using PRISM exhibited task-dependent correlations, while results for estimated assessments varied. PRISM may serve as a promising tool for feedback and discussion in the future, as it seems capable of highlighting different views and expectations in the teaching context. Further studies are needed to confirm these findings.
Article
Purpose The authors describe use of the workplace-based assessment (WBA) coactivity scale according to entrustable professional activities (EPAs) and assessor type to examine how diverse assessors rate medical students using WBAs. Method A WBA data collection system was launched at Oregon Health and Science University to visualize learner competency in various clinical settings to foster EPA assessment. WBA data from January 14 to June 18, 2021, for medical students (all years) were analyzed. The outcome variable was level of supervisor involvement in each EPA, and the independent variable was assessor type. Results A total of 7,809 WBAs were included. Most fourth-, third-, and second-year students were assessed by residents or fellows (755 [49.5%], 1,686 [48.5%], and 918 [49.9%], respectively) and first-year students by attending physicians (803 [83.0%]; P < .001). Attendings were least likely to use the highest rating of 4 (1 was available just in case; 2,148 [56.7%] vs 2,368 [67.7%] for residents; P < .001). Learners more commonly sought WBAs from attendings for EPA 2 (prioritize differential diagnosis), EPA 5 (document clinical encounter), EPA 6 (provide oral presentation), EPA 7 (form clinical questions and retrieve evidence-based medicine), and EPA 12 (perform general procedures of a physician). Residents and fellows were more likely to assess students on EPA 3 (recommend and interpret diagnostic and screening tests), EPA 4 (enter and discuss orders and prescriptions), EPA 8 (give and receive patient handover for transitions in care), EPA 9 (collaborate as member of interprofessional team), EPA 10 (recognize and manage patient in need of urgent care), and EPA 11 (obtain informed consent). Conclusions Learners preferentially sought resident vs attending supervisors for different EPA assessments. Future research should investigate why learners seek different assessors more frequently for various EPAs and if assessor type variability in WBA levels holds true across institutions.
Article
Full-text available
A prominent class of explanations for the gender gaps in student outcomes focuses on the interactions between students and teachers. In this study, I examine whether assignment to a same-gender teacher influences student achievement, teacher perceptions of student performance, and student engagement. This study's identification strategy exploits a unique matched-pairs feature of a major longitudinal study, which provides contemporaneous data on student outcomes in two different subjects. Within-student comparisons indicate that assignment to a same-gender teacher significantly improves the achievement of both girls and boys as well as teacher perceptions of student performance and student engagement with the teacher's subject. © 2007 by the Board of Regents of the University of Wisconsin System.
Article
Full-text available
Objectives: To explore the sources of variability in evaluator ratings among third year medical students in the Internal Medicine clinical rotation. Also, to examine systematic effects and variability introduced by differences in the various student, evaluator, and evaluation settings. Methods: A multilevel model was used to estimate the amount of between-student, between-rater and rater-student interaction variability present in the students' clinical evaluations in a third year internal medicine clinical rotation. Within this model, linear regression analysis was used to estimate the effect of variables on the students' numerical evaluation scores and the reliability of those scores. Results: A total of 2,747 evaluation surveys were collected from 389 evaluators on 373 students over 4.5 years. All surveys used a nine-point grading scale, and therefore all results are reported on this scale. The calculated between-rater, between-student and rater-student interaction variance components were 0.50, 0.27 and 0.62, respectively. African American/Black students had lower scores than Caucasian students by 0.58 points (t=-3.28; P=0.001). No gender effects were noted. Conclusions: These between-rater and between-student variance components imply that the evaluator plays a larger role in the students' scores than the students themselves. The residual rater-student interaction variance was larger and did not change by accounting for the measured demographic variables. This implies there is significant variability in each rater-student interaction that remains unexplained. This could contribute to unreliability in the system, requiring that students receive between 8 and 17 clinical evaluations to achieve 80reliability.
Article
Full-text available
There is ample evidence today in the stereotype threat literature that women and girls are influenced by gender-stereotyped expectations on standardized. math tests. Despite its high relevance to education, this phenomenon has not received much attention in school settings. The present studies offer the 1 st evidence to date indicating that middle school girls exhibit a performance deficit in quasi-ordinary classroom circumstances when they are simply led to believe that the task at hand measures mathematical skills. This deficit occurred in girls working alone or in mixed-gender groups (i.e., presence of regular classmates) but not in same-gender groups (i.e., presence of only same-gender classmates). Compared with the mixed-gender groups, the same-gender groups were also associated for girls in the stereotype threat condition with greater accessibility of positive role models (i.e., female classmates who excel in math), at the expense of both stereotypic in-group and out-group members (i.e., low-math-achievement girls and high-math-achievement boys). Finally, the greater accessibility of positive role models mediated the impact of the activated stereotype on girls' performance, exactly as one would expect from C. M. Steele's (1997) stereotype threat theory. Taken together, these findings clearly show that reducing stereotype threat in the classroom is a crucial challenge for both scientists and teachers.
Article
Full-text available
We extend the analysis of early-emerging gender differences in academic achievement to include both (objective) test scores and (subjective) teacher assessments. Using data from the 1998-99 ECLS-K cohort, we show that the grades awarded by teachers are not aligned with test scores, with the disparities in grading exceeding those in testing outcomes and uniformly favoring girls, and that the misalignment of grades and test scores can be linked to gender differences in non-cognitive development. Girls in every racial category outperform boys on reading tests and the differences are statistically significant in every case except for black fifth-graders. Boys score at least as well on math and science tests as girls, with the strongest evidence of a gender gap appearing among whites. However, boys in all racial categories across all subject areas are not represented in grade distributions where their test scores would predict. Even those boys who perform equally as well as girls on reading, math and science tests are nevertheless graded less favorably by their teachers, but this less favorable treatment essentially vanishes when non-cognitive skills are taken into account. White boys who perform on par with white girls on these subject-area tests and exhibit the same non-cognitive skill level are graded similarly. For some specifications there is evidence of a grade "bonus" for white boys with test scores and behavior like their girl counterparts. While the evidence is a little weaker for blacks and Hispanics, the message is essentially the same.
Article
We estimate the effect of primary school teachers' gender biases on boys' and girls' academic achievements during middle and high school and on the choice of advanced level courses in math and sciences during high school in Tel-Aviv, Israel. We measure bias using class-gender differences in scores between school exams graded by teachers and national exams graded blindly by external examiners. For identification, we rely on the random assignment of teachers and students to classes in primary schools. Our results suggest that assignment to a teacher with a greater bias in favor of girls (boys) has positive effects on girls' (boys') achievements. Such gender biases have also positive impact on girls' (boys') enrollment in advanced level math courses in high school. These results suggest that teachers' biased behavior at early stages of schooling has long run implications for occupational choices and earnings at adulthood, because enrollment in advanced courses in math and science in high school is a prerequisite for post-secondary schooling in engineering, computer science and so on.
Article
Girls presently outperform boys in overall academic success. Corresponding gender stereotypes portray male students as lazy and troublesome and female students as diligent and compliant. The present study investigated whether these stereotypes impact teachers’ perceptions of students and whether students’ visible enactment of their gender at school (behaving in a very masculine or feminine way) increases the impact of these stereotypes on teachers’ perceptions of students. We hypothesized that teachers would ascribe more behavior that impedes learning and less behavior that fosters learning to male students who enact masculinity as compared with male students who show gender-neutral behavior and female students. Three pilot studies (N = 104; N = 82; N = 86) yielded pretested material for a randomized vignette study of N = 104 teachers. The teachers read one randomly assigned vignette describing a male (or female) student enacting his (or her) gender (or not) and rated how likely this student would be to display behaviors that impede or foster learning in a 2 (between: target students’ gender) × 2 (between: gender enactment [yes/no]) × 2 (between: teachers’ gender) × 2 (within: ascribed behavior) factorial design. As expected, male students enacting masculinity were rated as showing the lowest amount of academic engagement. Results are discussed with regard to the current debate on the causes of boys’ lower academic success.
Article
Studies of the evaluation of medical students' clinical performance frequently do not differentiate between ratings by house officer and attending staff evaluators. This practice is not appropriate, since research investigations have shown that house officers rate medical students' clinical performance higher and have higher interrater agreement than do attending staff This investigation studies one aspect of the validity of medical students' clinical performance ratings and demonstrates that there are higher correlations between house officer ratings of student knowledge and student cognitive ability scores than there are between attending staff evaluations and student ability scores.
Article
Student's temperament plays a significant role in teacher's perception of the student's learning style, educational competence (EC), and teachability. Hence, temperament contributes to student's academic achievement and teacher's subjective ratings of school grades. However, little is known about the effect of gender and teacher's age on this association. We examined the effect of teacher's and student's gender and teacher's age on teacher-perceived temperament, EC, and teachability, and whether there is significant same gender or different gender association between teachers and students in this relationship. The participants were population-based sample of 3,212 Finnish adolescents (M= 15.1 years) and 221 subject teachers. Temperament was assessed with Temperament Assessment Battery for Children - Revised and Revised Dimensions of Temperament Survey batteries and EC with three subscales covering Cognitive ability, Motivation, and Maturity. Data were analyzed with multi-level modelling. Teachers perceived boys' temperament and EC more negatively than girls'. However, the differences between boys and girls were not as large when perceived by male teachers, as they were when perceived by female teachers. Males perceived boys more positively and more capable in EC and teachability than females. They were also stricter regarding their perceptions of girls' traits. With increasing age, males perceived boys' inhibition as higher and mood lower. Generally, the older the teacher, the more mature he/she perceived the student. Teachers' ratings varied systematically by their gender and age, and by students' gender. This bias may have an effect on school grades and needs be taken into consideration in teacher education.
Article
A group of 156 first year medical students completed measures of emotional intelligence (EI) and physician empathy, and a scale assessing their feelings about a communications skills course component. Females scored significantly higher than males on EI. Exam performance in the autumn term on a course component (Health and Society) covering general issues in medicine was positively and significantly related to EI score but there was no association between EI and exam performance later in the year. High EI students reported more positive feelings about the communication skills exercise. Females scored higher than males on the Health and Society component in autumn, spring and summer exams. Structural equation modelling showed direct effects of gender and EI on autumn term exam performance, but no direct effects other than previous exam performance on spring and summer term performance. EI also partially mediated the effect of gender on autumn term exam performance. These findings provide limited evidence for a link between EI and academic performance for this student group. More extensive work on associations between EI, academic success and adjustment throughout medical training would clearly be of interest.
Article
Multiple studies examining the relationship between physician gender and performance on examinations have found consistent significant gender differences, but relatively little information is available related to any gender effect on interviewing and written communication skills. The United States Medical Licensing Examination (USMLE(®)) Step 2 Clinical Skills(®) (CS(®)) examination is a multi-station examination where examinees (physicians in training) interact with, and are rated by, standardized patients (SPs) portraying cases in an ambulatory setting. Data from a recent complete year (2009) were analyzed via a series of hierarchical linear models to examine the impact of examinee gender on performance on the data gathering (DG) and patient note (PN) components of this examination. Results from both components show that not only do women have higher scores on average, but women continue to perform significantly better than men when other examinee and case variables are taken into account. Generally, the effect sizes are moderate, reflecting an approximately 2% score advantage by encounter. The advantage for female examinees increased for encounters that did not require a physical examination (for the DG component only) and for encounters that involved a Women's Health issue (for both components). The gender of the SP did not have an impact on the examinee gender effect for DG, indicating a desirable lack of interaction between examinee and SP gender. The implications of the findings, especially with respect to the validity of the use of the examination outcomes, are discussed.