Assessing professional competence: From methods to programs

Department of Educational Development and Research, University of Maastricht, Maastricht, The Netherlands.
Medical Education (Impact Factor: 3.2). 04/2005; 39(3):309-17.
INTRODUCTION: We use a utility model to illustrate that, firstly, selecting an assessment method involves context-dependent compromises, and secondly, that assessment is not a measurement problem but an instructional design problem, comprising educational, implementation and resource aspects. In the model, assessment characteristics are differently weighted depending on the purpose and context of the assessment. EMPIRICAL AND THEORETICAL DEVELOPMENTS: Of the characteristics in the model, we focus on reliability, validity and educational impact and argue that they are not inherent qualities of any instrument. Reliability depends not on structuring or standardisation but on sampling. Key issues concerning validity are authenticity and integration of competencies. Assessment in medical education addresses complex competencies and thus requires quantitative and qualitative information from different sources as well as professional judgement. Adequate sampling across judges, instruments and contexts can ensure both validity and reliability. Despite recognition that assessment drives learning, this relationship has been little researched, possibly because of its strong context dependence. ASSESSMENT AS INSTRUCTIONAL DESIGN: When assessment should stimulate learning and requires adequate sampling, in authentic contexts, of the performance of complex competencies that cannot be broken down into simple parts, we need to make a shift from individual methods to an integral programme, intertwined with the education programme. Therefore, we need an instructional design perspective. IMPLICATIONS FOR DEVELOPMENT AND RESEARCH: Programmatic instructional design hinges on a careful description and motivation of choices, whose effectiveness should be measured against the intended outcomes. We should not evaluate individual methods, but provide evidence of the utility of the assessment programme as a whole.


Available from: Cees Van der Vleuten, Dec 09, 2015
Medical Education 2005; 39: 309–317
Medical Education 2005; 39: 309–317
negatively affected by many sources of error or bias,
and research has provided conclusive evidence that,
if we want to increase reliability, we will have to
ensure that our sampling takes account of all these
unwanted so urces of variance. A good understanding
of the issues involved in sampling may offer us many
more degrees of freedom in test development.
The predominant condition affecting the reliability
of assessment is domain- or conten t-specificity,
because competence is highly dependent on context
or content. This means that we will only be able to
achieve reliable scores if we use a large sample across
the content of the subject to be tested.
If the
assessment involves other conditions with a potential
effect on reliability such as examiners and patients
careful samp ling across those conditions is equally
essential. With intelligent test designs, which sample
efficiently across conditions (such as using different
examiners for each station in an OSCE), reliable
scores will gener ally be obtained within a reasonable
testing time.
So far, this is nothing new. What is new, however, is
the recent insight that reliability is not conditional on
objectivity and standardisation . The fact that objec-
tivity and reliability are often confused was addressed
theoretically some time ago,
but the empirical
evidence is becoming convincingly clear now and
may point towards new directions in assessment. To
illustrate our point, let us look at the OSCE. The
OSCE was developed as an alternative to the then
prevailing subjective and unreliable clinical assess-
ment methods, such as vivas and clinical ratings. The
main perceived advantage of the OSCE was objectiv-
ity and standardisation, which were regarded as the
main underpinnings of its reliability. However, an
abundance of study evidence has since shown that
the reliability of an OSCE is contingent on careful
sampling, particularly across clinical content, and an
appropriate number of stations, whic h generally
means that several hours of testing time are nee-
What actually occurred was that the brevity of
clinical samples (leading to a larger sample overall
than in previous methods) and the fact that students
rotated through the stations (optimal sampling
across patients and examiners) led to more adequate
sampling, which in turn had a far greater impact on
reliability than any amount of standardisation could
have had. This finding is not unique to the OSCE. In
recent years many studies have demonstrated that
reliability can also be achieved with less standardised
assessment situations and more subjective evalua-
tions, provided the sampling is appropriate. Table 1
illustrates this by presenting reliability estimates for
several instruments with differing degrees of
Table 1 Reliability estimates of different assessment instruments as a function of testing time
Instrument Description
Reliability for different testing
1 hour 2 hours 4 hours 8 hours
Multiple choice*
Short stem and short menu of options 0.62 0.76 0.93 0.93
Patient management problem*
Simulation of patient, full scenarios 0.36 0.53 0.69 0.82
Key feature case (write-in)*
Short patient case vignette followed by
write-in answer
0.32 0.49 0.66 0.79
Oral examination
Oral examination based on patient cases 0.50 0.69 0.82 0.90
Long case examination
Oral examination based on previously
unobserved real patient
0.60 0.75 0.86 0.90
Simulated realistic encounters in round
robin format
0.54 0.69 0.82 0.90
Mini-clinical exercise (mini-CEX)à
Short follow-up oral examination based
on previously observed real patient
0.73 0.84 0.92 0.96
Practice video assessment
Selected patient)doctor encounters
from video recordings in actual practice
0.62 0.76 0.93 0.93
Incognito standardised patientsà
Real consultations scored by undetected
simulated patients 0.86
0.61 0.76 0.82 0.86
* One-facet all random design with items crossed with persons (pxi).
Two-facet all random design with judges (examiners) nested within items within persons (j:i:p).
à One-facet all random design with items nested within persons (i:p).
Medical Education 2005; 39: 309–317
Page 3
standardisation. For comparative purposes, the reli-
ability estimates are expressed as a function of the
testing time needed.
The comparative data should not be interpreted too
strictly as only a single study was included for each
type of method and reliability estimations were based
on different designs across studies. For our discussi on
it is irrelevant to know the exact magnitude of the
reliability or which me thod can be hailed as the
ÔwinnerÕ. The important point is to illustrate that all
methods require substantial sampling and that
methods which are less structured or standardised,
such as the oral examination, the long case exam-
ination, the mini-clinical evaluation exercis e (mini-
CEX) and the incognito standardised patient meth-
od, can be entirely or almost as reliable as other more
structured and objective measures. In a recent review,
a similar conclusion was presented for global clinical
performance assessments.
They are not incl uded in
Table 1 as the unit of testing time is unavailable, but
a sufficiently reliable gl obal estimate of competence
requires somewhere between 7 and 11 ratings,
probably not requiring more than a few hours of
testing time. All these reliability studies show that
sampling remains the pivotal factor in achieving
reliable score s with any instrument and that there is
no direct connection between reliability and the level
of structuring or standardisation.
This insight has far-reaching consequences for the
practice of assessment. Basically, the message is that
no method is inherently unreliable and any method
can be sufficiently reliable, provided sampling is
appropriate across conditions of measurement. An
important consequence of this shift in the perspec-
tive on reliability is that there is no need for us to
banish from our assess ment toolbox instruments that
are rather more subjective or not perfectly standard-
ised, provided that we use those instruments sensibly
and expertly. Conversely, we should not be deluded
into thinking that as long as we see to it that our
assessment toolbox exclusively contains structured
and standardised instruments, the reliability of our
measurements will automatically be guaranteed.
Validity refers to whether an instrument actually does
measure what it is purported to. Newer developments
concerning assessment methods in relat ion to validity
have typically been associated with the desire to attain
a more direct assessment of clinical competence by
increasing the authenticity of the measurement. This
started in the 1960s with the assessment of Ôclinical
reasoningÕ by patient management problems and
continued with the introduction of the OSCE in the
1970s. Authenticity was achieved by offering candi -
dates simulated real world challenges, either on
paper, in computerised forms or in a laboratory
setting. Such assessment methods have passed
through m ajor developments and refinements of
The assessment of higher cognitive
abilities has progressed from the use of realistic
simulations to short and focused vignet tes which tap
into key decisions and the application of knowledge,
in which the response format (e.g. menu, write-in,
open, matching) is of minor importance. The OSCE
has similarly led to a wealth of research, from which
an extensive assessment technology has emerged.
However, on top of the rapid progress in those areas,
we see a number of interrelated developments, which
may have a marked impact on the validity of our
measurements in the future.
Firstly, we are likely to witness the continued progress
of the authenticity movement towards assessment in
the setting of day-to-day practice.
Whereas the
success of the OSCE was basically predicated on
moving assessment away from the workplace to a
laboratory-controlled environment by providing
authentic tasks in a standardised and objectified way,
today, insights into th e relationship between samp-
ling and reliability appear to have put us in a position
where we can move assessment back to the real world
of the workplace as a result of the development of less
standardised, but nevertheless reliable, methods of
practice-based assessment. Methods are presently
emerging that allow assessment of performance in
practice by enabling adequate sampling across dif-
ferent contexts and assessors. Methods of perform-
ance assessment include the mini-CEX,
work sampling,
video assessment
and the use of
incognito simulated patients.
Such methods are
also helpful in the final step of Miller’s competency
In this pyramid, assessment moves from
the Ôknows Õ stage via Ôknows howÕ (paper and com-
puter simulations) and Ôshows howÕ (performance
simulations such as the OSCE) to the final ÔdoesÕ level
of habitual performance in day-to-day practice.
A second development concerns the movement
towards th e integration of competencies.
tially, this movement follows insights from modern
educational theory, which postulates that learning is
facilitated when tasks are integrated.
programmes that are restricted to the Ôstackin gÕ of
components or subskills of competencies are less
effective in delivering competent professionals than
methods in which different task components are
Medical Education 2005; 39: 309–317
Page 4
presented and practised in an integrated fashion,
which creates conditions that are conducive to
transfer. This Ôwhole-taskÕ approach is reflected in the
current competency movement. A competency is the
ability to handle a complex professional task by
integrating the relevant cognitive, psychomotor and
affective skills. In educational practice we now see
curricula being built around such competencies or
However, in assessment we tend to persist in our
inclination to break down the comp etency that we
wish to assess into smaller units, which we then assess
separately in the conviction that mastery of the parts
will automa tically lead to competent performance of
the integrated whole. Reductionism in assess-
ment has also emerged from oversimplified skills-by-
method thinking,
in which the fundamental idea
was that for each skill a single (and only a single)
instrument could be developed and used. We con-
tinue to think in this way despite the fact that
experience has taught us the errors of our simplistic
thinking. For example, in the original OSCE, short,
isolated skills were assessed within a short time span.
Previous validity research has sounded clear warnings
of the drawbacks of such an approach. For example,
the classic patient management problem, which
consisted of breaking down the problem-solving
process into isolated steps, has been found to be a
not very sensitive method for detecting differences in
Another example can be derived from
OSCE research that has shown that more global
ratings provide a more faithful reflection of expertise
than detailed checklists.
Atomisation may lead to
trivialisation and may threaten validity and, therefore,
should be avoided. Recen t research that shows the
validity of global and holis tic judgement thus helps us
to avoid trivialisation. The competen cy movement is a
plea for an integrated approach to competence,
which respects the (holistic or tacit) nature of
expertise. Coles argues that the learning and asses-
sing of professional judgement is the essence of what
medical competence is about.
This means that,
rather than being a quality that augments wit h each
rising level of Miller’s pyramid, authenticity is present
at all level s of the pyramid and in all good assessment
methods. A good illustration of this is the way test
items of certifying examinations in the USA are
currently being written (
Compared with a few decades ago, today’s items are
contextual, vignette-based or problem-oriented and
require reasoning skills rather than straightforward
recall of facts. This contextualisation is considered an
important quality or validity indicator.
The validity
of any method of assessment could be improved
substantially if assessment designers would respect
the characteristic of authenticity. We can also reverse
the authenticity argument: when authenticity is not a
matter of simply climbing the pyramid but something
that should be realised at all levels of the pyramid, we
can also say that similar authentic information may
come from various sources within the pyramid. It is,
therefore, wise to use these multiple sources of
information from various methods to construct an
overall judgement by triangulating information
across these sources, a fact that supports the argu-
ment that we need multiple methods in order to
make a good job of assessment.
A final trend is also related to the competency
movement. The importance of general professional
competencies ) which are not unique to the medical
profession ) is acknowledged. These competencies
include the ability to work in a team, metacognitive
skills, pro fessional behaviour, the ability to reflect
and to carry out self-appraisal, etc. Although neither
the concepts themselves nor the search for ways to
assess them are new, there is currently a marked
tendency to place more and more emphasis on such
general competencies in education and, therefore, in
assessment. New methods are gaining popularity,
such as self-assessment,
peer assessment,
source feed back or 360-degree feedback
and port-
We see the growing prominence of general
competencies as a significant development, because it
will require a different assessment orientation with
potential implications for other areas of assessment.
Information gathering for the assessment of such
general comp etencies will increasingly be based on
qualitative, descriptive and narrative infor mation
rather than on, or in addition to, quantitative,
numerical data. Such q ualitative information cannot
be judged against a simple, pre-set standard. That is
why some form of professional evaluation will be
indispensable to ensure its appropriate use for
assessment purposes. This is a cha llenge to which
assessment developers will have to rise in the near
future. In parallel to what we have said about the
dangers of reductionism, the implications of the use
of qualitative information point to a similar respect
for holistic professional judgement on the part of the
assessor. As we move further towards the assessment
of complex competencies, we will have to rely more
on other, and probably more qualitative, sources of
information than we have been accustomed to and
we will come to rely more on professional judgement
as a basis for decision making about the quality and
the implicati ons of that information. The challenge
will be to make this decision making as rigorous as
possible without trivialising the content for
Medical Education 2005; 39: 309–317
Page 5
ÔobjectivityÕ reasons. There is much to be done in this
Impact on learning
The impact of assessmen t on learning has also been
termed Ôconsequential validityÕ,
which is incorpor-
ated in the formal definition of validity by the
American Educational Research Association.
prefer to use it as a separate criterion, simply because
of its importance in any balanced utility appraisal.
This brings us to 2 somewhat paradoxical obs erva-
The first observation is that the notion of the impact
of assessment on learning is gaining more and more
general accep tance. Many publications have acknow-
ledged th e powerful relationship between assessment
and learning. Recognition of the concept that
assessment is the driving force behind learning is
increasingly regarded as one of the principles of good
practice in assessment.
Unfortunately, this does not
mean that changes are easy to achieve in practice or
that changes in assessment will no longer be the last
item on the agenda of curriculum renewal.
The second observation is that there is a paucity of
publications that shed light on the relationship
between assessment and learning.
From our daily
experience in educational practice we are familiar
with some of the crucial issues in this respect: how to
achieve congruence between educational o bjectives
and assessment; how to provide and increase feed-
back from assessment; how to sustain formative
feedback; how to combine and balance formative and
summative assessment; how much assessment is
enough; how to spread assessment over time, etc.
Unfortunately, published information that can fur-
ther our thinking and progress in this area is hard to
come by.
An explanation of this scarcity may be that it is almost
impossible to study th e impact of assessment on
learning without knowing about the context of the
assessment. For example, a recent paper showed that
students’ performance on an OSCE station had a
much stronger relationship with the students’
momentary context (the rotation they were in) than
with their past experience with the subject.
concept that a characteristic of an assessment method
is not inherent in the method but depends on how
and in what context assessment takes place is even
more applicable in the case of its impact on learning
than for any of the other characteristics in the utility
equation. Similar methods may lead to widely
differing educational effects, depending on their use
and place in the overall assessment programme. This
means that we are badly in need of more studies to
address the issues mentioned above, research that
will inevitably require more specification of the
assessment context.
The paragraph on reliability indicate d that there are
no inherently inferior assessment methods and that
reliability depends on sampling. However, it is also
fair to admit that most of the assessment methods
commonly used within regular medical training
programmes are used unreliably. The section on
validity showed that we cannot expect a single
method to be able to cover all aspects of competen-
cies of the layers of Miller’s pyramid, but that we need
a blend of methods, some of which will be different
in nature, which may mean less numerical with less
standardised test taking conditions. Professional
judgement is important, both for the assessment tasks
that we design as well as for the assess or who
appraises task performance. As for the impact of
assessment on learning, we have made it clear that
any method of assessment can have any sort of
influence on learning (positive or negative),
depending on how it is used and in what context. It is
our view that the preceding discussion constitutes a
strong plea for a shift of focus regarding assessment,
that is, a shift away from individual assessment
methods for separate parts of competencies towards
assessment as a component that is inextricably woven
together with all the other aspects of a training
programme. From this point of view, the instruc-
tional design perspective, the conceptual utility
model should be applied at the level of the integral
assessment programme. Assessment then changes
from a psychometric problem to be solved for a single
assessment method to an educational design problem
that encompasses the e ntire curriculum. Keeping in
mind what is acceptable in a given context (i.e. level
of expertise of staff, past experience in assessment,
student and staff beliefs) and the available resources,
the challenge then becomes how to design an
assessment programme that fulfils all the assessment
criteria. This approach offers considerably more
degrees of freedom in the use of a variety of methods.
One could try and cover the entire competency
pyramid, deliberately incorporating ÔhardÕ measures
in some instances on reliability gro unds and ÔsofterÕ
ones in other instances in order to deliberately steer
Medical Education 2005; 39: 309–317
Page 6
learning in a certain direction. One can diversify the
selection of methods, using some that elicit verbal
expression and others that call for writing skills.
Instead of designing course-related assessments only,
one could think of longitudinal, course-independent
measures targeted at individual students’ growth or
personal development. The issue then is not whether
one uses Ôold-fashionedÕ or ÔmodernÕ methods of
assessment, but much more why and how we should
select this or that method from our toolbox in a given
situation. Trustworthy and credible answers will
ultimately determine the utility of the assessment.
A programmatic, instructional design approach to
assessment surpasses the autonomy of the individual
course developer or teacher. It requires central
planning and co-ordination and needs a well written
master plan. Essentially, this notion follows that of
modern curriculum design. No curriculum renewal
will be successful witho ut careful orchestration and
The same holds for an assessment
programme. Another likeness to curriculum design is
the need for periodic re-evaluation and re-design.
The effect of assessmen t on learning can be quite
unpredictable and may change over time. For exam-
ple, the way regulations are set or changed may result
in dramatic strategic effects for learners. This means
that ongoing evaluation and adjustment of the
assessment programme will be imperative.
We know that many psychometric issues are involved
in collating assessment information and combining
scores from different sources. We cannot say that the
use of multiple measures will automatically increase
reliability and validity. With every decision that is
made within an assessment programme, reliability is
at stake and decision errors are made and may
accumulate. When we combine information from
totally different sources, we may seem to be adding
apples to oranges in a way that will inevitably
complicate the evaluation of the validity. Yet making
pass or fail decisions is something that again
should be evaluated at the level of the programme.
We think that this too will require professional
judgement. We should move away from the
1-competence)1-m ethod approach to assessment.
A good assessment programme will incorporate
several competency elements and multip le sources of
information to evaluate those competencies on mul-
tiple occasions using credible standards. The infor-
mation obtained will have to be aggregated into a
final (promotion) decision. When all sources point in
the same direction, the information is consistent and
the decision is relatively straightforward. With con-
flicting information, decision making is more prob-
lematic and a defensible judgement will require
additional information, by obtaining more informa-
tion, by adding more decision makers, by a condi-
tional promotion decision or by postponing the
decision. Such a decision-making procedure bears far
greater resemblance to a qualitative approach that
continues to accu mulate information until saturation
is reached and a decision becomes trustworthy and
A good assessment programme should
have a feedback mechanism in place to ensure that
any final decision does not come as a total surprise to
the learner. If the latter is the case, it is indicative of a
failure of the feedback m echanism somewhere along
the way, which should be remedied.
In a programmatic, instructional design approach
to assessment, ÔsimpleÕ psychometric evaluation will
not suffice. We should probably start with more
and proper descriptions of such assessment pro-
grammes. To our knowledge, there are only a few
good examples of these,
and we definitely
need many more. These programme specifications
should motivate the choices that are made and
provide sufficient contextual information. From
these descriptions, commonalities could then be
inferred. What constitutes a good programme, what
factors contribute to it and what are the pitfalls?
How is information combined and how are deci-
sions made? Further empirical research could
investigate whether the intended programme does
or does not work in practice. Are the intended
learning effects actually being achieved? How do
the stakeholders perceive the programme? This
kind of research will be less psychometrically
oriented (although there are some good exam-
) and will probably bear more resemblance
to curriculum research.
It is our opinion that the assessment literature is
overly oriented towards the individual assessment
method and too preoccupied with exclusively psy-
chometric issues. We advocate the perspective that
any method can have utility, depending on its usage
and the programmatic context. There are no inher-
ently bad or good assessment methods. They are all
relative. What really matters is that the assessment
programme should be an integrated part of the
curriculum and this should be the main focus of our
attention and efforts. The crucial question concerns
the utility of the assessment programme as a whole.
Medical Education 2005; 39: 309–317
Page 7
Contributors: both authors contributed to the views
expressed in this article. CvdV wrote the paper, assisted by
suggestions from LS.
Acknowledgements: none.
Funding: none.
Conflicts of interest: none.
Ethical approval: not applicable.
Medical Education 2005; 39: 309–317
Medical Education 2005; 39: 309–317
