ArticlePDF Available

Using automatic item generation to create multiple-choice test items

Authors:
Article

Using automatic item generation to create multiple-choice test items

Abstract and Figures

Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically.
Content may be subject to copyright.
Using automatic item generation to create multiple-
choice test items
Mark J Gierl,
1,2
Hollis Lai
2
& Simon R Turner
1,2
CONTEXT Many tests of medical knowledge,
from the undergraduate level to the level of
certification and licensure, contain multiple-
choice items. Although these are efficient in
measuring examinees’ knowledge and skills
across diverse content areas, multiple-choice
items are time-consuming and expensive to
create. Changes in student assessment brought
about by new forms of computer-based testing
have created the demand for large numbers of
multiple-choice items. Our current approaches
to item development cannot meet this demand.
METHODS We present a methodology for
developing multiple-choice items based on
automatic item generation (AIG) concepts and
procedures. We describe a three-stage approach
to AIG and we illustrate this approach by gen-
erating multiple-choice items for a medical
licensure test in the content area of surgery.
RESULTS To generate multiple-choice items,
our method requires a three-stage process.
Firstly, a cognitive model is created by content
specialists. Secondly, item models are devel-
oped using the content from the cognitive
model. Thirdly, items are generated from the
item models using computer software. Using
this methodology, we generated 1248 multiple-
choice items from one item model.
CONCLUSIONS Automatic item generation is
a process that involves using models to generate
items using computer technology. With our
method, content specialists identify and struc-
ture the content for the test items, and com-
puter technology systematically combines the
content to generate new test items. By com-
bining these outcomes, items can be generated
automatically.
assessment
Medical Education 2012: 46: 757–765
doi:10.1111/j.1365-2923.2012.04289.x
Discuss ideas arising from this article at
www.mededuc.com ‘discuss’
1
Department of Surgery, University of Alberta, Edmonton, Alberta,
Canada
2
Centre for Research in Applied Measurement and Evaluation,
University of Alberta, Edmonton, Alberta, Canada
Correspondence: Dr Mark J Gierl, Professor of Educational
Psychology, Canada Research Chair in Educational Measurement,
Centre for Research in Applied Measurement and Evaluation,
6-110 Education North, University of Alberta, Edmonton, Alberta
T6G 2G5, Canada. Tel: 00 1 780 492 2396; Fax: 00 1 780 492 1318;
E-mail: mark.gierl@ualberta.ca
ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765 757
INTRODUCTION
Multiple-choice items provide the foundation for
many assessments used in medical education. From
small-scale classroom assessments to large-scale
licensure examinations, the multiple-choice item
format is used extensively to measure examinees’
knowledge and skills across diverse medical content
areas and domains of specialisation. Clear guidelines
exist for creating multiple-choice items.
1–3
Content
specialists draw on these guidelines to develop items
because they represent well-established standards in
practice (e.g. they include a central idea in the stem;
they have one correct answer). Content specialists
must also draw on their experience, expertise and
judgement to create each new multiple-choice item.
1
However, despite the availability of item-writing
guidelines and the willingness of content specialists,
creating multiple-choice items for medical tests is still
a challenging task. One aspect of the challenge stems
from the need to develop items across diverse and
often specialised content areas. Another aspect is
rooted in the complexity of the development task
itself, in which cognitive problem-solving skill and
content knowledge must be expressed by the content
specialist in the multiple-choice item format, and
then replicated over and over again to produce new
items.
Addressing the challenges posed by multiple-choice
item development is feasible for small-scale applica-
tions, like classroom assessments, in which only a
small number of items are required (e.g. a 50-item,
end-of-unit test). However, this task becomes more
prohibitive in the context of a large-scale examina-
tion in which hundreds or even thousands of items
are needed. For instance, many medical licensure
examinations are now administered using computer-
ised adaptive testing systems. A computerised adap-
tive test (CAT) is a paperless examination that
implements a prescribed method of selecting and
administering items, scoring examinees’ responses,
and estimating examinees’ ability. The adaptive
process of selecting new items based on examinees’
responses to previously administered items is contin-
ued until a stopping rule is satisfied. The benefits of
the CAT are well documented.
4,5
For instance, a CAT
permits testing on demand, thereby allowing exam-
inations to be administered continually throughout
the year. A CAT can be administered via the Internet
and therefore at multiple sites simultaneously. A CAT
also decreases overall testing time by administering
items appropriate to each examinee’s ability level,
thereby shortening the examination without losing
measurement precision compared with a
paper-and-pencil test. However, this form of testing
has one important cost: CATs require large banks
containing thousands of multiple-choice items. The
use of large banks permits continuous testing while
minimising item exposure so that test security can be
maintained. Breithaupt et al.
6
estimated, for example,
that the number of items required for a high-stakes
licensure examination using a relatively small 40-item
CAT with two administrations per year was, at
minimum, 2000.
To create the items needed for a medical licensure
examination, extensive multiple-choice item devel-
opment is required. Item development is a time-
consuming process because each item is individually
written, reviewed and revised by a content specialist.
It is also an expensive process. Rudner
7
claimed that
the cost of developing a single item for a high-stakes
licensure examination ranged from US$1500 to
US$2000. Given this estimate, it is easy to see how the
costs of item development quickly escalate. If we
combine the Breithaupt et al.
6
item size estimate with
the Rudner
7
cost per item estimate, we find that the
bank for a 40-item CAT licensure examination
requires an expenditure of US$3 000 000–4 000 000.
In short, medical testing agencies that adopt new and
innovative computer-based testing methods, of which
CAT is just one example, are faced with the formi-
dable task of creating thousands of new and expen-
sive multiple-choice items. One way to develop these
items is to hire more content specialists. Although
this approach is permissible, it adds to the cost of an
already expensive item development process. An
alternative approach to developing multiple-choice
items for medical examinations is automatic item
generation (AIG).
8–11
Automatic item generation is a process of using
models to generate items using computer technology.
Unlike the current approach to item development, in
which content specialists create each item individu-
ally, AIG promotes a generative process where an
item model, which is a prototypical representation of
a multiple-choice test item or question, is used.
12–14
With this model, the item is decomposed into
elements that are varied during the generation
process to produce new items. Because the elements
in the item model are the only variables manipulated,
content specialists have the role of defining those
elements that yield feasible items, and computer
technology has the role of systematically combining
the elements to produce new items. Much like the
current approach to multiple-choice item develop-
ment, AIG items must adhere to the highest stan-
dards of quality through the use of rigorous
758 ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765
M J Gierl
et al
guidelines and item development practices. Items
developed using AIG are also created to measure
examinees’ knowledge and problem-solving skills in
specific and diverse medical content areas. However,
AIG has the added benefit of scalability, whereby large
numbers of multiple-choice items can be produced
very quickly compared with current item develop-
ment practices. The purpose of this study is to
present a methodology for AIG unique to medical
education testing that can be used to generate large
numbers of multiple-choice items. Although our
illustration is in the context of a medical licensure
test for the content area of surgery, it should be noted
that our method can be applied to any medical
testing situation in which large numbers of items are
required or desired.
METHODS AND RESULTS
To generate multiple-choice items for medical tests, a
three-stage process is required. In the first stage, a
cognitive model structure is created. In the second
stage, the item models are developed using the
content from the cognitive model. In the third stage,
items are generated from the item models using
computer software. We will describe and illustrate
each stage.
Stage 1: creating a cognitive model structure
The content included in a medical examination is
identified by specialists who balance the demands of
the medical profession with the knowledge and skills
students should acquire during their medical school
training.
2,3
A wealth of literature now exists on how
medical knowledge can be conceptualised. Norman
et al.
15
summarised these conceptual frameworks as
causal (basic mechanisms of medicine), analytic
(relationships of specific symptoms and features with
specific conditions) and experiential (prior case
experiences). The application of this knowledge is an
important area of study. Decision trees,
16
clinical
problem analysis,
17,18
and clinical reasoning educa-
tion
19
are examples of methods that can be used to
study how knowledge is applied in practice to make
medical diagnoses.
Just as frameworks are needed to study the structure
and application of medical knowledge, frameworks
are required to generate medical test items. In the
current study, we present a framework for structuring
the knowledge, skills and content required to make a
medical diagnosis. The knowledge, skills and content
are used, in turn, to generate items. We call our
framework a cognitive model structure for AIG because it
highlights and helps organise the knowledge, skills
and content required to make a medical diagnosis.
The model also organises this cognitive- and content-
specific information into a coherent whole, thereby
presenting a succinct yet structured organisation of
the content relationships and sources of information
used in formulating medical diagnoses (Fig. 1).
Two content specialists, both of whom were experi-
enced medical examination item writers and practis-
ing surgeons, were asked to describe the knowledge
and clinical reasoning skills required to solve items
on a medical licensure examination in surgery. The
knowledge and skills were identified in an inductive
manner using a verbal reporting method, meaning
that the content specialists were given an existing
multiple-choice item and asked to identify and
describe the key information that would be used to
solve the item. To produce this information, they
solved the item together by verbally articulating the
knowledge, skills and content they would draw upon
to generate the correct solution. The content spe-
cialists began by identifying the problem (i.e. post-
operative fever) specific to the existing test item. They
also identified different scenarios related to that
problem (i.e. urinary tract infection, atelectasis,
wound infection, pneumonia, deep vein thrombosis,
deep space infection). These outcomes are presented
in the top panel in Fig. 1.
Using a set of related scenarios that hold the
underlying problem in common, the content
specialists then identified sources of information
required to diagnose the scenarios. Again, this
information was identified by verbally describing
the type of information needed to solve the item.
Sources of information were deemed by the content
specialists to be either generic (i.e. patient demo-
graphics, physical examination) or case-specific
(i.e. timing of fever, type of surgery). These outcomes
are presented in the middle panel in Fig. 1.
Finally, the content specialists identified features
within each information source as they verbalised their
solution to the item (e.g. guarding and rebound, fever,
calf tenderness, abdominal examination, and exami-
nation of the wound are features of physical examina-
tion). Each feature, in turn, contains two nested
components. The first nested component for a feature
is the element. Elements contain content specific to
each feature that can be manipulated for item gener-
ation. For the Guarding and Rebound feature of the
Physical Examination source of information, Yes or No
Guarding and Rebound serve as the element. The
ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765 759
Automatic generation of test items
second nested component for a feature is the con-
straint. Each element is constrained by the scenarios
specific to this problem. For instance, Deep Space
Infection (DSI) is associated with Guarding and
Rebound (i.e. DSI: Present, for the Guarding and
Rebound feature). These outcomes are presented in
the bottom panel in Fig. 1.
In its entirety, Fig. 1 serves as a cognitive model
structure for AIG. It provides a structure articulated
by content specialists using a verbal reporting
method for identifying the contextual relationships
between domain-specific content and how associated
information is assembled to make a medical diagno-
sis for complications related to postoperative fever.
The modelling approach in our example can be
generalised to other medical diagnoses for the
purpose of generating items. A generic cognitive
model structure for AIG is presented in Fig. 2.
Stage 2: creating item models
In the next stage of our method, an item model is
created so that the content in the cognitive model
structure can be cast as a multiple-choice item. Recall,
an item model is a prototypical representation of a
test item that guides the generation process. The item
model includes all of the information necessary to
generate a set of unique items. A sample item model
is presented in Fig. 3. The item model contains the
stem, which specifies the context, content, item
and or question the examinee is required to answer.
The stem also highlights the sources of information
that will be manipulated for item generation, as well
as the location of those sources in the model itself.
For example, in Fig. 3, the sources of information
include Patient Demographic, Timing of Fever, Type
of Surgery and Physical Examination. These four
sources serve as the variables in the item model.
Figure 1 The cognitive model structure used to generate Postoperative Fever test items
760 ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765
M J Gierl
et al
Next, the elements in the item model are specified.
These are the variables that will be manipulated to
generate items in the Postoperative Fever example.
The elements are extracted directly from the cogni-
tive model structure described in Stage 1. In our
example, these elements are the sources of informa-
tion that direct which features (i.e. elements, subject
to the constraints) will be inserted into the item
model to create new test items. For example, the
elements Very Likely, Less Likely and Unlikely for the
features Gastrectomy, Left Hemicolectomy,
Laparoscopic Cholecystectomy, Right Hemicolecto-
my and Appendectomy will be inserted in the Type of
Surgery source of information to create new items.
The item model also contains the options. The
options are the components of the model that specify
the correct alternative and one or more incorrect
alternatives or distracters. For the Postoperative Fever
example, the possible treatment options are Antibi-
otics, Mobilise, Reopen Wound, Anticoagulation, and
Percutaneous Drainage.
Figure 2 A general cognitive model structure for automatic item generation
ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765 761
Automatic generation of test items
Stage 3: item generation
Finally, to generate items from the Stage 1 and 2
models, software is used to assemble all permissible
combinations of elements, subject to their con-
straints, to create new items. One piece of software
designed for this purpose is called Item GeneratOR
or
IGOR
.
20
IGOR
is a
JAVA
-based program that will run
on any desktop computer.
IGOR
is research software
available from the first author, by request.
IGOR
is an
‘iterator’, meaning that it contains algorithms for
combining elements, subject to the logical constraints
in the item model, to generate items. Given item
model input,
IGOR
first generates a set of all possible
items. Then,
IGOR
checks the constraints for each
item and removes all illogical element combinations.
The outcome is a set of generated items and their
corresponding answer keys. The generated items are
saved in
HTML
or in a Word document format in a
database. Figure 4 provides a summary of the out-
comes of Stages 1–3 that are required for item
generation.
Using the AIG process presented in Fig. 4,
IGOR
generated 1248 different multiple-choice items for
diagnosing complications with postoperative fever.
Table 1 contains a random sample of four generated
items, in which the correct answer is marked with an
asterisk. The 1248 items generated in this example
required 6 hours to produce across our three-stage
method: Stage 1 (3 hours); Stage 2 (2 hours), and
Stage 3 (1 hour).
DISCUSSION
The number of assessment points in the typical
experience of a medical student is large. Often, these
assessments include multiple-choice items. This item
format is used extensively in medical education to
evaluate examinees’ knowledge and skills across
diverse content areas in an efficient and economical
manner. However, multiple-choice items are chal-
lenging to develop. They require content specialists
to implicitly structure the knowledge, skills and
content required to solve a medical problem and to
express it in the form of a test item. They must also
use their interpretation of professional guidelines, as
well as their experience, expertise and judgement, to
maintain the quality of that item. Because interpre-
tation, experience and judgement play such impor-
tant roles, multiple-choice item development has
sometimes been characterised as an ‘art’.
21,22
The purpose of the current study is to offer a new
methodology for developing content for multiple-
choice examinations using AIG concepts and proce-
dures. Automatic item generation is the process of
Stem
A [Patient Demographic] was readmitted to hospital for pain in the
abdominal area. He was on post operative day [Timing of Fever] after recovering
from a [Type of Surgery]. The patient has a temperature of 38.5 °C. Physical
examination reveals [Physical Examination]. Which one of the following is the
best next step for this patient?
E
lements
Patient Demographic: Age 40 to 65 years; Gender: Male or Female
Timing of Fever: 1 to 6 days
Type of Surgery: Gastrectomy, Right Hemicolectomy, Left Hemicolectomy,
Appendectomy, Laparoscopic Cholecystectomy
Physical Examination: Red and Tender Wound, Guarding and Rebound,
Abdominal Tenderness, Calf Tenderness
Options Antibiotics, Mobilise, Reopen Wound, Anticoagulation, Percutaneous Drainage
Figure 3 An item model for generating Postoperative Fever test items
Create cogniƟve
model structure Create item model Generate items
Extract
Elements and Constraints
Program
Elements and Constraints
Figure 4 An illustration of the three-stage item generation process
762 ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765
M J Gierl
et al
creating models to generate items using computer
technology. We described and illustrated a method for
AIG that yields large numbers of items for medical
licensure testing using the example Postoperative
Fever. We presented a three-stage process, in which the
content needed to generate items is first identified and
structured, the item models are then created and,
finally, items are generated using computer software.
From a single item model, 1248 unique items were
generated for our Postoperative Fever example.
Automatic item generation could, therefore, be
characterised as representing a shift away from the
‘art’ of item development, in which multiple-choice
items are created solely from expertise, experience
and judgement, toward a new ‘science’ of item
development. This new science is characterised by
combining the knowledge and skill of content
specialists with the combinatoric and algorithmic
power of the computer. However, it is important to
note that the characterisation of AIG as an item
development ‘science’ does not diminish the role of
content specialists. Instead, it helps to focus their
role on the important task of identifying, organising
and evaluating the content needed to develop the
stem and the options; the role of the content
specialist in AIG is critical to the creative task in a
manner that we typically associate with the art of test
development in that the content specialist identifies
the knowledge and skills required to solve medical
problems, casts the knowledge, skills and content
into a cognitive model, designs meaningful item
models, and organises the content required to
produce the stem as well as the plausible options.
This creative task relies on the judgement, expertise
and experience of content specialists. The role of
computer technology in AIG is critical to the
generative task that we often associate with the science
of modern computing in that it systematically
merges large numbers of elements in each item
model. By combining the outcomes of the content-
based creative task and the technology-based gener-
Table 1 A set of multiple-choice items generated for measuring diagnoses of complications related to postoperative fever
1 A 34-year-old woman has an appendectomy. On postoperative day 6 she has a temperature of 38.5 C. Physical examination reveals
tenderness in the abdominal region with guarding and rebound. Which one of the following is the best next step?
(a) Mobilise
(b) Antibiotics
(c) Reopen the wound
(d) Percutaneous drainage*
2 A 46-year-old man is admitted to hospital for an appendectomy. On postoperative day 4 he has a temperature of 38.5 C. Physical
examination reveals tenderness in the abdominal region with guarding and rebound. Which one of the following is the best next step?
(a) Mobilise
(b) Anticoagulation
(c) Reopen the wound
(d) Percutaneous drainage*
3 A 54-year-old woman has a laparoscopic cholecystectomy. On postoperative day 3 she has a temperature of 38.5 C. Physical examination
reveals a red and tender wound and calf tenderness. Which one of the following is the best next step?
(a) Mobilise
(b) Antibiotics
(c) Anticoagulation*
(d) Reopen the wound
4 A 62-year-old man is admitted to hospital for a laparoscopic cholecystectomy. On postoperative day 1 he has a temperature of 38.5 C.
Physical examination reveals no other findings. Which one of the following is the best next step?
(a) Mobilise*
(b) Antibiotics
(c) Reopen the wound
(d) Percutaneous drainage
* Correct option
ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765 763
Automatic generation of test items
ative task, a new composite science of AIG becomes
possible.
Limitations and directions for future research
Evidence to support the reliability and validity of test
items can be garnered in many ways. For example, the
outcomes from an item analysis (e.g. item difficulty,
item discrimination, inter-item correlations) provide
evidence about the psychometric characteristics of
items that can support their reliable and valid use. In
the current study, the development process itself was
the focal point for producing reliable and valid test
items. That is, content specialists combine their
expertise, experience and judgement with test devel-
opment guidelines and standards of practice to create
items that yield a reliable and valid probe of students’
medical knowledge and skills.
2,3
Our AIG methodo-
logy uses a process similar to the more traditional
approach, except that the knowledge, skills and
content used by specialists are now specified overtly
in the form of a cognitive model. This model can, and
should, be scrutinised carefully. In our study, two
content specialists verbalised their strategies for
solving a medical test item. The knowledge, skills and
content they identified were then organised, struc-
tured and used to create a cognitive model for AIG
(Fig. 1). This model highlights and coordinates the
cognitive- and content-specific information used in
formulating medical diagnoses.
However, the cognitive model must still be evaluated to
ensure that it provides an accurate account of the
diagnostic process required to identify a particular
problem (i.e. postoperative fever) in a specific content
area (i.e. surgery). The reliability and validity of the
generated items are therefore limited to the cognitive
model produced by our two content specialists because
our study focused only on the AIG methodology. No
attempt was made to validate their model or to deter-
mine its generalisability. However, before these items
are used on an operational test, a validation step, in
which empirical evidence is collected to ensure that the
model is accurate and generalisable, is essential. Two
methods that could be used to collect additional
evidence on the veracity and generalisability of the
cognitive model include protocol and verbal analysis
drawing on a larger sample of content specialists who
would solve a more diverse sample of test items.
23
General theories of cognition and medical expertise,
15
as they pertain to the knowledge structures and cogni-
tive processes used in medical diagnoses, could also be
usedtoevaluateourcognitivemodel,therebyproviding
even stronger evidence to support the reliability and
validity of the generated items.
To summarise, AIG serves as a new, technology-
enhanced method for item development that may
help to address one of the most pressing and
challenging assessment issues facing many medical
educators: the rapid and efficient production of large
numbers of high-quality, content-specific, medical
multiple-choice items.
Contributors: MJG and HL made substantial contributions
to the study conception and design, and to the acquisition,
analysis and interpretation of data, and drafted the article.
MJG, HL and SRT made substantial contributions to the
acquisition, analysis and interpretation of data, and to the
critical revision of the article. All authors approved the final
manuscript for publication.
Acknowledgements: the authors would like to thank the
Medical Council of Canada (MCC) for its support of this
research. However, the authors are wholly responsible for
the methods, procedures and interpretations expressed in
this paper, which do not necessarily reflect the views of the
MCC.
Funding: funding for this research was provided to the first
author by the Social Sciences and Humanities Research
Council of Canada, Ottawa, Ontario, Canada (430-2011-
0011).
Conflicts of interest: none.
Ethical approval: this study was approved by the Research
Ethics Board of the University of Alberta.
REFERENCES
1 Haladyna TM. Developing and Validating Multiple-Choice
Test Items, 3rd edn. Mahwah, NJ: Lawrence Erlbaum
2004.
2 Medical Council of Canada. Guidelines for the Develop-
ment of Multiple-Choice Questions. Ottawa, ON: MCC
2010.
3 Case S, Swanson D. Constructing Written Test Questions for
the Basic and Clinical Sciences. Philadelphia, PA: National
Board of Medical Examiners 2001.
4 Wainer H, ed. Computerized Adaptive Testing: A Primer,
2nd edn. Mahwah, NJ: Lawrence Erlbaum 2000.
5 van der Linden WJ, Glas CAW, eds. Elements of Adaptive
Testing. New York, NY: Springer 2010.
6 Breithaupt K, Ariel A, Hare D. Assembling an inventory
of multistage adaptive testing systems. In: van der
Linden WJ, Glas CAW, eds. Elements of Adaptive Testing.
New York, NY: Springer 2010;247–66.
7 Rudner L. Implementing the graduate management
admission test computerised adaptive test. In: van der
Linden WJ, Glas CAW, eds. Elements of Adaptive Testing.
New York, NY: Springer 2010;151–65.
8 Drasgow F, Luecht RM, Bennett R. Technology and
testing. In: Brennan RL, ed. Educational Measurement,
4th edn. Washington, DC: American Council on
Education 2006;471–516.
764 ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765
M J Gierl
et al
9 Irvine S, Kyllonen P. Item Generation for Test Development.
Mahwah, NJ: Lawrence Erlbaum 2002;1–411.
10 Embretson SE, Yang X. Automatic item generation and
cognitive psychology. In: Rao CR, Sinharay S, eds.
Handbook of Statistics: Psychometrics, Vol 26. Radarweg:
Elsevier 2007;747–68.
11 Gierl MJ, Haladyna TM. Automatic Item Generation:
Theory and Practice. New York, NY: Routledge 2012;
1–459.
12 Bejar II. Generative testing: from conception to
implementation. In: Irvine SH, Kyllonen PC, eds. Item
Generation for Test Development. Mahwah, NJ: Lawrence
Erlbaum 2002;199–217.
13 LaDuca A, Staples WI, Templeton B, Holzman GB.
Item modelling procedures for constructing content-
equivalent multiple-choice questions. Med Educ
1986;20:53–6.
14 Gierl MJ, Lai H. Using weak and strong theory to create
item models for automatic item generation: some
practical guidelines with examples. In: Gierl MJ,
Haladyna TM, eds. Automatic Item Generation: Theory and
Practice. New York, NY: Routledge 2012;47–63.
15 Norman G, Eva K, Brooks L, Hamstra S. Expertise in
medicine and surgery. In: Ericsson KA, Charness N,
Feltovich PJ, Hoffman RR, eds. The Cambridge Handbook
of Expertise and Expert Performance. Cambridge: Cam-
bridge University Press 2006;339–53.
16 Coderre S, Mandin H, Harasym P, Fick G. Diagnostic
reasoning strategies and diagnostic success. Med Educ
2003;37:695–703.
17 Custers E, Stuyt P, De Vries P. Clinical problem analy-
sis: a systematic approach to teaching complex medical
problem solving. Acad Med 2000;75:291–7.
18 Norman G, Brooks L, Colle C, Hatala R. The benefit of
diagnostic hypotheses in clinical reasoning: experi-
mental study of an instructional intervention for for-
ward and backward reasoning. Cognit Instruct
2000;17:433–48.
19 ten Cate T, Schade E. Combining system-based and
problem-based approaches in medical education. In:
de Graaf E, Bouhuijs PAJ, eds. Implementation of Problem-
Based Learning in Higher Education. Amsterdam: Kluwer
1993;145–62.
20 Gierl MJ, Zhou J, Alves C. Developing a taxonomy of
item model types to promote assessment engineering.
J Technol Learn Assess 2008;7:1–51.
21 Schmeiser CB, Welch CJ. Test development. In: Bren-
nan RL, ed. Educational Measurement, 4th edn. Westport,
CT: National Council on Measurement in Education,
American Council on Education 2006;307–53.
22 Downing SM, Haladyna TM. Handbook of Test Develop-
ment. Mahwah, NJ: Lawrence Erlbaum 2006;1–778.
23 Leighton JP, Gierl MJ. Verbal reports as data for cog-
nitive diagnostic assessment. In: Leighton JP, Gierl MJ,
eds. Cognitive Diagnostic Assessment for Education: Theory
and Applications. Cambridge: Cambridge University
Press 2007;146–72.
Received 8 December 2011; editorial comments to authors
7 February 2012; accepted for publication 11 March 2012
ªBlackwell Publishing Ltd 2012. MEDICAL EDUCATION 2012; 46: 757–765 765
Automatic generation of test items
... To begin, test-development specialists identify the content required to produce new test items. Gierl, Lai, and Turner (2012) introduced the concept of a cognitive model for AIG in the area of medical testing. Figure 5.1 contains a cognitive model for AIG required to diag nose and treat complications with hernias. ...
... To date, a variety of test-related tasks have been studied using NLP methods, such as automatic text summarization [7], automatic item generation [8], automated essay scoring [9], and topic modeling [10]. In this study, we employ an unsupervised text classification method known as Lbl2Vec [11] to categorize written responses in Casper into the aspects of professionalism. ...
Conference Paper
Situational judgement tests (SJTs) measure various non-cognitive skills based on examinees’ actions for hypothetical real-life scenarios. To ensure the validity of scores obtained from SJTs, a quality assurance (QA) framework is essential. In this study, we leverage natural language processing (NLP) to build an efficient and effective QA framework for evaluating scores from an SJT focusing on different aspects of professionalism. Using 635,106 written responses from an operational SJT (Casper), we perform sentiment analysis to analyze if the tone of written responses affects scores assigned by human raters. Furthermore, we implement unsupervised text classification to evaluate the extent to which written responses reflect the theoretical aspects of professionalism underlying the test. Our findings suggest that NLP tools can help us build an efficient and effective QA process to evaluate human scoring and collect validity evidence supporting the inferences drawn from Casper scores.
... For psychological and knowledge tests where it is far easier to develop parallel versions and items with equal difficulties and inter-item correlations, the ML approach seems to be less appealing. In education settings, automatic item generation (AIG) based on "item models" that are able to variate certain features to create different versions of the same item (Gierl et al., 2012) seems to be the more auspicious approach. The same may be true for psychological tests -especially intelligence tests -where such AIG models are used to create figural analogy items (Blum & Holling, 2018), mazes (Blum & Holling, 2018) or Raven's Progressive Matrices (Wang & Su, 2015). ...
Article
Full-text available
Recently, machine learning modeling has made its way into psychological research. While it is used mostly in regression or classification contexts to optimize the prediction of certain variables , its principles and techniques also have arrived in psychometrics and psychological assessment. In this paper, we present machine learning and optimization concepts that can be used for different aspects of questionnaire development and test construction focusing on four central issues-item development and item selection, dimensionality assessment in latent variable modeling, improving the generalizability of factor models as well as the evaluation of measurement invariance or differential item functioning. By introducing different machine learning techniques and newly developed methods, we want to encourage researchers to try out these tools and upgrade their psychometrics toolboxes.
... The first generation of the automatic item generation (AIG) used templates or item models developed by experts and then used the computer to generate many items from each model. In most cases, these new items were very good and achieved estimated difficulties of a similar precision to small-sample pilots (Gierl & Haladyna, 2013;Gierl & Lai, 2012). The second generation of AIG is more sophisticated and allows for a better alignment of the items to construct and better estimates of the difficulty (Khan et al., 2020). ...
Article
Full-text available
Background Digital‐first assessments leverage the affordances of technology in all elements of the assessment process: from design and development to score reporting and evaluation to create test taker‐centric assessments. Objectives The goal of this paper is to describe the engineering, machine learning, and psychometric processes and technologies of a test security framework (part of a larger ecosystem; Burstein et al., 2021) that can be used to create systems that protect the integrity of test scores. Methods We use the Duolingo English Test to exemplify the processes and technologies that are presented. This includes methods for actively detecting and deterring malicious behaviour (e.g., a custom desktop app). It also includes methods for passively detecting and deterring malicious behaviour (e.g., a large item bank created through automatic generation methods). We describe the certification process that each test administration undergoes, which includes both automated and human review. Additionally, we describe our quality assurance dashboard which leverages psychometric data mining techniques to monitor test quality and inform decisions about item pool maintenance. Results and Conclusions As assessment developers transition to online delivery and to a design approach that places the test taker at the centre, it becomes increasingly important to take advantage of the tools and methodological advances in different fields (e.g., engineering, machine learning, psychometrics). These tools and methods are essential to maintaining the security of assessments so that the score reliability is sustained and the interpretations and uses of test scores remain valid.
... With Automatic Item Generation (AIG; [1], [2], [3]) the process of assessments differs significantly from this traditional way. In AIG, experts do not write single items; rather they build a highly structured representation of the subject area which can be called cognitive model. ...
Conference Paper
Rethinking Assessment with Automatic Item Generation In a traditional way of assessment, several steps are required: A) Experts have to define learning goals or competencies in a subject area. B) They write items in which learners may be able to apply what they have learnt. C) The test is executed and students are invited to handle the items. D) The experts score the test items and the results of the assessment are announced. However, especially if the assessment is periodic as in schools or universities, the traditional way of assessment is highly time and resource consuming. With Automatic Item Generation (AIG; Gierl, Lai, & Turner, 2012; Baum et al., 2021) the process of assessments differs significantly from this traditional way. In AIG, experts do not write single items; rather they build a highly structured representation of the subject area which can be called cognitive model. In the cognitive model, they define typical problems which they have to deal with in the subject area and information which are required to handle these problems. Afterwards, the experts create an item model which includes a meta-item that comprises all necessary information to deal with many of the problems. Some of this information is then randomly blanked out and software can be used to create a bunch of items by combining the cognitive and the item model. These items were then stored in an item bank which supports scoring the test. In the AIG project, we created two different software products. The first software which we call “Model Generator” was created to support experts in creating the cognitive and the item model. The second software which is called “Item Generator” was designed to combine the two models and create the items automatically. Both products and their evaluation will be introduced in detail at the INTED 2022. The AIG Model Generator is a web application which allows the experts to design a cognitive model in a graphical representation. This model consists of so-called classes, nodes and edges. Nodes reference to low-level features, whereas classes represent the groupings of these features. Edges represent the connections between nodes and classes and can be grouped, which is used during the path-walking process to find coherent nodes or classes. The second step in the editor is to define the item model. Possible question formulations are designed in a text editor and include blank spaces which are linked with classes and nodes of the cognitive model. The editor saves the models in XML and JSON representations to hand it over to the AIG Item Generator. The generator exports the questions and items in a JSON representation to import them in other learning systems for instance in our audience response system AMCS . Both software products were evaluated. The model generator was evaluated with the System Usability Scale (SUS, Brooke, 1996). 12 experts were asked to create a cognitive model with the model generator based on a predefined scenario. They were then asked to describe problems with the modelling and answer the SUS. The results indicate that 8 of 12 experts were satisfied with their result and only one expert was not able to solve the given scenario. The problems this expert described were used to redesign the software in terms of ease of use. The item generator was evaluated by importing three different complexity levels of cognitive and item models. All models were easily combined by the software whereby multiple item sets were created. References J. Brooke (1996). SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), 4-7. M.J. Gierl, H. Lai, S.R. Turner (2012). Using automatic item generation to create multiple‐choice test items. Medical Education, vol. 46, no. 8, pp. 757-765. H. Baum, G. Damnik, M. Gierl, I. Braun (2021) A shift in automatic item generation towards more complex tasks. INTED2021 Proceedings, pp. 3235-3241.
Article
Issue: Automatic item generation is a method for creating medical items using an automated, technological solution. Automatic item generation is a contemporary method that can scale the item development process for production of large numbers of new items, support building of multiple forms, and allow rapid responses to changing medical content guidelines and threats to test security. The purpose of this analysis is to describe three sources of validation evidence that are required when producing high-quality medical licensure test items to ensure evidence for valid test score inferences, using the automatic item generation methodology for test development. Evidence: Generated items are used to make inferences about examinees’ medical knowledge, skills, and competencies. We present three sources of evidence required to evaluate the quality of the generated items that is necessary to ensure the generated items measure the intended knowledge, skills, and competencies. The sources of evidence we present here relate to the item definition, the item development process, and the item quality review. An item is defined as an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. This definition allows for a critique of the input used for automatic item generation. The item development process is evaluated using a validation table, whose purpose is to support verification of the assumptions related to model specification made by the subject-matter expert. This table provides a succinct summary of the content and constraints that were used to create new items. The item quality review is used to evaluate the statistical quality of the generated items, which often focuses on the difficulty and the discrimination of the correct and incorrect options. Implications: Automatic item generation is an increasingly popular item development method. The generated items from this process must be bolstered by evidence to ensure the items measure the intended knowledge, skills, and competencies. The purpose of this analysis is to describe these sources of evidence that can be used to evaluate the quality of the generated items. The important role of medical expertise in the development and evaluation of the generated items is highlighted as a crucial requirement for producing validation evidence.
Chapter
The paper presents the analysis of B. Bloom’s taxonomy approach to describe learning objectives in connection with program learning outcomes and the assessment methodology and tools for analysis of engineering students’ cognitive skills of different levels. The research aimed to develop some recommendations for designing testing items that could be realized for multilevel assessment of the results of engineering students’ competency-based professional training. That was performed considering the evidence, which reflected the theory and practice of B. Bloom’s taxonomy use for learning outcomes, learning activities, and assessment strategy description for engineering education. The verbal means were analyzed and used to develop the recommendations, which could be applied to formulate the learning outcomes and tasks at different cognitive levels. The application of the taxonomy approach required the development of test items provided for six cognitive categories. A set of multiple choice and open-ended tasks was developed to perform a multilevel assessment of the results of the specific course mastering by agricultural engineering students. An experiment was performed to study how the formulation and the test item type influence the assessment results. It showed the effectiveness of differential ability of the approach when taxonomy is used to assess students’ learning outcomes. KeywordsLearning outcomesTestEngineering studentsSustainability educationAssessmentEvaluationValidityTest itemsEducation policy
Article
Purpose: Written assessments face challenges when administered repeatedly, including resource-intensive item development and the potential for performance improvement secondary to item recall as opposed to understanding. This study examines the efficacy of three-item development techniques in addressing these challenges. Methods: Learners at five training programs completed two 60-item repeated assessments. Items from the first test were randomized to one of three treatments for the second assessment: (1) Verbatim repetition, (2) Isomorphic changes, or (3) Total revisions. Primary outcomes were the stability of item psychometrics across test versions and evidence of item recall influencing performance as measured by the rate of items answered correctly and then incorrectly (correct-to-incorrect rate), which suggests guessing. Results: Forty-six learners completed both tests. Item psychometrics were comparable across test versions. Correct-to-incorrect rates differed significantly between groups with the highest guessing rate (lowest recall effect) in the Total Revision group (0.15) and the lowest guessing rate (highest recall effect) in the Verbatim group (0.05), p = 0.01. Conclusions: Isomorphic and total revisions demonstrated superior performance in mitigating the effect of recall on repeated assessments. Given the high costs of total item revisions, there is promise in exploring isomorphic items as an efficient and effective approach to repeated written assessments.[Box: see text].
Chapter
Gestures and speech modalities play potent roles in social learning, especially in educational settings. Enabling artificial learning companions (i.e., humanoid robots) to perform human-like gestures and speech will facilitate interactive social learning in classrooms. In this paper, we present the implementation of human-generated gestures and speech on the Pepper robot to build a robotic teacher. To this end, we transferred a human teacher gesture to a humanoid robot using a web and a kinect cameras and applied a video-based markerless motion capture technology and an observation-based motion mirroring method. To evaluate the retargeting methods, we presented different types of a humanoid robotic teacher to six teachers and collect their impressions on the practical usage of a robotic teacher in the classroom. Our results show that the presented AI-based open-source gesture retargeting technology was found attractive, as it gives the teachers an agency to design and employ the Pepper robot in their classes. Future work entails the evaluation of our solution to the stakeholders (i.e. teachers) for its usability.
Article
Full-text available
Automatic item generation (AIG) is the process of using models to generate items using computer technology. AIG is a scalable content development method because it relies on the item model as the unit of analysis which means that it is more efficient and economical compared to traditional item development. But to use the generated items effectively, they must be managed properly. Managing a bank that could include millions of items results in problems related to identifying, organizing, and securing the content. As a result, the challenges inherent to managing item models and generated items warrant a shift in banking methodology where the management task must be accomplished using content coding at the model level. The purpose of our paper is to describe and illustrate methods that use content coding to organize and secure generated items in a bank.
Article
Full-text available
Two approaches to electrocardiogram (ECG) diagnosis were examined in a series of 2 experiments. The first approach, based on forward reasoning, asked participants to carefully obtain all the data, then synthesize the data into a diagnosis using provided rules. The second, based on backward reasoning, asked participants to try to work out the diagnosis then identify supporting features. Participants were undergraduate psychology students. In the first experiment, the forward reasoning group had the ECG removed after listing features. Accuracy of the forward reasoning group was 41.9% and accuracy of the backward reasoning group was 61.3%. In the second experiment, the forward reasoning group was permitted to retain the ECG; this time accuracy rose to 49.4% versus 61.9% for the backward reasoning group. The difference remained statistically significant. Thus, the results showed a consistent advantage for holistic, backward reasoning in an ECG diagnostic task with novices.
Book
The purpose of this book is to identify how educational tests, especially large-scale tests given to students in grades K-12, can be improved so that they produce better information about what students know and don't know. By consulting and integrating psychological research into the design of educational tests, it is now possible to create new test items that students understand better than old test items. Moreover, these new test items help identify where students may be experiencing difficulties in learning. © Cambridge University Press 2007 and Cambridge University Press, 2009.
Article
The term cognitive diagnostic assessment (CDA) is used in this chapter to refer to a specific type of student evaluation. Unlike classroom-based tests designed by teachers or large-scale assessments designed by test developers to measure how much an examinee knows about a subject domain, CDAs are designed to measure the specific knowledge structures (e.g., distributive rule in mathematics) and processing skills (e.g., applying the distributive rule in appropriate mathematical contexts) an examinee has acquired. The type of information provided by results from a CDA should answer questions such as the following: Does the examinee know the content material well? Does the examinee have any misconceptions? Does the examinee show strengths for some knowledge and skills but not others? The objective of CDAs, then, is to inform stakeholders of examinees' learning by pinpointing the location where the examinee might have specific problem-solving weaknesses that could lead to difficulties in learning. To serve this objective, CDAs are normally informed by empirical investigations of how examinees understand, conceptualize, reason, and solve problems in content domains (Frederiksen, Glaser, Lesgold, & Shafto, 1990; Nichols, 1994; Nichols, Chipman, & Brennan, 1995). In this chapter, we focus on two methods for making sense of empirical investigations of how examinees understand, conceptualize, reason, and solve problems in content domains. As a way of introduction, we first briefly discuss the importance of CDAs for providing information about examinees' strengths and weaknesses, including the ways in which CDAs differ from traditional classroom-based tests and large-scale tests.
Article
The authors describe and discuss clinical problem analysis (CPA), an approach to solving complex clinical problems. They outline the five steps of the CPA model and the essential elements of each step. Next, they discuss the value of CPA's content-independent (methodical) approach and argue that teaching students to use CPA will enable them to avoid some common diagnostic reasoning errors and pitfalls. Finally, they compare CPA with two existing approaches to clinical problem solving.
Book
Despite the fact that test development is a growth industry that cuts across all levels of education and all the professions, there has never been a comprehensive, research-oriented Handbook to which everyone (developers and consumers) can turn for guidance. That is the mission of this book. The Handbook of Test Development brings together well-known scholars and test-development practitioners to present chapters on all aspects of test development. Each chapter contributor is not only a recognized expert with an academic and research background in their designated topic, each one has also had hands-on experience in various aspects of test development. This thirty two-chapter volume is organized into six sections: foundations, content, item development, test design, test production and administration, and post-test activities. The Handbook provides extensive treatment of such important but unrecognized topics as contracting for testing services, item banking, designing tests for small testing program, and writing technical reports. The Handbook is based on the Standards for Educational and Psychological Testing, which serve as the foundation for sound test development practice. These chapters also suggest best test development practices and highlight methods to improve test validity evidence. This book is appropriate for graduate courses and seminars that deal with test development and usage, professional testing services and credentialing agencies, state and local boards of education, and academic libraries serving these groups.
Article
This chapter discusses item generation and the role of item response theory (IRT) models that permit cognitive variables to predict item parameters. It presents an overview of the methods of item generation and the research requirements for application. It reviews both the item model approach and the cognitive design system approach to item generation. The item model approach has the advantage of being applicable to item generation relatively quickly as it requires a lesser cognitive foundation. The cognitive design approach has the advantages of explicating construct validity at the item level because the level and the source of cognitive complexity in an item are quantified. The chapter also describes psychometric models that are based on IRT. The models reviewed included the linear logistic test model (LLTM), the 2PL-constrained model, and the hierarchical IRT model. The latter has been shown to produce a broad family of models appropriate for item generation with certain constraints applied. The chapter illustrates some estimation procedures for the psychometric models and presents an example of automatic item generation to spatial ability.