The Role of Item Models in Automatic Item Generation
Mark J. Gierl
Centre for Research in Applied Measurement and Evaluation
University of Alberta
Paper Presented at the Symposium
“Item Modeling and Item Generation for the Measurement of
Quantitative Skills: Recent Advances and Prospects”
Annual Meeting of the National Council on Measurement in Education
New Orleans, LA
Item Models 2
Randy Bennett (2001) claimed, a decade ago, that no topic would become more central to the
innovation and future practice in educational assessment than computers and the internet. His
prediction has proven to be accurate. Educational assessment and computer technology have evolved
at a staggering pace since 2001. As a result many educational assessments, which were once given in a
paper-and-pencil format, are now administered by computer using the internet. Education Week’s
2009 Technology Counts, for example, reported that 27 US states now administer internet-based
computerized educational assessments. Many popular and well-known exams in North America such
as the Graduate Management Achievement Test (GMAT), the Graduate Record Exam (GRE), the Test of
English as a Foreign Language (TOEFL iBT), and the American Institute of Certified Public Accountants
Uniform CPA examination (CBT-e), to cite but a few examples, are administered by computer over the
internet. Canadian testing agencies are also implementing internet-based computerized assessments.
For example, the Medical Council of Canada Qualifying Exam Part I (MCCQE I), which is written by all
medical students seeking entry into supervised clinical practice, is administered by computer.
Provincial testing agencies in Canada are also making the transition to internet-based assessment.
Alberta Education, for instance, will introduce a computer-based assessment for elementary school
students in 2011, as part of their Diagnostic Mathematics Program.
Internet-based computerized assessment offers many advantages to students and educators
compared to more traditional paper-based assessments. For instance, computers support the
development of innovative item types and alternative item formats (Sireci & Zenisky, 2006; Zenisky &
Sireci, 2002); items on computer-based tests can be scored immediately thereby providing students
with instant feedback (Drasgow & Mattern, 2006); computers permit continuous testing and testing on-
demand for students (van der Linden & Glas, 2010). But possibly the most important advantage of
Item Models 3
computer-based assessment is that it allows educators to measure more complex performances by
integrating test items and digital media to substantially increase the types of knowledge, skills, and
competencies that can be measured (Bartram, 2006; Zenisky & Sireci, 2002).
The advent of computer-based testing has also raised new challenges, particularly in the area of
item development (Downing & Haladyna, 2006; Schmeiser & Welch, 2006). Large numbers of items are
needed to develop the banks necessary for computerized testing because items are continuously
administered and, therefore, exposed. As a result, these banks must be frequently replenished to
minimize item exposure and maintain test security. Because testing agencies are now faced with the
daunting task of creating thousands of new items for computer-based assessments, alternative
methods of item development are desperately needed. One method that may be used to address this
challenge is through automatic item generation (Drasgow, Luecht, & Bennett, 2006; Embretson & Yang,
2007; Irvine & Kyllonen, 2002). Automatic item generation represents a relatively new but rapidly
evolving research area where cognitive and psychometric theories are used to produce tests that
include items generated using computer technology. Automatic item generation requires two steps.
First, test development specialists develop item models, which are comparable to templates or
prototypes, that highlight the features or elements in the assessment task that must be manipulated.
Second, these item model elements are manipulated to generate new items with the aid of computer-
based algorithms. With this two-step process, hundreds or even thousands of new items can be created
from a single item model.
The purpose of our paper is describe seven different but related topics that are central to the
development and use of item models for automatic item generation. We start by defining item model
and highlighting some related concepts; we describe how item models are developed; we present an
item model taxonomy; we illustrate how item models can be used for automatic item generation; we
Item Models 4
outline some benefits of using item models; we introduce the idea of an item model bank; and finally,
we demonstrate how statistical procedures can be used to estimate the parameters of the generated
items without the need for extensive field or pilot testing. We begin by describing two general factors
that, we feel, will directly affect educational measurement—including emerging methods such as
automatic item generation—in the 21st century.
TWO FACTORS THAT WILL SHAPE EDUCATIONAL MEASUREMENT IN THE 21ST CENTURY
We assert the first factor that will shape educational measurement in the 21st century is the growing
view that the science of educational assessment will prevail in guiding the design, development,
administration, scoring, and reporting practices in educational testing. In their seminal chapter on
“Technology and Testing” in the 4th Edition of the handbook Educational Measurement, Drasgow,
Luecht, and Bennett (2006, p. 471) begin with this bold claim:
This chapter describes our vision a 21st-century testing program that capitalizes on modern
technology and takes advantage of recent innovations in testing. Using an analogy from
engineering, we envision a modern testing program as an integrated system of systems. Thus,
there is an item generation system, an item pretesting system, and examinee registration
system, and so forth. This chapter discusses each system and illustrates how technology can
enhance and facilitate the core processes of each system.
Drasgow et al. present a view of educational measurement where integrated technology-enhanced
systems govern and direct all testing processes. Ric Luecht has coined this technology-based approach
to educational measurement “assessment engineering” (Luecht, 2006a, 2006b, 2007, 2011).
Assessment engineering is an innovative approach to measurement practice where engineering-based
principles and technology-enhanced processes are used to direct the design and development of
assessments as well as the analysis, scoring, and reporting of assessment results. With this approach,
the measurement specialist begins by defining the construct of interest using specific, empirically-
derived cognitive models of task performance. Next, item models are created to produce replicable
Item Models 5
assessment tasks. Finally, statistical models are applied to the examinee response data collected using
the item models to produce scores that are both replicable and interpretable.
The second factor that will likely shape educational measurement in the 21st century stems from the
fact that the boundaries for our discipline are becoming more porous. As a result, developments from
other disciplines such as cognitive science, mathematical statistics, medical education, educational
psychology, operations research, educational technology, and computing science will permeate and
influence educational testing. These interdisciplinary contributions will also create opportunities for
both theoretical and practical change. That is, educational measurement specialists will begin to draw
on interdisciplinary developments to enhance their own research and practice. At the same time,
students across a host of other disciplines will begin to study educational measurement1. These
interdisciplinary forces that promote new ideas and innovations will begin to evolve, perhaps slowly at
first, but then at a much faster pace leading to even more changes in our discipline. It may also mean
that other disciplines will begin to adopt our theories and practices more readily as students with
educational measurement training move back to their own content domains and areas of specialization.
ITEM MODELING: DEFINITION AND RELATED CONCEPTS
An item model (Bejar, 1996, 2002; Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2003;
LaDuca, Staples, Templeton, & Holzman, 1986)—which has also been described as a schema (Singley &
Bennett, 2002), blueprint (Embretson, 2002), template (Mislevy & Riconscente, 2006), form (Hively,
Patterson, & Page, 1968), clone (Glas & van der Linden, 2003), and shell (Haladyna & Shindoll, 1989)—
serves as an explicit representation of the variables in an assessment task, which includes the stem, the
options, and oftentimes auxiliary information (Gierl, Zhou, & Alves, 2008). The stem is the part of an
1 We have already noticed this change in our own program. We currently have 14 students in the Measurement,
Evaluation, and Cognition (MEC) graduate program at the University of Alberta. These students represent a
diverse disciplinary base, which includes education, cognitive psychology, engineering, computing science,
medicine (one of our students is a surgery resident), occupational therapy, nursing, forensic psychology, statistics,
Item Models 6
item which formulates context, content, and/or the question the examinee is required to answer. The
options contain the alternative answers with one correct option and one or more incorrect options or
distracters. When dealing with a multiple-choice item model, both stem and options are required. With
an open-ended or constructed-response item model, only the stem is created. Auxiliary information
includes any additional material, in either the stem or option, required to generate an item, including
digital media such as text, images, tables, diagrams, sound, and/or video. The stem and options can be
divided further into elements. These elements are denoted as strings, S, which are non-numeric values
and integers, I, which are numeric values. By systematically manipulating the elements, measurement
specialists can generate large numbers of items from one item model. If the generated items or
instances of the item model are intended to measure content at similar difficulty levels, then the
generated items are isomorphic. When the goal of item generation is to create isomorphic instances,
the measurement specialist manipulates the incidental elements, which are the surface features of an
item that do not alter item difficulty. Conversely, if the instances are intended to measure content at
different difficulty levels, then the generated items are variants. When the goal of item generation is to
create variant instances, the measurement specialist can manipulate the incidental elements, but must
manipulate one or more radical elements in the item model. The radicals are the deep features that
alter item difficulty, and may even affect test characteristics such as dimensionality.
To illustrate some of these concepts, an example from Grade 6 mathematics is presented in Figure 1.
The item model is represented as the stem and options variables with no auxiliary information. The
stem contains two integers (I1, I2). The I1 element includes Ann’s payment. It ranges from $1525 to
$1675 in increments of $75. The I2 element includes the size of the lawn, as either 30/m2 or 45/m2. The
four alternatives, labelled A to D, are generated using algorithms produced from the integer values I1
and I2 (including the correct option, which is A).
Item Models 7
Figure 1. Simple item model in Grade 6 mathematics with two integer elements.
Ann has paid $1525 for planting her lawn. The cost of lawn is $45/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
ITEM MODEL VARIABLES
Ann has paid $I1 for planting her lawn. The cost of lawn is $I2/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
I1 Value Range: 1525-1675 by 75
I2 Value Range: 30 or 45
DEVELOPING ITEM MODELS
Test development specialists have the critical role of designing and developing the item models
used for automatic item generation. The principles, standards, and practices that guide traditional item
development (cf. Case & Swanson, 2002; Downing & Haladyna, 2006; Schmeiser & Welch, 2006) have
been recommended for use in item model development. Although a growing number of item model
examples are available in the literature (e.g., Bejar et al., 2003; Case & Swanson, 2002; Gierl et al.,
2008), there are currently no published studies describing either the principles or standards required to
Item Models 8
develop these models. Drasgow et al. (2006) advise test development specialists to engage in the
creative task of developing item models by using design principles and guidelines discerned from a
combination of experience, theory, and research. Initially, these principles and guidelines are used to
identify a parent item model. One way to identify a parent item model is by using a cognitive theory of
task performance. Within this theory, cognitive models, as described by Luecht in his assessment
engineering framework, may be identified or discerned. With this type of “strong theory” approach,
cognitive features are identified in such detail that item features that predict test performance can be
not only specified but also controlled. The benefit of using strong theory to create item models is that
item difficulty for the generated items is predictable and, as a result, the generated items may be
calibrated without the need for extensive field or pilot testing because the factors that govern the item
difficulty level can be specified and, therefore, explicitly modeled and controlled. Unfortunately, few
cognitive theories currently exist to guide our item development practices (Leighton & Gierl, in press).
As a result, the use of strong theory for automatic item generation has, thus far, been limited to narrow
content domains, such as mental rotation (Bejar, 1990) and spatial ability (Embretson, 2002).
In the absence of strong theory, parent item models can be identified using “weak theory” by
reviewing items from previously administered exams or by drawing on an inventory of existing test
items in an attempt to identify an underlying structure. This structure, if identified, provides a point-of-
reference for creating alternative item models, where features in the alternative models can be
manipulated to generate new items. Test development specialists can also create their own unique
item models. The weak theory approach to developing parent models using previously administered
items, drawing on an inventory of existing items, or creating new models is well-suited to broad
content domains where few theoretical descriptions exist on the cognitive skills required to solve test
items (Drasgow et al., 2006). The main drawback of using weak theory to create item models is that
Item Models 9
item difficulty for the generated items is unpredictable and, therefore, field or pilot testing may be
ITEM MODEL TAXONOMY
Gierl et al. (2008) described a taxonomy of item model types, as a way of offering guidelines for
creating item models. The taxonomy pertains to generating multiple-choice items and classifies models
based on the different types of elements used in the stems and options. The stem is the section of the
model used to formulate context, content, and/or questions. The elements in the stem can function in
four different ways. Independent indicates that the ni element(s) (ni 1) in the stem are unrelated to
one another. That is, a change in one element will have no effect on the other stem elements in the
item model. Dependent indicate all nd element(s) (nd 2) in the stem are directly related to one other.
Mixed Independent/Dependent include both independent (ni 1) and dependent (ni 1) elements in
the stem, where at least one pair of stem elements are directly related. Fixed represents a constant
stem format with no variation or change.
The options contain the alternatives for the item model. The elements in the options can function in
three different ways. Randomly-selected options refer to the manner in which the distracters are
selected from their corresponding content pools. The distracters are selected randomly. Constrained
options mean that the keyed option and the distracters are generated according to specific constraints,
such as formulas, calculation, and/or context. Fixed options occur when both the keyed option and
distracters are invariant or unchanged in the item model.
By crossing the stem and options, a matrix of item model types can be produced (see Table 1). This
taxonomy is useful for creating item models because it provides the guiding principles necessary for
designing diverse models by outlining their structure, function, similarities, and differences. It can also
be used to help ensure that test development specialists do not design item models with exactly the
Item Models 10
same elements. Ten functional combinations are designated with a checkmark, “√”. The two
remaining combinations are labelled not applicable, “NA”, because a model with a fixed stem and
constrained options is an infeasible item type and a model with a fixed stem and options produces a
single multiple-choice item type (i.e., a traditional multiple-choice item). Gierl et al. also presented 20
examples (i.e., two examples for each of the 10 cells in the item model taxonomy) to illustrate each
unique combination. Their examples were drawn from diverse content areas, including science, social
studies, mathematics, language arts, and architecture.
Table 1. Plausible Stem-by-Option Combinations in the Gierl et al. (2008) Item Model Taxonomy
Independent Dependent Mixed Fixed
Randomly Selected √ √ √ √
Constrained √ √ √ NA
Fixed √ √ √ NA
USING ITEM MODELS TO AUTOMATICALLY GENERATE ITEMS
Once the item models are developed by the test development specialists, automatic item
generation can begin. Automatic item generation is the process of using item models to generate test
items with the aid of computer technology. The role of the test development specialist is critical for the
creative task of designing and developing meaningful item models. The role of computer technology is
critical for the generative task of systematically combining large numbers of elements in each model to
produce items. By combining content expertise and computer technology, item modeling can be used
to generate items. If we return to the simple math example in Figure 1, the generative process can be
illustrated. Recall, the stem in this example contains two integers (I1, I2). The generative task for this
example involves generating six items with the following I1, I2 combinations: I1=$1525 and I2=30/m2;
Item Models 11
I1=$1600 and I2=30/m2; I1=$1675 and I2=30/m2; I1=$1525 and I2=45/m2; I1=$1600 and I2=45/m2;
I1=$1675 and I2=45/m2.
Gierl et al. (2008, pp. 25-31) also created a software tool that automatically creates, saves, and
stores items. The software is called IGOR (which stands for Item GeneratOR). It was written in Sun
Microsystems JAVE SE 6.0. The purpose of IGOR is to generate multiple items from a single item
model. The user interface for IGOR is structured using the same sections as the example in Figure 1
(i.e., stem, elements, options). The Item Model Editor window is used to enter and structure each item
model (see Figure 2a). The editor has three components. The stem panel is the starting point for item
generation where the item prompt is specified. Next, the elements panel is used to identify the string
and integer variables as well as specify the constraints required among the elements for successful item
generation. The options panel is used to specify possible answers to a given test item. The options are
classified as either a key or distracter. The Elements and Options panels also contain three editing
buttons. The first of these adds a new element or option to its panel ( ). The second opens a
window to edit the currently selected element or option ( ). The third removes the currently
selected element or option from the model ( ). To generate items from a model, the Test Item
Generator dialogue box is presented where the user specifies the item model file, the item bank output
file, and the answer key file. If the option ‘Create answer key’ is not selected, then the resulting test
bank will always display the correct answer as the last option (or alternative). If the option ‘Create
answer key’ is selected, then the resulting test bank will randomly order the options. Once the files
have been specified in the Test Item Generator dialogue box, the program can be executed by selecting
the ‘Generate’ button (see Figure 2b).
Item Models 12
Figure 2. IGOR interface illustrating the (a.) input panels and editing functions as well as the (b.)
Preliminary research has been conducted with IGOR. Gierl et al., working with two mathematics
test development specialists, developed 10 mathematics item models. IGOR generated 331371 unique
items from the 10 item models. That is, each model produced, on average, 33137 items thereby
providing an initial demonstration of the practicality and feasibility of item generation using IGOR.
BENEFITS OF ITEM MODELING
Item modeling can enhance educational assessment in many ways. The purpose of item modeling is
to create a single model that yields many test items. Multiple models can then be developed which will
yield hundreds or thousands of new test items. These items, in turn, are used to generate item banks.
Computerized assessments or automatic test assembly algorithms then draw on a sample of the items
from the bank to create a new test. With this approach, item exposure through test administration is
minimized, even with continuous testing, because a large bank of operational items is available. Item
modeling can also lead to more cost-effective item development because the model is continually re-
Item Models 13
used to yield many test items compared with developing each item for a test from scratch. Moreover,
costly, yet common, errors in item development—including omissions or additions of words, phrases,
or expressions as well as spelling, punctuation, capitalization, item structure, typeface, and formatting
problems—can be avoided because only specific elements in the stem and options are manipulated
across large numbers of items (Schmeiser & Welch, 2006). In other words, the item model serves as a
template or prototype where test development specialists manipulate only specific, well-defined,
elements. The remaining components in the template or prototype are not altered. The view of an
item model as a template or prototype with both fixed and variable elements contrasts with the more
conventional view of a single item where every element is unique, both within and across items.
Drasgow et al. (2006) explain:
The demand for large numbers of items is challenging to satisfy because the traditional
approach to test development uses the item as the fundamental unit of currency. That is, each
item is individually hand-crafted—written, reviewed, revised, edited, entered into a computer,
and calibrated—as if no other like it had ever been created before.
But possibly the most important benefit of item modeling stems from the logic of this approach to
test development. With item modeling, the model is treated as the fundamental unit of analysis where
a single model is used to generate many items compared with a more traditional approach where the
item is treated as the unit of analysis (Drasgow et al. 2006). Hence, with item modeling, the cost per
item is lower because the unit of analysis is multiple instances per model rather than single instances
per test development specialist. As a result, large number of items can be generated from a single item
model rather than relying on each test development specialist to develop a large number of unique
items. The item models can also be re-used, particularly when only a small number of the generated
items are used on a particular test form.
Item Models 14
ITEM MODEL BANK
Current practices in test development and analysis are ground in the test item. That is, each item is
individually written, reviewed, revised, edited, banked, and calibrated. If, for instance, a developer
intends to have 1236 operational test items in her bank, then she has 1236 unique items that must be
created, edited, reviewed, field tested, and, possibly, revised. An item bank serves as an electronic
repository for maintaining and managing information on each item. The maintenance task focuses on
item-level information. For example, the format of the item must be coded. Item formats and item
types can include multiple choice, numeric response, written response, linked items, passage-based
items, and items containing multimedia. The content for the item must be coded. Content fields
include general learning outcomes, blueprint categories, item identification number, item response
format, type of directions required, links, field test number, date, source of item, item sets, and
copyright. The developer attributes must be coded. These attributes include year the item was written,
item writer name, item writer demographics, editor information, development status, and review
status. The statistical characteristics for the item must also be coded. Statistical characteristics often
include word count, readability, classical item analyses, item response theory parameters, distracter
functioning, item history, field test item analyses, item drift, differential item functioning flags, and
history of item use.
The management task focuses on person-level information and process. That is, item bank
management requires explicit processes that guide the use of the item bank. Many different people
within a testing organization are often involved in the development process including the test
development specialists, subject matter experts (who often reside in both internal and external
committees), psychometricians, editors, graphic artists, word processors, and document production
specialists. Many testing programs field test their items and then review committees evaluate the items
Item Models 15
prior to final test production. Hence, field tested items are often the item bank entry point. Rules must
be established for who has access to the bank and when items can be added, modified, or removed
during field testing. The same rules must also apply to the preparation of the final form of the test
because field testing can, and often does, occur in a different unit of a testing organization or at a
different stage in development process and, therefore, may involve different people.
Item models, rather than single test items, serve as the unit of analysis in an item model bank. With
an item model bank, the test development specialist creates an electronic repository of item models for
maintaining and managing information on each model. However, a single item model which is
individually written, reviewed, revised, edited, and banked will also allow the developer to generate
many test items. If, for instance, a developer intends to have 331371 items, then she may only require
10 item models (as was illustrated in our previous section on “Using Item Models to Automatically
Generate Items”). Alternatively, if a particularly ambitious developer aspired to have a very large
inventory of 10980640827 items, then she would require 331371 item models [i.e., if each item model
generated, on average, 33137 mathematics items as was illustrated in our previous section on “Using
Item Models to Automatically Generate Items”, then 331371 item models could be used to generate
10980640827 (33137*331371) items].
An item model bank serves as an electronic repository for maintaining and managing information on
each item model. Because an item model serves as the unit of analysis, the banks contain a complex
assortment of information on every model, but not necessarily on every item. The maintenance task
focuses on model -level information. For example, the format of the item model must be coded.
Content fields must be coded. The developer attributes must be coded. Some statistical characteristics
of the model must also coded, including word count, readability, and item model history. The item
model bank may also contain coded information on the item model ID, item model name, expected
Item Models 16
grade levels for use, item model stem type, item model option type, number of constraints for the
model, the number of elements (e.g., integers and strings) in the model, and the number of generated
The management task focuses on person-level information and process. That is, item model bank
management requires explicit processes that guide the use of the item model bank. As with a more
traditional approach to item development, many different people within a testing organization are
involved in the process including the test development specialists, subject matter experts,
psychometricians, editors, graphic artists, and word processors. Because of the generative process
required for item model banking, an additional type of specialist may also be involved: the item model
programmer. This specialist is skilled in test development, but also in computer programming and
database management. In other words, this is a 21st century career! Their role is, first, to bridge the gap
between the test development specialist who creates the item model and required programming tasks
necessary to format and generate items using IGOR. In other words, the item model programmer helps
the test development specialist identify and manipulate the fixed and variable elements in each item
model (which is where test development experience will be helpful), enter the newly created item
models into IGOR, and then execute the program to generate items (the latter two steps require
computer programming skills, at least at this stage in the development of automatic item generation2).
Second, the item model programmer is responsible for entering the models into the item model bank,
maintaining the contents of the bank, and managing the use of the item model bank (which requires
2 In 2009, we worked with 12 test development specialists at the Learner Assessment Branch at Alberta Education
to create item models for achievement tests in Grade 3 Language Arts and Mathematics as well as Grade 6 and 9
Language Arts, Mathematics, Science, and Social Studies. The project yielded 284 unique item models at all three
grade levels and in four different content areas. The test development specialists in this project had the most
difficulty specifying the fixed and variable elements in their model and, despite repeated training, were unable to
code their models and run IGOR consistently.
Item Models 17
database management skills). The responsibilities of the item model programmer are presented in
Figure 3. Basic overview of workflow using traditional item banking and item model banking.
Item Writing Item Bank Form Assembly
Traditional Item Banking Process
Item Model Writing Form Assembly
Item Model Banking Process
ESTIMATING STATISTICAL CHARACTERISTICS OF GENERATED ITEMS
Drasgow et al. (2006, p. 473) claim that:
Ideally, automatic item generation has two requirements. The first requirement is that an item
class can be described sufficiently for a computer to create instances of that class automatically
or at least semi-automatically. The second requirement is that the determinants of item
difficulty be understood well enough so that each of the generated instances need not be
In the previous six sections of this paper, we described and highlighted the issues related to Drasgow et
al.’s first requirement—describing an item class and automatically generating items—with the use of
item models. In this section, we address the challenges related to Drasgow et al.’s second requirement
by illustrating how generated items could be calibrated automatically. To be useful in test assembly,
Item Models 18
item must have statistical characteristics. These characteristics can be obtained by administering the
items on field tests to collect preliminary information from a small sample of examinees. Item statistics
can also be obtained by embedding pilot items within a form as part of an operational test
administration, but not using the pilot items for examinee scoring. An alternative approach is to
account for the variation among the generated items in an item model and, using this information, to
estimate item difficulty with a statistical procedure thereby making field and pilot testing for the
generated items unnecessary (or, at least, dramatically reduced). A number of statistical procedures
have been developed to accomplish this task, including the linear logistic test model (LLTM; Fischer,
1973; see also Embretson & Daniel, 2008), the 2PL-constrained model (Embretson, 1999), the
hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical model (Sinharay,
Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response function approach
(Mislevy, Wingersky, & Sheehan, 1994).
Janssen (2010; see also Janssen, Schepers, & Peres, 2004) also described a promising approach for
modeling item design features using an extension of the LLTM called the random-effects LLTM (LLTM-R).
The probability that person successfully answers item is given by the LLTM as follows:
= 1,, = (
In this formula, the item difficulty parameter found in the Rasch model is replaced with an item
difficulty model specified as =
, where item difficulty is specified by a linear combination
of item predictors, including a parameter for the item design feature , which is the score of item on
item design feature , and a parameter , which is the difficulty weights associated with item design
feature . Building on this LLTM formulation, the LLTM-R adds a random error term to to estimate
that component of item difficulty that may not be accounted for in the item difficulty model:
Item Models 19
+ = + , where
to the model, random variation can be used to account for design principles that yield the
same items but not necessary the same item difficulty values across these items.
Janssen (2010) also described the logic that underlies the LLTM-R, as it applies to automatic item
generation. The LLTM-R consists of two parts. The first part of the model specifies the person
parameters associated with , which include and
, and the second part specifies the item
parameters associated with the , which include and
. The parameter accounts for the random
variation of all items created within the same item design principles leading to similar, but not
necessarily the same, item difficulty levels. Taken together, the LLTM-R can be used to describe three
meaningful components: persons (i.e., ,
) , items (), and item populations (
). For modeling
outcomes in an automatic item generation context, our focus is on the items and item populations
(where the items are nested within the item population).
Next, we develop a working example using the logic for automatic item generation presented in
Janssen (2010). Our example is developed using operational data from a diagnostic mathematics
program (see Gierl, Taylor-Majeau, & Alves, 2010). The purpose of the Gierl et al. (2010) study was to
apply the attribute hierarchy method in an operational diagnostic mathematics program at the
elementary school levels to promote cognitive inferences about students’ problem-solving skills. The
attribute hierarchy method is a statistical procedure for classifying examinees’ test item responses into a
set of structured attribute patterns associated with a cognitive model. Principled test design procedures
were used to design the exam and evaluate the student response data. To begin, cognitive models were
created by test development specialists who outlined the knowledge and skills required to solve
mathematical tasks in Grades 3 and 6. Then, items were written specifically to measure the skills in the
cognitive models. Finally, confirmatory statistical analyses were used to evaluate the student response
Item Models 20
data by estimating model-data fit, attribute probabilities for diagnostic score reporting, and attribute
reliabilities. The cognitive model and item development steps from the diagnostic math program were
used in the current example to create item models.
Cognitive models for CDA have four defining characteristics (Gierl, Alves, Roberts, & Gotzmann,
2009). First, the model contains skills that are specified at a fine grain size because these skills must
magnify the cognitive processes underlying test performance. Second, the skills must be measurable.
That is, each skill must be described in way that would allow a test developer to create an item to
measure that skill. Third, the skills must be instructionally relevant to a broad group of educational
stakeholders, including students, parents, and teachers. Fourth, a cognitive model will often reflect a
hierarchy of ordered skills within a domain because cognitive processes share dependencies and
function within a much larger network of inter-related processes, competencies, and skills. Figure 4
provides one example taken from a small section of a larger cognitive model developed to yield
diagnostic inferences in SAT algebra (cf. Gierl, Wang, & Zhou, 2008). As a prerequisite skill, cognitive
attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction,
multiplication, and division of numbers. In attribute A2, the examinee needs to have the basic
arithmetic skills (i.e., attribute A1) as well as knowledge about the property of factors. In attribute A3,
the examinee not only requires basic arithmetic skills (i.e., attribute A1) and knowledge of factoring (i.e.,
attribute A2), but also the skills required for the application of factoring. The attributes are specified at
a fine grain size; each attribute is measurable; each attribute, and its associated item, is intended to be
instructionally relevant and meaningful; and attributes are ordered from simple to more complex as we
move from A1 to A3.
Item Models 21
Figure 4. Three sample items designed to measure three ordered skills in a linear cognitive model.
Item 1: If 6(m+n)-3=15, then m+n=?
Item 2: If (x+2)/(m-1)=0 and m≠1, what is the value of
Item 3: If 4a+4b = 3c-3d, then (2a+2b)/(5c-5d)=?
Hierarchy Level Sample Test Items
The same test design principles were used to develop four item models in our working example. We
selected four parent items that had been field tested with 100 students from the diagnostic
mathematics project. These parent items, in turn, we used to create item models. The item models
were then used for item generation. The four item models are presented in Appendix A. The item
models in Appendix A are ordered from least to most complex according to their cognitive features,
meaning that item model 1 measures number sequencing skills; item model 2 measures number
sequencing skills and numerical comparison skills; item model 3 measures number sequencing skills,
numerical comparison skills, and addition skills; item model 4 measures number sequencing skills,
numerical comparison skills, addition skills, and ability to solve fractions (please note that the ordering
of the item models in this example has not been validated, rather the models are used to illustrate how
the LLTM-R could be used for item generation).
The LLTM-R was implemented in two steps. In step 1, parameters were estimated for the persons,
items, and item population with the LLTM. Using a field test consisting of 20 item specifically written to
measure the cognitive features of number sequencing, numerical comparison, addition, and fractions
Item Models 22
(i.e., five items per cognitive feature), the person and item parameters were estimated using the
dichotomously-scored response vectors for 100 students who solved these items. The item feature
parameter estimates were specified as fixed effects in the LLTM and the person and item population
estimates were specified as random effects. The estimated item fixed-effect parameter weights and
their associated standard errors are presented in Table 2.
Table 2. Estimated Weights and Standard Errors Using the Cognitive Features Associated with the
Four Diagnostic Test Items
Cognitive Feature Estimate (Standard Error)
Number Sequencing (Least Complex) -2.06 (0.22)
Numerical Comparisons 0.94 (0.27)
Addition 0.86 (0.25)
Fractions (Most Complex) 1.03 (0.25)
The estimated weights in Table 2 were then used to create a cognitive feature effect for each parent
item. The cognitive feature effect is calculated by taking the sum of the products for the pre-requisite
cognitive features as measured by each parent item. For example, a parent item that measures the
cognitive feature numerical comparisons would have a skill pattern of 1,1,0,0 because the features are
ordered in a hierarchy from least to most complex. This pattern would be multiplied and summed
across the estimated weights in Table 2 to produce the cognitive feature effect for each of the four
parent items in our example. The cognitive feature effect for the parent item measuring numerical
comparisons, for instance, would be (-2.06 X 1) + (0.94 X 1) +(0.86 X 0) + (1.03 X 0) = -1.13. The random
effects estimated for the person and item population, as reported in standard deviation units, are 0.99
and 0.33, respectively.
In step 2, the four parent items were selected from the field test and used to create item models
(Appendix A), the item models were used to generate items, and the difficulty parameters for the
generated items were estimated. Number sequencing is the first, and hence, most basic cognitive
Item Models 23
feature. This model generated 105 items. The second cognitive feature, numerical comparison,
resulted in a model that generated 90 items. The third cognitive feature was addition. The addition item
model generated 30 items. Fractions is the fourth, and therefore, most complex cognitive feature. The
fraction item model generated 18 items. In total, the four item models yielded 243 generated items.
For our illustrative example, the four item models are also differentiated by three key item features.
Each generated item had a different combination of these three item features. These features were
coded for each item and factored into our estimation process because they were expected to affect
item difficulty. The 10 item features and their codes (reported in parentheses) include all patterns with
0 (0), or not (1); no use of odd numbers (0) or use of odd numbers (1); sum of last digit is less than 10 (0)
or sum is greater than 10 (1); some parts are 1/8 (0) or no parts are 1/8 (1); pattern by 10s (0), pattern
by 20s and 5s (1), patterns by 15 and 25 (2); 1 group(0), 2 groups (1), 3 groups (2); no odd number (0),
one odd number (1), two odd numbers (2); lowest common denominator less that 8 (0) or lowest
common denominator greater than 8 (1); first number ends with 0 (0), or not (1); group size of 5 (0) or
group size of 10 (1); use of number in multiples of 10 (0) or no number with multiples of 10 (1). These
three item features, when crossed with the four cognitive features (i.e., four parent items), are shown in
Appendix B. These 10 item features serves as our best guess as to the variables that could affect item
difficulty for the generated items in each of the four item models. These item features would need to
be validated prior to use in a real item generation study.
To compute the difficulty parameter estimate for each of the generated items, four sources of
information must be combined. These sources include the cognitive feature effect (estimated in step 1),
the item feature coding weight, the item population standard deviation (from step 1), and random
error3. These sources are combined as follows: Difficulty Level for the Generate Item = Cognitive
3 The random error component allowed us to introduce error into our analysis, which is how we modeled the
LLTM-R using the LLTM estimates from step 1 for our example.
Item Models 24
Feature Effect + [(Item Feature Effect) x (Item Population Standard Deviation) x (Random Error)].
Returning to our previous example from step 1, the difficulty level for a generated item with the
numerical comparisons cognitive feature and an item feature effect of 0,1,1 (i.e., use of odd number;
use of two groups; use a group size of 5) would be -1.21 [-1.13 + (-0.5) x (0.33) x (0.48)]. The item
feature effect code of 0,1,1 is represented as -0.5 to standardize the item feature results in our
calculation, given that different cognitive features have different numbers of item features (see
Appendix B). This method is then applied to all 243 generated items to yield their item difficulty
SUMMARY AND FUTURE DIRECTIONS
Internet-based computerized assessment is proliferating. Assessments are now routinely
administered over the internet where students respond to test items containing text, images, tables,
diagrams, sound, and video. But the growth of internet-based computerized testing has also focused
our attention on the need for new testing procedures and practices because this form of assessment
requires a continual supply of new test items. Automatic item generation is the process of using item
models to generate test items with the aid of computer technology. Automatic item generation can be
used to initially develop item banks and then replenish the banks needed for computer-based testing.
The purpose of our paper was to describe seven topics that are central to the development and use of
item models for automatic item generation. We defined item model and highlighted related concepts;
we described how item models are developed; we presented an item model taxonomy; we illustrated
how item models can be used for automatic item generation; we outlined some benefits of using item
models; we introduced the idea of an item model bank; and we demonstrated how statistical
procedures could be used to calibrate the item parameter estimates for generated items without the
need for extensive field or pilot testing. We also attempted to contextualize the growing interest in
Item Models 25
automatic item generation by highlighting the fact that the science of educational assessment is
beginning to influence educational measurement theory and practice and by claiming that
interdisciplinary forces and factors are beginning to exert a stronger affect on how we solve problems
in the discipline of educational assessment.
Research on item models is warranted in at least two different areas. The first area is item model
development. To our knowledge, there has been no focused research on item model development.
Currently, the principles, standards, and practices that guide traditional item development are also
recommended for use with item model development. These practices have been used to design and
develop item model examples that are cited in the literature (e.g., Bejar et al., 2003; Case & Swanson,
2002; Gierl et al., 2008). But much more research is required on designing, developing, and, most
importantly, evaluating the items produced by these models. By working more closely with test
development specialists in diverse content areas, researchers can begin to better understand how to
design and develop item models by carefully documenting the process. Research must also be
conducted to evaluate these item models by focusing on their generative capacity (i.e., the number of
items that can be generated from a single item model) as well as their generative veracity (i.e., the
usefulness of the generated items, particularly from the view of test development specialists and
The second area is the calibration of generated items using an item modelling approach. As noted by
Drasgow et al. (2006), automatic item generation can minimize, if not eliminate, the need for item field
or pilot testing because items generated from a parent model can be pre-calibrated, meaning that the
statistical characteristics from the parent item model can be applied to the generated items. We
illustrated how the LLTM-R could be used to estimate the difficulty parameter for 243 generated items
in a diagnostic mathematics program. But a host of other statistical procedures are also available for
Item Models 26
estimating the statistical characteristics of generated items, including the 2PL-constrained model
(Embretson, 1999), the hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical
model (Sinharay, Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response
function approach (Mislevy, Wingersky, & Sheehan, 1994). These different statistical procedures could
be used with the same item models to permit parameter estimate comparisons across generated items,
without the use of sample data. This type of study would allow researchers to assess the comparability
of the predicted item statistics across the procedures. These statistical procedures could also be used
with the same item models to permit parameter estimate comparisons across generated items relative
to parameter estimates computed from a sample of examinees who actually wrote the generated items.
This type of study would allow researchers to assess the predictive utility of the statistical procedures
(i.e., the agreement between the predicted item characteristics on the generated items using a
statistical procedure compared to the actual item characteristics on the generated items using examinee
response data), which, we expect, will emerge as the “gold standard” for evaluating the feasibility and,
ultimately, the success of automatic item generation.
Item Models 27
Bartram, D. (2006). Testing on the internet: Issues, challenges, and opportunities in the field of
occupational assessment. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the
internet (pp. 13-37). Hoboken, NJ: Wiley.
Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological
Measurement, 14, 237-245.
Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium
(ETS Research Report 96-13). Princeton, NJ: Educational Testing Service.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C.
Kyllonen (Eds.), Item generation for test development (pp.199-217). Hillsdale, NJ: Erlbaum.
Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, & R. E., Revuelta, J. (2003). A feasibility
study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and
Assessment, 2(3). Available from http://www.jtla.org.
Bennett, R. (2001). How the internet will help large-scale assessment reinvent itself. Educational Policy
Analysis Archives, 9, 1-23.
Case, S. M., & Swanson, D. B (2002). Constructing written test questions for the basic and clinical
sciences (3rd edition). Philadelphia, PA: National Board of Medical Examiners.
Downing, S. M., & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.
Drasgow, F., & Mattern, K. (2006). New tests and new items: Opportunities and issues. In D. Bartram &
R. Hambleton (Eds.), Computer-based testing and the internet (pp. 59-76). Hoboken, NJ: Wiley.
Item Models 28
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models.
Psychometrika, 64, 407-433.
Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P.
C. Kyllonen (Eds.), Item generation for test development (pp. 219-250). Mahwah, NJ: Erlbaum.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity level in
mathematical problem solving items. Psychological Science Quarterly, 50, 328-344.
Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao &
S. Sinharay (Eds.) Handbook of Statistics: Psychometrics, Volume 26 (pp. 747-768). North Holland,
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta
Psychologica, 37, 359-374.
Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the Attribute Hierarchy Method to make diagnostic
inferences about examinees’ cognitive skills in algebra on the SAT©. Journal of Technology,
Learning, and Assessment, 6 (6). Retrieved [date] from http://www.jtla.org.
Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote
assessment engineering. Journal of Technology, Learning, and Assessment, 7(2). Retrieved [date]
Gierl, M. J. Alves, C., Roberts, M., & Gotzmann, A. (2009, April). Using judgments from content
specialists to develop cognitive models for diagnostic assessments. In J. Gorin (Chair), How to Build a
Cognitive Model for Educational Assessments. Paper presented in symposium conducted at the
annual meeting of the National Council on Measurement in Education, San Diego, CA.
Item Models 29
Gierl, M. J., Alves, C., & Taylor-Majeau, R. (2010). Using the Attribute Hierarchy Method to make
diagnostic inferences about examinees’ skills in mathematics: An operational implementation of
cognitive diagnostic assessment. International Journal of Testing, 10, 318-341.
Glas, C. A. W., & van der Linder, W. J. (2003). Computerized adaptive testing with item cloning. Applied
Psychological Measurement, 27, 247-261.
Haladyna, T., & Shindoll, R. (1989). Items shells: A method for writing effective multiple-choice test
items. Evaluation and the Health Professions, 12, 97-106.
Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic
achievement tests. Journal of Educational Measurement, 5, 275-290.
Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Erlbaum.
Janssen, R. (2010). Modeling the effect of item designs within the Rasch model. In S. E. Embretson (Ed.),
Measuring psychological constructs: Advances in model-based approaches (pp. 227-245).
Washington DC: American Psychological Association.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and group predictors. In P. DeBoeck & M.
Wilson (Eds.), Explanatory item response models: A generalized linear and non-linear approach (pp.
189-212). New York: Springer.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedures for
constructing content-equivalent multiple-choice questions. Medical Education, 20, 53-56.
Leighton, J. P., & Gierl, M. J. (in press). The learning sciences in educational assessment: The role of
cognitive models. Cambridge, UK: Cambridge University Press.
Luecht, R. M. (2006a, May). Engineering the test: From principled item design to automated test
assembly. Paper presented at the annual meeting of the Society for Industrial and Organizational
Psychology, Dallas, TX.
Item Models 30
Luecht, R. M. (2006b, September). Assessment engineering: An emerging discipline. Paper presented in
the Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton,
Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: From Data Models and
Templates to Psychometrics. Invited paper presented at the annual meeting of the National Council
on Measurement in Education, Chicago, IL.
Luecht, R. M. (February, 2011). Assessment design and development, version 2.0: From art to
engineering. Invited paper presented at the annual meeting of the Association of Test Publishers,
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T.
Haladyna (Eds.), Handbook of test development (pp. 61-90). Mahwah, NJ: Erlbaum.
Mislevy, R. J., Wingersky, M. S., & Sheehan, K. M. (1994). Dealing with uncertainty about item
parameters: Expected response functions (ETS Research Report 94-28-ONR). Princeton, NJ:
Educational Testing Service.
Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education
and American Council on Education.
Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to
mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development
(pp. 361-384). Mahwah, NJ: Erlbaum.
Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the
results using family expected response functions. Journal of Educational and Behavioral Statistics,
Item Models 31
Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. (ETS
Research Report 05-06). Princeton, NJ: Educational Testing Service.
Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of
improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test
development (pp.329-348). Mahwah, NJ: Erlbaum.
van der Linden, W., & Glas, C. A. W. (2010). Elements of adaptive testing. New York: Springer.
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment. Applied
Measurement in Education, 15, 337-362.
Item Models 32
Item model #1 in mathematics used to generate isomorphic instances of numerical sequences.
If the pattern continues, then the next three numbers should be
700 695 690 685 _____ _____ _____
A. 680, 675, 670
B. 700, 695, 690
C. 680, 677, 675
D. 685, 680, 675
ITEM MODEL VARIABLES
If the pattern continues, then the next three numbers should be I1 I1-I2 I1-(2*I2) I1-(3*I2) _____ _____ _____
I1 Value Range: 700-800 by 5
I2 Value Range: 5-25 by 5
A= I1 - ( 4 * I2 ), I1 - ( 5 * I2 ), I1 - ( 6 * I2 )
B= I1 - ( 3 * I2 ), I1 - ( 4 * I2 ), I1 - ( 5 * I2 )
C= I1 - ( 4 * I2 ), I1 - ( round( 4.5 * I2 ) , I1 - ( 5 * I2 )
D= I1, I1 - ( 1 * I2 ) , I1 - ( 2 * I2 )
Item Models 33
Item model #2 in mathematics used to generate isomorphic instances of numerical comparisons.
The number that is 1 group of 5 fewer than 201 is ...
ITEM MODEL VARIABLES
The number that is I1 group of I2 fewer than I3 is ...
I1 Value Range: 1-3 by 1
I2 Value Range: 5-10 by 5
I3 Value Range: 201-245 by 3
A= I3 - ( I2 * I1 )
B= I3 - ( I2 * ( I1 + 1 ) )
C= I3 - ( I2 * I1 ) + 1
D= I3 - ( I2 * ( I1 + 1 ) ) - 1
Item Models 34
Item model #3 in mathematics used to generate isomorphic instances for addition.
What is 15 + 18 ?
ITEM MODEL VARIABLES
What is I1 + I2 ?
I1 Value Range: 15-30 by 3
I2 Value Range: 15-30 by 3
A= I1 + I2
B= I1 + I2 + 1
C= I1 + I1 + I2 - 1
D= I1 + I1 + I2
Item Models 35
Item model #4 in mathematics used to generate isomorphic instances for fractions.
What fraction of the measuring cup has oil in it?
ITEM MODEL VARIABLES
What fraction of the measuring cup has oil in it?
Diagram: I1 of Water and I2 of oil in one cup.
I1 Value Range: 0.125-1.0 by 0.125
I2 Value Range: 0.125-1.0 by 0.125
A= ( I2 * 8 ) / 8
B= ( I2 * 8 ) + ( ( I1 * 8 ) / 8 )
C= ( I2 * 8 ) + ( ( I1 * 8 ) / 10 )
D= ( I2 * 8) / ( ( I2 * 8 ) + 1 )
The cognitive feature codes were used to develop the four parent items for our example. The item feature codes serve as variables that could
affect the difficulty level for the generated items.
Item Feature Code
Feature Code Value Feature Value Feature Value Feature
All start patterns are 0
Pattern by 10s
First number ends with 0
1 All start patterns not 0 1 Pattern by 20s and 5s 1 First number does not end with 0
2 Pattern by 15s and 25s
No use of odd number
1 Group less
Group size of 10
1 Use of odd umber 1 2 Groups less 1 Group size of 5
2 3 Groups less
0 Sum of Last Digit <10 0 No use of odd numbers 0 Use of number in multiples of 10
Sum of Last Digit >10
One use of odd numbers
No number with multiples of 10
2 Two use of odd numbers
0 Some parts are 1/8 0 Lowest common denominator < 8
No parts are 1/8
Lowest common denominator=8