ArticlePDF Available

The Role of Item Models in Automatic Item Generation



Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates or prototypes, that highlight the features or elements in the assessment task that must be manipulated. Second, these item model elements are manipulated to generate new items with the aid of computer-based algorithms. With this two-step process, hundreds or even thousands of new items can be created from a single item model. The purpose of our article is to describe seven different but related topics that are central to the development and use of item models for automatic item generation. We start by defining item model and highlighting some related concepts; we describe how item models are developed; we present an item model taxonomy; we illustrate how item models can be used for automatic item generation; we outline some benefits of using item models; we introduce the idea of an item model bank; and finally, we demonstrate how statistical procedures can be used to estimate the parameters of the generated items without the need for extensive field or pilot testing.
The Role of Item Models in Automatic Item Generation
Mark J. Gierl
Hollis Lai
Centre for Research in Applied Measurement and Evaluation
University of Alberta
Paper Presented at the Symposium
Item Modeling and Item Generation for the Measurement of
Quantitative Skills: Recent Advances and Prospects
Annual Meeting of the National Council on Measurement in Education
New Orleans, LA
April, 2011
Item Models 2
Randy Bennett (2001) claimed, a decade ago, that no topic would become more central to the
innovation and future practice in educational assessment than computers and the internet. His
prediction has proven to be accurate. Educational assessment and computer technology have evolved
at a staggering pace since 2001. As a result many educational assessments, which were once given in a
paper-and-pencil format, are now administered by computer using the internet. Education Week’s
2009 Technology Counts, for example, reported that 27 US states now administer internet-based
computerized educational assessments. Many popular and well-known exams in North America such
as the Graduate Management Achievement Test (GMAT), the Graduate Record Exam (GRE), the Test of
English as a Foreign Language (TOEFL iBT), and the American Institute of Certified Public Accountants
Uniform CPA examination (CBT-e), to cite but a few examples, are administered by computer over the
internet. Canadian testing agencies are also implementing internet-based computerized assessments.
For example, the Medical Council of Canada Qualifying Exam Part I (MCCQE I), which is written by all
medical students seeking entry into supervised clinical practice, is administered by computer.
Provincial testing agencies in Canada are also making the transition to internet-based assessment.
Alberta Education, for instance, will introduce a computer-based assessment for elementary school
students in 2011, as part of their Diagnostic Mathematics Program.
Internet-based computerized assessment offers many advantages to students and educators
compared to more traditional paper-based assessments. For instance, computers support the
development of innovative item types and alternative item formats (Sireci & Zenisky, 2006; Zenisky &
Sireci, 2002); items on computer-based tests can be scored immediately thereby providing students
with instant feedback (Drasgow & Mattern, 2006); computers permit continuous testing and testing on-
demand for students (van der Linden & Glas, 2010). But possibly the most important advantage of
Item Models 3
computer-based assessment is that it allows educators to measure more complex performances by
integrating test items and digital media to substantially increase the types of knowledge, skills, and
competencies that can be measured (Bartram, 2006; Zenisky & Sireci, 2002).
The advent of computer-based testing has also raised new challenges, particularly in the area of
item development (Downing & Haladyna, 2006; Schmeiser & Welch, 2006). Large numbers of items are
needed to develop the banks necessary for computerized testing because items are continuously
administered and, therefore, exposed. As a result, these banks must be frequently replenished to
minimize item exposure and maintain test security. Because testing agencies are now faced with the
daunting task of creating thousands of new items for computer-based assessments, alternative
methods of item development are desperately needed. One method that may be used to address this
challenge is through automatic item generation (Drasgow, Luecht, & Bennett, 2006; Embretson & Yang,
2007; Irvine & Kyllonen, 2002). Automatic item generation represents a relatively new but rapidly
evolving research area where cognitive and psychometric theories are used to produce tests that
include items generated using computer technology. Automatic item generation requires two steps.
First, test development specialists develop item models, which are comparable to templates or
prototypes, that highlight the features or elements in the assessment task that must be manipulated.
Second, these item model elements are manipulated to generate new items with the aid of computer-
based algorithms. With this two-step process, hundreds or even thousands of new items can be created
from a single item model.
The purpose of our paper is describe seven different but related topics that are central to the
development and use of item models for automatic item generation. We start by defining item model
and highlighting some related concepts; we describe how item models are developed; we present an
item model taxonomy; we illustrate how item models can be used for automatic item generation; we
Item Models 4
outline some benefits of using item models; we introduce the idea of an item model bank; and finally,
we demonstrate how statistical procedures can be used to estimate the parameters of the generated
items without the need for extensive field or pilot testing. We begin by describing two general factors
that, we feel, will directly affect educational measurementincluding emerging methods such as
automatic item generationin the 21st century.
We assert the first factor that will shape educational measurement in the 21st century is the growing
view that the science of educational assessment will prevail in guiding the design, development,
administration, scoring, and reporting practices in educational testing. In their seminal chapter on
“Technology and Testing” in the 4th Edition of the handbook Educational Measurement, Drasgow,
Luecht, and Bennett (2006, p. 471) begin with this bold claim:
This chapter describes our vision a 21st-century testing program that capitalizes on modern
technology and takes advantage of recent innovations in testing. Using an analogy from
engineering, we envision a modern testing program as an integrated system of systems. Thus,
there is an item generation system, an item pretesting system, and examinee registration
system, and so forth. This chapter discusses each system and illustrates how technology can
enhance and facilitate the core processes of each system.
Drasgow et al. present a view of educational measurement where integrated technology-enhanced
systems govern and direct all testing processes. Ric Luecht has coined this technology-based approach
to educational measurement assessment engineering” (Luecht, 2006a, 2006b, 2007, 2011).
Assessment engineering is an innovative approach to measurement practice where engineering-based
principles and technology-enhanced processes are used to direct the design and development of
assessments as well as the analysis, scoring, and reporting of assessment results. With this approach,
the measurement specialist begins by defining the construct of interest using specific, empirically-
derived cognitive models of task performance. Next, item models are created to produce replicable
Item Models 5
assessment tasks. Finally, statistical models are applied to the examinee response data collected using
the item models to produce scores that are both replicable and interpretable.
The second factor that will likely shape educational measurement in the 21st century stems from the
fact that the boundaries for our discipline are becoming more porous. As a result, developments from
other disciplines such as cognitive science, mathematical statistics, medical education, educational
psychology, operations research, educational technology, and computing science will permeate and
influence educational testing. These interdisciplinary contributions will also create opportunities for
both theoretical and practical change. That is, educational measurement specialists will begin to draw
on interdisciplinary developments to enhance their own research and practice. At the same time,
students across a host of other disciplines will begin to study educational measurement1. These
interdisciplinary forces that promote new ideas and innovations will begin to evolve, perhaps slowly at
first, but then at a much faster pace leading to even more changes in our discipline. It may also mean
that other disciplines will begin to adopt our theories and practices more readily as students with
educational measurement training move back to their own content domains and areas of specialization.
An item model (Bejar, 1996, 2002; Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2003;
LaDuca, Staples, Templeton, & Holzman, 1986)which has also been described as a schema (Singley &
Bennett, 2002), blueprint (Embretson, 2002), template (Mislevy & Riconscente, 2006), form (Hively,
Patterson, & Page, 1968), clone (Glas & van der Linden, 2003), and shell (Haladyna & Shindoll, 1989)
serves as an explicit representation of the variables in an assessment task, which includes the stem, the
options, and oftentimes auxiliary information (Gierl, Zhou, & Alves, 2008). The stem is the part of an
1 We have already noticed this change in our own program. We currently have 14 students in the Measurement,
Evaluation, and Cognition (MEC) graduate program at the University of Alberta. These students represent a
diverse disciplinary base, which includes education, cognitive psychology, engineering, computing science,
medicine (one of our students is a surgery resident), occupational therapy, nursing, forensic psychology, statistics,
and linguistics.
Item Models 6
item which formulates context, content, and/or the question the examinee is required to answer. The
options contain the alternative answers with one correct option and one or more incorrect options or
distracters. When dealing with a multiple-choice item model, both stem and options are required. With
an open-ended or constructed-response item model, only the stem is created. Auxiliary information
includes any additional material, in either the stem or option, required to generate an item, including
digital media such as text, images, tables, diagrams, sound, and/or video. The stem and options can be
divided further into elements. These elements are denoted as strings, S, which are non-numeric values
and integers, I, which are numeric values. By systematically manipulating the elements, measurement
specialists can generate large numbers of items from one item model. If the generated items or
instances of the item model are intended to measure content at similar difficulty levels, then the
generated items are isomorphic. When the goal of item generation is to create isomorphic instances,
the measurement specialist manipulates the incidental elements, which are the surface features of an
item that do not alter item difficulty. Conversely, if the instances are intended to measure content at
different difficulty levels, then the generated items are variants. When the goal of item generation is to
create variant instances, the measurement specialist can manipulate the incidental elements, but must
manipulate one or more radical elements in the item model. The radicals are the deep features that
alter item difficulty, and may even affect test characteristics such as dimensionality.
To illustrate some of these concepts, an example from Grade 6 mathematics is presented in Figure 1.
The item model is represented as the stem and options variables with no auxiliary information. The
stem contains two integers (I1, I2). The I1 element includes Ann’s payment. It ranges from $1525 to
$1675 in increments of $75. The I2 element includes the size of the lawn, as either 30/m2 or 45/m2. The
four alternatives, labelled A to D, are generated using algorithms produced from the integer values I1
and I2 (including the correct option, which is A).
Item Models 7
Figure 1. Simple item model in Grade 6 mathematics with two integer elements.
Ann has paid $1525 for planting her lawn. The cost of lawn is $45/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
A. 5.8
B. 6.8
C. 4.8
D. 7.3
Ann has paid $I1 for planting her lawn. The cost of lawn is $I2/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
I1 Value Range: 1525-1675 by 75
I2 Value Range: 30 or 45
+ 1
 1
+ 1.5
Test development specialists have the critical role of designing and developing the item models
used for automatic item generation. The principles, standards, and practices that guide traditional item
development (cf. Case & Swanson, 2002; Downing & Haladyna, 2006; Schmeiser & Welch, 2006) have
been recommended for use in item model development. Although a growing number of item model
examples are available in the literature (e.g., Bejar et al., 2003; Case & Swanson, 2002; Gierl et al.,
2008), there are currently no published studies describing either the principles or standards required to
Item Models 8
develop these models. Drasgow et al. (2006) advise test development specialists to engage in the
creative task of developing item models by using design principles and guidelines discerned from a
combination of experience, theory, and research. Initially, these principles and guidelines are used to
identify a parent item model. One way to identify a parent item model is by using a cognitive theory of
task performance. Within this theory, cognitive models, as described by Luecht in his assessment
engineering framework, may be identified or discerned. With this type of “strong theory” approach,
cognitive features are identified in such detail that item features that predict test performance can be
not only specified but also controlled. The benefit of using strong theory to create item models is that
item difficulty for the generated items is predictable and, as a result, the generated items may be
calibrated without the need for extensive field or pilot testing because the factors that govern the item
difficulty level can be specified and, therefore, explicitly modeled and controlled. Unfortunately, few
cognitive theories currently exist to guide our item development practices (Leighton & Gierl, in press).
As a result, the use of strong theory for automatic item generation has, thus far, been limited to narrow
content domains, such as mental rotation (Bejar, 1990) and spatial ability (Embretson, 2002).
In the absence of strong theory, parent item models can be identified using weak theoryby
reviewing items from previously administered exams or by drawing on an inventory of existing test
items in an attempt to identify an underlying structure. This structure, if identified, provides a point-of-
reference for creating alternative item models, where features in the alternative models can be
manipulated to generate new items. Test development specialists can also create their own unique
item models. The weak theory approach to developing parent models using previously administered
items, drawing on an inventory of existing items, or creating new models is well-suited to broad
content domains where few theoretical descriptions exist on the cognitive skills required to solve test
items (Drasgow et al., 2006). The main drawback of using weak theory to create item models is that
Item Models 9
item difficulty for the generated items is unpredictable and, therefore, field or pilot testing may be
Gierl et al. (2008) described a taxonomy of item model types, as a way of offering guidelines for
creating item models. The taxonomy pertains to generating multiple-choice items and classifies models
based on the different types of elements used in the stems and options. The stem is the section of the
model used to formulate context, content, and/or questions. The elements in the stem can function in
four different ways. Independent indicates that the ni element(s) (ni 1) in the stem are unrelated to
one another. That is, a change in one element will have no effect on the other stem elements in the
item model. Dependent indicate all nd element(s) (nd 2) in the stem are directly related to one other.
Mixed Independent/Dependent include both independent (ni 1) and dependent (ni 1) elements in
the stem, where at least one pair of stem elements are directly related. Fixed represents a constant
stem format with no variation or change.
The options contain the alternatives for the item model. The elements in the options can function in
three different ways. Randomly-selected options refer to the manner in which the distracters are
selected from their corresponding content pools. The distracters are selected randomly. Constrained
options mean that the keyed option and the distracters are generated according to specific constraints,
such as formulas, calculation, and/or context. Fixed options occur when both the keyed option and
distracters are invariant or unchanged in the item model.
By crossing the stem and options, a matrix of item model types can be produced (see Table 1). This
taxonomy is useful for creating item models because it provides the guiding principles necessary for
designing diverse models by outlining their structure, function, similarities, and differences. It can also
be used to help ensure that test development specialists do not design item models with exactly the
Item Models 10
same elements. Ten functional combinations are designated with a checkmark, “√”. The two
remaining combinations are labelled not applicable, “NA”, because a model with a fixed stem and
constrained options is an infeasible item type and a model with a fixed stem and options produces a
single multiple-choice item type (i.e., a traditional multiple-choice item). Gierl et al. also presented 20
examples (i.e., two examples for each of the 10 cells in the item model taxonomy) to illustrate each
unique combination. Their examples were drawn from diverse content areas, including science, social
studies, mathematics, language arts, and architecture.
Table 1. Plausible Stem-by-Option Combinations in the Gierl et al. (2008) Item Model Taxonomy
Independent Dependent Mixed Fixed
Randomly Selected
Constrained NA
Fixed NA
Once the item models are developed by the test development specialists, automatic item
generation can begin. Automatic item generation is the process of using item models to generate test
items with the aid of computer technology. The role of the test development specialist is critical for the
creative task of designing and developing meaningful item models. The role of computer technology is
critical for the generative task of systematically combining large numbers of elements in each model to
produce items. By combining content expertise and computer technology, item modeling can be used
to generate items. If we return to the simple math example in Figure 1, the generative process can be
illustrated. Recall, the stem in this example contains two integers (I1, I2). The generative task for this
example involves generating six items with the following I1, I2 combinations: I1=$1525 and I2=30/m2;
Item Models 11
I1=$1600 and I2=30/m2; I1=$1675 and I2=30/m2; I1=$1525 and I2=45/m2; I1=$1600 and I2=45/m2;
I1=$1675 and I2=45/m2.
Gierl et al. (2008, pp. 25-31) also created a software tool that automatically creates, saves, and
stores items. The software is called IGOR (which stands for Item GeneratOR). It was written in Sun
Microsystems JAVE SE 6.0. The purpose of IGOR is to generate multiple items from a single item
model. The user interface for IGOR is structured using the same sections as the example in Figure 1
(i.e., stem, elements, options). The Item Model Editor window is used to enter and structure each item
model (see Figure 2a). The editor has three components. The stem panel is the starting point for item
generation where the item prompt is specified. Next, the elements panel is used to identify the string
and integer variables as well as specify the constraints required among the elements for successful item
generation. The options panel is used to specify possible answers to a given test item. The options are
classified as either a key or distracter. The Elements and Options panels also contain three editing
buttons. The first of these adds a new element or option to its panel ( ). The second opens a
window to edit the currently selected element or option ( ). The third removes the currently
selected element or option from the model ( ). To generate items from a model, the Test Item
Generator dialogue box is presented where the user specifies the item model file, the item bank output
file, and the answer key file. If the option ‘Create answer key’ is not selected, then the resulting test
bank will always display the correct answer as the last option (or alternative). If the option ‘Create
answer key’ is selected, then the resulting test bank will randomly order the options. Once the files
have been specified in the Test Item Generator dialogue box, the program can be executed by selecting
the ‘Generate’ button (see Figure 2b).
Item Models 12
Figure 2. IGOR interface illustrating the (a.) input panels and editing functions as well as the (b.)
generating functions.
(a.) (b.)
Preliminary research has been conducted with IGOR. Gierl et al., working with two mathematics
test development specialists, developed 10 mathematics item models. IGOR generated 331371 unique
items from the 10 item models. That is, each model produced, on average, 33137 items thereby
providing an initial demonstration of the practicality and feasibility of item generation using IGOR.
Item modeling can enhance educational assessment in many ways. The purpose of item modeling is
to create a single model that yields many test items. Multiple models can then be developed which will
yield hundreds or thousands of new test items. These items, in turn, are used to generate item banks.
Computerized assessments or automatic test assembly algorithms then draw on a sample of the items
from the bank to create a new test. With this approach, item exposure through test administration is
minimized, even with continuous testing, because a large bank of operational items is available. Item
modeling can also lead to more cost-effective item development because the model is continually re-
Item Models 13
used to yield many test items compared with developing each item for a test from scratch. Moreover,
costly, yet common, errors in item developmentincluding omissions or additions of words, phrases,
or expressions as well as spelling, punctuation, capitalization, item structure, typeface, and formatting
problemscan be avoided because only specific elements in the stem and options are manipulated
across large numbers of items (Schmeiser & Welch, 2006). In other words, the item model serves as a
template or prototype where test development specialists manipulate only specific, well-defined,
elements. The remaining components in the template or prototype are not altered. The view of an
item model as a template or prototype with both fixed and variable elements contrasts with the more
conventional view of a single item where every element is unique, both within and across items.
Drasgow et al. (2006) explain:
The demand for large numbers of items is challenging to satisfy because the traditional
approach to test development uses the item as the fundamental unit of currency. That is, each
item is individually hand-craftedwritten, reviewed, revised, edited, entered into a computer,
and calibratedas if no other like it had ever been created before.
But possibly the most important benefit of item modeling stems from the logic of this approach to
test development. With item modeling, the model is treated as the fundamental unit of analysis where
a single model is used to generate many items compared with a more traditional approach where the
item is treated as the unit of analysis (Drasgow et al. 2006). Hence, with item modeling, the cost per
item is lower because the unit of analysis is multiple instances per model rather than single instances
per test development specialist. As a result, large number of items can be generated from a single item
model rather than relying on each test development specialist to develop a large number of unique
items. The item models can also be re-used, particularly when only a small number of the generated
items are used on a particular test form.
Item Models 14
Current practices in test development and analysis are ground in the test item. That is, each item is
individually written, reviewed, revised, edited, banked, and calibrated. If, for instance, a developer
intends to have 1236 operational test items in her bank, then she has 1236 unique items that must be
created, edited, reviewed, field tested, and, possibly, revised. An item bank serves as an electronic
repository for maintaining and managing information on each item. The maintenance task focuses on
item-level information. For example, the format of the item must be coded. Item formats and item
types can include multiple choice, numeric response, written response, linked items, passage-based
items, and items containing multimedia. The content for the item must be coded. Content fields
include general learning outcomes, blueprint categories, item identification number, item response
format, type of directions required, links, field test number, date, source of item, item sets, and
copyright. The developer attributes must be coded. These attributes include year the item was written,
item writer name, item writer demographics, editor information, development status, and review
status. The statistical characteristics for the item must also be coded. Statistical characteristics often
include word count, readability, classical item analyses, item response theory parameters, distracter
functioning, item history, field test item analyses, item drift, differential item functioning flags, and
history of item use.
The management task focuses on person-level information and process. That is, item bank
management requires explicit processes that guide the use of the item bank. Many different people
within a testing organization are often involved in the development process including the test
development specialists, subject matter experts (who often reside in both internal and external
committees), psychometricians, editors, graphic artists, word processors, and document production
specialists. Many testing programs field test their items and then review committees evaluate the items
Item Models 15
prior to final test production. Hence, field tested items are often the item bank entry point. Rules must
be established for who has access to the bank and when items can be added, modified, or removed
during field testing. The same rules must also apply to the preparation of the final form of the test
because field testing can, and often does, occur in a different unit of a testing organization or at a
different stage in development process and, therefore, may involve different people.
Item models, rather than single test items, serve as the unit of analysis in an item model bank. With
an item model bank, the test development specialist creates an electronic repository of item models for
maintaining and managing information on each model. However, a single item model which is
individually written, reviewed, revised, edited, and banked will also allow the developer to generate
many test items. If, for instance, a developer intends to have 331371 items, then she may only require
10 item models (as was illustrated in our previous section on “Using Item Models to Automatically
Generate Items”). Alternatively, if a particularly ambitious developer aspired to have a very large
inventory of 10980640827 items, then she would require 331371 item models [i.e., if each item model
generated, on average, 33137 mathematics items as was illustrated in our previous section on “Using
Item Models to Automatically Generate Items”, then 331371 item models could be used to generate
10980640827 (33137*331371) items].
An item model bank serves as an electronic repository for maintaining and managing information on
each item model. Because an item model serves as the unit of analysis, the banks contain a complex
assortment of information on every model, but not necessarily on every item. The maintenance task
focuses on model -level information. For example, the format of the item model must be coded.
Content fields must be coded. The developer attributes must be coded. Some statistical characteristics
of the model must also coded, including word count, readability, and item model history. The item
model bank may also contain coded information on the item model ID, item model name, expected
Item Models 16
grade levels for use, item model stem type, item model option type, number of constraints for the
model, the number of elements (e.g., integers and strings) in the model, and the number of generated
The management task focuses on person-level information and process. That is, item model bank
management requires explicit processes that guide the use of the item model bank. As with a more
traditional approach to item development, many different people within a testing organization are
involved in the process including the test development specialists, subject matter experts,
psychometricians, editors, graphic artists, and word processors. Because of the generative process
required for item model banking, an additional type of specialist may also be involved: the item model
programmer. This specialist is skilled in test development, but also in computer programming and
database management. In other words, this is a 21st century career! Their role is, first, to bridge the gap
between the test development specialist who creates the item model and required programming tasks
necessary to format and generate items using IGOR. In other words, the item model programmer helps
the test development specialist identify and manipulate the fixed and variable elements in each item
model (which is where test development experience will be helpful), enter the newly created item
models into IGOR, and then execute the program to generate items (the latter two steps require
computer programming skills, at least at this stage in the development of automatic item generation2).
Second, the item model programmer is responsible for entering the models into the item model bank,
maintaining the contents of the bank, and managing the use of the item model bank (which requires
2 In 2009, we worked with 12 test development specialists at the Learner Assessment Branch at Alberta Education
to create item models for achievement tests in Grade 3 Language Arts and Mathematics as well as Grade 6 and 9
Language Arts, Mathematics, Science, and Social Studies. The project yielded 284 unique item models at all three
grade levels and in four different content areas. The test development specialists in this project had the most
difficulty specifying the fixed and variable elements in their model and, despite repeated training, were unable to
code their models and run IGOR consistently.
Item Models 17
database management skills). The responsibilities of the item model programmer are presented in
Figure 3.
Figure 3. Basic overview of workflow using traditional item banking and item model banking.
Item Writing Item Bank Form Assembly
Traditional Item Banking Process
Item Model
Item Model Writing Form Assembly
Item Model
Item Generation
Item Generation
Item Model Banking Process
Drasgow et al. (2006, p. 473) claim that:
Ideally, automatic item generation has two requirements. The first requirement is that an item
class can be described sufficiently for a computer to create instances of that class automatically
or at least semi-automatically. The second requirement is that the determinants of item
difficulty be understood well enough so that each of the generated instances need not be
calibrated individually.
In the previous six sections of this paper, we described and highlighted the issues related to Drasgow et
al.’s first requirementdescribing an item class and automatically generating items—with the use of
item models. In this section, we address the challenges related to Drasgow et al.’s second requirement
by illustrating how generated items could be calibrated automatically. To be useful in test assembly,
Item Models 18
item must have statistical characteristics. These characteristics can be obtained by administering the
items on field tests to collect preliminary information from a small sample of examinees. Item statistics
can also be obtained by embedding pilot items within a form as part of an operational test
administration, but not using the pilot items for examinee scoring. An alternative approach is to
account for the variation among the generated items in an item model and, using this information, to
estimate item difficulty with a statistical procedure thereby making field and pilot testing for the
generated items unnecessary (or, at least, dramatically reduced). A number of statistical procedures
have been developed to accomplish this task, including the linear logistic test model (LLTM; Fischer,
1973; see also Embretson & Daniel, 2008), the 2PL-constrained model (Embretson, 1999), the
hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical model (Sinharay,
Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response function approach
(Mislevy, Wingersky, & Sheehan, 1994).
Janssen (2010; see also Janssen, Schepers, & Peres, 2004) also described a promising approach for
modeling item design features using an extension of the LLTM called the random-effects LLTM (LLTM-R).
The probability that person successfully answers item is given by the LLTM as follows:
 = 1,, =  (
 )
 ()
 .
In this formula, the item difficulty parameter found in the Rasch model is replaced with an item
difficulty model specified as =
 , where item difficulty is specified by a linear combination
of item predictors, including a parameter for the item design feature , which is the score of item on
item design feature , and a parameter , which is the difficulty weights associated with item design
feature . Building on this LLTM formulation, the LLTM-R adds a random error term to to estimate
that component of item difficulty that may not be accounted for in the item difficulty model:
Item Models 19
 + = + , where
By adding
to the model, random variation can be used to account for design principles that yield the
same items but not necessary the same item difficulty values across these items.
Janssen (2010) also described the logic that underlies the LLTM-R, as it applies to automatic item
generation. The LLTM-R consists of two parts. The first part of the model specifies the person
parameters associated with , which include and
, and the second part specifies the item
parameters associated with the , which include and
. The parameter accounts for the random
variation of all items created within the same item design principles leading to similar, but not
necessarily the same, item difficulty levels. Taken together, the LLTM-R can be used to describe three
meaningful components: persons (i.e., ,
) , items (), and item populations (
). For modeling
outcomes in an automatic item generation context, our focus is on the items and item populations
(where the items are nested within the item population).
Next, we develop a working example using the logic for automatic item generation presented in
Janssen (2010). Our example is developed using operational data from a diagnostic mathematics
program (see Gierl, Taylor-Majeau, & Alves, 2010). The purpose of the Gierl et al. (2010) study was to
apply the attribute hierarchy method in an operational diagnostic mathematics program at the
elementary school levels to promote cognitive inferences about students’ problem-solving skills. The
attribute hierarchy method is a statistical procedure for classifying examinees’ test item responses into a
set of structured attribute patterns associated with a cognitive model. Principled test design procedures
were used to design the exam and evaluate the student response data. To begin, cognitive models were
created by test development specialists who outlined the knowledge and skills required to solve
mathematical tasks in Grades 3 and 6. Then, items were written specifically to measure the skills in the
cognitive models. Finally, confirmatory statistical analyses were used to evaluate the student response
Item Models 20
data by estimating model-data fit, attribute probabilities for diagnostic score reporting, and attribute
reliabilities. The cognitive model and item development steps from the diagnostic math program were
used in the current example to create item models.
Cognitive models for CDA have four defining characteristics (Gierl, Alves, Roberts, & Gotzmann,
2009). First, the model contains skills that are specified at a fine grain size because these skills must
magnify the cognitive processes underlying test performance. Second, the skills must be measurable.
That is, each skill must be described in way that would allow a test developer to create an item to
measure that skill. Third, the skills must be instructionally relevant to a broad group of educational
stakeholders, including students, parents, and teachers. Fourth, a cognitive model will often reflect a
hierarchy of ordered skills within a domain because cognitive processes share dependencies and
function within a much larger network of inter-related processes, competencies, and skills. Figure 4
provides one example taken from a small section of a larger cognitive model developed to yield
diagnostic inferences in SAT algebra (cf. Gierl, Wang, & Zhou, 2008). As a prerequisite skill, cognitive
attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction,
multiplication, and division of numbers. In attribute A2, the examinee needs to have the basic
arithmetic skills (i.e., attribute A1) as well as knowledge about the property of factors. In attribute A3,
the examinee not only requires basic arithmetic skills (i.e., attribute A1) and knowledge of factoring (i.e.,
attribute A2), but also the skills required for the application of factoring. The attributes are specified at
a fine grain size; each attribute is measurable; each attribute, and its associated item, is intended to be
instructionally relevant and meaningful; and attributes are ordered from simple to more complex as we
move from A1 to A3.
Item Models 21
Figure 4. Three sample items designed to measure three ordered skills in a linear cognitive model.
Item 1: If 6(m+n)-3=15, then m+n=?
A. 2
B. 3
C. 4
D. 5
E. 6
Item 2: If (x+2)/(m-1)=0 and m1, what is the value of
A. 2
B. -1
C. 0
D. 1
E. -2
Item 3: If 4a+4b = 3c-3d, then (2a+2b)/(5c-5d)=?
A. 2/5
B. 4/3
C. 3/4
D. 8/15
E. 3/10
of Factors
of Factors
of Factoring
Hierarchy Level Sample Test Items
The same test design principles were used to develop four item models in our working example. We
selected four parent items that had been field tested with 100 students from the diagnostic
mathematics project. These parent items, in turn, we used to create item models. The item models
were then used for item generation. The four item models are presented in Appendix A. The item
models in Appendix A are ordered from least to most complex according to their cognitive features,
meaning that item model 1 measures number sequencing skills; item model 2 measures number
sequencing skills and numerical comparison skills; item model 3 measures number sequencing skills,
numerical comparison skills, and addition skills; item model 4 measures number sequencing skills,
numerical comparison skills, addition skills, and ability to solve fractions (please note that the ordering
of the item models in this example has not been validated, rather the models are used to illustrate how
the LLTM-R could be used for item generation).
The LLTM-R was implemented in two steps. In step 1, parameters were estimated for the persons,
items, and item population with the LLTM. Using a field test consisting of 20 item specifically written to
measure the cognitive features of number sequencing, numerical comparison, addition, and fractions
Item Models 22
(i.e., five items per cognitive feature), the person and item parameters were estimated using the
dichotomously-scored response vectors for 100 students who solved these items. The item feature
parameter estimates were specified as fixed effects in the LLTM and the person and item population
estimates were specified as random effects. The estimated item fixed-effect parameter weights and
their associated standard errors are presented in Table 2.
Table 2. Estimated Weights and Standard Errors Using the Cognitive Features Associated with the
Four Diagnostic Test Items
Cognitive Feature Estimate (Standard Error)
Number Sequencing (Least Complex) -2.06 (0.22)
Numerical Comparisons 0.94 (0.27)
Addition 0.86 (0.25)
Fractions (Most Complex) 1.03 (0.25)
The estimated weights in Table 2 were then used to create a cognitive feature effect for each parent
item. The cognitive feature effect is calculated by taking the sum of the products for the pre-requisite
cognitive features as measured by each parent item. For example, a parent item that measures the
cognitive feature numerical comparisons would have a skill pattern of 1,1,0,0 because the features are
ordered in a hierarchy from least to most complex. This pattern would be multiplied and summed
across the estimated weights in Table 2 to produce the cognitive feature effect for each of the four
parent items in our example. The cognitive feature effect for the parent item measuring numerical
comparisons, for instance, would be (-2.06 X 1) + (0.94 X 1) +(0.86 X 0) + (1.03 X 0) = -1.13. The random
effects estimated for the person and item population, as reported in standard deviation units, are 0.99
and 0.33, respectively.
In step 2, the four parent items were selected from the field test and used to create item models
(Appendix A), the item models were used to generate items, and the difficulty parameters for the
generated items were estimated. Number sequencing is the first, and hence, most basic cognitive
Item Models 23
feature. This model generated 105 items. The second cognitive feature, numerical comparison,
resulted in a model that generated 90 items. The third cognitive feature was addition. The addition item
model generated 30 items. Fractions is the fourth, and therefore, most complex cognitive feature. The
fraction item model generated 18 items. In total, the four item models yielded 243 generated items.
For our illustrative example, the four item models are also differentiated by three key item features.
Each generated item had a different combination of these three item features. These features were
coded for each item and factored into our estimation process because they were expected to affect
item difficulty. The 10 item features and their codes (reported in parentheses) include all patterns with
0 (0), or not (1); no use of odd numbers (0) or use of odd numbers (1); sum of last digit is less than 10 (0)
or sum is greater than 10 (1); some parts are 1/8 (0) or no parts are 1/8 (1); pattern by 10s (0), pattern
by 20s and 5s (1), patterns by 15 and 25 (2); 1 group(0), 2 groups (1), 3 groups (2); no odd number (0),
one odd number (1), two odd numbers (2); lowest common denominator less that 8 (0) or lowest
common denominator greater than 8 (1); first number ends with 0 (0), or not (1); group size of 5 (0) or
group size of 10 (1); use of number in multiples of 10 (0) or no number with multiples of 10 (1). These
three item features, when crossed with the four cognitive features (i.e., four parent items), are shown in
Appendix B. These 10 item features serves as our best guess as to the variables that could affect item
difficulty for the generated items in each of the four item models. These item features would need to
be validated prior to use in a real item generation study.
To compute the difficulty parameter estimate for each of the generated items, four sources of
information must be combined. These sources include the cognitive feature effect (estimated in step 1),
the item feature coding weight, the item population standard deviation (from step 1), and random
error3. These sources are combined as follows: Difficulty Level for the Generate Item = Cognitive
3 The random error component allowed us to introduce error into our analysis, which is how we modeled the
LLTM-R using the LLTM estimates from step 1 for our example.
Item Models 24
Feature Effect + [(Item Feature Effect) x (Item Population Standard Deviation) x (Random Error)].
Returning to our previous example from step 1, the difficulty level for a generated item with the
numerical comparisons cognitive feature and an item feature effect of 0,1,1 (i.e., use of odd number;
use of two groups; use a group size of 5) would be -1.21 [-1.13 + (-0.5) x (0.33) x (0.48)]. The item
feature effect code of 0,1,1 is represented as -0.5 to standardize the item feature results in our
calculation, given that different cognitive features have different numbers of item features (see
Appendix B). This method is then applied to all 243 generated items to yield their item difficulty
Internet-based computerized assessment is proliferating. Assessments are now routinely
administered over the internet where students respond to test items containing text, images, tables,
diagrams, sound, and video. But the growth of internet-based computerized testing has also focused
our attention on the need for new testing procedures and practices because this form of assessment
requires a continual supply of new test items. Automatic item generation is the process of using item
models to generate test items with the aid of computer technology. Automatic item generation can be
used to initially develop item banks and then replenish the banks needed for computer-based testing.
The purpose of our paper was to describe seven topics that are central to the development and use of
item models for automatic item generation. We defined item model and highlighted related concepts;
we described how item models are developed; we presented an item model taxonomy; we illustrated
how item models can be used for automatic item generation; we outlined some benefits of using item
models; we introduced the idea of an item model bank; and we demonstrated how statistical
procedures could be used to calibrate the item parameter estimates for generated items without the
need for extensive field or pilot testing. We also attempted to contextualize the growing interest in
Item Models 25
automatic item generation by highlighting the fact that the science of educational assessment is
beginning to influence educational measurement theory and practice and by claiming that
interdisciplinary forces and factors are beginning to exert a stronger affect on how we solve problems
in the discipline of educational assessment.
Research on item models is warranted in at least two different areas. The first area is item model
development. To our knowledge, there has been no focused research on item model development.
Currently, the principles, standards, and practices that guide traditional item development are also
recommended for use with item model development. These practices have been used to design and
develop item model examples that are cited in the literature (e.g., Bejar et al., 2003; Case & Swanson,
2002; Gierl et al., 2008). But much more research is required on designing, developing, and, most
importantly, evaluating the items produced by these models. By working more closely with test
development specialists in diverse content areas, researchers can begin to better understand how to
design and develop item models by carefully documenting the process. Research must also be
conducted to evaluate these item models by focusing on their generative capacity (i.e., the number of
items that can be generated from a single item model) as well as their generative veracity (i.e., the
usefulness of the generated items, particularly from the view of test development specialists and
content experts).
The second area is the calibration of generated items using an item modelling approach. As noted by
Drasgow et al. (2006), automatic item generation can minimize, if not eliminate, the need for item field
or pilot testing because items generated from a parent model can be pre-calibrated, meaning that the
statistical characteristics from the parent item model can be applied to the generated items. We
illustrated how the LLTM-R could be used to estimate the difficulty parameter for 243 generated items
in a diagnostic mathematics program. But a host of other statistical procedures are also available for
Item Models 26
estimating the statistical characteristics of generated items, including the 2PL-constrained model
(Embretson, 1999), the hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical
model (Sinharay, Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response
function approach (Mislevy, Wingersky, & Sheehan, 1994). These different statistical procedures could
be used with the same item models to permit parameter estimate comparisons across generated items,
without the use of sample data. This type of study would allow researchers to assess the comparability
of the predicted item statistics across the procedures. These statistical procedures could also be used
with the same item models to permit parameter estimate comparisons across generated items relative
to parameter estimates computed from a sample of examinees who actually wrote the generated items.
This type of study would allow researchers to assess the predictive utility of the statistical procedures
(i.e., the agreement between the predicted item characteristics on the generated items using a
statistical procedure compared to the actual item characteristics on the generated items using examinee
response data), which, we expect, will emerge as the “gold standard” for evaluating the feasibility and,
ultimately, the success of automatic item generation.
Item Models 27
Bartram, D. (2006). Testing on the internet: Issues, challenges, and opportunities in the field of
occupational assessment. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the
internet (pp. 13-37). Hoboken, NJ: Wiley.
Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological
Measurement, 14, 237-245.
Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium
(ETS Research Report 96-13). Princeton, NJ: Educational Testing Service.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C.
Kyllonen (Eds.), Item generation for test development (pp.199-217). Hillsdale, NJ: Erlbaum.
Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, & R. E., Revuelta, J. (2003). A feasibility
study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and
Assessment, 2(3). Available from
Bennett, R. (2001). How the internet will help large-scale assessment reinvent itself. Educational Policy
Analysis Archives, 9, 1-23.
Case, S. M., & Swanson, D. B (2002). Constructing written test questions for the basic and clinical
sciences (3rd edition). Philadelphia, PA: National Board of Medical Examiners.
Downing, S. M., & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.
Drasgow, F., & Mattern, K. (2006). New tests and new items: Opportunities and issues. In D. Bartram &
R. Hambleton (Eds.), Computer-based testing and the internet (pp. 59-76). Hoboken, NJ: Wiley.
Item Models 28
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models.
Psychometrika, 64, 407-433.
Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P.
C. Kyllonen (Eds.), Item generation for test development (pp. 219-250). Mahwah, NJ: Erlbaum.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity level in
mathematical problem solving items. Psychological Science Quarterly, 50, 328-344.
Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao &
S. Sinharay (Eds.) Handbook of Statistics: Psychometrics, Volume 26 (pp. 747-768). North Holland,
UK: Elsevier.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta
Psychologica, 37, 359-374.
Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the Attribute Hierarchy Method to make diagnostic
inferences about examinees’ cognitive skills in algebra on the SAT©. Journal of Technology,
Learning, and Assessment, 6 (6). Retrieved [date] from
Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote
assessment engineering. Journal of Technology, Learning, and Assessment, 7(2). Retrieved [date]
Gierl, M. J. Alves, C., Roberts, M., & Gotzmann, A. (2009, April). Using judgments from content
specialists to develop cognitive models for diagnostic assessments. In J. Gorin (Chair), How to Build a
Cognitive Model for Educational Assessments. Paper presented in symposium conducted at the
annual meeting of the National Council on Measurement in Education, San Diego, CA.
Item Models 29
Gierl, M. J., Alves, C., & Taylor-Majeau, R. (2010). Using the Attribute Hierarchy Method to make
diagnostic inferences about examinees’ skills in mathematics: An operational implementation of
cognitive diagnostic assessment. International Journal of Testing, 10, 318-341.
Glas, C. A. W., & van der Linder, W. J. (2003). Computerized adaptive testing with item cloning. Applied
Psychological Measurement, 27, 247-261.
Haladyna, T., & Shindoll, R. (1989). Items shells: A method for writing effective multiple-choice test
items. Evaluation and the Health Professions, 12, 97-106.
Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic
achievement tests. Journal of Educational Measurement, 5, 275-290.
Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Erlbaum.
Janssen, R. (2010). Modeling the effect of item designs within the Rasch model. In S. E. Embretson (Ed.),
Measuring psychological constructs: Advances in model-based approaches (pp. 227-245).
Washington DC: American Psychological Association.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and group predictors. In P. DeBoeck & M.
Wilson (Eds.), Explanatory item response models: A generalized linear and non-linear approach (pp.
189-212). New York: Springer.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedures for
constructing content-equivalent multiple-choice questions. Medical Education, 20, 53-56.
Leighton, J. P., & Gierl, M. J. (in press). The learning sciences in educational assessment: The role of
cognitive models. Cambridge, UK: Cambridge University Press.
Luecht, R. M. (2006a, May). Engineering the test: From principled item design to automated test
assembly. Paper presented at the annual meeting of the Society for Industrial and Organizational
Psychology, Dallas, TX.
Item Models 30
Luecht, R. M. (2006b, September). Assessment engineering: An emerging discipline. Paper presented in
the Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton,
AB, Canada.
Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: From Data Models and
Templates to Psychometrics. Invited paper presented at the annual meeting of the National Council
on Measurement in Education, Chicago, IL.
Luecht, R. M. (February, 2011). Assessment design and development, version 2.0: From art to
engineering. Invited paper presented at the annual meeting of the Association of Test Publishers,
Phoenix, AZ.
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T.
Haladyna (Eds.), Handbook of test development (pp. 61-90). Mahwah, NJ: Erlbaum.
Mislevy, R. J., Wingersky, M. S., & Sheehan, K. M. (1994). Dealing with uncertainty about item
parameters: Expected response functions (ETS Research Report 94-28-ONR). Princeton, NJ:
Educational Testing Service.
Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education
and American Council on Education.
Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to
mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development
(pp. 361-384). Mahwah, NJ: Erlbaum.
Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the
results using family expected response functions. Journal of Educational and Behavioral Statistics,
28, 295-313.
Item Models 31
Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. (ETS
Research Report 05-06). Princeton, NJ: Educational Testing Service.
Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of
improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test
development (pp.329-348). Mahwah, NJ: Erlbaum.
van der Linden, W., & Glas, C. A. W. (2010). Elements of adaptive testing. New York: Springer.
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment. Applied
Measurement in Education, 15, 337-362.
Item Models 32
Appendix A
Item model #1 in mathematics used to generate isomorphic instances of numerical sequences.
If the pattern continues, then the next three numbers should be
700 695 690 685 _____ _____ _____
A. 680, 675, 670
B. 700, 695, 690
C. 680, 677, 675
D. 685, 680, 675
If the pattern continues, then the next three numbers should be I1 I1-I2 I1-(2*I2) I1-(3*I2) _____ _____ _____
I1 Value Range: 700-800 by 5
I2 Value Range: 5-25 by 5
A= I1 - ( 4 * I2 ), I1 - ( 5 * I2 ), I1 - ( 6 * I2 )
B= I1 - ( 3 * I2 ), I1 - ( 4 * I2 ), I1 - ( 5 * I2 )
C= I1 - ( 4 * I2 ), I1 - ( round( 4.5 * I2 ) , I1 - ( 5 * I2 )
D= I1, I1 - ( 1 * I2 ) , I1 - ( 2 * I2 )
Item Models 33
Item model #2 in mathematics used to generate isomorphic instances of numerical comparisons.
The number that is 1 group of 5 fewer than 201 is ...
A. 196
B. 190
C. 197
D. 191
The number that is I1 group of I2 fewer than I3 is ...
I1 Value Range: 1-3 by 1
I2 Value Range: 5-10 by 5
I3 Value Range: 201-245 by 3
A= I3 - ( I2 * I1 )
B= I3 - ( I2 * ( I1 + 1 ) )
C= I3 - ( I2 * I1 ) + 1
D= I3 - ( I2 * ( I1 + 1 ) ) - 1
Item Models 34
Item model #3 in mathematics used to generate isomorphic instances for addition.
What is 15 + 18 ?
A. 33
B. 48
C. 32
D. 34
What is I1 + I2 ?
I1 Value Range: 15-30 by 3
I2 Value Range: 15-30 by 3
A= I1 + I2
B= I1 + I2 + 1
C= I1 + I1 + I2 - 1
D= I1 + I1 + I2
Item Models 35
Item model #4 in mathematics used to generate isomorphic instances for fractions.
What fraction of the measuring cup has oil in it?
A. 2/8
B. 2/3
C. 3/10
D. 3/8
What fraction of the measuring cup has oil in it?
Diagram: I1 of Water and I2 of oil in one cup.
I1 Value Range: 0.125-1.0 by 0.125
I2 Value Range: 0.125-1.0 by 0.125
A= ( I2 * 8 ) / 8
B= ( I2 * 8 ) + ( ( I1 * 8 ) / 8 )
C= ( I2 * 8 ) + ( ( I1 * 8 ) / 10 )
D= ( I2 * 8) / ( ( I2 * 8 ) + 1 )
Appendix B
The cognitive feature codes were used to develop the four parent items for our example. The item feature codes serve as variables that could
affect the difficulty level for the generated items.
Item Feature Code
Feature Code Value Feature Value Feature Value Feature
All start patterns are 0
Pattern by 10s
First number ends with 0
1 All start patterns not 0 1 Pattern by 20s and 5s 1 First number does not end with 0
2 Pattern by 15s and 25s
No use of odd number
1 Group less
Group size of 10
1 Use of odd umber 1 2 Groups less 1 Group size of 5
2 3 Groups less
0 Sum of Last Digit <10 0 No use of odd numbers 0 Use of number in multiples of 10
Sum of Last Digit >10
One use of odd numbers
No number with multiples of 10
2 Two use of odd numbers
0 Some parts are 1/8 0 Lowest common denominator < 8
No parts are 1/8
Lowest common denominator=8
... In the third and final stage of AIG, the model content created in the first step using computer technology is placed in the item model developed in the second stage, and item generation is carried out (Gierl et al., 2021). Different software are developed for item generation in the literature: Math Test Creation Assistant (Singley & Bennett, 2002), ModelCreator (Higgins et al., 2005), Item Distiller (Higgins, 2007), IGOR (Gierl & Lai, 2012), EAQC (Gutl et al., 2011), MARTEN ( In the present study, items were generated using scripts written in Phyton. ...
... It means that the items generated by the n-layer model are less similar to each other and they are not clones. This result is consistent with other research results (Gierl & Lai, 2012). It is expected that the CSI values of the items generated with the n-layered model are low, and it is recommended to use the n-layered model for AIG (Gierl & Lai, 2013). ...
Developments in the field of education have significantly affected test development processes, and computer-based test applications have been started in many institutions. In our country, research on the application of measurement and evaluation tools in the computer environment for use with distance education is gaining momentum. A large pool of items is required for computer-based testing applications that provide significant advantages to practitioners and test takers. Preparing a large pool of items also requires more effort in terms of time, effort, and cost. To overcome this problem, automatic item generation has been widely used by bringing together item development subject matter experts and computer technology. In the present research, the steps for implementing automatic item generation are explained through an example. In the research, which was based on the fundamental research method, first a total of 2560 items were generated using computer technology and SMEs in field of Turkish literature. In the second stage, 60 randomly selected items were examined. As a result of the research, it was determined that a large item pool could be created to be used in online measurement and evaluation applications using automatic item generation.
... Earlier techniques for i) were structured as expert template development followed by editing performed by computer algorithms. 10,11 Recent techniques have relied on machine learning and AI 4,5 for these phases. ...
... Topics used in generation of sample AI Test bankFigure 2points to the gaps in the proprietary BenchmarkAI * method which are ~85% poor in quality generated questions. While the state of the art for NQG has improved from over BenchmarkAI * methods,10 we like to conservatively conclude that even with a magnitude improvement of 10X, at least ~9% of questions are of poor quality. ...
Conference Paper
Automatic Item Generation (AIG) is increasingly used to process large amounts of information and scale the demand for computerized testing. Recent work in Artificial Intelligence for AIG (aka Natural Question Generation-NQG), states that even newer AIG techniques are short in syntactic, semantic, and contextual relevance when evaluated qualitatively on small datasets. We confirm this deficiency quantitatively over large datasets. Additionally, we find that human evaluation by Subject Matter Experts (SMEs) conservatively rejects at least ~9% portion of AI test questions in our experiment over large diverse dataset topics. Here we present an analytical study of these differences, and this motivates our two-phased post-processing AI daisy chain machine learning (ML) architecture for selection and editing of AI generated questions using current techniques. Finally, we identify and propose the first selection step in the daisy chain using ML with 97+% accuracy, and provide analytical guidance for development of the second editing step with a measured lower bound on a BLEU score improvement of 2.4+%.
... AI technologies can be utilised to automate and enhance various aspects of assessment design, delivery, and grading. For instance, AI can automate the generation of diverse, complex questions that assess higher-order cognitive skills, thereby reducing the manual workload for educators (Bridgeman et al., 2023;Gierl & Lai, 2013). Also, AI can be used to personalise assessments based on individual students' needs and progress, thus facilitating differentiated instruction and personalised learning (Vandewaetere et al., 2011;Stahl, 2023). ...
Full-text available
As artificial intelligence (AI) and chatbot technologies like ChatGPT continue to evolve, educators grapple with the risks and benefits these advances bring to online assessment. The democratisation of AI-based technologies, while offering personalised learning experiences, threatens online assessment legitimacy and academic integrity. This paper critically examines the intersection of AI chatbots and online assessments, in the context of their impact on the design of authentic online assessments. The widespread usage of AI chatbots has caused serious problems for the validity of online tests because of the possibility of student abuse. This underlines the need for 'authentic assessments' that concentrate on higher-order cognitive skills, problem-solving, creative thinking, and collaborative talents and calls for a reevaluation of conventional assessment methods. These types of assessments not only align with the evolving pedagogical needs of the 21st century but also present tasks that are significantly challenging for AI chatbots to replicate, thereby preserving their integrity. Conversely, the paper also explores how AI can facilitate the assessment process by automating certain tasks, providing personalised learning experiences, and supporting collaborative assessments. The era of AI chatbots presents an opportunity to rethink and enhance online assessments, making them more authentic, meaningful, and resistant to AI-assisted malpractice.
... Anketler, testler veya ölçekler için bağlamla ilgili ve çeşitli maddeler üretmek için kullanılabilir ve araştırmacıların daha geniş bir yapı veya nitelik yelpazesini daha etkili bir şekilde değerlendirmelerine olanak tanır (Gierl & Lai, 2012). ChatGPT, iyi hazırlanmış ve çeşitlendirilmiş maddeler üreterek anket araçlarının geçerliliğini ve güvenilirliğini artırabilir. ...
Full-text available
Yapay zekâ araçlarının sosyal bilim araştırmalarına entegrasyonu hem fırsatlar hem de zorluklar sunmaktadır. Büyük ölçekli bir üretici dil modeli olan ChatGPT, insan benzeri metin üretme ve karmaşık dilsel kalıpları anlama konusunda güçlü yetenekler göstererek sosyal bilimciler için umut verici bir araç haline gelmiştir. Bu teorik makale, ChatGPT'nin teorik temellerine, potansiyel uygulamalarına, etik ve toplumsal hususlara ve gelecekteki araştırma yönlerine odaklanarak sosyal bilimlerdeki araştırmaları destekleme potansiyelini araştırmaktadır. ChatGPT'nin teorik temellerini incelemekte ve sosyal bilim araştırmalarıyla ilgisi tartışılmaktadır. Daha sonra nitel veri analizi, anket ve mülakat tasarımı, hipotez oluşturma ve kamuoyu görüşü modellemesi dahil olmak üzere bir dizi potansiyel uygulaması keşfedilmektedir. Daha sonra, ChatGPT'yi sosyal bilim araştırmalarında kullanmanın etik ve toplumsal sonuçlarına değinerek, yapay zekâ araçlarının sorumlu bir şekilde geliştirilmesi ve dağıtılması ihtiyacını vurgulanmaktadır. Bu fırsatlar ve zorluklar ışığında, sınırlamaları ele almayı, model performansını iyileştirmeyi, etik ilkeleri dahil etmeyi ve disiplinler arası işbirliğini teşvik etmeyi amaçlayan bir araştırma gündemi önerilmektedir. Çalışmada, ChatGPT gibi yapay zekâ araçlarını çevreleyen sürekli araştırma ve diyaloğun, sosyal bilim araştırmalarında sorumlu ve etkili kullanımlarını sağlamak için çok önemli olduğunu savunulmaktadır. Bu makale, ChatGPT'nin sosyal bilim araştırmalarındaki potansiyelinin teorik olarak anlaşılmasına katkıda bulunmakta ve gelecekteki çalışmalar için bir yol haritası sunmakta, nihayetinde sosyal fenomenlerin daha derinlemesine anlaşılmasını teşvik etmekte ve toplumsal refahı artıran kanıta dayalı politikalar ve müdahaleler hakkında bilgi vermektedir.
... Firstly, automation can produce items in more cost-effective ways than traditional methods (Mosh, Simpson, Bickel, Kellogg and Sanford-Moore, 2019). Also, automation can produce items in sufficient volume to address potential security concerns (Gierl & Haladyna, 2012;Gierl & Lai, 2012;Luecht, 2012). In this context, automated test assembly (ATA) is an area which has also made important progress in recent years (van der Linden & Diao, 2011). ...
Technical Report
Full-text available
The purpose of this report is to provide a user-friendly, relatively jargon-free, overview of recent advances in the academic field of Educational Measurement for the Learning Progressions and Online Formative Assessment National Initiative.
... Automatic Item Generation (AIG) has been suggested as a systematic and cost-effective item development approach compared to the traditional method of manually writing individual items. AIG requires item writers or modelers (e.g., persons who know how to build an AIG template) to program item models (i.e., item templates), which are used to systematically generate a large number of items with the assistance of a specialized computer program (Gierl & Lai, 2012a;Kosh et al., 2019). In comparison, the traditional approach can only produce one item at a time manually, which is less effective compared to AIG approaches. ...
Full-text available
This case study applied the weak theory of Automatic Item Generation (AIG) to generate isomorphic item instances (i.e., unique but psychometrically equivalent items) for a large-scale assessment. Three representative instances were selected from each item template (i.e., model) and pilot-tested. In addition, a new analytical framework, differential child item functioning (DCIF) analysis, based on the existing differential item functioning statistics, was applied to evaluate the psychometric equivalency of item instances within each template. The results showed that, out of 23 templates, nine successfully generated isomorphic instances, five required minor revisions to make them isomorphic, and the remaining templates required major modifications. The results and insights obtained from the AIG template development procedure may help item writers and psychometricians effectively develop and manage the templates that generate isomorphic instances.
Full-text available
Progress tests (PT) are a popular type of longitudinal assessment used for evaluating clinical knowledge retention and long-life learning in health professions education. Most PTs consist of multiple-choice questions (MCQs) whose development is costly and time-consuming. Automatic Item Generation (AIG) generates test items through algorithms, promising to ease this burden. However, it remains unclear how AIG-items behave in formative assessment (FA) modalities such as PTs compared to manually written items. The purpose of this study was to compare the quality and validity of AIG-items versus manually written items. Responses to 126 (23 automatically generated) dichotomously scored single best-answer five-option MCQs retrieved from the 2021 University of Minho PT of medicine were analyzed. Procedures based on item response theory (IRT), dimensionality testing, item fit, reliability, differential item functioning (DIF) and distractor analysis were used. Qualitative assessment was conducted through expert review. Validity evidence of AIG-items was assessed by using hierarchical linear modeling (HLM). The PT proved to be a viable tool for assessing medical students cognitive competencies. AIG-items were parallel to manually written-items, presenting similar indices of difficulty and information. The proportion of functional distractors for both AIG and manually written items was similar. Evidence of validity for AIG-items was found while showing higher levels of item quality. AIG-items functioned as intended and were appropriate for evaluating medical students at various levels of the knowledge spectrum.
Full-text available
Automatic Item Generation (AIG) refers to the process of using cognitive models to generate test items using computer modules. It is a new but rapidly evolving research area where cognitive and psychometric theory are combined into digital framework. However, assessment of the item quality, usability and validity of AIG relative to traditional item development methods lacks clarification. This paper takes a top-down strong theory approach to evaluate AIG in medical education. Two studies were conducted: Study I—participants with different levels of clinical knowledge and item writing experience developed medical test items both manually and through AIG. Both item types were compared in terms of quality and usability (efficiency and learnability); Study II—Automatically generated items were included in a summative exam in the content area of surgery. A psychometric analysis based on Item Response Theory inspected the validity and quality of the AIG-items. Items generated by AIG presented quality, evidences of validity and were adequate for testing student’s knowledge. The time spent developing the contents for item generation (cognitive models) and the number of items generated did not vary considering the participants' item writing experience or clinical knowledge. AIG produces numerous high-quality items in a fast, economical and easy to learn process, even for inexperienced and without clinical training item writers. Medical schools may benefit from a substantial improvement in cost-efficiency in developing test items by using AIG. Item writing flaws can be significantly reduced thanks to the application of AIG's models, thus generating test items capable of accurately gauging students' knowledge.
Full-text available
The present paper consists of a theoretical and an empirical part: First Rasch's test model for items with two answer categories is considered under the assumption of linear constraints on the item parameters (‘linear logistic model’). It is shown that this model is appropriate for the analysis of subject areas in instructional research if the subject area comprises tasks or items which are solved by the pupil by combination of a certain number of cognitive operations or rules. An empirical investigation was made which showed that the psychological complexity of problems in elementary differential calculus, as taught in secondary school mathematics, can be approximately explained through the assumption of seven psychologically meaningful operations. The psychological contribution of this analysis does not lie in a mere statistical description of item difficulties, but rather in the testing of hypotheses as to which steps (operations) in solving a problem are to be viewed as psychological units. It was seen, for instance, that differentiation of a polynomial is to be considered a single operation psychologically, which is mastered and correctly combined with the other operations or not, and that the complexity of a task is primarily determined by the combination of different operations and is not increased significantly when the same operation occurs repeatedly within the problem.
Design patterns are tools to support task authoring under an evidence-centered approach to assessment design (ECD). This chapter reviews the basic concepts of ECD, focusing on evidentiary arguments. It defines the attributes of design patterns, and shows the roles they play in creating tasks around valid assessment arguments.
There is mounting hope in the United States that federal legislation in the form of No Child Left Behind will improve educational outcomes. As titanic as the challenge appears to be, however, the solution could be at our fingertips. This volume identifies visual types of cognitive models in reading, science and mathematics for researchers, test developers, school administrators, policy makers and teachers. In the process of identifying these cognitive models, the book also explores methodological or translation issues to consider as decisions are made about how to generate psychologically informative and psychometrically viable large-scale assessments based on the learning sciences. Initiatives to overhaul educational systems in disrepair may begin with national policies, but the success of these policies will hinge on how well stakeholders begin to rethink what is possible with a keystone of the educational system: large-scale assessment.
In the present chapter, the focus is on extending item response models on the item side. Item and item group predictors are included as external factors and the item parameters β i are considered as random effects. When the items are modeled to come from one common distribution, the models are descriptive on the item side. When item predictors of the property type are included, the models are explanatory on the item side. Item groups are a special case of item properties. They refer to binary, non-overlapping properties indicating group membership. The resulting models with item properties can all be described as linear logistic test models (LLTM; Fischer, 1995) with an error term in the prediction of item difficulty. When this random item variation is combined with random person variation, models with crossed random effects are obtained. All models in this chapter are of that kind.
Item models (LaDuca, Staples, Templeton, & Holzman, 1986) are classes from which it is possible to generate/produce items that are equivalent/isomorphic to other items from the same model (e.g., Bejar, 1996; Bejar, 2002). They have the potential to produce large number of high-quality items at reduced cost. This paper introduces data from the first known application of items automatically generated from item models in a large-scale assessment and deals with several research questions associated with the data. We begin by reviewing calibration techniques for the analysis of data involving item models; one method assumes that the items are isomorphic, while the other treats items generated from the same item model as distinct, but related. A major question for these type of data is whether these items are isomorphic, that is, if they behave the same psychometrically. This paper describes a number of rough diagnostic measures and a rigorous statistical diagnostic to assess the extent of isomorphicity in the items generated from an item model. Finally, this paper discusses the issue of scoring, an area that needs more research, with data involving item models.
Despite the fact that test development is a growth industry that cuts across all levels of education and all the professions, there has never been a comprehensive, research-oriented Handbook to which everyone (developers and consumers) can turn for guidance. That is the mission of this book. The Handbook of Test Development brings together well-known scholars and test-development practitioners to present chapters on all aspects of test development. Each chapter contributor is not only a recognized expert with an academic and research background in their designated topic, each one has also had hands-on experience in various aspects of test development. This thirty two-chapter volume is organized into six sections: foundations, content, item development, test design, test production and administration, and post-test activities. The Handbook provides extensive treatment of such important but unrecognized topics as contracting for testing services, item banking, designing tests for small testing program, and writing technical reports. The Handbook is based on the Standards for Educational and Psychological Testing, which serve as the foundation for sound test development practice. These chapters also suggest best test development practices and highlight methods to improve test validity evidence. This book is appropriate for graduate courses and seminars that deal with test development and usage, professional testing services and credentialing agencies, state and local boards of education, and academic libraries serving these groups.