ArticlePDF Available

The Role of Item Models in Automatic Item Generation

Authors:

Abstract

Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates or prototypes, that highlight the features or elements in the assessment task that must be manipulated. Second, these item model elements are manipulated to generate new items with the aid of computer-based algorithms. With this two-step process, hundreds or even thousands of new items can be created from a single item model. The purpose of our article is to describe seven different but related topics that are central to the development and use of item models for automatic item generation. We start by defining item model and highlighting some related concepts; we describe how item models are developed; we present an item model taxonomy; we illustrate how item models can be used for automatic item generation; we outline some benefits of using item models; we introduce the idea of an item model bank; and finally, we demonstrate how statistical procedures can be used to estimate the parameters of the generated items without the need for extensive field or pilot testing.
The Role of Item Models in Automatic Item Generation
Mark J. Gierl
Hollis Lai
Centre for Research in Applied Measurement and Evaluation
University of Alberta
Paper Presented at the Symposium
Item Modeling and Item Generation for the Measurement of
Quantitative Skills: Recent Advances and Prospects
Annual Meeting of the National Council on Measurement in Education
New Orleans, LA
April, 2011
Item Models 2
INTRODUCTION
Randy Bennett (2001) claimed, a decade ago, that no topic would become more central to the
innovation and future practice in educational assessment than computers and the internet. His
prediction has proven to be accurate. Educational assessment and computer technology have evolved
at a staggering pace since 2001. As a result many educational assessments, which were once given in a
paper-and-pencil format, are now administered by computer using the internet. Education Week’s
2009 Technology Counts, for example, reported that 27 US states now administer internet-based
computerized educational assessments. Many popular and well-known exams in North America such
as the Graduate Management Achievement Test (GMAT), the Graduate Record Exam (GRE), the Test of
English as a Foreign Language (TOEFL iBT), and the American Institute of Certified Public Accountants
Uniform CPA examination (CBT-e), to cite but a few examples, are administered by computer over the
internet. Canadian testing agencies are also implementing internet-based computerized assessments.
For example, the Medical Council of Canada Qualifying Exam Part I (MCCQE I), which is written by all
medical students seeking entry into supervised clinical practice, is administered by computer.
Provincial testing agencies in Canada are also making the transition to internet-based assessment.
Alberta Education, for instance, will introduce a computer-based assessment for elementary school
students in 2011, as part of their Diagnostic Mathematics Program.
Internet-based computerized assessment offers many advantages to students and educators
compared to more traditional paper-based assessments. For instance, computers support the
development of innovative item types and alternative item formats (Sireci & Zenisky, 2006; Zenisky &
Sireci, 2002); items on computer-based tests can be scored immediately thereby providing students
with instant feedback (Drasgow & Mattern, 2006); computers permit continuous testing and testing on-
demand for students (van der Linden & Glas, 2010). But possibly the most important advantage of
Item Models 3
computer-based assessment is that it allows educators to measure more complex performances by
integrating test items and digital media to substantially increase the types of knowledge, skills, and
competencies that can be measured (Bartram, 2006; Zenisky & Sireci, 2002).
The advent of computer-based testing has also raised new challenges, particularly in the area of
item development (Downing & Haladyna, 2006; Schmeiser & Welch, 2006). Large numbers of items are
needed to develop the banks necessary for computerized testing because items are continuously
administered and, therefore, exposed. As a result, these banks must be frequently replenished to
minimize item exposure and maintain test security. Because testing agencies are now faced with the
daunting task of creating thousands of new items for computer-based assessments, alternative
methods of item development are desperately needed. One method that may be used to address this
challenge is through automatic item generation (Drasgow, Luecht, & Bennett, 2006; Embretson & Yang,
2007; Irvine & Kyllonen, 2002). Automatic item generation represents a relatively new but rapidly
evolving research area where cognitive and psychometric theories are used to produce tests that
include items generated using computer technology. Automatic item generation requires two steps.
First, test development specialists develop item models, which are comparable to templates or
prototypes, that highlight the features or elements in the assessment task that must be manipulated.
Second, these item model elements are manipulated to generate new items with the aid of computer-
based algorithms. With this two-step process, hundreds or even thousands of new items can be created
from a single item model.
The purpose of our paper is describe seven different but related topics that are central to the
development and use of item models for automatic item generation. We start by defining item model
and highlighting some related concepts; we describe how item models are developed; we present an
item model taxonomy; we illustrate how item models can be used for automatic item generation; we
Item Models 4
outline some benefits of using item models; we introduce the idea of an item model bank; and finally,
we demonstrate how statistical procedures can be used to estimate the parameters of the generated
items without the need for extensive field or pilot testing. We begin by describing two general factors
that, we feel, will directly affect educational measurementincluding emerging methods such as
automatic item generationin the 21st century.
TWO FACTORS THAT WILL SHAPE EDUCATIONAL MEASUREMENT IN THE 21ST CENTURY
We assert the first factor that will shape educational measurement in the 21st century is the growing
view that the science of educational assessment will prevail in guiding the design, development,
administration, scoring, and reporting practices in educational testing. In their seminal chapter on
“Technology and Testing” in the 4th Edition of the handbook Educational Measurement, Drasgow,
Luecht, and Bennett (2006, p. 471) begin with this bold claim:
This chapter describes our vision a 21st-century testing program that capitalizes on modern
technology and takes advantage of recent innovations in testing. Using an analogy from
engineering, we envision a modern testing program as an integrated system of systems. Thus,
there is an item generation system, an item pretesting system, and examinee registration
system, and so forth. This chapter discusses each system and illustrates how technology can
enhance and facilitate the core processes of each system.
Drasgow et al. present a view of educational measurement where integrated technology-enhanced
systems govern and direct all testing processes. Ric Luecht has coined this technology-based approach
to educational measurement assessment engineering” (Luecht, 2006a, 2006b, 2007, 2011).
Assessment engineering is an innovative approach to measurement practice where engineering-based
principles and technology-enhanced processes are used to direct the design and development of
assessments as well as the analysis, scoring, and reporting of assessment results. With this approach,
the measurement specialist begins by defining the construct of interest using specific, empirically-
derived cognitive models of task performance. Next, item models are created to produce replicable
Item Models 5
assessment tasks. Finally, statistical models are applied to the examinee response data collected using
the item models to produce scores that are both replicable and interpretable.
The second factor that will likely shape educational measurement in the 21st century stems from the
fact that the boundaries for our discipline are becoming more porous. As a result, developments from
other disciplines such as cognitive science, mathematical statistics, medical education, educational
psychology, operations research, educational technology, and computing science will permeate and
influence educational testing. These interdisciplinary contributions will also create opportunities for
both theoretical and practical change. That is, educational measurement specialists will begin to draw
on interdisciplinary developments to enhance their own research and practice. At the same time,
students across a host of other disciplines will begin to study educational measurement1. These
interdisciplinary forces that promote new ideas and innovations will begin to evolve, perhaps slowly at
first, but then at a much faster pace leading to even more changes in our discipline. It may also mean
that other disciplines will begin to adopt our theories and practices more readily as students with
educational measurement training move back to their own content domains and areas of specialization.
ITEM MODELING: DEFINITION AND RELATED CONCEPTS
An item model (Bejar, 1996, 2002; Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2003;
LaDuca, Staples, Templeton, & Holzman, 1986)which has also been described as a schema (Singley &
Bennett, 2002), blueprint (Embretson, 2002), template (Mislevy & Riconscente, 2006), form (Hively,
Patterson, & Page, 1968), clone (Glas & van der Linden, 2003), and shell (Haladyna & Shindoll, 1989)
serves as an explicit representation of the variables in an assessment task, which includes the stem, the
options, and oftentimes auxiliary information (Gierl, Zhou, & Alves, 2008). The stem is the part of an
1 We have already noticed this change in our own program. We currently have 14 students in the Measurement,
Evaluation, and Cognition (MEC) graduate program at the University of Alberta. These students represent a
diverse disciplinary base, which includes education, cognitive psychology, engineering, computing science,
medicine (one of our students is a surgery resident), occupational therapy, nursing, forensic psychology, statistics,
and linguistics.
Item Models 6
item which formulates context, content, and/or the question the examinee is required to answer. The
options contain the alternative answers with one correct option and one or more incorrect options or
distracters. When dealing with a multiple-choice item model, both stem and options are required. With
an open-ended or constructed-response item model, only the stem is created. Auxiliary information
includes any additional material, in either the stem or option, required to generate an item, including
digital media such as text, images, tables, diagrams, sound, and/or video. The stem and options can be
divided further into elements. These elements are denoted as strings, S, which are non-numeric values
and integers, I, which are numeric values. By systematically manipulating the elements, measurement
specialists can generate large numbers of items from one item model. If the generated items or
instances of the item model are intended to measure content at similar difficulty levels, then the
generated items are isomorphic. When the goal of item generation is to create isomorphic instances,
the measurement specialist manipulates the incidental elements, which are the surface features of an
item that do not alter item difficulty. Conversely, if the instances are intended to measure content at
different difficulty levels, then the generated items are variants. When the goal of item generation is to
create variant instances, the measurement specialist can manipulate the incidental elements, but must
manipulate one or more radical elements in the item model. The radicals are the deep features that
alter item difficulty, and may even affect test characteristics such as dimensionality.
To illustrate some of these concepts, an example from Grade 6 mathematics is presented in Figure 1.
The item model is represented as the stem and options variables with no auxiliary information. The
stem contains two integers (I1, I2). The I1 element includes Ann’s payment. It ranges from $1525 to
$1675 in increments of $75. The I2 element includes the size of the lawn, as either 30/m2 or 45/m2. The
four alternatives, labelled A to D, are generated using algorithms produced from the integer values I1
and I2 (including the correct option, which is A).
Item Models 7
Figure 1. Simple item model in Grade 6 mathematics with two integer elements.
Ann has paid $1525 for planting her lawn. The cost of lawn is $45/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
A. 5.8
B. 6.8
C. 4.8
D. 7.3
ITEM MODEL VARIABLES
Stem
Ann has paid $I1 for planting her lawn. The cost of lawn is $I2/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
Elements
I1 Value Range: 1525-1675 by 75
I2 Value Range: 30 or 45
Options
A=12
B=12
+ 1
C=12
 1
D=12
+ 1.5
Key
A
DEVELOPING ITEM MODELS
Test development specialists have the critical role of designing and developing the item models
used for automatic item generation. The principles, standards, and practices that guide traditional item
development (cf. Case & Swanson, 2002; Downing & Haladyna, 2006; Schmeiser & Welch, 2006) have
been recommended for use in item model development. Although a growing number of item model
examples are available in the literature (e.g., Bejar et al., 2003; Case & Swanson, 2002; Gierl et al.,
2008), there are currently no published studies describing either the principles or standards required to
Item Models 8
develop these models. Drasgow et al. (2006) advise test development specialists to engage in the
creative task of developing item models by using design principles and guidelines discerned from a
combination of experience, theory, and research. Initially, these principles and guidelines are used to
identify a parent item model. One way to identify a parent item model is by using a cognitive theory of
task performance. Within this theory, cognitive models, as described by Luecht in his assessment
engineering framework, may be identified or discerned. With this type of “strong theory” approach,
cognitive features are identified in such detail that item features that predict test performance can be
not only specified but also controlled. The benefit of using strong theory to create item models is that
item difficulty for the generated items is predictable and, as a result, the generated items may be
calibrated without the need for extensive field or pilot testing because the factors that govern the item
difficulty level can be specified and, therefore, explicitly modeled and controlled. Unfortunately, few
cognitive theories currently exist to guide our item development practices (Leighton & Gierl, in press).
As a result, the use of strong theory for automatic item generation has, thus far, been limited to narrow
content domains, such as mental rotation (Bejar, 1990) and spatial ability (Embretson, 2002).
In the absence of strong theory, parent item models can be identified using weak theoryby
reviewing items from previously administered exams or by drawing on an inventory of existing test
items in an attempt to identify an underlying structure. This structure, if identified, provides a point-of-
reference for creating alternative item models, where features in the alternative models can be
manipulated to generate new items. Test development specialists can also create their own unique
item models. The weak theory approach to developing parent models using previously administered
items, drawing on an inventory of existing items, or creating new models is well-suited to broad
content domains where few theoretical descriptions exist on the cognitive skills required to solve test
items (Drasgow et al., 2006). The main drawback of using weak theory to create item models is that
Item Models 9
item difficulty for the generated items is unpredictable and, therefore, field or pilot testing may be
required.
ITEM MODEL TAXONOMY
Gierl et al. (2008) described a taxonomy of item model types, as a way of offering guidelines for
creating item models. The taxonomy pertains to generating multiple-choice items and classifies models
based on the different types of elements used in the stems and options. The stem is the section of the
model used to formulate context, content, and/or questions. The elements in the stem can function in
four different ways. Independent indicates that the ni element(s) (ni 1) in the stem are unrelated to
one another. That is, a change in one element will have no effect on the other stem elements in the
item model. Dependent indicate all nd element(s) (nd 2) in the stem are directly related to one other.
Mixed Independent/Dependent include both independent (ni 1) and dependent (ni 1) elements in
the stem, where at least one pair of stem elements are directly related. Fixed represents a constant
stem format with no variation or change.
The options contain the alternatives for the item model. The elements in the options can function in
three different ways. Randomly-selected options refer to the manner in which the distracters are
selected from their corresponding content pools. The distracters are selected randomly. Constrained
options mean that the keyed option and the distracters are generated according to specific constraints,
such as formulas, calculation, and/or context. Fixed options occur when both the keyed option and
distracters are invariant or unchanged in the item model.
By crossing the stem and options, a matrix of item model types can be produced (see Table 1). This
taxonomy is useful for creating item models because it provides the guiding principles necessary for
designing diverse models by outlining their structure, function, similarities, and differences. It can also
be used to help ensure that test development specialists do not design item models with exactly the
Item Models 10
same elements. Ten functional combinations are designated with a checkmark, “√”. The two
remaining combinations are labelled not applicable, “NA”, because a model with a fixed stem and
constrained options is an infeasible item type and a model with a fixed stem and options produces a
single multiple-choice item type (i.e., a traditional multiple-choice item). Gierl et al. also presented 20
examples (i.e., two examples for each of the 10 cells in the item model taxonomy) to illustrate each
unique combination. Their examples were drawn from diverse content areas, including science, social
studies, mathematics, language arts, and architecture.
Table 1. Plausible Stem-by-Option Combinations in the Gierl et al. (2008) Item Model Taxonomy
Stem
Options
Independent Dependent Mixed Fixed
Randomly Selected
Constrained NA
Fixed NA
USING ITEM MODELS TO AUTOMATICALLY GENERATE ITEMS
Once the item models are developed by the test development specialists, automatic item
generation can begin. Automatic item generation is the process of using item models to generate test
items with the aid of computer technology. The role of the test development specialist is critical for the
creative task of designing and developing meaningful item models. The role of computer technology is
critical for the generative task of systematically combining large numbers of elements in each model to
produce items. By combining content expertise and computer technology, item modeling can be used
to generate items. If we return to the simple math example in Figure 1, the generative process can be
illustrated. Recall, the stem in this example contains two integers (I1, I2). The generative task for this
example involves generating six items with the following I1, I2 combinations: I1=$1525 and I2=30/m2;
Item Models 11
I1=$1600 and I2=30/m2; I1=$1675 and I2=30/m2; I1=$1525 and I2=45/m2; I1=$1600 and I2=45/m2;
I1=$1675 and I2=45/m2.
Gierl et al. (2008, pp. 25-31) also created a software tool that automatically creates, saves, and
stores items. The software is called IGOR (which stands for Item GeneratOR). It was written in Sun
Microsystems JAVE SE 6.0. The purpose of IGOR is to generate multiple items from a single item
model. The user interface for IGOR is structured using the same sections as the example in Figure 1
(i.e., stem, elements, options). The Item Model Editor window is used to enter and structure each item
model (see Figure 2a). The editor has three components. The stem panel is the starting point for item
generation where the item prompt is specified. Next, the elements panel is used to identify the string
and integer variables as well as specify the constraints required among the elements for successful item
generation. The options panel is used to specify possible answers to a given test item. The options are
classified as either a key or distracter. The Elements and Options panels also contain three editing
buttons. The first of these adds a new element or option to its panel ( ). The second opens a
window to edit the currently selected element or option ( ). The third removes the currently
selected element or option from the model ( ). To generate items from a model, the Test Item
Generator dialogue box is presented where the user specifies the item model file, the item bank output
file, and the answer key file. If the option ‘Create answer key’ is not selected, then the resulting test
bank will always display the correct answer as the last option (or alternative). If the option ‘Create
answer key’ is selected, then the resulting test bank will randomly order the options. Once the files
have been specified in the Test Item Generator dialogue box, the program can be executed by selecting
the ‘Generate’ button (see Figure 2b).
Item Models 12
Figure 2. IGOR interface illustrating the (a.) input panels and editing functions as well as the (b.)
generating functions.
(a.) (b.)
Preliminary research has been conducted with IGOR. Gierl et al., working with two mathematics
test development specialists, developed 10 mathematics item models. IGOR generated 331371 unique
items from the 10 item models. That is, each model produced, on average, 33137 items thereby
providing an initial demonstration of the practicality and feasibility of item generation using IGOR.
BENEFITS OF ITEM MODELING
Item modeling can enhance educational assessment in many ways. The purpose of item modeling is
to create a single model that yields many test items. Multiple models can then be developed which will
yield hundreds or thousands of new test items. These items, in turn, are used to generate item banks.
Computerized assessments or automatic test assembly algorithms then draw on a sample of the items
from the bank to create a new test. With this approach, item exposure through test administration is
minimized, even with continuous testing, because a large bank of operational items is available. Item
modeling can also lead to more cost-effective item development because the model is continually re-
Item Models 13
used to yield many test items compared with developing each item for a test from scratch. Moreover,
costly, yet common, errors in item developmentincluding omissions or additions of words, phrases,
or expressions as well as spelling, punctuation, capitalization, item structure, typeface, and formatting
problemscan be avoided because only specific elements in the stem and options are manipulated
across large numbers of items (Schmeiser & Welch, 2006). In other words, the item model serves as a
template or prototype where test development specialists manipulate only specific, well-defined,
elements. The remaining components in the template or prototype are not altered. The view of an
item model as a template or prototype with both fixed and variable elements contrasts with the more
conventional view of a single item where every element is unique, both within and across items.
Drasgow et al. (2006) explain:
The demand for large numbers of items is challenging to satisfy because the traditional
approach to test development uses the item as the fundamental unit of currency. That is, each
item is individually hand-craftedwritten, reviewed, revised, edited, entered into a computer,
and calibratedas if no other like it had ever been created before.
But possibly the most important benefit of item modeling stems from the logic of this approach to
test development. With item modeling, the model is treated as the fundamental unit of analysis where
a single model is used to generate many items compared with a more traditional approach where the
item is treated as the unit of analysis (Drasgow et al. 2006). Hence, with item modeling, the cost per
item is lower because the unit of analysis is multiple instances per model rather than single instances
per test development specialist. As a result, large number of items can be generated from a single item
model rather than relying on each test development specialist to develop a large number of unique
items. The item models can also be re-used, particularly when only a small number of the generated
items are used on a particular test form.
Item Models 14
ITEM MODEL BANK
Current practices in test development and analysis are ground in the test item. That is, each item is
individually written, reviewed, revised, edited, banked, and calibrated. If, for instance, a developer
intends to have 1236 operational test items in her bank, then she has 1236 unique items that must be
created, edited, reviewed, field tested, and, possibly, revised. An item bank serves as an electronic
repository for maintaining and managing information on each item. The maintenance task focuses on
item-level information. For example, the format of the item must be coded. Item formats and item
types can include multiple choice, numeric response, written response, linked items, passage-based
items, and items containing multimedia. The content for the item must be coded. Content fields
include general learning outcomes, blueprint categories, item identification number, item response
format, type of directions required, links, field test number, date, source of item, item sets, and
copyright. The developer attributes must be coded. These attributes include year the item was written,
item writer name, item writer demographics, editor information, development status, and review
status. The statistical characteristics for the item must also be coded. Statistical characteristics often
include word count, readability, classical item analyses, item response theory parameters, distracter
functioning, item history, field test item analyses, item drift, differential item functioning flags, and
history of item use.
The management task focuses on person-level information and process. That is, item bank
management requires explicit processes that guide the use of the item bank. Many different people
within a testing organization are often involved in the development process including the test
development specialists, subject matter experts (who often reside in both internal and external
committees), psychometricians, editors, graphic artists, word processors, and document production
specialists. Many testing programs field test their items and then review committees evaluate the items
Item Models 15
prior to final test production. Hence, field tested items are often the item bank entry point. Rules must
be established for who has access to the bank and when items can be added, modified, or removed
during field testing. The same rules must also apply to the preparation of the final form of the test
because field testing can, and often does, occur in a different unit of a testing organization or at a
different stage in development process and, therefore, may involve different people.
Item models, rather than single test items, serve as the unit of analysis in an item model bank. With
an item model bank, the test development specialist creates an electronic repository of item models for
maintaining and managing information on each model. However, a single item model which is
individually written, reviewed, revised, edited, and banked will also allow the developer to generate
many test items. If, for instance, a developer intends to have 331371 items, then she may only require
10 item models (as was illustrated in our previous section on “Using Item Models to Automatically
Generate Items”). Alternatively, if a particularly ambitious developer aspired to have a very large
inventory of 10980640827 items, then she would require 331371 item models [i.e., if each item model
generated, on average, 33137 mathematics items as was illustrated in our previous section on “Using
Item Models to Automatically Generate Items”, then 331371 item models could be used to generate
10980640827 (33137*331371) items].
An item model bank serves as an electronic repository for maintaining and managing information on
each item model. Because an item model serves as the unit of analysis, the banks contain a complex
assortment of information on every model, but not necessarily on every item. The maintenance task
focuses on model -level information. For example, the format of the item model must be coded.
Content fields must be coded. The developer attributes must be coded. Some statistical characteristics
of the model must also coded, including word count, readability, and item model history. The item
model bank may also contain coded information on the item model ID, item model name, expected
Item Models 16
grade levels for use, item model stem type, item model option type, number of constraints for the
model, the number of elements (e.g., integers and strings) in the model, and the number of generated
items.
The management task focuses on person-level information and process. That is, item model bank
management requires explicit processes that guide the use of the item model bank. As with a more
traditional approach to item development, many different people within a testing organization are
involved in the process including the test development specialists, subject matter experts,
psychometricians, editors, graphic artists, and word processors. Because of the generative process
required for item model banking, an additional type of specialist may also be involved: the item model
programmer. This specialist is skilled in test development, but also in computer programming and
database management. In other words, this is a 21st century career! Their role is, first, to bridge the gap
between the test development specialist who creates the item model and required programming tasks
necessary to format and generate items using IGOR. In other words, the item model programmer helps
the test development specialist identify and manipulate the fixed and variable elements in each item
model (which is where test development experience will be helpful), enter the newly created item
models into IGOR, and then execute the program to generate items (the latter two steps require
computer programming skills, at least at this stage in the development of automatic item generation2).
Second, the item model programmer is responsible for entering the models into the item model bank,
maintaining the contents of the bank, and managing the use of the item model bank (which requires
2 In 2009, we worked with 12 test development specialists at the Learner Assessment Branch at Alberta Education
to create item models for achievement tests in Grade 3 Language Arts and Mathematics as well as Grade 6 and 9
Language Arts, Mathematics, Science, and Social Studies. The project yielded 284 unique item models at all three
grade levels and in four different content areas. The test development specialists in this project had the most
difficulty specifying the fixed and variable elements in their model and, despite repeated training, were unable to
code their models and run IGOR consistently.
Item Models 17
database management skills). The responsibilities of the item model programmer are presented in
Figure 3.
Figure 3. Basic overview of workflow using traditional item banking and item model banking.
Item Writing Item Bank Form Assembly
Traditional Item Banking Process
Item Model
Database
Item Model Writing Form Assembly
Item Model
Programmer
Item Generation
Item Generation
Item Model Banking Process
ESTIMATING STATISTICAL CHARACTERISTICS OF GENERATED ITEMS
Drasgow et al. (2006, p. 473) claim that:
Ideally, automatic item generation has two requirements. The first requirement is that an item
class can be described sufficiently for a computer to create instances of that class automatically
or at least semi-automatically. The second requirement is that the determinants of item
difficulty be understood well enough so that each of the generated instances need not be
calibrated individually.
In the previous six sections of this paper, we described and highlighted the issues related to Drasgow et
al.’s first requirementdescribing an item class and automatically generating items—with the use of
item models. In this section, we address the challenges related to Drasgow et al.’s second requirement
by illustrating how generated items could be calibrated automatically. To be useful in test assembly,
Item Models 18
item must have statistical characteristics. These characteristics can be obtained by administering the
items on field tests to collect preliminary information from a small sample of examinees. Item statistics
can also be obtained by embedding pilot items within a form as part of an operational test
administration, but not using the pilot items for examinee scoring. An alternative approach is to
account for the variation among the generated items in an item model and, using this information, to
estimate item difficulty with a statistical procedure thereby making field and pilot testing for the
generated items unnecessary (or, at least, dramatically reduced). A number of statistical procedures
have been developed to accomplish this task, including the linear logistic test model (LLTM; Fischer,
1973; see also Embretson & Daniel, 2008), the 2PL-constrained model (Embretson, 1999), the
hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical model (Sinharay,
Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response function approach
(Mislevy, Wingersky, & Sheehan, 1994).
Janssen (2010; see also Janssen, Schepers, & Peres, 2004) also described a promising approach for
modeling item design features using an extension of the LLTM called the random-effects LLTM (LLTM-R).
The probability that person successfully answers item is given by the LLTM as follows:
 = 1,, =  (
 )
 ()
 .
In this formula, the item difficulty parameter found in the Rasch model is replaced with an item
difficulty model specified as =
 , where item difficulty is specified by a linear combination
of item predictors, including a parameter for the item design feature , which is the score of item on
item design feature , and a parameter , which is the difficulty weights associated with item design
feature . Building on this LLTM formulation, the LLTM-R adds a random error term to to estimate
that component of item difficulty that may not be accounted for in the item difficulty model:
Item Models 19
=
 + = + , where
,
.
By adding
to the model, random variation can be used to account for design principles that yield the
same items but not necessary the same item difficulty values across these items.
Janssen (2010) also described the logic that underlies the LLTM-R, as it applies to automatic item
generation. The LLTM-R consists of two parts. The first part of the model specifies the person
parameters associated with , which include and
, and the second part specifies the item
parameters associated with the , which include and
. The parameter accounts for the random
variation of all items created within the same item design principles leading to similar, but not
necessarily the same, item difficulty levels. Taken together, the LLTM-R can be used to describe three
meaningful components: persons (i.e., ,
) , items (), and item populations (
). For modeling
outcomes in an automatic item generation context, our focus is on the items and item populations
(where the items are nested within the item population).
Next, we develop a working example using the logic for automatic item generation presented in
Janssen (2010). Our example is developed using operational data from a diagnostic mathematics
program (see Gierl, Taylor-Majeau, & Alves, 2010). The purpose of the Gierl et al. (2010) study was to
apply the attribute hierarchy method in an operational diagnostic mathematics program at the
elementary school levels to promote cognitive inferences about students’ problem-solving skills. The
attribute hierarchy method is a statistical procedure for classifying examinees’ test item responses into a
set of structured attribute patterns associated with a cognitive model. Principled test design procedures
were used to design the exam and evaluate the student response data. To begin, cognitive models were
created by test development specialists who outlined the knowledge and skills required to solve
mathematical tasks in Grades 3 and 6. Then, items were written specifically to measure the skills in the
cognitive models. Finally, confirmatory statistical analyses were used to evaluate the student response
Item Models 20
data by estimating model-data fit, attribute probabilities for diagnostic score reporting, and attribute
reliabilities. The cognitive model and item development steps from the diagnostic math program were
used in the current example to create item models.
Cognitive models for CDA have four defining characteristics (Gierl, Alves, Roberts, & Gotzmann,
2009). First, the model contains skills that are specified at a fine grain size because these skills must
magnify the cognitive processes underlying test performance. Second, the skills must be measurable.
That is, each skill must be described in way that would allow a test developer to create an item to
measure that skill. Third, the skills must be instructionally relevant to a broad group of educational
stakeholders, including students, parents, and teachers. Fourth, a cognitive model will often reflect a
hierarchy of ordered skills within a domain because cognitive processes share dependencies and
function within a much larger network of inter-related processes, competencies, and skills. Figure 4
provides one example taken from a small section of a larger cognitive model developed to yield
diagnostic inferences in SAT algebra (cf. Gierl, Wang, & Zhou, 2008). As a prerequisite skill, cognitive
attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction,
multiplication, and division of numbers. In attribute A2, the examinee needs to have the basic
arithmetic skills (i.e., attribute A1) as well as knowledge about the property of factors. In attribute A3,
the examinee not only requires basic arithmetic skills (i.e., attribute A1) and knowledge of factoring (i.e.,
attribute A2), but also the skills required for the application of factoring. The attributes are specified at
a fine grain size; each attribute is measurable; each attribute, and its associated item, is intended to be
instructionally relevant and meaningful; and attributes are ordered from simple to more complex as we
move from A1 to A3.
Item Models 21
Figure 4. Three sample items designed to measure three ordered skills in a linear cognitive model.
A3
A2
A1
A1:
Arithmetic
operations
Item 1: If 6(m+n)-3=15, then m+n=?
A. 2
B. 3
C. 4
D. 5
E. 6
Item 2: If (x+2)/(m-1)=0 and m1, what is the value of
x?
A. 2
B. -1
C. 0
D. 1
E. -2
Item 3: If 4a+4b = 3c-3d, then (2a+2b)/(5c-5d)=?
A. 2/5
B. 4/3
C. 3/4
D. 8/15
E. 3/10
A1:
Arithmetic
operations
A1:
Arithmetic
operations
A2:
Properties
of Factors
A2:
Properties
of Factors
A3:
Application
of Factoring
Cognitive
Model
Hierarchy Level Sample Test Items
The same test design principles were used to develop four item models in our working example. We
selected four parent items that had been field tested with 100 students from the diagnostic
mathematics project. These parent items, in turn, we used to create item models. The item models
were then used for item generation. The four item models are presented in Appendix A. The item
models in Appendix A are ordered from least to most complex according to their cognitive features,
meaning that item model 1 measures number sequencing skills; item model 2 measures number
sequencing skills and numerical comparison skills; item model 3 measures number sequencing skills,
numerical comparison skills, and addition skills; item model 4 measures number sequencing skills,
numerical comparison skills, addition skills, and ability to solve fractions (please note that the ordering
of the item models in this example has not been validated, rather the models are used to illustrate how
the LLTM-R could be used for item generation).
The LLTM-R was implemented in two steps. In step 1, parameters were estimated for the persons,
items, and item population with the LLTM. Using a field test consisting of 20 item specifically written to
measure the cognitive features of number sequencing, numerical comparison, addition, and fractions
Item Models 22
(i.e., five items per cognitive feature), the person and item parameters were estimated using the
dichotomously-scored response vectors for 100 students who solved these items. The item feature
parameter estimates were specified as fixed effects in the LLTM and the person and item population
estimates were specified as random effects. The estimated item fixed-effect parameter weights and
their associated standard errors are presented in Table 2.
Table 2. Estimated Weights and Standard Errors Using the Cognitive Features Associated with the
Four Diagnostic Test Items
Cognitive Feature Estimate (Standard Error)
Number Sequencing (Least Complex) -2.06 (0.22)
Numerical Comparisons 0.94 (0.27)
Addition 0.86 (0.25)
Fractions (Most Complex) 1.03 (0.25)
The estimated weights in Table 2 were then used to create a cognitive feature effect for each parent
item. The cognitive feature effect is calculated by taking the sum of the products for the pre-requisite
cognitive features as measured by each parent item. For example, a parent item that measures the
cognitive feature numerical comparisons would have a skill pattern of 1,1,0,0 because the features are
ordered in a hierarchy from least to most complex. This pattern would be multiplied and summed
across the estimated weights in Table 2 to produce the cognitive feature effect for each of the four
parent items in our example. The cognitive feature effect for the parent item measuring numerical
comparisons, for instance, would be (-2.06 X 1) + (0.94 X 1) +(0.86 X 0) + (1.03 X 0) = -1.13. The random
effects estimated for the person and item population, as reported in standard deviation units, are 0.99
and 0.33, respectively.
In step 2, the four parent items were selected from the field test and used to create item models
(Appendix A), the item models were used to generate items, and the difficulty parameters for the
generated items were estimated. Number sequencing is the first, and hence, most basic cognitive
Item Models 23
feature. This model generated 105 items. The second cognitive feature, numerical comparison,
resulted in a model that generated 90 items. The third cognitive feature was addition. The addition item
model generated 30 items. Fractions is the fourth, and therefore, most complex cognitive feature. The
fraction item model generated 18 items. In total, the four item models yielded 243 generated items.
For our illustrative example, the four item models are also differentiated by three key item features.
Each generated item had a different combination of these three item features. These features were
coded for each item and factored into our estimation process because they were expected to affect
item difficulty. The 10 item features and their codes (reported in parentheses) include all patterns with
0 (0), or not (1); no use of odd numbers (0) or use of odd numbers (1); sum of last digit is less than 10 (0)
or sum is greater than 10 (1); some parts are 1/8 (0) or no parts are 1/8 (1); pattern by 10s (0), pattern
by 20s and 5s (1), patterns by 15 and 25 (2); 1 group(0), 2 groups (1), 3 groups (2); no odd number (0),
one odd number (1), two odd numbers (2); lowest common denominator less that 8 (0) or lowest
common denominator greater than 8 (1); first number ends with 0 (0), or not (1); group size of 5 (0) or
group size of 10 (1); use of number in multiples of 10 (0) or no number with multiples of 10 (1). These
three item features, when crossed with the four cognitive features (i.e., four parent items), are shown in
Appendix B. These 10 item features serves as our best guess as to the variables that could affect item
difficulty for the generated items in each of the four item models. These item features would need to
be validated prior to use in a real item generation study.
To compute the difficulty parameter estimate for each of the generated items, four sources of
information must be combined. These sources include the cognitive feature effect (estimated in step 1),
the item feature coding weight, the item population standard deviation (from step 1), and random
error3. These sources are combined as follows: Difficulty Level for the Generate Item = Cognitive
3 The random error component allowed us to introduce error into our analysis, which is how we modeled the
LLTM-R using the LLTM estimates from step 1 for our example.
Item Models 24
Feature Effect + [(Item Feature Effect) x (Item Population Standard Deviation) x (Random Error)].
Returning to our previous example from step 1, the difficulty level for a generated item with the
numerical comparisons cognitive feature and an item feature effect of 0,1,1 (i.e., use of odd number;
use of two groups; use a group size of 5) would be -1.21 [-1.13 + (-0.5) x (0.33) x (0.48)]. The item
feature effect code of 0,1,1 is represented as -0.5 to standardize the item feature results in our
calculation, given that different cognitive features have different numbers of item features (see
Appendix B). This method is then applied to all 243 generated items to yield their item difficulty
estimates.
SUMMARY AND FUTURE DIRECTIONS
Internet-based computerized assessment is proliferating. Assessments are now routinely
administered over the internet where students respond to test items containing text, images, tables,
diagrams, sound, and video. But the growth of internet-based computerized testing has also focused
our attention on the need for new testing procedures and practices because this form of assessment
requires a continual supply of new test items. Automatic item generation is the process of using item
models to generate test items with the aid of computer technology. Automatic item generation can be
used to initially develop item banks and then replenish the banks needed for computer-based testing.
The purpose of our paper was to describe seven topics that are central to the development and use of
item models for automatic item generation. We defined item model and highlighted related concepts;
we described how item models are developed; we presented an item model taxonomy; we illustrated
how item models can be used for automatic item generation; we outlined some benefits of using item
models; we introduced the idea of an item model bank; and we demonstrated how statistical
procedures could be used to calibrate the item parameter estimates for generated items without the
need for extensive field or pilot testing. We also attempted to contextualize the growing interest in
Item Models 25
automatic item generation by highlighting the fact that the science of educational assessment is
beginning to influence educational measurement theory and practice and by claiming that
interdisciplinary forces and factors are beginning to exert a stronger affect on how we solve problems
in the discipline of educational assessment.
Research on item models is warranted in at least two different areas. The first area is item model
development. To our knowledge, there has been no focused research on item model development.
Currently, the principles, standards, and practices that guide traditional item development are also
recommended for use with item model development. These practices have been used to design and
develop item model examples that are cited in the literature (e.g., Bejar et al., 2003; Case & Swanson,
2002; Gierl et al., 2008). But much more research is required on designing, developing, and, most
importantly, evaluating the items produced by these models. By working more closely with test
development specialists in diverse content areas, researchers can begin to better understand how to
design and develop item models by carefully documenting the process. Research must also be
conducted to evaluate these item models by focusing on their generative capacity (i.e., the number of
items that can be generated from a single item model) as well as their generative veracity (i.e., the
usefulness of the generated items, particularly from the view of test development specialists and
content experts).
The second area is the calibration of generated items using an item modelling approach. As noted by
Drasgow et al. (2006), automatic item generation can minimize, if not eliminate, the need for item field
or pilot testing because items generated from a parent model can be pre-calibrated, meaning that the
statistical characteristics from the parent item model can be applied to the generated items. We
illustrated how the LLTM-R could be used to estimate the difficulty parameter for 243 generated items
in a diagnostic mathematics program. But a host of other statistical procedures are also available for
Item Models 26
estimating the statistical characteristics of generated items, including the 2PL-constrained model
(Embretson, 1999), the hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical
model (Sinharay, Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response
function approach (Mislevy, Wingersky, & Sheehan, 1994). These different statistical procedures could
be used with the same item models to permit parameter estimate comparisons across generated items,
without the use of sample data. This type of study would allow researchers to assess the comparability
of the predicted item statistics across the procedures. These statistical procedures could also be used
with the same item models to permit parameter estimate comparisons across generated items relative
to parameter estimates computed from a sample of examinees who actually wrote the generated items.
This type of study would allow researchers to assess the predictive utility of the statistical procedures
(i.e., the agreement between the predicted item characteristics on the generated items using a
statistical procedure compared to the actual item characteristics on the generated items using examinee
response data), which, we expect, will emerge as the “gold standard” for evaluating the feasibility and,
ultimately, the success of automatic item generation.
Item Models 27
REFERENCES
Bartram, D. (2006). Testing on the internet: Issues, challenges, and opportunities in the field of
occupational assessment. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the
internet (pp. 13-37). Hoboken, NJ: Wiley.
Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological
Measurement, 14, 237-245.
Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium
(ETS Research Report 96-13). Princeton, NJ: Educational Testing Service.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C.
Kyllonen (Eds.), Item generation for test development (pp.199-217). Hillsdale, NJ: Erlbaum.
Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, & R. E., Revuelta, J. (2003). A feasibility
study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and
Assessment, 2(3). Available from http://www.jtla.org.
Bennett, R. (2001). How the internet will help large-scale assessment reinvent itself. Educational Policy
Analysis Archives, 9, 1-23.
Case, S. M., & Swanson, D. B (2002). Constructing written test questions for the basic and clinical
sciences (3rd edition). Philadelphia, PA: National Board of Medical Examiners.
Downing, S. M., & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.
Drasgow, F., & Mattern, K. (2006). New tests and new items: Opportunities and issues. In D. Bartram &
R. Hambleton (Eds.), Computer-based testing and the internet (pp. 59-76). Hoboken, NJ: Wiley.
Item Models 28
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models.
Psychometrika, 64, 407-433.
Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P.
C. Kyllonen (Eds.), Item generation for test development (pp. 219-250). Mahwah, NJ: Erlbaum.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity level in
mathematical problem solving items. Psychological Science Quarterly, 50, 328-344.
Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao &
S. Sinharay (Eds.) Handbook of Statistics: Psychometrics, Volume 26 (pp. 747-768). North Holland,
UK: Elsevier.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta
Psychologica, 37, 359-374.
Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the Attribute Hierarchy Method to make diagnostic
inferences about examinees’ cognitive skills in algebra on the SAT©. Journal of Technology,
Learning, and Assessment, 6 (6). Retrieved [date] from http://www.jtla.org.
Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote
assessment engineering. Journal of Technology, Learning, and Assessment, 7(2). Retrieved [date]
from http://www.jtla.org.
Gierl, M. J. Alves, C., Roberts, M., & Gotzmann, A. (2009, April). Using judgments from content
specialists to develop cognitive models for diagnostic assessments. In J. Gorin (Chair), How to Build a
Cognitive Model for Educational Assessments. Paper presented in symposium conducted at the
annual meeting of the National Council on Measurement in Education, San Diego, CA.
Item Models 29
Gierl, M. J., Alves, C., & Taylor-Majeau, R. (2010). Using the Attribute Hierarchy Method to make
diagnostic inferences about examinees’ skills in mathematics: An operational implementation of
cognitive diagnostic assessment. International Journal of Testing, 10, 318-341.
Glas, C. A. W., & van der Linder, W. J. (2003). Computerized adaptive testing with item cloning. Applied
Psychological Measurement, 27, 247-261.
Haladyna, T., & Shindoll, R. (1989). Items shells: A method for writing effective multiple-choice test
items. Evaluation and the Health Professions, 12, 97-106.
Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic
achievement tests. Journal of Educational Measurement, 5, 275-290.
Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Erlbaum.
Janssen, R. (2010). Modeling the effect of item designs within the Rasch model. In S. E. Embretson (Ed.),
Measuring psychological constructs: Advances in model-based approaches (pp. 227-245).
Washington DC: American Psychological Association.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and group predictors. In P. DeBoeck & M.
Wilson (Eds.), Explanatory item response models: A generalized linear and non-linear approach (pp.
189-212). New York: Springer.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedures for
constructing content-equivalent multiple-choice questions. Medical Education, 20, 53-56.
Leighton, J. P., & Gierl, M. J. (in press). The learning sciences in educational assessment: The role of
cognitive models. Cambridge, UK: Cambridge University Press.
Luecht, R. M. (2006a, May). Engineering the test: From principled item design to automated test
assembly. Paper presented at the annual meeting of the Society for Industrial and Organizational
Psychology, Dallas, TX.
Item Models 30
Luecht, R. M. (2006b, September). Assessment engineering: An emerging discipline. Paper presented in
the Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton,
AB, Canada.
Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: From Data Models and
Templates to Psychometrics. Invited paper presented at the annual meeting of the National Council
on Measurement in Education, Chicago, IL.
Luecht, R. M. (February, 2011). Assessment design and development, version 2.0: From art to
engineering. Invited paper presented at the annual meeting of the Association of Test Publishers,
Phoenix, AZ.
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T.
Haladyna (Eds.), Handbook of test development (pp. 61-90). Mahwah, NJ: Erlbaum.
Mislevy, R. J., Wingersky, M. S., & Sheehan, K. M. (1994). Dealing with uncertainty about item
parameters: Expected response functions (ETS Research Report 94-28-ONR). Princeton, NJ:
Educational Testing Service.
Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education
and American Council on Education.
Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to
mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development
(pp. 361-384). Mahwah, NJ: Erlbaum.
Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the
results using family expected response functions. Journal of Educational and Behavioral Statistics,
28, 295-313.
Item Models 31
Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. (ETS
Research Report 05-06). Princeton, NJ: Educational Testing Service.
Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of
improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test
development (pp.329-348). Mahwah, NJ: Erlbaum.
van der Linden, W., & Glas, C. A. W. (2010). Elements of adaptive testing. New York: Springer.
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment. Applied
Measurement in Education, 15, 337-362.
Item Models 32
Appendix A
Item model #1 in mathematics used to generate isomorphic instances of numerical sequences.
If the pattern continues, then the next three numbers should be
700 695 690 685 _____ _____ _____
A. 680, 675, 670
B. 700, 695, 690
C. 680, 677, 675
D. 685, 680, 675
ITEM MODEL VARIABLES
Stem
If the pattern continues, then the next three numbers should be I1 I1-I2 I1-(2*I2) I1-(3*I2) _____ _____ _____
Elements
I1 Value Range: 700-800 by 5
I2 Value Range: 5-25 by 5
Options
A= I1 - ( 4 * I2 ), I1 - ( 5 * I2 ), I1 - ( 6 * I2 )
B= I1 - ( 3 * I2 ), I1 - ( 4 * I2 ), I1 - ( 5 * I2 )
C= I1 - ( 4 * I2 ), I1 - ( round( 4.5 * I2 ) , I1 - ( 5 * I2 )
D= I1, I1 - ( 1 * I2 ) , I1 - ( 2 * I2 )
Key
A
Item Models 33
Item model #2 in mathematics used to generate isomorphic instances of numerical comparisons.
The number that is 1 group of 5 fewer than 201 is ...
A. 196
B. 190
C. 197
D. 191
ITEM MODEL VARIABLES
Stem
The number that is I1 group of I2 fewer than I3 is ...
Elements
I1 Value Range: 1-3 by 1
I2 Value Range: 5-10 by 5
I3 Value Range: 201-245 by 3
Options
A= I3 - ( I2 * I1 )
B= I3 - ( I2 * ( I1 + 1 ) )
C= I3 - ( I2 * I1 ) + 1
D= I3 - ( I2 * ( I1 + 1 ) ) - 1
Key
A
Item Models 34
Item model #3 in mathematics used to generate isomorphic instances for addition.
What is 15 + 18 ?
A. 33
B. 48
C. 32
D. 34
ITEM MODEL VARIABLES
Stem
What is I1 + I2 ?
Elements
I1 Value Range: 15-30 by 3
I2 Value Range: 15-30 by 3
Options
A= I1 + I2
B= I1 + I2 + 1
C= I1 + I1 + I2 - 1
D= I1 + I1 + I2
Key
A
Item Models 35
Item model #4 in mathematics used to generate isomorphic instances for fractions.
What fraction of the measuring cup has oil in it?
A. 2/8
B. 2/3
C. 3/10
D. 3/8
ITEM MODEL VARIABLES
Stem
What fraction of the measuring cup has oil in it?
Diagram: I1 of Water and I2 of oil in one cup.
Elements
I1 Value Range: 0.125-1.0 by 0.125
I2 Value Range: 0.125-1.0 by 0.125
Options
A= ( I2 * 8 ) / 8
B= ( I2 * 8 ) + ( ( I1 * 8 ) / 8 )
C= ( I2 * 8 ) + ( ( I1 * 8 ) / 10 )
D= ( I2 * 8) / ( ( I2 * 8 ) + 1 )
Key
A
Appendix B
The cognitive feature codes were used to develop the four parent items for our example. The item feature codes serve as variables that could
affect the difficulty level for the generated items.
Item Feature Code
1
2
3
Cognitive
Feature Code Value Feature Value Feature Value Feature
1
(Number
Sequencing)
0
All start patterns are 0
0
Pattern by 10s
0
First number ends with 0
1 All start patterns not 0 1 Pattern by 20s and 5s 1 First number does not end with 0
2 Pattern by 15s and 25s
2
(Numerical
Comparisons)
0
No use of odd number
0
1 Group less
0
Group size of 10
1 Use of odd umber 1 2 Groups less 1 Group size of 5
2 3 Groups less
3
(Addition)
0 Sum of Last Digit <10 0 No use of odd numbers 0 Use of number in multiples of 10
1
Sum of Last Digit >10
1
One use of odd numbers
1
No number with multiples of 10
2 Two use of odd numbers
4
(Fractions)
0 Some parts are 1/8 0 Lowest common denominator < 8
1
No parts are 1/8
1
Lowest common denominator=8
... The goal of AIG is to generate large banks of high-quality items while reducing overall costs (Gierl & Lai, 2012;Lai et al., 2009). The template-based method proposed by Gierl et al. (2012) is the most widely used method for AIG, which involves a three-step process that requires the construction of cognitive models and item models (Gierl & Lai, 2012, 2016. ...
... The goal of AIG is to generate large banks of high-quality items while reducing overall costs (Gierl & Lai, 2012;Lai et al., 2009). The template-based method proposed by Gierl et al. (2012) is the most widely used method for AIG, which involves a three-step process that requires the construction of cognitive models and item models (Gierl & Lai, 2012, 2016. This method has been successfully implemented in various educational practices (e.g., medical licensing examinations), demonstrating its viability and cost-effectiveness in creating high-quality items (Kosh et al., 2018;Pugh et al., 2020). ...
... The goal of AIG is to generate large banks of high-quality items while reducing overall costs (Gierl & Lai, 2012;Lai et al., 2009). The template-based method proposed by Gierl et al. (2012) is the most widely used method for AIG, which involves a three-step process that requires the construction of cognitive models and item models (Gierl & Lai, 2012, 2016. This method has been successfully implemented in various educational practices (e.g., medical licensing examinations), demonstrating its viability and cost-effectiveness in creating high-quality items (Kosh et al., 2018;Pugh et al., 2020). ...
Preprint
Full-text available
Over a decade ago, automatic item generation (AIG) was introduced to meet the increasing need for high-quality items in educational measurement. Around the same time, a new area of research in computer science began to develop questions for educational use. Historically, researchers from these two domains had little knowledge or communication with one another. However, the development of pre-trained large language models (LLMs) has sparked the interest of researchers from both domains in applying these models for automatically creating items. With similar objectives and methodologies, these two research domains appear to be converging on how to address the problems in this field. The purpose of this study is to provide a review of the current state of research by synthesizing existing studies on the use of LLMs for AIG. By combining research from both domains, we examine the utility and potential of LLMs for AIG. We performed a comprehensive literature review in seven research databases, selected studies based on predefined criteria, and summarized 60 relevant studies that employed LLMs in the AIG process. We identified the most commonly used LLMs in current AIG literature, their specific applications in the AIG process, and the characteristics of the generated items. We found that LLMs are flexible and effective in generating various types of items based on different languages and subject domains. However, many studies have overlooked the quality of the generated items, indicating a lack of a solid educational foundation. This review emphasizes the urgent need for greater integration of learning and measurement theories in future AIG research. We share two suggestions to enhance the educational foundation for leveraging LLMs in AIG, advocating for interdisciplinary collaborations to exploit the utility and potential of LLMs.
... Templated-based AIG has begun to spread across psychology, education, and computer science disciplines in recent times . In the literature, it has been observed that templatebased AIG has been applied intensively in fields such as medicine (Falcão et al., 2022;Gierl & Lai, 2012) and dentistry ; it has also been found to generate automated items in diverse disciplines like mathematics (Adji et al., 2018;Embretson & Kingston, 2018) and literature (Sayın & Gierl, 2023). Notably, the studies have identified verbal expressions and numerical values within mathematical items. ...
... Therefore, AIG is used, which combines the expertise of professionals with computer technology. It has been observed in the literature that templatebased AIG studies have been used in various fields such as medicine (Falcão et al., 2022;Gierl & Lai, 2012), dentistry , mathematics (Adji et al., 2018;Embretson & Kingston, 2018), literature (Sayin & Gierl, 2023). Also, limited studies in the existing literature, such as those by Gierl et al. (2015) and Ryoo et al. (2022), have shown that non-verbal items can be generated using AIG. ...
Article
Full-text available
The purpose of this study is to generate non-verbal items for a visual reasoning test using templated-based automatic item generation (AIG). The fundamental research method involved following the three stages of template-based AIG. An item from the 2016 4th-grade entrance exam of the Science and Art Center (known as BİLSEM) was chosen as the parent item. A cognitive model and an item model were developed for non-verbal reasoning. Then, the items were generated using computer algorithms. For the first item model, 112 items were generated, and for the second item model, 1728 items were produced. The items were evaluated based on subject matter experts (SMEs). The SMEs indicated that the items met the criteria of one right answer, single content and behavior, not trivial content, and homogeneous choices. Additionally, SMEs' opinions determined that the items have varying item difficulty. The results obtained demonstrate the feasibility of AIG for creating an extensive item repository consisting of non-verbal visual reasoning items.
... Based on [9], computerized testing using item banks need a substantial number of items due to continuous administration and subsequent exposure which is needed to maintain the test security. On the other hand, in our study, the items are securely stored and controlled in the item bank where it is authorized only for educators. ...
Article
Full-text available
Constructing examination papers has been a lengthy and tedious procedure that is commonly raise by issues with content validity, reliability, and fairness. Therefore, investigating alternative methods such as item banking using Rasch analysis may present a more practical and efficient approach for creating assessments. This study aims to demonstrate how to develop an item bank using Rasch analysis and assess the reliability and validity of the final exam paper for a statistics course that uses item bank as its foundation. In this research, Statistics course which consist of 7 questions are divided into 21 items based on test specification table (TST). The results from Rasch analysis are recorded in an item bank interface created by using excel. The item bank interface facilitates easy access to a large variety of pretested items, allowing for the creation of diverse and balanced exam papers. A well-developed item bank will be a great assistance to exam setters as it makes the process of creating tests easier, faster, and more efficient, which leads to higher-quality examination questions paper.
... We use empirically derived coherence scores to determine which forecasters may be more accurate than others. To mitigate item exposure, which can lead to inaccurate or inflated estimates of ability, we use a novel psychometric paradigm called Automatic Item Generation (AIG; Gierl & Lai, 2012) to produce psychometrically equivalent multiple forms that measure the same construct (Glas & van der Linden, 2003). ...
Article
AI has the potential to revolutionize learning and assessments. This research explored ChatGPT's potential in creating MCQ-based clinical pharmacy exam papers at the Pharm-D level and compared their quality to human-made exams. In this study, three different MCQ-based papers (2 by ChatGPT; AI easy & AI hard , and 1 by instructor; HUMAN) were set to have 10 MCQs each. We asked ChatGPT to set one exam with a low level of difficulty (AI easy) and a second exam with specific instructions to achieve a more difficult exam (AI hard). Students attempted these exams as part of their regular assessment and rated them from 1 to 5 for various parameters such as critical thinking involved, difficulty level, and overall experience. Our study shows that students obtained higher marks for AI easy (7.67 ± 3.92) than for AI hard (7.06 ± 1.32) and HUMAN (5.02 ± 1.70), making the HUMAN exam the most difficult one. Students rated the AI hard and HUMAN exams higher for critical thinking than for AI easy. The students rated highest for AI easy for overall experience. Interestingly, most of the students (n = 52, 83%) could not correctly identify the exam set by the instructor. Therefore, with clear instructions, ChatGPT can create content-relevant, good-quality exam papers with varying difficulty levels. This is especially useful for students who need to self-study and be ready for several exams to gauge their knowledge.
Preprint
Full-text available
The integration of artificial intelligence (AI) in educational measurement has revolutionized assessment methods, enabling automated scoring, rapid content analysis, and personalized feedback through machine learning and natural language processing. These advancements provide timely, consistent feedback and valuable insights into student performance, thereby enhancing the assessment experience. However, the deployment of AI in education also raises significant ethical concerns regarding validity, reliability, transparency, fairness, and equity. Issues such as algorithmic bias and the opacity of AI decision-making processes pose risks of perpetuating inequalities and affecting assessment outcomes. Responding to these concerns, various stakeholders, including educators, policymakers, and organizations, have developed guidelines to ensure ethical AI use in education. The National Council of Measurement in Education's Special Interest Group on AI in Measurement and Education (AIME) also focuses on establishing ethical standards and advancing research in this area. In this paper, a diverse group of AIME members examines the ethical implications of AI-powered tools in educational measurement, explores significant challenges such as automation bias and environmental impact, and proposes solutions to ensure AI's responsible and effective use in education.
Article
Full-text available
In light of the widespread adoption of technology-enhanced learning and assessment platforms, there is a growing demand for innovative, high-quality, and diverse assessment questions. Automatic Question Generation (AQG) has emerged as a valuable solution, enabling educators and assessment developers to efficiently produce a large volume of test items, questions, or assessments within a short timeframe. AQG leverages computer algorithms to automatically generate questions, streamlining the question-generation process. Despite the efficiency gains, significant gaps in the question-generation pipeline hinder the seamless integration of AQG systems into the assessment process. Notably, the absence of a standardized evaluation framework poses a substantial challenge in assessing the quality and usability of automatically generated questions. This study addresses this gap by conducting a comprehensive survey of existing question evaluation methods, a crucial step in refining the question generation pipeline. Subsequently, we present a taxonomy for these evaluation methods, shedding light on their respective advantages and limitations within the AQG context. The study concludes by offering recommendations for future research to enhance the effectiveness of AQG systems in educational assessments.
Article
Cognitive diagnostic assessments (CDAs) require a large number of itemsto measure the target attributes with high precision. An automatic itemgeneration (AIG) system would help to reduce the cost and effort of itemwriting in CDAs. This study aimed to develop a valid AIG system for CDAsin linear equations of mathematics by designing an AIG system andexamining two aspects of a generated CDA in cognitive diagnosticmodeling: (a) the M-matrix, which specifies the set of attributesrequired by each item model and (b) the item discrimination index, whichis computed from item parameters in the deterministic input,noisy-and-gate (DINA) model. First, we compared an original M-matrix totwo alternative M-matrices by using information criteria, posteriorpredictive model checks, and item discrimination indices. Second, weexamined the magnitude and variability of the item discriminationindices for every item model. No substantially large differences werefound among the results from all of the M-matrices. The discriminationindices tended to be high in items that measured major attributes, andthe variabilities of the indices were small within each item model,except for a few item models. These findings indicate the validity ofour M-matrix and AIG system. Furthermore, they suggest ways to improvethe AIG system. Research limitations and how future studies can help toenhance the AIG system are discussed.
Article
Full-text available
The present paper consists of a theoretical and an empirical part: First Rasch's test model for items with two answer categories is considered under the assumption of linear constraints on the item parameters (‘linear logistic model’). It is shown that this model is appropriate for the analysis of subject areas in instructional research if the subject area comprises tasks or items which are solved by the pupil by combination of a certain number of cognitive operations or rules. An empirical investigation was made which showed that the psychological complexity of problems in elementary differential calculus, as taught in secondary school mathematics, can be approximately explained through the assumption of seven psychologically meaningful operations. The psychological contribution of this analysis does not lie in a mere statistical description of item difficulties, but rather in the testing of hypotheses as to which steps (operations) in solving a problem are to be viewed as psychological units. It was seen, for instance, that differentiation of a polynomial is to be considered a single operation psychologically, which is mastered and correctly combined with the other operations or not, and that the complexity of a task is primarily determined by the combination of different operations and is not increased significantly when the same operation occurs repeatedly within the problem.
Chapter
Design patterns are tools to support task authoring under an evidence-centered approach to assessment design (ECD). This chapter reviews the basic concepts of ECD, focusing on evidentiary arguments. It defines the attributes of design patterns, and shows the roles they play in creating tasks around valid assessment arguments.
Article
There is mounting hope in the United States that federal legislation in the form of No Child Left Behind will improve educational outcomes. As titanic as the challenge appears to be, however, the solution could be at our fingertips. This volume identifies visual types of cognitive models in reading, science and mathematics for researchers, test developers, school administrators, policy makers and teachers. In the process of identifying these cognitive models, the book also explores methodological or translation issues to consider as decisions are made about how to generate psychologically informative and psychometrically viable large-scale assessments based on the learning sciences. Initiatives to overhaul educational systems in disrepair may begin with national policies, but the success of these policies will hinge on how well stakeholders begin to rethink what is possible with a keystone of the educational system: large-scale assessment.
Article
In the present chapter, the focus is on extending item response models on the item side. Item and item group predictors are included as external factors and the item parameters β i are considered as random effects. When the items are modeled to come from one common distribution, the models are descriptive on the item side. When item predictors of the property type are included, the models are explanatory on the item side. Item groups are a special case of item properties. They refer to binary, non-overlapping properties indicating group membership. The resulting models with item properties can all be described as linear logistic test models (LLTM; Fischer, 1995) with an error term in the prediction of item difficulty. When this random item variation is combined with random person variation, models with crossed random effects are obtained. All models in this chapter are of that kind.
Article
Item models (LaDuca, Staples, Templeton, & Holzman, 1986) are classes from which it is possible to generate/produce items that are equivalent/isomorphic to other items from the same model (e.g., Bejar, 1996; Bejar, 2002). They have the potential to produce large number of high-quality items at reduced cost. This paper introduces data from the first known application of items automatically generated from item models in a large-scale assessment and deals with several research questions associated with the data. We begin by reviewing calibration techniques for the analysis of data involving item models; one method assumes that the items are isomorphic, while the other treats items generated from the same item model as distinct, but related. A major question for these type of data is whether these items are isomorphic, that is, if they behave the same psychometrically. This paper describes a number of rough diagnostic measures and a rigorous statistical diagnostic to assess the extent of isomorphicity in the items generated from an item model. Finally, this paper discusses the issue of scoring, an area that needs more research, with data involving item models.
Book
Despite the fact that test development is a growth industry that cuts across all levels of education and all the professions, there has never been a comprehensive, research-oriented Handbook to which everyone (developers and consumers) can turn for guidance. That is the mission of this book. The Handbook of Test Development brings together well-known scholars and test-development practitioners to present chapters on all aspects of test development. Each chapter contributor is not only a recognized expert with an academic and research background in their designated topic, each one has also had hands-on experience in various aspects of test development. This thirty two-chapter volume is organized into six sections: foundations, content, item development, test design, test production and administration, and post-test activities. The Handbook provides extensive treatment of such important but unrecognized topics as contracting for testing services, item banking, designing tests for small testing program, and writing technical reports. The Handbook is based on the Standards for Educational and Psychological Testing, which serve as the foundation for sound test development practice. These chapters also suggest best test development practices and highlight methods to improve test validity evidence. This book is appropriate for graduate courses and seminars that deal with test development and usage, professional testing services and credentialing agencies, state and local boards of education, and academic libraries serving these groups.