ArticlePDF Available

The Role of Item Models in Automatic Item Generation

Authors:

Abstract

Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates or prototypes, that highlight the features or elements in the assessment task that must be manipulated. Second, these item model elements are manipulated to generate new items with the aid of computer-based algorithms. With this two-step process, hundreds or even thousands of new items can be created from a single item model. The purpose of our article is to describe seven different but related topics that are central to the development and use of item models for automatic item generation. We start by defining item model and highlighting some related concepts; we describe how item models are developed; we present an item model taxonomy; we illustrate how item models can be used for automatic item generation; we outline some benefits of using item models; we introduce the idea of an item model bank; and finally, we demonstrate how statistical procedures can be used to estimate the parameters of the generated items without the need for extensive field or pilot testing.
The Role of Item Models in Automatic Item Generation
Mark J. Gierl
Hollis Lai
Centre for Research in Applied Measurement and Evaluation
University of Alberta
Paper Presented at the Symposium
Item Modeling and Item Generation for the Measurement of
Quantitative Skills: Recent Advances and Prospects
Annual Meeting of the National Council on Measurement in Education
New Orleans, LA
April, 2011
Item Models 2
INTRODUCTION
Randy Bennett (2001) claimed, a decade ago, that no topic would become more central to the
innovation and future practice in educational assessment than computers and the internet. His
prediction has proven to be accurate. Educational assessment and computer technology have evolved
at a staggering pace since 2001. As a result many educational assessments, which were once given in a
paper-and-pencil format, are now administered by computer using the internet. Education Week’s
2009 Technology Counts, for example, reported that 27 US states now administer internet-based
computerized educational assessments. Many popular and well-known exams in North America such
as the Graduate Management Achievement Test (GMAT), the Graduate Record Exam (GRE), the Test of
English as a Foreign Language (TOEFL iBT), and the American Institute of Certified Public Accountants
Uniform CPA examination (CBT-e), to cite but a few examples, are administered by computer over the
internet. Canadian testing agencies are also implementing internet-based computerized assessments.
For example, the Medical Council of Canada Qualifying Exam Part I (MCCQE I), which is written by all
medical students seeking entry into supervised clinical practice, is administered by computer.
Provincial testing agencies in Canada are also making the transition to internet-based assessment.
Alberta Education, for instance, will introduce a computer-based assessment for elementary school
students in 2011, as part of their Diagnostic Mathematics Program.
Internet-based computerized assessment offers many advantages to students and educators
compared to more traditional paper-based assessments. For instance, computers support the
development of innovative item types and alternative item formats (Sireci & Zenisky, 2006; Zenisky &
Sireci, 2002); items on computer-based tests can be scored immediately thereby providing students
with instant feedback (Drasgow & Mattern, 2006); computers permit continuous testing and testing on-
demand for students (van der Linden & Glas, 2010). But possibly the most important advantage of
Item Models 3
computer-based assessment is that it allows educators to measure more complex performances by
integrating test items and digital media to substantially increase the types of knowledge, skills, and
competencies that can be measured (Bartram, 2006; Zenisky & Sireci, 2002).
The advent of computer-based testing has also raised new challenges, particularly in the area of
item development (Downing & Haladyna, 2006; Schmeiser & Welch, 2006). Large numbers of items are
needed to develop the banks necessary for computerized testing because items are continuously
administered and, therefore, exposed. As a result, these banks must be frequently replenished to
minimize item exposure and maintain test security. Because testing agencies are now faced with the
daunting task of creating thousands of new items for computer-based assessments, alternative
methods of item development are desperately needed. One method that may be used to address this
challenge is through automatic item generation (Drasgow, Luecht, & Bennett, 2006; Embretson & Yang,
2007; Irvine & Kyllonen, 2002). Automatic item generation represents a relatively new but rapidly
evolving research area where cognitive and psychometric theories are used to produce tests that
include items generated using computer technology. Automatic item generation requires two steps.
First, test development specialists develop item models, which are comparable to templates or
prototypes, that highlight the features or elements in the assessment task that must be manipulated.
Second, these item model elements are manipulated to generate new items with the aid of computer-
based algorithms. With this two-step process, hundreds or even thousands of new items can be created
from a single item model.
The purpose of our paper is describe seven different but related topics that are central to the
development and use of item models for automatic item generation. We start by defining item model
and highlighting some related concepts; we describe how item models are developed; we present an
item model taxonomy; we illustrate how item models can be used for automatic item generation; we
Item Models 4
outline some benefits of using item models; we introduce the idea of an item model bank; and finally,
we demonstrate how statistical procedures can be used to estimate the parameters of the generated
items without the need for extensive field or pilot testing. We begin by describing two general factors
that, we feel, will directly affect educational measurementincluding emerging methods such as
automatic item generationin the 21st century.
TWO FACTORS THAT WILL SHAPE EDUCATIONAL MEASUREMENT IN THE 21ST CENTURY
We assert the first factor that will shape educational measurement in the 21st century is the growing
view that the science of educational assessment will prevail in guiding the design, development,
administration, scoring, and reporting practices in educational testing. In their seminal chapter on
“Technology and Testing” in the 4th Edition of the handbook Educational Measurement, Drasgow,
Luecht, and Bennett (2006, p. 471) begin with this bold claim:
This chapter describes our vision a 21st-century testing program that capitalizes on modern
technology and takes advantage of recent innovations in testing. Using an analogy from
engineering, we envision a modern testing program as an integrated system of systems. Thus,
there is an item generation system, an item pretesting system, and examinee registration
system, and so forth. This chapter discusses each system and illustrates how technology can
enhance and facilitate the core processes of each system.
Drasgow et al. present a view of educational measurement where integrated technology-enhanced
systems govern and direct all testing processes. Ric Luecht has coined this technology-based approach
to educational measurement assessment engineering” (Luecht, 2006a, 2006b, 2007, 2011).
Assessment engineering is an innovative approach to measurement practice where engineering-based
principles and technology-enhanced processes are used to direct the design and development of
assessments as well as the analysis, scoring, and reporting of assessment results. With this approach,
the measurement specialist begins by defining the construct of interest using specific, empirically-
derived cognitive models of task performance. Next, item models are created to produce replicable
Item Models 5
assessment tasks. Finally, statistical models are applied to the examinee response data collected using
the item models to produce scores that are both replicable and interpretable.
The second factor that will likely shape educational measurement in the 21st century stems from the
fact that the boundaries for our discipline are becoming more porous. As a result, developments from
other disciplines such as cognitive science, mathematical statistics, medical education, educational
psychology, operations research, educational technology, and computing science will permeate and
influence educational testing. These interdisciplinary contributions will also create opportunities for
both theoretical and practical change. That is, educational measurement specialists will begin to draw
on interdisciplinary developments to enhance their own research and practice. At the same time,
students across a host of other disciplines will begin to study educational measurement1. These
interdisciplinary forces that promote new ideas and innovations will begin to evolve, perhaps slowly at
first, but then at a much faster pace leading to even more changes in our discipline. It may also mean
that other disciplines will begin to adopt our theories and practices more readily as students with
educational measurement training move back to their own content domains and areas of specialization.
ITEM MODELING: DEFINITION AND RELATED CONCEPTS
An item model (Bejar, 1996, 2002; Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2003;
LaDuca, Staples, Templeton, & Holzman, 1986)which has also been described as a schema (Singley &
Bennett, 2002), blueprint (Embretson, 2002), template (Mislevy & Riconscente, 2006), form (Hively,
Patterson, & Page, 1968), clone (Glas & van der Linden, 2003), and shell (Haladyna & Shindoll, 1989)
serves as an explicit representation of the variables in an assessment task, which includes the stem, the
options, and oftentimes auxiliary information (Gierl, Zhou, & Alves, 2008). The stem is the part of an
1 We have already noticed this change in our own program. We currently have 14 students in the Measurement,
Evaluation, and Cognition (MEC) graduate program at the University of Alberta. These students represent a
diverse disciplinary base, which includes education, cognitive psychology, engineering, computing science,
medicine (one of our students is a surgery resident), occupational therapy, nursing, forensic psychology, statistics,
and linguistics.
Item Models 6
item which formulates context, content, and/or the question the examinee is required to answer. The
options contain the alternative answers with one correct option and one or more incorrect options or
distracters. When dealing with a multiple-choice item model, both stem and options are required. With
an open-ended or constructed-response item model, only the stem is created. Auxiliary information
includes any additional material, in either the stem or option, required to generate an item, including
digital media such as text, images, tables, diagrams, sound, and/or video. The stem and options can be
divided further into elements. These elements are denoted as strings, S, which are non-numeric values
and integers, I, which are numeric values. By systematically manipulating the elements, measurement
specialists can generate large numbers of items from one item model. If the generated items or
instances of the item model are intended to measure content at similar difficulty levels, then the
generated items are isomorphic. When the goal of item generation is to create isomorphic instances,
the measurement specialist manipulates the incidental elements, which are the surface features of an
item that do not alter item difficulty. Conversely, if the instances are intended to measure content at
different difficulty levels, then the generated items are variants. When the goal of item generation is to
create variant instances, the measurement specialist can manipulate the incidental elements, but must
manipulate one or more radical elements in the item model. The radicals are the deep features that
alter item difficulty, and may even affect test characteristics such as dimensionality.
To illustrate some of these concepts, an example from Grade 6 mathematics is presented in Figure 1.
The item model is represented as the stem and options variables with no auxiliary information. The
stem contains two integers (I1, I2). The I1 element includes Ann’s payment. It ranges from $1525 to
$1675 in increments of $75. The I2 element includes the size of the lawn, as either 30/m2 or 45/m2. The
four alternatives, labelled A to D, are generated using algorithms produced from the integer values I1
and I2 (including the correct option, which is A).
Item Models 7
Figure 1. Simple item model in Grade 6 mathematics with two integer elements.
Ann has paid $1525 for planting her lawn. The cost of lawn is $45/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
A. 5.8
B. 6.8
C. 4.8
D. 7.3
ITEM MODEL VARIABLES
Stem
Ann has paid $I1 for planting her lawn. The cost of lawn is $I2/m2. Given the shape of her lawn is
square, what is the side length of Ann’s lawn?
Elements
I1 Value Range: 1525-1675 by 75
I2 Value Range: 30 or 45
Options
A=12
B=12
+ 1
C=12
 1
D=12
+ 1.5
Key
A
DEVELOPING ITEM MODELS
Test development specialists have the critical role of designing and developing the item models
used for automatic item generation. The principles, standards, and practices that guide traditional item
development (cf. Case & Swanson, 2002; Downing & Haladyna, 2006; Schmeiser & Welch, 2006) have
been recommended for use in item model development. Although a growing number of item model
examples are available in the literature (e.g., Bejar et al., 2003; Case & Swanson, 2002; Gierl et al.,
2008), there are currently no published studies describing either the principles or standards required to
Item Models 8
develop these models. Drasgow et al. (2006) advise test development specialists to engage in the
creative task of developing item models by using design principles and guidelines discerned from a
combination of experience, theory, and research. Initially, these principles and guidelines are used to
identify a parent item model. One way to identify a parent item model is by using a cognitive theory of
task performance. Within this theory, cognitive models, as described by Luecht in his assessment
engineering framework, may be identified or discerned. With this type of “strong theory” approach,
cognitive features are identified in such detail that item features that predict test performance can be
not only specified but also controlled. The benefit of using strong theory to create item models is that
item difficulty for the generated items is predictable and, as a result, the generated items may be
calibrated without the need for extensive field or pilot testing because the factors that govern the item
difficulty level can be specified and, therefore, explicitly modeled and controlled. Unfortunately, few
cognitive theories currently exist to guide our item development practices (Leighton & Gierl, in press).
As a result, the use of strong theory for automatic item generation has, thus far, been limited to narrow
content domains, such as mental rotation (Bejar, 1990) and spatial ability (Embretson, 2002).
In the absence of strong theory, parent item models can be identified using weak theoryby
reviewing items from previously administered exams or by drawing on an inventory of existing test
items in an attempt to identify an underlying structure. This structure, if identified, provides a point-of-
reference for creating alternative item models, where features in the alternative models can be
manipulated to generate new items. Test development specialists can also create their own unique
item models. The weak theory approach to developing parent models using previously administered
items, drawing on an inventory of existing items, or creating new models is well-suited to broad
content domains where few theoretical descriptions exist on the cognitive skills required to solve test
items (Drasgow et al., 2006). The main drawback of using weak theory to create item models is that
Item Models 9
item difficulty for the generated items is unpredictable and, therefore, field or pilot testing may be
required.
ITEM MODEL TAXONOMY
Gierl et al. (2008) described a taxonomy of item model types, as a way of offering guidelines for
creating item models. The taxonomy pertains to generating multiple-choice items and classifies models
based on the different types of elements used in the stems and options. The stem is the section of the
model used to formulate context, content, and/or questions. The elements in the stem can function in
four different ways. Independent indicates that the ni element(s) (ni 1) in the stem are unrelated to
one another. That is, a change in one element will have no effect on the other stem elements in the
item model. Dependent indicate all nd element(s) (nd 2) in the stem are directly related to one other.
Mixed Independent/Dependent include both independent (ni 1) and dependent (ni 1) elements in
the stem, where at least one pair of stem elements are directly related. Fixed represents a constant
stem format with no variation or change.
The options contain the alternatives for the item model. The elements in the options can function in
three different ways. Randomly-selected options refer to the manner in which the distracters are
selected from their corresponding content pools. The distracters are selected randomly. Constrained
options mean that the keyed option and the distracters are generated according to specific constraints,
such as formulas, calculation, and/or context. Fixed options occur when both the keyed option and
distracters are invariant or unchanged in the item model.
By crossing the stem and options, a matrix of item model types can be produced (see Table 1). This
taxonomy is useful for creating item models because it provides the guiding principles necessary for
designing diverse models by outlining their structure, function, similarities, and differences. It can also
be used to help ensure that test development specialists do not design item models with exactly the
Item Models 10
same elements. Ten functional combinations are designated with a checkmark, “√”. The two
remaining combinations are labelled not applicable, “NA”, because a model with a fixed stem and
constrained options is an infeasible item type and a model with a fixed stem and options produces a
single multiple-choice item type (i.e., a traditional multiple-choice item). Gierl et al. also presented 20
examples (i.e., two examples for each of the 10 cells in the item model taxonomy) to illustrate each
unique combination. Their examples were drawn from diverse content areas, including science, social
studies, mathematics, language arts, and architecture.
Table 1. Plausible Stem-by-Option Combinations in the Gierl et al. (2008) Item Model Taxonomy
Stem
Options
Independent Dependent Mixed Fixed
Randomly Selected
Constrained NA
Fixed NA
USING ITEM MODELS TO AUTOMATICALLY GENERATE ITEMS
Once the item models are developed by the test development specialists, automatic item
generation can begin. Automatic item generation is the process of using item models to generate test
items with the aid of computer technology. The role of the test development specialist is critical for the
creative task of designing and developing meaningful item models. The role of computer technology is
critical for the generative task of systematically combining large numbers of elements in each model to
produce items. By combining content expertise and computer technology, item modeling can be used
to generate items. If we return to the simple math example in Figure 1, the generative process can be
illustrated. Recall, the stem in this example contains two integers (I1, I2). The generative task for this
example involves generating six items with the following I1, I2 combinations: I1=$1525 and I2=30/m2;
Item Models 11
I1=$1600 and I2=30/m2; I1=$1675 and I2=30/m2; I1=$1525 and I2=45/m2; I1=$1600 and I2=45/m2;
I1=$1675 and I2=45/m2.
Gierl et al. (2008, pp. 25-31) also created a software tool that automatically creates, saves, and
stores items. The software is called IGOR (which stands for Item GeneratOR). It was written in Sun
Microsystems JAVE SE 6.0. The purpose of IGOR is to generate multiple items from a single item
model. The user interface for IGOR is structured using the same sections as the example in Figure 1
(i.e., stem, elements, options). The Item Model Editor window is used to enter and structure each item
model (see Figure 2a). The editor has three components. The stem panel is the starting point for item
generation where the item prompt is specified. Next, the elements panel is used to identify the string
and integer variables as well as specify the constraints required among the elements for successful item
generation. The options panel is used to specify possible answers to a given test item. The options are
classified as either a key or distracter. The Elements and Options panels also contain three editing
buttons. The first of these adds a new element or option to its panel ( ). The second opens a
window to edit the currently selected element or option ( ). The third removes the currently
selected element or option from the model ( ). To generate items from a model, the Test Item
Generator dialogue box is presented where the user specifies the item model file, the item bank output
file, and the answer key file. If the option ‘Create answer key’ is not selected, then the resulting test
bank will always display the correct answer as the last option (or alternative). If the option ‘Create
answer key’ is selected, then the resulting test bank will randomly order the options. Once the files
have been specified in the Test Item Generator dialogue box, the program can be executed by selecting
the ‘Generate’ button (see Figure 2b).
Item Models 12
Figure 2. IGOR interface illustrating the (a.) input panels and editing functions as well as the (b.)
generating functions.
(a.) (b.)
Preliminary research has been conducted with IGOR. Gierl et al., working with two mathematics
test development specialists, developed 10 mathematics item models. IGOR generated 331371 unique
items from the 10 item models. That is, each model produced, on average, 33137 items thereby
providing an initial demonstration of the practicality and feasibility of item generation using IGOR.
BENEFITS OF ITEM MODELING
Item modeling can enhance educational assessment in many ways. The purpose of item modeling is
to create a single model that yields many test items. Multiple models can then be developed which will
yield hundreds or thousands of new test items. These items, in turn, are used to generate item banks.
Computerized assessments or automatic test assembly algorithms then draw on a sample of the items
from the bank to create a new test. With this approach, item exposure through test administration is
minimized, even with continuous testing, because a large bank of operational items is available. Item
modeling can also lead to more cost-effective item development because the model is continually re-
Item Models 13
used to yield many test items compared with developing each item for a test from scratch. Moreover,
costly, yet common, errors in item developmentincluding omissions or additions of words, phrases,
or expressions as well as spelling, punctuation, capitalization, item structure, typeface, and formatting
problemscan be avoided because only specific elements in the stem and options are manipulated
across large numbers of items (Schmeiser & Welch, 2006). In other words, the item model serves as a
template or prototype where test development specialists manipulate only specific, well-defined,
elements. The remaining components in the template or prototype are not altered. The view of an
item model as a template or prototype with both fixed and variable elements contrasts with the more
conventional view of a single item where every element is unique, both within and across items.
Drasgow et al. (2006) explain:
The demand for large numbers of items is challenging to satisfy because the traditional
approach to test development uses the item as the fundamental unit of currency. That is, each
item is individually hand-craftedwritten, reviewed, revised, edited, entered into a computer,
and calibratedas if no other like it had ever been created before.
But possibly the most important benefit of item modeling stems from the logic of this approach to
test development. With item modeling, the model is treated as the fundamental unit of analysis where
a single model is used to generate many items compared with a more traditional approach where the
item is treated as the unit of analysis (Drasgow et al. 2006). Hence, with item modeling, the cost per
item is lower because the unit of analysis is multiple instances per model rather than single instances
per test development specialist. As a result, large number of items can be generated from a single item
model rather than relying on each test development specialist to develop a large number of unique
items. The item models can also be re-used, particularly when only a small number of the generated
items are used on a particular test form.
Item Models 14
ITEM MODEL BANK
Current practices in test development and analysis are ground in the test item. That is, each item is
individually written, reviewed, revised, edited, banked, and calibrated. If, for instance, a developer
intends to have 1236 operational test items in her bank, then she has 1236 unique items that must be
created, edited, reviewed, field tested, and, possibly, revised. An item bank serves as an electronic
repository for maintaining and managing information on each item. The maintenance task focuses on
item-level information. For example, the format of the item must be coded. Item formats and item
types can include multiple choice, numeric response, written response, linked items, passage-based
items, and items containing multimedia. The content for the item must be coded. Content fields
include general learning outcomes, blueprint categories, item identification number, item response
format, type of directions required, links, field test number, date, source of item, item sets, and
copyright. The developer attributes must be coded. These attributes include year the item was written,
item writer name, item writer demographics, editor information, development status, and review
status. The statistical characteristics for the item must also be coded. Statistical characteristics often
include word count, readability, classical item analyses, item response theory parameters, distracter
functioning, item history, field test item analyses, item drift, differential item functioning flags, and
history of item use.
The management task focuses on person-level information and process. That is, item bank
management requires explicit processes that guide the use of the item bank. Many different people
within a testing organization are often involved in the development process including the test
development specialists, subject matter experts (who often reside in both internal and external
committees), psychometricians, editors, graphic artists, word processors, and document production
specialists. Many testing programs field test their items and then review committees evaluate the items
Item Models 15
prior to final test production. Hence, field tested items are often the item bank entry point. Rules must
be established for who has access to the bank and when items can be added, modified, or removed
during field testing. The same rules must also apply to the preparation of the final form of the test
because field testing can, and often does, occur in a different unit of a testing organization or at a
different stage in development process and, therefore, may involve different people.
Item models, rather than single test items, serve as the unit of analysis in an item model bank. With
an item model bank, the test development specialist creates an electronic repository of item models for
maintaining and managing information on each model. However, a single item model which is
individually written, reviewed, revised, edited, and banked will also allow the developer to generate
many test items. If, for instance, a developer intends to have 331371 items, then she may only require
10 item models (as was illustrated in our previous section on “Using Item Models to Automatically
Generate Items”). Alternatively, if a particularly ambitious developer aspired to have a very large
inventory of 10980640827 items, then she would require 331371 item models [i.e., if each item model
generated, on average, 33137 mathematics items as was illustrated in our previous section on “Using
Item Models to Automatically Generate Items”, then 331371 item models could be used to generate
10980640827 (33137*331371) items].
An item model bank serves as an electronic repository for maintaining and managing information on
each item model. Because an item model serves as the unit of analysis, the banks contain a complex
assortment of information on every model, but not necessarily on every item. The maintenance task
focuses on model -level information. For example, the format of the item model must be coded.
Content fields must be coded. The developer attributes must be coded. Some statistical characteristics
of the model must also coded, including word count, readability, and item model history. The item
model bank may also contain coded information on the item model ID, item model name, expected
Item Models 16
grade levels for use, item model stem type, item model option type, number of constraints for the
model, the number of elements (e.g., integers and strings) in the model, and the number of generated
items.
The management task focuses on person-level information and process. That is, item model bank
management requires explicit processes that guide the use of the item model bank. As with a more
traditional approach to item development, many different people within a testing organization are
involved in the process including the test development specialists, subject matter experts,
psychometricians, editors, graphic artists, and word processors. Because of the generative process
required for item model banking, an additional type of specialist may also be involved: the item model
programmer. This specialist is skilled in test development, but also in computer programming and
database management. In other words, this is a 21st century career! Their role is, first, to bridge the gap
between the test development specialist who creates the item model and required programming tasks
necessary to format and generate items using IGOR. In other words, the item model programmer helps
the test development specialist identify and manipulate the fixed and variable elements in each item
model (which is where test development experience will be helpful), enter the newly created item
models into IGOR, and then execute the program to generate items (the latter two steps require
computer programming skills, at least at this stage in the development of automatic item generation2).
Second, the item model programmer is responsible for entering the models into the item model bank,
maintaining the contents of the bank, and managing the use of the item model bank (which requires
2 In 2009, we worked with 12 test development specialists at the Learner Assessment Branch at Alberta Education
to create item models for achievement tests in Grade 3 Language Arts and Mathematics as well as Grade 6 and 9
Language Arts, Mathematics, Science, and Social Studies. The project yielded 284 unique item models at all three
grade levels and in four different content areas. The test development specialists in this project had the most
difficulty specifying the fixed and variable elements in their model and, despite repeated training, were unable to
code their models and run IGOR consistently.
Item Models 17
database management skills). The responsibilities of the item model programmer are presented in
Figure 3.
Figure 3. Basic overview of workflow using traditional item banking and item model banking.
Item Writing Item Bank Form Assembly
Traditional Item Banking Process
Item Model
Database
Item Model Writing Form Assembly
Item Model
Programmer
Item Generation
Item Generation
Item Model Banking Process
ESTIMATING STATISTICAL CHARACTERISTICS OF GENERATED ITEMS
Drasgow et al. (2006, p. 473) claim that:
Ideally, automatic item generation has two requirements. The first requirement is that an item
class can be described sufficiently for a computer to create instances of that class automatically
or at least semi-automatically. The second requirement is that the determinants of item
difficulty be understood well enough so that each of the generated instances need not be
calibrated individually.
In the previous six sections of this paper, we described and highlighted the issues related to Drasgow et
al.’s first requirementdescribing an item class and automatically generating items—with the use of
item models. In this section, we address the challenges related to Drasgow et al.’s second requirement
by illustrating how generated items could be calibrated automatically. To be useful in test assembly,
Item Models 18
item must have statistical characteristics. These characteristics can be obtained by administering the
items on field tests to collect preliminary information from a small sample of examinees. Item statistics
can also be obtained by embedding pilot items within a form as part of an operational test
administration, but not using the pilot items for examinee scoring. An alternative approach is to
account for the variation among the generated items in an item model and, using this information, to
estimate item difficulty with a statistical procedure thereby making field and pilot testing for the
generated items unnecessary (or, at least, dramatically reduced). A number of statistical procedures
have been developed to accomplish this task, including the linear logistic test model (LLTM; Fischer,
1973; see also Embretson & Daniel, 2008), the 2PL-constrained model (Embretson, 1999), the
hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical model (Sinharay,
Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response function approach
(Mislevy, Wingersky, & Sheehan, 1994).
Janssen (2010; see also Janssen, Schepers, & Peres, 2004) also described a promising approach for
modeling item design features using an extension of the LLTM called the random-effects LLTM (LLTM-R).
The probability that person successfully answers item is given by the LLTM as follows:
 = 1,, =  (
 )
 ()
 .
In this formula, the item difficulty parameter found in the Rasch model is replaced with an item
difficulty model specified as =
 , where item difficulty is specified by a linear combination
of item predictors, including a parameter for the item design feature , which is the score of item on
item design feature , and a parameter , which is the difficulty weights associated with item design
feature . Building on this LLTM formulation, the LLTM-R adds a random error term to to estimate
that component of item difficulty that may not be accounted for in the item difficulty model:
Item Models 19
=
 + = + , where
,
.
By adding
to the model, random variation can be used to account for design principles that yield the
same items but not necessary the same item difficulty values across these items.
Janssen (2010) also described the logic that underlies the LLTM-R, as it applies to automatic item
generation. The LLTM-R consists of two parts. The first part of the model specifies the person
parameters associated with , which include and
, and the second part specifies the item
parameters associated with the , which include and
. The parameter accounts for the random
variation of all items created within the same item design principles leading to similar, but not
necessarily the same, item difficulty levels. Taken together, the LLTM-R can be used to describe three
meaningful components: persons (i.e., ,
) , items (), and item populations (
). For modeling
outcomes in an automatic item generation context, our focus is on the items and item populations
(where the items are nested within the item population).
Next, we develop a working example using the logic for automatic item generation presented in
Janssen (2010). Our example is developed using operational data from a diagnostic mathematics
program (see Gierl, Taylor-Majeau, & Alves, 2010). The purpose of the Gierl et al. (2010) study was to
apply the attribute hierarchy method in an operational diagnostic mathematics program at the
elementary school levels to promote cognitive inferences about students’ problem-solving skills. The
attribute hierarchy method is a statistical procedure for classifying examinees’ test item responses into a
set of structured attribute patterns associated with a cognitive model. Principled test design procedures
were used to design the exam and evaluate the student response data. To begin, cognitive models were
created by test development specialists who outlined the knowledge and skills required to solve
mathematical tasks in Grades 3 and 6. Then, items were written specifically to measure the skills in the
cognitive models. Finally, confirmatory statistical analyses were used to evaluate the student response
Item Models 20
data by estimating model-data fit, attribute probabilities for diagnostic score reporting, and attribute
reliabilities. The cognitive model and item development steps from the diagnostic math program were
used in the current example to create item models.
Cognitive models for CDA have four defining characteristics (Gierl, Alves, Roberts, & Gotzmann,
2009). First, the model contains skills that are specified at a fine grain size because these skills must
magnify the cognitive processes underlying test performance. Second, the skills must be measurable.
That is, each skill must be described in way that would allow a test developer to create an item to
measure that skill. Third, the skills must be instructionally relevant to a broad group of educational
stakeholders, including students, parents, and teachers. Fourth, a cognitive model will often reflect a
hierarchy of ordered skills within a domain because cognitive processes share dependencies and
function within a much larger network of inter-related processes, competencies, and skills. Figure 4
provides one example taken from a small section of a larger cognitive model developed to yield
diagnostic inferences in SAT algebra (cf. Gierl, Wang, & Zhou, 2008). As a prerequisite skill, cognitive
attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction,
multiplication, and division of numbers. In attribute A2, the examinee needs to have the basic
arithmetic skills (i.e., attribute A1) as well as knowledge about the property of factors. In attribute A3,
the examinee not only requires basic arithmetic skills (i.e., attribute A1) and knowledge of factoring (i.e.,
attribute A2), but also the skills required for the application of factoring. The attributes are specified at
a fine grain size; each attribute is measurable; each attribute, and its associated item, is intended to be
instructionally relevant and meaningful; and attributes are ordered from simple to more complex as we
move from A1 to A3.
Item Models 21
Figure 4. Three sample items designed to measure three ordered skills in a linear cognitive model.
A3
A2
A1
A1:
Arithmetic
operations
Item 1: If 6(m+n)-3=15, then m+n=?
A. 2
B. 3
C. 4
D. 5
E. 6
Item 2: If (x+2)/(m-1)=0 and m1, what is the value of
x?
A. 2
B. -1
C. 0
D. 1
E. -2
Item 3: If 4a+4b = 3c-3d, then (2a+2b)/(5c-5d)=?
A. 2/5
B. 4/3
C. 3/4
D. 8/15
E. 3/10
A1:
Arithmetic
operations
A1:
Arithmetic
operations
A2:
Properties
of Factors
A2:
Properties
of Factors
A3:
Application
of Factoring
Cognitive
Model
Hierarchy Level Sample Test Items
The same test design principles were used to develop four item models in our working example. We
selected four parent items that had been field tested with 100 students from the diagnostic
mathematics project. These parent items, in turn, we used to create item models. The item models
were then used for item generation. The four item models are presented in Appendix A. The item
models in Appendix A are ordered from least to most complex according to their cognitive features,
meaning that item model 1 measures number sequencing skills; item model 2 measures number
sequencing skills and numerical comparison skills; item model 3 measures number sequencing skills,
numerical comparison skills, and addition skills; item model 4 measures number sequencing skills,
numerical comparison skills, addition skills, and ability to solve fractions (please note that the ordering
of the item models in this example has not been validated, rather the models are used to illustrate how
the LLTM-R could be used for item generation).
The LLTM-R was implemented in two steps. In step 1, parameters were estimated for the persons,
items, and item population with the LLTM. Using a field test consisting of 20 item specifically written to
measure the cognitive features of number sequencing, numerical comparison, addition, and fractions
Item Models 22
(i.e., five items per cognitive feature), the person and item parameters were estimated using the
dichotomously-scored response vectors for 100 students who solved these items. The item feature
parameter estimates were specified as fixed effects in the LLTM and the person and item population
estimates were specified as random effects. The estimated item fixed-effect parameter weights and
their associated standard errors are presented in Table 2.
Table 2. Estimated Weights and Standard Errors Using the Cognitive Features Associated with the
Four Diagnostic Test Items
Cognitive Feature Estimate (Standard Error)
Number Sequencing (Least Complex) -2.06 (0.22)
Numerical Comparisons 0.94 (0.27)
Addition 0.86 (0.25)
Fractions (Most Complex) 1.03 (0.25)
The estimated weights in Table 2 were then used to create a cognitive feature effect for each parent
item. The cognitive feature effect is calculated by taking the sum of the products for the pre-requisite
cognitive features as measured by each parent item. For example, a parent item that measures the
cognitive feature numerical comparisons would have a skill pattern of 1,1,0,0 because the features are
ordered in a hierarchy from least to most complex. This pattern would be multiplied and summed
across the estimated weights in Table 2 to produce the cognitive feature effect for each of the four
parent items in our example. The cognitive feature effect for the parent item measuring numerical
comparisons, for instance, would be (-2.06 X 1) + (0.94 X 1) +(0.86 X 0) + (1.03 X 0) = -1.13. The random
effects estimated for the person and item population, as reported in standard deviation units, are 0.99
and 0.33, respectively.
In step 2, the four parent items were selected from the field test and used to create item models
(Appendix A), the item models were used to generate items, and the difficulty parameters for the
generated items were estimated. Number sequencing is the first, and hence, most basic cognitive
Item Models 23
feature. This model generated 105 items. The second cognitive feature, numerical comparison,
resulted in a model that generated 90 items. The third cognitive feature was addition. The addition item
model generated 30 items. Fractions is the fourth, and therefore, most complex cognitive feature. The
fraction item model generated 18 items. In total, the four item models yielded 243 generated items.
For our illustrative example, the four item models are also differentiated by three key item features.
Each generated item had a different combination of these three item features. These features were
coded for each item and factored into our estimation process because they were expected to affect
item difficulty. The 10 item features and their codes (reported in parentheses) include all patterns with
0 (0), or not (1); no use of odd numbers (0) or use of odd numbers (1); sum of last digit is less than 10 (0)
or sum is greater than 10 (1); some parts are 1/8 (0) or no parts are 1/8 (1); pattern by 10s (0), pattern
by 20s and 5s (1), patterns by 15 and 25 (2); 1 group(0), 2 groups (1), 3 groups (2); no odd number (0),
one odd number (1), two odd numbers (2); lowest common denominator less that 8 (0) or lowest
common denominator greater than 8 (1); first number ends with 0 (0), or not (1); group size of 5 (0) or
group size of 10 (1); use of number in multiples of 10 (0) or no number with multiples of 10 (1). These
three item features, when crossed with the four cognitive features (i.e., four parent items), are shown in
Appendix B. These 10 item features serves as our best guess as to the variables that could affect item
difficulty for the generated items in each of the four item models. These item features would need to
be validated prior to use in a real item generation study.
To compute the difficulty parameter estimate for each of the generated items, four sources of
information must be combined. These sources include the cognitive feature effect (estimated in step 1),
the item feature coding weight, the item population standard deviation (from step 1), and random
error3. These sources are combined as follows: Difficulty Level for the Generate Item = Cognitive
3 The random error component allowed us to introduce error into our analysis, which is how we modeled the
LLTM-R using the LLTM estimates from step 1 for our example.
Item Models 24
Feature Effect + [(Item Feature Effect) x (Item Population Standard Deviation) x (Random Error)].
Returning to our previous example from step 1, the difficulty level for a generated item with the
numerical comparisons cognitive feature and an item feature effect of 0,1,1 (i.e., use of odd number;
use of two groups; use a group size of 5) would be -1.21 [-1.13 + (-0.5) x (0.33) x (0.48)]. The item
feature effect code of 0,1,1 is represented as -0.5 to standardize the item feature results in our
calculation, given that different cognitive features have different numbers of item features (see
Appendix B). This method is then applied to all 243 generated items to yield their item difficulty
estimates.
SUMMARY AND FUTURE DIRECTIONS
Internet-based computerized assessment is proliferating. Assessments are now routinely
administered over the internet where students respond to test items containing text, images, tables,
diagrams, sound, and video. But the growth of internet-based computerized testing has also focused
our attention on the need for new testing procedures and practices because this form of assessment
requires a continual supply of new test items. Automatic item generation is the process of using item
models to generate test items with the aid of computer technology. Automatic item generation can be
used to initially develop item banks and then replenish the banks needed for computer-based testing.
The purpose of our paper was to describe seven topics that are central to the development and use of
item models for automatic item generation. We defined item model and highlighted related concepts;
we described how item models are developed; we presented an item model taxonomy; we illustrated
how item models can be used for automatic item generation; we outlined some benefits of using item
models; we introduced the idea of an item model bank; and we demonstrated how statistical
procedures could be used to calibrate the item parameter estimates for generated items without the
need for extensive field or pilot testing. We also attempted to contextualize the growing interest in
Item Models 25
automatic item generation by highlighting the fact that the science of educational assessment is
beginning to influence educational measurement theory and practice and by claiming that
interdisciplinary forces and factors are beginning to exert a stronger affect on how we solve problems
in the discipline of educational assessment.
Research on item models is warranted in at least two different areas. The first area is item model
development. To our knowledge, there has been no focused research on item model development.
Currently, the principles, standards, and practices that guide traditional item development are also
recommended for use with item model development. These practices have been used to design and
develop item model examples that are cited in the literature (e.g., Bejar et al., 2003; Case & Swanson,
2002; Gierl et al., 2008). But much more research is required on designing, developing, and, most
importantly, evaluating the items produced by these models. By working more closely with test
development specialists in diverse content areas, researchers can begin to better understand how to
design and develop item models by carefully documenting the process. Research must also be
conducted to evaluate these item models by focusing on their generative capacity (i.e., the number of
items that can be generated from a single item model) as well as their generative veracity (i.e., the
usefulness of the generated items, particularly from the view of test development specialists and
content experts).
The second area is the calibration of generated items using an item modelling approach. As noted by
Drasgow et al. (2006), automatic item generation can minimize, if not eliminate, the need for item field
or pilot testing because items generated from a parent model can be pre-calibrated, meaning that the
statistical characteristics from the parent item model can be applied to the generated items. We
illustrated how the LLTM-R could be used to estimate the difficulty parameter for 243 generated items
in a diagnostic mathematics program. But a host of other statistical procedures are also available for
Item Models 26
estimating the statistical characteristics of generated items, including the 2PL-constrained model
(Embretson, 1999), the hierarchical IRT model (Glas & van der Linden, 2003), the Bayesian hierarchical
model (Sinharay, Johnson, & Williamson, 2003; Sinharay & Johnson, 2005), and the expected response
function approach (Mislevy, Wingersky, & Sheehan, 1994). These different statistical procedures could
be used with the same item models to permit parameter estimate comparisons across generated items,
without the use of sample data. This type of study would allow researchers to assess the comparability
of the predicted item statistics across the procedures. These statistical procedures could also be used
with the same item models to permit parameter estimate comparisons across generated items relative
to parameter estimates computed from a sample of examinees who actually wrote the generated items.
This type of study would allow researchers to assess the predictive utility of the statistical procedures
(i.e., the agreement between the predicted item characteristics on the generated items using a
statistical procedure compared to the actual item characteristics on the generated items using examinee
response data), which, we expect, will emerge as the “gold standard” for evaluating the feasibility and,
ultimately, the success of automatic item generation.
Item Models 27
REFERENCES
Bartram, D. (2006). Testing on the internet: Issues, challenges, and opportunities in the field of
occupational assessment. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the
internet (pp. 13-37). Hoboken, NJ: Wiley.
Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological
Measurement, 14, 237-245.
Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium
(ETS Research Report 96-13). Princeton, NJ: Educational Testing Service.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C.
Kyllonen (Eds.), Item generation for test development (pp.199-217). Hillsdale, NJ: Erlbaum.
Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, & R. E., Revuelta, J. (2003). A feasibility
study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and
Assessment, 2(3). Available from http://www.jtla.org.
Bennett, R. (2001). How the internet will help large-scale assessment reinvent itself. Educational Policy
Analysis Archives, 9, 1-23.
Case, S. M., & Swanson, D. B (2002). Constructing written test questions for the basic and clinical
sciences (3rd edition). Philadelphia, PA: National Board of Medical Examiners.
Downing, S. M., & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.
Drasgow, F., & Mattern, K. (2006). New tests and new items: Opportunities and issues. In D. Bartram &
R. Hambleton (Eds.), Computer-based testing and the internet (pp. 59-76). Hoboken, NJ: Wiley.
Item Models 28
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models.
Psychometrika, 64, 407-433.
Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P.
C. Kyllonen (Eds.), Item generation for test development (pp. 219-250). Mahwah, NJ: Erlbaum.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity level in
mathematical problem solving items. Psychological Science Quarterly, 50, 328-344.
Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao &
S. Sinharay (Eds.) Handbook of Statistics: Psychometrics, Volume 26 (pp. 747-768). North Holland,
UK: Elsevier.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta
Psychologica, 37, 359-374.
Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the Attribute Hierarchy Method to make diagnostic
inferences about examinees’ cognitive skills in algebra on the SAT©. Journal of Technology,
Learning, and Assessment, 6 (6). Retrieved [date] from http://www.jtla.org.
Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote
assessment engineering. Journal of Technology, Learning, and Assessment, 7(2). Retrieved [date]
from http://www.jtla.org.
Gierl, M. J. Alves, C., Roberts, M., & Gotzmann, A. (2009, April). Using judgments from content
specialists to develop cognitive models for diagnostic assessments. In J. Gorin (Chair), How to Build a
Cognitive Model for Educational Assessments. Paper presented in symposium conducted at the
annual meeting of the National Council on Measurement in Education, San Diego, CA.
Item Models 29
Gierl, M. J., Alves, C., & Taylor-Majeau, R. (2010). Using the Attribute Hierarchy Method to make
diagnostic inferences about examinees’ skills in mathematics: An operational implementation of
cognitive diagnostic assessment. International Journal of Testing, 10, 318-341.
Glas, C. A. W., & van der Linder, W. J. (2003). Computerized adaptive testing with item cloning. Applied
Psychological Measurement, 27, 247-261.
Haladyna, T., & Shindoll, R. (1989). Items shells: A method for writing effective multiple-choice test
items. Evaluation and the Health Professions, 12, 97-106.
Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic
achievement tests. Journal of Educational Measurement, 5, 275-290.
Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Erlbaum.
Janssen, R. (2010). Modeling the effect of item designs within the Rasch model. In S. E. Embretson (Ed.),
Measuring psychological constructs: Advances in model-based approaches (pp. 227-245).
Washington DC: American Psychological Association.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and group predictors. In P. DeBoeck & M.
Wilson (Eds.), Explanatory item response models: A generalized linear and non-linear approach (pp.
189-212). New York: Springer.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedures for
constructing content-equivalent multiple-choice questions. Medical Education, 20, 53-56.
Leighton, J. P., & Gierl, M. J. (in press). The learning sciences in educational assessment: The role of
cognitive models. Cambridge, UK: Cambridge University Press.
Luecht, R. M. (2006a, May). Engineering the test: From principled item design to automated test
assembly. Paper presented at the annual meeting of the Society for Industrial and Organizational
Psychology, Dallas, TX.
Item Models 30
Luecht, R. M. (2006b, September). Assessment engineering: An emerging discipline. Paper presented in
the Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton,
AB, Canada.
Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: From Data Models and
Templates to Psychometrics. Invited paper presented at the annual meeting of the National Council
on Measurement in Education, Chicago, IL.
Luecht, R. M. (February, 2011). Assessment design and development, version 2.0: From art to
engineering. Invited paper presented at the annual meeting of the Association of Test Publishers,
Phoenix, AZ.
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T.
Haladyna (Eds.), Handbook of test development (pp. 61-90). Mahwah, NJ: Erlbaum.
Mislevy, R. J., Wingersky, M. S., & Sheehan, K. M. (1994). Dealing with uncertainty about item
parameters: Expected response functions (ETS Research Report 94-28-ONR). Princeton, NJ:
Educational Testing Service.
Schmeiser, C.B., & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education
and American Council on Education.
Singley, M. K., & Bennett, R. E. (2002). Item generation and beyond: Applications of schema theory to
mathematics assessment. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development
(pp. 361-384). Mahwah, NJ: Erlbaum.
Sinharay, S., Johnson, M. S., & Williamson, D. M. (2003). Calibrating item families and summarizing the
results using family expected response functions. Journal of Educational and Behavioral Statistics,
28, 295-313.
Item Models 31
Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. (ETS
Research Report 05-06). Princeton, NJ: Educational Testing Service.
Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of
improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test
development (pp.329-348). Mahwah, NJ: Erlbaum.
van der Linden, W., & Glas, C. A. W. (2010). Elements of adaptive testing. New York: Springer.
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment. Applied
Measurement in Education, 15, 337-362.
Item Models 32
Appendix A
Item model #1 in mathematics used to generate isomorphic instances of numerical sequences.
If the pattern continues, then the next three numbers should be
700 695 690 685 _____ _____ _____
A. 680, 675, 670
B. 700, 695, 690
C. 680, 677, 675
D. 685, 680, 675
ITEM MODEL VARIABLES
Stem
If the pattern continues, then the next three numbers should be I1 I1-I2 I1-(2*I2) I1-(3*I2) _____ _____ _____
Elements
I1 Value Range: 700-800 by 5
I2 Value Range: 5-25 by 5
Options
A= I1 - ( 4 * I2 ), I1 - ( 5 * I2 ), I1 - ( 6 * I2 )
B= I1 - ( 3 * I2 ), I1 - ( 4 * I2 ), I1 - ( 5 * I2 )
C= I1 - ( 4 * I2 ), I1 - ( round( 4.5 * I2 ) , I1 - ( 5 * I2 )
D= I1, I1 - ( 1 * I2 ) , I1 - ( 2 * I2 )
Key
A
Item Models 33
Item model #2 in mathematics used to generate isomorphic instances of numerical comparisons.
The number that is 1 group of 5 fewer than 201 is ...
A. 196
B. 190
C. 197
D. 191
ITEM MODEL VARIABLES
Stem
The number that is I1 group of I2 fewer than I3 is ...
Elements
I1 Value Range: 1-3 by 1
I2 Value Range: 5-10 by 5
I3 Value Range: 201-245 by 3
Options
A= I3 - ( I2 * I1 )
B= I3 - ( I2 * ( I1 + 1 ) )
C= I3 - ( I2 * I1 ) + 1
D= I3 - ( I2 * ( I1 + 1 ) ) - 1
Key
A
Item Models 34
Item model #3 in mathematics used to generate isomorphic instances for addition.
What is 15 + 18 ?
A. 33
B. 48
C. 32
D. 34
ITEM MODEL VARIABLES
Stem
What is I1 + I2 ?
Elements
I1 Value Range: 15-30 by 3
I2 Value Range: 15-30 by 3
Options
A= I1 + I2
B= I1 + I2 + 1
C= I1 + I1 + I2 - 1
D= I1 + I1 + I2
Key
A
Item Models 35
Item model #4 in mathematics used to generate isomorphic instances for fractions.
What fraction of the measuring cup has oil in it?
A. 2/8
B. 2/3
C. 3/10
D. 3/8
ITEM MODEL VARIABLES
Stem
What fraction of the measuring cup has oil in it?
Diagram: I1 of Water and I2 of oil in one cup.
Elements
I1 Value Range: 0.125-1.0 by 0.125
I2 Value Range: 0.125-1.0 by 0.125
Options
A= ( I2 * 8 ) / 8
B= ( I2 * 8 ) + ( ( I1 * 8 ) / 8 )
C= ( I2 * 8 ) + ( ( I1 * 8 ) / 10 )
D= ( I2 * 8) / ( ( I2 * 8 ) + 1 )
Key
A
Appendix B
The cognitive feature codes were used to develop the four parent items for our example. The item feature codes serve as variables that could
affect the difficulty level for the generated items.
Item Feature Code
1
2
3
Cognitive
Feature Code Value Feature Value Feature Value Feature
1
(Number
Sequencing)
0
All start patterns are 0
0
Pattern by 10s
0
First number ends with 0
1 All start patterns not 0 1 Pattern by 20s and 5s 1 First number does not end with 0
2 Pattern by 15s and 25s
2
(Numerical
Comparisons)
0
No use of odd number
0
1 Group less
0
Group size of 10
1 Use of odd umber 1 2 Groups less 1 Group size of 5
2 3 Groups less
3
(Addition)
0 Sum of Last Digit <10 0 No use of odd numbers 0 Use of number in multiples of 10
1
Sum of Last Digit >10
1
One use of odd numbers
1
No number with multiples of 10
2 Two use of odd numbers
4
(Fractions)
0 Some parts are 1/8 0 Lowest common denominator < 8
1
No parts are 1/8
1
Lowest common denominator=8
... To the best of our knowledge, there is no checklist available to evaluate the quality of studies included in a review on this topic. Considering this limitation, we developed a checklist to assess the quality of the papers included in this review inspired by Gierl & Lai's (2012) study. In their work, the authors described seven fundamental topics for AIG: (i) item modelling: definition and related concepts; (ii) developing item models; (iii) item model taxonomy; (iv) using item models to automatically generate items; (v) benefits of item modelling; (vi) item model bank; and (vii) estimation of the statistical characteristics of generated items. ...
... First, it provides readers with an overview of AIG. We framed the changes occurring in medical assessment with the urgency for new test items; introduced the concept of AIG; described the three-step process proposed by Gierl & Lai (2012) for generating medical MCQs; and elaborated on the need for empirical data regarding the feasibility, validity and item quality of AIG. ...
... Regarding time-spent, AIG enabled the development of 80 to 1248 items in a relatively short time (1-6 h; 208 items per hour) (Gierl et al., 2012a, b;Pugh et al., 2020). Among the three steps described by Gierl & Lai (2012), the first (development of a cognitive model) was found to be the most time consuming (3 h on average) (Gierl et al., 2012a, b). Considering the number of items generated, almost all authors were able to generate more than 100 new MCQs in their AIG-based experiments-except Gierl & Lai (2018), who employed AIG to generate items along with rationales for formative assessment, which is more complex. ...
Article
Full-text available
Background Current demand for multiple-choice questions (MCQs) in medical assessment is greater than the supply. Consequently, an urgency for new item development methods arises. Automatic Item Generation (AIG) promises to overcome this burden, generating calibrated items based on the work of computer algorithms. Despite the promising scenario, there is still no evidence to encourage a general application of AIG in medical assessment. It is therefore important to evaluate AIG regarding its feasibility, validity and item quality. Objective Provide a narrative review regarding the feasibility, validity and item quality of AIG in medical assessment. Methods Electronic databases were searched for peer-reviewed, English language articles published between 2000 and 2021 by means of the terms ‘Automatic Item Generation’, ‘Automated Item Generation’, ‘AIG’, ‘medical assessment’ and ‘medical education’. Reviewers screened 119 records and 13 full texts were checked according to the inclusion criteria. A validity framework was implemented in the included studies to draw conclusions regarding the validity of AIG. Results A total of 10 articles were included in the review. Synthesized data suggests that AIG is a valid and feasible method capable of generating high-quality items. Conclusions AIG can solve current problems related to item development. It reveals itself as an auspicious next-generation technique for the future of medical assessment, promising several quality items both quickly and economically.
... Modern approaches to AIG for cognitive items typically rely on a three-step process (Gierl & Lai, 2015). A target knowledge, skill, or ability is first organized into a conceptual model that structures the cognitive and content-specific information required by test takers to solve problems in the desired domain. ...
Article
Full-text available
Algorithmic automatic item generation can be used to obtain large quantities of cognitive items in the domains of knowledge and aptitude testing. However, conventional item models used by template-based automatic item generation techniques are not ideal for the creation of items for non-cognitive constructs. Progress in this area has been made recently by employing long short-term memory recurrent neural networks to produce word sequences that syntactically resemble items typically found in personality questionnaires. To date, such items have been produced unconditionally, without the possibility of selectively targeting personality domains. In this article, we offer a brief synopsis on past developments in natural language processing and explain why the automatic generation of construct-specific items has become attainable only due to recent technological progress. We propose that pre-trained causal transformer models can be fine-tuned to achieve this task using implicit parameterization in conjunction with conditional generation. We demonstrate this method in a tutorial-like fashion and finally compare aspects of validity in human- and machine-authored items using empirical data. Our study finds that approximately two-thirds of the automatically generated items show good psychometric properties (factor loadings above .40) and that one-third even have properties equivalent to established and highly curated human-authored items. Our work thus demonstrates the practical use of deep neural networks for non-cognitive automatic item generation.
... The success of processing testlet templates as input and the variants of physics testlets generated by the generator program as output showed that the physics testlet templates had characteristics that match the input requirements of the generator program for AIG. The template characteristics and the input requirements of the generator represented the properties of creative task and generative task [14] of AIG. ...
... Hence, item difficulty should only be determined by test length, not by the specific objects chosen. Therefore, the proposed tests are based on a 1-layer model (Gierl & Lai, 2012a). ...
Thesis
Wie die meisten westlichen Streitkräfte, bewegt sich die Bundeswehr im Spannungsfeld zwischen hohem Personalbedarf und Fachkräftemangel. Durch ein Onlineassessment kann der Bewerbungsprozess dahingehend optimiert werden, dass fähiges Personal schneller gebunden wird. Onlineassessment hat diverse Vorteile, gleichzeitig sind damit jedoch Herausforderungen verbunden. Die wahrscheinlich größte ist es, Betrug zu minimieren, da Onlineassessment in einer weitestgehend unkontrollierten Umgebung stattfindet. Zur Entgegnung dieser Problematik dienen verschiedene Ansätze, wie beispielsweise große Itempools, wodurch einer Verbreitung der Lösung im Internet entgegengewirkt werden kann. Dieser Ansatz ist jedoch mit hohen Kosten verbunden. Automatische Itemgenerierung hingegen ermöglicht es, kostengünstig und zeiteffizient psychometrisch hochwertige Items zu erstellen. Aus diesem Grund wurden in der vorliegenden Arbeit zwei Arbeitsgedächtnistests mit automatischer Itemgenerierung für das Onlineassessment der Bundeswehr entwickelt und evaluiert, mit dem Ziel einer hohen prädiktiven Validität auf die Diagnostik vor Ort. In der ersten Studie (N = 330) wurde gezeigt, dass automatische Itemgenerierung für die entwickelten Arbeitsgedächtnistests eingesetzt werden kann. Hierbei wurden zudem zwei verschiedene zeitliche Varianten untersucht, wobei sich diejenige mit der längeren Stimulusrepräsentationszeit als vorteilhafter erwies. In der zweiten Studie (N = 621) wurden Nachweise für Reliabilität und Validität erbracht. Die Tests zeigten eine gute konvergente und diskriminante Validität. Zudem konnte einer der beiden Tests eine sehr gute prädiktive Validität aufweisen. Unter Gesamtberücksichtigung der Testgütekriterien wurde dieser Test schließlich für das Onlineassessment der Bundeswehr vorgeschlagen. Somit steht der Bundeswehr nun ein wissenschaftlich fundierter Arbeitsgedächtnistest für das Onlineassessment zur Verfügung.
... Recently, AIG gains the attention of researchers. Although being classified as new research, however, AIG was developing very rapidly [26]. Several approaches are proposed to produce items in multiple choices, closed questions, and opened questions. ...
Conference Paper
Full-text available
The development of Computer Assisted Assessment (CAA) has helped the assessment process. Some of the advantages offered by these tools are effectivity and speed. But there are still obstacles in the process of compiling the items. Therefore, we need tools to overcome this problem. Automated Item Generation (AIG) is a tool to generate questions using computer devices. Several models and techniques in AIG had been developed. AIG research is grouped into 2 models i.e. the framework model and special model. In the previous study, the AIG was divided into two approaches, automatic and semi-automatic. Several techniques were developed to produce high-quality and variety of items. Each technique has advantages and disadvantages in generating items. Thus, there is still a need for developing AIG to answer the challenges that have not been resolved in the previous study. This paper presents a literature review to identify a specific problem in each model and technique especially for mathematic and non-verbal items, validity and reliability of the items, and the schema or infrastructure presented to generate the items. This paper also gives possible suggestions on how to make an improvement the items generated to be better in the future.
Article
Full-text available
Automatic item generation (AIG) is the process of using models to generate items using computer technology. AIG is a scalable content development method because it relies on the item model as the unit of analysis which means that it is more efficient and economical compared to traditional item development. But to use the generated items effectively, they must be managed properly. Managing a bank that could include millions of items results in problems related to identifying, organizing, and securing the content. As a result, the challenges inherent to managing item models and generated items warrant a shift in banking methodology where the management task must be accomplished using content coding at the model level. The purpose of our paper is to describe and illustrate methods that use content coding to organize and secure generated items in a bank.
Article
Full-text available
Algorithmic automatic item generation can be used to obtain large quantities of cognitive items in the domains of knowledge and aptitude testing. However, conventional item models used by template-based automatic item generation techniques are not ideal for the creation of items for non-cognitive constructs. Progress in this area has been made recently by employing long short-term memory recurrent neural networks to produce word sequences that syntactically resemble items typically found in personality questionnaires. To date, such items have been produced unconditionally, without the possibility of selectively targeting personality domains. In this article, we offer a brief synopsis on past developments in natural language processing and explain why the automatic generation of construct-specific items has become attainable only due to recent technological progress. We propose that pre-trained causal transformer models can be fine-tuned to achieve this task using implicit parameterization in conjunction with conditional generation. We demonstrate this method in a tutorial-like fashion and finally compare aspects of validity in human-and machine-authored items using empirical data. Our study finds that approximately two-thirds of the automatically generated items show good psychometric properties (factor loadings above .40) and that one-third even have properties equivalent to established and highly curated human-authored items. Our work thus demonstrates the practical use of deep neural networks for non-cognitive automatic item generation.
Article
Video games are a promising tool for the psychometric assessment of cognitive abilities. They can present novel task types and answer formats, they can record process data, and they can be highly motivating for test takers. This paper introduces the first game-based intelligence assessment implemented in Minecraft, an exceptionally popular video game with 176m copies sold. A matrix-based pattern completion task (PC), a mental rotation task (MR) and a spatial construction task (SC) were implemented in the three-dimensional, immersive environment of the game. PC was intended as a measure of inductive reasoning, whereas MR and SC were measures of spatial ability. We tested 129 children aged 10-12 years old on the Minecraft-based tests as well as equivalent pen-and-paper tests. All three scales fit the Rasch model and were moderately reliable. Factorial validity was good with regard to the distinction between PC and SC, but no distinct factor was found for MR. Convergent validity was good as abilities measured with Minecraft and conventional tests were highly correlated at the latent level (r = .72). Subtest-level correlations were in the moderate range. Furthermore, we found that behavioral log-data collected from the game environment was highly predictive of performance in the Minecraft test and, to a lesser extent, also predicted scores in conventional tests. We identify a number of behavioral features associated with spatial reasoning ability, demonstrating the utility of analyzing granular behavioral data in addition to traditional response formats. Overall, our findings indicate that Minecraft is a suitable platform for game-based intelligence assessment and encourage future work aiming to explore game-based problem solving tasks that would not be feasible on paper or in conventional computer-based tests.
Article
This study introduces a newly developed public-domain multilingual automatic item generator that creates propositional reasoning (PR) items belonging to 15 item families by using various inference rules. Psychometric properties of the resulting written PR test were investigated in three diverse samples in English, simplified Chinese, and German, respectively. Internal consistency was good to excellent across samples. The ICAR16 short form test of cognitive abilities ( Condon & Revelle, 2014 ) was used to evaluate construct validity. Correlations of ICAR16 scores and PR scores were high. Furthermore, items within families appeared to be equivalent, with only minor differential item functioning between the Chinese- and English-speaking samples. Performance on the PR test was shown to be reasonably stable over the course of 1 week. Differences of total scores between test forms (pen and paper vs. computerized administration) were not detected. Findings suggest that the automatically generated PR test is a valuable instrument for the assessment of propositional reasoning ability.
Article
Full-text available
The present paper consists of a theoretical and an empirical part: First Rasch's test model for items with two answer categories is considered under the assumption of linear constraints on the item parameters (‘linear logistic model’). It is shown that this model is appropriate for the analysis of subject areas in instructional research if the subject area comprises tasks or items which are solved by the pupil by combination of a certain number of cognitive operations or rules. An empirical investigation was made which showed that the psychological complexity of problems in elementary differential calculus, as taught in secondary school mathematics, can be approximately explained through the assumption of seven psychologically meaningful operations. The psychological contribution of this analysis does not lie in a mere statistical description of item difficulties, but rather in the testing of hypotheses as to which steps (operations) in solving a problem are to be viewed as psychological units. It was seen, for instance, that differentiation of a polynomial is to be considered a single operation psychologically, which is mastered and correctly combined with the other operations or not, and that the complexity of a task is primarily determined by the combination of different operations and is not increased significantly when the same operation occurs repeatedly within the problem.
Chapter
Design patterns are tools to support task authoring under an evidence-centered approach to assessment design (ECD). This chapter reviews the basic concepts of ECD, focusing on evidentiary arguments. It defines the attributes of design patterns, and shows the roles they play in creating tasks around valid assessment arguments.
Article
There is mounting hope in the United States that federal legislation in the form of No Child Left Behind will improve educational outcomes. As titanic as the challenge appears to be, however, the solution could be at our fingertips. This volume identifies visual types of cognitive models in reading, science and mathematics for researchers, test developers, school administrators, policy makers and teachers. In the process of identifying these cognitive models, the book also explores methodological or translation issues to consider as decisions are made about how to generate psychologically informative and psychometrically viable large-scale assessments based on the learning sciences. Initiatives to overhaul educational systems in disrepair may begin with national policies, but the success of these policies will hinge on how well stakeholders begin to rethink what is possible with a keystone of the educational system: large-scale assessment.
Article
In the present chapter, the focus is on extending item response models on the item side. Item and item group predictors are included as external factors and the item parameters β i are considered as random effects. When the items are modeled to come from one common distribution, the models are descriptive on the item side. When item predictors of the property type are included, the models are explanatory on the item side. Item groups are a special case of item properties. They refer to binary, non-overlapping properties indicating group membership. The resulting models with item properties can all be described as linear logistic test models (LLTM; Fischer, 1995) with an error term in the prediction of item difficulty. When this random item variation is combined with random person variation, models with crossed random effects are obtained. All models in this chapter are of that kind.
Article
Item models (LaDuca, Staples, Templeton, & Holzman, 1986) are classes from which it is possible to generate/produce items that are equivalent/isomorphic to other items from the same model (e.g., Bejar, 1996; Bejar, 2002). They have the potential to produce large number of high-quality items at reduced cost. This paper introduces data from the first known application of items automatically generated from item models in a large-scale assessment and deals with several research questions associated with the data. We begin by reviewing calibration techniques for the analysis of data involving item models; one method assumes that the items are isomorphic, while the other treats items generated from the same item model as distinct, but related. A major question for these type of data is whether these items are isomorphic, that is, if they behave the same psychometrically. This paper describes a number of rough diagnostic measures and a rigorous statistical diagnostic to assess the extent of isomorphicity in the items generated from an item model. Finally, this paper discusses the issue of scoring, an area that needs more research, with data involving item models.
Book
Despite the fact that test development is a growth industry that cuts across all levels of education and all the professions, there has never been a comprehensive, research-oriented Handbook to which everyone (developers and consumers) can turn for guidance. That is the mission of this book. The Handbook of Test Development brings together well-known scholars and test-development practitioners to present chapters on all aspects of test development. Each chapter contributor is not only a recognized expert with an academic and research background in their designated topic, each one has also had hands-on experience in various aspects of test development. This thirty two-chapter volume is organized into six sections: foundations, content, item development, test design, test production and administration, and post-test activities. The Handbook provides extensive treatment of such important but unrecognized topics as contracting for testing services, item banking, designing tests for small testing program, and writing technical reports. The Handbook is based on the Standards for Educational and Psychological Testing, which serve as the foundation for sound test development practice. These chapters also suggest best test development practices and highlight methods to improve test validity evidence. This book is appropriate for graduate courses and seminars that deal with test development and usage, professional testing services and credentialing agencies, state and local boards of education, and academic libraries serving these groups.
Book
Item selection and ability estimation in adaptive testing (Wim J. van der Linden and Peter J. PashIey).- Constrained adaptive testing with shadow tests (Wim J. van der Linden).- Principles of multidimensional adaptive testing (Daniel O. Segall).- Multidimensional adaptive testing with Kullback-Liebler information item selection (Wim J. van der Linden and Joris Mulder).- Sequencing an adaptive test battery (Wim J. van der Linden).- Adaptive tests for measuring anxiety and depression (Otto B. Walter).- MATHCAT: A flexible testing system in mathematics education for adults (Alfred J. Verschoor and Gerard J. J. M. Straetmans).- Implementing the Graduate Management admission test computerized adaptive test (Lawrence M. Rudner).- Designing and implementing a multistage adaptive test: The uniform CPA exam (Gerald J. Melican, Krista Breithaupt, and Yanwei Zhang).- A Japanese adaptive test of English as a foreign language: Developmental and operational aspects (Yasuko Nogami and Norio Hayashi).- Innovative items for computerized testing (Cynthia G. Parshall, J. Christine Harmes, Tim Davey, and Peter J. Pashley).- Designing item pools for adaptive testing (Bernard P. Veldkamp and Wim J. van der Linden).- Assembling an inventory of multistage adaptive testing systems (Krista Breithaupt, Adelaide A. Ariel, and Donovan R. Hare).- Item parameter estimation and item fit analysis (Cees A.W. GIas).- Estimation of the parameters in an item-cloning model for adaptive testing (Cees A. W. GIas, Wim J. van der Linden, and Hanneke Geerlings).- Detecting person misfit in adaptive testing using statistical process control techniques (Edith M. L. A. van Krimpen-Stoop and Rob R. Meijer).- The assessment of differential item functioning in computer adaptive tests (Rebecca Zawick).- Multi-stage testing: Issues, designs, and research (April Zenisky, Ronald K.Hambleton, and Richard M. Luecht).- Three-category adaptive classification testing (Theo J.H.M. Eggen).- Testlet-based adaptive mastery testing (Hans J. Vos and Cees A. W. Glas).- Adaptive mastering testing using a multidimensional IRT model (Cees A. W. Glas and Hans J. Vos).