Proceedings of IAC 2018 in Vienna
Teaching, Learning and E-learning (IAC-TLEl 2018)
Management, Economics and Marketing (IAC-MEM 2018)
Engineering, Transport, IT and Artificial Intelligence (IAC-ETITAI 2018)
Friday - Saturday, July 6 - 7, 2018
Czech Technical University in Prague
How to build a Computerized Adaptive Test with free
software and pedagogical relevance?
, Douglas De Rizzo MENEGHETTI
Ocimar Munhoz ALAVARSE
, Érica Maria de Toledo CATALANI
University Săo Francisco, Psychology Department, R. Waldemar César da Silveira, 105, Campinas,
FEI University Center, 3972-B Humberto de Alencar Castelo Branco Ave, Săo Bernardo do Campo,
Săo Paulo, Brazil, email@example.com
University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,
University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,
This paper describes a pilot project carried out in the city of Săo Paulo (Brazil), focusing on the algorithm building process,
with the main goal being to create a Computerized Adaptive Test (CAT) based on a national standardized test for second
grade students (about eight years old at the end of the school year). The CAT version was to be smaller and more accurate
than the non-adaptive one. The theoretical basis is the Item Response Theory, and the programing language is R. Here we
describe the fundamentals of the algorithm and the simulations used to build and analyze it, comparing software packages
and methods in regard to accuracy and speed. The simulations were also useful to adjust some parameters according to our
goals and the item bank. Finally, we present results of the CAT actual application on 1,160 students of 15 municipal schools,
corroborating the test quality in some aspects. In this paper, we also propose a CAT stopping criterion based on evaluation,
rather than just measurement. It can be useful when the proficiency scale is discretized into levels, each one with different
computerized adaptive tests, item response theory, simulations, free software
External assessments have become increasingly important in the world. There are several challenges to be
faced so that evaluations, even at least assessments, can contribute to improving the quality of education. Some
of them concern the interpretation of the results and their use in the management of the educational system and
in the daily life of the classroom, mainly when is encompassing literacy teaching. Others refer to logistics
difficulties related to security and management of large amounts of paper. A third type of challenge concerns
the technical quality of the test as a measuring instrument. A Computerized Adaptive Test (CAT) can be
defined more than an assessment device, been an evaluation methodology that significantly contributes in
overcoming the last two types of challenge, while still contributing mildly to the first.
The process of literacy is especially important in the initial years of elementary education, when teachers
dedicate much time to the development of children's reading and writing skills . National indicators related to
the development of these competences have been unsatisfactory considering the entire Brazilian school-age
population. So, we decided to construct a CAT based on a test which, in its printed version, is applied nationally
to students of the second grade of the Brazilian nine-year primary education system. This test is called Provinha
Brasil, it was created and is managed by the National Institute of Studies and Research in Education "Anísio
Rodrigo Travitzki * Corresponding author.
Teixeira" (Inep). It focuses on the diagnosis of the students' reading and mathematical abilities. In the current
work, we focus our efforts on building a CAT of the reading aspect of Provinha Brasil, which will be called
from now on Provinha Brasil CAT – Reading.
The construction of this CAT was part of a project developed by researchers from the Group of Studies and
Research in Educational Assessment (Gepave), linked to the Faculty of Education of the University of Săo
Paulo (Feusp) in partnership with Inep, a body linked to the Ministry of Education (MEC) and with the Săo
Paulo Municipal Department of Education (SMESP), which involved central, regional managers (school
supervisor), school directors and coordinators, teachers and students. More information about the project, in
Portuguese, can be found in .
Provinha Brasil – Reading is an instrument built with a formative perspective. Applied since 2008, Provinha
is composed of two tests that are annually made available to the teachers: the first in March (at the beginning of
the school year) and the second in October (end of the school year), both containing 20 items. Items are
prepared according to a content specifications table for reading, pre-tested and calibrated according to statistical
standards by Inep specialists, and made available in test books to the country's teachers, together with guidance
for correction and interpretation of the students' scores. In the last editions, the items aimed towards assessing
writing skills were removed from the test for methodological reasons and new items began to focus on the
reading skills. The results (scores of the students) in the tests are compared to a reading proficiency scale, which
additionally present a pedagogical interpretation and intervention suggestions for students' progress.
The teachers themselves apply the test, collect students answers and interpret the results. Items are made
available to teachers and managers after each application. This availability, according to teacher’s opinions,
allows a better understanding of students' results.
In this project, we seeked to create an adaptive test that has, on average, fewer items and more precision than
a conventional test. This characteristic can be observed in adaptive tests because, when selecting specific items
for each student, each new measure (based on the response to the item) adds some information to the previous
set of measures. And when adding information to the set, the estimation uncertainty (error) is reduced. To
understand this, one can imagine the opposite: a traditional test in which the student answered correctly all six
easy items that have been presented to him. In this situation, presenting a new item of low difficulty should not
change the estimate of proficiency, nor reduce the standard error of this estimate, which means the item is not
adding any new information. This is one of the great advantages of adaptive testing: to optimize the collection
of information about the student while avoiding to take unnecessary measurements.
Barrada identifies four general objectives for a CAT and, depending on what one wishes, can be given
greater importance to one or other of these objectives. These would be: 1) reliability of the proficiency estimate;
2) item bank security; 3) content restrictions; 4) item bank maintenance. Some of these objectives are to some
extent exclusive, such as 1) and 2). Objective 3), on the other hand, has a smaller effect on the other objectives,
mainly depending on the existence of a balanced item bank, that is, that reflects the content constraints .
Considering an assessment periodically applied to students, it is important to consider objectives 2 and 4, using
techniques for controlling items exposure rates .
This paper explains the CAT algorithm and its creation process. We also propose a simple stopping criterion
focused on evaluation, rather than just measurement. All the simulations and the item selection algorithm used
in the main application was done with free software, using the R language. Some packages and functions are
compared through simulations, to optimize measurement precision and processing speed. Some parameters are
adjusted according to project goals and item bank availability. The final session shows some results of the CAT
actual application, corroborating the technologies used and the choices made.
2. OPERATION OF THE CAT
To be called both computerized and adaptive, a test needs a user-friendly computer interface and a module
capable of processing IRT-related data. The interface will be referred to herein as "computerized platform",
while the statistical processing of responses will be referred to as "algorithm". The computerized platform is
responsible for displaying items to the examinees and capturing responses. It works on-line and has been
accessed via intranet for all participating schools.
Item presentation was carried out using tablets connected to the Wi-Fi network of the schools. The audio
commands of the items were made available individually to the students using headphones. Figure 2 presents, in
a simplified way, the components of Provinha Brasil CAT - Reading and their interrelationships.
Figure 1: Overview of Provinha Brasil CAT - Reading
In this work, we detail the operation of the algorithm and its creation process. Item Response Theory  is
the theoretical basis and its goal is to provide adaptive dynamics to the computerized testing platform. More
specifically, we seek to maximize the accuracy of the proficiency estimation of examinees and minimize the
number of items that are administered to these examinees, avoiding losses in instrument validity while speeding
up the testing procedure.
Proficiency estimation is performed based on the expected a posteriori distribution (EAP)  with 21
quadrature points. The criteria for selecting items included in the algorithm were:
1. Fisher Maximum Information (MFI) ;
2. the balanced selection of items between the descriptors of the matrix.
Barrada  identifies three types of stopping criteria for a CAT: 1) reach a predetermined number of items;
2) achieve a minimum of uncertainty in the estimation of proficiency; 3) minimum information threshold that a
new item would add to the proficiency estimation. To determine the end of the test, the algorithm uses a mixed
1. number of test items (minimum of 8 and maximum of 20 items);
2. permitted limit of uncertainty (Standard error less than 35 points in the scale);
3. degree of confidence in determining the level of proficiency (one of five levels).
The first two criteria are widely used in adaptive tests . The third criterion was developed for this project
and no references were found about it in the literature. It seems to be a significant, though technically simple,
contribution of this project to the state of the art in CAT.
This section presents the technologies used in the study, as well as the simulation procedure employed.
2 In this pilot phase, criterion (2) was not used due to the reduced number of items in the bank.
3.1. Software and hardware
Both the algorithm and the simulation procedures were written in the R programming language, which is
free, open source and specialized in statistics. The following packages were tested: catR 3.13 , PP 0.6.1 
and irtoys 0.2.0 . The simulations were carried out in 2016, but were also done in February 2018, with new
versions of the packages. All the work was done in a laptop with an i7 processor of 4 cores, 2.50 GHz and 8 GB
of memory. The operating system was Linux Mint. No parallel processing or GPU acceleration was used.
For the development of the algorithm, five item selection methods were tested through simulation (besides
random selection), as well as seven methods for proficiency estimation. The methods were tested for accuracy
and speed. In addition, the simulations allowed the adjustment of two parameters in the algorithm: the
maximum standard error and the critical value of the confidence interval.
The simulations are based on IRT, more specifically on the probability of an individual with defined
proficiency to correctly answer an item with known parameters, described by the two-parameter logistic
function . Although different item banks have been tested, the results described here refer to the item bank
provided by Inep, composed of 40 items with two parameters (item difficulty and discrimination).
For each situation 1,000 simulations were made, each with 1,000 participants (with normal proficiency
distribution, average 500 and standard deviation 100), responding to 20 items out of 40 of the original Inep
database. The descriptors of each item were not included, only the two parameters of the logistic model.
For proficiency estimation, four methods were compared, one being tested in several packages, totaling
seven methods. Two methods are based on the likelihood principle: maximum likelihood  and weighted
likelihood . The other two methods use Bayesian statistics: expected a posteriori (EAP) distribution method
 and modal estimator .
The seven methods compared from these three packages were:
• ML: maximum likelihood (catR package);
• WL: weighted likelihood (catR package);
• BM: modal Bayesian estimator (catR package);
• EAP: EAP method (thetaEst function, catR package);
• eapC: EAP method (eapEst function, catR package);
• eapI: EAP method (irtoys package);
• eapP: EAP method (PP package).
The methods for selecting items, included in 'catR' package , were as follows:
• random: random selection of items from the bank;
• MFI (Maximum Fisher Information): selects the item that returns more information for the current
estimated proficiency ;
• bOpt (Urry rule): selects the item with the level of difficulty closest to the estimated proficiency so far;
• thOpt (Maximum Information with stratification): adaptation of the MFI method, used to increase the
security of the item bank by avoiding unnecessary overexposure of items;
• Progressive method : items are selected according to two elements, one relative to MFI and
another random. Throughout the test, the random element becomes less important. This promotes
greater security of the item bank;
• Proportional method : The item is selected according to probabilities related to MFI, also with the
aim of promoting greater security of the items bank.
1.1. Test finalization criteria
The main goal in developing the finalization criterion was to provide a test that has fewer items and more
accuracy than a similar but non-adaptive test. Therefore, three criteria were considered simultaneously. First, a
maximum limit of 20 items and a minimum of 8 were set beforehand to ensure at least 4 items of each of the
two axes are applied .
The second criterion consists of a maximum error defined in the proficiency estimation step, measured by
the Standard Error. This error starts high and decreases throughout the test as more items are answered. The
finalization criterion is to determine a maximum allowed limit for the Standard Error, according to the available
items, the target population and the objectives. Through simulations using the 40 items bank, we found that a
CAT of 12 or 13 items should have similar accuracy than a linear test of 20 items. According to our goals, we
defined an ideal length of 16 items, with a corresponding standard error of 35 (Figure 6).
The third criterion, not found in the literature, is detailed in the next session.
3.2.1. Reliability of evaluation
The criterion of reliability was conceived and developed especially for the purposes of Provinha Brasil CAT,
since the main objective of developing the CAT was the contribution to the evaluation process, considering not
only the accuracy of the proficiency measurement, but mainly to what extent the proficiency is contained in one
of the five levels of the scale. Each level categorizes the reading competence of the respondent, as well as
suggests differentiated pedagogical intervention. In this objective lies the difference between simply performing
a measurement of the students' proficiencies and actually making a evaluation of their performance. A test ends
in the moment that the proficiency and its confidence interval are fully contained inside a single proficiency
level, among the five levels defined for Provinha Brasil (Figure 2). It is worth mentioning that the points that
divide the scale of Provinha Brasil into five levels, also called cut-off points, resulted from a psychometric
process (anchorage) associated to the process of pedagogical analysis of the items performed by specialists and
Figure 2: Cut-off points and proficiency levels in the Provinha Brasil - Reading scale.
Thus, the criterion that usually defines the end of the test by the lowest error was modified to also consider
the confidence of the psychometrical evaluation in the five interpretable levels. After all, Provinha Brasil is not
a high-stakes test, one involving candidate selection or certification, in which the precision of proficiency
estimation is of utmost relevance. For example, in Figure 2, it is less relevant to differentiate the proficiencies of
examinees A (342) and B (418) than it is to know in which of the ranks each examinee belongs to.
As an example, in Figure 2, it can be observed that the proficiency measured for examinee B is closer to the
proficiency measured for examinee C than that measured for examinee A. However, considering the cut-off
points defined by the pedagogical interpretation of the scale, examinees A and B are closer than examinee C,
belonging to another level of the scale. In pedagogical terms, this means that students A and B demonstrated
mastery of skills that require similar interventions, while student C demonstrated mastery that requires other
It must be taken into account, however, that there is a degree of uncertainty in the estimation of proficiency,
regardless of the method used. When a student answers a few items and their estimated proficiency is 418, for
example, we must ask ourselves how reliable this number is.
Certainly this depends on the quantity of items answered. Assuming the normality of the proficiency
distribution, the uncertainty of the estimate can be calculated from the Standard Error. The confidence interval,
in turn, is obtained by multiplying the Standard Error by the Critical Value, depending on the level of
confidence that is desired – for example 95% –, which corresponds to a Critical Value of 1.96 . The
confidence interval is defined by a minimum number and a maximum number, and it means that, in our case,
there is a 95% chance of the estimated proficiency to position itself between these two numbers.
Figure 3 illustrates how reliability as a test stopping criterion can be used as a complement to the Maximum
Standard Error criterion. Student B's proficiency estimation has greater error than student A's, but B's
proficiency level has already been safely estimated within one of the five levels of the scale (level 5) after
answering correctly to five items in an adaptive test. Student A, on the other hand, has already answered 11
items in this adaptive test, but has not yet been reliably evaluated, being able to belong to either levels 2 or 3.
After all, the estimation error does not depend only on the number of items presented, but also of the answers of
each student and the parameters of the items. Moreover, the quantity and positioning of the cut-off points
strongly interfere with this criterion. In fact, student B was benefited from the addition of this third criterion:
finishing faster without loss of test accuracy, accomplishing the main practical purpose of Provinha Brasil –
Reading to provide a reliable evaluation of the child within one of the five levels of proficiency.
Figure 3: Representation of the confidence intervals of two estimated proficiencies.
4. RESULTS OF THE SIMULATIONS
This section presents the results obtained through simulation in the construction stage of the algorithm and
adjustment of the basic parameters.
4.1. Item selection and proficiency estimation methods
Item selection and proficiency estimation methods were tested for accuracy and processing speed. Figure 4
shows random selection of items (non-adaptive) produced greater errors. Considering the speed, all methods
were fast enough.
Figure 4: Measurement accuracy and processing speed, according to six item selection criteria. "Real" error =
absolute number of difference between "real" and estimated proficiency.
The t-test revealed that all criteria presented less error than the absence of criterion, represented by the
random selection method, but that the differences between the criteria are not significant (p <0.05) in most
simulations. Thus, considering the specialized literature, we chose to include Fisher's Maximum Information
method in the algorithm. However, if Provinha Brasil CAT – or any other CAT – becomes consolidated as a
public policy, it may be necessary to review this technical choice, since it does not consider the safety and
sustainability of the item bank. Progressive or proportional methods might be more appropriate in this case.
Regarding proficiency estimation methods, results on measurement accuracy and speed are summarized in
Figure 5. There was no significant difference (p <0.05) between the accuracy of the methods that presented the
least error (BM, EAP, eapC, eapI, WL) for a 20-item test with a mean proficiency of 500. However, it is worth
noting that, in populations with a mean proficiency different from expected (600 or 400, for example), the WL
method was more accurate (unpublished results) than the others, due to its greater adaptability and less
dependence on an a priori estimate of the population.
Figure 5: Measurement accuracy and processing speed (in seconds), according to seven proficiency estimation
methods. "Real" error = absolute number of differences between real and estimated proficiency.
It must also be noted that the EAP method of the 'PP' package (eapP) presented a much greater error than the
others. This illustrates another important role of the simulations in the development of algorithms, which is to
prevent the use of packages of dubious quality, with inconsistent results. This precaution is especially important
when working with free software, but it is also necessary with proprietary software.
In short, considering both precision and speed simultaneously, the method chosen for proficiency estimation
was the EAP method of the 'irtoys' package.
In order to determine the maximum standard error to be accepted by the CAT, we consider the goal of
producing a more accurate and smaller test than a conventional, non-adaptive test. Simulating the test using the
parameters of the 40 items bank, the standard error in the conventional test of 20 items corresponds to an
adaptive test around 12 or 13 items (Figure 6). Therefore, to provide a smaller and more precise test, the
maximum limit for the Standard Error was defined as 35 points on the “Provinha Brasil” scale, which would
correspond to an adaptive test with 16 items.
Figure 6: Errors in tests with different sizes. Two types of errors: (A) standard error of proficiency
estimation; (B) difference between the "real" proficiency of the simulation and the proficiency estimated. Blue
refers to CAT while red to non-adaptive test.
5. APPLICATION RESULTS
The test was applied to students from 15 municipal schools in Săo Paulo. Here we describe results for the
1,160 students of the second grade (about eight years old), which are the target of “Provinha Brasil”, and 823
first grade students for comparison. The proficiency average on second grade was 495 points, while on first
grade was 417, confirming the quality of the item bank and the methods implemented, since the expected (a
priori) average for the target population was 500. The standard deviation in second grade was 79 points, slightly
lower than the expected for the Brazilian population (100), which also confirms the psychometric quality of
“Provinha Brasil” CAT - Reading, since a smaller variance would be expected in this specific set of schools.
The size of the tests (Figure 7) also confirms the expected results due to the adjustment of the algorithm
parameters. The maximum standard error of 35 points was defined as corresponding to a test of 16 items, on
average. Finally, another result that corroborates the adaptive dynamics of the test is the item exposure rate,
when comparing first and second grade students (Figure 8).
Figure 7: Size of the tests in Provinha Brasil CAT - Reading, applied to the second grade of Săo Paulo city.
Figure 8: Relation between item exposure rate and item difficulty parameter in IRT
The actual application results indicate that “Provinha Brasil” CAT works fine for its purposes, specially,
because as expected by the theory and the simulations: 1) most tests ended with 16 items, excepting those which
reach the maximum limit; 2) the time spending in CAT was 12 minutes on average, while in non-adaptive test
was 15 minutes; 3) the proficiency average in the target population (second grade) was very close to a priori
mean, while in first grade was lower; 4) the proficiency standard deviation of was smaller than expected in
national population, as it should be in this specific set of 15 schools; 5) item exposure rate was different when
comparing the first and second grade students.
The simulations were useful to avoid packages with poor results (as PP package in R) and choose functions
with good accuracy and speed. We also showed how to adjust some parameters, as the maximum standard error,
through the use of simulation studies.
One possible limitation in the algorithm is the EAP estimation method, which produces less accurate results
when the population does not have an average proficiency near the expected a priori. Taking into account that
CAT can be used as a comparison tool for different grades, as confirmed by the results, it may be better to use a
non-Bayesian estimation method, such as Weighted Likelihood. Another limitation was the small item bank,
and the exposure rate of some items turned out to be quite high. It is recommended, to avoid this, to develop
improvements in the algorithm to control the item bank security.
This work was conducted in partnership with the Department of Education of Săo Paulo City and financially
supported by Unesco.
 M. Soares, Alfabetização: a questão dos métodos. São Paulo: Contexto, 2016.
 O. M. Alavarse, É. M. de T. Catalani, D. R. Meneghetti, and R. Travitzki, “Teste Adaptativo Informatizado como Recurso
Tecnológico para Avaliaçăo da Alfabetizaçăo Inicial,” Revista Iberoamericana de Sistemas, Cibernética e Informática, vol. 15,
no. 1, pp. 1--11, 2018.
 J. R. Barrada, “Tests adaptativos informatizados: una perspectiva general,” anales de psicología, vol. 28, no. 1, pp. 289–302,
 J. R. Barrada, J. Olea, V. Ponsoda, and F. J. Abad, “A Method for the Comparison of Item Selection Rules in Computerized
Adaptive Testing,” Applied Psychological Measurement, vol. 34, no. 6, pp. 438–452, Sep. 2010.
 F. B. Baker, The Basics of Item Response Theory. College Park, Md.: ERIC Clearinghouse on Assessment and Evaluation, 2001.
 R. D. Bock and R. J. Mislevy, “Adaptive EAP Estimation of Ability in a Microcomputer Environment,” Applied Psychological
Measurement, vol. 6, no. 4, pp. 431–444, 1982.
 D. Magis and G. Raîche, “Random Generation of Response Patterns under Computerized Adaptive Testing with the R Package
catR,” Journal of Statistical Software, vol. 48, no. 8, pp. 1–31, 2012.
 M. Reif, PP: Estimation of person parameters for the 1,2,3,4-PL model and the GPCM. 2014.
 I. Partchev, irtoys: A Collection of Functions Related to Item Response Theory (IRT). 2016.
 D. F. de Andrade, H. R. Tavares, and R. da C. Valle, Teoria da Resposta ao Item: Conceitos e Aplicações. São Paulo: AVALIA
 F. M. Lord, Applications of item response theory to practical testing problems. Hillsdale, N.J: L. Erlbaum Associates, 1980.
 T. A. Warm, “Weighted likelihood estimation of ability in item response theory,” Psychometrika, vol. 54, no. 3, pp. 427–450,
 A. Birnbaum, “Statistical theory for logistic mental test models with a prior distribution of ability,” Journal of Mathematical
Psychology, vol. 6, pp. 258–276, 1969.
 J. Revuelta and V. Ponsoda, “A comparison of item exposure control methods in computerized adaptive testing,” Journal of
Educational Measurement, vol. 35, no. 4, pp. 311–327, 1998.
 D. F. Ferreira, Estatística Básica. UFLA, 2005.