Content uploaded by Rodrigo Travitzki

Author content

All content in this area was uploaded by Rodrigo Travitzki on Aug 03, 2018

Content may be subject to copyright.

Proceedings of IAC 2018 in Vienna

Teaching, Learning and E-learning (IAC-TLEl 2018)

and

Management, Economics and Marketing (IAC-MEM 2018)

and

Engineering, Transport, IT and Artificial Intelligence (IAC-ETITAI 2018)

Vienna, Austria

Friday - Saturday, July 6 - 7, 2018

ISBN 978-80-88203-06-3

Czech Technical University in Prague

117

How to build a Computerized Adaptive Test with free

software and pedagogical relevance?

Rodrigo TRAVITZKI

a1

, Douglas De Rizzo MENEGHETTI

b

,

Ocimar Munhoz ALAVARSE

c

, Érica Maria de Toledo CATALANI

d

a

University Săo Francisco, Psychology Department, R. Waldemar César da Silveira, 105, Campinas,

Brazil, r.travitzki@gmail.com

b

FEI University Center, 3972-B Humberto de Alencar Castelo Branco Ave, Săo Bernardo do Campo,

Săo Paulo, Brazil, douglasrizzo@fei.edu.br

c

University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,

ocimar@usp.br

d

University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,

ericamtc@usp.br

Abstract

This paper describes a pilot project carried out in the city of Săo Paulo (Brazil), focusing on the algorithm building process,

with the main goal being to create a Computerized Adaptive Test (CAT) based on a national standardized test for second

grade students (about eight years old at the end of the school year). The CAT version was to be smaller and more accurate

than the non-adaptive one. The theoretical basis is the Item Response Theory, and the programing language is R. Here we

describe the fundamentals of the algorithm and the simulations used to build and analyze it, comparing software packages

and methods in regard to accuracy and speed. The simulations were also useful to adjust some parameters according to our

goals and the item bank. Finally, we present results of the CAT actual application on 1,160 students of 15 municipal schools,

corroborating the test quality in some aspects. In this paper, we also propose a CAT stopping criterion based on evaluation,

rather than just measurement. It can be useful when the proficiency scale is discretized into levels, each one with different

pedagogical interpretation.

Keywords:

computerized adaptive tests, item response theory, simulations, free software

1. INTRODUCTION

External assessments have become increasingly important in the world. There are several challenges to be

faced so that evaluations, even at least assessments, can contribute to improving the quality of education. Some

of them concern the interpretation of the results and their use in the management of the educational system and

in the daily life of the classroom, mainly when is encompassing literacy teaching. Others refer to logistics

difficulties related to security and management of large amounts of paper. A third type of challenge concerns

the technical quality of the test as a measuring instrument. A Computerized Adaptive Test (CAT) can be

defined more than an assessment device, been an evaluation methodology that significantly contributes in

overcoming the last two types of challenge, while still contributing mildly to the first.

The process of literacy is especially important in the initial years of elementary education, when teachers

dedicate much time to the development of children's reading and writing skills [1]. National indicators related to

the development of these competences have been unsatisfactory considering the entire Brazilian school-age

population. So, we decided to construct a CAT based on a test which, in its printed version, is applied nationally

to students of the second grade of the Brazilian nine-year primary education system. This test is called Provinha

Brasil, it was created and is managed by the National Institute of Studies and Research in Education "Anísio

1

Rodrigo Travitzki * Corresponding author.

118

Teixeira" (Inep). It focuses on the diagnosis of the students' reading and mathematical abilities. In the current

work, we focus our efforts on building a CAT of the reading aspect of Provinha Brasil, which will be called

from now on Provinha Brasil CAT – Reading.

The construction of this CAT was part of a project developed by researchers from the Group of Studies and

Research in Educational Assessment (Gepave), linked to the Faculty of Education of the University of Săo

Paulo (Feusp) in partnership with Inep, a body linked to the Ministry of Education (MEC) and with the Săo

Paulo Municipal Department of Education (SMESP), which involved central, regional managers (school

supervisor), school directors and coordinators, teachers and students. More information about the project, in

Portuguese, can be found in [2].

Provinha Brasil – Reading is an instrument built with a formative perspective. Applied since 2008, Provinha

is composed of two tests that are annually made available to the teachers: the first in March (at the beginning of

the school year) and the second in October (end of the school year), both containing 20 items. Items are

prepared according to a content specifications table for reading, pre-tested and calibrated according to statistical

standards by Inep specialists, and made available in test books to the country's teachers, together with guidance

for correction and interpretation of the students' scores. In the last editions, the items aimed towards assessing

writing skills were removed from the test for methodological reasons and new items began to focus on the

reading skills. The results (scores of the students) in the tests are compared to a reading proficiency scale, which

additionally present a pedagogical interpretation and intervention suggestions for students' progress.

The teachers themselves apply the test, collect students answers and interpret the results. Items are made

available to teachers and managers after each application. This availability, according to teacher’s opinions,

allows a better understanding of students' results.

In this project, we seeked to create an adaptive test that has, on average, fewer items and more precision than

a conventional test. This characteristic can be observed in adaptive tests because, when selecting specific items

for each student, each new measure (based on the response to the item) adds some information to the previous

set of measures. And when adding information to the set, the estimation uncertainty (error) is reduced. To

understand this, one can imagine the opposite: a traditional test in which the student answered correctly all six

easy items that have been presented to him. In this situation, presenting a new item of low difficulty should not

change the estimate of proficiency, nor reduce the standard error of this estimate, which means the item is not

adding any new information. This is one of the great advantages of adaptive testing: to optimize the collection

of information about the student while avoiding to take unnecessary measurements.

Barrada identifies four general objectives for a CAT and, depending on what one wishes, can be given

greater importance to one or other of these objectives. These would be: 1) reliability of the proficiency estimate;

2) item bank security; 3) content restrictions; 4) item bank maintenance. Some of these objectives are to some

extent exclusive, such as 1) and 2). Objective 3), on the other hand, has a smaller effect on the other objectives,

mainly depending on the existence of a balanced item bank, that is, that reflects the content constraints [3].

Considering an assessment periodically applied to students, it is important to consider objectives 2 and 4, using

techniques for controlling items exposure rates [4].

This paper explains the CAT algorithm and its creation process. We also propose a simple stopping criterion

focused on evaluation, rather than just measurement. All the simulations and the item selection algorithm used

in the main application was done with free software, using the R language. Some packages and functions are

compared through simulations, to optimize measurement precision and processing speed. Some parameters are

adjusted according to project goals and item bank availability. The final session shows some results of the CAT

actual application, corroborating the technologies used and the choices made.

2. OPERATION OF THE CAT

To be called both computerized and adaptive, a test needs a user-friendly computer interface and a module

capable of processing IRT-related data. The interface will be referred to herein as "computerized platform",

while the statistical processing of responses will be referred to as "algorithm". The computerized platform is

responsible for displaying items to the examinees and capturing responses. It works on-line and has been

accessed via intranet for all participating schools.

Item presentation was carried out using tablets connected to the Wi-Fi network of the schools. The audio

commands of the items were made available individually to the students using headphones. Figure 2 presents, in

a simplified way, the components of Provinha Brasil CAT - Reading and their interrelationships.

119

Figure 1: Overview of Provinha Brasil CAT - Reading

In this work, we detail the operation of the algorithm and its creation process. Item Response Theory [5] is

the theoretical basis and its goal is to provide adaptive dynamics to the computerized testing platform. More

specifically, we seek to maximize the accuracy of the proficiency estimation of examinees and minimize the

number of items that are administered to these examinees, avoiding losses in instrument validity while speeding

up the testing procedure.

Proficiency estimation is performed based on the expected a posteriori distribution (EAP) [6] with 21

quadrature points. The criteria for selecting items included in the algorithm were:

1. Fisher Maximum Information (MFI) [4];

2. the balanced selection of items between the descriptors of the matrix.

2

Barrada [3] identifies three types of stopping criteria for a CAT: 1) reach a predetermined number of items;

2) achieve a minimum of uncertainty in the estimation of proficiency; 3) minimum information threshold that a

new item would add to the proficiency estimation. To determine the end of the test, the algorithm uses a mixed

criterion, considering:

1. number of test items (minimum of 8 and maximum of 20 items);

2. permitted limit of uncertainty (Standard error less than 35 points in the scale);

3. degree of confidence in determining the level of proficiency (one of five levels).

The first two criteria are widely used in adaptive tests [3]. The third criterion was developed for this project

and no references were found about it in the literature. It seems to be a significant, though technically simple,

contribution of this project to the state of the art in CAT.

3. METHODS

This section presents the technologies used in the study, as well as the simulation procedure employed.

2 In this pilot phase, criterion (2) was not used due to the reduced number of items in the bank.

120

3.1. Software and hardware

Both the algorithm and the simulation procedures were written in the R programming language, which is

free, open source and specialized in statistics. The following packages were tested: catR 3.13 [7], PP 0.6.1 [8]

and irtoys 0.2.0 [9]. The simulations were carried out in 2016, but were also done in February 2018, with new

versions of the packages. All the work was done in a laptop with an i7 processor of 4 cores, 2.50 GHz and 8 GB

of memory. The operating system was Linux Mint. No parallel processing or GPU acceleration was used.

3.2. Simulations

For the development of the algorithm, five item selection methods were tested through simulation (besides

random selection), as well as seven methods for proficiency estimation. The methods were tested for accuracy

and speed. In addition, the simulations allowed the adjustment of two parameters in the algorithm: the

maximum standard error and the critical value of the confidence interval.

The simulations are based on IRT, more specifically on the probability of an individual with defined

proficiency to correctly answer an item with known parameters, described by the two-parameter logistic

function [10]. Although different item banks have been tested, the results described here refer to the item bank

provided by Inep, composed of 40 items with two parameters (item difficulty and discrimination).

For each situation 1,000 simulations were made, each with 1,000 participants (with normal proficiency

distribution, average 500 and standard deviation 100), responding to 20 items out of 40 of the original Inep

database. The descriptors of each item were not included, only the two parameters of the logistic model.

For proficiency estimation, four methods were compared, one being tested in several packages, totaling

seven methods. Two methods are based on the likelihood principle: maximum likelihood [11] and weighted

likelihood [12]. The other two methods use Bayesian statistics: expected a posteriori (EAP) distribution method

[6] and modal estimator [13].

The seven methods compared from these three packages were:

• ML: maximum likelihood (catR package);

• WL: weighted likelihood (catR package);

• BM: modal Bayesian estimator (catR package);

• EAP: EAP method (thetaEst function, catR package);

• eapC: EAP method (eapEst function, catR package);

• eapI: EAP method (irtoys package);

• eapP: EAP method (PP package).

The methods for selecting items, included in 'catR' package [7], were as follows:

• random: random selection of items from the bank;

• MFI (Maximum Fisher Information): selects the item that returns more information for the current

estimated proficiency [10];

• bOpt (Urry rule): selects the item with the level of difficulty closest to the estimated proficiency so far;

• thOpt (Maximum Information with stratification): adaptation of the MFI method, used to increase the

security of the item bank by avoiding unnecessary overexposure of items;

• Progressive method [14]: items are selected according to two elements, one relative to MFI and

another random. Throughout the test, the random element becomes less important. This promotes

greater security of the item bank;

• Proportional method [4]: The item is selected according to probabilities related to MFI, also with the

aim of promoting greater security of the items bank.

1.1. Test finalization criteria

The main goal in developing the finalization criterion was to provide a test that has fewer items and more

accuracy than a similar but non-adaptive test. Therefore, three criteria were considered simultaneously. First, a

maximum limit of 20 items and a minimum of 8 were set beforehand to ensure at least 4 items of each of the

two axes are applied [3].

The second criterion consists of a maximum error defined in the proficiency estimation step, measured by

the Standard Error. This error starts high and decreases throughout the test as more items are answered. The

finalization criterion is to determine a maximum allowed limit for the Standard Error, according to the available

items, the target population and the objectives. Through simulations using the 40 items bank, we found that a

CAT of 12 or 13 items should have similar accuracy than a linear test of 20 items. According to our goals, we

defined an ideal length of 16 items, with a corresponding standard error of 35 (Figure 6).

121

The third criterion, not found in the literature, is detailed in the next session.

3.2.1. Reliability of evaluation

The criterion of reliability was conceived and developed especially for the purposes of Provinha Brasil CAT,

since the main objective of developing the CAT was the contribution to the evaluation process, considering not

only the accuracy of the proficiency measurement, but mainly to what extent the proficiency is contained in one

of the five levels of the scale. Each level categorizes the reading competence of the respondent, as well as

suggests differentiated pedagogical intervention. In this objective lies the difference between simply performing

a measurement of the students' proficiencies and actually making a evaluation of their performance. A test ends

in the moment that the proficiency and its confidence interval are fully contained inside a single proficiency

level, among the five levels defined for Provinha Brasil (Figure 2). It is worth mentioning that the points that

divide the scale of Provinha Brasil into five levels, also called cut-off points, resulted from a psychometric

process (anchorage) associated to the process of pedagogical analysis of the items performed by specialists and

educators.

Figure 2: Cut-off points and proficiency levels in the Provinha Brasil - Reading scale.

Thus, the criterion that usually defines the end of the test by the lowest error was modified to also consider

the confidence of the psychometrical evaluation in the five interpretable levels. After all, Provinha Brasil is not

a high-stakes test, one involving candidate selection or certification, in which the precision of proficiency

estimation is of utmost relevance. For example, in Figure 2, it is less relevant to differentiate the proficiencies of

examinees A (342) and B (418) than it is to know in which of the ranks each examinee belongs to.

As an example, in Figure 2, it can be observed that the proficiency measured for examinee B is closer to the

proficiency measured for examinee C than that measured for examinee A. However, considering the cut-off

points defined by the pedagogical interpretation of the scale, examinees A and B are closer than examinee C,

belonging to another level of the scale. In pedagogical terms, this means that students A and B demonstrated

mastery of skills that require similar interventions, while student C demonstrated mastery that requires other

interventions.

It must be taken into account, however, that there is a degree of uncertainty in the estimation of proficiency,

regardless of the method used. When a student answers a few items and their estimated proficiency is 418, for

example, we must ask ourselves how reliable this number is.

Certainly this depends on the quantity of items answered. Assuming the normality of the proficiency

distribution, the uncertainty of the estimate can be calculated from the Standard Error. The confidence interval,

in turn, is obtained by multiplying the Standard Error by the Critical Value, depending on the level of

confidence that is desired – for example 95% –, which corresponds to a Critical Value of 1.96 [15]. The

confidence interval is defined by a minimum number and a maximum number, and it means that, in our case,

there is a 95% chance of the estimated proficiency to position itself between these two numbers.

Figure 3 illustrates how reliability as a test stopping criterion can be used as a complement to the Maximum

Standard Error criterion. Student B's proficiency estimation has greater error than student A's, but B's

proficiency level has already been safely estimated within one of the five levels of the scale (level 5) after

answering correctly to five items in an adaptive test. Student A, on the other hand, has already answered 11

items in this adaptive test, but has not yet been reliably evaluated, being able to belong to either levels 2 or 3.

After all, the estimation error does not depend only on the number of items presented, but also of the answers of

each student and the parameters of the items. Moreover, the quantity and positioning of the cut-off points

strongly interfere with this criterion. In fact, student B was benefited from the addition of this third criterion:

finishing faster without loss of test accuracy, accomplishing the main practical purpose of Provinha Brasil –

Reading to provide a reliable evaluation of the child within one of the five levels of proficiency.

Figure 3: Representation of the confidence intervals of two estimated proficiencies.

122

4. RESULTS OF THE SIMULATIONS

This section presents the results obtained through simulation in the construction stage of the algorithm and

adjustment of the basic parameters.

4.1. Item selection and proficiency estimation methods

Item selection and proficiency estimation methods were tested for accuracy and processing speed. Figure 4

shows random selection of items (non-adaptive) produced greater errors. Considering the speed, all methods

were fast enough.

Figure 4: Measurement accuracy and processing speed, according to six item selection criteria. "Real" error =

absolute number of difference between "real" and estimated proficiency.

The t-test revealed that all criteria presented less error than the absence of criterion, represented by the

random selection method, but that the differences between the criteria are not significant (p <0.05) in most

simulations. Thus, considering the specialized literature, we chose to include Fisher's Maximum Information

method in the algorithm. However, if Provinha Brasil CAT – or any other CAT – becomes consolidated as a

public policy, it may be necessary to review this technical choice, since it does not consider the safety and

sustainability of the item bank. Progressive or proportional methods might be more appropriate in this case.

Regarding proficiency estimation methods, results on measurement accuracy and speed are summarized in

Figure 5. There was no significant difference (p <0.05) between the accuracy of the methods that presented the

least error (BM, EAP, eapC, eapI, WL) for a 20-item test with a mean proficiency of 500. However, it is worth

noting that, in populations with a mean proficiency different from expected (600 or 400, for example), the WL

method was more accurate (unpublished results) than the others, due to its greater adaptability and less

dependence on an a priori estimate of the population.

123

Figure 5: Measurement accuracy and processing speed (in seconds), according to seven proficiency estimation

methods. "Real" error = absolute number of differences between real and estimated proficiency.

It must also be noted that the EAP method of the 'PP' package (eapP) presented a much greater error than the

others. This illustrates another important role of the simulations in the development of algorithms, which is to

prevent the use of packages of dubious quality, with inconsistent results. This precaution is especially important

when working with free software, but it is also necessary with proprietary software.

In short, considering both precision and speed simultaneously, the method chosen for proficiency estimation

was the EAP method of the 'irtoys' package.

In order to determine the maximum standard error to be accepted by the CAT, we consider the goal of

producing a more accurate and smaller test than a conventional, non-adaptive test. Simulating the test using the

parameters of the 40 items bank, the standard error in the conventional test of 20 items corresponds to an

adaptive test around 12 or 13 items (Figure 6). Therefore, to provide a smaller and more precise test, the

maximum limit for the Standard Error was defined as 35 points on the “Provinha Brasil” scale, which would

correspond to an adaptive test with 16 items.

124

Figure 6: Errors in tests with different sizes. Two types of errors: (A) standard error of proficiency

estimation; (B) difference between the "real" proficiency of the simulation and the proficiency estimated. Blue

refers to CAT while red to non-adaptive test.

5. APPLICATION RESULTS

The test was applied to students from 15 municipal schools in Săo Paulo. Here we describe results for the

1,160 students of the second grade (about eight years old), which are the target of “Provinha Brasil”, and 823

first grade students for comparison. The proficiency average on second grade was 495 points, while on first

grade was 417, confirming the quality of the item bank and the methods implemented, since the expected (a

priori) average for the target population was 500. The standard deviation in second grade was 79 points, slightly

lower than the expected for the Brazilian population (100), which also confirms the psychometric quality of

“Provinha Brasil” CAT - Reading, since a smaller variance would be expected in this specific set of schools.

The size of the tests (Figure 7) also confirms the expected results due to the adjustment of the algorithm

parameters. The maximum standard error of 35 points was defined as corresponding to a test of 16 items, on

average. Finally, another result that corroborates the adaptive dynamics of the test is the item exposure rate,

when comparing first and second grade students (Figure 8).

Figure 7: Size of the tests in Provinha Brasil CAT - Reading, applied to the second grade of Săo Paulo city.

125

Figure 8: Relation between item exposure rate and item difficulty parameter in IRT

6. CONCLUSION

The actual application results indicate that “Provinha Brasil” CAT works fine for its purposes, specially,

because as expected by the theory and the simulations: 1) most tests ended with 16 items, excepting those which

reach the maximum limit; 2) the time spending in CAT was 12 minutes on average, while in non-adaptive test

was 15 minutes; 3) the proficiency average in the target population (second grade) was very close to a priori

mean, while in first grade was lower; 4) the proficiency standard deviation of was smaller than expected in

national population, as it should be in this specific set of 15 schools; 5) item exposure rate was different when

comparing the first and second grade students.

The simulations were useful to avoid packages with poor results (as PP package in R) and choose functions

with good accuracy and speed. We also showed how to adjust some parameters, as the maximum standard error,

through the use of simulation studies.

One possible limitation in the algorithm is the EAP estimation method, which produces less accurate results

when the population does not have an average proficiency near the expected a priori. Taking into account that

CAT can be used as a comparison tool for different grades, as confirmed by the results, it may be better to use a

non-Bayesian estimation method, such as Weighted Likelihood. Another limitation was the small item bank,

and the exposure rate of some items turned out to be quite high. It is recommended, to avoid this, to develop

improvements in the algorithm to control the item bank security.

Acknowledgements

This work was conducted in partnership with the Department of Education of Săo Paulo City and financially

supported by Unesco.

References

[1] M. Soares, Alfabetização: a questão dos métodos. São Paulo: Contexto, 2016.

[2] O. M. Alavarse, É. M. de T. Catalani, D. R. Meneghetti, and R. Travitzki, “Teste Adaptativo Informatizado como Recurso

Tecnológico para Avaliaçăo da Alfabetizaçăo Inicial,” Revista Iberoamericana de Sistemas, Cibernética e Informática, vol. 15,

no. 1, pp. 1--11, 2018.

[3] J. R. Barrada, “Tests adaptativos informatizados: una perspectiva general,” anales de psicología, vol. 28, no. 1, pp. 289–302,

2012.

126

[4] J. R. Barrada, J. Olea, V. Ponsoda, and F. J. Abad, “A Method for the Comparison of Item Selection Rules in Computerized

Adaptive Testing,” Applied Psychological Measurement, vol. 34, no. 6, pp. 438–452, Sep. 2010.

[5] F. B. Baker, The Basics of Item Response Theory. College Park, Md.: ERIC Clearinghouse on Assessment and Evaluation, 2001.

[6] R. D. Bock and R. J. Mislevy, “Adaptive EAP Estimation of Ability in a Microcomputer Environment,” Applied Psychological

Measurement, vol. 6, no. 4, pp. 431–444, 1982.

[7] D. Magis and G. Raîche, “Random Generation of Response Patterns under Computerized Adaptive Testing with the R Package

catR,” Journal of Statistical Software, vol. 48, no. 8, pp. 1–31, 2012.

[8] M. Reif, PP: Estimation of person parameters for the 1,2,3,4-PL model and the GPCM. 2014.

[9] I. Partchev, irtoys: A Collection of Functions Related to Item Response Theory (IRT). 2016.

[10] D. F. de Andrade, H. R. Tavares, and R. da C. Valle, Teoria da Resposta ao Item: Conceitos e Aplicações. São Paulo: AVALIA

Educacional, 2000.

[11] F. M. Lord, Applications of item response theory to practical testing problems. Hillsdale, N.J: L. Erlbaum Associates, 1980.

[12] T. A. Warm, “Weighted likelihood estimation of ability in item response theory,” Psychometrika, vol. 54, no. 3, pp. 427–450,

Sep. 1989.

[13] A. Birnbaum, “Statistical theory for logistic mental test models with a prior distribution of ability,” Journal of Mathematical

Psychology, vol. 6, pp. 258–276, 1969.

[14] J. Revuelta and V. Ponsoda, “A comparison of item exposure control methods in computerized adaptive testing,” Journal of

Educational Measurement, vol. 35, no. 4, pp. 311–327, 1998.

[15] D. F. Ferreira, Estatística Básica. UFLA, 2005.