Conference PaperPDF Available

How to build a Computerized Adaptive Test with free software and pedagogical relevance?

Authors:

Abstract

This paper describes a pilot project carried out in the city of Săo Paulo (Brazil), focusing on the algorithm building process, with the main goal being to create a Computerized Adaptive Test (CAT) based on a national standardized test for second grade students (about eight years old at the end of the school year). The CAT version was to be smaller and more accurate than the non-adaptive one. The theoretical basis is the Item Response Theory, and the programing language is R. Here we describe the fundamentals of the algorithm and the simulations used to build and analyze it, comparing software packages and methods in regard to accuracy and speed. The simulations were also useful to adjust some parameters according to our goals and the item bank. Finally, we present results of the CAT actual application on 1,160 students of 15 municipal schools, corroborating the test quality in some aspects. In this paper, we also propose a CAT stopping criterion based on evaluation, rather than just measurement. It can be useful when the proficiency scale is discretized into levels, each one with different pedagogical interpretation.
Proceedings of IAC 2018 in Vienna
Teaching, Learning and E-learning (IAC-TLEl 2018)
and
Management, Economics and Marketing (IAC-MEM 2018)
and
Engineering, Transport, IT and Artificial Intelligence (IAC-ETITAI 2018)
Vienna, Austria
Friday - Saturday, July 6 - 7, 2018
ISBN 978-80-88203-06-3
Czech Technical University in Prague
117
How to build a Computerized Adaptive Test with free
software and pedagogical relevance?
Rodrigo TRAVITZKI
a1
, Douglas De Rizzo MENEGHETTI
b
,
Ocimar Munhoz ALAVARSE
c
, Érica Maria de Toledo CATALANI
d
a
University Săo Francisco, Psychology Department, R. Waldemar César da Silveira, 105, Campinas,
Brazil, r.travitzki@gmail.com
b
FEI University Center, 3972-B Humberto de Alencar Castelo Branco Ave, Săo Bernardo do Campo,
Săo Paulo, Brazil, douglasrizzo@fei.edu.br
c
University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,
ocimar@usp.br
d
University of Săo Paulo, Faculty of Education, 308 Av. da Universidade, Săo Paulo, Brazil,
ericamtc@usp.br
Abstract
This paper describes a pilot project carried out in the city of Săo Paulo (Brazil), focusing on the algorithm building process,
with the main goal being to create a Computerized Adaptive Test (CAT) based on a national standardized test for second
grade students (about eight years old at the end of the school year). The CAT version was to be smaller and more accurate
than the non-adaptive one. The theoretical basis is the Item Response Theory, and the programing language is R. Here we
describe the fundamentals of the algorithm and the simulations used to build and analyze it, comparing software packages
and methods in regard to accuracy and speed. The simulations were also useful to adjust some parameters according to our
goals and the item bank. Finally, we present results of the CAT actual application on 1,160 students of 15 municipal schools,
corroborating the test quality in some aspects. In this paper, we also propose a CAT stopping criterion based on evaluation,
rather than just measurement. It can be useful when the proficiency scale is discretized into levels, each one with different
pedagogical interpretation.
Keywords:
computerized adaptive tests, item response theory, simulations, free software
1. INTRODUCTION
External assessments have become increasingly important in the world. There are several challenges to be
faced so that evaluations, even at least assessments, can contribute to improving the quality of education. Some
of them concern the interpretation of the results and their use in the management of the educational system and
in the daily life of the classroom, mainly when is encompassing literacy teaching. Others refer to logistics
difficulties related to security and management of large amounts of paper. A third type of challenge concerns
the technical quality of the test as a measuring instrument. A Computerized Adaptive Test (CAT) can be
defined more than an assessment device, been an evaluation methodology that significantly contributes in
overcoming the last two types of challenge, while still contributing mildly to the first.
The process of literacy is especially important in the initial years of elementary education, when teachers
dedicate much time to the development of children's reading and writing skills [1]. National indicators related to
the development of these competences have been unsatisfactory considering the entire Brazilian school-age
population. So, we decided to construct a CAT based on a test which, in its printed version, is applied nationally
to students of the second grade of the Brazilian nine-year primary education system. This test is called Provinha
Brasil, it was created and is managed by the National Institute of Studies and Research in Education "Anísio
1
Rodrigo Travitzki * Corresponding author.
118
Teixeira" (Inep). It focuses on the diagnosis of the students' reading and mathematical abilities. In the current
work, we focus our efforts on building a CAT of the reading aspect of Provinha Brasil, which will be called
from now on Provinha Brasil CAT – Reading.
The construction of this CAT was part of a project developed by researchers from the Group of Studies and
Research in Educational Assessment (Gepave), linked to the Faculty of Education of the University of Săo
Paulo (Feusp) in partnership with Inep, a body linked to the Ministry of Education (MEC) and with the Săo
Paulo Municipal Department of Education (SMESP), which involved central, regional managers (school
supervisor), school directors and coordinators, teachers and students. More information about the project, in
Portuguese, can be found in [2].
Provinha Brasil – Reading is an instrument built with a formative perspective. Applied since 2008, Provinha
is composed of two tests that are annually made available to the teachers: the first in March (at the beginning of
the school year) and the second in October (end of the school year), both containing 20 items. Items are
prepared according to a content specifications table for reading, pre-tested and calibrated according to statistical
standards by Inep specialists, and made available in test books to the country's teachers, together with guidance
for correction and interpretation of the students' scores. In the last editions, the items aimed towards assessing
writing skills were removed from the test for methodological reasons and new items began to focus on the
reading skills. The results (scores of the students) in the tests are compared to a reading proficiency scale, which
additionally present a pedagogical interpretation and intervention suggestions for students' progress.
The teachers themselves apply the test, collect students answers and interpret the results. Items are made
available to teachers and managers after each application. This availability, according to teacher’s opinions,
allows a better understanding of students' results.
In this project, we seeked to create an adaptive test that has, on average, fewer items and more precision than
a conventional test. This characteristic can be observed in adaptive tests because, when selecting specific items
for each student, each new measure (based on the response to the item) adds some information to the previous
set of measures. And when adding information to the set, the estimation uncertainty (error) is reduced. To
understand this, one can imagine the opposite: a traditional test in which the student answered correctly all six
easy items that have been presented to him. In this situation, presenting a new item of low difficulty should not
change the estimate of proficiency, nor reduce the standard error of this estimate, which means the item is not
adding any new information. This is one of the great advantages of adaptive testing: to optimize the collection
of information about the student while avoiding to take unnecessary measurements.
Barrada identifies four general objectives for a CAT and, depending on what one wishes, can be given
greater importance to one or other of these objectives. These would be: 1) reliability of the proficiency estimate;
2) item bank security; 3) content restrictions; 4) item bank maintenance. Some of these objectives are to some
extent exclusive, such as 1) and 2). Objective 3), on the other hand, has a smaller effect on the other objectives,
mainly depending on the existence of a balanced item bank, that is, that reflects the content constraints [3].
Considering an assessment periodically applied to students, it is important to consider objectives 2 and 4, using
techniques for controlling items exposure rates [4].
This paper explains the CAT algorithm and its creation process. We also propose a simple stopping criterion
focused on evaluation, rather than just measurement. All the simulations and the item selection algorithm used
in the main application was done with free software, using the R language. Some packages and functions are
compared through simulations, to optimize measurement precision and processing speed. Some parameters are
adjusted according to project goals and item bank availability. The final session shows some results of the CAT
actual application, corroborating the technologies used and the choices made.
2. OPERATION OF THE CAT
To be called both computerized and adaptive, a test needs a user-friendly computer interface and a module
capable of processing IRT-related data. The interface will be referred to herein as "computerized platform",
while the statistical processing of responses will be referred to as "algorithm". The computerized platform is
responsible for displaying items to the examinees and capturing responses. It works on-line and has been
accessed via intranet for all participating schools.
Item presentation was carried out using tablets connected to the Wi-Fi network of the schools. The audio
commands of the items were made available individually to the students using headphones. Figure 2 presents, in
a simplified way, the components of Provinha Brasil CAT - Reading and their interrelationships.
119
Figure 1: Overview of Provinha Brasil CAT - Reading
In this work, we detail the operation of the algorithm and its creation process. Item Response Theory [5] is
the theoretical basis and its goal is to provide adaptive dynamics to the computerized testing platform. More
specifically, we seek to maximize the accuracy of the proficiency estimation of examinees and minimize the
number of items that are administered to these examinees, avoiding losses in instrument validity while speeding
up the testing procedure.
Proficiency estimation is performed based on the expected a posteriori distribution (EAP) [6] with 21
quadrature points. The criteria for selecting items included in the algorithm were:
1. Fisher Maximum Information (MFI) [4];
2. the balanced selection of items between the descriptors of the matrix.
2
Barrada [3] identifies three types of stopping criteria for a CAT: 1) reach a predetermined number of items;
2) achieve a minimum of uncertainty in the estimation of proficiency; 3) minimum information threshold that a
new item would add to the proficiency estimation. To determine the end of the test, the algorithm uses a mixed
criterion, considering:
1. number of test items (minimum of 8 and maximum of 20 items);
2. permitted limit of uncertainty (Standard error less than 35 points in the scale);
3. degree of confidence in determining the level of proficiency (one of five levels).
The first two criteria are widely used in adaptive tests [3]. The third criterion was developed for this project
and no references were found about it in the literature. It seems to be a significant, though technically simple,
contribution of this project to the state of the art in CAT.
3. METHODS
This section presents the technologies used in the study, as well as the simulation procedure employed.
2 In this pilot phase, criterion (2) was not used due to the reduced number of items in the bank.
120
3.1. Software and hardware
Both the algorithm and the simulation procedures were written in the R programming language, which is
free, open source and specialized in statistics. The following packages were tested: catR 3.13 [7], PP 0.6.1 [8]
and irtoys 0.2.0 [9]. The simulations were carried out in 2016, but were also done in February 2018, with new
versions of the packages. All the work was done in a laptop with an i7 processor of 4 cores, 2.50 GHz and 8 GB
of memory. The operating system was Linux Mint. No parallel processing or GPU acceleration was used.
3.2. Simulations
For the development of the algorithm, five item selection methods were tested through simulation (besides
random selection), as well as seven methods for proficiency estimation. The methods were tested for accuracy
and speed. In addition, the simulations allowed the adjustment of two parameters in the algorithm: the
maximum standard error and the critical value of the confidence interval.
The simulations are based on IRT, more specifically on the probability of an individual with defined
proficiency to correctly answer an item with known parameters, described by the two-parameter logistic
function [10]. Although different item banks have been tested, the results described here refer to the item bank
provided by Inep, composed of 40 items with two parameters (item difficulty and discrimination).
For each situation 1,000 simulations were made, each with 1,000 participants (with normal proficiency
distribution, average 500 and standard deviation 100), responding to 20 items out of 40 of the original Inep
database. The descriptors of each item were not included, only the two parameters of the logistic model.
For proficiency estimation, four methods were compared, one being tested in several packages, totaling
seven methods. Two methods are based on the likelihood principle: maximum likelihood [11] and weighted
likelihood [12]. The other two methods use Bayesian statistics: expected a posteriori (EAP) distribution method
[6] and modal estimator [13].
The seven methods compared from these three packages were:
ML: maximum likelihood (catR package);
WL: weighted likelihood (catR package);
BM: modal Bayesian estimator (catR package);
EAP: EAP method (thetaEst function, catR package);
eapC: EAP method (eapEst function, catR package);
eapI: EAP method (irtoys package);
eapP: EAP method (PP package).
The methods for selecting items, included in 'catR' package [7], were as follows:
random: random selection of items from the bank;
MFI (Maximum Fisher Information): selects the item that returns more information for the current
estimated proficiency [10];
bOpt (Urry rule): selects the item with the level of difficulty closest to the estimated proficiency so far;
thOpt (Maximum Information with stratification): adaptation of the MFI method, used to increase the
security of the item bank by avoiding unnecessary overexposure of items;
Progressive method [14]: items are selected according to two elements, one relative to MFI and
another random. Throughout the test, the random element becomes less important. This promotes
greater security of the item bank;
Proportional method [4]: The item is selected according to probabilities related to MFI, also with the
aim of promoting greater security of the items bank.
1.1. Test finalization criteria
The main goal in developing the finalization criterion was to provide a test that has fewer items and more
accuracy than a similar but non-adaptive test. Therefore, three criteria were considered simultaneously. First, a
maximum limit of 20 items and a minimum of 8 were set beforehand to ensure at least 4 items of each of the
two axes are applied [3].
The second criterion consists of a maximum error defined in the proficiency estimation step, measured by
the Standard Error. This error starts high and decreases throughout the test as more items are answered. The
finalization criterion is to determine a maximum allowed limit for the Standard Error, according to the available
items, the target population and the objectives. Through simulations using the 40 items bank, we found that a
CAT of 12 or 13 items should have similar accuracy than a linear test of 20 items. According to our goals, we
defined an ideal length of 16 items, with a corresponding standard error of 35 (Figure 6).
121
The third criterion, not found in the literature, is detailed in the next session.
3.2.1. Reliability of evaluation
The criterion of reliability was conceived and developed especially for the purposes of Provinha Brasil CAT,
since the main objective of developing the CAT was the contribution to the evaluation process, considering not
only the accuracy of the proficiency measurement, but mainly to what extent the proficiency is contained in one
of the five levels of the scale. Each level categorizes the reading competence of the respondent, as well as
suggests differentiated pedagogical intervention. In this objective lies the difference between simply performing
a measurement of the students' proficiencies and actually making a evaluation of their performance. A test ends
in the moment that the proficiency and its confidence interval are fully contained inside a single proficiency
level, among the five levels defined for Provinha Brasil (Figure 2). It is worth mentioning that the points that
divide the scale of Provinha Brasil into five levels, also called cut-off points, resulted from a psychometric
process (anchorage) associated to the process of pedagogical analysis of the items performed by specialists and
educators.
Figure 2: Cut-off points and proficiency levels in the Provinha Brasil - Reading scale.
Thus, the criterion that usually defines the end of the test by the lowest error was modified to also consider
the confidence of the psychometrical evaluation in the five interpretable levels. After all, Provinha Brasil is not
a high-stakes test, one involving candidate selection or certification, in which the precision of proficiency
estimation is of utmost relevance. For example, in Figure 2, it is less relevant to differentiate the proficiencies of
examinees A (342) and B (418) than it is to know in which of the ranks each examinee belongs to.
As an example, in Figure 2, it can be observed that the proficiency measured for examinee B is closer to the
proficiency measured for examinee C than that measured for examinee A. However, considering the cut-off
points defined by the pedagogical interpretation of the scale, examinees A and B are closer than examinee C,
belonging to another level of the scale. In pedagogical terms, this means that students A and B demonstrated
mastery of skills that require similar interventions, while student C demonstrated mastery that requires other
interventions.
It must be taken into account, however, that there is a degree of uncertainty in the estimation of proficiency,
regardless of the method used. When a student answers a few items and their estimated proficiency is 418, for
example, we must ask ourselves how reliable this number is.
Certainly this depends on the quantity of items answered. Assuming the normality of the proficiency
distribution, the uncertainty of the estimate can be calculated from the Standard Error. The confidence interval,
in turn, is obtained by multiplying the Standard Error by the Critical Value, depending on the level of
confidence that is desired – for example 95% –, which corresponds to a Critical Value of 1.96 [15]. The
confidence interval is defined by a minimum number and a maximum number, and it means that, in our case,
there is a 95% chance of the estimated proficiency to position itself between these two numbers.
Figure 3 illustrates how reliability as a test stopping criterion can be used as a complement to the Maximum
Standard Error criterion. Student B's proficiency estimation has greater error than student A's, but B's
proficiency level has already been safely estimated within one of the five levels of the scale (level 5) after
answering correctly to five items in an adaptive test. Student A, on the other hand, has already answered 11
items in this adaptive test, but has not yet been reliably evaluated, being able to belong to either levels 2 or 3.
After all, the estimation error does not depend only on the number of items presented, but also of the answers of
each student and the parameters of the items. Moreover, the quantity and positioning of the cut-off points
strongly interfere with this criterion. In fact, student B was benefited from the addition of this third criterion:
finishing faster without loss of test accuracy, accomplishing the main practical purpose of Provinha Brasil
Reading to provide a reliable evaluation of the child within one of the five levels of proficiency.
Figure 3: Representation of the confidence intervals of two estimated proficiencies.
122
4. RESULTS OF THE SIMULATIONS
This section presents the results obtained through simulation in the construction stage of the algorithm and
adjustment of the basic parameters.
4.1. Item selection and proficiency estimation methods
Item selection and proficiency estimation methods were tested for accuracy and processing speed. Figure 4
shows random selection of items (non-adaptive) produced greater errors. Considering the speed, all methods
were fast enough.
Figure 4: Measurement accuracy and processing speed, according to six item selection criteria. "Real" error =
absolute number of difference between "real" and estimated proficiency.
The t-test revealed that all criteria presented less error than the absence of criterion, represented by the
random selection method, but that the differences between the criteria are not significant (p <0.05) in most
simulations. Thus, considering the specialized literature, we chose to include Fisher's Maximum Information
method in the algorithm. However, if Provinha Brasil CAT – or any other CAT – becomes consolidated as a
public policy, it may be necessary to review this technical choice, since it does not consider the safety and
sustainability of the item bank. Progressive or proportional methods might be more appropriate in this case.
Regarding proficiency estimation methods, results on measurement accuracy and speed are summarized in
Figure 5. There was no significant difference (p <0.05) between the accuracy of the methods that presented the
least error (BM, EAP, eapC, eapI, WL) for a 20-item test with a mean proficiency of 500. However, it is worth
noting that, in populations with a mean proficiency different from expected (600 or 400, for example), the WL
method was more accurate (unpublished results) than the others, due to its greater adaptability and less
dependence on an a priori estimate of the population.
123
Figure 5: Measurement accuracy and processing speed (in seconds), according to seven proficiency estimation
methods. "Real" error = absolute number of differences between real and estimated proficiency.
It must also be noted that the EAP method of the 'PP' package (eapP) presented a much greater error than the
others. This illustrates another important role of the simulations in the development of algorithms, which is to
prevent the use of packages of dubious quality, with inconsistent results. This precaution is especially important
when working with free software, but it is also necessary with proprietary software.
In short, considering both precision and speed simultaneously, the method chosen for proficiency estimation
was the EAP method of the 'irtoys' package.
In order to determine the maximum standard error to be accepted by the CAT, we consider the goal of
producing a more accurate and smaller test than a conventional, non-adaptive test. Simulating the test using the
parameters of the 40 items bank, the standard error in the conventional test of 20 items corresponds to an
adaptive test around 12 or 13 items (Figure 6). Therefore, to provide a smaller and more precise test, the
maximum limit for the Standard Error was defined as 35 points on the “Provinha Brasil” scale, which would
correspond to an adaptive test with 16 items.
124
Figure 6: Errors in tests with different sizes. Two types of errors: (A) standard error of proficiency
estimation; (B) difference between the "real" proficiency of the simulation and the proficiency estimated. Blue
refers to CAT while red to non-adaptive test.
5. APPLICATION RESULTS
The test was applied to students from 15 municipal schools in Săo Paulo. Here we describe results for the
1,160 students of the second grade (about eight years old), which are the target of “Provinha Brasil”, and 823
first grade students for comparison. The proficiency average on second grade was 495 points, while on first
grade was 417, confirming the quality of the item bank and the methods implemented, since the expected (a
priori) average for the target population was 500. The standard deviation in second grade was 79 points, slightly
lower than the expected for the Brazilian population (100), which also confirms the psychometric quality of
“Provinha Brasil” CAT - Reading, since a smaller variance would be expected in this specific set of schools.
The size of the tests (Figure 7) also confirms the expected results due to the adjustment of the algorithm
parameters. The maximum standard error of 35 points was defined as corresponding to a test of 16 items, on
average. Finally, another result that corroborates the adaptive dynamics of the test is the item exposure rate,
when comparing first and second grade students (Figure 8).
Figure 7: Size of the tests in Provinha Brasil CAT - Reading, applied to the second grade of Săo Paulo city.
125
Figure 8: Relation between item exposure rate and item difficulty parameter in IRT
6. CONCLUSION
The actual application results indicate that “Provinha Brasil” CAT works fine for its purposes, specially,
because as expected by the theory and the simulations: 1) most tests ended with 16 items, excepting those which
reach the maximum limit; 2) the time spending in CAT was 12 minutes on average, while in non-adaptive test
was 15 minutes; 3) the proficiency average in the target population (second grade) was very close to a priori
mean, while in first grade was lower; 4) the proficiency standard deviation of was smaller than expected in
national population, as it should be in this specific set of 15 schools; 5) item exposure rate was different when
comparing the first and second grade students.
The simulations were useful to avoid packages with poor results (as PP package in R) and choose functions
with good accuracy and speed. We also showed how to adjust some parameters, as the maximum standard error,
through the use of simulation studies.
One possible limitation in the algorithm is the EAP estimation method, which produces less accurate results
when the population does not have an average proficiency near the expected a priori. Taking into account that
CAT can be used as a comparison tool for different grades, as confirmed by the results, it may be better to use a
non-Bayesian estimation method, such as Weighted Likelihood. Another limitation was the small item bank,
and the exposure rate of some items turned out to be quite high. It is recommended, to avoid this, to develop
improvements in the algorithm to control the item bank security.
Acknowledgements
This work was conducted in partnership with the Department of Education of Săo Paulo City and financially
supported by Unesco.
References
[1] M. Soares, Alfabetização: a questão dos métodos. São Paulo: Contexto, 2016.
[2] O. M. Alavarse, É. M. de T. Catalani, D. R. Meneghetti, and R. Travitzki, “Teste Adaptativo Informatizado como Recurso
Tecnológico para Avaliaçăo da Alfabetizaçăo Inicial,” Revista Iberoamericana de Sistemas, Cibernética e Informática, vol. 15,
no. 1, pp. 1--11, 2018.
[3] J. R. Barrada, “Tests adaptativos informatizados: una perspectiva general,” anales de psicología, vol. 28, no. 1, pp. 289–302,
2012.
126
[4] J. R. Barrada, J. Olea, V. Ponsoda, and F. J. Abad, “A Method for the Comparison of Item Selection Rules in Computerized
Adaptive Testing,” Applied Psychological Measurement, vol. 34, no. 6, pp. 438–452, Sep. 2010.
[5] F. B. Baker, The Basics of Item Response Theory. College Park, Md.: ERIC Clearinghouse on Assessment and Evaluation, 2001.
[6] R. D. Bock and R. J. Mislevy, “Adaptive EAP Estimation of Ability in a Microcomputer Environment,” Applied Psychological
Measurement, vol. 6, no. 4, pp. 431–444, 1982.
[7] D. Magis and G. Raîche, “Random Generation of Response Patterns under Computerized Adaptive Testing with the R Package
catR,” Journal of Statistical Software, vol. 48, no. 8, pp. 1–31, 2012.
[8] M. Reif, PP: Estimation of person parameters for the 1,2,3,4-PL model and the GPCM. 2014.
[9] I. Partchev, irtoys: A Collection of Functions Related to Item Response Theory (IRT). 2016.
[10] D. F. de Andrade, H. R. Tavares, and R. da C. Valle, Teoria da Resposta ao Item: Conceitos e Aplicações. São Paulo: AVALIA
Educacional, 2000.
[11] F. M. Lord, Applications of item response theory to practical testing problems. Hillsdale, N.J: L. Erlbaum Associates, 1980.
[12] T. A. Warm, “Weighted likelihood estimation of ability in item response theory,” Psychometrika, vol. 54, no. 3, pp. 427–450,
Sep. 1989.
[13] A. Birnbaum, “Statistical theory for logistic mental test models with a prior distribution of ability,” Journal of Mathematical
Psychology, vol. 6, pp. 258–276, 1969.
[14] J. Revuelta and V. Ponsoda, “A comparison of item exposure control methods in computerized adaptive testing,” Journal of
Educational Measurement, vol. 35, no. 4, pp. 311–327, 1998.
[15] D. F. Ferreira, Estatística Básica. UFLA, 2005.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
O estudo delineou a construção do Teste Adaptativo Informatizado (TAI) para a Provinha Brasil. A Provinha é um teste padronizado, disponibilizado na versão impressa pelo órgão federal e respondido por crianças brasileiras em processo de alfabetização. Teoricamente, os TAI têm características que superam limites dos testes impressos, entre elas, captação eletrônica das respostas, permitindo apuração automática das pontuações e apresentação de um teste diferente para cada nível de proficiência do respondente. O desenvolvimento do TAI pressupôs o envolvimento de especialistas de diferentes áreas, configurando uma atuação interdisciplinar e articulada no uso das Tecnologias da Informação e Comunicação (TIC). Incorporou-se as TIC no processo de avaliação educacional e consequentemente no de aprendizagem. O TAI desenvolvido da PB possibilitou: testes com menor número de questões comparado ao teste impresso; sequência de questões mais ajustadas aos domínios dos alunos; e algoritmo de seleção que permitiu maior precisão na alocação dos respondentes nos níveis de proficiência, contribuindo para melhorar e intervenção pedagógica, objetivo central da avaliação.
Article
Full-text available
Computerized adaptive testing (CAT) adapts the items to be administered to each examinee according to the responses to the previous items. In this way, more accurate trail level estimations can be obtained or test length is reduced. In the last years, several CATs have been developed in Spain and it can be expected that, given the advantages of this technique, more will become available soon. The goal of this work is to offer and updated view of this topic. For doing so, the basic structure of a CAT is presented and the different steps composing it are commented. Special emphasis is given ot item selection, the fundamental part for the adaptability of the test, from the perspective of the four objectives that must be satisfied by a CAT: (a) accuracy, (b) item bank security; (c) content balance; and (d) test maintenance.
Article
Full-text available
In a typical study of the relative efficiency of two competing item selection rules in computerized adaptive testing, the common result is that they simultaneously differ in accuracy and security, making,it difficult to reach a conclusion on which is the most appropriate. This study proposes a strategy to conduct a global comparison of two or more selection rules. A plot showing the performance of each selection rule for several maximum,exposure rates is obtained and the whole plot is compared with other rule plots. The strategy has been applied in a simulation study for the comparison of 6 exposure control methods: point Fisher information, Fisher information weighted by likelihood, Kullback-Leiblerweighted by likelihood, maximum information stratification method with blocking, progressive method and proportional method. There is no optimal rule for any overlap value or RMSE. The fact that a rule, for a given level of overlap, has lower RMSE than another does not imply that this pattern holds for another overlap rate. A fair comparison of the rules requires extensive manipulation of the maximum,exposure rates. The best methods were Kullback-Leiblerweighted by likelihood, proportional method and maximum
Article
This paper outlines a computerized adaptive testing (CAT) framework and presents an R package for the simulation of response patterns under CAT procedures. This package, called catR, requires a bank of items, previously calibrated according to the four-parameter logistic (4PL) model or any simpler logistic model. The package proposes several methods to select the early test items, several methods for next item selection, different estimators of ability (maximum likelihood, Bayes modal, expected a posteriori, weighted likelihood), and three stopping rules (based on the test length, the precision of ability estimates or the classification of the examinee). After a short description of the different steps of a CAT process, the commands and options of the catR package are presented and practically illustrated.
Article
Applications of item response theory, which depend upon its parameter invariance property, require that parameter estimates be unbiased. A new method, weighted likelihood estimation (WLE), is derived, and proved to be less biased than maximum likelihood estimation (MLE) with the same asymptotic variance and normal distribution. WLE removes the first order bias term from MLE. Two Monte Carlo studies compare WLE with MLE and Bayesian modal estimation (BME) of ability in conventional tests and tailored tests, assuming the item parameters are known constants. The Monte Carlo studies favor WLE over MLE and BME on several criteria over a wide range of the ability scale.
Article
A perennial problem for language testers is the need to construct and select test items with 'good' properties. The difficulty lies in the need to assess the properties of items by trying them out on a sample of subjects whose abilities, in turn, it ought to be possible to measure by observing their response to the items. This paper discusses the more important concepts of item response theory (IRT) - a technique, or set of tech niques, developed over the last 25 years, mainly by psychometricians. (An application of IRT was discussed in a recent issue of this journal (Henning, (1984).) Basic concepts are introduced and their implications considered by concentrating on the simplest IRT tool, the Rasch (1960) Model.
Article
Expected a posteriori (EAP) estimation of ability, based on numerical evaluation of the mean and variance of the posterior distribution, is shown to have unusually good properties for computerized adaptive testing. The calculations are not complex, precede noniteratively by simple summation of log likelihoods as items are added, and require only values of the response function obtainable from precalculated tables at a limited number of quadrature points. Simulation studies are reported showing the near equivalence of the posterior standard deviation and the standard error of measurement. When the adaptive testings terminate at a fixed posterior standard deviation criterion of .90 or better, the regression of the EAP estimator on true ability is virtually linear with slope equal to the reliability, and the measurement error homogeneous, in the range ± 2.5 standard deviations.
Article
Two new methods for item exposure control were proposed. In the Progressive method, as the test progresses, the influence of a random component on item selection is reduced and the importance of item information is increasingly more prominent. In the Restricted Maximum Information method, no item is allowed to be exposed in more than a predetermined proportion of tests. Both methods were compared with six other item-selection methods (Maximum Information, One Parameter, McBride and Martin, Randomesque, Sympson and Hetter, and Random Item Selection) with regard to test precision and item exposure variables. Results showed that the Restricted method was useful to reduce maximum exposure rates and that the Progressive method reduced the number of unused items. Both did well regarding precision. Thus, a combined Progressive-Restricted method may be useful to control item exposure without a serious decrease in test precision.
Article
In previous work the logistic model of a mental test was introduced, and statistical inference methods were developed without the assumption of a prior distribution of ability. Such a distribution is assumed in the present paper, and corresponding statistical inference methods and distribution theory are developed. Results include detailed investigation of posterior distributions, given observed response patterns; point-estimation of ability, and its mean-squared error; and distributions of test scores and their moments. Some computational methods and detailed examples are given.