Content uploaded by Connie Nshemereirwe
Author content
All content in this area was uploaded by Connie Nshemereirwe on May 09, 2016
Content may be subject to copyright.
1
ESTIMATING THE DIFFICULTY OF A’LEVEL EXAMINATION
SUBJECTS IN UGANDA
Connie V. Nshemereirwe
University of Twente, The Netherlands
c.v.nshemereirwe@utwente.nl
In order to gain access to institutions of higher learning in Uganda, including universities, all students sit a
national examination at the end of A’Level, the scores of which determine their selection into various institutions of
higher learning, including university. For most university degree programmes, entry is determined based on the
A’Level scores irrespective of subject, essentially implying that the same scores in the different subjects are
comparable. In order to investigate this comparability, a generalised partial credit item response model was fit to
the A’Level examination results data for the years 2009 and 2010. Science and non-science subjects were
hypothesised to load on two separate dimensions of the latent ability scale, and subject difficulty and discrimination
parameters were estimated. It was found that science subjects were relatively more difficult than humanities and
language subjects, and that they also provided the largest amount of information, although this was for the higher
end of the ability scale. Some other subjects like Art and Kiswahili were not only relatively easier, they also
provided very little information on the ability scale underlying the other subjects. These findings bring into question
the comparability of scores in the different subjects at A’Level, and if student ability based on examination
performance can be better represented by integrating information on difficulty levels.
Key words: Subject Difficulty; A’Level Examinations, Subject Comparability; University Selection, Generalised
Partial Credit Model; Multidimensional Item Response Theory.
INTRODUCTION
The Ugandan pre-university education system follows a 7-4-2 system: seven years of primary school, four
years of lower secondary or “Ordinary Level (O’Level)” and two years of upper secondary or “Advanced
Level (A’Level)”. To advance from one level to the next, students must sit and pass a national
examination which is centrally developed and administered by the Uganda National Examination Board
(UNEB). At the end of primary school, pupils are examined in four subjects, all of which are
compulsory: English, Mathematics, Social Studies and Basic Science and Health Education. At the end of
O’Level, students sit ten to twelve subjects that they may choose from thirty-six in total; of these, five are
compulsory: English Language, Mathematics, Physics, Biology and Chemistry. At A’Level, students may
choose three subjects out of a possible twenty-six, and an additional compulsory subject called General
Paper. These twenty-six subjects are categorised as shown in Table 1, and the UNEB gives the following
guidelines on subject choice:
Candidates are advised to avoid selecting more than one subject from groups that are
normally timetabled together. […] candidates are particularly advised to avoid
combining Science subjects with Arts subjects, e.g. Sciences with Languages,
Physics with Geography, etc. (UNEB, 2010)
2
Table 1: A’Level Subject Categories
I. GENERAL PAPER (COMPULSORY)
II. HUMANITIES
P210 History
P220 Economics
P230 Entrepreneurship Education
P235 Islamic Religious Education
P245 Christian Religious Education
P250 Geography
III. LANGUAGES
P310 Literature in English
P320 Kiswahili
P330 French
P340 German
P350 Latin
P360 Luganda
P370 Arabic
IV. MATHEMATICAL SUBJECTS
P425 [Pure] Mathematics
S475 [Subsidiary] Mathematics
V. SCIENCE SUBJECTS
P510 Physics
P515 Agriculture: Principles and Practice
P525 Chemistry
P530 Biology
VI. CULTURAL SUBJECTS AND OTHERS
P615 Art
P620 Music
P630 Clothing and Textiles
P640 Foods and Nutrition
VII. TECHNICAL SUBJECTS
P710 Geometrical and Mechanical Drawing
P720 Geometrical and Building Drawing
P730 Woodwork
P740 Engineering Metalwork
(Source: UNEB, 2013)
UNIVERSITY SELECTION
The minimum requirement for university entry in Uganda is two principle passes, or a score of between A
and E in at least two A’Level subjects taken at principle level. Additional entry requirements differ
between academic programmes as well as universities. Some academic programmes have more restrictive
entry requirements, like engineering and medicine, but many other academic programmes have open
requirements. The majority of university students in Uganda are enrolled at public universities, where a
system of weighting is used at selection. A’Level subjects categorised as essential for a given university
academic programme receive a weight of three, the relevant subjects receive a weight of two, and any
other subject not categorised as essential or relevant receives a weight of one or a half. However, in
programmes which have more open subject requirements, these weights are simply applied to the subjects
in which students score the highest grades. Table 1.2 shows the entry requirements for three very popular
academic programmes at public universities: Bachelor of Information Technology, Bachelor of Business
Administration, Bachelor of Development Studies and Bachelor of Laws. It can be observed that subject
requirements become more and more open until students who apply to enter the bachelor of development
studies or the bachelor of laws can be admitted with any A’Level subject combination. The question this
raises, however, is whether all subject scores are interchangeable, and can be thought to represent similar
academic ability.
3
Table 2: Entry Requirements for Four Academic Programmes at Public Universities
Programme
“Essential” (receives a weight of 3)
“Relevant” (receives a weight of 2)
B. Information
Technology (BIT)
Two best done of Maths, Economics
Physics, Biology, Chemistry,
Literature, Geography,
Entrepreneurship, Technical Drawing,
Fine Arts
One better done of the remaining
A’Level subjects
B. Business
Administration
(BBA)
Economics and one better done of the
remaining A’Level Subjects
Next better done of the remaining
A’Level Subjects
B. Development
Studies (BDS)
Two best done of all A ’Level Subjects
Third best done of all A’ Level Subjects
Bachelor of Laws
Two best done of all A ’Level Subjects
Third best done of all A’ Level Subjects
(Source: Joint Admissions Board, 2012/2013)
THE CONCEPT OF “SUBJECT DIFFICULTY”
Subject difficulty as a concept is rather controversial. On one hand, the observation that certain subjects
generally have higher pass rates than other subjects appears to indicate that some subjects are relatively
more difficult than others; on the other hand, it can be argued that pass rates may be a result of other
factors intrinsic to the education system such as less qualified teachers in some of the subjects, or intrinsic
to students themselves, such as varying levels of motivation (i.e. more motivated students tend to choose
certain subjects), rather than a characteristic of the subject itself. Additionally, there is a possibility that
grading practices in some subjects are simply more stringent than in others. Finally, it can also be argued
that scores in different subjects may indicate different dimensions of ability in the first place, rather than a
uniform dimension that underlies all subjects, and that therefore no sensible comparison can be made
between them.
Aside from comparison of subjects to one another at the same sitting, another issue of contention is
comparability of examination scores across time. Public confidence in the school system is often shaped
by whether performance is improving or not, judging from pass rates. Unfortunately, this sets up a
situation where an increase in the proportion of students passing raises concerns that examination
standards are falling (examinations are easier or have been compromised), and when pass rates drop, this
raises concerns that standards in schools are falling. William (1996, in Coe, 2010) has described the
dilemma that school systems and examination boards face in this case as a “heads I win, tails you lose”
situation.
4
Current Views of Subject Comparability
In considering subject comparability, it may be useful to start with reviewing the process of grade
allocation itself. In examination systems such as Uganda’s, an A’Level grade scale, such as A-F, is
applied across all subjects, and the grade boundaries agreed upon by a panel of subject matter experts.
Care is taken to decide on these grade boundaries in such a way as to maintain some kind of
comparability between the letter grades from year to year. According to Newton (2005), these kinds of
panels may also make use of statistical information on candidate performance in previous years, as well
as technical information regarding mark distributions for the particular sitting, so as to arrive
“comparable” grade boundaries. This process of judgemental grade boundary allocation or “linking” is
meant to enable fair decision-making, such as university selection, for students sitting the same subjects
from year to year.
The purpose of national examinations, however, is not only for selection for the next level, but also to
provide data to enable the monitoring of schools and education systems. In this case, it is also necessary
to be able to determine the actual achievement levels of students from year to year; that is to say, the
knowledge and skill levels in each subject so as to judge progress. In Uganda, the UNEB uses a
combination of criterion and norm referencing to arrive at grade boundaries, and these two methods of
viewing performance reflect the two main views on “comparability” as well, namely performance
comparability and statistical comparability.
Performance comparability of any two subjects concerns judging difficulty based on the degree of
challenge each subject presents students. This challenge may be in terms of complexity, skill level or
knowledge required to score the same grade in each subject. The main difficulty with this
conceptualisation of difficulty is the fact that complexity and skill levels cannot be directly observed and
therefore must be inferred, making this comparability method problematic (Coe, 2010). Further, different
knowledge and skill sets may be necessary for the different subjects, and then how can a judgement be
made on which is the more “difficult”?
Statistical comparability circumvents this problem by only relying on defining a standard as the relative
chances of success that candidates have in different subjects. Coe (2010) puts it as follows: “Two subjects
are of comparable standard if the same grades are equally likely to be achieved by comparable candidates
in each” (p275). A statistical conceptualisation of comparability, however, takes no account of the quality
or content of the examinations, which, depending on the use to which the comparability is to be put, may
be problematic as well. Nevertheless,
For purposes of the current study, a statistical comparability view is appropriate because the focus is on
the use of a simple average of A’Level subjects scores for selection for university. That is to say, scores in
the A’Level examinations are used as a basis to qualify students by ranking them, rather than as an
indication of specific skill and knowledge levels. In that case, it is more useful to apply statistical
comparability, and a more detailed description of the methods involved in this is presented in the next
section.
Statistical Comparability of Subject Scores
Coe etal (2007) gives a summary of the statistical methods employed in the comparison of subject scores.
These include:
Common Examinee Linear Methods – the best known of these methods is Kelly’s method (1976), which
estimates the difficulty of a subject based on all candidates who have taken that particular subject along 5
5
with any other. Kelly’s method involves the solution of simultaneous equations, which allows the average
performance of each subject to be used in the computation of the subject difficulties of all the other
subjects in an iterative process that repeatedly corrects for ―difficulty of each subject until the
differences between corrected subject scores is zero.
Latent Trait Methods – these are methods that rely on Item Response Theory (IRT), which takes the view
that not all items in a test give the same amount of information about the ability of a student. Some items
are more difficult, and even though two students get the same number of items correct, a student’s ability
depends on which questions s/he got correct. The idea is that the probability of a person answering a
given item correct is a mathematical function of the difference between the ability of a person and the
difficulty of that question. Given the responses of a number persons on a set of items, therefore, the
“difficulty” of items can be simultaneously estimated along with person “ability” using an iterative
maximum likelihood procedure which assigns an ability to a person that best matches their response
pattern given the difficulty of the items. The difficulty of items and the ability of persons can then be
represented on the same “latent trait” scale, with persons higher up on the latent scale having a higher
probability of answering more difficult items correct. In estimating subject difficulty, latent trait models
take the individual subjects to be items, and the subject scores to represent the response pattern by each
student on these items (subjects).
Latent trait models have an advantage over Common Examinee Linear Methods like Kelly’s in that they
allow for the interval between subject scores in terms of difficulty to vary. In other words, the difference
between a score of A and B need not be equal to the difference between a score of B and C; similarly, the
distances between scores in different subjects need not be the same, so that the distance between a score
of A and B in History can differ from the distance between a score of A and B in Chemistry. Another
advantage of Latent trait models is that depending on the particular model employed, it is possible to
determine the extent to which subject can be represented on by single underlying dimension or more than
one dimension, and to test which explanation best fits the observed data. In this way, it is possible to
examine the extent to which subjects can indeed be compared to one another.
Statistical Comparability – some criticisms
Coe (2008) outlines some criticisms of statistical approaches, such as the basic incomparability of
subjects in general, and the fact that performance is affected by many other factors besides “difficulty”.
Further, the analysis of subject “difficulties” for different subgroups, such as males and females, may
result in different difficulties, and that even the method of statistical analysis itself matters as different
methods tend to give different results. Coe (2008) maintains, however, that statistical differences are still
interpretable within the context of a linking construct as long as all inferences are confined to that linking
construct. The important consideration, then, is the identification of a plausible linking construct.
Construct Comparability: An Integrated View of Subject Comparability
Given the shortcomings of both performance and statistical views of comparability, Newton (2005)
proposes a third, integrated view, which he terms as construct comparability. This view of comparability
takes the position that it is inadvisable to infer any sense of equivalence based on a statistical comparison
of scores on a combination of subjects; rather, “comparison” can only translate the scores in these
different subjects to another scale which expresses the extent to which the scores measure the same
construct. Inferences about the scores so-linked can therefore only be made with reference to this
construct. It should be noted that this construct is not identical to any of the constructs being measured by
individual tests, and that no such inference should be made (Newton, 2005). Coe (2008) goes further to
6
say that in comparing subject scores, it can only be said that a given score in a subject indicates a lower
level of the linking construct than the same score in another subject. Take for instance comparing scores
in Mathematics and English: while these two subjects clearly represent different abilities, it is still
reasonable to say that a high score on both may be indicative of a more general academic ability. In
placing the scores in these two subjects on a scale of academic ability (the linking construct), it can then
be said that a high score in one subject represents a higher level on the linking construct than the same
score in the other subject. That being said, careful thought and consideration must go into defining this
linking construct, and then “made explicit for all users and stakeholders” (Newton, 2005, pp 111,
emphasis in original) so as to avoid invalid inferences.
Subject Comparability of A’Level Examinations in Uganda: A Linking Construct
Depending on the purpose to which the scores in national examinations are put, therefore, a linking
construct can be proposed. For instance, A’Level examination results in Uganda are the basis for
university entry; as such, a linking construct such as university “potential” can be proposed so that the
scores in different subjects can be compared based on such a scale. This is especially applicable for those
university degree programmes that do not impose any limitations on the A’Level subjects required for
admission, but even for those that do, such as Engineering and Medicine, a construct such as “scientific
ability” can also be used to place scores in subjects such as Mathematics, Physics, Chemistry and Biology
on the same scale. In other words, subjects that are strikingly different can be scaled separately and then
the aggregation made thereafter so that students who choose “difficult” subjects are not disadvantaged.
ESTIMATING A’LEVEL SUBJECT DIFFICULTY IN UGANDA: A METHODOLOGY
Item Response Theory
Item Response Theory (IRT) is a general statistical theory which attempts to relate the performance of an
individual on an item to the ability measured by that item (Hambleton & Jones, 1993). In contrast to
traditional testing where a person’s ability is inferred from a total score, IRT uses the information on the
individual’s responses to every item. IRT rests on three assumptions: a) items measure a uniform
underlying trait (unidimensionality); b) a response on one item is not dependent on the response to
another item on the same test (local independence); and c) That the relationship between a person’s
response and their ability can be mathematically modelled by a logistic function (Hambleton & Jones,
1993).
In general, IRT modelling proceeds by analysing the responses of a large number of individuals to a given
number of items with the aim of estimating the ability level associated with a given response pattern. In
this process, two parameters are commonly estimated: item difficulty, b, and item discrimination, a. Item
difficulty, b, represents the ability level at which there is a 50-50 chance of scoring in a given category,
and in this way can locate the item difficulty on the same scale as person ability, ϴ (theta). Once items
have been calibrated, a particular response will indicate the same ϴ value no matter who attempts the
question, which is a distinct advantage of IRT because item parameters are not tied to a particular
population - this property of IRT is known as invariance. The ϴ scale itself runs from negative infinity to
positive infinity, and is often scaled by fixing the zero point at the population mean, with each unit change
in the value of theta being equal to a change in ability represented by one standard deviation in the
population.
Secondly, it is usually also possible to model how well a given item discriminates between individuals
with a different latent trait ability. An item has high discrimination if it can detect a small difference in
the level of ability between persons based on their response; in other words, if the probability of a given
7
response was plotted against ability levels, a highly discriminating item would have a steeper slope since
the difference in probability of that response at low levels of ability would be quite different from that of
individuals with a higher level of the latent trait. A flatter slope would signify that the probability of a
given response does not change much between persons of low and high ability (Baker, 2001). The idea of
discrimination is parallel to that of factor loadings in factor analysis; an item which has a high
discrimination can be thought of as loading heavily on the underlying latent trait, and can measure the
ability levels of different individuals more precisely. It should be noted that an item may have high
discrimination only in a small part of the ability dimension; for instance, an item may be very well suited
to differentiate individuals at the upper end of the ability scale but have little discriminatory power at the
lower end since most of the individual would score in the lowest category on that item. This gives IRT an
advantage in testing because it is sometimes desirable to discriminate between individuals of a similar
level, an advantage that is put to full use in computer adaptive testing. Once item difficulties and
discriminations have been computed, an individual’s response pattern places him/her on an ability scale,
which is on the same scale as item difficulties.
Within the IRT framework, various models have been developed to deal with different test formats and to
meet different assumptions. Students in Uganda may obtain a grade of A, B, C, D, E, O or F in the
national A’Level examinations, with A being the highest grade and F being the lowest. Each students
takes examinations in three or four subjects; in order to model student performance using IRT, each
subject can be thought of as an item with seven score categories to represent the seven possible grades.
Since there are more than two possible scores categories for each subject (or item), then modelling the
relationship between student responses and subject difficulty requires a model developed for polytomous
items.
IRT Models for Polytomous Items
These models are divided into two major categories – those for items where the response categories are
ordered (ordinal), and those where the response categories are in no particular order (nominal). In the
present case, if the ordering is certain, i.e. A>B>C>D>E>O>F, then the one of the ordinal models would
be suitable; however, if this ordering cannot be assumed in advance and one wants to test the hypothesis
that A>B>C>D>E>O>F, then a nominal response model is more appropriate. In the present case, the
ordering was not assumed in advance; further, it was of interest to not only estimate the difficulty of items
but also their discrimination, and the most suitable model for this was found to be the Generalised Partial
Credit Model (Muraki, 1992). The GPCM is also particularly suitable for the modelling of A’Level
subject difficulty because it allows items to have different numbers of score categories; At A’Level, there
are some subjects where no one scores A, or where no one scores F, so that subjects end up with a
different number of score categories, so that items end up having a different number of score categories.
Generalised Partial Credit Model (GPCM)
Difficulty in the GPCM is conceptualised as the threshold where the probability of scoring in the adjacent
category is more likely; as such, threshold values are estimated for all adjacent categories so that more
than one difficulty, or threshold, parameter is estimated for every item. It can be imagined that as ability
increases, the probability of scoring in a lower category decreases as the probability of scoring in the
adjacent category increases. Put another way, the probability of scoring in the lowest category, for
instance, is always dropping with increasing ability since the probability of scoring in any other category
is also rising at the same time and the total of probabilities always equals one. At some point, the
probability of scoring in an adjacent category becomes higher than that of scoring in the lowest category,
and the point at which these two curves cross marks the threshold ability or difficulty where the chances
8
of scoring in either category are equal. Figure 1 represents the category response curves for an item with
five response categories k = 1 to k = 5. For this particular item, the ability level that is needed to “cross”
the threshold between category one and category two, or the point at which the probability of scoring in
the adjacent category becomes higher than scoring in the lowest category, is around ϴ = -1.5; the next
threshold occurs at approximately ϴ =-1.2, and then the next one, where responding in category k=3
becomes less likely than scoring in category k=4 occurs at ability level around ϴ = 1.4., with the last
threshold being located at closer to ϴ = 1.8.
Figure 1: Probability of scoring in different categories for a polytomously scored item. Adapted from
"Confirmatory Factor Analysis and Item Response Theory: Two approaches for exploring measurement
invariance” by S. P. Reise, K. F. Widaman, & R. H. Pigh, 1993. Psychological Bulletin 114 (3), 552-566.
Copyright 1993 by the American Psychological Association. Adapted with permission.
In other words, the threshold parameter is that value of where scoring in the adjacent category is more
likely.
Item Information.
In IRT, the item difficulty parameter locates it on the ϴ scale, and the discrimination parameter describes
its loading or steepness on the underlying latent scale; however, in order to interpret these parameters in
any meaningful way, it is necessary to inspect the information functions for each of the items (Muraki,
1993). Item information is essentially an expression of how precisely a given item estimates the ability
parameters of individuals responding to it; this precision is indicated by the variance of those estimates,
and item information is equal to the reciprocal of that variance. If responses on an item lead to quite a
precise estimate of the ability parameters, then the variance of those estimates will be low and
information will be high; if, on the other hand, the estimates have a high variance, such an item provides
little information on the latent trait (Baker, 2001).
For polytomous items, item information functions may be unimodal or multimodal, depending on the
distance between the threshold difficulty parameters of adjacent categories; if it is large, then the
information will drop in the ϴ range between them (Muraki, 1993). Figure 2a and 2b show the item
category response functions of two different items together with their item information functions. The
9
first item has the following item parameters: a = 1.304, b = (-1.289, 0.292, 0.381, 1.252, 2.026, 2.735),
and the second has the following item parameters: a = 0.647, b = (-3.751, -1.794, -1.600, 0.476, 1.894,
3.301). Both items can be scored from 0-6, and total item information is shown by the thick dashed line.
The information curves show that item 1 provides more information about the underlying trait than item
2. Further, the peak of the item information indicates where along the latent trait this item provides the
most information: for item 1 that is around ϴ = 0.7 and for item 2 is around ϴ = -1.5. This is consistent
with the a and b-values of the two items since item 1 has a higher discrimination and average difficulty
than item 2, which leads to the expectation that item 1 will provide more information, and at a higher
value of ϴ than item 2.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0
INFORMATION
0
1
2
3
4
5
6
ITEM 1
a = 1.304, b = (-1.289, 0.292, 0.381, 1.252, 2.026, 2.735)
Figure 2(a): Item category response functions for a polytomous item with high information
10
a = 0.647, b = (-3.751, -1.794, -1.600, 0.476, 1.894, 3.301)
Figure 2(b): Item category response functions for a polytomous item with low information.
Multidimensional Item Response theory (MIRT).
One of the basic assumptions of IRT is that the test items whose parameters are being estimated all
measure a single underlying trait; however, it is easy to imagine a situation where the probability of
success depends on more than one ability, such as word math problems where success depends on the
respondent’s ability to comprehend the language, as well as know the applicable mathematical principles
to solve the problem. In modelling such items, assuming unidimensionality of the underlying trait would
lead to inaccurate parameter estimates, in which case multidimensional IRT (MIRT) is a more suitable
analytical framework (Ackerman, Gierl, & Walker 2003). In the present study, multidimensionality was
suspected based on the large differences in performance in science subjects compared to the humanities at
A’Level in Uganda. As such, a two dimensional latent ability was explored.
ESTIMATING SUBJECT DIFFICULTY IN THE UGANDA NATIONAL A’LEVEL
EXAMINATIONS
Findings
Using the MIRT computer program (Glas, 2010), the GPCM was fitted to the data as a unidimensional
model and a 2-dimensional model (Sciences/All the rest). The science dimension was made up of physics,
mathematics, chemistry, biology and agriculture. It turned out that the 2-dimensioanl model fit the data
best, and that the two dimensions were shown to be fairly distinct, with a correlation of only 0.66.
11
Discrimination parameters
The data analysed were from the 2009 and 2010 A’Level examination sitting, and Figure 3 shows a plot
of the a-parameters (discrimination parameters) for each of the 16 subjects analysed, for the 1and 2-
dimensional models for the data from 2010. The a-parameters obtained from each analysis indicate how
well scores on a given subject discriminate between students with a different ability on the given
dimension; as such, subjects with high values of discrimination provide more information on the ability of
students than subjects with low discriminations. In this case, most of the science subjects were on the high
end of the scale, while some of the languages and fine art were on the lower end. This means that scores
in these subjects are not able discriminate between students with regard to ability as well as the ones on
the high end. The rest of the subjects lie somewhere in the middle range.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
a-parameters
Subject loadings on different dimensions, 2010
1-DIM
2-DIM
Figure 3: Subject Loadings (as indicated by subject discrimination or a-parameters) for the 1 and 2-
dimensional models.
Difficulty Parameters
The subject difficulty was estimated by treating student scores on each subject as though it was the score
on an item that could be awarded a mark between 0 and 6. The GPCM estimates threshold difficulties, b,
which represent the values at which the probability of a student with a given ϴ scoring in the adjacent
category, say D, is higher than that of scoring in the present category, say E. However, since threshold
difficulties differ so much between and within subjects, a comparison of subject difficulty based upon
them is rather difficult; as such, an average of the threshold difficulties for each subject was computed,
and in Figure 4 a plot of the relative subject difficulties for the two years is shown. The general trend
12
shows that the local languages Kiswahilli and Luganda have the lowest relative difficulty, while the four
science subjects mathematics, physics, chemistry and biology have the highest relative difficulty.
Figure 4:Relative subject difficulty, A’Level national examinations, 2009 and 2010.
Item information
In practical terms, estimates of item difficulty and discrimination parameters are most useful if we know
the ability levels about which they give the most information. For a start, one may expect that high
relative a and b-values mean that such items discriminate best at the higher end of the ability spectrum,
while those with high a but low b-values discriminate best at the lower ability spectrum. However, for
polytomous items such as the ones that were analysed in this study this is not straightforward because
threshold categories behave different for different items. In this case, plots of the item information
functions (IIF) are more informative, and these are shown for three different items in Figure 5. In order to
show the relative amount of information provided by different items, all the IIFs are plotted to a value of
information equal to 5.0, except for chemistry which goes to 7.0; information is indicated by the thick
broken line.
The top panel shows a subject, Art, that gives almost no information about the underlying trait measured
by the other subjects, and the bottom panels two show items which give more information about the
underlying trait. History provides the most information at slightly below the average performance of
students in 2009. Economics, mathematics and geography also gave a moderate amount of information.
Finally, chemistry gives the most information at about 1.5 standard deviations above the average. Physics
and biology gave similar amounts of information, but chemistry also tuned out to be bi-modal, with a
smaller peak at just below ϴ = 0, meaning it also discriminates enough at that ability level to provide
some information.
13
Figure 5: Item Information functions for three items in the A’Level national examinations, 2009
14
DISCUSSION AND CONCLUSION
Most degree programmes offered at universities in Uganda admit students almost sorely based on their
scores in the national examinations at the end of the upper or advanced level of secondary school
(A’Level). A weighting methodology is employed based on the subjects deemed most relevant for a given
degree programme, and for the most part this is done for courses like engineering and medicine, but in
many other cases the highest weight is applied to the best performed subject instead. It turns out that the
majority of students take humanities and language subjects at A’Level, and these also have the highest
pass rates. On the other hand, degree courses like development studies and business administration have
very broad admission criteria, and so the majority of enrolled students took humanities and language
subjects at A’Level. Assuming that universities want to enrol the students of highest ability, however, it
may not be valid to assume that all subject scores are interchangeable, or that they represent a similar
general ability.
The study described in this paper was aimed at estimating the subject difficulty of the sixteen most
commonly chosen subjects at A’Level in Uganda so as to give a more accurate picture of the extent to
which scores in these subjects can be compared. Using data from two A’Level examination years, 2009
and 2010, a modelling method based on item response theory (IRT) was utilised, and it was assumed that
the sixteen subjects were different to such an extent that scores on them represented two separate ability
dimensions: a science and a non-science dimension. The science dimension was represented by the
subjects of biology, chemistry, physics, agriculture and mathematics, and the non-science dimension was
represented by some humanities subjects like economics and geography, and some language subjects like
Kiswahilli and Luganda. The generalised partial credit IRT model was fit to the data using the program
MIRT (Glas, 2010). The two-dimensional model turned out to have best overall fit to the data, with the
correlation between the two dimensions found to be about 0.65. This was low enough to support the
likelihood that scores in science subjects represent a separate ability dimension
Modelling the subject difficulty in this study revealed that the science subjects, on the whole, have the
highest relative difficulty (averaged over score categories), and that they also generally have the highest
discrimination values. Aggregating the information provided by the difficulty and discrimination
parameters, it was found that scores in subjects like Fine Art gave very little information about the 2-
dimensional ability trait measured by the rest of the subjects, and that the little information they gave was
at the lower end of the ability scale. Subjects like history, economics, mathematics and geography, on the
other hand, gave a moderate amount of information around the middle ranges of the ability scale, while
the sciences (biology, chemistry and physics) gave the highest amount of information, but more within
the higher ability range of the scale. The exception was chemistry, whose information curve was bi-
modal, such that it also gave a moderate amount of information within the middle ability range.
The findings of this study are in line with what is generally thought of the difficulty of science as
compared to non-science subjects, but also provide a way to compare non-science subjects to one another.
The subjects with some of the highest pass rates like Art and the local languages, appear to measure
something different from what is measured by the other subjects, and yet they are assumed to be
comparable to them. In the absence of a mechanism to compare subjects, universities understandably
have to rely only on raw scores in different subjects; however, the findings of this study provide an
alternative way of regarding the A’Level grades of applicants at selection, so as to improve the quality of
students enrolled and make the selection process more fair.
15
REFERENCES
Ackerman, T. A., Gierl M. J., & Walker, C. M (2003). Using Multidimensional Item
Response Theory to Evaluate Educational and Psychological Tests. Educational
Measurement: Issues and Practice 22:3, 37–51
Baker, F. (2001). The basics of item response theory. College Park, MD: ERIC
Clearinghouse on Assessment and Evaluation.
Coe, R. 2008 Comparability of GCSE examinations in different subjects: an application
of the Rasch model, Oxford Review of Education, 34:5, 609-636, DOI:
10.1080/03054980801970312
Coe. R. 2010 Understanding comparability of examination standards. Research Papers in
Education. 25: 3. 271 — 284
Coe, R., Searle, J., Barmby, P., Jones, K. & Higgins, S. 2007 Relative difficulty of
examinations in different subjects. Report for SCORE (Science Community
Partnership Supporting Education) (Durham, Curriculum, Evaluation and Management
Centre, Durham University).
Glas, C. A. W. (2010). Preliminary manual of the software program Multidimensional
Item Response Theory (MIRT). Retrieved from University of Twente website:
http://www.utwente.nl/gw/omd/Medewerkers/temp_test/mirt-manual.pdf
Hambleton, R. K. & Jones R. W. 1993 Comparison of Classical Test Theory and Item
Response Theory and Their Applications to Test Development. Educational
Measurement: Issues and Practice 12(3), 38–47 DOI: 10.1111/j.1745-
3992.1993.tb00543.x
Kelly, A. 1976 A study of the comparability of external examinations in different
subjects. Research in Education, 16, 37-63.
Muraki, E. (1993) Information functions of the generalised partial credit model. Applied
Psychological Measurement 17, 351-363.
Newton, P. E. 2005 Examination standards and the limits of linking, Assessment in
Education, 12(2), 105–123. DOI: 10.1080/09695940500143795
UNEB. 2010: Registration of Candidates for 2010 UCE and UACE Examinations.
Retrieved from :
http://www.uneb.ac.ug/index.php?link=Guidelines&&Key=Secondary (4.10.2013)
UNEB. 2013 Uganda Advanced Certificate of Education (UACE) Examinable Subjects.
Retrieved from http://www.uneb.ac.ug/index.php?link=Syllabus&&Key=A
(4.10.2013)