ArticlePDF Available

Using comparative judgement and online technologies in the assessment and measurement of creative performance and capability

Authors:

Abstract

In this paper we argue that comparative judgement delivered by online technologies is a viable, valid and highly reliable alternative to traditional analytical marking. In the past, comparative judgement has been underused in educational assessment and measurement, particularly in large-scale testing, mainly due to the lack of supporting technologies to facilitate the large number of judgements and judges. We describe the foundations of comparative judgement and dispel many of the old issues regarding its use in regards to time, cost and training for large-scale assessment. Studies in the use of comparative judgement and online technologies for the assessment and measurement of practical performance conducted by Edith Cowan University provide a context for further promoting its use in educational testing.
R E S E A R C H A R T I C L E Open Access
Using comparative judgement and online
technologies in the assessment and
measurement of creative performance and
capability
Pina Tarricone and C. Paul Newhouse
*
* Correspondence:
p.newhouse@ecu.edu.au
Centre for Schooling and Learning
Technologies (CSaLT), School of
Education, Edith Cowan University,
2 Bradford St, Mount Lawley, 6050
Western, Australia
Abstract
In this paper we argue that comparative judgement delivered by online technologies is
a viable, valid and highly reliable alternative to traditional analytical marking. In
the past, comparative judgement has been underused in educational assessment
and measurement, particularly in large-scale testing, mainly due to the lack of
supporting technologies to facilitate the large number of judgements and judges.
We describe the foundations of comparative judgement and dispel many of the
old issues regarding its use in regards to time, cost and training for large-scale
assessment. Studies in the use of comparative judgement and online technologies for
the assessment and measurement of practical performance conducted by Edith Cowan
University provide a context for further promoting its use in educational testing.
Keywords: Comparative judgement, Paired comparison, Pairwise comparison, Online
assessment, Creative performance, Relative judgements, Absolute judgements
Background
Assessment is influenced by numerous factors including the teachers expertise in the
design and application of a measurement instrument, such as a test, and their judge-
ment of student performance. Inferences and judgements are made about a students
learning development based on the evidence provided by a students performance; and
judgement making uses evidence in comparison to a standard, or another performance
(e.g. by another student). Performance evidence, its quality, and judgements made are
crucial to reliable assessment and have formed the problem of our research. In this
paper we argue for the increased use of comparative judgement in the assessment of
performance, particularly where creative expression is an integral component of that
performance. Initially the concept of comparative judgement is defined and explained,
then a range of literature is used to provide a rationale for its application to assessment,
the important role of digital technologies is explained, and finally our own research
journey over the past decade in the Centre for Schooling and Learning Technologies
(CSaLT) is used to illustrate the potential.
© 2016 Tarricone and Newhouse. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in
any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons
license, and indicate if changes were made.
Tarricone and Newhouse International Journal of Educational Technology
in Higher Education (2016) 13:16
DOI 10.1186/s41239-016-0018-x
Comparative judgement
The fundamental principle of measurement involves comparison, whether between two
objects or performances or between a performance and a theoretical standard (e.g.
marking criteria). Comparison provides the basis upon which measurement instruments
are developed. Measurement instruments, assessment tasks, are deliberately designed and
developed to measure ability and to classify performance relative to another performance.
Essential to measurement is the construction of the task and its categories, which are used
to compare, classify and identify ability through performance in relation to a well-defined
outcome space (Humphry & Heldsinger, 2007; Wilson, 2005). Therefore, the basis of
measurement is comparison, based on a specific outcome space and developed criteria
that identify what performance outcomes are being measured and the categories of the
performance (Andrich, 1988; Rasch, 1961; Wilson, 2005).
Comparative judgement, comparative pairs, pairwise comparison or paired compari-
son are terms that are used to describe a measurement method, which involves making
inferences based on specific task criteria about one students performance compared
with anothers, in contrast to comparison with a theoretical standard in analytical
methods. This means that through comparison, items, objects or performances are
measured with reference to another item, object or performance (Andrich, 1988; Rasch,
1961; Thurstone, 1927). Arguably comparative judgement began when Fechner (1860)
first introduced the psychophysical method which involved ranking objects of physical
magnitude; this method was later investigated and extended by Thurstone (1927, 1954).
Comparative judgement was originally used for the pairing of psychophysical stimuli,
such as weight, height and for measuring values and attitudes (Andrich, 1988; Bramley et al.
1998; Fechner, 1860; Jones & Alcock, 2014; Jones et al. 2015; Thurstone, 1927, 1954). The
original conception of comparative judgement was that repeated comparisons of pairs of
items, objects or performance used for an immediate perception with minimal cognitive
processing. A comparison for quality takes more time than a comparison for magnitude. An
example of an early study applying comparison for quality was by Bradley and Terry (1952),
who investigated the feasibility of using comparative judgement to determine the quality of
pork meat. Inferences of quality are made based on the criteria and this informs the estima-
tion of the scale location. The process and outcome of the comparison involves estimating
or inferring the scale location of the performances/instances, not based on the origin being
relative to an absolute zero (Bond & Fox, 2001; Humphry & Heldsinger, 2007).
Measurement of ability (i.e. latent trait) from performance relies upon comparison to
infer thresholds and form an interval scale (Andrich, 1988; Andrich & Luo, 2003).
Thresholds, classifications or intervals on the latent-trait continuum represent a range
of abilities. But differences in ability are not necessarily equal, and therefore the thresholds
or intervals on the continuum cannot be proportioned equally. The location of a students
ability on the continuum is inferred by the judgment of their performance on a task or
tasks in a specific knowledge domain. A foundation of the latent-trait continuum is that
learning is cumulative and prior knowledge becomes the basis for the development of
new knowledge and understandings (Humphry & Heldsinger, 2007). Inference and
subsequent probabilistic mapping of ability and growth on a continuum, or underlying
growth continua, is fundamental to the Rasch (1980) latent-trait theory (Andrich, 1988).
Andrich (1988) argued that ability is a latent trait, whereas performances are the observ-
able manifestations of ability in making observations that reflect properties, the actual
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 2 of 11
properties are abstractions based on the patterns of observations(p.14).Basedon
Andrichs conception, the latent nature of ability means that comparisons cannot be
deterministic, but are probabilistic.
Deterministic classifications involve comparisons that are determined with certainty
(Andrich, 1988). However, comparisons cannot be deterministic if they are measuring ability
because it is difficult to determine with certainty the classification of abilities as they are not
necessarily fully observable. Therefore, the comparisons are probabilistic; meaning that the
outcomes of the comparisons are based on frequency, or some likelihood and qualitative
judgment. In addition, invariance of classification is essential to measurement in a two-way
frame of reference, such as in comparative judgement (Andrich, 1988). The Guttman struc-
ture provides a two-way observational frame of reference where comparisons are made be-
tween two items, objects or performances in a process of classification (Andrich, 1988;
Bond & Fox, 2001; Humphry & Heldsinger, 2007). The outcomes of such comparisons are
labelled dichotomous as they involve two outcomes represented by the integers of 0 or 1.
Student work samples are compared to each other in this two-way frame of reference to
identify which has achieved the criteria of the assessment at a higher or lower level. There-
fore, comparison is used to determine whether each work sample is better (1) or worse (0)
than another work sample. Hence, the ordering or comparison process is not deterministic
but is probabilistic (Humphry & Heldsinger, 2007).
Referring to Thurstones work on comparative judgements, Pollitt (2012a), who we
understand initially collaborated with Andrich in the 1970s, argued that comparisons
can be made with any kind of object including educational performances. Essentially,
assessorsholistic judgements provide a shared contribution to the development of
a measurement interval scale. The scale represents the outcomes of the assessors
cumulative comparative judgments of the quality of the psychometric latent trait be-
ing measured (Bramley, 2007; Pollitt, 2012a).
Relative and absolute judgements
There is no absolute judgment. All judgments are comparisons of one thing with
another(Laming, 2004, p. 9). Absolute and relative judgements are terms that have
been used to describe analytical and comparative judgement assessment methods
respectively. Absolute judgements, analytical traditional marking/scoring, refers to
the allocation of marks/score points or a grade (Gill & Bramley, 2013; Pollitt, 2012a). It
has been contended by researchers that relative, comparative judgements are highly
reliable, more so than absolute, analytical judgements (see Jones & Alcock, 2012,
2014; Laming, 2004; Pollitt, 2012b). Jones and Alcock (2012) argued that people are
far more reliable when comparing one thing with another than when making absolute
judgements(p. 64). Pollitt has also consistently supported the argument that comparative
judgements, specifically adaptive comparative judgements, can generate extremely reli-
able scores, far higher than traditional markingwith little training (Pollitt, 2012a, p. 168).
For example, a study investigating absolute and relative judgements and standard-setting
by Gill and Bramley (2013) considered the accuracy of holistic comparative judgements in
relation to the original analytical marks of examination scripts that were close together in
overall markin a standard-setting exercise. The assessors made absolute (mark/grade of
each script), relative (script quality compared to another script) and confidence judge-
ments (how confident they were of their judgements). Gill and Bramley found that in the
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 3 of 11
standard setting/awarding process, the assessors made more accurate relative judgements
than absolute judgements, and that the judgements that were described by the judges as
very confidenthad higher accuracy than other judgements.
Heldsinger and Humphry (2013) identified some advantages of analytical marking
including that the use of criterion marking rubrics provide specific information for
assessors reducing subjectivity. Their review of the literature also suggested that there
have been arguments highlighting concerns regarding the assumption that using rubrics in-
creases inter-rater reliability and validity, and the overall accuracy and quality of assessment
(p. 221). Finally, they concluded that assessors may mark leniently, harshly, to a different
standard to others, and may value quality differently leading to a need for a substantial
amount of time allocated to training and moderation. And Pollitt (2012a) argued that a
marker may award the same average mark, yet discriminate more finely amongst the
objects, in effect being more generous to the better ones and more severe to the
pooreror vice versa(p. 168). Therefore, one major downside of absolute, analytical
marking is that ensuring reliability is challenging and costly to monitor. Whereas,
relative judgements can be constantly monitored as they are made, and decisions can
be made if additional judges or judgements are needed to increase reliability. Overall,
there is agreement amongst researchers in the field with Jones and Alcocks (2012)
argument that people are far more reliable when comparing one thing with another
than when making absolute judgements(p. 64). Comparative judgement ensures that
individualjudgeseverityisovercomebecausethejudgementprocessisfocusedon
the relative meritof each performance and not the individual judgesstandard
(Bramley, 2007, p. 265; Pollitt, 2012a). Also, if judges are making different quality dis-
crimination decisions, the process of comparative judgment, after multiple rounds of
judgements, can accurately rank performance quality (Pollitt, 2012a).
One of the main reasons for high reliability of relative comparative judgements
compared to absolute analytical judgements is that, relative judgements require larger
numbers of judges. Absolute analytical marking uses just one or two markers whereas
comparative judgement requires more than two judges, e.g. ten judges (Pollitt, 2012a). For
example, Pollitt (2012b) refers to his work with Richard Kimbell on the e-scape study of
the assessment of e-portfolios as an example of the high reliability coefficients that can be
achieved with digital assessment and the comparative judgement method. Pollitt posited
that in Kimbell et al.s (2009) study the high reliability coefficient of 0.96 generated by 28
judges assessing 352 e-portfolios with 3067 judgements was higher than any analytical
marking system could achieve. Studies by Heldsinger and Humphry (2010), (2013) found
that highly reliable comparative judgements of writing scripts by a large number of judges
were made using calibrated exemplars as referents, as suggested by Thurstone (1928).
They argued that comparative judgement, using calibrated exemplars, were successfully
used as a highly reliable method to validate results from large-scale testing programs
without the extensive assessor training and moderation processes that were required
for absolute analytical judgements. In their 2010 study they found a high level of internal
consistency of the teacher judgements and that the comparative judgements were highly
correlated with the results of analytical marking from an Australian large-scale test for the
same students, providing evidence of concurrent validity. These researchers were moti-
vated by concerns about the accuracy of teacher judgements, the quality of information,
including very little diagnostic information, provided by standardized testing. They found
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 4 of 11
that complex rubrics used for analytical marking required extensive training, whereas,
comparative judgement required little training. So they found that the assessorsjudge-
ments of writing performance were highly internally consistent and highly reliable, consid-
ering that initially the teachers were concerned about making judgements across ability
ranges with which they were not familiar. In a follow-up study Heldsinger and Humphry
(2013) used calibrated writing exemplars as the referents for comparative judgements of
scripts to assess student writing performance. The analysis showed a very high reliability
and internal consistency.
Several researchers have found that judgements based on quality, using specified
holistic criteria, resulted in a more valid assessment of performance (Bramley,
2007; Heldsinger & Humphry, 2010; Pollitt, 2012a). According to Pollitt (2012a)
the high reliability of comparative judgement is a consequence of the constant
focus on validity(p. 168) and this demands and checks that a sufficient consen-
sus exists amongst the pool of judges involved(p. 167). Comparative judgements
are independent and additional judgements increase reliability (Bramley et al.,
1998; Pollitt, 2012a). Thus, judge bias and misfitting scripts can be easily identified
during the judgement process, and reliability can be improved by increasing the
number of judges and/or the number of judgements (Bramley, 2007). Additional
judgements of work at grade cut-offs can help to improve and consolidate the
ranking of works and improve reliability (Pollitt, 2012a). However, Bramley (2007)
has argued that it is implausiblethat each comparison is independent as the
judges would remember particular scripts. We believe that it may be the case if
the comparative judgements were being conducted with a small number of hard
copy scripts or performances but it would be unlikely with large numbers of
scripts and judgements being made when using online software systems, because
judges may not encounter individual scripts in subsequent comparisons.
Time, cost, training and large-scale testing
Researchers have identified several limitations of comparative judgement. One is the
time that it takes to make the large number of judgements and the number of judges
required to reach the high levels of reliability that can be achieved with comparative
judgement (Bramley et al., 1998; Heldsinger & Humphry, 2013; Jones & Alcock, 2014).
Heldsinger and Humphry (2013) explained that although comparative judgement was
an efficient assessment method complex scripts required additional time to complete the
number of judgements needed for reliability. Bramley (2007) used the term psychological
validity to describe the additional cognitive requirements to make valid and reliable judge-
ments of complex performances, such as scripts, because of the amount of time taken to
make the judgement. Bramleys psychological validity may have some connection with
what Thurstone (1927, 1954) described as the discriminal process(p. 48). According to
Thurstone during comparative judgement the discriminal process encompasses the differ-
ent reactions to each work sample, item or task as it is being compared to the next work
sample, item or task. Thus the discriminal process leads to the comparative judgement
and affects the time taken.
The issue of time in making comparative judgements was highlighted in a study by
McMahon and Jones (2014), who investigated the use of comparative judgement in the
assessment of a secondary school chemistry experiment. The researchers found that
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 5 of 11
the comparative judgement was highly reliable, consistent across judges and the judge-
ments correlated with the analytical marking testing data. The study involved 154 test
scripts and five teachers from a secondary school in Ireland. Altogether they made
1550 judgements using the No More Markingonline system. All of the scripts were
also analytically marked by two of the five teachers. The researchers found that the
comparative judgements totalled approximately 14 h (3 h per judge) and the analytical
marking totalled approximately 3 h to mark all 154 tests. The two teachers who analyt-
ically marked the tests reported that they each took one and a half hours to mark the
tests without providing feedback. The estimated reported marking time seems short for
the number of scripts that were marked. Although the comparative judgements were
more time consuming than the analytical marking, McMahon and Jones acknowledged
that other studies, including our own, have shown otherwise (see Jones et al., 2015;
Kimbell, 2012; Newhouse, 2011). Our review of the test instrument revealed that a possible
reason for this time discrepancy, not identified by the authors, was that it was not conducive
to making holistic judgements. Most of the questions required students to list responses
rather than provide a descriptive response. These questions required the judge to assess
each individual response rather than to make a holistic judgement of the script.
Another time and cost issue is assessor training. Several researchers have found that
assessors needed little training in the use of comparative judgement systems and the
process, including those in which judges had not had specific assessor training (see
Heldsinger & Humphry, 2013; McMahon & Jones, 2014; Pollitt, 2012a). Training gener-
ally involves understanding the holistic performance criteria and assessors gaining an
understanding of the quality of the evidence presented in a range of work samples
(Kimbell et al., 2009). For example, in one study Heldsinger and Humphry (2010) found
that training for comparative judgements took approximately half an hour, whereas,
analytical marking rubric training needed a full day. In terms of judgement making,
Heldsinger and Humphry (2013) found that with little training assessors using cali-
brated exemplars were able to efficiently judge 29 writing performances, one every
three minutes, in one and half hours. Although they specifically did not refer to the
lower cost of comparative judgement they did consistently argue the efficiency of the
method, in terms of training and judgement making, in contrast to traditional, analyt-
ical marking. Pollitt (2012b) has consistently argued that comparative judgment can be
conducted at a comparable or lower cost to analytical, traditional marking and that
comparative judgement provides consistently high reliability. Jones and Alcock (2014)
asserted that the larger number of judges in comparative judgement can be more costly
than analytical marking, but this is offset by the greater reliability in the comparative
judgment process.
Use of technology and comparative judgement
It has been noted by Heldsinger and Humphry (2013) that until recently there has been
little use of comparative judgement methods for educational purposes, probably because it
has been considered time consuming, tedious and laborious (see Bramley et al., 1998). The
recent increase in the use of comparative judgment may be largely due to the availability
and use of online software systems, using digitised student works, by small groups of
researchers in the UK and in Western Australia.Theseonlinesystems,suchastheAdaptive
Comparative Judgement System (ACJS) (Pollitt, 2012b) and the Pair-Wise Web Software
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 6 of 11
(Humphry et al. 20132015), are based on Thurstones (1927) law of comparative judge-
ment and have dealt with many of the issues, that in the past inhibited the use of compara-
tive judgement in assessment (Jones & Alcock, 2012; Pollitt, 2012a). These online systems
provide a platform for making comparative judgements of digitised works such as scripts,
recorded performances and portfolios.
The recent use of these digital comparative judgment systems have been for standard
setting (Bramley, 2007; Bramley et al., 1998; Gill & Bramley, 2013), moderation and
marking of scripts and portfolios in areas such as writing (Heldsinger & Humphry,
2010, 2013; Pollitt, 2012b), mathematical problem solving (Jones et al., 2015), science
(McMahon & Jones, 2014), design and visual arts (Newhouse & Tarricone, 2014), design
and technology e-portfolio assessment (Kimbell et al., 2009), scientific and technology en-
quiry e-portfolio assessment (Davies et al. 2012), oral language assessment (Newhouse,
2011; Pollitt & Murray, 1996), and peer assessment (Jones & Alcock, 2012, 2014;
Seery et al. 2012). These studies have provided evidence of the efficacy in using digital
assessment systems for relative comparative judgement. Many researchers have also
investigated the effectiveness of, and correlation between, absolute analytical marking
and relative comparative judgements. For example, research by Jones, Swan and Pollitt
(2015) investigated the use of comparative judgement, as an alternative to analytical criter-
ion marking methods, to assess constructs such as mathematical problem solving, that
can be somewhat difficult to capture and describe in analytical marking criteria. Their
research involved two studies and two groups of mathematics education experts in
the UK (Group 1 N=12, Group 2 N=11). The first study investigated the suitability
of comparative judgements to assess the General Certificate of Secondary Education
(GCSE) final examinations in the UK to determine whether comparative judgement
could replicate the analytical marking using specific criteria. In the first study, Group
1completed151judgementsin105min,andGroup2completed150judgementsin
100 min, and showed that comparative judgement could be used to assess the GCSE.
The second study used comparative judgement to assess mathematical problem solving
skills, using 18 scripts from three tasks. The judgements were correlated with scripts
grades to determine the validity and reliability of comparative judgements as a measure to
assess summative examinations and problem solving in mathematics. The groups of asses-
sors, in both studies, produced ranked measurement scales with high reliability.
The CSaLT journey to comparative judgements
The Centre for Schooling and Learning Technologies (CSaLT) was formed in 2003 at
Edith Cowan University, Perth Western Australia, with a mission to conduct research
into, and promote the use of, digital technologies to improve schooling. The premise was
that technologies could be used for teaching, learning and assessment (Newhouse, 2010).
Increasingly, research in the Centre has found that the technologies were readily available
in educational institutions but seriously underused especially in high-stakes assessment.
In response to this situation, researchers in CSaLT decided to partner with the Western
Australian curriculum and examination authority in testing the use of digital technologies
in high stakes assessments for tertiary entrance. This decision resulted in three major
studies from 2006 to 2014. Detailed descriptions of the methodologies and results have
been previously reported (e.g., Newhouse, 2011; Newhouse & Tarricone, 2014).
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 7 of 11
The first study, from August 2006 to December 2007, was funded by a University col-
laborative research grant, and involved collaboration between CSaLT and the School
Curriculum and Standards Authority (Newhouse, 2010). The researchers sought to in-
vestigate using digital technologies to support alternative forms of external assessment
of student performance in tertiary entrance courses that included a major practical
component (Newhouse, 2008). The focus of this study was the use of digital technolo-
gies to captureand score performance on practical tasks in an engineering and an
applied IT course. The study involved 13 small teacher-class case studies. During this
study Professor Richard Kimbell from London University was a visiting scholar at the
centre; he introduced the e-scape project, assessing complex practical performance
using evidence collected in digital forms, and the use of comparative judgements for
generating scores (Kimbell, 2012). As a result one of the outcomes of this study was
the successful application of a comparative judgements method of scoring to the
assessment of digital posters created by students under exam conditions. To facilitate
this an online database tool was custom-designed using Filemaker Pro to allow judges
to view pairs of student work and enter their judgements. On completion of all judge-
ments, the researchers exported and analysed the data using RUMMcc software.
The first study was then used as a pilot for a more extensive three-year study investi-
gating the feasibility of using digital representations of work for authentic and reliable
performance assessment in foreign language, engineering, information technology, and
physical education tertiary entrance courses (Newhouse, 2011). The study was funded
by an Australian Research Council Linkage grant, with the local curriculum authority
as the partner again, and also involved collaboration with researchers in the British
e-scape project (Kimbell et al. 2007). For each course, digital forms of assessment
were devised to improve on the standard form of assessment used at the time, with
the evidence (e.g. student work, recording of performance) stored in digital files
within online systems. Assessors accessed these files using online scoring tools to
facilitate analytical marking and comparative judgement. By the second year of the
study a commercial system had been developed by TAG Learning for the e-scape
project (eventually known as ACJS), so this system was used for comparative judgements.
It not only displayed the assessment evidence and captured assessor judgements, it
also included the Rasch mathematical calculations to progressively provide scores and
reliability statistics. As a result, assessors no longer had to make a set number of
judgements before completing the exercise; completion was determined by reaching a
required level of reliability.
Rasch measurement was used to analyse all scoring data to determine reliability coef-
ficients and related statistics. The comparative judgement method, for all four courses,
generated highly reliable sets of scores, with Cronbachs Alpha and Separation Index
coefficients exceeding 0.9. For all four courses the comparative judgment method
showed higher levels of reliability than analytical marking, but the comparative judge-
ment scores and rankings were typically moderately to highly correlated with the scores
from traditional analytical marking (Newhouse, 2011). Issues related to the time taken
to complete comparative judgments when compared with analytical marking, were
mainly related to file size, download time and the amount of text and pages of the
digital evidence, rather than the time it took to make the comparative judgment. Over-
all, the study demonstrated the feasibility of using Rasch measurement with analytical
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 8 of 11
marking and comparative judgements for the digital assessment of performance as a
reliable and valid alternative to paper-based high-stakes examinations. Ultimately the
researchers found that the comparative judgment method of scoring was more reliable
than analytical marking.
In the third study, which began in 2011 and was completed in 2014 (Newhouse &
Tarricone, 2014) the researchers sought to investigate the efficacy of digitisation for
online analytical marking and comparative judgement methods by building on Dillon
and Browns (2006) use of e-portfolios. This study involved three phases; the first
two phases investigated the efficacy of digitisation of student practical work and the
effectiveness of comparative judgement of portfolios for the purpose of summative
assessment in the Visual Arts and Design tertiary entrance courses. Once again the
ACJS was used to facilitate the comparative judgements. In the first phase the
researchers explored the potential of representing practical work in the two courses,
whereas the second phase concerned whether it was feasible for students to create
and submit such representations. The third phase focused solely on the online scoring
of Visual Arts digital portfolios for the purpose of moderation to ensure consistent
standards. The researchers believed that this may be efficiently and effectively accom-
plished using the approach used in this study to the online scoring of digital
portfolios.
In all three phases of the study the researchers demonstrated that the digitised stu-
dent work could be efficiently and effectively assessed using online marking systems.
For both courses, there was low consistency between the assessors for analytical
marking, although the combined scores using a Rasch polytomous model exhibited
good reliability. For both there was a high level of internal consistency for the results
of the comparative judgements. For all three phases and for both Visual Arts and
Design, the comparative judgement method generated highly reliable sets of scores,
with Cronbachs Alpha and Separation Index coefficients for the first two phases
exceeding 0.9. However, in the third phase a reliability coefficient of only 0.88 was
achieved, with the most likely explanation being a lack of a consensus understanding
of the criteria among some of the Visual Arts teachers who were inexperienced with
marking these submissions.
Conclusion
In this paper we have used a range of literature and CSaLTs eight-year research journey
to argue that the authenticity, validity and reliability of high-stakes summative assess-
ment may be enhanced through the use of the comparative judgement method of
scoring using digital representation of evidence and online tools to facilitate scoring
and feedback. The changing requirements of tertiary courses necessitate that students
be proficient in using a variety of digital technologies to support their learning. The
potential for these technologies to be used for high-stakes assessment is fast becoming a
reality, with the added capacity to use comparative judgement for scoring and Rasch
modeling to analyse the data. The time is now to embrace digital technologies for
high-stakes assessment and consider comparative judgement for scoring as a viable
and reliable alternative to analytical marking, particularly for creative performances
that rely on highly subjective judgements.
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 9 of 11
Authorsinformation
Paul is an Associate Professor at Edith Cowan University in Western Australia where he is the director of the Centre for
Schooling and Learning Technologies (CSaLT) in the School of Education. His aim is to improve the opportunities for
children to develop their potential through engaging and relevant schooling.
Pinas PhD in educational/cognitive psychology won the 2007 Edith Cowan University Research Medal. Psychology
Press published her thesis research as a book titled The Taxonomy of Metacognition. She has a M Ed in Interactive
Multimedia. In 2014 she completed a M Ed in Educational Measurement with Honours from the University of Western
Australia. Her interests include educational psychology constructs, psychometrics and the use of technologies for
assessment.
Acknowledgements
The study discussed in this paper was the work of a research team led by Paul Newhouse and included researchers
Jeremy Pagram, Lisa Paris, Mark Hackling, Martin Cooper, Pina Tarricone, Alistair Campbell, Alun Price and many
research assistants. The work of everyone in this team, particularly Martin Cooper and Pina Tarricone, and the teachers
and students involved, contributed to the research outcomes presented in this paper.
Received: 17 July 2015 Accepted: 16 November 2015
References
Andrich D (1988) Rasch models for measurement. Sage Publications, Newbury Park
Andrich D, Luo G (2003) Conditional pairwise estimation in the Rasch model for ordered response categories using
principal components. J Appl Meas 4(3):205221
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah,
N.J. ; London: L. Erlbaum
Bradley RA, Terry ME (1952) Rank analysis of incomplete block designs: I. The method of paired comparisons.
Biometrika 39(3/4):324345
Bramley T (2007) Paired comparison methods. In: Newton P, Baird J-A, Goldstein H, Patrick H, Tymms P (eds)
Techniques for monitoring the comparability of examination standards. QCA, London, pp 264294
Bramley T, Bell JF, Pollitt A (1998) Assessing changes in standards over time using Thurstone paired comparisons.
Education Res Perspectives 25(2):124
Davies D, Collier C, Howe A (2012) Assessing Scientific and Technological Enquiry Skills at Age 11 using the e-scape
System. Int J Technology Design Education 22(2):247263
Dillon SC, Brown AR (2006) The art of ePortfolios : insights from the creative arts experience. In: Ali J, Kaufman C (eds)
Handbook of Research on ePortfolios. Idea Group Reference, Hershey, pp 420433
Fechner, G. T. (1860). Elements of psychophysics (H. E. Adler, Trans. Holt, Rinehart & Winston Eds.)
Gill T, Bramley T (2013) How accurate are examinersholistic judgements of script quality? Assessment in Education:
Principles. Policy Practice 20(3):308324
Heldsinger S, Humphry SM (2010) Using the method of pairwise comparison to obtain reliable teacher assessments.
Australian Educational Res 37(2):119
Heldsinger S, Humphry SM (2013) Using calibrated exemplars in the teacher-assessment of writing: an empirical study.
Educational Res 55(3):219235
Humphry SM, Heldsinger S (2007) Using measurement principles in developing assessment processes. In: Educational
Assessment and Measurement Series. University of Western Australia, Perth
Humphry SM, Wray WH, Wray FW (20132015) Pair-Wise Web Software. The University of Western Australia, Perth
Jones I, Alcock L (2012) Summative peer assessment of undergraduate calculus using adaptive comparative judgement.
In: Iannone P, Simpson A (eds) Mapping university mathematics assessment practices. University of East Anglia,
East Anglia, pp 6374
Jones I, Alcock L (2014) Peer assessment without assessment criteria. Studies in Higher Education 39(10):17741787
Jones I, Swan M, Pollitt A (2015) Assessing mathmatical problem solving using comparative judgement. Int J Science
Mathematics Education 13:151177
Kimbell R (2012) Evolving project e-scape for national assessment. Int J Technology Design Education 22:135155
Kimbell R, Wheeler T, Miller A, Pollitt A (2007) e-scape: e-solutions for creative assessment in portfolio environments.
Technology Education Research Unit, Goldsmiths College, London
Kimbell R, Wheeler T, Stables K, Sheppard T, Martin F, Davies D, Whitehouse G (2009) e-scape portfolio assessment:
Phase 3 report. Technology Education Research Unit: Goldsmith College, London
Laming D (2004) Human judgment: The eye of the beholder. Thomson Learning, London
McMahon, S., & Jones, I. (2014). A comparative judgement approach to teacher assessment. Assessment in Education:
Principles, Policy & Practice. doi: 10.1080/0969594X.2014.978839
Newhouse CP (2008) Digital forms of performance assessment. In: Paper presented at the Australian Computers in
Education. ACT, Canberra
Newhouse CP (2010) Aligning assessment with curriculum and pedagogy in applied information technology. Australian
Educational Computing 24(2):411
Newhouse CP (2011) Comparative pairs marking supports authentic assessment of practical performance within
constructivist learning environments. In: Cavanagh RF, Waugh RF (eds) Applications of Rasch Measurement in
Learning Environments Research. Sense Publisher, Rotterdam, pp 141180
Newhouse CP, Tarricone P (2014) Digitizing practical production work for high-stakes assessments. Canadian J Learning
Technology 40(2):117
Pollitt A (2012a) Comparative judgement for assessment. Int J Design Education 22:157170
Pollitt A (2012b) The method of Adaptive Comparative Judgement. Assessment Education 19(3):281300
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 10 of 11
Pollitt A, Murray NL (1996) What raters really pay attention to. In: Milanovic M, Saville N (eds) Studies in language
testing 3: Performance testing, cognition and assessment. Cambridge University Press, Cambridge, pp 7489
Rasch G (1961) On general laws and the meaning of measurement in psychology. In: Neyman J (ed), Proceedings of
the fourth Berkeley symposium on mathematical statistics and probability, Berkeley, California, 1961 (Vol. IV:
Contributions to biology and problems of medicine). University of California Press, Berkeley, California, pp. 321-333
Rasch G (1980) Some probabilistic models for the measurement of attainment and intelligence. MESA Press, Chicago
Seery N, Canty D, Phelan P (2012) The validity and value of peer assessment using adaptive comparative judgement in
design driven practical education. Int J Technology Design Education 22:205226
Thurstone LL (1927) A law of comparative judgement. Psychol Rev 34:278286
Thurstone LL (1928) Attitudes can be measured. American J Sociology 33(4):529554
Thurstone LL (1954) The measurement of values. Psychol Rev 61(1):4758
Wilson M (2005) Constructing Measures: An Item Response Modelling Approach. Lawrence Erlbaum Associates, Mahwah
Submit your manuscript to a
journal and benefi t from:
7 Convenient online submission
7 Rigorous peer review
7 Immediate publication on acceptance
7 Open access: articles freely available online
7 High visibility within the fi eld
7 Retaining the copyright to your article
Submit your next manuscript at 7 springeropen.com
Tarricone and Newhouse International Journal of Educational Technology in Higher Education (2016) 13:16 Page 11 of 11
... Through CJ, several benefits emerge for educational assessment. It allows for a more nuanced and holistic evaluation of student work by comparing pairs of responses, reducing the subjectivity and bias often associated with traditional marking schemes [82]. CJ can efficiently handle complex, open-ended tasks where standardised marking is : An illustration of the statistical comparisons of the weighted ensemble method against the other picking methods for the DREsS and BSU dataset of differential entropy (11a, 11e), max entropy (11b, 11f), sum of the entropy (11c, 11g) and the weighted sum of the entropy (11d, 11h). ...
Preprint
Full-text available
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria. This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments. CJ aligns with real-world evaluations, where overall quality emerges from the interplay of various elements. However, rubrics remain widely used in education, offering structured criteria for grading and detailed feedback. This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns. This paper addresses this gap using a Bayesian approach. We build on Bayesian CJ (BCJ) by Gray et al., which directly models preferences instead of using likelihoods over total scores, allowing for expected ranks with uncertainty estimation. Their entropy-based active learning method selects the most informative pairwise comparisons for assessors. We extend BCJ to handle multiple independent learning outcome (LO) components, defined by a rubric, enabling both holistic and component-wise predictive rankings with uncertainty estimates. Additionally, we propose a method to aggregate entropies and identify the most informative comparison for assessors. Experiments on synthetic and real data demonstrate our method's effectiveness. Finally, we address a key limitation of BCJ, which is the inability to quantify assessor agreement. We show how to derive agreement levels, enhancing transparency in assessment.
... Another significant endeavour, as reported by Wheadon et al. [2020], employed CJ to assess the writing skills of 55,599 primary school students in the UK. Further applications of CJ as a grading method encompass mathematical problem-solving [Jones, Swan and Pollitt, 2015], visual arts and design assessments [Tarricone and Newhouse, 2016], as well as academic writing evaluations [Van Daal et al., 2019]. This versatile grading approach has been deployed across diverse age groups, including university students [Bartholomew and Yoshikawa-Ruesch, 2018]. ...
Conference Paper
This research explores the feasibility of employing Comparative Judgement (CJ) as an alternative grading approach for academic essays written by Information Systems (IS) honours students. The study involved 81 participants from a fourth-year research module in the honours program. Students were tasked with composing an introduction for a Systematic Literature Review (SLR) under controlled conditions. This assignment was evaluated by three evaluators using the CJ method, facilitated by the No More Marking (NMM) platform. The findings indicate that while the CJ method is not notably more time-efficient than conventional rubric-based grading, it does alleviate cognitive strain on evaluators. Moreover, the method demonstrated high reliability and substantial validity when assessing academic writing assignments.
... Advocates for the method vary in their expectations for the role of comparative judgment in high-stakes national assessments. Some, such as Pollitt, and Tarricone and Newhouse, have suggested that comparative judgment methods should replace marking and grading activities (Pollitt, 2009;Pollitt & Crisp, 2004;Pollitt & Elliott, 2003;Tarricone & Newhouse, 2016). Proponents of CJ have also suggested that comparative judgment could play a role in quality assurance. ...
Article
Comparative judgment is gaining popularity as an assessment tool, including for high-stakes testing purposes, despite relatively little research on the use of the technique. Advocates claim two main rationales for its use: that comparative judgment is valid because humans are better at comparative than absolute judgment, and because it distils the aggregate view of expert judges. We explore these contentions. We argue that the psychological underpinnings used to justify the method are superficially treated in the literature. We conceptualise and critique the notion that comparative judgment is ‘intrinsically valid’ due to its use of expert judges. We conclude that the rationales as presented by the comparative judgment literature are incomplete and inconsistent. We recommend that future work should clarify its position regarding the psychological underpinnings of comparative judgment, and if necessary present a more compelling case; for example, by integrating the comparative judgment literature with evidence from other fields.
... The development of comparative judgment methods is commonly attributed to Thurstone (1927Thurstone ( , 1954 who theorized about applications including measuring handwriting, drawings, social values, and attitudes. Contemporary applications of comparative judgment methods include education-focused research in languages (Heldsinger & Humphry, 2013), design and technology (Bartholomew et al., 2019), chemistry (McMahon & Jones, 2015), visual art portfolios (Tarricone & Newhouse, 2016), mathematics (Davies et al., 2021;Mejía Ramos et al., 2021;Sa et al., 2023). Applications beyond education have included the disciplines of linguistics (Bisson, 2022;Stadthagen-González et al., 2019), market research (Grashuis & Magnier, 2018), environmental science (Hanley, Wright & Adamowicz, 1998) and health (Soekhai, de Bekker-Grob, Ellis & Vass, 2019). ...
... There is increasing interest in the use of pairwise comparison in education across a range of disciplines, particularly in the development of scales for writing performances (Steedle and Ferrara, 2016;van Daal et al., 2016;Humphry and Heldsinger, 2019) and mathematical problem solving Jones and Inglis, 2015;Bisson et al., 2016). Others have investigated the use of pairwise comparison for peer assessment and feedback (Seery et al., 2012;Potter et al., 2017), creative performance assessments (Tarricone and Newhouse, 2016), and oral narrative performance assessments (Humphry et al., 2017). ...
Article
Full-text available
Standard setting of extended performances is fraught with difficulties due to the subjective nature of the scoring decision. This article introduces and reports results of a novel systematic method for setting standards for assessments that require extended performances: the Scaled Exemplar Standard Setting method (SESS). This two-stage method brings together (1) pairwise comparison of samples to scale performances and (2) standard setting involving multiple standards with tasks presented to judges in a validated order. By utilizing paired comparison methodology in the first stage, the judgment is made more manageable by chunking it into a series of pairs. This method produces scaled performances that can be checked for internal consistency of the judgments. The validated, ordered performances are used in the standard setting stage, making the task for judges as transparent and straightforward as possible. Recommendations and implications for standard setting are discussed in light of the results.
... Advocates for the method vary in their expectations for the role of comparative judgment in high-stakes national assessments. Some, such as Pollitt, and Tarricone and Newhouse, have suggested that comparative judgment methods should replace marking and grading activities (Pollitt, 2009;Pollitt & Crisp, 2004;Pollitt & Elliott, 2003;Tarricone & Newhouse, 2016). Proponents of CJ have also suggested that comparative judgment could play a role in quality assurance. ...
Article
Full-text available
Comparative judgment is gaining popularity as an assessment tool, including for high-stakes testing purposes, despite relatively little research on the use of the technique. Advocates claim two main rationales for its use: that comparative judgment is valid because humans are better at comparative than absolute judgment, and because it distils the aggregate view of expert judges. We explore these contentions. We argue that the psychological underpinnings used to justify the method are superficially treated in the literature. We conceptualise and critique the notion that comparative judgment is ‘intrinsically valid’ due to its use of expert judges. We conclude that the rationales as presented by the comparative judgment literature are incomplete and inconsistent. We recommend that future work should clarify its position regarding the psychological underpinnings of comparative judgment, and if necessary present a more compelling case; for example, by integrating the comparative judgment literature with evidence from other fields.
Article
Music can convey emotions. Even in the performance of written rather than improvised music, the performer can modify the way they play particular elements of the music to convey specific emotions. Considerable research attention has been paid to the ways in which performers convey a small set of so-called basic emotions. In the current work, we investigated how musicians think they can convey a larger set of emotions in their performances by modifying musical elements and varying playing parameters. We presented 12 clarinetists trained in Western classical music with 55 pairs of expressive goals (EGs) derived from a list of 11 (e.g., to make the music sound happy, sad, or expressive), and asked them to provide similarity judgments for each pair representing how well they thought they would be able to distinguish between them when performing four excerpts of music, and how they would do it. The similarity judgments produced dissimilarity scores that were grouped using cluster analysis to form six independent, coherent clusters of EGs. Four clusters consisted of, or included, EGs representing four basic emotions: happy/ humorous; sad; angry/mad/ugly; and fearful. The other two clusters consisted of the EGs expressive/overly expressive/beautiful and deadpan. We therefore suggest that this set of six clusters could be used in future research on emotional and musical expression in music. Participants reported that to achieve specific EGs, they would vary the way they played a wider range of musical elements than in previous studies, albeit including many of the same elements. Among the playing parameters they reported, reed-bite force was mentioned most often, with consistent associations between more force and ugly or angry, and between less force and fearful and expressive EGs.
Book
Cited over 2500 times, this classic text facilitates a deep understanding of the Rasch model. The authors review the crucial properties of the model and demonstrate its use with a variety of examples from education, psychology, and health. A glossary and numerous illustrations aid the reader's understanding. Readers learn how to apply Rasch analysis so they can perform their own analyses and interpret the results. The authors present an accessible overview that does not require a mathematical background. Highlights of the new edition include: -More learning tools to strengthen readers’ understanding including chapter introductions, boldfaced key terms, chapter summaries, activities, and suggested readings. -Divided chapters (4, 6, 7 & 8) into basic and extended understanding sections so readers can select the level most appropriate for their needs and to provide more in-depth investigations of key topics. -A website at www.routledge.com/9780415833424 that features free Rasch software, data sets, an Invariance worksheet, detailed instructions for key analyses, and links to related sources. -Greater emphasis on the role of Rasch measurement as a priori in the construction of scales and its use post hoc to reveal the extent to which interval scale measurement is instantiated in existing data sets. -Emphasizes the importance of interval level measurement data and demonstrates how Rasch measurement is used to examine measurement invariance. -Insights from other Rasch scholars via innovative applications (Ch. 9). -Extended discussion of invariance now reviews DIF, DPF, and anchoring (ch. 5). -Revised Rating Scale Model material now based on the analysis of the CEAQ (ch.6). -Clarifies the relationships between Rasch measurement, True Score Theory, and Item Response Theory by reviewing their commonalities and differences (Ch.13). -Provides more detail on how to conduct a Rasch analysis so readers can use the techniques on their own (Appendix B). Intended as a text for graduate courses in measurement, item response theory, (advanced) research methods or quantitative analysis taught in psychology, education, human development, business, and other social and health sciences, professionals in these areas also appreciate the book‘s accessible introduction.
Chapter
This chapter reports on the first year of an applied research project that utilises new digital technologies to address the challenge of embedding authentic complex performance as an integral part of summative assessment in senior secondary courses. Specifically, it reports on the approaches to marking authentic assessment tasks to meet the requirements of external examination. On-line marking tools were developed utilising relational databases to support the use of the analytical rubric-based marking method and the paired comparisons method to generate scores based on Rasch modelling. The research is notable in seeking to simultaneously enhance assessment and marking practices in examination contexts and in so doing, also contribute to the advancement of pedagogical practices associated with constructivist learning environments. The chapter will be relevant to courses and subjects that incorporate a significant performance dimension.
Article
We report one teacher’s response to a top-down shift from external examinations to internal teacher assessment for summative purposes in the Republic of Ireland. The teacher adopted a comparative judgement approach to the assessment of secondary students’ understanding of a chemistry experiment. The aims of the research were to investigate whether comparative judgement can produce assessment outcomes that are valid and reliable without producing undue workload for the teachers involved. Comparative judgement outcomes correlated as expected both with test marks and with existing student achievement data, supporting the validity of the approach. Further analysis suggested that teacher judgement privileged scientific understanding, whereas marking privileged factual recall. The estimated reliability of the outcome was acceptably high, but comparative judgement was notably more time-consuming than marking. We consider how validity and efficiency might be improved and the contributions that comparative judgement might offer to summative assessment, moderation of teacher assessment and peer assessment.
Article
This research investigated the accuracy (agreement with the original marking and grading) of examiners’ holistic judgements of the quality of examination scripts that were close together in overall mark. For a History and a Physics exam, examiners considered pairs of scripts (with marks removed) and made three types of judgement: (1) Absolute – which grade each script was worth; (2) Relative – which of the pair was better in terms of overall quality; (3) Confidence – how confident they were about judgements (1) and (2). In both subjects, relative judgements were more accurate than absolute judgements, and judgements rated as ‘very confident’ were more accurate than other judgements. In Physics, the further apart the two scripts in terms of overall mark the greater was the likelihood of a correct relative judgement, but in History this expected pattern was not found. Despite differences between the research setting and the use of expert judgement in grading the live examinations, these results suggest that the current procedures do not use expert judgement in the most effective way.