Content uploaded by Andrew Luxton-Reilly
Author content
All content in this area was uploaded by Andrew Luxton-Reilly on Nov 11, 2014
Content may be subject to copyright.
Vol. 00, No. 00, Month 200x, 1–19
RESEARCH ARTICLE
A Systematic Review of Tools that Support Peer Assessment
Andrew Luxton-Reilly
(Received 6
th
July 2009; final version rece ived 20
th
September 2009)
Computer Science Department, The University of Auckland, Private Bag 92019, Auckland, New Zealand
Peer assessment is a powerful educational technique that provides significant benefits to both staff and students.
Traditionally, peer assessment has been conducted using pen-and-paper in small classes. More recently, online
tools have been developed to enable peer assessment to be applied in large class. In this paper, the tools that
support peer assessment are reviewed and analysed, revealing the common features and significant differences.
Future directions for research on peer assessment tools are suggested.
1 Introduction
The massification of tertiary education has impacted on quality and quantity of interactions
between instructors and students (Ballantyne, Hughes & Mylonas, 2002). Although the oppor-
tunities for instructors to provide detailed feedback on students’ work have decreased, a high
degree of individualized feedback for stud ents can be maintained by engaging them in tasks that
promote learning by interacting with each other. Hamer et al. (2008) report growing interest in
the use of contributing stu dent pedagogies among Computer Science educators. They define a
contributing student pedagogy (CSP) as:
A pedagogy that encourages students to contribute to the lear ning of others and to value the
contributions of others (pg 195).
Contributing student pedagogies characteristically involve the use of new (web-based) tech-
nologies. They encompass a wide range of activities, including that of peer assessment which has
been defined as:
. . . an arrangement in which individuals consider the amount, level, value, worth, quality or success
of the products or outcomes o f learning of peers of similar status (Topping, 1998).
Peer assessment is not a new assessment strategy. Peer assessment has been used in many
institutions for more than 50 years (S luijsmans, Brand-Gruwel & van Merrinboer, 2002), in a
wide range of higher education contexts such as academic writing, science, engineering, business
and medicine (Falchikov, 1995; Freeman & McKenzie, 2002). Peer review has been used as a
learning pro cess to improve the quality of computer programs for at least 30 years (Anderson &
Shneiderman, 1977).
In a review of the peer assessment literature, Topping (1998) concludes that peer assessment
has been used in a wide variety of contexts and that it can result in gains in the cognitive, social,
affective, transferable skill and sy s temic domains. The majority of the studies reviewed showed
an acceptably high level of validity and reliability. A subsequent review of peer assessment by
Do chy, Segers and Sluijsmans (1999) showed that peer assessment can be valuable as a formative
assessment method, and that students find the process sufficiently f air and accurate.
Email: andrew@cs.auckland.ac.nz
ISSN: 0899-3408 print/ISSN 1744-5175 online
c
200x Taylor & Francis
DOI: 10.1080/0899340YYxxxxxxxx
http://www.informaworld.com
Ballantyne et al. (2002) report significant benefits of peer assessment, but at the cost of
significant administrative overheads. O nline tools can alleviate this overhead by providing ap-
propriate administrative sup port for the management of the peer review pro cess. The use of
these tools enable peer assessment to be used in contexts such as large classes where it would be
infeasible without such support. Furthermore, a number of features such as anonymous online
discussion or automated weighting of reviews cannot be provided by traditional pen -and-paper
or face-to-face peer reviews.
The experience of using an online tool for reviewing assignments is q ualitatively different
to both face-to-face reviewing and using pen-and-paper to mark. An early study by Price and
Petre (1997) of instructors using electronic markin g reported numerous benefits over paper-based
marking, including: improved legibility, easy reuse of comments, faster turn-around time, lower
administrative overh eads and fewer administrative errors.
Plimmer and Apperley (2007) note that the act of marking paper-based assignments often
involves scanning the assignments, physically r eord ering the scripts and making annotations on
the scripts as a reminder of the critical points, summarising grades and as feedback to students.
The location of annotations on paper provides an easily identifiable reference point which is
clumsy to replicate with an online sys tem. The au th ors advocate the use of digital ink as a
means to retain the traditional advantages of pen-and-paper marking while using electronic
systems to relieve the administrative burden imposed by paper.
Murphy and Wolff (2005) compared “Minute Papers” created electronically with those created
using pen-and -paper. T hey found that the response rate using pen-and-paper was higher, but
the length of the student responses was much higher in the electronic version.
McLuckie and Topping (2004) note that although both face-to-face and online peer assessment
activities involve many s imilar skills, th ere are important differences. Face-to-face activities in-
volve socio-affective elements which are difficult to develop in online interactions. Other skills,
such as interactive pro cess management, that are essential for online environments are less crit-
ical for face-to-face interaction. Figl, Bauer and Mangler (2006) noted qualitative differences
between reviews conducted between teams using traditional pen-and-paper, an online tool, and
face-to-face. Students reported that communication was easier face-to-face, and that it was sig-
nificantly easier to give hints and helpful feedback using pen-and-paper compared to filling in
an online form.
The significant qualitative differences observed when different med iums are used to conduct
peer review highlight the importance of reviewing the research on tools that support peer as-
sessment. This is of p articular interest to Computer Science educators since the majority of the
tools described in this review have been developed by Computer Science ins tr uctors for use in
Computer Science classrooms.
Webster and Watson (2002) claim that the paucity of review articles pu blished in the in-
formation systems field impedes research progress. Although the literature on peer review has
previously been reviewed, the literature on the tools that support peer assessment has not. In
this paper we review the currently available tools, compare and contrast the features provided
by the tools and analyse these features with respect to the findings from the literature. The
following research questions are addressed.
(1) What are the common features and important differences between online tools?
(2) How does the implementation of features relate to the main findings reported by reviews
of the peer assessment literature?
(3) What directions for future research are indicated?
2 Method
A systematic literature review is a pro cess that seeks to aggregate empirical data using a formal
protocol. Kitchenham describes the process as:
“a means of evaluating and interpreting all available research relevant to a particular research
question of topic area or phenomenon of interest” (Kitchenham, 2004).
The procedures for practising evidence-based literature reviews have been well documented in
the medical domain (Sackett, R ichardson, Rosenberg & Haynes, 1997). More recently, the steps
for conducting evidence based reviews in software engineering have been identified and docu-
mented (Kitchenham, 2004). Brereton, Kitchenham, Budgen, Turner and Khalil (2007) report
that s ystematic literature reviews help researchers rigorously and systematically aggregate out-
comes from relevant emp irical research.
2.1 Data sources and stu dy selection
Primary studies were identified by searching the EEExp lore, ACM Digital Library, Google
Scholar, Citeseer, ScienceDirect and S pringerLink electronic databases. The Journal of C om-
puter Assisted Learning, Compu ter Science Education and Computers and Education were also
searched. The title, abstract and keywords were searched for the phrases (“peer assessment” OR
“peer review” OR “peer evaluation”). As not all of the databases su pported boolean phrases in
the same way, the search was adapted as required to obtain equivalent results.
The title and abstracts of the search results were assessed for relevance. Studies th at mentioned
the use of peer assessment in large classes were scanned to determin e if technology was u s ed
to assist the assessment process. Studies that mentioned the use of software to support peer
assessment were scanned to determine the nature of the software and how it was used.
In order to be included in this review, the software must have been designed specifically for
the purpose of s upporting peer assessment activities. This excluded a number of studies that
discussed the use of standard communication software such as email (Downing & Brown, 1997),
forums (Mann, 2005) and wikis (Xiao & Lucking, 2008; Lutteroth & Luxton-Reilly, 2008) for
peer assessment. Tools such as TeCTra (Raban & Litchfield, 2007) and SPARK (Freeman &
McKenzie, 2002) that are d esigned to be used in the context of group projects to support the
assessment of an ind ividual’s contribution within a team are explicitly excluded from this study.
Although these kinds of peer reviews have elements of commonality with the peer review of
artefacts, the review process is qualitatively different. In the review of teammates, students are
typically assessing in a competitive way (since a higher grade for a teammate n ormally results in
a lower person al score), and they are often required to evaluate personal qualities and impressions
built up over time, rather than assessing a distinctive artefact at a specific time.
Studies that describe a proposal for a tool that has not yet been b uilt, such as RRAS (Trivedi,
Kar & Patterson-McNeil, 2003), or that describe a prototype such as PeerPigeon (Millard,
Sinclair & Newman, 2008), which has not been used in a classroom at the time of publication,
are excluded from this review.
In summary, studies that describe a software tool designed for peer review in an educational
setting and used in at least one course are included in this review. The following software tools
are considered outside the scope of this review:
•
tools used for the peer review of an individual contribution to team;
•
standard technologies designed for another purpose and used for peer review (e.g. word pro-
cessor change tracking, email, forums and wikis);
•
tools designed for peer review that have not been implemented and used in the classroom;
and
•
conference management tools and other software designed to manage peer review in a profes-
sional rather than educational setting.
The reference lists of all primary research reports were searched for other cand idate reports.
Tab le 1. Generic peer assessment tools
Name Year
Rubric
Design
Rubric
Criteria
Discuss
Backward
Feedback
Flexible
Workflow
Evaluation
PeerGrader 2000 ? b,d,n,t
shared
page
student no student survey
Web-SPA 2001 flexible d,n,t
public
comments
none fixed
validity,
performance
improvement
OPAS 2004 flexible b, d, n, t debrief none script student survey
CeLS 2005 flexible b,d,n,t
peers
instructor
? script validity
PRAISE 2005 flexible b,t none none fixed
student survey,
usage statistics
Arop¨a 2007 flexible b,d,n,t none student limited
student survey,
staff interview
3 Results
The results of the review are organised into three s ubsections based on the kind of tools that
were identified during the review process. The fi rst subs ection summarizes generic tools that
have been designed to be flexible and support peer assessment in a variety of different disciplines
and contexts. The second subsection summarizes tools th at have been designed to support peer
assessment in a specific domain such as the review of a specific kind of artefact such as written
reports, or computer programs. T he final subsection summarizes tools that have been purpose-
built for a specific course, or which require manual mod ification to the software to adapt it for
use in other contexts.
Each subsection contains a table that summarizes the relevant tools. It lists the name of the
tool and the year of the first published report about it. Rubric designs are described as “flexible”
if the administrator has the ability to modify the rubric for a given assessment, and “fixed” if
the rubr ic cannot be modified. The rubric criteria are coded as “b” if the tool supports boolean
criteria (e.g. check boxes), “d” is the tool supports discrete choices (su ch as a drop-down list or
a forced choice between a fin ite number of specified criteria) “n” if the tool su pports numeric
scales (e.g. rating a solution on a 1–10 scale), and “t” if the tool supports open ended textual
comments (e.g. suggestions to improve the solution). If the quality of the reviews is assessed,
then the s ource of the feedback is noted (i.e. either a student or an instructor evaluates the
quality of the reviews). The opportunity for dialogue to occur between the reviewers and the
authors is coded. The way in which the tool allows workflow to be specified by the instructor is
listed. Finally, a s ummary of the kinds of evaluation performed with the tool is included .
3.1 Generic systems
A number of the s ystems reported in the literature are designed to be highly configurable and
support peer review activities in a wide range of disciplines and contexts. Although some systems
have only been used in a limited context at the time of publication, the design of those systems
indicates that they could be used in a variety of d isciplines and context. This section describes
these very flexible systems.
Table 1 summarizes the generic peer assessment tools.
3.1.1 PeerGrader
The PeerGrader (PG) system reported by Gehringer (2000) allow s students to submit an
arbitrary number of web pages for review, allowing students to include multimedia resources.
Reviewers and authors are able to communicate anonymously via a shared web page. After the
initial feedback phase, authors are given an opportunity to revise their work. At the end of the
revision period, the reviewers are required to allocate a grade. When the reviews are completed,
the students are required to grade the reviews on the basis of how helpful and careful the review
was.
An initial evaluation of PeerGrader in a standard data structures and algorithms course ex-
posed some problems with review allocations. Since assignments were allocated to reviewers on
the basis of stu dent enrollments, students who didn’t su bmit assignments or reviews (due to
dropping the course, or simply ch oosing not to participate) caused other students to receive
too few assignments to review, or to receive to o few reviews on the assignments they submit-
ted. Since reviewing can only begin after an assignment is submitted, assignments that were
submitted late left little time for the reviewers to complete their reviews. Gehringer notes th at
dynamically allocating assignments to reviewers may go some way towards alleviating these
problems.
3.1.2 Web-SPA
Web-SPA (Sung, Chang, Chiou & Hou, 2005) is designed to guide students through self and
peer assessment activities. Instructors have some flexibility to configure the type of activity
by configuring parameters s uch as setting a group or individual assignment, and defining the
method of scoring used by the rubric (discrete scale, percentage or no scoring). An instructor
can define criteria which are scored according to the method chosen in the initial configuration.
The Web-S PA system uses a fixed workflow to progressively engage students in the peer
assessment activity. Initially, students assess themselves. Having completed an evaluation, they
compare their own evaluation with others in their group. The groups select the best and worst
examples. The system will randomly present each ind ividual with exemplars of the best and
worst cases chosen by other groups to review. Once the reviews have been condu cted, the system
presents the best and wors t examples from the entire class. The act of re-reviewing exemplars
is designed to help students identify what is good and bad in a given assignment.
The au th ors conducted a study with 76 high school students in a Computer and In formation
Science course. The study found considerable consistency between instructor and peer marks. I t
also found that the quality of work improved after the peer review activities.
3.1.3 Online Peer Assessment System — OPAS
The OPAS system (Trahasch, 2004) has been designed to support a wide range of peer assess-
ment activities with flexible submission and marking criteria. Collaboration scripts are used to
formalise the structure an d workflow of the peer assessment process. An artefact submitted for
review can be a single document or a zip file. Submissions can come from individual authors,
groups, or the instructor. Reviews can be assigned randomly, manually, or using a combination
of random and manual. The system supports the allocation of reviews within groups. The re-
view rubrics are flexible and contain criteria that can be assessed u s ing radio buttons, list boxes,
numeric scales or with open-ended feedback. Multiple review cycles are supported. An overview
of the rankings and criteria is displayed to students at th e completion of the review and the best
example of each is displayed. A forum supports discussion after the completion of the reviews.
The system was evaluated with a class of 76 students enrolled in an Algorithms and Data
Structures course in Computer S cience. A student satisfaction survey was completed in which
students were generally positive.
3.1.4 Collaborative e-Learning Structures — CeLS
CeLS (Ronen, Kohen-Vacs & Raz-Fogel, 2006) is a system designed to support collaborative
learning activities, including peer review with flexible work processes us ing collaboration scripts.
An instructor can create new activities or use the structure of an existing activity. The assessment
activities can in clude all the standard elements of a web form, but may additionally includ e
activities that involve ranking or sorting a set of artefacts.
A prototype of the CeLS system was piloted in 2003–2004 in Israel by 9 universities, 5 schools
and 4 in -service teacher courses. In total, 1600 students used CeLS in 48 different courses,
although the nature of the collaborative activities was not reported.
Kali and Ronen (2005) report on the use of CeLS for peer review in three successive semesters
of an undergraduate Educational Philosophy course. Students were asked to use the system
to evaluate a group presentation on a scale of 1–7 and write feedback in text fields for three
grading criteria. After an initial evaluation of the system, a fourth criteria was introduced to
allow students to write their own opinion which was not considered to be a grading criterion.
This was intended to explicitly distinguish between objective and subjective viewpoints. A third
design iteration introduced the idea of evaluating students as reviewers. Instead of assigning
grades according to the results of the peer review, the reviews themselves were evaluated by an
instructor and 15% of the students’ grades were calculated based on the qu ality of the reviews.
3.1.5 PRAISE
PRAISE (de Raadt, Toleman & Watson, 2005) supports the peer review of documents ac-
cording to a rubric defined by an instructor. The rubric consists of objective binary criteria,
and a holistic open-ended comment. The system waits until a specified number of reviews have
been received (e.g. 4 or 5), and thereafter immediately allocates an assignment to review when
a student submits. Assignment reviews can be flagged for mod eration by th e author if they feel
that th e review is unfair.
PRAISE has been used in at least 5 different courses, across the subj ects of Comp uting,
Accounting and Nursin g. Student surveys, us age statistics, time management and moderation
required have all been analysed. Student attitudes and practices of novice programmers were
found to differ from those of non-programmers (de Raadt, Lai & Watson, 2007).
3.1.6 Aro p¨a
Arop¨a (Hamer, Kell & Spence, 2007) is a generic web-based sys tem that supports the adm in-
istration and management of peer assessment activities in a variety of contexts. Authors submit
files directly to the Arop¨a system. Reviewers download the files for off-line viewing. Reviews are
conducted online by filling in an web f orm, which is customized by an instructor for the review
activity. After students receive their reviews, they may be required to provide feedback on the
quality of the reviews (according to a rubric defined by the instructor).
The allocation of authors to reviewers can be automatic or manual. If automatic, the instructor
can define a subset of authors, a s ubset of reviews and the number of reviews to allocate to each
reviewer. This system of allocation can accommodate a wide range of peer assessment activities,
including intra- or inter-group reviews.
The auth ors report that Arop¨a has been used in over 20 disciplines with a diverse range of
classes, ranging in size from 12 to 850. It h as been used for formative feedb ack on drafts, critical
reflection after an assignment and for summative assessment. Each of these varieties of peer
assessment differ in the timing, style of the rubric and degree of compulsion and awarding of
marks.
3.2 Domain-specific systems
Many of the systems are designed to sup port peer review activities in a specific domain, such as
reading and writing essays, or reviewing Java programming code. Systems that are designed for
use in a specific domain are described in this section, an d summarized in table 2.
3.2.1 Calibrated Peer Review
TM
— CPR
CPR (Chapman & Fiore, 2000) has been designed to help students develop writing skills
through peer assessment. Instructors use CPR to create assignments that include specifications,
guiding questions to focus the development of a good solution, and examples of solutions with
corresponding reviews. Stu dents write and submit short essays on the specified topic. The review
process requires students to engage in a training phase wh ere they must evaluate the quality of
three sample essays. Their reviews are compared to th e samples provided by the instructor and
Tab le 2. Domain-specific peer assessment tools
Name Year Domain
Rubric
Design
Rubric
Criteria
Discuss
Backward
Feedback
Flexible
Workflow
Evaluation
CPR 1998 essays flexible b,n none auto none
validity, student
survey, writing perfor-
mance
C.A.P. 2000 essays fixed n,d,t
private
author/
reviewer
auto none
student surveys,
higher-order skills,
comment frequency,
use of review fea-
tures, compare with
self-assessment
Prakotomat 2000 programs fixed d,t none none none
student survey,
usage correlation
Sitthiworachart 2003 programs fixed d,n,t reviewers student none student survey, validity
SWoRD 2007 essays fixed n,t none s tudent limited validity
PeerWise 2008 MCQ fixed n,t
public
feedback
student none
usage, effect on exam
performance, quality of
questions, validity
peerScholar 2008 essays fixed n,t none s tudent none validi ty
feedback is given to the students about their review performance. Students are not permitted to
participate in real reviews until they can perform adequately on the samples. Although w idely
used, few evaluation studies have been published. A recent report suggests that the use of
CPR did not improve writing skills or scientific understanding (Walvoord, Hoefnagels, Gaffin,
Chumchal & Long, 2008).
3.2.2 C.A.P.
The C.A.P. system (Davies, 2000) (originally, Computerized Assessm ent including Plagia-
rism, and later Computerized Assessment by Peers) is designed for peer assessment of written
documents such as research reports and essays. It has evolved substantially from its initial im-
plementation, and continues to be actively studied and improved.
C.A.P. includes a predefined list of comments. Each student can configure the list by adding
their own comments. In addition, comments can be assigned a rating to specify how important
they are to the reviewer. The review process requires s tu dents to summ atively assess essays
by allocating a numeric value for four fi x ed criteria (such as “Readability”), and to provide
formative feedback by choosing comments from the configurable list. Students also provide a
holistic comment in an open-text area.
After an initial review period, students are given an opportunity to see the comments that other
reviewers selected, and may choose to mod ify their own review as a result (Davies, 2008). Once
the review stage is fully complete, the marks are used to calculate a compensated average peer
mark from the ratings submitted by the reviewers. The choice of comments and the ratings are
evaluated and used to generate an au tomated mark for the quality of reviewing (Davies, 2004).
Students are given th e opportunity to anonymously discuss their marks with the reviewer. The
reviewer may choose to m odify the marks on the basis of the discussion (Davies, 2003)
The C.A.P. system has been used in a number of stud ies of peer assessment, particularly
around the assessment of the review q uality. Studies show that marks for reviewing are positively
correlated with both essay marks and marks in a multiple choice test (Davies, 2004). The upper
two quartiles of students are more critical with comments than the lower two quartiles (Davies,
2006).
3.2.3 Praktom at
The Praktomat system (Zeller, 2000) is designed to provide feedback on programming code
to students. Authors submit a program to Praktomat which automatically runs regression tests
to evaluate the correctness of the code. Auth ors who have submitted code to the system can
request a program to anonymously review. The reviews use a fixed rubric that focus on a number
of specific style considerations that the instructor uses for final grading purposes. Th e code is
displayed in a text area that can be edited to allow annotations to be entered directly into the
co de. The review feedback is purely formative and plays no p art in the fi nal grade, nor is it
required. Students can review as m any programs as they wish.
Students reported that they found the system u seful, both in terms of automated testing,
reviewing programs and having programs reviewed by others. The grades obtained by students
for program readability increased both with the number of sent reviews and the number of
received reviews, although no formal statistical analysis was perf ormed.
3.2.4 Sitthiworachart
The system developed by Sittiworachart and Joy (2004) was based on the OASYS system.
It was designed for the peer review of programming assignments. A fixed rubric is used to
assess pr ogram style and correctness using Likert-scale ratings. An asynchronous communication
tool is provided to allow reviewers to anonymously discuss the assignments they are reviewing
throughout the process. An evaluation study showed that the peer ratings correlate significantly
with ins tr uctor ratings, and that students are better able to make accurate objective judgements
than subjective ones.
3.2.5 SWoRD — Scaffolded Writing and Rewriting in the Discipline
SWoRD (Cho & Schunn, 2007) is a to ol designed specifically to support writing practice. At
the time of publication, it had been used in 20 courses in four different universities between 2002
and 2004.
An instructor using SWoRD defines a pool of topics, from which students select those th ey
want to write about and those they want to review. SWoRD balances the allocation of topics,
so some s tu dents may have a r ed uced set of choices. Students submit drafts and a self-assessed
estimate of grade.
The review structure is fixed, and uses pseudonyms to ensure that the identity of the authors
remains confidential. Reviewers evaluate the writing according to three dimensions: flow; logic;
and insight. For each dimension, reviewers rate the work on a scale of 1–7 and provide a written
comment about the quality of the writing.
SWoRD assumes that the average grade given by a group of student reviewers is an accurate
assessment. It calculates the accuracy of an individu al reviewer using three different metrics
—systematic difference, consistency and spread. These three m etrics are calculated for each
of the three dimensions of flow, logic and insight, giving nine measures of accuracy. T he nine
measures of accuracy obtained for each reviewer are normalized and combined to calculate
a weighted average grade for a submitted piece of writing. All the drafts are published with
their ps eu donyms , ratings and associated comments. Authors revise the drafts and submit final
papers, along with feedback about th e usefulness of the reviews th ey received. The review cycle
is repeated with the revised papers.
An evaluation of SWoRD was conducted with 28 students in a research methods course. A
controlled experiment comparing a single expert reviewer, single peer reviewer and multiple peer
reviewers showed the greatest improvement between draft an d final paper occurr ed when the
author received multiple peer reviews, and the least improvement occurred with a single expert
reviewer.
3.2.6 PeerWise
PeerWise (Denny, Luxton-Reilly & Hamer, 2008a) supports the development of an online
multiple-choice question (MCQ) database by students. The MCQs submitted using PeerWise
become available f or other stud ents to use for revision purposes. When students answer a ques-
tion, they are r equired to review the question and enter a holistic rating (0–5) of the quality.
They are also encouraged to w rite a holistic comment in an text area. The author of a question
has the right to reply to any given comment, although there is no facility for a continuing dis-
Tab le 3. Context-specific peer assessment tools
Name Year Context
Rubric
Design
Rubric
Criteria
Discuss
Backward
Feedback
Flexible
Workflow
Evaluation
Peers 1995 Comp. Sci. flexible n none none none
student survey,
validity
NetPeas 1999
Comp. Sci.
Science
teachers
fixed n, t none none none
student survey,
rubric comparison,
thinking styles
OASYS 2001 Comp. Sci. fixed d,t none none none
student survey,
admin costs
Wolfe 2004
Comp. Sci.
Mathematics
Marketing
Psychology
fixed n, t none none none
usage
PEARS 2005 Comp. Sci. fixed n, t none none none
rubric comparison
cussion. Although the comments are visib le to all users, the individual ratings are averaged and
only the aggregate rating is displayed. All interaction between students is anonymous.
Numerous studies have evaluated aspects of PeerWise, including the usage (Denny, Luxton-
Reilly & Hamer, 2008b),effect on exam performance (Denny, Hamer, Luxton-Reilly & Pu rchase,
2008) and the qu ality of th e questions (Denny, Luxton-Reilly & Simon, 2009). A study of the
validity of the peer assessed ratings (Denny et al., 2009) found that the correlations between
ratings of students and the ratings of instructors who taught the course were good (0.5 and
0.58). The authors conclude that students are reasonably effective at determining the quality of
the multiple choice questions created by their peers.
3.2.7 peerScholar
The peerScholar (Par´e & Joordens, 2008) system was designed to impr ove the writing and
critical thinking skills of students in a large undergraduate Psychology class. In the first phase,
students are required to write two abstracts and two essays. The second phase requires students
to anonymously assess five abstracts and five essays by assigning a numeric grade (1–10) and
writing a positive constructive comment for each piece of work. Finally, in the third phase,
students receive the marks and comments as f eedback. An accountability feature allows students
to submit a mark (1–3) for each of the reviews they received.
An study was conducted to compare expert marks with the marks generated through the peer
review process. The authors found that the correlation between expert and peer marks was good,
and that it improved when the accountability feature was applied to students.
3.3 Context-specific systems
Some of the systems reported in the literature have been written for use in a specific cour s e and
must be rewritten if they were to accommodate other contexts. Although most of these systems
have the potential to be developed further in the future, at the time of publication they were
bound to the specific context in which they were developed. Table 3 summarizes the tools in
this category.
3.3.1 Peers
The Peers (Ngu & Shepherd, 1995) s ystem was implemented in Ingres, a commercial database
management system. Students were able to anonymously suggest assessment criteria and alter
weightings on existing criteria before the submission of assignments. Assignments were allocated
to students who were able to anonymously review them and provide marks for the criteria
that were cooperatively developed. A short evaluation study found a good correlation between
instructor and student marks. However, the student survey that was conducted found that all the
students preferred to have instr uctor assessment in addition to the peer evaluation, suggesting
that s tu dents did not trust the outcomes of peer assessment.
3.3.2 NetPeas
NetPeas (Lin, Liu & Yuan, 2001), initially known as Web-based Peer Review or WPR (Liu,
Lin, Chiu & Yuan, 2001), requires students to submit documents in HT ML format. Initially,
the system only supported a s ingle holistic rating and an open-ended comment, but was later
revised to support numerous specific criteria involving both a rating (1–10 Likert scale) and an
open-ended comment for each criterion. Th e system supports the modification of assignments
by students which allows drafts to be revised after an initial review period .
Evaluation studies have looked at correlations between review ability and examination scores,
different thinking styles, specific and holistic feedback and student attitude. The authors con-
clude that being a successful author, or a successful reviewer alone may not be sufficient for
success in a peer review environment.
3.3.3 OASYS
OASYS (Bhalerao & Ward, 2001) is designed to support self-assessment and provide timely
formative feedback to students in large classes without increasing academic workload. It is a
hybrid system used to assess students using a combination of mu ltiple choice questions and free-
response questions. The system automatically marks the MCQ questions and uses peer review
to provide summative feedback to students about their answers to the free-response questions.
Although this system has been designed and used in the context of a programming course,
the authors note that it could easily be adapted for more widespread use in other disciplines.
An evaluation which compared the time taken to mark paper tests with the time required to
mark using the OASYS system was perf ormed. Students using the system received feedback
more rapidly with less staff time required than paper-based tests.
3.3.4 Wolfe
Wolfe (2004) developed a system in which stu dents posted their assignments on their own web
site and submitted the URL to the peer review s ys tem. Reviewer s remained anonymous, but they
knew who they were reviewing. Reviewers were presented with the list of all the assignments
submitted and were expected to submit a score (1–10) and a holistic comment about each
assignment. Students were required to submit a minimum number of r eviews, but no maximum
was s et. The web site listed the number of r eviews that had already been submitted for each
assignment and students were asked to ensure the numbers were roughly even, but th e r equ est
was not en forced.
The system was used in Computer Science, Mathematics, Marketing and Psychology courses,
but required manual recoding to adapt it to each new context. Wolfe notes that roughly 70% of
the reviews were superficial. He reports on the use of the s ystem in a s mall software engineering
course (34 students). Students were required to submit a minimum of 10 reviews, but could
conduct additional reviews if desired. The majority of students received more th an the minimum
10 reviews, and the majority of those reviews were submitted by students ranked in the top third
of the class.
3.3.5 PEARS
PEARS (Chalk & Adeboye, 2005) is designed to support th e learning of programming skills.
Students subm it Java files directly to the system, conduct peer reviews, respond to feedback
and may r esubmit reviewed work. In the published study, students used two different rubrics
to review Java code. The first rubric contained sixteen specific binary criteria (yes/no, and not
applicable), while the second rubric used a text area to submit open-ended holistic feedback
about the strengths and weaknesses of the reviewed work and a single overall score out of 10.
The authors report th at over two-thirds of the s tu dents p refer to write reviews using holistic
feedback, that they preferred receiving holistic feedback, and that the holistic feedback written
by students had a significant positive correlation with the marks allocated by a tutor.
4 Discussion
In this section, the common elements of the systems are discussed and unique approaches are
identified.
4.1 Anonymity
Ballantyne et al. (2002) suggest that students should remain anonymous to alleviate student
concerns over bias and unfair markin g. The majority of systems use a double-blind peer review
process, ensuring that students remain anonymous through out th e entire p rocess. Bhalerao and
Ward (2001) report that anonymity is a s tatutory requir ement in their institution. Developers
of peer review software would be well advised to consider their own institutional regulations
regarding the privacy of student grades.
In some cases, s tu dent presentations are being assessed (Kali & Ronen, 2005), or students are
working in teams on different projects, in which case students performing a review would be
aware of the identity of the person they were reviewing. In such cases, there is no need to ensure
a double-blind review occurs. Flexible systems such as OPAS and Arop¨a may be configured to
have different levels of anonymity for a given activity (e.g. double-blind, single b lind, pseudonym,
or open reviewing).
Notably, the system developed by Wolfe (2004) ensured the anonymity of the reviews, but the
identity of the authors was known to the reviewers.
4.2 Allocation and distribution
A variety of methods are employed to distribute artefacts produced by an author to a reviewer
(or most commonly, multiple reviewers ). The simplest approach is simply to allocate the reviews
randomly. A spreadsheet specifying the allocation of assignments from author to reviewer is ac-
commo dated by Arop¨a, PeerGrader and OPAS. Although some systems (such as Arop¨a) support
the allocation of assignments by groups (to allow inter-or intra-group reviews), many do not.
PRAISE system waits until a minimum number of submissions are received before it begins to
allocate assignments to reviewers. After the threshold has been reached, an author that submits
an assignment is immediately allocated assignments to review. The major benefit of this approach
is a reduction in time between submission and review. However, no analysis of the consequences
of this strategy has yet been conducted. It is possible that better students (who complete the
assignment and submit early) will end reviewing each other while weaker students who submit
later will be allocated weaker assignments to review. Further investigation may be warranted to
explore th e implications of this allocation strategy.
The use of exemplars can help students to identify what is go od or bad in a given assignment.
These exemplars can act as a ‘yard-stick’ by which students can measure their own performance
and that of others. In order to ensure that students see a diversity of assignments, OASYS
uses the marks for an MCQ test in the distribution algorithm to ensure that each reviewer
receives one script from authors in each of the good, intermediate and poor MCQ categories.
Web-SPR uses multiple review cycles to ensure that students are exposed to examples of the
best and worst assignments. SWoRD makes all the drafts, reviews and ratings publicly available
for students to peruse, providing students with the opportunity to compare the best and worst
submissions. At the completion of the review phase, OPAS displays a summary of the rankings f or
each criteria assessed and the top ranked assignment for each criteria is available for students to
view. Although Arop¨a does not systematically provide students with the best and worst reviews,
during the allocation phase, it has been seeded with a sample solution provided by the instructor
to ensure all students see a good solution.
4.2.1 Unrestricted reviewing
The PeerWise system has no system of allocation. Instead, stu dents can choose to answer
as many MCQ questions as they wish. E ach time a question is answered a review is required.
Students tend to choose the questions with the highest rating, therefore the better qu estions are
reviewed more frequently. Poor questions are infrequently reviewed.
Wolfe (2004) allowed students to choose who they reviewed (and the identities of the authors
was known). The number of reviews that each artefact had received was displayed and reviewers
were asked to ensure that they were approximately even, but this requirement was not enforced
by the system.
Since reviewing is optional in the Praktomat system, the pr ocess of review allocation uses a
non-random strategy to encourage students to participate and contribute h igh-quality reviews.
Praktomat uses a set of rules to determine wh ich artefacts are reviewed next. The artefact that
has had the minimum number of reviews is selected. Programs whose authors have composed
a greater number of reviews are selected by preference. Praktomat tries to allocate reviews
mutually, so a pair of authors review each others’ programs.
4.3 Marking criteria
A variety of different appr oaches to designing marking criteria are apparent. Students are rarely
invited to participate in the design of th e marking criteria, although numerous authors report
that criteria are discussed with students prior to the review process. Some systems use very
specific criteria while others use a more holistic general rating.
Systems that are designed to be used in a wide variety of conditions (i.e. those classified
as “generic” systems) support instructor-designed marking forms. These forms are typically
constructed from the components that m ake up standard web forms, and su pport check boxes,
discrete lists, numeric scales and open responses in text areas. CeLS has a very flexible design
that can accommodate a range of assessment activities including selection, assigning a numeric
value and free-text comments, but also more complex assessments such as ranking and sorting.
Systems that are designed to operate in a more r estricted domain frequently use a fixed
structure for the assessment process and may provide few options for the configuration of the
marking schema.
Falchikov and Goldfinch (2000) conducted a meta-analysis that investigated the validity of
peer assigned marks by comparing peer marks with teacher marks. They recommend that it
is better to use an overall global mark rather than expecting students to rate many individual
dimensions. However, Miller (2003) found that more specific, detailed rubr ics provided better
differentiation of performance at the cost of qualitative feedback. Rubrics that provided more
opportunities to comment elicited a greater number of qualitative responses and a larger number
of comments.
An evaluation study comparing holistic with specific feedback using the PEARS system found
that the majority of students preferred both wr iting and receiving the h olistic feedback (Chalk &
Adeboye, 2005). They also found that there was no correlation between the students’ s cores and
the tutors scores when using the rubric w ith specific criteria, but a significant positive correlation
was found between students and tutors when the holistic rubric was used.
PRAISE us es objective binary criteria to ensure consistency between reviewers. A holistic
comment is also supported.
Kali and Ronen (2005) report that an explicit distinction between objective and subjective cri-
teria improves the quality of the review. Students like having the option to exp ress their personal,
subjective op inion (which d oes not contribute to the grading process), and distinguishing their
subjective view from the objective grading criteria improves the correlation between student and
instructor marks.
CAP requ ir es students to use numeric scales to summatively assess an essay, b ut they are
also expected to provide formative feedback by selecting comments from a defined list. The
importance of each comment in the list is weighted by the reviewer, allowing the CAP system to
automatically compare the comments applied by different reviewers in an attempt to estimate
the effectiveness of a given reviewer.
Open-ended feedback requires students to write prose that states their opinion in a critical, yet
constructive way. It is certainly possible that the formative feedback provided by this approach
is more useful to students than that obtained through check boxes or simple numeric scale.
However, furth er research is required to identify the conditions under which specific feedback
is more valuable than holistic feedback for both the reviewers and the authors who receive the
review.
4.4 Calcul ating the peer mark
Many of the s y s tems use a simp le mean value, although a variety of other methods of calculating
the peer mark are employed.
The peerScholar has a fixed workfl ow design in which each artefact is reviewed by five different
authors. An average of the middle three values were used to calculate the final mark. This red uces
the impact of a single rogue reviewer from the calculation.
Arop¨a uses an iterative weighting algorithm (Hamer, Ma & Kwong, 2005) to calculate the
grade. This algorithm is designed to eliminate the effects of rogue reviewers. Th e more that a
reviewer deviates from the weighted average, the less their review contributes to the average in
the next iteration. When the weighted averages have settled, the algorithm halts and the values
are assigned as grades.
CPR requires students to go through a training stage where the grades assigned by students
are compared with the expected grades. Students receive feedback on their grading performance
and mu st be able to accurately apply the criteria before they are permitted to begin reviewing
work submitted by their peers. Th e degree to which a reviewer agrees with the “ideal” review
set by the instructor determines a “reviewer competency index” which is later used to weight
the reviews when a weighted average is calculated.
SWO R D calculates a weighted grade based on three accuracy measures, systematic differences,
consistency and spread. Th e system assumes that th e average of all the reviewers of a given
artefact is an accurate measure. The “systematic” metric determines the degree to w hich a
given reviewer is overly generous or overly harsh (a variation of a t-test between the reviewer and
the average marks across all the reviews). The “consistency” metric determines the correlation
between the reviewer marks and the average marks (i.e can the reviewer distinguish between
go od and poor papers). Finally, the “spread” metric determines the degree to which the reviewer
allocates marks too narrowly or too widely. These metrics are combined to form an accuracy
measure which is factored into the weighting for reviewer marks.
CAP initially used a median value (Davies, 2000) to eliminate the effect of “off the wall”
reviewers, but was subsequently modified to calculate a compensated peer mark (Davies, 2004).
The compensated peer mark is a weighted mark that takes into account w hether a given r eviewer
typically over estimates th e grade, or under-estimates the grade (compared to the average given
by peers). Although the overall effects of the compensation are minor, students feel more com-
fortable knowing that they will not be disadvantaged by a “tough” marker.
4.5 Qualit y of reviews
The q uality of the reviews created by students is of significant concern by both instructors and
students. A number of systems offer the opportunity to provide feedback to the reviewer about
the quality of their reviews. However, there are few studies that have investigated the quality of
the reviews, the value of the feedback to the students, or investigated how the rubric format, or
quality assurance method s have affected the quality of the feedback.
4.5.1 Validity of reviews
One aspect of quality is the ability of peers to mark fairly and consistently. The metric most
commonly used to determine if students can mark effectively is the correlation with the marks
assigned by an instructor.
Falchikov and Goldfinch (2000) condu cted a meta-analysis comparing peer marks with teacher
assigned marks. They f ou nd a mean correlation of 0.69 between teacher and peer marks over
all the studies they considered. Par´e and Joordens (2008) found a small but significant differ-
ence between expert and peer marks in psychology courses using the peerScholar system. The
correlation between the expert and peer marks was low, but increased after they introduced the
facility for students to grade the reviews they received. They conclude that the averaged peer
marks are similar to the averaged expert marks in terms of level and ranking of assignments.
Sitthiworachart and Joy (2008) conducted a study that compared tutors’ and peers’ marks for
a number of detailed marking criteria for assignments in a first-year programming course. They
found high correlations between tutors’ and students’ marks for objective criteria, but lower
correlations between tutor and student marks for subjective criteria.
Wolfe (2004) reports that su fficiently large numbers of reviews result in reliable averages,
although this was an anecdotal oberservation by the author rather than the result of a formal
study. It is worth noting that the system used by Wolfe resulted in a larger number of reviews
being contributed by the better students than the poorer students.
4.5.2 Quality of formative feedback
There are few studies that have investigated the nature of formative feedback provided by
students in holistic comments, and compared the value of those comments with those provided
by instructors. A study using SWoRD revealed that formative f eedback from multiple peer
reviews was more useful for improving a draft than feedback from a single expert.
4.5.3 Backwards feedback
The term backwards feedback is used to d escribe the feedback that an author provides to
a reviewer about the quality of the review. This feedback can be formative, in the form of a
comment, or can be summative in the form of a numeric value.
Ballantyne et al. (2002) suggest that teachers award marks for the feedb ack provided by peers
in order to boost student engagement and commitment to the task. The system created by Wolfe
(2004) did not contain any assessment of review quality, and he estimates that approximately
70% of the reviews were superficial. Many of the more recently developed tools require students
to assess the quality of the reviews they have recieved, either summatively or with formative
feedback.
The “tit f or tat” approach used in Praktomat allocates reviews on a paired basis where possible,
so a reviewer knows that they are reviewing the work of the person that will be reviewing them
in turn. This encourages students to produce high quality reviews in the hope th at the recipient
will be doing the same. Although this is a feasible strategy for f ormative assessment, it is not
appropriate for summative assessment where it would be likely to encourage grade inflation.
Kali and Ronen (2005) decided not to grade assignments on the basis of the peer assessments,
but instead to grade the quality of the reviews. They report that grading stu dents on the quality
of the reviews rather than the peer assessed marks for their assignments reduced tensions and
produced higher correlations between the marks assigned by students and instructors. This
grading was performed by instructors.
PEARS allows authors to respond to their reviewers, giving feedback on the usefulness of the
reviews they received. However, this feedback is purely form ative and is not used in assessment
criteria.
SWoRD requires authors to provide feedback to the r eviewers about the quality and usefulness
of the review. This feedback is pur ely formative and plays no part in the final grade.
Arop¨a can be configured to require students to formally review a number of reviews using
an instructor-defined rubric. The instructor can specify that students assess the reviews they
have received, or the reviews can be considered to be artefacts in th eir own right and allocated
anonym ously and randomly to be reviewed by a stu dent that has no vested interest.
4.6 Dialogue
The systems considered here vary substantially when it comes to su pporting discussion within the
peer assessment framework. PeerGrader allows authors and reviewers to access and contrib ute to
a shared web page where discussions can occur. The instructor can configure the system to m ake
the comments posted to the shared web page visible to the other students allocated to review
the same authors work. This allows either private discuss ions between authors and reviewers,
or a group discussion between the author and all the reviewers of their work. Web-SPA uses a
similar approach in which students can post short messages to a p ublic page.
OPAS includes a discussion forum which is available for students to post after the completion
of the review. T his encourages reflection on the criteria and quality of the work pro duced. The
highest degree of discussion is provided by the Sitthiworachart system, which provides reviewers
with the capacity to communicate with both the author and all the other reviewers assigned to
review the given assignment. A chat s ystem allows th em to communicate in real time, or leave
messages for each other if they are not available.
4.7 Workflow
SWoRD is designed for students to progressively improve essay drafts using formative peer
feedback. It uses a fixed process for a given cycle, but the instructor can define the number of
review cycles that occurs before the final submission.
PeerGrader allows the author to revise their work at any time through the reviewing pr ocess.
When the author subm its a revised version, an email message is sent to all the reviewers. The
older version is archived and a new d iscus sion page is created for the revised version. The collab-
oration scripts used by OPAS support multiple review cycles wh ere students can progressively
improve drafts on the basis of feedback.
Miao and Koper (2007) show how collaboration scripts can be used to describe the structure
of interactions that occur in the process of peer review. Using a script to describe the peer
assessment process, a tool can automatically generate documents adhering to the IMS Learning
Design specification (IMS LD) and IMS Question and Test Interoperability specification (IMS
QTI) that can be viewed using an appropriate player. However, the authoring tools used to
create th e scripts are complex and require a significant degree of technical expertise.
CeLS is extremely flexible and allows instructors to create a wide range of peer assessment
activities with varying workflow. The authors report the flexibility resulted in a large number of
variants of basic structures which could be confusing. The authors suggest that further work is
required to categorize the structures to ensure that the variety of options is not overwhelming.
There appears to be a significant trade-off between flexibility and ease-of-use. Systems that
have more flexible workflow have us ed collaboration scripts or domain specific languages to
express the complex processes, bu t this fl exibility makes th em too difficult to use for a n on -
technical person.
5 Conclusion
This review makes a significant contribution by summ arizing the available tools that support
online peer assessment. These tools have been classified as generic, domain specific and context
specific. The major features have been compared and discussed. Although a variety of different
tools have been reported in the literature, few of them have been thoroughly evaluated. There
is a clear need for more usability studies and further evaluation studies that investigate the
differences between the approaches taken .
Arop¨a, SWoRD and C.A.P. have the most sophisticated processes for identifying “good” re-
viewers and weighting student assigned grades accordingly. A variety of different algorithms
are applied to weight the peer marks in a attempt to establish a more accurate measure of
the “true” quality of an assignment. Comparative studies that investigate the benefits of these
different approaches are required .
Since the peer assessment process uses the output from one student as th e input to another
student, online tools need to provide a mechanism to deal with late or missing submissions. Many
of the systems support both manual and automatic allocation of reviews, but PRAISE is the only
system that dynamically allocates reviews during the submission pr ocess. S ome s ys tems, such
as PeerWise and that of Wolfe, do not limit th e number of reviews that a student can perform.
In s uch systems, students with higher grades tend to contribute more than weaker students,
resulting in a greater amount of higher quality feedback being produced. This approach looks
promising, and f uture tools should support unlimited reviewing where possible, although further
research is required to investigate this approach more carefully.
All of the systems considered in this study are web-based and use standard web forms for the
entry of the review. Only one of the the sys tems (Praktomat) supports direct annotation on the
product of review, something that has always been possible on paper-based reviews. None of
the tools currently sup port the use of digital ink to provide annotations during the peer review
process.
Although some tools s upported instructor designed m arking criteria, others specified a fixed
sched ule. The marking criteria varied between binary criteria and a holistic overall rating and
open-ended text. Th ere is no clear in dication of the impact of each approach. Future work is
required to evaluate the effectiveness of different forms of rubrics for both th e reviewer and
the recipient of the review. Although numerous studies have considered the correlation between
instructor-assigned grades and stu dent-assigned grades, no studies have thoroughly investigated
the quality of the formative feedback (comments) provided by students.
Many of the tools support some form of feedback between reviewer and author, but few support
full discussion. The impact of discussion at different stages of the peer assessment process has
not been investigated. The support of discussion between reviewers and between reviewers and
authors warrants further study.
Instructors in Computer Science have the expertise to develop online tools that support peer
assessment, and the opportun ity to evaluate those tools in the classroom. The majority of online
tools described in this paper (13 of 18) have b een used in Computer Science courses, but most
are unavailable for use outside the context in wh ich they were developed, and none of them have
been widely adopted. It is likely that peer assessment tools in the immediate future will continue
to be developed by Computer Science educators for use in their own classrooms, informed by
reports of the current tools. However, it would contribute significantly to the Computer Science
community if future peer assessment tools were designed for use in multiple institutions.
References
Anderson, N., & Shneiderman, B. (1977). Use of peer ratings in evaluating computer program
quality. In Proceedings of the fifteenth annual SIGCPR conference, Arlington, Virginia,
United States (pp. 218–226). New York, NY, USA: ACM.
Ballantyne, R., Hughes, K., & Mylonas, A. (2002). Developing Procedures for Implementing Peer
Assessment in Large Classes Usin g an Action Research Process.. Assessment & Evaluation
in Higher Ed uca tion, 27(5), 427–441.
Bhalerao, A., & Ward, A. (2001). Towards electronically assisted peer assessment: a case study.
Association for Learning Technology Journal, 9(1), 26–37.
Brereton, P., Kitchenham, B., Budgen, D., Turner, M., & Khalil, M. (2007). L essons from ap-
plying the systematic literature review process w ithin the software engineering domain. The
Journal of Systems and Software, 80, 571–583.
Chalk, B., & Adeboye, K. (2005). Peer Assessment Of P rogram Code: a comparison of two
feedback instruments. In 6th HEA-ICS Annual Conference, University of York, UK (pp.
106–110).
Chapman, O., & Fiore, M. (2000). Calibrated Peer Review
TM
. Journal of Interactive Intruction
Development, 12(3), 11–15.
Cho, K., & Schunn, C.D. (2007). Scaffolded writing and rewriting in the discipline: A web-based
reciprocal peer review system. Computers & Education, 48(3), 409 – 426.
Davies, P. (2000). Computerized Peer Assessment. Innovations In Education & Training Inter-
national, 37(4), 346–355.
Davies, P. (2003). Closing the communications loop on the computerized peer-assessment of
essays. ALT-J, 11(1), 41–54.
Davies, P. (2004). Don’t write, just mark: the validity of assessing student ability via their
computerized peer-marking of an essay rather than their creation of an essay. ALT-J, 12(3),
261–277.
Davies, P. (2006). Peer assessment: judging the quality of the students’ work by comments rather
than marks . Innovations In Education & Training International, 43(1), 69–82.
Davies, P. (2008). Review and reward within the computerised peer-assessment of essays. As-
sessment & Eva luation in Higher Education, (pp. 1–12).
de Raadt, M., Lai, D., & Watson, R. (2007). An evaluation of electronic individual peer assess-
ment in an introductory programming course. In R. Lister & Simon (Eds .), Seventh Baltic
Sea Conference on Computing Education Research (Koli Calling 2007), Vol. 88 of CRPIT
(pp. 53–64). Koli National Park, Finland: ACS.
de Raadt, M., Toleman, M., & Watson, R. (2005). Electronic peer review: A large cohort teaching
themselves?. In Proceedings of the 22nd Annual Conference of the Australasian Society for
Computers in Learning in Tertiary Education (ASCILITE’05) Brisbane, Australia.
Denny, P., Hamer, J., Luxton-Reilly, A., & Purchase, H. (2008). PeerWise: students sharing their
multiple choice questions. In ICER ’08: Proceed ing o f the fourth international workshop on
Computing education research, Sydn ey, Australia (p p. 51–58). New York, NY, USA: ACM.
Denny, P., Luxton-Reilly, A., & Hamer, J. (2008a). T he PeerWise system of student contributed
assessment questions. In Simon & M. Hamilton (Eds.), Tenth Australasian Computing Ed-
ucation Conference (ACE 200 8), Vol. 78 of CRPIT (pp. 69–74). Wollongong, NSW, Aus-
tralia: ACS.
Denny, P., Luxton-Reilly, A., & Hamer, J. (2008b). Student use of the PeerWise system. In
ITICSE ’08: Proceedings of the 13th annual SIGCSE conference on Innovation and tech-
nology in computer science education (pp. 73–77). Madrid, Spain: ACM.
Denny, P., Luxton-Reilly, A., & Simon, B. (2009). Quality of student contributed questions using
PeerWise. In M. Hamilton & T. Clear (Eds.), Eleventh Australasian Computing Education
Conference (ACE 2009), Vol. 95 of CRPIT, Wellington, New Zealand, January (pp. 55–64).
Wellington, New Zealand: Austr alian Computer S ociety.
Do chy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer and co-assessment in higher
education: A review. Studies in Higher Education, 24(3), 331–350.
Downing, T., & Brown, I. (1997). Learning by cooperative publishing on the World-Wide Web.
Active Learning, 7, 14–16.
Falchikov, N. (1995). Peer Feedb ack Marking: Developing Peer Assessment. Innovations in Ed-
ucation and Teaching International, 32(2), 175–187.
Falchikov, N., & Goldfinch, J. (2000). Student Peer Ass essment in Higher Education: A Meta-
Analysis Comparing Peer and Teacher Marks. Review of Educational Research, 70(3), 287–
322.
Figl, K., Bauer, C., & Mangler, J. (2006). Online versus Face-to-Face Peer Team Reviews. In
36th ASEE/IEEE Frontiers in Education Conference, Oct. (pp. 7–12).
Freeman, M., & McKenzie, J. (2002). SPARK, a confidential web-based template for self and
peer assessment of student teamwork: benefits of evaluating across different subjects.. British
Journal of Educational Technology, 33(5), 551–569.
Gehringer, E. (2000). Strategies and mechanisms for electronic peer review. Frontiers in Educa-
tion Conference, 2000. FIE 2000. 30th Annual, 1, F1B/2–F1B/7 vol.1.
Hamer, J., Cutts, Q., Jackova, J., Luxton-Reilly, A., McCartney, R., Purchase, H., et al. (2008).
Contributing student pedagogy. SIGCSE Bull., 4 0(4), 194–212.
Hamer, J., Kell, C., & Spence, F. (2007). Peer assessment us ing arop¨a. In ACE ’07: Proceedings
of the ninth Australasian conference on Computing education, Ballarat, Victoria, Australia
(pp. 43–54). Darlinghurst, Australia, Australia: Australian Computer Society, Inc.
Hamer, J., Ma, K.T.K., & Kwong, H.H.F. (2005). A method of automatic grade calibration in
peer assessment. In ACE ’05: Proceedings of the 7th Australasian conference on Computing
education, Newcastle, New South Wales, Australia (pp. 67–72). Darlinghurst, Australia,
Australia: Australian Computer Society, Inc.
Kali, Y., & Ronen, M. (2005). Design principles for online peer-evaluation: Fostering objectivity.
In T. Koschmann, D.D. S uthers & Chan (Eds.), Computer suppor t for collaborative learning:
The Next 10 Tears! Proceedings of CSCL 2005 (Taipei, Taiwan) Mahwah, NJ: Lawrence
Erlbaum Associates.
Kitchenh am, B., Procedur es for Perf oming Systematic Reviews. (2004). , Technical report
TR/SE-0401, K eele University.
Lin, S., Liu, E., & Yuan, S. (2001). Web-based peer assessment: f eedback for students with
various thinking-styles. Journal of Computer Assisted Learning, 17(4), 420–432.
Liu, E.Z.F., Lin, S., Chiu, C.H., & Yuan, S.M. (2001). Web-based peer review: the learner as
both adapter and reviewer. Education, IEEE Transactions on, 44(3), 246–251.
Lutteroth, C., & Luxton-Reilly, A. (2008). Flexible learning in C S2: A case study. In Proceed-
ings of the 21st Annual Conference of the National Advisory Committee on Computing
Qualifications Auckland, New Zealand.
Mann, B. (2005). The Post and Vote Model of Web-Based Peer Assessment.. I n P. Kommers &
G. Richards (Eds.), Proceedings of World Conference on Educational Multimedia, Hyper-
media and Telecommunications 2005 (pp. 2067–2074). Chesapeake, VA: AACE.
McLuckie, J., & Topping, K.J. (2004). Transferable skills for online peer learning. Assessment
& Evaluation in Higher Education, 29(5), 563–584.
Miao, Y., & Koper, R. (2007). An Efficient and Flexible Technical Approach to Develop and
Deliver Online Peer Assessment. In C.A. Chinn, G. Erkens & S. Pu ntambekar (Eds.), Pro-
ceed ings of the 7th Computer Suppo rted Collaborative Learning (CSCL 2007) conference
’Mice, Minds, a nd Society’, July (pp. 502–511)., New Jersey, USA.
Millard, D., Sinclair, P., & Newman, D. (2008). PeerPigeon: A Web Application to Support
Generalised Peer Review. In E-Learn 2008 - World Conference on E-Learning in Corporate,
Government, Healthcare, a nd Higher Education, November .
Miller, P.J. (2003). The Effect of Scoring Criteria Specificity on Peer and Self-assessment. As-
sessment & Eva luation in Higher Education, 28(4), 383–394.
Murphy, L., & Wolff, D. (2005). Take a minute to complete the loop: us ing electronic Classroom
Assessment Techniques in computer science labs. J. Comput. Small Coll., 21(1), 150–159.
Ngu, A.H.H., & Shepherd, J. (1995). Engineering the ‘Peers’ system: the development of a
computer-assisted approach to peer assessment.. Research and Development in Higher Ed-
ucation, 18, 582–587.
Par´e, D., & Joordens, S. (2008). Peering into large lectures: examining peer and expert mark
agreement using peerScholar, an online peer assessment tool. Journal of Computer Assisted
Learning, 24(6), 526–540.
Plimmer, B., & Apperley, M. (2007). Making paperless work. In CHINZ ’07: Proceedings of the
7th ACM SIGCHI New Zealand ch apter’s international conference on Computer-human
interaction, Hamilton, New Zealand (pp. 1–8). New York, NY, USA: ACM.
Price, B., & Petre, M. (1997). Teaching programming through paperless assignments: an empir-
ical evaluation of instructor feedback. In ITiCSE ’97: Proceedings of the 2nd conference on
Integrating technology into co mputer science educa tion, Uppsala, Sweden (pp. 94–99). New
York, NY, USA: ACM.
Raban, R., & Litchfield, A. (2007). Supporting peer assessment of individual contributions in
groupwork. Australasian Journal of Educational Technology, 23(1), 34–47.
Ronen, M., Kohen-Vacs, D., & Raz-Fogel, N. (2006). Adopt & adapt: structuring, sharing and
reusing asynchronous collaborative pedagogy. In ICLS ’06: Proceedings of the 7th interna-
tional conference on Learning sciences, Bloomington, Indiana (pp. 599–605). International
Society of the Learning Sciences.
Sackett, D.L., Richard s on , W.S., Rosenberg, W., & Haynes, R.B. (1997). Evidence-based
medicine: how to practice and teach EBM. Lond on (UK): Churchill Livingstone.
Sitthiworachart, J., & Joy, M. (June 2008). Computer support of effective peer assessment in an
undergraduate programming class. Journal of Computer Assisted Learning, 24, 217–231(15).
Sitthiworachart, J., & Joy, M. (2004). Effective peer assessment for learning computer program-
ming. In ITiCSE ’04: Proceedings of the 9th annual SIGCSE conference on Innovation and
technology in computer science education, Leeds, United Kingdom (pp. 122–126). New York,
NY, USA: ACM.
Sluijsmans, D.M.A., Brand-Gruwel, S., & van Merrinboer, J .J.G. (2002). Peer Assessment Train-
ing in Teacher Education: effects on performance and perceptions. A ssessment & Evaluation
in Higher Ed uca tion, 27(5), 443–454.
Sung, Y.T., Chang, K.E., Chiou, S.K., & Hou, H.T. (2005). The design and application of a
web-based self- and p eer-assessment system. Computers & Education, 45(2), 187 – 202.
Topping, K. (1998). Peer Assessment Between Students in Colleges and Universities. Review of
Educational Research, 68(3), 249–276.
Trahasch, S. (2004). From peer assessment towards collaborative learning. Frontiers in Educa-
tion, 2004. FIE 2004. 34th Annual, (pp. F3F–16–20 Vol. 2).
Trivedi, A., Kar, D.C., & Patterson-McNeill, H. (2003). Automatic assignment management and
peer evaluation. J. Comput. Small Coll., 18(4), 30–37.
Walvoord, M.E., Hoefnagels, M.H., Gaffin, D.D., Chumchal, M.M., & Long, D.A. (2008). An
analysis of Calibrated Peer Review (CPR) in a science lecture classroom. Journal of College
Science Teaching, 37(4), 66–73.
Webster, J., & Watson, R.T. (2002). Analyzing the Past to Prepare for the Future: Writing a
Literature Review. MIS Quarterly, 26(2), xii–xxiii.
Wolfe, W.J. (2004). Online student peer reviews. In CITC5 ’04: Proceedings of the 5th conference
on Information technology education, Salt Lake City, UT, USA (pp. 33–37). New York, NY,
USA: ACM.
Xiao, Y., & Lucking, R. (2008). The impact of two types of peer assessment on students’ per-
formance and satisfaction within a Wiki environment. The Internet and Higher Education,
11(3-4), 186 – 193. Special Section of the AERA Education and World Wide Web Special
Interest Group (EdWeb/SIG)
Zeller, A. (2000). Making students read and review code. In ITiCSE ’00: Proceedings of the 5th
annual SIGCSE/SIGCUE ITiCSE conference on Innovation and tech nology in computer
science education, Helsinki, Finland (pp. 89–92). New York, NY, USA: ACM.