ArticlePDF Available

Variability in the accuracy of Self-Assessment Among Low, moderate, and high performing dtudent in university education

Authors:

Abstract and Figures

The present work empirically examines the validity of Student Self-Assessment (SSA) as an educational assessment in higher education. We briefly review the principle methodological factors that could affect SSA validity, as well as the main findings identified in the literature. One empirical study is presented that compares student-self evaluations on a test with the evaluation made by the course instructor while controlling for students' experience with SSA, criteria, rubric, and scales used by the student and teacher, and that the teacher was blind. Results show a strong correlation overall between the SSA and the instructor's evaluation and show that lower-performing students tend to overestimate their performance while higher-performing students underestimate their performance. The results support that SSA is valid for the average student, but less so for those that deviate above and below average in the absence of measurements of potentially mediating variables. The need to consider metacognitive factors in SSA is proposed.
Content may be subject to copyright.
Practical Assessment, Research, and Evaluation Practical Assessment, Research, and Evaluation
Volume 26 Article 16
2021
Variability In The Accuracy Of Self-Assessments Among Low, Variability In The Accuracy Of Self-Assessments Among Low,
Moderate, And High Performing Students In University Education Moderate, And High Performing Students In University Education
Samuel Parra León
University of Jaén, Spain
Antonio Pantoja Vallejo
University of Jaén, Spain
James Byron Nelson
University of the Basque Country, Spain
Follow this and additional works at: https://scholarworks.umass.edu/pare
Recommended Citation Recommended Citation
León, Samuel Parra; Pantoja Vallejo, Antonio; and Nelson, James Byron (2021) "Variability In The Accuracy
Of Self-Assessments Among Low, Moderate, And High Performing Students In University Education,"
Practical Assessment, Research, and Evaluation
: Vol. 26 , Article 16.
Available at: https://scholarworks.umass.edu/pare/vol26/iss1/16
This Article is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for
inclusion in Practical Assessment, Research, and Evaluation by an authorized editor of ScholarWorks@UMass
Amherst. For more information, please contact scholarworks@library.umass.edu.
Variability In The Accuracy Of Self-Assessments Among Low, Moderate, And Variability In The Accuracy Of Self-Assessments Among Low, Moderate, And
High Performing Students In University Education High Performing Students In University Education
Cover Page Footnote Cover Page Footnote
This work was supported by the by the Spanish Ministry of Economy and Competitiveness under the
Grant PSI2014-52263-C2-1-P; Junta de Andalucía under the Grant HUM642; Eusko Jaurlaritza
(Government of the Basque Country) under the Grant IT1341-19 from
This article is available in Practical Assessment, Research, and Evaluation: https://scholarworks.umass.edu/pare/
vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 1
León, Vallejo, & Nelson, Validity in Self-Assessment
A peer-reviewed electronic journal.
Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission
is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. PARE has the
right to authorize third party reproduction of this article in print, electronic and database forms.
Volume 26 Number 16, June 2021 ISSN 1531-7714
Variability In The Accuracy Of Self-Assessments Among Low,
Moderate, And High Performing Students In University
Education
Samuel Parra León, University of Jaén, Spain
Antonio Pantoja Vallejo, University of Jaén, Spain
James Byron Nelson, University of the Basque Country, Spain
The present work empirically examines the validity of Student Self-Assessment (SSA) as an
educational assessment in higher education. We briefly review the principle methodological factors
that could affect SSA validity, as well as the main findings identified in the literature. One empirical
study is presented that compares student-self evaluations on a test with the evaluation made by the
course instructor while controlling for students' experience with SSA, criteria, rubric, and scales used
by the student and teacher, and that the teacher was blind. Results show a strong correlation overall
between the SSA and the instructor’s evaluation and show that lower-performing students tend to
over-estimate their performance while higher-performing students under-estimate their performance.
The results support that SSA is valid for the average student, but less so for those that deviate above
and below average in the absence of measurements of potentially mediating variables. The need to
consider metacognitive factors in SSA is proposed.
Introduction
Evaluation has always been an essential part of the
instructional processes, as it measures and assigns value
to achievement in the teaching and learning process (e.g.,
Ainscow, 1988; Ysseldyke & Matson, 1988). Of existing
types of evaluation, Student Self-Assessment (SSA,
hereafter) has aroused substantial interest in the research
community. Panadero et al. (2016, p. 2) describe SSA as
“…mechanisms and techniques through which students
describe (i.e., assess) and possibly assign merit or worth
to (i.e., evaluate) the qualities of their own learning
processes and products”.
SSA has been an important topic in the analysis of
teaching and learning processes for decades. The first
important SSA review (Boud & Falchikov, 1989)
provided answers to many doubts that the SSA practice
raised at that time. However, the review raised a series
of unresolved questions about factors that influence
SSA. Since 1989, the popularity of self-assessment has
increased considerably, becoming common practice for
many teachers (e.g., Berry 2011; Black & Wiliam, 1998).
Panadero et al. (2014) showed that 90% of the
professors in their survey had used SSA in their courses
and that 90% indicated a positive experience.
The increasing use of this evaluation technique has
led to heterogeneity in how it is defined and understood.
Currently up to 20 different categories of SSA can be
found in the literature (see Panadero et al., 2016 for
review). These different categories are organized from
1
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 2
León, Vallejo, & Nelson, Validity in Self-Assessment
practices that seek the student's summative and general
SSA without any established criteria (e.g., Stanton, 1978),
to those that view SSA as part of the student's self-
regulation process (e.g., Panadero & Alonso-Tapia,
2014). Roberts et al (2019, p. 79) describe the latter as
“… the ability to set, monitor, and reflect on goals and
then set new goals to monitor and reflect upon. This
cycle of learning is needed to be an efficient and effective
active participant in one’s own learning.” In that regard,
self-regulation would represent the highest level of
acquisition achieved by the student within his or her own
learning process and, consequently, of self-evaluation. It
is necessary that teachers stimulate reflection in the
classroom and provide tools and strategies for self-
regulated learning (Torres & Tackett, 2016), which also
includes self-assessment.
Numerous studies (e.g., Dochy et al., 1999) suggest
a relationship between the SSA process and other factors
regarding learning, such as a) improvement in both the
effectiveness and quality of learning (e.g., Brown &
Harris, 2013; Ramdass & Zimmerman, 2008; Topping,
2003); b) the use of self-regulation strategies for learning
(Kostons et al., 2012; Panadero et al., 2016) and c) self-
efficacy (Olina & Sullivan, 2004; Ramdass &
Zimmerman, 2008; c.f., Andrade et al., 2009). SSA is
also deemed important in that it produces active learning
by involving students in the evaluation process (Black &
William, 1998; Nicol & McFarlane-Dick, 2006; Tan,
2012; Taras, 2010).
The role that SSA has taken in educational research,
combined with the heterogeneity in its conception
regarding its purpose, its execution, and even its
interpretation, has produced several methodological
approaches to its study. Consequently, different ideas are
associated with the validity of SSA results (e.g., Boud &
Falchikov, 1989; Falchikov & Boud, 1989; Gordon,
1991; Panadero & Alonso-Tapia, 2013; Ros, 2006; Ward
et al., 2002). The range ways to implement or develop
experiences with SSA requires a theoretical framework
that facilitates its effective implementation.
Ross (2006) has reviewed the reliability, validity, and
usefulness of SSA in education. He highlights some of
the characteristics of SSA that are relevant to effectively
implementing it in the classrooms, such as a) its
reliability as a technique, b) its validity as evidence of the
student's performance, c) its formative nature, and d) its
usefulness as an evaluation technique. Similarly,
Panadero et al., (2016) proposed different factors that
could impact the validity of SSA. These are: a) the
medium of SSA, b) the delay between SSA and
instruction, c) expectations of students, and d) whether
or not criteria for evaluation are provided.
When considering the variables proposed to affect
the validity of SSA, it is necessary to define what is meant
with regard to validity. In general terms, validity refers
to the evidence provided to support or refute the
meaning or explanation given to the evaluation data or
results; “To validate a proposed interpretation or use of
test scores is to evaluate the claims being based on the
test scores. The specific mix of evidence needed for
validation depends on the inferences being drawn and
the assumptions being made.” (Kane, 2006, p. 131). SSA
Validity has been defined as the degree of agreement,
rapprochement, or consistency that exists between the
student’s evaluations and those of the teachers (e.g.,
Andrade, 2019; Gordon, 1991; Ros, 2006).
In the case of SSA, the evaluation made by the
teacher is considered as an expert evaluation against
which to compare the scores that students provide
during SSA. In accord with Kane (23006), the validity
of SSA should be greater the more consistent student
achievement evaluations are with teacher evaluations.
Data in the literature on SSA validity appear
inconclusive. Authors such as Boud and Falchikov
(1989) indicated that, although there was a moderate
consensus between SSA and the judgments expressed by
professors, most of the literature suffered from a
number of errors; methodological, conceptual, and
interpretative, making a general picture of the findings
complex. Some of the main limitations are summarized
below:
The evaluation scales used were not specific
(e.g., Boud & Tyree, 1979).
Students and teachers used different evaluation
criteria (Doleys & Renzaglia, 1963; Gaier, 1961;
Keefer, 1971; Mueller, 1970; Murstein, 1965;
Sumner, 1932).
The dependent variables measure more than
knowledge gained. For example, Davis and
Rand (1980) asked students to report their
overall course performance, without making a
clear distinction between performance and effort
expended. Thus, the dependent variable
sometimes reflected the effort invested by the
student, providing a high degree of subjectivity
2
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 3
León, Vallejo, & Nelson, Validity in Self-Assessment
and variability (e.g., Filene, 1969; Davis & Rand,
1980).
Abuse of reporting correlations between SSA
and teacher evaluation (e.g., Gaier, 1961; Doleys
& Renzaglia, 1963; Morton & Macbeth, 1977).
The correlation coefficient is sensitive to atypical
scores, and is not the best technique when
assessing a group that may not be heterogeneous
with respect to variables that could affect the
correlation. Interpretations that compare SSA
and teacher evaluation studies assume that
individuals within the group share the same SSA
ability (e.g., Ward et al., 2002).
There is a consensus that a moderate correlation
between SSA and teacher evaluations exists, but there
are also cases of a lack of correspondence that could call
into question the validity of SSA. Some studies show
trends towards overvaluation or undervaluation
(precision errors). The variables that conclusively predict
these trends are not known (Boud & Falchikov, 1989).
The most common precision error reported is
overestimation. It is generally assumed, with some
exceptions, that students' SSA are higher than that of the
teachers. There are also results that indicate that
cognitive ability relates to precision errors.
Overestimation occurs in young children, for example,
and that has been attributed to the absence of the ability
to assess achievement based on a criterion (Butler, 1990).
Students who usually achieve better marks tend to
be more precise/realistic (e.g., Cochran & Spears, 1980;
Doleys & Renzaglia, 1963; Keefer, 1971; Murstein,
1965), or even underestimate (e.g., Sumner, 1932) their
performance, while students who usually achieve worse
marks tend to overvalue (e.g., Daines, 1978; Moreland et
al., 1981). When analyzing the relationship between
students' academic achievement and SSA accuracy,
students tend to be grouped into two groups, high and
low academic achievement. This clustering may be
overshadowing part SSA/Teacher evaluation
relationship. Analyzing the relationship by grouping
students into finer-grained levels of academic
achievement may provide greater sensitivity to changes
in SSA accuracy.
A variable that consistently affects accuracy with
regard to the scores given by teachers is the experience
or ability that the student has with the subject to be
evaluated. Results show better accuracy in SSA with an
increase of experience or mastery of the subject (e.g.,
Ross et al., 1999; Sung et al., 2005; Longhurst & Norton,
1997; Ross, 1998).
Knowledge of the evaluation criteria is also an
important variable. When students do not know the
evaluation criteria, the SSA is less accurate (Panadero &
Romero, 2014). This relationship implies that the
students' experience with SSA and their previous
knowledge about the evaluation criteria, as well as the
instruments used for SSA, will help to achieve more
accurate SSA relative to the teacher (expert) evaluation.
Also, accuracy improves when the criteria are simple
(Pakaslahti & Keltikangas-Järvinen, 2000), or when the
students have participated in the development of
evaluation criteria (Brown & Harris, 2013). However,
the effect of this latter variable is inconsistent, and
contradictory results have been found (Andrade et al.,
2010; Orsmond et al., 2000).
Overall levels of education do not necessarily have
much impact on SSA accuracy. Accuracy in primary
school students (Brown & Harris, 2013; Finn &
Metcalfe, 2014; Ross, 2006) and higher education (Boud
& Falchikov, 1989; Falchikov & Boud, 1989) is very
similar (.30 < r < .50). SSA can be affected by
motivational factors, such whether the SSA affects, or
not, the final grade (e.g., Boekaerts, 2011; Dunning et al.,
2004; Tejeiro et al., 2012). Tejeiro et al., (2012) find that
when the SSA does not influence the grade, the SSA is
very similar to that of the teacher. When the SSA
influences the grade, the discrepancy increases notably
with overestimates becoming more likely (Boud &
Falchikov, 1998). Moreover, different motivational
components can influence SSA. In a meta-analysis,
Sitzmann et al., (2010) concluded that affective factors
(e.g., satisfaction with the evaluation outcomes) have a
greater impact on SSA than those associated with
cognitive learning, which they determined to be only
moderately related to SSA accuracy.
Although the studies discussed so far have shown a
moderate level of precision of SSA, the literature shows
varied results on the validity of this test (e.g., Brown &
Harris, 2014; Dunning et al., 2004; Eva & Regher, 2005;
Lew et al., 2010). These works show that the agreement
between SSA and other measures (test grade, expert
judgments, etc...) is moderate only in the best of cases
(e.g., Brown & Harris, 2013). Though correlations vary
from .2 to .8, there are few studies that report
correlations higher than 0.6 (e.g., Brown & Harris, 2013;
3
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 4
León, Vallejo, & Nelson, Validity in Self-Assessment
Falchikov & Boud, 1989). Nevertheless, Panadero et al.,
(2016) propose that, even being an imprecise measure
with an unknown amount of error, its use continues to
be beneficial for teaching practice (Andrade, 2010;
McMillan & Hearn, 2008).
As we have discussed, SSA precision / validity can
be affected by numerous methodological factors. Some
of those variables described above have been considered
in the analysis of SSA validity. However, we have
identified no studies that explicitly controlled for the
variables already identified in the literature as
"threatening" as a whole, when assessing the accuracy of
the SSA.
The main objective of the study we present is to
assess the validity of SSA as evaluation method. The
study was conducted with a natural university sample
while controlling factors that threaten SSA's validity. We
controlled the evaluation criteria (teacher and students
were familiar with the evaluation criteria); the scale and
rubric used (teacher and students used the same scale
and rubric for the evaluation); all students had previous
SSA experience; the SSA was conducted at the end of
the course when student’s knowledge should be at its
highest; and the examiner was blind to the identity of the
student being evaluated. Finally, we analyzed how the
precision of SSA can vary according to the level of
achievement acquired by the students beyond a simple
pass-fail categorization (cf., Ćukušić, et al., 2014). To
achieve this, we not only focus on students who pass or
fail, but we also set different levels of academic
achievement to ensure sensitivity in the accuracy of self-
assessments made by students, with respect to the
evaluation of the teacher (expert).
Method
Participants
Sixty-four students from the third year of the
Degree in Social Education of the University of Jaén
(Spain) participated in the experiment. The sample
constitutes a natural group. The age of the students was
between 20.54 and 39.62 years (Median 22.48 years). The
sample was 82.54% female and 17.46% male. These
percentages are proportional to the distribution of males
and females in the total population of students in Spain
(Spanish National Institute of Statistics, 2015). One male
student chose not to participate, so the final number of
participants was 63.
Instruments
The test that was used as a basis to evaluate the
validity of the SSA was the final exam of the theoretical
component one of the subjects of the third year of the
degree in Social Education. This test consisted of six
open-ended questions, four of them with a short answer
(e.g., requiring 1 to 2 paragraphs to answer correctly) and
the other two with a long answer (requires 1 to 2 sheets
to respond correctly). The questions were drawn from
the most prevalent theoretical contents of the subject.
An example of a short question is “What are the principles
of guidance? Explain briefly what they are.” An example of a
Table 1. Example of evaluation criteria by rubric.
Degree of achievement
Description of the achievement
Level 1(Excellent) 100%
The student includes the required content and demonstrates mastery of it. The
information is relevant, accurate and written in a coherent manner.
Level 2 (Good) 50-75%
The student includes three parts of the required content and demonstrates mastery
of it. The information is relevant, accurate and written in a coherent manner.
Level 3 (Medium) 25-50%
The student includes half of the required content and demonstrates partial mastery
of the content. The information is partially relevant, accurate.
Level 4 (Poor) 0-25%
The student includes summary information of the required content and does not
demonstrate mastery of it. The information is not relevant, accurate and is not
written in a coherent manner.
4
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 5
León, Vallejo, & Nelson, Validity in Self-Assessment
long question is Explain the phases that should be followed in
the development of an Orientation Program (as we saw in topic
3). The rubric for the evaluation of the questions is
presented in Table 1. To evaluate the validity of the
rubric, 5 expert teachers evaluated 5 exams with the
rubric designed. The intraclass correlation coefficient
was .962.
At the beginning of the written test students were
given instructions to complete the test and provided the
point distribution for the various question types (e.g., up
to one point for each short question answered correctly
and up to 3 points for each long question answered). The
maximum score that could be obtained in the test was
ten. Students were told that once the test was completed,
they should calculate the grade that they expect to obtain
on the test, based on their performance, and write it in a
box titled "Self-Assessment that appeared in the upper
left of the sheet. In the study, the teacher's score was
assumed to be the accurate one and both evaluations
(expert-teacher and student) followed the same criteria.
Procedure
At the beginning of the course, students were
informed about the evaluation criteria and were told in
the course's practice sessions that SSA would be used to
evaluate the work they gave to the professor. Thus, and
in accordance with the recommendations of the
literature (Doleys & Renzaglia, 1963, Gaier, 1961,
Keefer, 1971, Mueller, 1970, Murstein, 1965, Sumner,
1932), we ensured that students were aware of the
evaluation criteria and practiced in their use. In order to
control the bias caused by lack of knowledge about the
use of SSA, students received instruction on how to
evaluate works using criteria established by a rubric, in
the same way as the teacher would do when evaluating
the students’ exams. Thus, during SSA students were
practiced at evaluating works using rubric-based criteria
and used the same criteria as the teacher, reducing the
influence of factors not related to the knowledge
assessed in the exam (e.g., Filene, 1969; Davis & Rand,
1980). The classes on how to perform SSA were part of
the content of the subject regarding how achievement
can be quantified.
On the day of the written test, the students were
seated, the exams were distributed, and the previously
mentioned instructions were read. Participants were
informed of the purpose of the investigation and that
participation in it was entirely voluntary. Those who did
not want to participate simply left the self-assessment
box blank. Students were informed that any student who
had participated in the study and wanted to receive
information about the findings could email the teacher
to receive the information. The degree of participation
was 98.4%.
Before beginning the evaluation of the exams by the
teacher, the first page of each exam containing the
student's identifying data, the exam questions, and the
SSA response was removed. Thus, the teacher was blind
to the identity of the student and his/her self-
assessment.
Statistical Analysis
The scores resulting from the evaluation by the
teacher (hereinafter called Exam) and the SSA responses
issued by the students (hereinafter called SSA) were
recorded for each student. The results were evaluated
initially using the Pearson correlation coefficient (r) and
linear regression was used to characterize the
relationship between the variables. Participants were
further grouped into Fail (<5), Pass (≥ 5 and <7), Very
Good (≥ 7 and <9) and Outstanding (≤9) categories
based on the Exam. These groupings were not arbitrary
but followed common evaluation practices in university
education (e.g., as defined by Real Decreto 1125/2003,
5 de September). Student’s t-tests were used to compare
each groups Exam score to its’ SSA. Percent agreement
between the SSA and Exam groupings as well as Cohen’s
Kappa were calculated.
Effect sizes were computed using Hedges g, as it is
useful with small sample sizes (Hedges, 1981).
Bootstrapping was used to determine the confidence
intervals around the effect sizes using methods described
by Efron (1992). Five-thousand random samples (with
replacement) were drawn from each Exam grouping.
The bootstrapping, and the associated statistical
estimates, were made with the dabestr R package (Ho et
al., 2018).
Results
Table 2 shows the descriptive statistics for the
Exam and Self-assessment (SSA) variables. As can be
seen in these data, the means of both variables are very
close (6.62 and 6.75). Overall, there was no difference
between SSA and Exam scores, F < 1.
The inter-rater agreement between the SSA and the
teacher’s evaluation with respect to the grouping (Fail,
Pass, Very Good, Outstanding) was good (63.5%, K =
5
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 6
León, Vallejo, & Nelson, Validity in Self-Assessment
.61). Figure 1 shows the overall correlation between SSA
(X) and Exam performance (Y) on the left along with
the regression line of best fit predicting Exam grade with
SSA. The points at right show the distribution of the
scores, where it is evident that there were no outliers or
extreme scores in either the SSA or the Exam. Within
Figure 1, at left, the different symbols represent the
scores grouped by exam achievement. Overall, exam
scores were predictable given SA scores, r = .66, r2 = .44,
p < .0001. The SSA scores of failing students were
shifted to the right of the regression line, indicating that
their SSA overestimated their exam score (1.99 points
overestimation, t(7) = 5.39, p = .001). Those who passed
were more accurate in their SA, with scores spanning the
regression line (.292 overestimation, t(24) = 1.21, p =
.24). Those doing “very good” under-estimated their
final grade, with scores being shifted to the left of the
regression line (-.40, t(23) = 5.21, p < .0001), as did
those doing “Outstanding”, who underestimated their
grade, on average, by -.81, t(5) = 2.74, p = .04.
The regression lines within each grouping of exam
performance (dotted lines in Figure 1) suggest little
relationship between SSA and Exam performance,
beyond the achievement (Fail, Pass, Very Good, and
Outstanding) groupings. Examining the relationship
between Exam Grade and SSA within each achievement
grouping would yield weak conclusions due to the small
sample sizes within each group. To remove the
influence of the exam grouping, we subtracted each
grouping’s mean exam score from each exam score so.
Table 2. Descriptive statistics for the Exam and Self-assessment variables by level of achievement
1
Range
𝒙
̅
S
S2
Skew
(Tip.
Error.)
Kurtosis
(Tip. Error)
Total
3.50-10
6.62
1.65
2.74
0.06 (0.3)
-0.75 (0.59)
Fail
3.50-4.5
4.04
0.41
0.17
-0.27 (0.75)
-1.93 (1.48)
Pass
5.00-6.9
5.72
0.62
0.38
0.52 (0.46)
-0.89 (0.90)
Very Good
7.00-8.9
7.71
0.55
0.30
0.84 (0.47)
-0.37 (0.92)
Outstanding
9.00-10
9.52
0.35
0.13
-0.08 (0.85)
-0.29 (1.74)
Total
4.00-9.7
6.75
1.30
1.69
0.04 (0.3)
-0.74 (0.59)
Fail
4.00-7.5
6.03
1.23
1.52
-0.18 (0.75)
-0.77 (1.48)
Pass
4.50-
8.50
6.01
1.07
1.14
0.85 (0.46)
0.53 (0.90)
Very Good
5.50-
8.60
7.30
0.84
0.70
-0.45 (0.47)
-0.29 (0.92)
Outstanding
7.50-
9.70
8.70
0.73
0.54
-0.52 (0.85)
1.17 (1.74)
1
Note. Exam refers to the variable measured by the Exam score determined by the teacher. SSA refers to the student’s self-assessment. Total
shows the scores by grouping all the students in the class. Fail, Pass, Very Good and Outstanding are classifications according to the score
given by the teacher on the test.
6
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 7
León, Vallejo, & Nelson, Validity in Self-Assessment
Figure 1. Exam Grade by Self-Assessment Grade
2
that each group had the same mean (zero). Thus, the
only source of variation remaining in the exam scores
was that within each achievement group. There was no
relationship within achievement categories between the
Exam and the SSA, r = .19, p = .13. The lack of
relationship is, perhaps, not surprising given that there is
a restricted range of exam scores within each exam
achievement category.
Given that the sizes of the achievement groupings were
different, a bootstrap analysis (5000 samples) was
performed to estimate the population effect size of each
comparison. Figure 2 shows the results of the
bootstrapping (F: Fail, P: Pass, V: Very Good, and O:
Outstanding) for difference between the SSA
assessment (subscript s) and the exam score (subscript
e). Positive effect sizes show an overestimation of
performance by SSA and negative effect sizes show an
2
The vertical axis represents the scores obtained by the students in the exam (Exam Grade), on the horizontal axis are Self-Assessment
(SSA) scores. The left part of the figure shows the linear regression between SSA and Exam Grade. The regression equation is shown at
bottom right and the overall line of best fit is shown in black. Scores are shown by group (Fail = open circles, Pass = black triangles, Very
Good = open squares, and Outstanding = black circles) with the lines of best fit within groups shown in dashed lines (Blue for Fail, red for
Pass, green for Very Good and purple for Outstanding). The distributions of the scores for the variable Exam (E) and for the variable Self-
assessment (SSA) are shown to the right of the figure.
under-estimation. Effects near zero indicate similar SSA
and Exam scores. The following effect sizes were
obtained: Fail g = 2.04, [CI95% 0.74, 3.18], Pass, g = .33
[CI95% -0.197, 0.88], Very Good g = -.57, [CI95% -1.12, -
0.05] and Outstanding, g = -1.31, [CI95% -2.45, -0.10]. As
can be seen in the figure, the Fail group shows a robust
overestimation, the Pass and Very Good groups show
an accurate and slight underestimation, respectively, and
the Outstanding group shows a robust underestimation.
Conclusions
The main objective of this work was to analyze the
relationship between academic achievement and SSA
accuracy in university education while ensuring that
teachers and students used the same evaluation criteria
(cf., Admiraal et al., 2015), used the same rubric and
7
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 8
León, Vallejo, & Nelson, Validity in Self-Assessment
Figure 2. Paired Hedges' g for the 4 groups (F: Fail, P: Pass, V: Very Goods, and O: Outstanding)
3
scales (cf., Alameddine et al., 2018), that students were
experienced in SSA (cf., Bolívar-Cruz, & Verano-
Tacoronte, 2018), and that the teacher was blind to the
authors of the works being evaluated (cf., Aryadoust,
2015). The results showed that students' estimates of
their achievements did not differ, overall, from the
evaluations made by the teacher. That correspondence
demonstrates a degree of SSA validity since, overall,
the students evaluated their performance similarly as
did the teacher. Further, we analyzed SSA precision
with respect to overall academic achievement based on
standard classifications of Fail, Pass, Very Good,
and Outstanding. Those standard classifications
revealed biases in SSA, where lower achievers tended
to over-estimate their performance while higher
achievers underestimated their performance. Defining
validity as the agreement between the evaluations made
by the students and the teacher, our results show that
SSA is a valid assessment of achievement. In Andrade’s
(2019, p.5) terms, there is consistency between the SSA
and teacher’s evaluations as the data show, “the degree
of alignment between students’ and expert raters’
evaluations, avoiding the purer, more rigorous term
accuracy unless it is fitting.
In the research presented here, all factors
classified by the SSA-accuracy literature as threatening
3
Comparisons are shown in the Cumming estimation plot. The subscript e indicates Exam score; subscript s indicates SSA score. Each
paired mean difference is plotted as a bootstrap sampling distribution. Mean differences are depicted as dots; 95% confidence intervals are
indicated by the ends of the vertical error bars.
have been considered; Knowledge that students have
about the evaluation criteria, the experience that
students have with SSA, and familiarity with the scale,
and standardization of the scale used (e.g., Panadero et
al., 2013; Panadero et al., 2016). Following these
recommendations, our participants had prior
experience and training in evaluating work using
evaluation criteria similar to that used on the SSA.
In the present work the evaluation criteria were
established previously and were provided to the
students before SSA. Additionally, we explained to the
students what SSA is and spent several sessions on its
use to familiarize them with it. To help ensure
maximum objectivity regarding the teacher's
evaluation, a rubric was developed prior to the
evaluation and validated by five experts not involved
in the evaluation (e.g., Brown & Harris, 2013; Panadero
et al., 2012). In addition, the teacher was blind during
the evaluation process to avoid any bias regarding the
score assigned to each participant. The same, familiar,
measurement scale was used both for teacher
evaluation and for SSA.
A common feature in SSA studies is to analyse the
results considering the class as a homogeneous group
(e.g., Bolívar-Cruz, & Verano-Tacoronte, 2018).
8
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 9
León, Vallejo, & Nelson, Validity in Self-Assessment
Numerous factors and skills are unevenly distributed
among students in a class (e.g., capacity, performance,
motivation, etc...), and these factors could affect the
ability to assess their own achievements. We provided
additional analysis of the results by grouping students
depending on their test results. The results have shown
that this factor (the level of student achievement within
the course) is related to the ability of students to self-
assess.
The results were analyzed with respect to different
levels of student achievement to enhance sensitivity
beyond the results of a binary pass/fail classification
(c.f., Bolívar-Cruz, & Verano-Tacoronte, 2018). This
decision is consistent with the specialized literature
(e.g., Brown & Harris, 2014; Panadero et al., 2016),
which argues for the importance of considering such
factors in the construction and design of evaluation
instruments as their absence could be detrimental. For
instance, Brown and Harris (2014, p. 23) affirmed “…
awarding grades or basing educational interventions or
changes based on unrealistic or construct-irrelevant
self-assessments is untenable. If self-assessment
processes lead students to conclude wrongly that they
are good or weak in some domain and they base
personal decisions on such false interpretations, harm
could be done, even in classroom settings (e.g., task
avoidance, not enrolling in future subjects)”.
Our findings are consistent with those of Sumer
(1932) showing that better students tend to
underestimate their performance, while those of lesser
skill tend to overestimate (e.g., Daines, 1978; Moreland
et al., 1981). Such an effect is obfuscated when
analysing the class as a whole or using a simple
pass/fail distinction. Despite the over and under
estimation, accuracy was good. Overall, 66.7% of the
students SSA scores were within 1 point of the
teachers’ score, with that portion being 12.5% among
those Failing, 72% among those Passing, 79% for
those classified as Very Good, and 66.7% for those in
the outstanding category.
The proportion of students in each performance
category (Fail: 12.60%, Pass: 39.68%, Very Good:
38.10%, Outstanding: 9.52%) is representative of the
distribution of those proportions in the population of
university students in Spain (e.g., Aranda, et al., 2013).
Due to their different sample sizes, we used the
bootstrap analysis to calculate the effect sizes in each
group, which reaffirmed the previous analysis
regarding the over and underestimation of scores by
SSA.
Use of the student's achievement level regarding
the subject matter as a moderating factor when
analysing accuracy in SSA has been proposed in the
literature. However, in contrast with our research,
studies have tended to focus on high and low
competition rankings (e.gr., Brown & Harris, 2013;
Boud & Falchikov, 1989). Although results have
varied, most of these studies relate higher levels of
student achievement with better accuracy in SSA, and
lower levels of student achievement with
overestimation in SSA. The measure of the student’s
capacity or competence is based on a post-hoc measure
(i.e., the exam), that is also the instrument being
compared to the SSA. Additional measures of
competency, based perhaps on pre-test measures of
classroom performance, could further refine the
relationship of competence and accuracy in SSA.
A possible explanation of these results could be
found in the Dunning-Kruger effect(Kruger & Dunning,
1999). According to the authors, the effect is a
cognitive bias in which people erroneously assess their
cognitive ability as greater than it is. Kruger and
Dunning (1999) propose that students lack not only
knowledge of the content, they also lack metacognitive
skills to recognize that they do not possess that
knowledge. The cognitive bias, in “less capable
students”, is attributed to an internal illusion about
their own cognitive abilities, leading to overestimations
of performance. In the case of “more capable
students” the cognitive bias may arise from an
erroneous perception about the external assessment of
their competences being more rigorous than it is,
which leads them to underestimate their performance
on the assessment. Our findings suggest that a-priori
assessment of metacognitive skills (e.g., Kallio et al.,
2018) would help to quantify the relationship of SSA
and exam performance.
The role that SSA has in the processes of Self-
Regulation-Learning (SRL) and on Self-efficacy has
been the focus of many investigations (see for a review
Panadero et al., 2017). However, as far as we know,
there are no studies in the literature where, in addition
to evaluating the formative role of the SSA, they assess
the relationship between the accuracy of the students
in SSA and its formative benefits. Knowing what
factors positively or negatively affect the SSA accuracy
9
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 10
León, Vallejo, & Nelson, Validity in Self-Assessment
should have an impact on the formative role in the SRL
process.
In summary, the results found in this study show
that the validity SSA as an evaluative test in university
education is somewhat relative to the skills of the
student. When validity of the test was assessed across
the entire group, the data show a general
correspondence between the SSA and the expert
evaluation. However, when students are grouped
according to their achievement in the evaluation by the
teacher, the measurement of validity is compromised
by the student’s expertise in the subject. There is an
overvaluation in the less competent students, while the
more competent students show undervaluation. These
findings suggest that SSA may be valid test for
evaluating group performance, but less so for the
individual in the absence of further knowledge about
that individual. Assessing metacognitive abilities
related to illusion of control may be one way to better
adjust individual SSA evaluations.
Strengths, Limitations, and further research
The present study provides an accurate validation
of SSA with university sample while controlling factors
that the literature indicates influence the accuracy of
SSA. Analysis considered both the whole group and
different academic achievement sub groups, beyond
pass-fail. Limitations include the use of a natural class
resulting in a small sample, requiring the use of a
bootstrap analysis to infer a population effect. Further
research should consider larger and more diverse
samples (e.g. in other educational levels) and the
measurement of other indicators of achievement and
metacognitive factors that might mediate the accuracy
of SSA.
References
Admiraal, W., Huisman, B., & Pilli, O. (2015).
Assessment in massive open online courses.
Electronic Journal of E-learning, 13(4), 207-216.
Retrieved from https://academic-
publishing.org/index.php/ejel/article/view/172
8/1691
Ainscow, M. (1988). Beyond the eyes of the monster:
an analysis of recent trends in assessment and
recording. Support for learning, 3(3), 149-153.
https://doi.org/10.1111/j.1467-
9604.1988.tb00088.x
Alameddine, M. B., Englesbe, M. J., & Waits, S. A.
(2018). A video-based coaching intervention to
improve surgical skill in fourth-year medical
students. Journal of surgical education, 75(6), 1475-
1479.
https://doi.org/10.1016/j.jsurg.2018.04.003
Andrade H. L. (2019) A Critical Review of Research on
Student Self-Assessment. Front. Educ. 4:87.
https://doi.org/10.3389/feduc.2019.00087
Andrade, H. L. (2010). Students as the definitive source
of formative assessment: academic self-
assessment and the self-regulation of learning. In
H. L. Andrade & G. J. Cizek (Eds.), Handbook of
formative assessment (pp. 90105). New York:
Routledge.
Andrade, H., Du, Y., & Mycek, K. (2010). Rubric-
referenced self-assessment and middle school
students’ writing. Assessment in Education: Principles,
Policy & Practice, 17(2), 199214.
https://doi.org/10.1080/09695941003696172
Andrade, H., Wang, X. L., Du, Y., & Akawi, R. L.
(2009). Rubric-referenced self-assessment and
self-efficacy for writing. Journal of Educational
Research, 102(4), 287301.
https://doi.org/10.3200/JOER.102.4.287-302
Aranda, A. F., Pastor, V. M. L., Oliva, F. J. C., &
Romero, R. (2013). La evaluación formativa en
docencia universitaria y el rendimiento académico
del alumnado. Aula abierta, 41(2), 23-34. Retrieved
from
https://dialnet.unirioja.es/descarga/articulo/423
9063.pdf
Aryadoust, V. (2015). Self-and peer assessments of oral
presentations by first-year university students.
Educational Assessment, 20(3), 199-225.
https://doi.org/10.1080/10627197.2015.106198
9
Berry, R. (2011). Assessment Reforms Around the
World. Assessment Reform in Education, 89102.
https://doi.org/10.1007/978-94-007-0729-0_7
Black, P., & Wiliam, D. (1998). Assessment and
classroom learning. Assessment in Education:
principles, policy & practice, 5(1), 7-74.
https://doi.org/10.1080/0969595980050102
10
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 11
León, Vallejo, & Nelson, Validity in Self-Assessment
Boekaerts, M. (2011). Emotions, emotion regulation,
and self-regulation of learning. In B. J.
Zimmerman & D. H. Schunk (Eds.), Handbook of
self-regulation of learning and performance (pp. 408
425). New York: Routledge.
Bolívar-Cruz, A., & Verano-Tacoronte, D. (2018). Self-
assessment of the oral presentation competence:
Effects of gender and student’s performance.
Studies in Educational Evaluation, 59, 94-101.
https://doi.org/10.1016/j.stueduc.2018.04.001
Boud, D. J., & Tyree, A. L. (1979). Self and peer
assessment in professional education: a
preliminary study in law, Journal of the Society of
Public Teachers of Law, 15, 1, 65-74. Retrieved from
https://heinonline.org/HOL/LandingPage?han
dle=hein.journals/sptlns15&div=1&src=home
Boud, D., & Falchikov, N. (1989). Quantitative studies
of student self-assessment in higher education: A
critical analysis of findings. Higher education, 18(5),
529-549. https://doi.org/10.1007/bf00138746
Brown, G. T. L., & Harris, L. R. (2013). Student self-
assessment. In J. H. McMillan (Ed.), The SAGE
handbook of research on classroom assessment (pp. 367
393). Thousand Oaks: Sage.
Brown, G. T. L., & Harris, L. R. (2014). The future of
self-assessment in classroom practice: reframing
self-assessment as a core competency. Frontline
Learning Research, 3, 2230.
https://doi.org/10.14786/flr.v2i1.24
Butler, R. (1990). The effects of mastery and
competitive conditions on self-assessment at
different ages. Child Development, 61, 201-210.
https://doi.org/10.2307/1131059
Cain, K. M., Wilkowski, B. M., Barlett, C. P., Boyle, C.
D., & Meier, B. P. (2018). Do we see eye to eye?
Moderators of correspondence between student
and faculty evaluations of day-to-day
teaching. Teaching of Psychology, 45(2), 107-114.
https://doi.org/10.1177/0098628318762862
Cochran, S.B. & Spears, M.C. (1980). Student self-
assessment and instructors' ratings: a comparison,
Journal of the American Diatetic Association, 76, 253-
257. Retrieved from http://agris.fao.org/agris-
search/search.do?recordID=US8045887
Ćukušić, M., Garača, Ž., & Jadrić, M. (2014). Online
self-assessment and students' success in higher
education institutions. Computers & Education, 72,
100-109.
https://doi.org/10.1016/j.compedu.2013.10.018
Daines, J.M. (1978). Self evaluation of academic
performance in a continuously assessed course of
study, Research Intelligence, 4, 1, 24-26.
https://doi.org/10.1080/0141192780040106
Davis, J.K. & Rand, D.C. (1980). Self-grading versus
instructor grading, Journal of Educational Research,
73(4), 207-211.
https://doi.org/10.1080/00220671.1980.108852
37
Dochy, F. J. R. C., Segers, M., & Sluijsmans, D. (1999).
The use of self-, peer and co-assessment in higher
education: A review. Studies in Higher education,
24(3), 331-350.
https://doi.org/10.1080/0307507991233137993
5
Doleys, E. J., & Renzaglia, G. A. (1963). Accuracy of
student prediction of college grades. Personnel &
Guidance Journal, 41(6), 528530.
https://doi.org/10.1002/j.2164-
4918.1963.tb02337.x
Dunning, D., Heath, C., & Suls, J. M. (2004). Flawed
self-assessment: implications for health,
education, and the workplace. Psychological Science in
the Public Interest, 5(3), 69106.
https://doi.org/10.1111/j.1529-
1006.2004.00018.x
Efron, B. (1992). Bootstrap methods: another look at
the jackknife. In Breakthroughs in statistics (pp. 569-
593). Springer, New York, NY.
Falchikov, N., & Boud, D. (1989). Student self-
assessment in higher education: A meta-analysis.
Review of Educational Research, 59(4), 395-430.
https://doi.org/10.2307/1170205
Filene, P.O. (1969). Self-grading: an experiment in
learning, Journal of Higher Education, 40, 451-458.
https://doi.org/10.2307/1979820
Finn, B., & Metcalfe, J. (2014). Overconfidence in
children’s multi-trial judgments of learning.
Learning & Instruction, 32, 19.
11
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 12
León, Vallejo, & Nelson, Validity in Self-Assessment
https://doi.org/10.1016/j.learninstruc.2014.01.0
01
Gaier, E. L. (1961). Student self estimates of final
course grades. The Journal of genetic psychology, 98(1),
63-67.
https://doi.org/10.1080/00221325.1961.105343
53
Gordon, M. J. (1991). A review of the validity and
accuracy of self-assessments in health professions
training. Academic medicine: journal of the Association
of American Medical Colleges, 66(12), 762-769.
https://doi.org/10.1097/00001888-199112000-
00012
Hedges, L. V. (1981). Distribution Theory for Glass’s
Estimator of Effect size and Related Estimators.
Journal of Educational Statistics, 6(2), 107-128.
https://doi.org/10.3102/10769986006002107
Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-
Chang, A. (2018). Moving beyond P values:
Everyday data analysis with estimation plots. Nat
Methods 16, 565566.
https://doi.org/10.1038/s41592-019-0470-3
Instituto Nacional de Estadística (2015). Mujeres en el
Profesorado Por Enseñanza que Imparten] [Females in
the Teaching Body According to the Grade Level They
Teach. Available at:
http://www.ine.es/ss/Satellite?L=es_ES&c=IN
ESeccion_C&cid=1259925481851&p=12547351
10672&pagename=ProductosYServicios%2FPY
SLayout&param3=1259924822888
Kallio, H., Virta, K. & Kallio, M. (2018). Modelling the
Components of Metacognitive Awareness.
International Journal of Educational Psychology, 7(2),
94-122. https://doi.org/10.17583/ijep.2018.2789
Kane, M. (2006). Content-related validity evidence in
test development. In S.M. Downing & T.M.
Haladyna (Eds.), Handbook of test development (pp.
131153). Mahwah, NJ: Lawrence Erlbaum
Associates
Keefer, K. E. (1971). Characteristics of students who
make accurate and inaccurate self-predictions of
college achievement. The Journal of Educational
Research, 64(9), 401-404.
https://doi.org/10.1080/00220671.1971.108842
03
Kostons, D., van Gog, T., & Paas, F. (2012). Training
self-assessment and task-selection skills: a
cognitive approach to improving self-regulated
learning. Learning and Instruction, 22(2), 121132.
https://doi.org/10.1016/j.learninstruc.2011.08.0
04
Kruger, J., & Dunning, D. (1999). Unskilled and
unaware of it: how difficulties in recognizing one’s
own incompetence lead to inflated self-
assessments. Journal of Personality and Social
Psychology, 77(6), 11211134.
https://doi.org/10.1037/0022-3514.77.6.1121
Longhurst, N., & Norton, L. S. (1997). Self-assessment
in coursework essays. Studies in Educational
Evaluation, 23(4), 319-330.
https://doi.org/10.1016/s0191-491x(97)86213-x
McMillan, J. H., & Hearn, J. (2008). Student self-
assessment: the key to stronger student
motivation and higher achievement. Educational
Horizons, 87, 4049. Retrieved from
https://pdfs.semanticscholar.org/bb37/9a81071
53678f41b65ac68d78726ba065f26.pdf?_ga=2.39
662499.1836272783.1575369325-
792685113.1573028967
Moreland, R., Miller, J. & Laucka, F. (1981) Academic
achievement and self-evaluation of academic
performance. Journal of Educational Psychology, 73(3),
335-344. https://doi.org/10.1037/0022-
0663.73.3.335
Morton, J.B. & Macbeth, W.A.A.G. (1977).
Correlations between staff, peer, and self-
assessments of fourth-years students in surgery,
Medical Education, 11(3), 167-170.
https://doi.org/10.1111/j.1365-
2923.1977.tb00586.x
Mueller, R. H. (1970). Is self-grading the answer? The
Journal of Higher Education, 41(3), 221-224.
https://doi.org/10.2307/1977312
Murstein, B. I. (1965). The relationship of grade
expectations and grades believed to be deserved
to actual grades received. The Journal of
Experimental Education, 33(4), 357-362.
https://doi.org/10.1080/00220973.1965.110108
94
Nicol, D., & McFarlane-Dick, D. (2006). Formative
assessment and self-regulated learning, a model
12
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 13
León, Vallejo, & Nelson, Validity in Self-Assessment
and seven principles of good feedback practice.
Studies in Higher Education, 31(2), 199218.
https://doi.org/10.1080/03075070600572090
Noonan, B., & Duncan, C. R. (2005). Peer and self-
assessment in high schools. Practical assessment,
research and evaluation, 10(17), 1-8. Retrieved from
https://pareonline.net/pdf/v10n17.pdf
Olina, Z., & Sullivan, H. J. (2004). Student self-
evaluation, teacher evaluation, and learner
performance. Educational Technology Research and
Development, 52(3), 522.
https://doi.org/10.1007/BF02504672
Orsmond, P., Merry, S., & Reiling, K. (2000). The use
of student derived marking criteria in peer and
self-assessment. Assessment & Evaluation in Higher
Education, 25(1), 2338.
https://doi.org/10.1080/02602930050025006
Pakaslahti, L., & Keltikangas-Järvinen, L. (2000).
Comparison of peer, teacher and self-assessments
on adolescent direct and indirect aggression.
Educational Psychology, 20(2), 177-190.
https://doi.org/10.1080/713663710
Panadero, E., & Alonso-Tapia, J. (2013). Self-
Assessment: Theoretical and Practical
Connotations. When It Happens, How Is It
Acquired and What to Do to Develop It in Our
Students. Electronic Journal of Research in Educational
Psychology, 11(2), 551-576.
https://doi.org/10.1016/j.stueduc.2013.04.001
Panadero, E., & Alonso-Tapia, J. (2014). How do
students self-regulate? Review of Zimmerman’s
cyclical model of self-regulated learning. Anales De
Psicologia, 30(2), 450462.
https://doi.org/10.6018/analesps.30.2.167221
Panadero, E., & Romero, M. (2014). To rubric or not
to rubric? The effects of self-assessment on self-
regulation, performance and self-efficacy.
Assessment in Education: Principles, Policy & Practice,
21(2), 133148.
https://doi.org/10.1080/0969594X.2013.877872
Panadero, E., Alonso-Tapia, J., & Huertas, J. A. (2012).
Rubrics and self-assessment scripts effects on
self-regulation, learning and self-efficacy in
secondary education. Learning and Individual
Differences, 22(6), 806813.
https://doi.org/10.1016/j.lindif.2012.04.007
Panadero, E., Brown, G. T., & Strijbos, J. W. (2016).
The future of student self-assessment: A review of
known unknowns and potential directions.
Educational Psychology Review, 28(4), 803-830.
https://doi.org/10.1007/s10648-015-9350-2
Panadero, E., Brown, G., & Courtney, M. (2014).
Teachers’ reasons for using self-assessment: A
survey self-report of Spanish teachers. Assessment
in Education: Principles, Policy & Practice, 21(4), 365-
383.
https://doi.org/10.1080/0969594x.2014.919247
Panadero, E., Jonsson, A., & Botella, J. (2017). Effects
of self-assessment on self-regulated learning and
self-efficacy: Four meta-analyses. Educational
Research Review, 22, 74-98.
https://doi.org/10.1016/j.edurev.2017.08.004
Ramdass, D., & Zimmerman, B. J. (2008). Effects of
self-correction strategy training on middle school
students’ self-efficacy, self-evaluation, and
mathematics division learning. Journal of Advanced
Academics, 20(1), 18 41.
https://doi.org/10.4219/jaa-2008-869
Real Decreto 1125/2003, de 5 de septiembre, por el
que se establece el sistema europeo de créditos y
el sistema de calificaciones en las titulaciones
universitarias de carácter oficial y validez en todo
el territorio nacional. Boletín Oficial del Estado, 224,
de 18 de septiembre de 2003, 34,355-34,356.
Retrieved from
https://www.boe.es/eli/es/rd/2003/09/05/112
5
Ross, J. A. (2006). The Reliability, Validity, and Utility
of Self-Assessment. Practical Assessment Research &
Evaluation, 11(10) 10.
https://doi.org/10.7275/9wph-vv65
Ross, J. A., Rolheiser, C., & Hogaboam-Gray, A.
(1998). Skills training versus action research in-
service: Impact on student attitudes to self-
evaluation. Teaching and Teacher Education, 14(5),
463-477. https://doi.org/10.1016/s0742-
051x(97)00054-1
Ross, S. (1998). Self-assessment in second language
testing: A meta-analysis and analysis of
experiential factors. Language Testing, 15(1), 1-20.
https://doi.org/10.1177/026553229801500101
13
León et al.: Validity in Self-Assessment
Published by ScholarWorks@UMass Amherst, 2021
Practical Assessment, Research & Evaluation, Vol 26 No 16 Page 14
León, Vallejo, & Nelson, Validity in Self-Assessment
Sitzmann, T., Ely, K., Brown, K. G., & Bauer, K. N.
(2010). Self-assessment of knowledge: A cognitive
learning or affective measure? Academy of
Management Learning & Education, 9(2), 169-191.
https://doi.org/10.5465/amle.9.2.zqr169
Stanton, H.E. (1978) Self-grading as an assessment
method. Improving College and University Teaching,
26(4), 236-238.
https://doi.org/10.1080/00193089.1978.992758
9
Sumner, F.C. (1932). Marks as estimated by students,
Education, 32, 429.
Sung, Y. T., Chang, K. E., Chiou, S. K., & Hou, H. T.
(2005). The design and application of a web-based
self-and peer-assessment system. Computers &
Education, 45(2), 187-202.
https://doi.org/10.1016/j.compedu.2004.07.002
Tan, K. H. K. (2012). Student self-assessment. Assessment,
learning and empowerment. Singapore: Research
Publishing.
Taras, M. (2010). Student self-assessment: processes
and consequences. Teaching in Higher Education,
15(2), 199209.
https://doi.org/10.1080/13562511003620027
Tejeiro, R. A., Gómez-Vallecillo, J. L., Romero, A. F.,
Pelegrina, M., Wallace, A., & Emberley, E. (2012).
Summative self-assessment in higher education:
Implications of its counting towards the final
mark. Electronic Journal of Research in Educational
Psychology, 10 (2), 789-812. Retrieved from
http://investigacion-
psicopedagogica.org/revista/articulos/27/englis
h/Art_27_707.pdf
Topping, K. J. (2003). Self and peer assessment in
school and university: reliability, validity and
utility. In M. Segers, F. Dochy, & E. Cascallar
(Eds.), Optimising new modes of assessment: in search of
qualities and standards Vol. 1 (pp. 5587).
Netherlands: Springer.
Torres, K., & Tackett, S. (2016). Pre-service teachers’
beliefs about teaching ESOL students in
mainstream classrooms. International Online Journal
of Education and Teaching, 3(3), 186-200. Retrieved
from
http://iojet.org/index.php/IOJET/article/view
/130/135
Ward, M., Gruppen, L., & Regehr, G. (2002).
Measuring self-assessment: current state of the
art. Advances in Health Sciences Education, 7(1), 63-
80. https://doi.org/10.1023/a:1014585522084
Ysseldyke, J.E. & Matson, D. (1988). Issues in the
psychological evaluation of children. En V. Van
Hasselt, P.S. Satrain, & M. Hersen (Eds.)
Handbook of developmental and physucal disabilities.
Nueva York: Pergamon
Citation:
León, S. P., Vallejo, A. P., & Nelson. J. B.. (2021). Variability in the accuracy of self-assessments among low,
moderate, and high performing students in university education. Practical Assessment, Research & Evaluation,
26(16). Available online: https://scholarworks.umass.edu/pare/vol26/iss1/16/
Corresponding Author
Samuel Parra León
University of Jaén, Spain
email: sparra [at] ujaen.es
14
Practical Assessment, Research, and Evaluation, Vol. 26 [2021], Art. 16
https://scholarworks.umass.edu/pare/vol26/iss1/16
... [20] A recent study found that self-assessment is valid for students with average performance but not for those with above-or below-average performance unless potentially mediating variables are considered. [21] Furthermore, a pilot study conducted by Alfakhry et al. revealed that providing students with sufficient self-assessment training can improve their skills and reduce the gap between their self-assessment and faculty assessment. [22] Our study, which involved four self-assessments and peer assessments during the preclinical and clinical modules, provided self-assessment training. ...
... These results suggest that students provide accurate assessments of their peers; other studies have found similar results, revealing that training and experience improved students' peer assessment skills and attitudes. [21,30] Several limitations should be considered when interpreting our results. First, the study was conducted in a single university hospital, and assessments were made by a single investigator, which may lead to bias and poor external validity, making it difficult to generalize the results to all undergraduate dental students. ...
Article
Full-text available
Introduction This study investigated the agreement between students’ self-assessment or peer assessment and faculty assessment of the quality of root canal treatment (RCT) performed during preclinical and clinical endodontic courses. Materials and Methods This cross-sectional study enrolled 23 4 th -year students and a faculty member from the University of Medical Sciences and Technology, Sudan. The radiographic technical quality of RCTs performed during preclinical and clinical endodontic courses was evaluated by the students themselves, their peers, and faculty members in terms of the obturation length, shape, and density. Paired Student’s t -test was used to compare self-peer and faculty assessments. P < 0.05 was considered indicative of statistical significance. Results No statistically significant differences were observed among the three assessment groups in terms of the quality of RCTs performed during the preclinical course. However, statistically significant differences were observed between self- and peer assessments ( P = 0.014), and between self- and faculty assessments ( P = 0.005), which may be attributable to lower confidence among students when working on real patients, leading to lower self-assessment scores. Conclusion Significant differences were observed between self- and peer assessments and faculty assessments in the quality of RCT performed during the clinical course, with students underrating their performance. Formative assessment of learning rather than summative assessment, during endodontic courses, is recommended at the end of academic year.
... These results are consistent with findings obtained in León, Augusto-Landa, and García-Martínez (2021) and León, Vallejo, and Nelson (2021), in which the researchers found that the students who failed the exam (received the worst grades) overestimated their performance, the students who received passing grades were the most accurate in their evaluations, and the highest achieving students showed bias towards underestimation. Although we found no differences among students who prefer different sitting position in their exam performance, the patterns of results are supportive of the manifestation of the Dunning-Kruger bias. ...
Article
Students who perform better academically are generally more accurate in assessing their performance. In other words, they are better calibrated, showing smaller discrepancy between their predicted and actual performance. Research into contextual variables related to performance has shown that students sitting position may be related to their performance in class. In the current study we attempted to bring several lines of research together and examine whether there are differences in academic performance, calibration, and confidence judgments among students who choose various sitting positions in the classroom. 130 college students (M = 21.52, SD = 1.59; 103 female) attending courses at the University of Jaen in Spain participated. Results revealed that students' calibration differed depending on their preferred sitting position. We found that students in middle rows were optimally calibrated, students in the front rows overestimated their performance, and those in the back rows tended to underestimate their performance. Furthermore, students in the front rows were the most confident in their estimates, whereas students in the middle and back rows were less confident, suggesting the manifestation of the Dunning-Kruger effect. Future directions for contextual variables contributing to students' optimal calibration are discussed.
... The fact that the better comprehenders exhibited worse calibration accuracy was initially quite perplexing. Nonetheless, we came to reconcile this outcome when we delved into the literature on variability in self-judgment and self-assessment accuracy (Le on et al., 2021;Tirso et al., 2019;Urban & Urban, 2021). ...
Article
The comprehension and calibration of 54 undergraduates were investigated as they read excerpts from an introductory geology textbook on weather and soil in print and digitally. All excerpts were approximately 1600 words in length and contained a graph, a diagram, and three photographs that complemented or extended the written text. Each student read two texts with medium and topic counterbalanced. Prior to reading, the students completed a demographic survey, rated their topic familiarity, and completed two topic knowledge pretests. They next read one chapter on either weather or soil in print or digitally and then answered a series of short-answer questions. The questions drew on content from the written text only, visuals only, or both. The same procedure was then repeated in the other medium. Analyses indicated processing multimodal texts in print was significantly more advantageous than processing those same texts digitally, and this difference was more pronounced for questions focused on visuals only. Students' self-rated topic familiarity was compared to their demonstrated topic knowledge for weather and soil and their predicted comprehension performance was compared to actual comprehension performance. Results showed that undergraduates' calibration was poor overall , but comprehension was overestimated more often when students read multimodal texts digitally.
... Student self-assessment has inherent variability. The DNP students as expert clinicians may have underestimated their scores which aligns with the findings by León et al. (2021), that high performing students underestimate their performance. After the intervention, VIA scores changed. ...
Article
Background Purposeful visual training is effective. Increasingly, visual arts learning interventions are used by multiple disciplines to improve observation and communication skills, critical skills in healthcare. Purpose The primary aim was to evaluate enhancement of skills in the observation, perception and communication domains of visual literacy following a fine arts intervention with nursing graduate students. Methods An art-based intervention with five visual thinking activities was collaboratively developed with a faculty director of an arts and humanities program and museum arts educator and implemented at a gallery with 94 Post-Master's Doctor of Nursing Practice (DNP) students in advanced roles and Clinical Nurse Leader (CNL) entry-to-practice graduate students. Participation in the non-randomized, non-controlled quantitative study with pre- and post-evaluation design was voluntary. Prior to and following the session, 63 participants completed a Visual Intelligence Assessment (VIA) tool and 70 completed an Image Assessment exercise. Results Pre-intervention scores were similar for the two groups. Post intervention, DNP VIA mean scores increased from 3.58 to 3.68 (p = .057) while CNL mean VIA scores decreased from 3.65 to 3.53 (p = .08). DNP students had significantly higher image scores post-intervention (p < .001), demonstrating improved use of objective language. No difference was found in CNL image scores pre and post. Conclusions The intervention increased self-awareness and perceived understanding of the role of VI on perception and empathy. DNP students also improved observational skills and language. Magnitude and direction of change in self-awareness appears relative to prior experience and skill.
Article
Full-text available
Developing the ability to self-assess is a crucial skill for students, as it impacts their academic performance and learning strategies, amongst other areas. Most existing research in this field has concentrated on the exploration of the students’ capacity to accurately assign a score to their work that closely mirrors an expert’s evaluation, typically a teacher’s. Though this process is commonly referred to as self-assessment, a more precise term would be self-assessment scoring accuracy. Our aim is to review what is the average accuracy and what moderators might influence this accuracy. Following PRISMA recommendations, we reviewed 160 articles, including data from 29,352 participants. We analysed 9 factors as possible moderators: (1) assessment criteria; (2) use of rubric; (3) self-assessment experience; (4) feedback; (5) content knowledge; (6) incentive; (7) formative assessment; (8) field of knowledge; and (9) educational level. The results showed an overall effect of students’ overestimation (g = 0.206) with an average relationship of z = 0.472 between students’ estimation and the expert’s measure. The overestimation diminishes when students receive feedback, possess greater self-assessment experience and content knowledge, when the assessment does not have formative purposes, and in younger students (primary and secondary education). Importantly, the studies analysed exhibited significant heterogeneity and lacked crucial methodological information.
Article
Full-text available
Introduction: This project report describes the development of the Questionnaire for the Assessment of Teaching Competencies in Medicine (FKM_L) and the analysis of some of its psychometric properties. The design of the FKM_L is based on the model of Core Competency for Teachers in Medicine (KLM) model of the GMA Committee on Personnel and Organizational Development in Teaching. Methods: Global questions and in-depth items were formulated for each of the sub-competencies of the six core competencies of the KLM model. Depending on the number of sub-competencies, there are 3-4 subscales for each core competency, comprising 69 items in total. Data from 90 participants of medical didactic courses were analysed. Item analyses supported the hypothesized scales. Results: The internal consistencies (Cronbach's alpha: CR-α) of the 22 subscales ranged from CR-α=.70 to CR-α=.93, and the item difficulty indices of the subscales ranged from 18% to 89%. For 2 subscales, some items had a difficulty index of more than 80%, and for 3 subscales, the difficulty index of some items was less than 25%. Conclusions: The FKM_L was developed to assess individual and group profiles of teachers’ competence in medicine. The results of this first psychometric analysis are promising: With the help of the FKM_L, teachers can learn about and reflect on aspects of their teaching competencies in the context of medical didactic courses. Based on their FKM_L profiles, they can decide whether they want to selectively optimize their competence characteristics. For providers, the FKM_L is suitable as a screening tool to identify, among other things, gaps in the training offered. Further analyses are necessary to check limitations identified in some scales and to improve individual items. In addition, research on the construct and criterion-related validity of the instrument is required
Article
Full-text available
This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated, including the relationship between self-assessment and achievement, consistency of self-assessment and others' assessments, student perceptions of self-assessment, and the association between self-assessment and self-regulated learning. An argument is made for less research on consistency and summative self-assessment, and more on the cognitive and affective mechanisms of formative self-assessment.
Article
Full-text available
Metacognitive awareness consists of two components, i.e. regulation of cognition and knowledge of cognition. In earlier studies self-evaluation is aligned as a sub-component of regulation of cognition. However, in this study we point out that self-evaluation does not actually regulate the ongoing or forthcoming process but it is a tool used to reflect both knowledge and regulation. This alignment is modelled to assess to what extend self-evaluation can be predicted by the other components of the metacognitive awareness. The model is tested empirically among vocational education students (N= 578) using the Metacognitive Awareness Inventory (MAI). The results of SEM concludes that the conditions and goals appointed by the learner predict the selection of contents and strategies towards self-evaluation of one’s own learning. In other words, by measuring planning or conditional knowledge we could predict other components of knowledge or regulation and, especially, self-evaluation. The findings of this study extensively confirm that planning and knowledge of conditions predict success through the learning process. The results encourage teachers to support students in improving their metacognitive awareness, i.e. expect them to set goals for their own learning.
Article
Full-text available
Students and instructors show moderate levels of agreement about the quality of day-to-day teaching. In the present study, we replicated and extended this finding by asking how correspondence between student and instructor ratings is moderated by time of semester and student demographic variables. Participants included 137 students and five instructors. On ten separate days, students and instructors rated teaching effectiveness and challenge level of the material. Multilevel modeling indicated that student and instructor ratings of teaching effectiveness converged overall, but more advanced students and Caucasian students converged more closely with instructors. Student and instructor ratings of challenge converged early but diverged later in the semester. These results extend our knowledge about the connection between student and faculty judgments of teaching.
Article
Full-text available
Open online distance learning in higher education has quickly gained popularity, expanded, and evolved, with Massive Open Online Courses (MOOCs) as the most recent development. New web technologies allow for scalable ways to deliver video lecture content, implement social forums and track student progress in MOOCs. However, we remain limited in our ability to assess complex and open‑ended student assignments. In this paper, we present a study on various forms of assessment and their relationship with the final exam score. In general, the reliability of both the self‑assessments and the peer assessments was high. Based on low correlations with final exam grades as well as with other assessment forms, we conclude that self‑assessments might not be a valid way to assess students performance in MOOCs. Yet the weekly quizzes and peer assessment significantly explained differences in students final exam scores, with one of the weekly quizzes as the strongest explanatory variable. We suggest that both self‑assessment and peer assessment would better be used as assessment for learning instead of assessment of learning. Future research on MOOCs implies a reconceptualization of education variables, including the role of assessment of students achievements.
Article
Objective: For senior medical students pursuing careers in surgery, specific technical feedback is critical for developing foundational skills in preparation for residency. This pilot study seeks to assess the feasibility of a video-based coaching intervention to improve the suturing skills of fourth-year medical students. Design: Fourth-year medical students pursuing careers in surgery were randomized to intervention vs. control groups and completed 2 video recorded suture tasks. Students in the intervention group received a structured coaching session between consecutive suturing tasks, whereas students in the control group did not. Each coaching session consisted of a video review of the students' first suture task with a faculty member that provided directed feedback regarding technique. Following each suturing task, students were asked to self-assess their performance and provide feedback regarding the utility of the coaching session. All videos were deidentified and graded by independent faculty members for evaluation of suture technique. Setting: The University of Michigan Medical School in Ann Arbor, Michigan. Participants: All fourth-year medical students pursuing careers in surgical specialties were contacted via e-mail for voluntary participation. In all, 16 students completed both baseline and follow up suture tasks. Results: All students who completed the coaching session would definitely recommend the session for other students. A total of 94% of the students strongly agreed that the exercise was a beneficial experience, and 75% strongly agreed that it improved their technical skills. Based on faculty grading, students in the intervention group demonstrated greater average improvements in bimanual dexterity compared to students in the control group; whereas students in the control group demonstrated greater average improvements in domains of efficiency and tissue handling compared to the intervention group. Based on student self-assessments, those in the intervention group had greater subjective improvements in all scored domains of bimanual dexterity, efficiency, tissue handling, and consistency compared to the control group. Subjective, free-response comments centered on themes of becoming more aware of hand movements when viewing their suturing from a new perspective, and the usefulness of the coaching advice. Conclusions: This pilot study demonstrates the feasibility of a video-based coaching intervention for senior medical students. Students who participated in the coaching arm of the intervention noticed improvements in all domains of technical skill and noted that the experience was overwhelmingly positive. In summary, video-based review shows promise as an educational tool in medical education as a means to provide specific technical feedback.
Article
Oral presentation competence is critical to graduates' employability. Improving this competence involves developing self-assessment skills that help students to analyze their own performance. However, more research is needed on the factors affecting self-assessment of the oral presentation competence, such as the speaker's confidence and the summative use of self-assessment that can act as a sort of incentive for students. A study with 201 students was carried out, focusing on a segmentation analysis and differentiating by gender and level of performance on the assessed competence. Results show that: (1) the existence of incentives is the only variable that significantly influences men's self-assessment, whereas women's self-assessment is basically conditioned by their confidence as speakers; and (2) the self-assessment of the worst speakers, rated by teachers, is influenced only by the existence of incentives, whereas the best speakers give themselves higher scores when they feel confident about speaking in public.