ThesisPDF Available

Predictivity of Standards-Based Report Card Models for Standardized Test Scores: A Taxonomic Mixed Methods Study



Research indicates that traditional letter and number grades are inaccurate and harmful to children, while standards-based grading is both more accurate and better for all stakeholders. Despite years of study, though, standards-based report cards (SBRCs) come in many forms, and the best number and arrangement of performance level descriptors (PLDs) remains undetermined. This sequential, exploratory, transformative mixed methods study, completed in five stages, was designed to quantitatively analyze the relationships between SBRC models, PLDs, and the Commonwealth of Virginia’s Standards of Learning (SOL) tests. After conducting a systematic literature review, the author created a new taxonomy to classify SBRCs, which was qualitatively applied both to models found in existing research as well as ones reported by respondents in this study. Subsequent quantitative analysis found no practical difference between SBRC models regarding their efficacy as predictors of SOL-test outcomes. While this indicates that various SBRCs may be effectively similar in predicting the outcome of standardized tests, psychological research demonstrating the harmful aspects of grading practices may indicate a way to differentiate between these models.
A Dissertation
Presented to
The Faculty of the College of Graduate Studies
Lamar University
In Partial Fulfillment
of the Requirements for the Degree of
Doctor of Education in Educational Leadership
Keith David Reeves
December 2021
© 2021 by Keith David Reeves
No part of this work may be reproduced without permission except as indicated by the
“Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work
must be properly credited in any written or published materials.
Keith David Reeves
Research indicates that traditional letter and number grades are inaccurate and
harmful to children, while standards-based grading is both more accurate and better for
all stakeholders. Despite years of study, though, standards-based report cards (SBRCs)
come in many forms, and the best number and arrangement of performance level
descriptors (PLDs) remains undetermined. This sequential, exploratory, transformative
mixed methods study, completed in five stages, was designed to quantitatively analyze
the relationships between SBRC models, PLDs, and the Commonwealth of Virginia’s
Standards of Learning (SOL) tests. After conducting a systematic literature review, the
author created a new taxonomy to classify SBRCs, which was qualitatively applied both
to models found in existing research as well as ones reported by respondents in this study.
Subsequent quantitative analysis found no practical difference between SBRC models
regarding their efficacy as predictors of SOL-test outcomes. While this indicates that
various SBRCs may be effectively similar in predicting the outcome of standardized
tests, psychological research demonstrating the harmful aspects of grading practices may
indicate a way to differentiate between these models.
It looks like enough people heeded the warning signs on my coffee supplies not to
steal them so I could stay caffeinated enough to finish this thing. Doing my EdD and my
CSML simultaneously during a global pandemic wasn’t the best idea I ever had, but
hopefully there are enough good ones in the following pages to make up for it.
I dedicate this dissertation to Rachel Savage, with boundless love and gratitude.
I thank with love my mother Luana and my father David for supporting this
undertaking: It helped. A lot. My love to my brothers Jeffery and Justin, and their
families. Additional thanks to Roderick O’Savio, David Carter, Gloria Mora, Melissa
Robison, Diana Murphy McColgan, Lynda Jesukiewicz, Stacia Reeves Joyce, Megan
Link, Angelique Coulouris, Sandy Bolivar, Catherina Hurlburt, Lisa Varga, Melissa
Kuersteiner, Patricia Catchouny, Heather Heverly, Sandi Parker, Tanya Parrott, Jeremy
Kumin, Kathleen Huddle, Ruth Moceri Whitmer, and Jennifer Altamirano.
I remember here my long-time mentor Dawn L. Moulen, one of the finest
educators ever to practice the craft, who would have liked that I did this.
Thanks to my beloved Discovery Explorers, including Erin Russo and Judy
Seeber, for giving me the freedom and support to spread my wings these last six years,
implementing the theories and praxis that led to this work. Special cheers to the mule
crew, who helped this Disco Wizard immeasurably during the mad Summer of ‘20. You
have made me feel so valued, and I am grateful to be your colleague and friend.
A toast to the inimitable HTB Crew, the finest group of knucklehead neighbors
one could ask for. You welcomed me to The Little City with open arms and hearts, and I
have felt much at home for your camaraderie and simpatico.
I extend my gratitude to Bob Weaver, Hung Do, and Lucas Cherry. My gratitude
to Ron Fisher for his consultation, and to my outstanding editor Demian Pedone. Cheers
to my colleagues in Cohort 13, including the PK12D quant and qual crews, the New
Zealand Kaiako communications team, and my fellow Nixians.
Special thanks to my dissertation committee members Kelly Brown and Hunter
Keeney. I have appreciated your enthusiasm and insight, and believe this project is all the
better for your expertise.
Lastly, to kindred spirit and fellow thinker, Vince Nix, the very finest dissertation
chair any scholar could have: Your mentorship, organization, expertise, humor, and
genuine human decency are unparalleled. I’ll always be grateful to have been on your
team, and deeply proud that my name is associated with yours in our work together.
My fourth-grade teacher told me I was a no-good rotten little boy who would
never grow up to be anything in life. Ha-ha: I win.
Table of Contents
List of Figures ix
List of Tables xi
Glossary xii
Chapter Page
I Introduction to the Study 1
Background 1
Statement of Problem 3
Theoretical Framework 3
Statement of Purpose 6
Rationale for and Significance of the Study 7
Research Questions 8
Assumptions, Limitations, and Delimitations 8
Assumptions 8
Limitations 8
Delimitations 10
Organization of the Study 11
II Review of the Literature 13
Sources 13
Introduction 13
Purpose of Grading 14
History of Grading in the United States 21
Norm-Referenced vs Criterion-Referenced Grades 27
Traditional Grading 29
Letter Grades 31
Numerical scores 32
Effects of Traditional Grading 38
Standards-Based Practices 42
Standards-Based Report Cards (SBRCs) 45
Effects of Standards-Based Practices 48
Student Experiences 49
Teacher Experiences 50
Performance Level Descriptors 53
Quantity of Performance Level Descriptors 55
Justification for This Research 58
Summary 59
III Methodology 62
Research Design 62
Methodological Rigor 65
Research Questions 66
Research Stages 66
Stage 1: Systematic Review of the Literature 67
Stage 2: Development of Taxonomy 68
Stage 3: Application of Taxonomy 69
Stage 4: Quantitative Analyses 70
Avoiding the p-Value Pitfall 72
Multiple Nested Predictor Variables 74
Data Analysis 75
Avoiding the “Pass-Fail” Pitfall 78
Stage 5: Recommendations for Praxis 79
Philosophical Rationale 79
Population and Sample 81
Data-Collection Procedures 82
Data Security and Storage 83
Data Cleaning 83
Summary 85
IV Findings and Analysis 86
Stage 2: Reeves Taxonomy of SBRC Models 86
Development of the Taxonomy 87
Classification by T-Value 89
Subclassification by L-Values 90
Line of Demarcation 92
Performance Levels 93
Nomenclature 95
Stage 3: Application of Taxonomy 105
Observations in Practice 107
Observations in This Study 109
Stage 4: Quantitative Findings and Analysis 110
Description of Sample 111
Analysis 1: SOL-Test Scores, PLD Codes, and Grades 114
Analysis 2: SOL-Test Scores, SBRC Classes, and Grades 117
Summary 124
V Summary, Implications for Praxis, and Conclusion 127
Summary of the Study 127
Implications for Practice 130
Implications for Teaching and Learning 131
Implications for Assessment 133
Implications for School-Community Communications 138
Recommendations for Future Research 142
Concluding Remarks 144
References 146
Appendices 193
Appendix A: IRB Approval 194
Appendix B: CITI Certificate 195
Appendix C: Approval to Use Graphics Adaptations 196
Biographical Note 197
List of Figures
Figure Page
Figure 1. Theoretical Framework for This Study 6
Figure 2. Factors Impacting Availability of Data 9
Figure 3. Marzano's Power Law 36
Figure 4. Derksen's Critique of Marzano's Power Law 37
Figure 5. Sequential Exploratory Transformative Mixed Methods Design 67
Figure 6. Goss-Sampson: Tests for Predicting Outcomes Based on Variables 75
Figure 7. Two SBRC Models With T = 3 90
Figure 8. Taxonomy: Levels L 91
Figure 9. Taxonomy: Line of Demarcation 93
Figure 10. Taxonomy: PLD Labels (PL values) 95
Figure 11. Taxonomy: T-Values for Each SBRC Class in the Reeves Taxonomy 97
Figure 12. Taxonomy: T-Designated Models 98
Figure 13. Taxonomy: Maximum and Minimum PL Values for Each Class 99
Figure 14. Taxonomy: m-Designated Models 100
Figure 15. Taxonomy: e-Designated Models 101
Figure 16. Taxonomy: x-Designated Models 102
Figure 17. Taxonomy: y-Designated Model 103
Figure 18. Reeves Taxonomy of SBRC Models 104
Figure 19. SBRC Models Observed in Practice as Classified by the Reeves
Taxonomy 109
Figure 20. SBRC Models in This Study as Classified by the Reeves Taxonomy 110
Figure 21. Frequency Distribution of SOL-Test Scores 111
Figure 22. Comparison: SOL-Test Scores for Mathematics by PLD Code and Grade
Figure 23. Comparison: SOL-Test Scores for Reading by PLD Code and Grade 116
Figure 24. Comparison: SOL-Test Scores for Science by PLD Code and Grade 116
Figure 25. Comparison: SOL-Test Scores for Mathematics by SBRC Class and Grade
Figure 26. Comparison: SOL-Test Scores for Reading by SBRC Class and Grade 118
Figure 27. Comparison: SOL-Test Scores for Science by SBRC Class and Grade 119
List of Tables
Table Page
Table 1. Identical Initial and Final Mastery With Traditional Averages 34
Table 2. Identical Averages Arising From Different Mastery Journeys 35
Table 3. Variables 71
Table 4. Raw Data Headers 84
Table 5. Included SBRC Models With T Between 6 and 2 96
Table 6. M Variables and Formulaic Representations of Each SBRC Class 105
Table 7. Observed SBRC Classes 108
Table 8. Distribution: Mathematics Cases Between SBRC Classes and Grades 113
Table 9. Distribution: Reading Cases Between SBRC Classes and Grades 113
Table 10. Distribution: Science Cases Between SBRC Classes 114
Table 11. Mixed-Model Analysis to Compare Mathematics Scores by SBRC Class
and Grade 120
Table 12. Post-Hoc Comparison: Marginal Mean Mathematics Scores between
SBRC Classes 121
Table 13. Mixed-Model to Compare Reading Scores by SBRC Class and Grade 122
Table 14. Comparison: Marginal Mean Reading Scores between SBRC Classes 122
Table 15. Mixed-Model to Compare Science Scores by SBRC Class and Grade 123
Table 16. Comparison: Marginal Mean Science Scores between SBRC Classes 124
PLD: performance level descriptor
PLD class: a classification describing a set of performance level descriptors that are
all collectively on one or the other side of the line of demarcation, above or below
PLD label: a label describing a performance level descriptor’s relative position
SBA: standards-based assessment
SBG: standards-based grading
SBRC: standards-based report card
SBRC model: the established system of performance level descriptors available to
teachers at a given school, including a specific quantity of discrete PLDs arranged
into an ordinal hierarchy
School division: School districts in the Commonwealth of Virginia, coterminous with
county-level political jurisdictions, are called “school divisions.”
SOL: Standards of Learning
Reeves 1
Chapter I
Introduction to the Study
For nearly thirty years, standards-based grading (SBG) has been firmly
established by research and practice as more effective and better for students than
traditional grading; however, those studies have mostly examined the comparative
efficacy of standards-based report card (SBRC) models and traditional ones, or how
schools can successfully transition to SBG. Few—if any—studies have compared SBRC
models to each other. This research was intended to fill that gap, designing a five-stage
mixed methods study to determine how effectively different models of SBRCs with
various permutations of performance level descriptors (PLDs) predicted student
performance on standardized end-of-course tests, specifically on the Standards of
Learning (SOL) subject tests used in Virginia.
This chapter discusses the background of standards-based practices (which are
more robustly detailed in Chapter II) before outlining the rationale behind this
dissertation, the problem it addresses, its theoretical framework, and its ultimate purpose.
After discussing the rationale for and significance of the study, Chapter I enumerates the
research questions and variables, identifying the pertinent assumptions, limitations, and
delimitations of the research project, before concluding with a summary describing the
organization of the rest of the dissertation.
Standards-based assessment (SBA) practices came to national attention in the
United States in the early 1990s, during the Clinton administration’s reauthorization of
the Elementary and Secondary Schools Act. This interest was sustained into the Bush
Reeves 2
administration, with much ink spilled discussing the merits and flaws of the No Child
Left Behind Act of 2001 (Hamilton et al., 2008). Within standards-based practices, the
word standard refers specifically to a content standard, a description of what a student
should know and be able to do at a given time in a given subject. However, during the
advent of the standards-based movement, it was not uncommon to confuse content
standards with performance standards, such as those used to set passing scores on tests
(Shepard et al., 2009). Hamilton et al. (2008) found that unclear and confusing
legislation, which was originally intended to promulgate standards-based reform and set
common benchmarks for all learners, instead fostered the expansion of standardized
testing. Rather than focusing on educational content and children’s needs, some
reformers emphasized scores and assessments, culminating in the test-fixated era of No
Child Left Behind. The predominance of high-stakes testing effectively sabotaged
standards-based assessment practices, prioritizing scores, numerical rankings, and
categories over real shifts toward authentically pro-child pedagogy (Hamilton et al.,
Since the first efforts to implement standards-based reform and assessment in
American schools, significant research has been conducted on the topic. Studies have
investigated everything from perceptions about SBA (Burkhardt, 2020; Fairman et al.,
2018; Frederickson, 2017; Guskey, 2007; Mild, 2018; Swan et al., 2014; Wheeler, 2017;
Whitesell, 2015; Winton, 2015; Youngman, 2017; Zoeckler, 2007) to how to maximize
the benefit to learners while implementing standards-based grading (Freytas, 2017;
Manley, 2019; Marzano, 1998; Peters & Buckmiller, 2015; Redmond, 2020; D. Reeves,
2004; Shakman et al., 2018; Ulrich, 2012; Vogel, 2012; Weaver, 2018; Wheeler, 2017).
Reeves 3
One topic not thoroughly investigated, however, is how to best design standards-based
report cards (SBRCs).
Statement of Problem
While the efficacy of standards-based assessment practices in general is well
established in the literature (a topic detailed at length in Chapter II), the effectiveness of
various specific standards-based models, relative to each other, has not been robustly
studied. Without this research, designers of SBRCs are at best taking educated guesses,
and at worst making potentially harmful mistakes. In the absence of rigorous, evidence-
based research to determine what works and what does not, the design of standards-based
report cards is largely arbitrary. Concretely answering some key questions about
successful SBRC design will aid in crafting and applying SBG models, thereby
substantively and positively affecting students.
Theoretical Framework
Maslow (1943, 1954) posited that the psychological needs of humans are
progressive, ordered in a hierarchy wherein some needs are more foundational and
fundamental than others. A critical understanding found in his work is that the more an
individual’s essential needs are tended to, the easier it is to then fulfill—or have
fulfilled—needs above that basic level. Educational ethicists (Colnerud, 2006; Sherpa,
2018) generally agree that teachers must care for their students and meet their needs, and
thus teachers are ethically responsible for ensuring their students’ healthy development,
including their ability to progress up Maslow’s hierarchy. This progression is vital to a
student’s growth and learning, as expounded in constructivism, the concept in child
psychology which holds development and learning are deeply intertwined, and
Reeves 4
individuals create meaning from their experiences (Elliott et al., 2000). Hardegree (2012)
identified the theoretical tension within constructivism—between Vygotsky (1978), who
posited that learning was antecedent to development, and Piaget (1953), who believed the
reverse—as part of a larger general understanding that learning and development are
interrelated. Informed by Freire’s (1970) “radical pedagogy” (later rebranded “critical
pedagogy” by his intellectual scions Giroux, 1997; and McLaren, 1998), Kincheloe
(2001, 2005, 2008) expanded constructivist thinking significantly, specifically in the
realms of education and the social sciences. The fusion of constructivism and critical
pedagogy led Kincheloe to put forth a model joining the two in critical constructivism
(2001), which held that scientific inquiry and measurement cannot exist outside of a
social context, and are as much ideological constructions as they are empirical
Returning to the intersection of psychology and teaching practices, Bloom (1956,
1964, 1968, 1971) pioneered the conceptual model of mastery learning, which held that
the fulfillment of critical needs—the most foundational within Maslow’s hierarchy—is
key to facilitating learning and that providing “tailored correctives” (which we better
understand today as formative assessment and feedback) are essential to achieving skill
mastery. Bloom argued that praxis ought to satisfy the conditions necessary in
Vygotsky’s and Piaget’s versions of constructivist thought; better teaching and learning is
the natural product of better attention to the psychological needs of children. The
intersection of psychology and pedagogy underscores the importance of considering both
domains when studying education. Because teaching, learning, and assessment are so
Reeves 5
intertwined, successful investigation must be contextualized with other relevant and
compelling psychological research.
And integrating psychology into pedagogy is crucial. Kohn (2011, 2015) and A.
Miller (1990) both found that labeling and ranking children are harmful practices. Vidal
(as cited in Diaz-Loza, 2015) coined the phrase ghettoization to refer to forcibly moving
a person out of their authentic state and into a predefined label or role. Alice Miller, a
Swiss psychologist who lost her father to Nazi pogroms, understood that just as children
are harmed by forced removal from their physical inhabited space, so too are they harmed
when an instructor, by ranking students through an arbitrary grading system, forces them
out of a psychologically authentic space into an imposed category. Because grading
symbols are reductive categorizations instead of meaningful descriptions of children’s
authentic selves, they are suspect; Kohn’s research demonstrates this, showing that
traditional grades harm both learning and a student’s perception of self, undermining
their Maslowian needs for safety and security. There is a substantial body of literature on
assessment that shows how eliminating the ranking and structure of traditional grading
benefits children. Scholars such as Guskey (1994, 1996, 2000, 2001, 2004, 2005, 2007,
2011, 2013a, 2013b, 2015, 2020) have consistently demonstrated that standards-based
grading practices are far better for children—in terms of psychoemotional health as well
as learning outcomes—than traditional grading, and that body of work is a central pillar
of the literature reviewed in Chapter II.
This dissertation hypothesizes that improved SBRC design will, given the
intersection of constructivist psychology and mastery-learning-informed praxis, benefit
learners; it will also limit harm to students, considering research that shows the academic
Reeves 6
and emotional benefits of using less-hierarchical assessment methods. The relationships
between the concepts and theories enumerated in this framework are illustrated in Figure
Figure 1
Theoretical Framework for This Study
Statement of Purpose
The purpose of this study was to examine the relationship between standards-
based report cards (SBRCs) with different quantities of performance level descriptors
(PLDs) and the scores received on standardized end-of-course tests by primary-grade
students in Virginia.
Reeves 7
Rationale for and Significance of the Study
To date, quantitative analyses have not established how many performance level
descriptors are needed to accomplish the purpose of grading (namely, to assess mastery
and provide feedback), nor have they determined the appropriate form and wording of
those PLDs, especially vis-à-vis the psychoemotional imperative to rank and label
children as little as possible (Kohn, 2011, 2015; A. Miller, 1990). As we will explore in
Chapter II, many authors involved in researching and designing practices for standards-
based grading cited little evidence or (apparently) conducted no studies providing a
cogent rationale for the number and arrangement of PLDs they included in their designs
(Guskey, 1994, 1996, 2000, 2001, 2004, 2005, 2007, 2011, 2013a, 2013b, 2015, 2020;
Marzano, 1998, 2000, 2009; Marzano & Heflebower, 2011). Given the critical questions
and considerations inherent to rubric development, this deficit in the literature must be
addressed, as various models may have significantly different impacts on student learning
and health.
Establishing this relative efficacy is necessary: If one standards-based report card
is found to be notably more effective in practice at describing student outcomes than
another, then educators should employ that model. If no model is demonstrably better
than others, educators must then consider the research that shows how ranking and
labeling harms students; obviously, this would suggest that the model with the fewest
labels is preferable.
Reeves 8
Research Questions
This research was developed to answer the following questions:
1. Is there a relationship between qualitative SBRC models and students’
quantitative standardized-test scores?
2. Are different SBRC models more or less predictive of student performance on
standardized tests?
3. Are certain SBRC models preferable?
The objectives of this study are:
to qualitatively classify different SBRCs into models based on the number and
nature of their performance level descriptors,
to quantitatively describe differences in the predictivity of different SBRC-
model classes, and
to make recommendations for actual practices in schools.
Assumptions, Limitations, and Delimitations
This dissertation assumes that data provided by participating school districts are
accurate and complete. It also assumes that all the Standards of Learning (SOL) data
reported by schools and the state are free of errors.
Limitations are elements of a study's design or methodology that have an impact
on the interpretation of research findings, constraining the ability to generalize a study or
apply it to practice (Price & Murnan, 2004). While the investigation herein has few
significant limitations, there are two worth mentioning that arise from the actual
Reeves 9
conditions in which the study was conducted. Due to submission and graduation
deadlines, I had to request data from school divisions during their busiest time of the
year: August and September. Moreover, the 2019–20 and 2020–21 school years were
hugely impacted by the COVID-19 pandemic, which further complicated data collection
(Onyema et al., 2020). The pandemic-related limitations on this research are illustrated in
Figure 2.
Figure 2
Factors Impacting Availability of Data
Class of ‘26
Grade 3
Class of ‘27
Grade 4
Grade 3
Class of ‘28
Grade 5
Grade 4
Grade 3
Class of ‘29
Grade 5
Grade 4
Grade 3
Class of ‘30
Grade 5
Grade 4
Grade 3
Class of ‘31
Grade 5
Grade 4
Grade 3
Class of ‘32
Grade 5
Grade 4
Grade 3
Class of ‘33
Grade 5
Grade 4
Grade 3
Class of ‘34
Grade 5
Grade 4
Class of ‘35
Grade 5
No Impact
SOL Data
The other significant limitation involves demographic data. Tannenbaum et al.
(2016) found that including sex and gender when designing human-subject studies that
involve implementation—including this one—can improve discoveries and strengthen the
science. According to Andrews et al. (2019), researchers should be careful not to make
assumptions from normative positions, but should include elements that promote
Reeves 10
equitable outcomes. Consequently, if there are significant differences between sexes or
races/ethnicities in the application of standards-based assessment practices, effective
research must investigate and discuss those differences, both promote equity and improve
scientific rigor. Unfortunately, while such data were requested from all schools as part of
the research-application process for this dissertation, some did not provide that
information. This unexpected limitation likely prescribes a follow-up companion study,
as discussed in Chapter V.
While the findings of this project may be generalizable to other states’ primary
schools, they may not be so otherwise, given the differences in pedagogy between
elementary and secondary schools, including an increased emphasis on testing often
found in higher grades. Of the fifteen school divisions the researcher solicited for data,
eight did not reply at all, and three flatly stated that they were not accepting or approving
any applications for external research during the pandemic. While follow-up studies may
be conducted with these school divisions at another time, the originally intended scope of
research was unavoidably restricted to the data provided, an unfortunate effect of
COVID-19. As a result of these limitations, follow-up studies are warranted at all levels.
While limitations are the result of factors beyond the researcher’s control,
delimitations are restrictions intentionally placed by study authors to establish the
boundaries of their work and ensure the study can be effectively completed (Theofanidis
& Fountouki, 2018). In this case, I chose to delimit the study to models of standards-
based report cards (SBRCs) for which sufficient data were obtained. As described in the
literature review in Chapter II, traditional grading is fraught with bias, inaccuracy, and
Reeves 11
psychoemotional harm to children, and is incompatible with standards-based assessment.
Despite this, some schools using standards-based practices still shoehorn letter grades
into PLDs, typically in the form of end-of-quarter aggregate grades. They often do this as
a politically expedient way to placate stakeholders who do not entirely subscribe to the
principles of SBG. Even when grafted onto standards-based systems, these letter grades
are not performance level descriptors and cannot be equitably or appropriately compared
to such. One school division from which extensive data was acquired did not provide
their standards-based PLDs, but instead exclusively provided the letter-grade conversion
which was derived from those descriptors. This inappropriate conversion, in my
academic opinion, invalidated comparison. Consequently, the entire set of data from that
school division was excluded, both on scientific grounds, due to the confounding variable
of forcible letter-grade conversion (Cameron, 2020)—and on ethical grounds: The
established literature indicates that letter grades are inappropriate for standards-based
assessments in practice.
Organization of the Study
As understanding the organization and stages of a study is predicate to
understanding and implementing its results, this introductory chapter now elucidates the
structure of this dissertation, which has the traditional front matter, followed by five
chapters, references, and appendices. Chapter I describes the issue under investigation,
puts it into an historical context, and establishes the research questions. The next chapter
is a systematic review of existing work on assessment, grading, standards-based
practices, and SBG reporting. Following that summation, Chapter III describes the
research methodology and enumerates the research stages, after which Chapter IV
Reeves 12
outlines the data analyses and presents findings. The final chapter summarizes the
research and describes its implications, with specific recommendations for implementing
best practices in schools, as well as suggestions for future research.
Reeves 13
Chapter II
Review of the Literature
This literature review examines both seminal works from the 20th century as well
as the significant body of evidence developed in the 21st century. Primary searches were
performed through the Mary and John Gray Library online search system at Lamar
University. Additional sources include academic libraries and databases such as ERIC,
JSTOR, and Taylor & Francis Online. Google Scholar and ResearchGate were also
useful, given the need to explore emergent research that can be otherwise difficult to
locate. In all cases, authors and publications were vetted and verified with additional
An extensive review of existing work provides an outline of the foundations and
history of American educational assessment and grading practices. It is worth noting that
because this dissertation examines both grading and testing—often conflated in the
layman’s understanding—it examines those constructs as a gestalt. Many authors have
described the elements necessary to understand standards-based grading and contrast it
with traditional systems, while measuring the effects of various forms of assessment.
Nevertheless, there appears to be a gap in the literature, with a dearth of studies
investigating how SBG models compare to one another in practice. To establish the
differences between SBG frameworks, the literature review delved deeply into the
different ways standards-based practices describe student performance. Finally, the
Reeves 14
chapter concludes by highlighting the aforementioned gap in the literature and outlines
areas for further research in the vein of this study.
Purpose of Grading
Any discussion of grading practices must begin by defining grades themselves:
Simply put, grades are recognizable symbols with explicit meaning intended to provide
feedback and guidance to learners by describing their progress and achievement
(Airasian, 1994). They are used in two distinct, yet related, ways. They are given by
teachers to evaluate the quality of individual assignments, and they are assigned as
composite representations of cumulative student performance, such as grades found in
report cards (Brookhart et al., 2016; Guskey et al., 2011). Despite Americans’ familiarity
and personal experience with ubiquitous systems like letter grades and number scores
(Kalin, 2017), experts examining more than 100 years of research have repeatedly found
no evidence supporting these traditional grading practices (Brookhart et al., 2016; Cox,
2011; Guskey & Bailey, 2001; Marzano, 2000; Townsley & Buckmiller, 2016; Zoeckler,
Some forms of assessment are vital, though. Meaningful information about
student skill mastery is important at all age levels; evaluations of primary-age student
skills are important predictors of future academic performance (Balfanz et al., 2007).
However, significant problems arise when teachers reduce descriptions of performance to
single letters or numbers; even when individual assessments within that aggregate are
meaningful and research-based, overgeneralization or corruption can occur. Whether
numbers or letters, traditional grades dissatisfy students and their families because of the
lack of specific and meaningful feedback and information (Bushaw & Gallup, 2008;
Reeves 15
Webber & Wilson, 2012); while those grades can theoretically give a broad idea of
academic performance, they do nothing to direct effort toward actual gaps in learning.
Traditional grading and scoring methods rank and compare children, but do not guide
learning accurately or consistently (Meaghan & Casas, 2004). Indeed, such ranking has
led to shallower thinking and thus negatively impacts learning (Kohn, 2015).
William et al. (2010) wrote that if teachers could predict what students would
learn from being taught, there would be no need for assessment in the first place;
instructors would merely present their syllabi and record the predestined results. Those
authors note that student outcomes are uncertain, however, so assessment may well be the
most central process in all of education. The grades exist to communicate a student's
current level of progress toward mastery (Allmain, 2013), and they are used by educators
to plan future instruction and by institutions to determine placement and promotion
(Airasian, 1994). Yale University (2013) concisely defined grading as a way to signal
strengths and weaknesses to students. Brookhart (2013) explained how teachers use
grades to make decisions, most often about instruction, and the team of Lund and
Shanklin (2011, as cited in James, 2018) regard grading as a method for holding students
accountable for accomplishing assigned lesson activities. While there is a difference
between formative (during learning) and summative (after learning) assessment, the latter
ideally serves to verify desired learning outcomes (Black & Wiliam, 2018; Schimmer et
al., 2018). Traditional assessment methods are generally summative, while research-
based practices tend to focus on the formative (Iamarino, 2014), and recent studies
suggest that teachers of all experience ranges prefer and value formative over summative
Reeves 16
assessment (Coombs et al., 2018; DeLuca et al., 2018); both forms of assessment, though,
do have value when used appropriately (Lau, 2016).
Essentially, this signifies that grades should have meaning (Scriffiny, 2008), and
meaningful grading must provide an authentic and accurate measurement of achievement
(J. D. Allen, 2005; Kunnath, 2016; Manley, 2019). Ideally, grades describe the extent to
which a student achieved goals in learning, with as much specificity as possible (Muñoz
& Guskey, 2015). In practice, however, Sun and Cheng (2013) found that the meaning of
a grade was closely related to the concept of judgment, and often reflected values-based
determinations. Those judgments involved perceptions about the effort put forth in
completing a task and the quality of the outcome, and often incorporated non-academic
elements as well. Guskey and Bailey (2010) grouped educators’ understanding of the
purpose of grading into six categories: communicating achievement to families;
providing feedback for students for self-evaluation; identifying groups needing
educational interventions; providing students incentive to learn; evaluating instructional
efficacy; and justifying perceptions of students’ irresponsibility or lack of effort. While
teachers’ beliefs fell into these well-defined groups, the respondents were not able to
consistently rank the importance of these categories, and there was significant variation
between their responses (Guskey, 2013a).
Accurate description of the current state of learning is essential to orienting both
students and their parents to the present level of mastery, and ideally gives direction on
how best to grow in the immediate and long term (Townsley, 2018). The more clearly
that criteria for a given performance level or grade were communicated, the fairer and
more equitable the grading process (Muñoz & Guskey, 2015). Close (2009) proposed that
Reeves 17
“grading should be impartial and consistent” (p. 361) and that grading should be “based
on the student's competence in the academic context of the course” (2014, p. 189). He
further described normal distribution—the famous bell curve—as a flawed basis for
grading. There is no cogent rationale for using that curve, as it implies that mastery is
somehow limited in quantity. Moreover, the idea that students’ academic achievement
should resemble a normal curve assumes that intelligence and performance follow the
same curve, a flawed statistical assumption (Royal & Guskey, 2015), and one which,
essentially, literally defines half of a population as substandard. In Close’s (2009, 2014)
view, these flaws violate these principles of fairness, corrupting the value of grades for all
stakeholders, including students, parents, and teachers. Similarly, including variables
unrelated to content standards, such as “compliance with institutional values” or
“cheerfulness,” (2009, p. 394) violates teacher professional ethics and responsibilities.
The intrusion of non-academic factors lessens the utility of reported grades to the point of
meaninglessness (O’Connor & Wormeli, 2011).
Finkelstein (1913) explicitly stated that grading in the early 20th century involved
a teacher’s understanding of a “pupil’s native ability” and encouraged the use of grades to
reflect a “pupil’s real knowledge” (p. 10); this was explicit encouragement for teachers to
change grades based on their perceptions. Like Finkelstein, Guskey and Jung (2012)
emphasized the importance of teachers’ professional judgment; percentage systems are
often intrinsically incapable of appropriately reflecting student skill mastery due to
insurmountable mathematical and methodological failures, an empirical understanding
consistently reaffirmed by Wormeli (2018). Brookhart (1993) and Marzano (1998)
posited that teachers know their students well, and consequently have a thorough
Reeves 18
understanding of any given student’s complex nature and progress. This familiarity was
found to help create highly-accurate descriptions of attainment and achievement, a core
feature of standards-based grading.
Even when assessments are accurate, though, the question remains: What,
precisely, is being assessed? Some argue that practice (e.g., homework) and formative-
assessment tasks (e.g., in-process mastery checks) should not be included in summative
reporting (Canady et al., 2017; Fleenor et al., 2011; Scarlett, 2018). D. Reeves et al.
(2017) likened it to reviewing musicians based on rehearsals instead of performances, or
athletes being scored based on practice instead of actual competition. To learn
effectively, students must have multiple opportunities to practice skills (Canady et al.,
2017; Fleenor et al., 2011; Townsley, 2018); using those efforts to assign a final grade is
inappropriate (Wormeli, 2011). One must remember that grading children is not the same
as a high-stakes demonstration of surgical skill or piloting ability; just as a surgeon-to-be
does not have to perform a life-or-death operation on the first day of medical school, so
should a child be allowed to fully learn a concept before being tested on it.
A school year of arbitrary length, divided into roughly equal quarters, is an
unscientific way to schedule learning, especially when used to segregate students by age,
regardless of all other factors. Nevertheless, this is the predominant structure used in
America. While acknowledging the need for a framework around which schools can
structure their curricula, Wormeli (2011) declared it nonsensical to blindly adhere to such
schedules, as doing so runs contrary to well-understood research on child development
and is thus pedagogically inappropriate. Furthermore, by letting time constraints deny
children opportunities for relearning material and retaking assessments, these systems are
Reeves 19
unnecessarily biased against younger students. Wormeli points out that even the highest-
stakes standardized tests in the world, such as the LSAT, MCAT, Praxis, and bar exams,
can all be retaken ad infinitum, without penalty and for full credit; there is obviously a
tacit understanding that penalizing students for retaking a test is not compatible with the
fact that mastery is achieved over time, often accompanied by stumbles along the way.
Brookhart (1991) defined the hodgepodge grade as an arbitrary distillation of a
broad constellation of data into an indistinct and imprecise—and consequently largely
meaningless—label, which is affixed to a student under the guise of evaluating
competency across a broad swath of learning (Guskey, 2000). Sun and Cheng (2013)
found that most teachers base hodgepodge grades on three types of evidence, defined by
Jung and Guskey (2011) as product, process, and progress. Product grades stem from
student artifacts, which are direct evidence of skill mastery. Process grades are
behavioral, including what Close (2014) classified as deportment grades; widely
recognized and often-used examples in America are grades given for behavior,
participation, attendance, tidiness, and timeliness (Close, 2014; Dueck, 2014; Guskey &
Link, 2017; Jung & Guskey, 2011; Randall & Engelhard, 2010). Progress grades
summarize student growth, sometimes explicitly through measurements made before or
after tests, and sometimes subjectively based on a teacher’s impression of growth. Jung
and Guskey (2007) found teachers often varied the criteria they used in grading,
generating inconsistency across the grading scale (Rosser, 2011). Individual teacher
beliefs—particularly ones pertaining to whether and how student effort should be
factored into grading—played a profound role in shaping their assessment practices (Cox,
2011). Guskey’s (2015) recent research confirms that the subjectivity of grading seen by
Reeves 20
Starch and Elliott in 1913 continues today—even in such seemingly concrete subjects as
mathematics. Grade inflation is well-documented, and the fact it occurs more frequently
in affluent schools than in poor ones (Gershenson, 2018) casts further doubt upon the
accuracy of traditional grades (Rojstaczer & Healy, 2012).
In essence, hodgepodge grades utterly fail as assessments; they cannot be
interpreted accurately and do not effectively reflect mastery (Cross & Frary, 1999;
Erickson, 2011a; Freytas, 2017; Schimmer, 2014, 2016; Townsley, 2018). This is, in
large part, due to the inclusion of non-cognitive factors, the characteristics, habits, and
skills that are related to and supportive of learning but are not themselves defined by or
included in content standards (Kafka, 2016). Up to 20% of a given hodgepodge grade
may include imprecise metrics unrelated to student skill mastery and academic
performance, such as a teacher’s impression of how hard a student is trying (Brookhart,
2011; Guskey, 2000; Guskey & Link, 2019).
These extraneous non-academic factors varied significantly in interpretation and
application, even among experienced teachers—though there was some similarity within
grade-level teams (Guskey, 2000; Guskey & Link, 2019). Despite the demonstrably
subjective nature of the system, hodgepodge grading has become so normalized in
American schools that all stakeholders—students, parents, and teachers—have accepted
them as effective and qualitatively sound, despite the empirical evidence (Cross & Frary,
1999). While reporting on behavior can be appropriate when done as an accompanying
narrative or in an otherwise separate fashion, aggregating behaviors into an evaluation of
skill mastery is inappropriate and confounding, and cripples the effectiveness of any
given assessment (D. Reeves et al., 2017). For example, the common practice of
Reeves 21
penalizing late work or incomplete assignments does not improve learning; furthermore,
rather than accurately reflecting the student’s skill mastery, it merely indicates a capacity
for organization or management (Dueck, 2014). These deportment-grading practices
adulterate assessment, punishing students who would otherwise meet the benchmark for
skill mastery (Erickson, 2011b). In addition, when teachers perceive a score as
inaccurately reflecting a student’s progress, they sometimes offer extra credit, a
pedagogically inappropriate and conceptually incoherent practice that directly contributes
to grade inflation (Pynes, 2014).
In summary, the purpose of giving grades is often derailed by actual practice.
When extra credit is assigned, or when teacher perceptions or behavioral elements are
taken into consideration when awarding a traditional grade, any hope of using those
assessments to accurately reflect mastery is dashed (Gordon & Fay, 2010; Guskey, 2011).
Extra credit is particularly pedagogically pernicious, being wholly inconsistent with
established research on effective grading practices (Pynes, 2014), and other subjective or
non-rigorous factors, like grading on a curve or incorporating deportment into reporting,
further diminishes accuracy (Erickson, 2011a). Research even shows that a mainstay of
many classrooms—the due date—is antithetical to accurate assessment and may be
considered unethical (Close, 2014). Different students learn at different paces, and
Scarlett (2018) encourages basing evaluations upon current evidence, rather than upon an
average of performance over time.
History of Grading in the United States
From the founding of the first American public primary school in 1635 until the
middle of the 19th century, teachers discussed student achievement and skill mastery
Reeves 22
orally and did not generally codify or rank student performance (Cremin, 1970; Guskey,
2013b). The earliest form of formal assessment in American education was likely
Harvard University’s end-of-degree exam in 1646 (Smallwood, 1935). Both Harvard and
Yale identified valedictorians and salutatorians in the early 1700s, suggesting a ranking
method for student performance, but the earliest unequivocal record of an official grading
system Smallwood found was a four-point scale used at Yale in 1785, which used four
Latin-language descriptors, which (translated to English) ranked from “worst” to “best.”
By 1813, Yale records referenced a four-point numerical system, likely an evolution of
the Latin-language system.
Researchers have speculated that this four-point system was the basis for the 4.0
grade-point average (GPA) system used today in post-primary education (Brookhart,
2011; Durm, 1993; Manley, 2019). By 1837, at least some Harvard faculty were using a
scale grading from zero to 100, and experimentation with different systems was common
throughout the middle part of the 19th century (Smallwood, 1935). While this
experimentation mainly unfolded in tertiary education, few (if any) such systems
appeared in primary and secondary schools. Guskey (2013a) found that grading and
report cards were virtually nonexistent in American schools before 1850, when the
earliest evidence of letter grades began to appear (Gay, 2017).
In the latter part of the 19th century, as the industrial age dawned and
compulsory-education laws were passed (Guskey, 2013b, 2015), students began to be
grouped by age (Edwards & Richey, 1947). As the number of public high schools
exploded from 500 in 1870 to 10,000 by 1910 (Gutek, 1986), discrete subject areas began
to emerge in secondary education, along with grading on a percentile scale
Reeves 23
(Kirschenbaum et al., 1971). Norm-referenced numerical scores began to be widely used
in American primary and secondary schools around the same time (O’Connor, 2007) but
were controversial even then. Several studies of teacher-assigned grades were undertaken
in the 1920s, and found significant variation in assessment methods, distribution curves,
and criteria (or lack thereof); other studies, though, claimed the contrary, further clouding
the issue (Brookhart et al., 2016).
Through his research in the early 20th century, Crooks (1933) found sufficient
evidence to condemn grades as distracting and superficial extrinsic motivators, and saw
them as barriers to critical thinking and deeper understanding that ultimately led to
factory-like assembly-line teaching. Over the next nine decades, studies consistently
reinforced the understanding among academics that grades are intrinsically flawed. Some
dissenters like Feldmesser (1971) made the case that grades were useful as general
indicators of performance, and posited that, even if flawed, grades should be valued by
students because the institution did. Nevertheless, subsequent substantive research
demonstrated the inefficacy, bias, and detrimental effects of traditional grading.
Despite the early misgivings of Crooks and others, the period after World War II
saw letter grades become increasingly common in both primary and secondary education
(O’Connor, 2007); and by the end of the 1940s, more than 80% of American primary and
secondary schools were using the A–F grading scale (Brookhart et al., 2016). In 1956,
Benjamin Bloom authored his now-ubiquitous taxonomy of learning objectives,
becoming a seminal figure in the outcome-based reform movement (as cited in
Frederickson, 2017). This was the first meaningful attempt to counter the growing
influence of letter grading in primary and secondary education. Fredrickson makes a case
Reeves 24
that by the middle of the 20th century, tension already existed between what we now
consider traditional grading and more progressive, research-based methodologies.
The second half of the 20th century saw rapid changes, at the state and national
levels. The federal Elementary and Secondary Education Act (ESEA) of 1965
significantly changed expectations for primary and secondary education, generating a
huge impact with the creation of Title I (Greer, 2018). During this time, Bloom (1964)
found that many non-academic factors influenced student learning, including
socioeconomic conditions. He averred that evaluations should be more than mere reports,
but should serve as useful aids for teaching and guides for learning. Bloom (1968, 1971)
further proposed mastery learning and mastery grading, which eschewed traditional
normal-curve distributions in favor of what was effectively pass-fail reporting, with
accompanying specific narrative feedback giving specific suggestions for improvement
(as cited in Guskey, 1996). Subsequent meta-analyses (Guskey & Pigott, 1988; Kulik et
al., 1990) found Bloom’s models had consistently positive effects on achievement,
retention, involvement, and student attitudes and values, especially for struggling
Between 1960 and 1989, paradigms at every level remained fluid even in
traditional quarters; Yale University used at least four different grading systems during
this period (2013). Despite experiments with non-traditional grading, about 67% of
public primary and secondary schools in America were still using letter grades in 1971
(National Education Association, 1971). Marzano (2000) authored a pioneering work at
the close of the 20th century, which found that the previous 100 years of traditional
grading—and the habits that model fostered in teachers—had no real evidence to support
Reeves 25
their efficacy or utility. D. Reeves (2004) found significant variations in grading practices
between classrooms, which differed not only from teacher to teacher but even between
classes taught by the same instructor.
Against this backdrop, growing dissatisfaction with the educational system took
hold. A Nation at Risk, a report authored during the Reagan administration in 1983,
detailed new federal objectives for primary and secondary education (National
Commission on Excellence in Education, 1983). Amid widespread calls to action, what
eventually became known as standards-based reform evolved, espousing changes to both
pedagogical and assessment practices. These reform movements continued to gain
momentum throughout the following two decades (Frederickson, 2017).
By the dawn of the 21st century, though, the pendulum began to swing back.
Under the No Child Left Behind (NCLB) legislation of 2001, the number of content
standards was decreased in many jurisdictions, squeezed out by the increasing burdens of
NCLB-related standardized-testing expansion (Saatcioglu et al., 2021). The Common
Core State Standards (CCSS, more commonly “Common Core”) Initiative of 2010
reversed this trend, as jurisdictions that adopted CCSS once again expanded the use of
content standards. This reversal resulted in increased transparency and clearer
expectations, compared to former curricular frameworks (National Governors
Association Center for Best Practices & Council of Chief State School Officers, 2010, as
cited in Townsley & Buckmiller, 2016). That expansion, however, did not come without
a cost: Emphasis on testing as a proxy for genuine standards led to greatly expanded data
collection and aggregation. This expansion led to a boom in vendors providing such
services, including inBloom, a controversial non-profit funded by the Gates Foundation.
Reeves 26
Founded in February 2013 with the mission of aggregating previously private student
data, inBloom collapsed barely a year later, in April 2014; revelations that their data was
not held securely led to massive backlash from 36 states and parental privacy-advocacy
organizations, and dooming the ill-conceived project to failure (Bulger et al., 2017;
Schaffer, 2017).
That swing in public opinion led to further changes, including 2015’s Every
Student Succeeds Act (ESSA), which was an overhaul of NCLB as part of the
reauthorization of ESEA. ESSA was a significant course-correction (Egalite et al., 2017),
designed with the intent of correcting historic inequities created by 50 years spent
prioritizing standardized testing. This decreased emphasis on testing paved the way for
further expansion of standards-based practices. Chief contributors to the body of
literature that guides research in the field today—Susan Brookhart, Thomas Guskey, Lee
Ann Jung, Bob Marzano, Ken O’Connor, and Douglas Reeves among them—formed a
consensus that current standards-based practices were better aligned to pedagogical
reality and were thus better for students and their learning (Townsley & Buckmiller,
Still, despite a half-century of research supporting SBG, traditional grades persist
to this day, efforts to reform the system notwithstanding. Indeed, some still describe
standards-based grading as “very much in its infancy;” Brookhart (2013, as cited in
Schimmer et al., 2018), described inconsistencies in how grading systems were
implemented as part of the research challenge. Additionally, despite the central role of
grading in primary and secondary education, most educators working in the field at the
time of this study have had no formal training in research-based assessment and
Reeves 27
reporting—perhaps unsurprising, as most undergraduate teacher-education curricula have
not addressed the subject (Guskey, 2015). The idiosyncratic nature of teachers’ personal
experiences and career skill-sets, as well as the differing rules and requirements of their
districts or schools, indicates that robust, current, and ongoing professional development
is required to successfully implement effective assessment practices (Livingston &
Hutchinson, 2017).
Norm-Referenced vs Criterion-Referenced Grades
In the broadest terms, there are two distinct domains within traditional-grading
methodology: norm-referenced and criterion-referenced (Banditwattanawong &
Masdisornchote, 2021b; Bond, 1996; Lok et al., 2015). Norm-referenced grading is
commonly referred to by its often-used (but not-always-accurate) nickname, grading on
the curve (Royal & Guskey, 2015). A norm-referenced system orders and ranks a
student’s achievement relative to their classmates, often with a numerical score
(Banditwattanawong & Masdisornchote, 2021a). These categorizations are usually
translated to letter-grade labels corresponding to predetermined arbitrary ranges, such as
designating any raw score below 60 or 65 as an F (Erickson, 2011b).
Norm-referenced grading is used to compare subjects to one another, to “classify”
students, usually using a normal-curve distribution to rank subjects from highest score to
lowest (Bond, 1996, p. 1). These systems actively seek to create a hierarchical, ordered
list of students, categorizing scores into ranges accompanied by summary descriptors
(Stiggins, 1994). Traditionally, each standard deviation within that normal distribution is
assigned a label—and in most American public schools during the 20th century, that label
was a letter grade. In this model, A represented the highest scores, B the second-highest,
Reeves 28
C the average or median scores within one standard deviation, and so on, with F
indicating the lowest scores, beginning at a predetermined mathematical cutoff,
traditionally 60 or 65. (Note that this means most possible numerical scores (0–65) are
considered “failing.”) In any case, the calculations used to arrive at a summative final
score, as well as those used to convert that final score into a letter grade, were often quite
complicated (Banditwattanawong & Masdisornchote, 2021b).
It should be clear at this point that norm-referenced assessment methods are
generally not useful to anyone, being imprecise, opaque, and unable to provide quality
feedback (Guskey, 2007; Kifer, 2001). Nevertheless, this kind of assessment has a long
history and is firmly established in American schools, especially at the secondary and
tertiary levels. Attempts to reform assessment practices have languished; as Guskey said,
the United States is “more bound by tradition than any other developed country in the
world” (as cited in Anderson, 2017, para. 9).
Students subject to these evaluations are by definition being constantly ranked
against each other. Hypercompetition among students is dangerous for children,
engendering the falsely dichotomous concept that for one to succeed, another must fail—
a worldview which can promote risk- and challenge-avoidance behaviors (Anderson,
2017). Adding insult to psychosocial injury, the instruments most used to calculate norm-
referenced grades are often highly unreliable when used to measure differences in student
achievement, thus invalidating the whole exercise before it begins (William, 2017).
In contrast, criterion-referenced grading assesses learning against a predetermined
scale that describes levels of mastery, often in the form of a rubric, and does not compare
a student’s achievement to others’ (Banditwattanawong & Masdisornchote, 2021a).
Reeves 29
Teachers use a great variety of observations and gathered data to identify the unique
needs of each student and to determine where their achievement falls upon that
descriptive scale. While patterns may emerge that group students according to current
aptitude or needs, those groupings never translate to broad, reductive grades. By
removing the single overarching normal distribution, any student's skill mastery can be
assessed at any level at any time, eliminating the unfounded assumption that performance
in any given group of students must conform to the invidious bell curve (Melrose, 2017).
While arguments have been made that norm-referenced and criterion-referenced
methodologies are not mutually exclusive, Lok et al. (2015) argued that practical
experience suggests that the latter will ultimately become predominant by virtue of its
utility and simplicity.
Traditional Grading
Traditional grading has generally been norm-referenced, taking two primary
forms throughout American educational history. Early on, these grades were numerical
scores, first on a four-point scale, with an alternative 0–100 system emerging around
1900 (Brookhart et al., 2016). The alternative form was the letter grade, which appeared
around 1920, with each letter approximately representing one standard deviation on a
normal curve (Brookhart, 2017; Starch & Elliott, 1913). Even as these forms were taking
shape, though, researchers were already beginning to document the inherent inaccuracies
and unfairness of norm-based grading (Manley, 2019; Starch & Elliott, 1912, 1913). In
their seminal research, the team of Starch and Elliott found letter grades were assigned in
arbitrary and inconsistent fashion, even in subjects regarded as concrete and objective,
such as English grammar and geometry. In one of their studies, 147 English teachers
Reeves 30
examined two identical papers, to which they gave marks ranging from 50 to 90; one
could ask for no better evidence of the subjectivity of teacher-assigned grades.
About 25% of what a traditional grade is based on academic knowledge, such as
the elements evaluated by standardized testing; the other 75% comprises what Guskey
called “something else,” including deportment elements such as social-emotional
learning, perceived intelligence, participation, and attendance (Brookhart et al., 2016).
This high proportion of subjective criteria make it inevitable that traditional grades will
vary significantly (Guskey & Jung, 2012), as teachers are inevitably inconsistent in the
idiosyncratic criteria that they use to concoct the grades they use to report performance
(Guskey & Link, 2017).
Not only do these grading schemes fail to provide effective assessment (Schinske
& Tanner, 2014), but Brookhart (2017) finds that even when constructive guidance is
provided, the very presence of a letter grade tended to draw students’ focus, to the degree
that the meaningful feedback was often ignored. Butler and Nisan (1986) found that
grades did not promote learning, but rather encouraged avoidance behaviors to avert
“bad” grades, preying upon students’ fear of shame while promoting interpersonal
competition instead of intrapersonal achievement (Pulfrey et al., 2011). Shame is
extremely detrimental to children and their development, and can be experienced as
actual trauma (A. Miller, 1990). Wormeli (2018) summarized the problems with
traditional grading articulately: “It’s dangerous to emphasize something in our schools
that has no positive purpose for learning or living” (p. 155).
Reeves 31
Letter Grades
Letter grades—the well-known continuum of A, B, C, D, and F—constitute a
gradient of rankings ranging from high performance to low or “failing” performance.
Letter grades are perceived by students as summative judgments (Cameron, 2020),
though, as discussed above, these symbols are subjective, inconsistently applied (Guskey,
2013b), and lack validity (Brookhart et al., 2016). Students do not perceive a C grade—
ostensibly representing the median of this scale—as average, but rather as substandard.
Primary-age students perceived an A grade as good, and considered any other grade to
be, by extension, bad (Greig, 2006). Despite this distorted focus on grades, students also
tend to regard rigorous systems of accountability with suspicion, which negatively
impacts learning (Whitesell, 2015). With all these drawbacks, it is unsurprising that
teachers regarded traditional grades as among the least-reliable indicators of student skill
mastery, according to one study (Guskey, 2007).
By all accounts, they are indeed unreliable. When schools in the American South
grade more harshly than schools in the North, and when private schools give many more
A and B grades than public schools with similar demographics (Rojstaczer & Healy,
2012), it stretches credibility to claim that these norm-referenced models reflect any sort
of objectivity. Rojstaczer and Healy have shown that since 1960, the share of A grades
given has increased by 28%, and as of 2012, they make up fully 43% of all grades
awarded. The alphabetic-symbol paradigm is so entrenched, though, that rather than
shifting to a different (and verifiably better) model, schools persist in attempting to graft
progressive, research-based methods onto the regressive traditional paradigm, succeeding
Reeves 32
only in creating a bastardized chimera which still bears the failures of letter grading.
(Cameron, 2020).
Some scholars do assert that letter grading can still be valid when the framework
is informed by rigorous research-based practices (Canfield et al., 2015). However, even
research that does not completely contraindicate the practice indicates a need for
intensive improvement in methodology, given the significant problems enumerated in
this literature review (J. D. Allen, 2005).
Numerical Scores
Erickson (2011a) considered numerical scores to be the most flawed practice in
assessment; non-educators tend to regard these ratings as objectively quantifying reality
despite the fact these numbers rarely represent what they purport to. Indeed, the common
thread through research about nearly every numerical-score-based system was the ever-
present, virtually unavoidable inequity and unreliability inherent to these models. The use
of averaging, a pillar of most such systems, was found to be both inaccurate and unethical
(O’Connor & Wormeli, 2011), as is assigning a grade of zero (J. D. Allen, 2005;
O’Connor & Wormeli, 2011). In fact, points-driven grading systems are found to have a
number of fatal flaws that are harmful to learning (Henry, 2018; Klapp, 2015), as these
methods are biased both racially (Cornwell et al., 2013) and against girls (Cornwell et al.,
2013). The unending procession of such studies shows that these norm-referenced
assessments are almost congenitally incapable of serving the need they were ostensibly
designed to fulfill.
For example, numerical scores often take the form of percentiles. When derived
from assignments, they are often percentages of correct responses, and these item-by-
Reeves 33
item grades are typically aggregated as averages to ascribe a “final grade” for the
marking period. Unsurprisingly, researchers repeatedly describe numerical scores as
inaccurate and insufficient to communicate a student’s level of mastery (Brookhart &
Nitko, 2008; O’Connor, 2007; D. Reeves, 2011). Few instruments used in numerical
models appear to examine individual skills as discrete entities; therefore, if a problem
posed requires ten skills to craft a “correct” response, a student may well be labeled as
deficient because they lack proficiency in one area, even if they have superior mastery of
the other nine. The New Teacher Project’s report The Opportunity Myth (2018) found
that although 71% of students did their assigned work and earned high grades, less than
20% of that work met the standard being assessed. This finding—which clearly indicates
a disconnect between performance on assignments and the final (theoretically aggregate)
grade—was reinforced by an independently conducted study from the Thomas B.
Fordham Institute (Gershenson, 2018). Scores of zero disproportionately negatively
impacted mathematical calculations, leading to unrecoverable low-scoring conditions
(O’Connor & Wormeli, 2011).
One primary school in Virginia, Discovery Elementary School (2021), has, since
its inception in 2015, used standards-based assessment. It demonstrates the
inconsistencies of traditional grading through the example shown in Table 1, in which
three different students, each with identical levels of mastery both at the beginning and
end of an assessment period, still received three entirely different summative averages.
These data give the lie to the conceit that traditional grading methods are effective and
representative of reality: These three hypothetical students ultimately achieved the same
level of mastery, but were assigned significantly different grades merely because of their
Reeves 34
different learning trajectories. The disparity in those final grades can only be relevant if
one believes that a student who learns “faster” or at a “higher rate” should be rewarded
for that characteristic. A student’s pace of learning is a highly complex phenomenon,
affected by uncontrollable variables such as social factors and biology, and therefore
should not be considered when drafting a summative evaluation of skill mastery (Sturgis,
2017). Furthermore, as previously cited, converting percentage grades to letter ones
adulterates the assessment (Gay, 2017).
Table 1
Identical Initial and Final Mastery With Traditional Averages
Week 1
Week 2
Week 3
Week 4
Week 5
Student A
67 / D
Student B
71 / C
Student C
80 / B
Numerical average grades are inauthentic and lack substance, have no intrinsic
real-world application, and are, in the final judgment, predicated upon false assumptions
about their own meaning. Number scores are not even effective predictors of student
performance on standardized tests (Greene, 2015). They should be avoided at all costs,
and researchers have advocated that educators should proactively seek to eliminate their
use (D. Reeves et al., 2017). Despite these findings, weighted-average grading (WAG)
has been the most-used traditional numerical system for calculating a grade in America
(Elsinger & Lewis, 2020).
Hooper and Cowell (2014) provided another example, shown in Table 2, of two
students’ work over ten assignments. Student A was given two zeroes and eight scores of
Reeves 35
80, and Student B received ten scores of 64. In both cases, using a traditional flat average
yielded a failing grade of 64, despite one student consistently demonstrating what would
otherwise be considered a mastery level of 80. Numerous researchers have shown this
type of grading lacks fairness, validity, and accuracy (Brookhart & Nitko, 2008; Guskey
& Bailey, 2001; O’Connor, 2007; D. Reeves, 2011).
Table 2
Identical Averages Arising from Different Mastery Journeys
64 / F
64 / F
Capturing scores as snapshots in time does not promote continuous learning, and
low scores may derail the learning process regardless of a student’s actual progress or
mastery (Iamarino, 2014).
Classical test theory holds that all assessments are inherently imprecise, in that an
observation is actually a combination of reality and observational error. This is
summarized as Oi = Ti + Ei, where O is the observed score, T is the true score, and E is
the error, all for any given individual examinee i (Gregory, 2011; Himelfarb, 2019). E
will inevitably contain variables that are not directly related to student skill mastery, be
they deportment factors or temporal variables, such as blood-sugar level or testing
irregularities caused by technology failure (Close, 2014; Marzano, 2009). E is
exacerbated by endogenous variability, those effects arising from the model of testing or
analysis itself. Over the years, a wide variety of mathematical gymnastics have been
proposed in an effort to lessen the impact of E and create fairer calculations of grades; a
Reeves 36
prime example is Marzano’s (2000, 2006) often-cited power law, sometimes referred to
as the “method of mounting evidence” (2006, p. 100) which attempted to draw a line of
best fit and predict the current state of mastery by more heavily weighting a student’s
most-recent scores. Examining Marzano’s power law helps illustrate the innate problems
of attempting to mathematically generate fair norm-referenced grading. The calculation
itself was profoundly complicated (Allmain, 2013; Great Schools, 2016), as represented
by the formula in Figure 3 (Powley, 2019).
Figure 3
Marzano’s Power Law
Values: x = ordinal number of the score, s = score, N = number of scores in date order.
In the end, Marzano (2000) inadvertently undermined the point of standards-
based assessment by reintroducing the deleterious influences of numerical grading and
ranking; the ways these adulterating elements creep back into assessment will be detailed
in forthcoming sections. Derksen (2014) found that while Marzano’s power law was
intended to accurately describe a student’s skill mastery based on the totality of the
evidence, it suffered from the same flaws as any other mathematical model—namely, that
it could not account or adjust for certain types of qualitative evidence. In the illustration
provided by Derksen (Figure 4), a student with a recent low-performance score was
Reeves 37
damaged mathematically, despite numerous higher scores. This handily illustrates the
perils of applying strictly mathematical methods of determining a “score” or “grade.”
Figure 4
Derksen’s Critique of Marzano’s Power Law
Values: = score, Mo = mode, x
̄ = average, PL = Marzano’s Power Law.
Derksen (2014) further identified the power law’s inability to assess accurately
when a standard was tested too often, or not often enough. The algorithm requires a
certain amount of data, and when that is not provided, it fails to function as intended, thus
creating endogenous invalidity. Marzano (2009) posited that the power law rewarded
growth over time, but Derksen (2014) demonstrated that this promise was unfulfilled in
actual practice. The formula was ultimately so problematic that it was rewritten because
of these failures (Allmain, 2013). To move away from this endogenous error, Hooper and
Cowell (2014) proposed instead a nonparametric inferential calculus that made fewer
assumptions, thereby eliminating this overreliance upon a single data point, calling their
new model the history-adjusted true score (HAT). The authors admitted their
nonparametric methodology made assumptions about patterns of data, and that additional
robust qualitative and quantitative input would be necessary to comprehensively reflect a
student's performance.
Reeves 38
HAT was an improvement over Marzano’s power law in that it made fewer
assumptions; however, it was still reductive in that it aggregated various data points into
a single reported classification, distilling them to a summary symbol rather than
considering each one distinctly (Hooper & Cowell, 2014). According to the authors, HAT
operated under the rationale that the two most recent scores served as the best evidence of
current proficiency. That premise made significant assumptions about the validity of
those scores as accurate reflections of student skill mastery; given the plethora of
variables that influence scores, those assumptions could easily render the entire method
Effects of Traditional Grading
Webber and Wilson (2012) showed that traditional grades were not correlated
with deeper student understanding of content, nor did they improve performance or
increase intellectual risk-taking. On the contrary, grades significantly lessened children’s
interest in and desire to learn the material at hand (Kohn, 2011, 2015). These results show
a negative correlation between students’ concern about their assessment scores and their
concern about how much they are learning. Kohn’s research cites extensive empirical
evidence that reinforces the understanding that prioritizing assessment undermines
academic achievement and excellence.
Still, parents commonly claim traditional grades are effective motivators, but
relying on extrinsic motivation—including symbols of reward and punishment—often
reduces intrinsic motivation and is destructive to students’ mental health (Crocker, 2002;
Kohn, 2011). Kohn found grading seriously harmful to children, citing the work of
authors from Crooks (1933) to Kirschenbaum et al. (1971), all of whom found such
Reeves 39
systems had profound deleterious effects on student interest, rigor, and thinking. Pulfrey
et al. (2011) reported that even high-achieving students experience a fear of low grades,
an emotion which hampers learning. Not only is there evidence showing a negative
correlation between grades and student motivation, but Chamberlain et al. (2018) found a
reinforcing positive relationship between narrative, non-letter-grade feedback, and higher
intrinsic motivation, compared with students who received traditional report cards.
Researchers from that study posited that grades thwart fulfillment of basic psychological
needs, referencing the foundational work done by Maslow (1943, 1954) and reiterating
the endogenous problems with traditional grading.
Students also tend to exhibit challenge avoidance as they seek to maximize their
chances of receiving “good” grades (Long, 2015). Challenge avoidance is the desire to
avoid being or feeling wrong, a psychological phenomenon closely related to the fear of
failure and easily engendered in students; once learned, changing these behaviors is quite
challenging for learners of all ages (Bartholomew et al., 2018; Henry et al., 2018).
Indeed, grades appear to intrinsically instill avoidance behaviors in learners; for instance,
task-avoidance is a common strategy adopted by students seeking to get the highest grade
for the least effort (Bartholomew et al., 2018; Pulfrey et al., 2011). Even when
accompanied by a formative comment, the grade itself appeared to be the source of
performance avoidance (Pulfrey et al., 2011). Especially when interest was absent, a
compounding negative effect on learning manifested: Students engaged in whatever
behaviors they could to avoid working any harder than necessary when they perceived
subject material to be irrelevant (Bartholomew et al., 2018; Tovani, 2014).
Reeves 40
Assessment and measurement, though, are two different enterprises, and Kohn
(2011) argued that quantification of student achievement was perhaps the least valuable
aspect of teaching. While understanding what students know and can do is a useful
diagnostic enterprise, Kohn’s position was that the more value and emphasis placed on
ranking and categorizing, the more detrimental that process becomes (Kohn, 2011, 2015).
Assessment exists on a continuum between evaluative and descriptive poles, according to
Tunstall and Gipps (1996). Evaluative assessments are binary, positive-versus-negative
judgments made according to norms that are linked to rewards and punishments,
respectively. Descriptive assessments are often narrative, specific to the student’s
attainment of mastery, and provide explicit information on how to move toward future
achievement goals (Pulfrey et al., 2011). While narrative comments are among the most
important and effective elements in assessment (Brookhart, 2011), parents often want
summative aggregate grades, despite their many failures (Webber & Wilson, 2012). This,
in part, explains why American schools widely continue to use traditional grading
methods, despite the preponderance of evidence against them. Simon and Bellanca
(1976) found grades were primarily a mechanism used politically by adults, designed to
sort students into categories of compliant and noncompliant relative to scholiocentric
norms of behavior rather than serving as effective indicators of skill mastery.
While practicing skills and receiving specific constructive guidance from
professional educators is of prima facie learning value, giving grades during skill practice
has not been shown to be useful (Fisher et al., 2011). Instead, Fisher et al. found that
doing so caused students to be far more concerned with the grade than with the work and
learning in which they engaged. Formative feedback need not be translated into a
Reeves 41
consequential judgment in the form of a grade; its utility lies in the ability to guide a
learner toward mastery (Black & Wiliam, 2018; Canady et al., 2017). Students who
received appropriate formative assessment outperformed their peers on a wide variety of
standardized achievement measurements (Hanover Research, 2014).
Providing ample explanation about evaluation methods and explicitly describing
the purpose and intentions behind the course material improved learning outcomes
(Hanover Research, 2014). Giving students access to clear, transparent rubrics is very
effective at conveying expectations for learning, and pupils often use them to support
self-assessment and improvement (Andrade & Brown, 2016; Jonsson, 2014; Mathena,
Unclear criteria used in grading, however, are quite detrimental to student
perceptions about assessment. Whether grading is perceived as fair or unfair is
significantly influenced by pedagogical elements; teachers who strove to help students
and gave them the most support were also the ones whose assessment practices were
most likely to be seen as fair and equitable (Gordon & Fay, 2010).
As noted in the earlier discussion of hodgepodge grading, many teachers who also
perceive the inherent unfairness of grading systems try to balance inequities with
corrective measures such as extra credit. However, doing so not only serves as a tacit
admission that grading is a broken reward-punishment system but also introduces biases
of which some teachers may not be aware. While extra-credit assignments are often
intended to help pupils improve their grades, the students most likely to actually complete
the extra work are ones who already have higher grades, and girls were more likely to
complete extra-credit work than boys (M. A. Harrison et al., 2011). Eliminating the
Reeves 42
anxiety and disillusionment induced by grading translates to improved outcomes for
students. Psychological safety—which is the ability to be one’s true self without fear or
negative consequences (Kahn, 1990)—improved the performance of stereotyped groups,
including non-Asian ethnic minorities and women (Edmondson, 1999; Edmondson &
Lei, 2014; Walton & Spencer, 2009).
Biases both implicit and explicit are endemic to many traditional forms of
grading, due in large part to the inclusion of subjective—and often punitive—deportment
measurements. These biases affect nearly every student, with the most prevalent being
against certain races/ethnicities, socioeconomic classes, and girls (Feldman, 2019). As a
result, Feldman notes, the most vulnerable children are often most harmed by these
practices, especially when traditional grading is used as a mechanism to exercise control,
by manipulating grades to reward or punish students in an effort to make them comply
with explicit rules or implied social norms. While the evidence strongly indicated
academics should abandon traditional grading, public perception and pressure to maintain
these familiar systems was pervasive among adult stakeholders, who often prevailed in
efforts to maintain the status quo (Shippy et al., 2013).
Standards-Based Practices
Standards are statements that describe “what and/or how well students are
expected to understand and perform” (O’Connor, 2018, p. 246). These statements should
be named logically and clearly, and not referred to by codes or worded using arcane
language that obfuscates their meaning (Beatty, 2013). When discussing an ideal
standard, O’Connor (2018) suggests that it be presented in a rubric that pairs shorthand
descriptors with a detailed explanation of what constitutes varying levels of performance.
Reeves 43
Together, the descriptor and explanation comprise a performance level descriptor (PLD).
Providing PLDs for standards is clearer and more helpful for parents and learners than
using obscure symbols or complex texts (Youngman, 2017).
Standards-based assessment eliminates the focus on normative behavior, instead
concentrating on actual academic achievement (Hanover Research, 2014), providing
rubrics that not only inform the student as to the expectations for their study and work but
also guide the teacher as to the appropriate PLDs to use when assessing (Shippy et al.,
2013). While those unfamiliar with standards-based grading (SBG) do need extra
instruction to understand the model, given its significant differences from traditional
evaluative measures, Peters et al. (2016) found that providing such instruction was likely
to overcome student concerns about the system.
Once the initial adjustment has passed, stakeholders are often quite comfortable
with the new standards; Burkhardt (2020) found that more than half of a sampled parent
population reported that standards-based grading practices gave them a better
understanding of their students’ performance than norm-referenced grading did.
Successful SBG measures of students’ proficiency against well-defined course objectives
(Tomlinson & McTighe, 2006) and presents assessments of mastery and progress in the
honest and meaningful ways that parents want (Lehman et al., 2018), while explaining
the types of learning and the levels of achievement in ways that educators want (Rosser,
2011). Unlike the traditional-grading mess illustrated in Table 1, SBG does not penalize
or reward a student for the amount of time they take to achieve mastery (Scriffiny, 2008);
it simply describes proficiency with the appropriate PLD. Muñoz and Guskey (2015)
found that frameworks which specifically identify skill mastery and standards are critical
Reeves 44
for effective assessment design, and another study (Buckmiller et al., 2017) indicated that
SBG was better at facilitating student learning-engagement and was more defensible than
traditional grading.
The method used to report SBG is usually the unsurprisingly named standards-
based report card (SBRC). Jung and Guskey (2007) found that SBG allowed teachers to
report information on individual elements of learning (Rosser, 2011) consistently and
accurately; schools that replaced traditional report cards with SBRCs found parents
overwhelmingly preferred the standards-based model (Long, 2015).
K. D. Reeves’s (2015) principle of omnimodality held that any student can and
should be free to demonstrate any skill in any manner using any means that works.
Omnimodal assessment is a model wherein teachers develop rubrics to describe skill
mastery without prescribing the manner in which their pupils will demonstrate that
mastery. Children are afforded free rein to choose how they will exhibit their knowledge
and skills. Traditional grading is not well-suited to this principle, preferring more
delineated methods, but standards-based practices are ideally suited to omnimodal praxis.
Effective PLDs consider the many ways in which a child’s thinking may work (Benziger,
2004) and the many ways in which children might choose to show their knowledge and
As omnimodality addresses ways to demonstrate that mastery has been achieved,
reassessment addresses the processes teachers and students may use to re-evaluate skill
proficiencies. Research on retakes is consistent, showing that providing students ample
time and opportunity to develop and demonstrate mastery are essential characteristics of
effective standards-based assessment systems (Elsinger & Lewis, 2020). Indeed,
Reeves 45
reassessment is firmly established as one of the most critical elements of SBA, which
eschews single-point assessment activities and artifacts in favor of ongoing learning
(Beatty, 2013).While there is broad agreement that reassessments are conducive to
learning, there is also some debate on whether or not students should have to meet certain
conditions before being permitted a retake opportunity.
As noted earlier, Fisher et al. (2011) held that including any form of practice,
recitation, iteration, or failure in an evaluation of skill mastery is destructive to the
integrity of the system. Calculations should not include formative assessment; they
should rather inform teacher practices in supporting student learning; doing so ensures
that standards-based PLDs are based upon authentic demonstrations of student skill
mastery, rather than serving as records of pre-mastery practices.
Standards-Based Report Cards (SBRCs)
As the purpose of SBG is to aid in learning, it follows that standards-based report
cards should not merely communicate information, but must also translate to action
supporting the learner; this often involves parent partners, particularly at the primary
level (Bokas, 2018). It is critical to note that the purpose of a report card is to summarize
student performance efficiently rather than comprehensively (Wiggins, 1994). A full
understanding of the constellation of diagnostic and performance data from the variety of
sources teachers have—everything from formalized, targeted diagnostic instruments to
qualitative examination of student-generated artifacts—is the purview of the professional
educator, not the receiving parent. When discussing SBRCs, this dissertation recognizes
that there are many signifiers and data points absent from the report card that are
nonetheless used by teachers to robustly understand their students’ performance.
Reeves 46
SBRCs usually take one of three major forms: fully aggregated, aggregated-
disaggregated, and disaggregated. Aggregate grades are often described as rolling up the
PLDs used to assess students’ mastery of individual components within a given subject
area into a single composite grade, often through application of a mathematical formula.
For example, a fully aggregated SBRC would take PLDs assigned to a student’s work in
spelling, reading comprehension, and creative writing, and then combine them into a
composite overall English grade. These types of SBRCs report only the rolled-up grades
for each subject area, with the actual SBRC looking superficially very much like a
traditional report card. Aggregated-disaggregated SBRCs report both the rolled-up grades
for each subject area and the individual PLDs for each reported standard; disaggregated
SBRCs report only the individual PLDs without giving an overall grade for the subject.
Regardless of type, SBRCs do not assess every standard in every grading period. Many
schools that practice SBG select a core set of reporting standards to include on the report
card, consisting of the most important or salient standards that communicate key
elements of what has been taught and learned during the reporting period (Guskey et al.,
While traditional letter grades tend to communicate more about tasks completed
or not completed than about actual learning, SBRCs strive to convey progress toward
development and mastery of specific skills (Ketch, 2019). Traditional grading is so
ingrained in the minds of parents that any change away from it frequently engenders
pushback; oftentimes, schools bow to expediency and use letter grades on report cards
even when implementing SBG (Peters & Buckmiller, 2015). O'Connor (2017) stated that
some SBRCs appeared to use traditional letter-grade language, but that the method for
Reeves 47
determining that grade was not traditional, and therefore could still be considered
standards-based; however, this runs counter to findings from Peters and Buckmiller
(2015) and Ketch (2019), among others.
Ultimately, the format itself poses the largest challenge for some parents, as it is
so different from traditional report cards, but successful implementations of SBA can
overcome these challenges, especially in light of Youngman’s (2017) finding that parents
believe that narrative comments provide the most insight.
Whatever their form, meaningful grades report discretely and precisely on the
elements learned during a given course of study (Guskey, 2020), and solely be a measure
of a student’s knowledge and capability vis-à-vis the standards established by the school's
governing authority (Schimmer, 2014). While research into which SBRC model best
accomplishes this goal is sparse, Feldman (2018) found that aggregate grades were
untrustworthy and inaccurate, and therefore irresponsible, given that they did not
accurately and specifically communicate information to learners about their strengths and
weaknesses. They were erroneous because they were, in effect, hodgepodges; limiting
variables while designing SBRCs is a best practice (Brookhart, 1991; Close, 2014;
Guskey, 2000; Guskey & Brookhart, 2019; Guskey & Link, 2017). Guskey (2020)
provided further evidence that fully aggregated SBRCs are undesirable, stating that
effective report cards must provide performance levels separately for each learning
standard or competency. That author notes that aggregate hodgepodge grades are innately
reductive; an array of assessment elements is necessary to develop a comprehensive and
meaningful analysis of any given area of achievement for any given student. For
example, teachers assessing younger learners rely less upon written work, because those
Reeves 48
students have limited writing ability; instead, instructors include formative assessments,
work exhibitions, and classroom observations in their amalgamated grades (Guskey,
2000; Guskey & Link, 2019).
Ideally, SBRCs should ensure their students have clear objectives, identifying and
assessing those pupils' progress toward mastering skills. Communicating that assessment
effectively to students and families is crucial, and can be hampered by complex or
unclear SBRCs—a situation frustrating for all stakeholders (Spencer, 2012).
Effects of Standards-Based Practices
Standards-based grading is demonstrably less discriminatory than traditional
grading (Feldman, 2018). SBG is criterion-referenced and therefore avoids the ranking
inherent to norm-referenced systems, being based on more-objective PLDs instead of
highly subjective traditional practices. Feldman’s paper shows that economically
disadvantaged students were far more likely to have their skill masteries misrepresented
with inaccurately low scores under a traditional model than they were with standards-
based grading practices. That study found that standards-based grading practices
decreased failure rates for non-white students by 37%; it also decreased the number of A
grades awarded to white students by 19% and by 3% for non-white students.
SBG is also better correlated with standardized testing measures. While fill-in-
the-blank and multiple-choice assessments represent outdated and outmoded pedagogy
(Ketch, 2019), they yet maintain a ubiquitous and sometimes mandatory presence in
education, and so must be addressed. Studies show that standards-based assessments
were effective predictors of performance on standardized tests (Hardegree, 2012), were
more correlational to test scores than traditional grading (Feldman, 2018), and aligned
Reeves 49
closely with external measures of mastery such as ACTs and SATS (Buckmiller &
Peters, 2018). While a study by Greene (2015) found that SBA practices and traditional
grades were roughly equivalent in their ability to predict test scores, this study appeared
to be an outlier in the literature. In fact, the positive correlation of SBAs with external
standardized metrics is borne out in studies of middle-school students, in which PLDs are
demonstrated to be reliable predictors of student skill mastery (d’Erizans, 2020; Lehman
et al., 2018).
Student Experiences
As detailed previously, many parents have cited the extrinsic motivation of grades
as a compelling reason to retain traditional practices (Kohn, 2011). While Cameron
(2020) found no significant difference in student motivation between classes using SBG
or traditional practices, Frechette’s (2017) results showed that students learning in SBA
models had more motivation to reach mastery and were more aware of their strengths and
weaknesses; related research demonstrated that college students responded positively
to—and indeed preferred—standards-based assessment practices (Beatty, 2013). Those
respondents perceived those systems as fairer, and as a result, took more ownership of
their learning (Buckmiller et al., 2017).
More evidence comes from Odell’s (2018) research involving middle-school
students in Georgia, who experienced a moderately-significant positive effect on
achievement when standards-based assessments were implemented (Odell, 2018). This
effect carries through to secondary school, where Knight and Cooper (2019)
demonstrated that standards-based assessment facilitated more-effective teaching, which
improved learning outcomes, met students’ needs better, and fostered their growth
Reeves 50
mindset. Research is replete with studies that show SBA practices positively contribute to
a culture of growth and empowerment (Sheeley, 2017), with students reporting they felt
more centered in their learning. Although learners’ perceptions of SBG tend to be more
favorable than their teachers’ (Winton, 2015), instructors nevertheless agree with their
students that outcomes are better under that model (Frechette, 2017).
The perceptions of these teachers and students are borne out by the data.
Secondary students receiving standards-based assessment and grading scored higher on
end-of-course tests and had greater learning growth compared to peers receiving regular
letter grades (Poll, 2019). Another study found students earned A or B grades and passed
the end-of-course standardized test at approximately twice the previous rate after
standards-based assessment was adopted (Pollio & Hochbein, 2015). SBRCs identified
specific strengths and weaknesses for learners (Youngman, 2017), which Wheeler (2017)
found may provide better alerts for students struggling with mental illness or trauma by
revealing specific deficit patterns in particular standards or skill areas.
There are some negative and neutral SBG effects in the literature. A study by
Townsley and Varga (2018) indicated that SBA practices have no impact on high-school
students’ grade-point averages, and in a different study, conducted in Missouri at the
secondary-school level, both parent and student responses indicated a low regard for the
grading method, with respondents voicing their mistrust of a system that appeared overly
forgiving (Smith, 2018).
Teacher Experiences
Guskey (2001) found that a major obstacle to enacting standards-based reforms
was the difficulty educators have making the conceptual shift from norm-referenced to
Reeves 51
criterion-referenced methods. Green et al. (2007) found the lack of consensus among
instructors as to best practices for evaluations was a significant problem. Since
assessment conditions and criteria should be as standardized as possible, especially for
grading purposes (Bonner & Chen, 2021), standards-based methods make the most sense.
Guskey (2001) posited that SBG was arguably the most effective method in terms of
teaching and learning; again, research stressed the system’s strengths in providing
valuable diagnostic information for educators and learners alike (Rosser, 2011). Manley
(2019) held that the two essential requirements for implementing standards-based grading
were establishing the standards themselves and developing an unambiguous rubric.
Carter (2016) identified eight critical leadership capacities that smoothed the transition
from traditional grading to SBA, most of which centered on communication, consensus-
building, and developing a clear and consistent vision of how to implement future
Some research indicates that teachers favored—and parents “overwhelmingly”
preferred—standards-based assessment (Swan et al., 2014, p. 289). SBRCs provided
authentic accountability for all stakeholders and had a direct, positive impact on teachers’
perceptions of their own pedagogy (Ketch, 2019). An added advantage of SBA processes,
when implemented correctly, is that they involve far less paperwork than amalgamated
hodgepodge grades, which require keeping track of and reporting a wide variety of
unstandardized elements (Guskey, 2020).
Guskey (2011) also saw the persistent—and incorrect—assumption that grades
should rank students and the accompanying belief that grading should follow a normal
distribution curve as the chief obstacles to grading reform. Issues with initial
Reeves 52
implementation of standards-based assessment often involve infidelity to the program (as
seen in Smith’s (2018) Missouri study) or challenges to adoption and practice; the latter
were often overcome with experience, consistent application, and professional
development (Adrian, 2012; Knight & Cooper, 2019). Indeed, appropriate continuing
education in SBA, SBG, and SBRCs was important to the successful deployment of
standards-based practices (Michael et al., 2016); this requires significant focus on
purpose and communication (Peters & Buckmiller, 2015), which is necessary to disabuse
stakeholders of the attitudes and assumptions shackling them to an outdated grading
Even when the initial implementation of standards-based assessment is not ideal,
intervention programs can improve teachers’ skills, albeit with a significant investment of
time and energy. Major attitudinal shifts require additional work after initial intervention
(Adrian, 2012; Sugarman, 2015), and it has been shown that successfully implementing
and transitioning to standards-based assessment requires administrative support including
psychological safety (Edmondson & Lei, 2014), which teachers have been shown to
value (Ulrich, 2012).
Student skill mastery within a clearly defined objective standard—one which
expressly avoids comparison to other students—should drive the PLD used in a given
case (Brookhart, 2011; Marzano, 2000; Wiggins, 1994). Scarlett (2018) identified several
major arguments in favor of SBA frameworks, among which are the assertions that
traditional grading relies on unreliable percentages and numerical scores, which were
used inconsistently and served more to perpetuate the experiences teachers had with
being graded as students than to comport with quality research-supported practices
Reeves 53
(Guskey & Bailey, 2010; Guskey et al., 2011). PLDs are more effective ways of
describing student achievement, defined by O’Connor (2011) as “performance measured
against accepted published standards and learning outcomes” (p. 7). The potential
damage to students inherent in traditional assessment methodology, coupled with
findings establishing the efficacy of appropriately implemented SBA methodology, have
driven—and continue to drive—progressive schools to move toward standards-based
frameworks. There is no consistent research, however, on the right number of or optimum
phrasing for performance level descriptors in order to maximize benefit and minimize
Performance Level Descriptors
Rasch (1980) developed an important probabilistic psychometric model for
analyzing the validity of test data, which estimated the likelihood of an individual
correctly answering a test question as a function of the person’s level of ability and the
question’s difficulty. Rasch models are applied extensively in the social sciences, and a
recent revolution within Rasch-model analysis is the Wright map (Boone & Staver,
2020). A graphical representation of individuals and responses, Wright maps have been
used to help identify pass points (Bond et al., 2020) that delineate bands of performance,
and have also been applied to the development of performance level descriptors (Arikan
et al., 2019). While this method of statistical analysis can be applied to standardized-test
results, it is challenging to project how Rasch analysis might be used in omnimodal
assessment. Perhaps future work can be developed into a tool analogous to Rasch models
or Wright maps, which can then be used to refine the methods by which PLDs are
determined; effective use of analytic tools, demonstrating statistically significant positive
Reeves 54
benefits, would greatly strengthen arguments for the switch to SBA.
At the time of this research, performance level descriptors are created by selecting
categories of student performance, and phrasing PLDs for each. This ideally creates a
robust definition criterion for describing mastery and achievement at each level for each
standard. While the terms used in each PLD will vary from state to state and from school
to school, there are likely to be some commonalities in the language of the descriptors
Guskey (2004) reported significant confusion arising from this inconsistent and
idiosyncratic deployment of performance level descriptors. Examining various standards-
based models, he identified 16 different labels used to demarcate levels of understanding,
16 phrases used to connote mastery, eight terms describing how frequently mastery was
displayed, six degrees of effectiveness, and four phrases for the quantity of evidence.
Guskey also found parents would inevitably mentally convert performance levels into
supposedly equivalent letter grades. If, for example, a scale of performance level
descriptors ranged from advanced, to proficient, developing, and basic, parents tended to
think of these as A, B, C, and D grades, and Guskey found that this false analogy was
based on their own prior experiences with grading.
Some performance level descriptors incorporate an additional code such as
insufficient evidence to reflect an inability to accurately report a student’s skill mastery.
These statements are not evaluations of proficiency and are not true PLDs, but merely
note an inability to evaluate performance. As such, in the following paragraphs, systems
that include three performance level descriptors and one or more “insufficient-data
footnote” devices will be considered to have three levels, not four.
Reeves 55
Quantity of Performance Level Descriptors
Four PLDs has been most common in SBG: In 1785, Yale University used four
ordinal descriptors for end-of-course reporting: optimi, second optimi, inferiores, and
pejores (Schinske & Tanner, 2014), which effectively described two tiers of satisfactory
achievement, one of inferior achievement, and one level of non-achievement (this scale is
hereafter referred to as “Yale 1785”). Four-point PLD structures have been common ever
since, although no evidence-based research suggested that such systems are naturally
superior to other models. Marzano (1998), one of the most-cited standards-based
assessment authors, developed his own four-point scale: advanced, proficient, basic, and
novice; Arter and Busick (2001) used four slightly different descriptors: exceeds
standard, meets standard, does not meet standard but is making progress, and does not
meet standard. Guskey, among the most prolific scholars in the field, co-developed a
four-point system for the Commonwealth of Kentucky using numerical symbols, with 4
representing the highest level of mastery and 1 representing the lowest (Guskey, 2001;
Spencer, 2012). Additional examples include terms like consistently, usually, sometimes,
and seldom (Wormeli, 2018), and advanced, proficient, basic, and below basic
(O’Connor, 2018, p. 81).
Marzano (2000, 2009), who often referred to four-point systems, is cited in many
frameworks that do not use three-tier scales. For example, the Federal Way Public
Schools system in the State of Washington has employed standards-based assessment
since the early 2000s (Allmain, 2013), and used several iterations with four performance
level descriptors. Many other authors, including Reeves et al. (2017), also recommended
a four-point system; despite their frequent use, though, Wormeli (2018) cautioned against
Reeves 56
a four-point scale precisely because it mirrors the A, B, C, D elements of the traditional
A–F scale and closely resembles the 4.0 GPA scale, concluding that parents will assume
equivalency even if none is intended. He instead recommended rubrics with anywhere
from three to six levels, which would forestall attempts to conflate traditional and
standards-based systems.
Three-tiered models are also common, eliminating a level of either mastery or
non-mastery. Colby (1999) used proficient, progressing, and not proficient—similar to
Ulrich’s (2012) proficient, developing, and beginning—while McNulty (as cited in
Elsinger & Lewis, 2020) referred to master, journeyman, and apprentice, and Guskey
(2000) used A, B, and C. Elsinger and Lewis (2020) noted that exceeds is often absent
from computational work, as mastery is often described as consistently arriving at the
correct answer, and exceeding such an expectation may be difficult, or impossible to
Marzano (2000, 2009), who often referred to four-point systems, is cited in many
frameworks that do not use three-tier scales. For example, the Federal Way Public
Schools in Washington state has employed standards-based assessment since the early
2000s (Allmain, 2013), and saw several iterations with four performance level
descriptors. Many other authors, including Reeves et al. (2017), also recommended a
four-point system; despite their frequent use, Wormeli (2018) cautioned against a four-
point scale precisely because it mirrors the A, B, C, D elements of the traditional A–F
scale and mimics the 4.0 GPA scale, concluding that parents will assume equivalency
even if none is intended. He recommended rubrics with anywhere from three to six
levels, making it difficult for parents to conflate traditional and standards-based systems.
Reeves 57
Five-level models are not common in SBG. The PARCC assessment (the
Partnership for Assessment of Readiness for College and Careers methodology developed
to measure Common Core outcomes) uses descriptors with two levels of meets standard
and three levels of does not meet standard (Ketch, 2019). Other examples included
sophisticated, mature, good, adequate, and naïve (McTighe & Wiggins, 2005, p. 72);
Wormeli (2018) suggested exemplary, competent, satisfactory, inadequate, and unable to
begin effectively.
Even less frequently, systems using but two descriptors are occasionally found in
SBG, and are commonly referred to as pass/fail models. These binary systems have been
used for quite some time; Brookhart et al. in 2016 described the University of Michigan
using an explicit pass/fail PLD model as early as 1851 (hereafter referred to as “Michigan
1851”). Two-PLD systems continue to be developed today, such as Elsinger and Lewis’
(2020) experimental mastered and not yet mastered assessment. At least one American
primary school with significant standards-based assessment experience used only meets
standard and approaching mastery to evaluate students (Discovery Elementary School,
The literature also provides one example—apparently entirely theoretical—cited
by Wormeli which used six tiers (2018); those ratings were exceptional, strong, capable,
developing, beginning, and emergent. Regardless of the specific language used, any given
PLD should be consistent both in application and appellation between students and
classes, given the same standard (Schimmer, 2014). Schimmer et al. (2018) described
differences between performance level descriptors as “slight divergence at the most
granular levels” (p. 126).
Reeves 58
Justification for This Research
As amply described, there is an overwhelming body of research suggesting that
assessments ranking and categorizing students can be harmful and are often inaccurate.
Evidence instead indicates that standards-based assessments using well-crafted PLDs
offer a better way for students, parents, and teachers to give and receive feedback. Yet,
while a strong case can unquestionably be made for the adoption of SBG, there is no
clear consensus on how many PLDs to use, or how to phrase them. Some practitioners
(Schimmer et al., 2018) use existing frameworks for determining a quantity of
performance level descriptors to employ, such as Webb’s (1997) Depth of Knowledge
model (as cited in Francis, 2017) or Bloom’s (1956) taxonomy.
What little research has been done about the optimal composition of PLDs has
spotty applicability and validity (Guskey, 2000; Guskey & Link, 2019); even after
reviewing three decades of published research, I found no academic, empirical
justification for any specific number or form of descriptors. The dearth of research about
the most effective way to classify different levels of student performance led directly to
this study.
Further quantitative studies in standards-based assessment are needed (Townsley
& Varga, 2018), and Moloi and Kanjee (2018) specifically called for research into how to
best select performance level descriptors; they found that frameworks lacking a solid
empirical basis lead to confusion among teachers and students, who are unsure what
those descriptors actually signify.
Another gap in the literature is the absence of any substantive research
investigating the differences between models using distinct numbers of PLDs; Marzano
Reeves 59
(1998) proposed a four-point scale over twenty years ago, but stated, “the actual number
of points is not so critical as the fact that all teachers are utilizing the same scale” (p. 40).
Many rubrics created over the past two decades were predicated upon Marzano’s (2000)
early four-point scale, but that author never found a concrete justification for using that
specific number, either at the time of writing or in subsequent work. Given that so many
schools followed his lead, empirical research about the efficacy of his four-point models
is warranted.
According to Craig (2011), “it would be beneficial to the educational environment
to [study] the difference in growth or performance” for students evaluated by different
grading systems that used similar descriptors “to see if educators are needlessly
expending effort on revising a system that needs only minimal adjustment” (p. 114).
Youngman (2017) specifically indicated the need to examine the way SBRCs are being
adopted: “An increasing number of school districts are modifying their report cards to
align with curriculum [sic], instruction, and assessments that have common standards, so
research should examine SBRCs as they continue to be developed” (p. 132), and
Cameron (2020) agrees “more research is needed into the impact of different SBRC
forms” (p. 229). Broekkamp and van Hout-Wolters (2007) acknowledge that educational
research rarely provides unambiguous and actionable results supported with clear
empirical evidence, that can then be directly applied as actual changes in practice.
In the broadest terms, the history of grading and assessment in the United States
can be divided into the preindustrial era (from 1635 to around 1850) and the industrial-
neoliberal era which followed. Before 1850, students often attended local, blended-age
Reeves 60
schools which rarely issued report cards, and rather relied on narrative comment and
discussion to assess and communicate mastery.
After the industrial revolution, however, students began to be grouped into grade
levels determined by age, and assessment increasingly included formal testing and the
assignment of number or letter grades, with report cards ranking and categorizing pupils.
In the 1960s, as neoliberalism began to influence policy, assessments would often include
deportment grades, incorporating non-academic factors into purportedly academic
evaluations. Beginning with the 1965 Elementary and Secondary Education Act (ESEA),
and carrying through its subsequent reauthorizations in the 2001 No Child Left Behind
Act (NCLB) and 2015 Every Student Succeeds Act (ESSA), grading, ranking, and testing
became more pervasively insinuated into education and practice, partially at the behest of
private interests including potential employers and the commercial standardized-testing
complex (K. D. Reeves, 2015).
More than a century of published research provides empirical evidence
definitively demonstrating that traditional grading, be it in the form of numbers or letters,
is wholly inadequate and ethically suspect. As Aitken (2016) wrote, “[e]xtensive research
indicates that grades can be harmful to students' disposition toward learning, and can
damage social and emotional development.” O'Connor (2007) found—and proposed
solutions for—numerous failures of traditional grading, including distortions of student
achievement, reliance upon low-quality or poorly organized evidence, the use of
inappropriate or inaccurate grade calculations, and a general failure to support learning.
There is no evidentiary or pedagogical justification for the continued use of traditional
grading, and even legal scholars have raised concerns.
Reeves 61
On the other hand, continuing evidence-based research and analyses which
consistently show significant benefits to using SBA models can be compelling for
teachers who have received appropriate professional development. The improved
feedback and elimination of shame and unhealthy competition demonstrably improve
outcomes—and not just for students, but for all stakeholders. Even so, more research is
needed to establish the best practices to use for standards-based report cards; logically,
one can conclude that the number of PLDs used in assessment will have a significant
impact on how that system is used and its effect on learners. Knowing that ranking and
ordering children’s performance is harmful would suggest that the less we categorize
students, the better (Klapp, 2019); consequently, researchers must investigate the efficacy
of various SBRC models to help ensure optimal outcomes for children.
With the literature review concluded, we now move to an examination of this
study itself, with the next chapter devoted to a description of the methodology used in
this research. Chapter IV presents the findings, which are then summarized in Chapter V,
wherein I examine the implications of the research and make specific recommendations
for practices in schools.
Reeves 62
Chapter III
According to Creswell (2012), research is, in the broadest sense, the process of
posing a question, collecting data to answer it, and presenting the result. To explore how
something occurs, one employs a qualitative research methodology; causality,
generalizability, and effect size are studied through a qualitative methodology (Fetters et
al., 2013). Complex problems, though, are rarely solved through simple processes, and
consequently, a third modality of research has emerged called mixed methods design,
which harnesses the power of both qualitative and quantitative forms of research
(Creswell & Plano Clark, 2018; Teddlie & Tashakkori, 2009).
The purpose of this study was to examine the relationship between standards-
based report cards (SBRCs) with different quantities of performance level descriptors
(PLDs) and the scores received on standardized end-of-course tests by primary-grade
students in Virginia, with the ultimate objective of providing recommendations for the
design of efficacious predictive standards-based report cards that will ensure maximum
benefit for students. Chapter III describes the design, structure, and procedures employed
in this study.
Research Design
This is a sequential, exploratory, transformative mixed methods study (Creswell,
2003), using a Type VI mixed-model design (Tashakkori & Teddlie, 1998; Teddlie &
Tashakkori, 2009), which Creswell (2012) described as more complex than other
methodologies. Each element of this design was chosen specifically to address facets of
the research problem.
Reeves 63
The study is sequential because it first performed a qualitative analysis by
evaluating alternative SBRCs, and then subsequently analyzed their effects
quantitatively. Teddlie and Tashakkori (2009) refer to the sequential approach used in
this study as conversion design, in which qualitative categories are determined and then
quantitized for analysis.
Exploratory designs use themes or categories developed qualitatively to drive the
quantitative analysis that advances and deepens research (Creswell & Plano Clark, 2018;
Onwuegbuzie and Slate et al., 2009; Teddlie & Tashakkori, 2009). Putnam and Mumby
(2013) wrote that if the phenomenon being researched “is relatively unexplored … then
an exploratory design should be considered” (p. 311), and this type of study is often
conducted in emerging areas where little (or no) prior research has been conducted, and
may produce findings which provide a map for further surveys (Brown, 2006; Singh,
2007). In this case, the method was appropriate because of the uncertainty about optimal
PLD configurations in standards-based report cards; furthermore, it was necessary to
sequentially classify SBRC models qualitatively before quantitative analysis could be
applied in a post hoc correlation investigation (Creswell, 2012). Because a quantitative
analysis followed qualitative analysis, this study is considered a Type VI mixed methods
design (Teddlie & Tashakkori, 2009). While the generalizability of exploratory studies is
circumscribed by their nature, such inquiries often provide pioneering insight into new or
understudied subjects—just as this dissertation aspires to do (Creswell & Plano Clark,
Finally, the study design is transformative and mixed methods because it evolved
qualitative data into quantitative, thus using a mixture of techniques to address the
Reeves 64
problem at hand. Mixed methods design moves past the binary of qualitative versus
quantitative, and instead combines the power of both, which Berman (2017) indicated can
advance the scholarly conversation.
Onwuegbuzie and Slate et al. (2009) created a three-dimensional model to
visualize mixed method orientation. The first continuum, which describes variable-
oriented analysis, is arranged between particularistic and universalistic poles and
indicates the degree to which the study’s inferences can be generalized. The second
continuum plots case-oriented analyses, arranged from intrinsic to instrumental, and
elucidates the purpose for selecting cases. The third and final continuum charts
process/experience-oriented analyses, ranging from cross-sectional to longitudinal,
illustrating the temporal nature of the study.
This study, which developed a classification system designed to be applicable to
any SBRC model, was built upon a comprehensive review of extensive literature and
draws examples from actual practice. Therefore, its inferences are generalizable to a
significant degree, being highly universalistic as measured on the variable-oriented
analysis continuum. Regarding case-oriented analysis, Garson (2016) indicated that
studies were instrumental if they generalized from a specific case exemplary of a larger
set. Because the research herein examined the specific case of Virginia primary-school
SBRCs, with generalizability to other locations and grade levels, it could reasonably be
considered generally instrumental. Finally, in terms of process- and experience-oriented
analysis, this study is predominantly cross-sectional, as longitudinal analysis of the
correlations between PLDs and SOL-test scores were not feasible due to the disruption
Reeves 65
caused by the pandemic of 2020–21. Many SOL exams were not administered at all in
Virginia during the spring of 2020, causing a significant gap in the data.
Methodological Rigor
R. L. Harrison et al. (2020) prescribed six areas of methodological rigor for high-
quality mixed methods designs (a category to which this dissertation aspires). Aims and
purpose prescribes articulating a research question that provides a reason for using mixed
methods. Chapter I includes such a question, consistent with Creswell and Creswell
(2017) as it includes both quantitative and qualitative research elements, and a
Philosophical Rationale in this chapter is provided. Rigor in data collection entails
reporting specific procedures for both qualitative and quantitative sections, and rigor in
data analysis involves the use of both basic and sophisticated methods, and those criteria
are met via Chapters I through IV. Harrison et al. also encourage careful selection of a
study’s design type, which in this case meant purposefully selecting a mixed methods
model that allowed for meaningful data integration, linking both the qualitative and
quantitative data. Both the design type and point of data integration are illustrated in
Figure 5. Methodological rigor also requires an explicit description of the structure of this
research, including a diagram of this dissertation’s Tashakkori and Teddlie Type VI
design (a function also fulfilled by Figure 5). Finally, elements of writing prescribes
including references to mixed methods literature and the discrete identification of the
design in the title, abstract, or paper itself, which this dissertation patently does. Having
met all the criteria, this can therefore be considered a highly rigorous mixed method
design (R. L. Harrison et al., 2020).
Reeves 66
Research Questions
The following questions guided the research:
1. Is there a relationship between qualitative SBRC models and students’
quantitative standardized test scores?
2. Are different SBRC models more or less predictive of student performance on
standardized tests?
3. Are certain SBRC models preferable?
The objectives of this study are
to qualitatively classify different SBRCs into models based on the number and
nature of their performance level descriptors,
to quantitatively describe differences in the predictivity of different SBRC-
model classes, and
to make recommendations for actual practices in schools.
Research Stages
As illustrated in Figure 5, there were five distinct stages to this study. Stage 1 was
the systematic literature review summarized in Chapter II, while Stages 2–4 were specific
phases of research stages, which are presented in Chapter IV. Stage 5 synthesized the
study’s findings into the recommendations and conclusions described in Chapter V.
Reeves 67
Figure 5
Sequential Exploratory Transformative Mixed Methods Design
Stage 1: Systematic Review of the Literature
This qualitative exploratory stage involved systemically reviewing the existing
literature about standards-based assessment throughout the history of American
education. This stage was conducted before empirical research, as is ideal for this type of
design (Xiao & Watson, 2017). While primarily focused on recent work, the review also
included seminal publications that illustrate the development of ideas central to
standards-based assessment practices. Following Xiao and Watson’s (2017) framework
Reeves 68
for effectively and systematically conducting a literature review, the first stage organized
material logically to provide not only a history of grading systems but to also contrast
traditional and standards-based practices, and included an examination of various models
of standards-based report cards (SBRC) used in primary, secondary, and tertiary
institutions. This last was crucial, enabling the identification of many different
permutations of SBRCs—a critical element, as Stage 2’s taxonomy was predicated upon
Stage 2: Development of Taxonomy
Stage 2 was the creation of a taxonomy of SBRC models from the information
gathered during the literature review. This interpretivist, descriptive framework classified
the highly disparate varieties of SBRCs in current and former use in the U.S., using
membership categorization analysis (MCA). MCA is a method rooted in the
categorization work of Harvey Sacks (1972, 1992), used for explicating practically
oriented, commonsense empirical material (Day, 2011). Conforming to the MCA method
described by Schegloff (2007), Stage 2 classified types of SBRCs using specific criteria
called membership categorization devices, or MCDs (Given, 2008).
Employing MCA, the “Reeves Taxonomy of Standards-Based Report Card
Models” was created. The taxonomy developed for this research conformed to design
standards established by Collier et al. (2012), including articulating the overarching
concept, establishing clear row and column variables, and creating a matrix to organize
the classifications. Because the determining criteria are used to describe existing SBRCs
based on their attributes, a conceptual typology exists. Note that Borgès Da Silva (2013)
Reeves 69
established that taxonomies and typologies may overlap, which is further detailed in
Chapter IV.
“Quantitizing coded qualitative data is usually achieved by considering the codes
as variables,” and the study thus required the classification of SBRC models (Fofana et
al., 2020, p. 2). The taxonomy normalized the diverse vocabulary used in performance
level descriptors (PLDs) to yield usable variables for analysis (Fofana et al., 2020; van
Grootel et al., 2020). As an analogy, consider that while English has words for many
different shades of red—crimson, scarlet, cardinal, cerise, fire-engine—and many of
those terms may not have direct analogues in other tongues, the precise color can still be
communicated to a foreign-language speaker by using well-defined and internationally
understood frameworks such as RGB values or Pantone codes (Tammet, 2013). By
establishing rigorous, clear criteria, the taxonomy developed for this dissertation assigned
a qualitative, ordinal set of classified levels for PLDs, facilitating analysis in Stage 4
(Collier et al., 2012).
Stage 3: Application of Taxonomy
Collier et al. (2012) found it methodologically sound to apply classification
systems in the social sciences, especially when done with great rigor and care. The
literature review in Stage 1 included identifying as many existing SBRC variations as
possible, so as to gather ample models to sort. Stage 3 then applied the Reeves Taxonomy
to existing cases as an extension of the analysis in Stage 2, transformatively quantitizing
and ordering the data through typological analysis (Given, 2008). The resulting classes
served as variables attached to each set of data.
Reeves 70
To effectively and thoroughly investigate how well each model served as a
predictor, the levels within each SBRC needed to be standardized. Sandelowski et al.
(2009) described quantitizing as “numerical translation, transformation, or conversion of
qualitative data” (p. 208) and said that this process can “answer research questions or test
hypotheses” that require a numerical format for statistical analysis (p. 211). This was
precisely the reason for quantitizing the qualitative data in this study: to transform PLDs
that use different words, but describe the same thing, into new thematic groupings that
allow them to be analyzed (Martin, 2004). That which was originally described
qualitatively can now be counted and measured in a meaningful way through
categorization, and as Quine (1969) said, “nothing is more basic to thought and language
than our sense of similarity; our sorting of things into kinds” (p.116).
Stage 4: Quantitative Analyses
To prepare for analyses, Stage 4 employed the new Reeves Taxonomy to make
inferences. At this point, several quantitative variables were available to be analyzed
(Table 3).
Reeves 71
Table 3
DV or IV
Possible Values
SOL-Test Score
SBRC Class
Classes from Reeves
Grade Level
3, 4, 5
Mathematics, reading, science
PLD labels (e.g., PL+1), and
corresponding codes (e.g., +1)
PLD Code
Stage 4 analyzed relationships among the variables, using inferential statistics to
assess the impact of those ties. These tests are described later under the subheading “Data
Analysis.” While some variables were innately quantitative, such as “grade level” and
“standardized-test score,” PLDs were not quantitizable until Stage 3 was complete.
Qualitative classes defined in Stage 2 and applied in Stage 3 become variables in
Stage 4. The purpose of these classifications is to provide a nomenclature so that
standards-based report cards (SBRCs) which conform qualitatively to a common set of
factors could be compared, forming a framework for later correlative analysis.
As described earlier, a main purpose for this study was to examine what (if any)
differences exist between various SBRC models, and test their validity as predictors of
SOL-test outcomes. The dependent variable (DV) was the score attained on the Virginia
Standards of Learning test. Independent variables (IVs) for each case include the SBRC
class from Reeves Taxonomy; grade level (Grade 3, 4, or 5); subject area, including
Reeves 72
mathematics, reading, and science; and the performance level descriptor (PLD) assigned
by the teacher for the subject area, all coded for consistency throughout. A review of
recent methodological literature led to a consideration of the following critical
Avoiding the p-Value Pitfall. Null-hypothesis significance tests (NHSTs) in the
social sciences are often outmoded and should be considered of dubious utility by serious
statistical analysts (Wasserstein et al., 2019). Century-old classical statistics methodology
is inappropriate for contemporary educational research, with the American Statistical
Association (ASA) stating that “scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a specific threshold” (Wasserstein
& Lazar, 2016, p. 129). Those authors go on to hold that as p does not communicate the
size of a statistical effect or the importance of a given result, it therefore “does not
provide a good measure of evidence” (p. 130) in terms of a given model or hypothesis.
The new ASA guidelines, published just as the pandemic began (Wasserstein et
al., 2019), make it quite clear that p is no longer considered the gold standard of social-
sciences research, given its reductive nature. Wasserstein et al. instruct us not to base our
conclusions “solely on whether an association or effect was found to be ‘statistically
significant,’” such as by passing an arbitrary threshold (p. 1). Through Wasserstein et al.,
the ASA cautions us against false conclusions; in any given test, the mere existence of
statistical significance does not guarantee an association or effect, nor does the reverse
hold true. Indeed, the ASA states categorically: “Don’t say ‘statistically significant’ …
Regardless of whether it was ever useful, a declaration of statistical significance has
today become meaningless” (p. 1). In light of this guidance, over 800 scientists were
Reeves 73
surveyed and concluded that “it’s time for statistical significance to go” (Amrhein et al.,
2019, p. 307).
A “fundamental paradigm shift” (Carlin, 2016) is needed, including the
elimination of null-hypothesis significance tests (NHSTs) altogether from academic
dissertations and journals: “There is now wide agreement among many statisticians who
have studied the issue that … it is illogical and inappropriate to dichotomize the p-scale
and describe results as ‘significant’ and ‘nonsignificant.’ Authors are strongly
discouraged from continuing this [practice].” (Hurlbert et al., 2019).
There are specific injunctions against the use of NHST and p for this dissertation
in particular, as “it is pointless to estimate the p-value for non-random samples [or] when
dealing with population data” (Filho et al., 2013, p. 31); both conditions apply to the
study of SBRC models employed by teachers of Virginian students in grades 3–5. Szűcs
and Ioannidis (2017) declared NHST as “unsuitable as the cornerstone of scientific
inquiry in most fields” (p. 13), and suggested that instead of p-values and all-or-nothing
hypothesis testing, contemporary high-quality dissertations “should focus on reporting
data in original units of measurement as well as providing derived effect sizes” (p. 15).
When p-values are included, they should be reported regardless of size, along with a
measure of the effect size and its corresponding confidence interval (Hayat et al., 2019).
Specific to the field of education, Fraenkel and Wallen (2018) note that “focusing
attention on a hypothesis may prevent researchers from noticing other phenomena that
might be important to study [which] seems to be a good argument for all research not
being directed toward hypothesis testing” (p. 46). Those authors continue to caution
“[s]tating a hypothesis may lead to a bias … on the part of the researcher [who may be]
Reeves 74
tempted to arrange the procedures or manipulate the data in such a way as to bring about
a desired outcome” (p. 46). To achieve accurate, valid statistical analysis, and observe
ethically mandatory equity considerations (Andrews et al., 2019), the quantitative stage
of this dissertation will employ best practices, as currently understood, in statistical
Multiple Nested Predictor Variables. Because the matrix of data for this
dissertation includes multiple levels of nested predictor variables, simple univariate and
bivariate statistics based on random sampling assumptions are invalid options, and
advanced multivariate methods are required to account for correlation or dependency
between these nested variables. In short, there are too many questions to use simple
methods. Consequently, hierarchical mixed-modeling (sometimes called hierarchical
modeling or multilevel regression modeling) is indicated (Gelman & Hill, 2006;
Williams, 1978), with the SOL-test score as the dependent variable. Specifically, this
dissertation used the protocols prescribed by Laura O’Dwyer and Caroline Parker (2014)
to analyze the data via the Advanced Statistics module of IBM SPSS Statistics (Version
Reeves 75
Figure 6
Goss-Sampson: Tests for Predicting Outcomes Based on Variables
Note: Adapted from Statistical analysis in JASP: A guide for students (p. 166), by Goss-
Sampson, M. A., 2020, Jasp Team. Copyright 2020 by M. A. Goss-Sampson. Adapted
with permission.
Data Analysis. Considering ASA guidelines (Wasserstein et al., 2019), the results
of inferential statistics tests conducted in this study were guided by the following
recommendations (Hayat et al., 2019):
1. When a p-value is reported, state it, regardless of how small or large.
2. Avoid using 0.05 or any other cut-off for a p-value as the basis for a decision
about whether an effect is meaningful or important.
Reeves 76
3. In reporting a p-value, a measure of the effect size should be included, along
with a corresponding confidence interval.
The confidence intervals are advantageous in that they reflect the differences
between group means at the same level of measurement as the empirical data, and also
indicate the strength and direction of an effect (Pandis, 2013). This helps shift focus away
from the fickle and unreliable dichotomous outcomes of NHSTs toward an examination
of real effects that are compatible with empirical data (Kock, 2015). “Taken together,
confidence intervals in addition to replications, graphic illustrations[,] and meta-analyses
seem to represent a methodically superior alternative to significance tests” (Brandstätter
& Kepler, 1999, p. 46).
Quantitative analysis answers two of the research questions for this study:
1. Is there a relationship between qualitative standards-based report card models
and students’ quantitative standardized test scores?
2. Are different SBRC models more or less predictive of student performance on
standardized tests?
To test these specific open-ended research questions through the interpretation of
confidence intervals and effect sizes, two specific relationships were examined through
parametric analysis of means. For the first question, the researcher tested to what extent
SOL-test scores for mathematics, reading, and science were related to the standardized
PLD codes (PL-2, PL-1, PL+1, and PL+2 coded as -2, -1, +1, +2, respectively) and the
students’ grade levels (3, 4, and 5). For Research Question 2, the investigation explored
to what extent different varieties of SBRC could predict whether a student would pass the
SOL tests.
Reeves 77
Ascertaining the answer to these two research questions first involved visual
interpretation of error-bar charts (Plapinger, 2017), drawn with Minitab 17 Statistical
Software. In each chart, the mean SOL-test scores were symbolized by a bullet, ●. The
upper and lower limits of the 95% confidence intervals (CI) of each mean score
(representing the range within which the true mean score would be captured in at least 95
out of 100 samples) were symbolized by a whisker with boundary lines, ꟾ. If the
confidence intervals of two mean scores did not overlap, then it was assumed that, in at
least 95 out of 100 samples, the difference between the mean scores was not zero
(Cumming & Fidler, 2009; Fidler & Loftus, 2009; Hoekstra et al., 2012).
Research Question 2 was also addressed using hierarchical mixed-model analysis
in IBM SPSS Statistics (Version 24). The continuous level dependent variables were the
SOL-test scores for mathematics, reading, and science, and the within-subject effects—
related to the variance in the multiple test scores (two to five) awarded to each student—
were assumed to be random. The applicable types of SBRC were nested within the three
grade levels, and both were assumed to be fixed effects nested within the random ones.
A post-hoc pairwise comparison of mean test scores between the three SBRC
classes was conducted using Fisher’s LSD test. The values of Cohen’s d—representing
the standardized difference between two mean scores—were estimated to reflect the
effect sizes while searching for practically significant effects—outcomes that are
meaningful in real-world situations and can be used to make decisions. The interpretation
of Cohen’s d followed the criteria defined by Ferguson (2016) where d = 0.41 is the
“recommended minimum effect size representing a practically significant effect” (p.
305); d = 1.15 represents a moderate effect, and d = 2.70 represents a strong effect.
Reeves 78
Avoiding the “Pass-Fail” Pitfall. It is worth noting that the Virginia SOL tests
include a pass-fail threshold; for the 2020–21 school year, the cut-off was 400 out of a
possible perfect score of 600. This practice, which turns a continuous variable into a
simple dichotomy, is not only reductive but predicates failures—potentially catastrophic
ones—in statistical analyses leading to incorrect conclusions. While one might be
tempted to code each SOL-test score as a simple “0 = Fail, 1 = Pass” binary for the
dependent variable (DV), this is statistically indefensible as the score itself was measured
at a continuous, interval level. Over the past three decades, statisticians have
demonstrated the failed logic behind dichotomizing continuous-level DVs as part of data
cleaning and preparation for analysis, as the practice leads to error and specious
conclusions (Altman & Royston, 2006; Dawson & Weiss, 2012; Delaney et al., 1993;
Fernandez et al., 2019; Kuss, 2013; MacCallum et al., 2002; Naggara et al., 2011;
Ragland, 1992; Royston et al., 2006; Taylor & Yu, 2002; Walraven & Van Hart, 2008).
Essentially, when a continuous variable is converted into two ordinal levels,
biases and errors emerge and corrupt the results of the analysis. Meaningful differences
between groups within the DV—which for this study might include different SBRC
models, the very core of the research question—are hidden, as no analysis of variance is
possible once the nuances and complexities of raw continuous level data are obscured by
the reduction resulting from dichotomous coding. This is logical not only statistically, but
also practically: Educators know that the actual difference between an SOL-test score of
399 and 401 is minimal at best, and utterly meaningless at worst, irrespective of the
reductive coding used by the Commonwealth of Virginia. A student who scores a 399 is
almost certainly bound for immediate retesting, given the insignificant difference of a
Reeves 79
single point, while a student scoring a 299 is quite a way off from demonstrating mastery
on that assessment. By reducing—and thereby corrupting—the continuous interval data
of the SOL-test score into a binary pass-fail value, certain statistical tests become
unavailable. By reporting the actual scores, more complex and reliable statistical analyses
can be performed. In order to obtain accurate statistics that support useful conclusions, it
is imperative that we avoid the inevitable distortion that comes with dichotomization,
which invariably obscures the relationships between predictor variables and the
dependent variable. Where the health and wellness of children are concerned, we must do
our due diligence and take the higher road, no matter the challenges.
Stage 5: Recommendations for Praxis
The results of analyses from earlier stages were in Stage 5 developed into a clear
set of recommendations that should have a direct impact upon SBRC design and, more
broadly, on standards-based grading practices. Creswell (2012) advised that concrete
suggestions for actual practice are a hallmark of an effective study, and a significant
portion of Chapter V is given over to discussing such recommendations.
Philosophical Rationale
Tashakkori and Teddlie (2010) indicated that a study employing mixed methods
should be able to articulate a rationale for using such a design. Increasingly used in social
and behavioral sciences, mixed methods research designs use both quantitative and
qualitative components to thoroughly investigate complex issues, and can yield robust
theoretical understanding and enhance practice (Niaz, 2008; Ponce & Pagán-Maldonado,
2014). Niaz (2008) found that understanding quantitative analysis in a practical,
applicable human context has significant value for enacting change. By examining
Reeves 80
multiple dimensions of a complex problem, mixed methods research can triangulate
potential solutions (Greene et al., 1989), a process that involves using multiple research
approaches to develop richer and fuller data (Wilson, 2014).
With the advantages inherent to mixed methods design, Timans et al. (2019)
concluded that such research is important and well-supported within the social sciences.
In their review, they highlight three areas of potential concern: employing a standardized
methodological framework rather than searching for a more-effective design; failing to
understand the epistemological truth that this is a standalone methodology, rather than an
assemblage of other traditional approaches; and a lack of knowledge about the social-
historical context of mixed method research. That team concluded that such designs are
constructions “that encapsulate different assumptions on causality, rely on different
conceptual relations and categorizations, allow for different degrees of emergence, and
employ different theories of the data that they internalize as objects of analysis” (pp.
212–213). This dissertation uses a mixed methods approach because it asks complex
questions requiring complex analyses. Standards-based assessment has been in practice
for some three decades, as Chapter II discussed in detail, but critical questions about SBG
practices remain unanswered.
Aside from the functional advantages of mixed methods design, there can be
philosophical reasons to use it as well. Mertens (2003) argued that concern with human
welfare was one of the cornerstone paradigms of mixed methods research design, and
wrote that as research in social and behavioral sciences is by its nature contextualized
within society, it is thus inextricably involved in the full gamut of human conditions.
Consequently, research that acknowledges extant injustices can be philosophically
Reeves 81
contextualized within what Mertens termed a transformative-emancipatory paradigm. In
regard to this study, such injustice takes the form of what Young-Bruehl (2012) described
as childism: the prejudice against children which disregards and disenfranchises them as
members of a suspect class. To wit, it is specifically because children are children that
they often suffer oppression by adults. Consequently, what is best for children is often
ignored or undermined in favor of adult preferences, be they institutional or personal.
This injustice dynamic is unquestionably relevant to educational praxis. Transformative-
emancipatory is a justifiable philosophical context for this research design, given that
regardless of its findings, there will be clear implications for the integration and
interpretation of other research and evidence, as well as for praxis in the realm of SBRC
design—which will, by its nature, directly impact children.
Population and Sample
The study population included a subset of the students in the Commonwealth of
Virginia who were in grades 3–5 during the 2020–21 school year, took the Virginia
Standards of Learning tests, and were assessed with standards-based report cards. To
identify a sample population, I investigated publicly available resources and
corresponded with the school divisions (Virginia’s equivalent of school districts) to
confirm their participation in standards-based grading, a process composed of non-
probability, voluntary response sampling. The schools in the sample represent a
socioeconomically diverse population.
Each case, drawn from data for a unique student, consisted of a performance level
descriptor (assigned by the teacher) for a specific academic subject, along with the
student’s SOL-test score for that subject. In total, complete data for 82,135 cases were
Reeves 82
acquired from 45 schools, representing 31,091 students in grades 3, 4, and 5. Within the
sample, three SBRC classes (as defined by the Reeves Taxonomy established in Chapter
IV of this dissertation) were represented.
As described in Chapter I, sex and race/ethnicity data was requested but not
received for all students from all schools, producing an unexpected limitation that
unfortunately prevents an equity-related analysis (discussed further in Chapter V).
However, using publicly available data from the schools included in this study, an
estimate of the composition of the sample population can be extrapolated, using the
Virginia Department of Education (n.d.) race/ethnicity codes:
Hispanic or Latino: 21.66%
American Indian or Alaskan Native: 0.28%
Asian: 8.00%
Black or African American, not Hispanic or Latino: 16.04%
Native Hawaiian or Other Pacific Islander: 0.13%
White, not Hispanic or Latino: 47.91%
Two or more ethnicities reported: 5.97%
Data-Collection Procedures
Existing data generated in the normal course of teaching and learning was used
for this study. As school divisions in Virginia typically require researchers to conduct
their studies in the summer and fall, after obtaining permission to do so in the first half of
the year. As shown in Figure 2 in Chapter I, though, the COVID-19 pandemic upended
normal school functioning, limiting the availability of data to that which could be
collected after June 2021. With the artificially compressed timeline, some school
Reeves 83
divisions’ internal procedures simply could not be followed due to pandemic-induced
operational disruptions; as a result, those schools could not be included in the study, a
limitation which will be addressed in the recommendations in Chapter V.
Data Security and Storage
O’Toole et al. (2018) suggested researchers set a priori restrictions on themselves,
requesting no more data than are necessary to conduct research, and noted that best
practices require personally identifiable information be separated from the rest as soon as
possible. In this case, requested data arrived already scrubbed of such information, as
stipulated by the Family Education Rights and Privacy Act of 1974. Data were received
and stored securely in an encrypted, password-protected environment with two-factor
authentication, as prescribed by that Act and following the practices set forth by Grassi et
al. (2017) and O’Toole et al. (2018).
Data Cleaning
Data cleaning is the process by which data quality is improved, often by
correcting formatting errors in a data set; this is particularly important when dealing with
data from multiple sources (M. Allen, 2017). The researcher first ensured that
spreadsheet formats and columnar organization fit a standard that could be analyzed.
Unique identifiers were assigned to each case during this process, and cases with
uncorrectable errors were removed from the cleaned data set. Any data not pertaining to
the study were also removed; for example, when querying a student information system
for all the PLDs assigned to a particular learner in each marking period, data for subjects
not tested as part of the Standards of Learning exams (e.g., art or PE) might be retrieved,
Reeves 84
and needed to be cleaned from the data set. The raw data had the following headers,
shown in Table 4.
Table 4
Raw Data Headers
Variable Type
SBRC Model Class
Categorical, Nominal
SBRC Model Class Code
Categorical, Nominal
Unique Student Identifier
Categorical, Nominal
Categorical, Nominal
Categorical, Nominal
Grade Level
PLD Label (PL)
SOL-Test Score
Continuous, Interval
Student data arrived in a variety of forms, requiring extensive cleaning in two
stages. First, the data on standards-based PLDs needed to be cleaned, followed by a
similar process for the SOL-test results. Data provided for grades other than 3–5 were
deleted, as were data from courses without SOL tests. Next, any cases that included a
non-PLD code representing insufficient evidence or no evidence were deleted, as were
SOL-data cases marked with an irregularity, a “did not attempt” flag, or a “no score”
code. Because SOL tests are not administered for science in grades 3 and 4, those cases
were also deleted for a lack of comparative data. Finally, any cases with test scores that
were blank or that had a score of zero—only possible in the case of an irregular test
Reeves 85
were removed. This left only cases that included both actual PLDs and true SOL-test
scores for students in the target grades. As described in “Delimitations” in Chapter I, one
entire school division’s data had to be omitted from the study, as they had converted their
standards-based PLDs into letter grades. As discussed in that section, the existing
literature indicates that retrofitting traditional number- or letter-score schema into
standards-based assessments is both inappropriate and inefficient; consequently, those
data were excluded as inapplicable to a study of this nature.
This study was developed as a sequential, exploratory, transformative mixed
methods effort to categorize standards-based report card (SBRC) designs and compare
their performance level descriptors (PLDs) and models in order to investigate their ability
to predict student scores on the Commonwealth of Virginia’s Standards of Learning tests.
Executed in five stages, the research began with a systematic literature review, which
informed the creation of a taxonomy of SBRC models. Applying the taxonomy to
Virginia public schools’ SBG systems used in grades three through five, quantitative
analysis tested the various standards-based systems for their ability to predict Standards
of Learning (SOL) test scores; distilling those findings into recommendations for praxis
concluded the final stage of the study. With this overview of the study’s methodology
complete, we now move to Chapter IV, which presents the study’s findings.
Reeves 86
Chapter IV
Findings and Analysis
Once Stage 1 (the literature review) was complete, it was time to begin the next
phase. As shown in Figure 5, on page 67 of Chapter III, Stage 2 involved creating the
taxonomy used to categorize the various models of standards-based report cards. Once
this system was complete, data were gathered from 45 schools in Virginia, including the
model of SBRC the schools employed, the standards-based grades their pupils received,
and the SOL-test scores of those students. Applying the classifications devised by the
researcher allowed meaningful analyses to be performed, clarifying the relationships
between different assessment models, as well as assessing their predictivity for student
achievement on the Standards of Learning tests.
Stage 2: Reeves Taxonomy of SBRC Models
Bailey (1994), a scholar of classification in the social sciences, stated that
“[w]ithout classification, there could be no advanced conceptualization, reasoning,
language, data analysis or, for that matter, social science [sic] research” (p. 1).
Furthermore, he states that the purpose of classification is to create categories whose
members exhibit as much intra-group homogeneity and inter-group heterogeneity as
possible; thus, it is critical that the classes emerging from this process should be
exhaustive and mutually exclusive, in that there must be an appropriate class for every
known member in the population (exhaustivity) and no member can belong to more than
one class (exclusivity). Bailey went on to identify two types of class. Monothetic classes
contain members that are strictly homogeneous within the measured variables, identical
to each other with no exceptions; polythetic classes group members which are similar, in
Reeves 87
that they are predominantly homogeneous, but may exhibit heterogeneity with respect to
at least one measured variable.
When creating classification systems, researchers use one of two methods:
typologies and taxonomies. Typologies are generally multidimensional and conceptual,
while taxonomies are generally empirical, and both can be used in qualitative research
(Bailey, 1994; Borgès Da Silva, 2013; Bradley et al., 2007). While the terms are not
technically synonymous, one framework may well be both when some dimension of the
taxonomy conceptually categorizes members a priori; the potential overlap between
typologies and taxonomies has been well established (Bailey, 1994; Borgès Da Silva,
In this research, it was clear that some form of classification was indispensable:
As the specific symbols and language of any given performance level descriptor (PLD)
within any given standards-based report card (SBRC) may be unique, both PLDs and the
report-card models themselves must be classified for analysis. However, existing research
did not include any systems for classifying SBRCs at the time of this study, necessitating
the creation of a new one. Ultimately, as the research involved developing monothetic
classes from the empirical measurement of specific variables, the term “taxonomy” best
describes this novel classification system: the Reeves Taxonomy of Standards-Based
Report-Card Models.
Development of the Taxonomy
While Hedden (2016) describes two stages of taxonomic development: concepts
and relationships, Bailey (1994), indicates that an examination of empirical evidence is
an essential precursor, leading to classical cluster analysis, which groups members of the
Reeves 88
studied population into homogeneous classes, based on the similarity of measured
characteristics. Bailey terms these measured characteristics M variables. As discussed in
Chapters I and II, SBRCs have specific performance level descriptors (PLDs) that are
distinct, non-overlapping, and arranged in a hierarchy, all varying from model to model.
In order to classify SBRCs based on PLDs, one must measure their characteristics; the
primary differences are the number and nature of performance levels, as described in this
chapter. Therefore, as used in this study, M variables for any given SBRC are:
The number of performance levels in the model,
The number of performance level descriptors (PLDs) in the model that
describe a level of student performance that meets the standard,
The number of PLDs that describe a level of student performance that
does not meet the standard, and
The highest possible positive performance level descriptor within the
Note that the actual language of the PLD is not an M variable. Whether the
descriptor is satisfactory or sufficient or proficient, all are qualitatively and functionally
the same, denoting a performance which “meets the standard.” Therefore, while the M
variables of the taxonomy are quantitative, based on direct empirical observation, M
variables also include qualitative characteristics. SBRC models by their nature are ordinal
and hierarchical, and qualitative examination yields a key component of the taxonomy:
the line between “meets standard” and “does not meet standard.” While the distinction is
almost always obvious, it is worth noting that some SBRC models are ambiguous and
Reeves 89
warrant scrutiny. Making that qualitative distinction is the first step in applying the
taxonomy, and is described in the next section.
Examining the empirical nature of the population according to Bailey (1994),
Hedden (2016) helps us identify the taxonomic concepts:
An SBRC model is a distinct set of performance level descriptors (PLDs).
Each PLD represents a discrete level of student performance, ranked from
lowest to highest, or vice-versa.
SBRC systems may have varying numbers of PLDs, and the language used to
describe them may differ between models.
Each PLD has a corresponding level; that is, each one occupies a distinct
ordinal position on the low-to-high continuum, relative to the other PLDs in
the model.
That said, we can create specific M variables based on empirical observations and
incorporate the concepts enumerated above.
Classification by T-Value
The primary characteristic of an SBRC, and the measure that establishes the first
level of homogeneity, is the number of performance levels in the model. This variable is
always a whole integer that forms the primary class and is termed T. For example, if an
SBRC included the PLDs superior, proficient, and needs work, that model would have T
= 3. In this case, the important thing is the total number of PLDs (i.e., 3), regardless of
the semantics in the descriptor. An SBRC using exceeds standard, meets standard, and
approaching standard would also have T = 3.
Reeves 90
Subclassification by L-Values
There are many permutations of T = 3, and the T-value alone is not sufficient to
completely classify all SBRC models. The arrangement of the PLDs is critical, as can be
easily demonstrated with extreme theoretical examples. No SBRC, by nature, could
include a set of PLDs all of which described “meeting the standard;” if a report card used
only outstanding, excellent, and satisfactory, there would be no way to report a level of
proficiency that fell below that standard. Similarly, if a report card used the three PLDs
almost there, not there yet, and not even close, there would be no way to report student
skill mastery that did achieve the standard. Consequently, all SBRC models must include
at least one PLD that describes “meeting the standard” and at least one that describes not
“meeting the standard.” Thus, there are two possible versions of an SBRC with T = 3: a
model in which a single PLD stands for “meets standard” and two report “does not meet
standard,” and the reverse, with two “meets standard” and one “does not meet standard”
PLDs. This is illustrated in Figure 7.
Figure 7
Two SBRC Models With T = 3
Both exceeds standard and meets standard describe a metric of student
performance that satisfies the given expectations and goals. While they each have their
own distinct ordinal value, and are in a hierarchical relationship, both describe
Reeves 91
performance that, broadly considered, has reached a desired level. Similarly, both
approaching mastery and beginning mastery describe a state of progress that has not yet
met the standard, and while they too are unique, they both indicate performance that is
still unsatisfactory. These two levels of PLD—“meets standard” and “does not meet
standard”—are defined as Lm and Ld respectively, where L reflects that the descriptors
apply to a homogenous level, and m and d distinguish the two different subtaxa—the
subclasses of PLD arrangement—based on achievement relative to a rubric: subscript m
for “meets standard” and subscript d for “does not meet standard.” This is illustrated in
Figure 8.
Figure 8
Taxonomy: Levels L
To populate an SBRC’s PLDs into these two classes, we must qualitatively
examine them, a process well illustrated by examining extant SBRC models identified in
the Stage 1 literature review. Yale 1785 (Schinske & Tanner, 2014), named in Chapter II,
included four PLDs: optimi, second optimi, inferiores, and pejores. Optimi means “best”
in Latin, and thus optimi and second optimi can reasonably be inferred, given their
placement at the top of the Yale 1785 scale, to equate to “best” and “second best.”
Inferiores means “lower” and pejores means “worse,” which are reasonably inferred to be
Reeves 92
connotatively negative descriptors. One could logically pair the “upper” two PLDs as
belonging to Lm while the “lower” ones are in Ld. Within this model, each level L
contains two PLDs, represented formulaically as Lm = 2 and Ld = 2, respectively. For any
given model, the sum of the values of Lm and Ld will always equal T. As an example, for
Yale 1785, (Lm = 2) + (Ld = 2) = T = 4.
Line of Demarcation
The line of demarcation in this taxonomy provides an easy visual reference that
helps classify a given SBRC model. Compare Yale 1785 (Schinske & Tanner, 2014) with
the model used by Marzano in 1998 (hereafter referred to as “Marzano 1998”), which
uses four PLDs: advanced, proficient, basic, and novice (Marzano, 1998). In this system,
we again find two PLDs of “meets standard”—in this case, advanced and proficient—and
two PLDs of “does not meet standard”—basic and novice. Because both Yale 1785 and
Marzano 1998 use PLDs arranged in Lm = 2 and Ld = 2, they are effectively the same.
Regardless of the language used in the descriptors, there are two performance levels
above and two below a conceptual “line” separating the Lm and Ld classes. This division
between Lm and Ld is called the line of demarcation in the Taxonomy, and is illustrated in
Figure 9.
Reeves 93
Figure 9
Taxonomy: Line of Demarcation
Performance Levels
Recall that performance level descriptors (PLDs) do exactly what they say:
describe the level of performance. Exceeds standard is situated above meets standard and
can be considered to describe a higher performance level. This arrangement is not only
conceptual and visualizable but is also mathematically ordinal in that a numerically
greater value can be affixed to exceeds standard than to meets standard, given their
hierarchical relationship. Each specific PLD within each level L is therefore defined as
PL, labeled with a subscript defining its proximity to the line of demarcation. For
example, the first performance level (PL) above the line of demarcation is PL+1, as it
describes the first PL above the line. Regardless of how PL+1 is phrased—meets standard
or satisfactory or proficient—the performance level is assigned a constant value. If no
PLD exists above PL+1 then there is only one performance level above the line of
demarcation, represented as Lm = 1. A model like Marzano 1998, which has two PLDs
above the line of demarcation (advanced and proficient), would place proficient at PL+1,
after which advanced would receive a PL label with the next highest whole integer value:
Reeves 94
PL+2. This way, the relative value of the PL subscript mirrors the hierarchical relationship
of higher versus lower levels of performance.
The same method applies to Ld. Beginning with the first performance level below
the line of demarcation, each subsequently lower PL receives an increasingly negative
subscript. Using the Marzano 1998 model, where basic and novice represent successively
lesser performance, each would be defined as PL-1 and PL-2 respectively, and the model
would have a value of Ld = 2. In this way, every SBRC model can be classified by
identifying the line of demarcation between meets and does not meet, sequentially
counting and labeling each PLD with the appropriate positive and negative PL subscripts,
and then counting the number of PLs within Lm and Ld. Figure 10 shows how, regardless
of the different permutations of PLDs in any given SBRC model, each distinct level in
the hierarchy—and therefore in the taxonomy—is labeled with an identical PL value,
which facilitates comparison. For discussion and formulaic expression, the subscript q
can be used as a placeholder for an undefined PL. For example, one could state, “there is
no PLq where q = 0.” As a final note, it is important to reiterate that non-PLD codes like
insufficient evidence or no attempt are not actual PLDs, and should not be counted in any
given PL.
Reeves 95
Figure 10
Taxonomy: PLD Labels (PL Values)
As noted in Chapter II, a systematic review of the literature did not reveal any
SBRCs of T > 6, and as discussed earlier in this chapter, neither can there be one with T
< 2. (This is reflected throughout the preceding figures, and in Figure 13.) Bailey (1994)
described the empirical identification of observed subjects as the basis for establishing
the outer bounds of a taxonomy. Because the maximum T observed in theory or practice
was 6, this established the maximal outer bound of the taxonomy. While unobserved
permutations of T = 5 are theorizable, no such unobserved T = 6 variations are expected.
Consequently and as a matter of practicality, this new classification method is
intentionally designed with the single maximum taxon of T = 6, and permutations other
than the observed one are excluded for clarity’s sake. We can therefore tabulate all
permutations of SBRCs within those bounds as a basis for the taxonomy, shown in Table
5, meeting Bailey’s (1994) first criterion for taxonomic development, exhaustivity.
Reeves 96
Table 5
Included SBRC Models With T Between 6 and 2
T = 6
T = 5
T = 5
T = 5
T = 5
T = 4
T = 4
T = 4
T = 3
T = 3
T = 2
Line of Demarcation
Note. In this table, the presence of a PLD at a given PLq is shown by an X.
This yields 11 monothetic classes of SBRC model, each with a unique formula of
(Lm)(Ld) = T, meeting Bailey’s (1994) second criterion for taxonomic development,
mutual exclusivity. Those 11 unique formulae are:
(Lm = 3)(Ld = 3) = T = 6
(Lm = 4)(Ld = 1) = T = 5
(Lm = 3)(Ld = 2) = T = 5
(Lm = 2)(Ld = 3) = T = 5
(Lm = 1)(Ld = 4) = T = 5
(Lm = 3)(Ld = 1) = T = 4
(Lm = 2)(Ld = 2) = T = 4
Reeves 97
(Lm = 1)(Ld = 3) = T = 4
(Lm = 2)(Ld = 1) = T = 3
(Lm = 1)(Ld = 2) = T = 3
(Lm = 1)(Ld = 1) = T = 2
This is represented visually in Figure 11.
Figure 11
Taxonomy: T-Values for Each SBRC Class in the Reeves Taxonomy
To facilitate discussion of SBRC classes, semantic nomenclature had to be
developed, as the (Lm)(Ld) = T formulaic representation is cumbersome for praxis.
Examining Figure 11, two unique T values are visible at either end of the spectrum of the
included SBRC models: There is only one T = 6 and there is only one T = 2. For these
models, which have a unique quantity of PLDs, no further distinction is warranted, and
they can simply be named by that T value. To avoid confusion between the final class
nomenclature and the various M variables, this taxonomy employs Roman numerals, and
therefore, the SBRC model with T = 2 is classified as a Class II model, and the one with
T = 6 is Class VI, as shown in Figure 12.
Reeves 98
Figure 12
Taxonomy: T-Designated Models
For the remaining models (two versions of T = 3, three of T = 4, and four of T =
5), we must distinguish between the different permutations. The chief distinction between
the two models with T = 3 is that one has (Lm = 1)(Ld = 2), and the other has the opposite,
(Lm = 2)(Ld = 1). The first model has its highest-level PLD at PL+1, equivalent to meets
standard, while the other model has its highest level in the exceeds standard position of
As discussed earlier in this chapter, the fourth M variable for classifying SBRCs is
the highest possible positive value of the performance level descriptors within the model.
By examining that value of PL, we find that no two models within a given T value have
the same highest PL value. This is illustrated in Figure 13.
Reeves 99
Figure 13
Taxonomy: Maximum and Minimum PL Values for Each Class
The positive PL values within the taxonomy are as follows, in order of frequency:
n (PL+1) = 11
n (PL+2) = 7
n (PL+3) = 4
n (PL+4) = 1
Obviously, every model within the taxonomy must include PL+1, as SBRCs must
include a PLD representing “meets the standard.” Unsurprisingly, the PL+1 descriptor is
often just that: meets standard. In mathematics, the highest possible value within a set is
referred to as the supremum, labeled sup (Rudin, 1976). There are three distinct classes
within the taxonomy which can be represented by the following:
T = 5, sup(PL) = PL+1
T = 4, sup(PL) = PL+1
T = 3, sup(PL) = PL+1
Reeves 100
In order to distinguish these three models in the taxonomy, we turn to their
distinguishing characteristic: a highest PL value of PL+1 Taking the first letter of the
common descriptor meets standard, the designation m is attached to the Roman numeral
derived from each model’s T value.
Thus, we can classify these three unique models as Class Vm, Class IVm, and
Class IIIm, shown highlighted in Figure 14.
Figure 14
Taxonomy: m-Designated Models
The same logic can be applied to those models whose highest level is one above
meets standard. PLDs in this position are typically named in the same vein as ones at
PL+1; at PL+2, the descriptor is quite often some form of exceeds standard. Just as the
designation m was derived from meets standard, the designation e is drawn from exceeds
standard and appended to the T value of any model with sup(PL) = PL+2. This yields
three more distinct classes:
T = 5, sup(PL) = PL+2
T = 4, sup(PL) = PL+2
Reeves 101
T = 3, sup(PL) = PL+2
We can classify the three unique versions of these SBRC models as Class Ve,
Class IVe, and Class IIIe, illuminated in Figure 15.
Figure 15
Taxonomy: e-Designated Models
The last remaining model with T = 4 has a very high level of PL+3, which
effectively “exceeds exceeding.” There is another model with a unique T in which
sup(PL) = PL+3, namely, T = 5. Because the designation e has already been used, we turn
to the next letter in the word “exceeds,” arriving at the designation x—which has the
added advantage of highlighting the unusual nature of these models by using a letter that
has traditionally been equated with things experimental or unusual. Therefore, these two
SBRC models can be respectively labeled as Class Vx and Class IVx, as shown in Figure
Reeves 102
Figure 16
Taxonomy: x-Designated Models
The only class yet unaccounted for is the last remaining model with T = 5. Its
unique formulaic representation is T = 5, sup(PL) = PL+4. Rather than drawing again
from the word “exceeds,” this class will be designated with the letter following x;
historically, x and y designations have been used for prototype aircraft (Moritz, 1997).
Therefore, this final model is classified as Class Vy, as shown in Figure 17.
Reeves 103
Figure 17
Taxonomy: y-Designated Model
This unique T = 5 version has only appeared once in the literature, proposed by
McTighe and Wiggins in 2005 (a model hereafter referred to as “McTighe & Wiggins
2005”) who discussed a model using the PLDs sophisticated, mature, good, adequate,
and naïve. This is an excellent example of the praxis-based, experience-informed
qualitative nature of determining the line of demarcation. Adequate must be reasonably
considered to “meet the standard,” as adequate has words like sufficient and satisfactory
as synonyms, which appear in SBRCs using more-common vocabulary. Therefore, while
one might be tempted to ask that students attain a good level of mastery, that would be
falling into the traditionalist trap of ranking discussed at length in Chapter II. Therefore,
the McTighe and Wiggins 2005 model, subjected to the taxonomic methodology
described here, must be considered to have Lm = 4 and Ld = 1, placing the line of
demarcation between adequate and naïve, making the model Class Vy.
If we modify Figure 10, which illustrates each model’s PL value permutations,
and add our now-exhaustive class nomenclature, as well as the T and both L values (Lm
Reeves 104
and Ld) of each model class, we generate Figure 18, a complete rendition of the
taxonomy: eleven mutually exclusive classes, each with a unique designation based on
M-variable permutations.
Figure 18
Reeves Taxonomy of SBRC Models
Each model class can be represented based on the M variables shown in Table 6.
Recall that T is the total number of PLDs; Lm is the number of positive-value PLs and Ld
is the number of negative value PLs, indicating PLDs above and below the line of
demarcation respectively; and sup(PL) is the highest PL value in the class.
Reeves 105
Table 6
M Variables and Formulaic Representations of Each SBRC class
Formulaic Representation
T = 6
Lm = 3
Ld = 3
(Lm = 3)(Ld = 3) = T = 6
T = 5
Lm = 4
Ld = 1
(Lm = 4)(Ld = 1) = T = 5
T = 5
Lm = 3
Ld = 2
(Lm = 3)(Ld = 2) = T = 5
T = 5
Lm = 2
Ld = 3
(Lm = 2)(Ld = 3) = T = 5
T = 5
Lm = 1
Ld = 4
(Lm = 1)(Ld = 4) = T = 5
T = 4
Lm = 3
Ld = 1
(Lm = 3)(Ld = 1) = T = 4
T = 4
Lm = 2
Ld = 2
(Lm = 2)(Ld = 2) = T = 4
T = 4
Lm = 1
Ld = 3
(Lm = 1)(Ld = 3) = T = 4
T = 3
Lm = 2
Ld = 1
(Lm = 2)(Ld = 1) = T = 3
T = 3
Lm = 1
Ld = 2
(Lm = 1)(Ld = 2) = T = 3
T = 2
Lm = 1
Ld = 1
(Lm = 1)(Ld = 1) = T = 2
Stage 3: Application of Taxonomy
To apply the Reeves Taxonomy to any given SBRC, begin by identifying the
PLDs in the model and count them to identify T. Next, find the line of demarcation by
examining the PLDs to determine which describe “meets standard” (and are therefore
part of Lm), and which describe “does not meet standard” (and thus belong to Ld). If the
semantics of the PLD are ambiguous or idiosyncratic, investigation of associated rubric
language, or even interviewing the practitioners may be warranted; in almost all observed
SBRC models, though, the language of the PLDs themselves was sufficient to establish
the line of demarcation. (The only challenging example observed in the literature is
described and classified later in this section.) Once the line of demarcation is established,
Reeves 106
count each set of PLDs, above and below the line of demarcation, to identify Lm and Ld.
Locate the appropriate permutation of T, Lm, and Ld in the taxonomy, confirm the
sup(PL), and the model has been classified.
Applying the Reeves Taxonomy to existing SBRCs demonstrates the validity of
the methodology. Once again, recall from Chapter II that non-PLD codes are
placeholders for a lack of performance level description, and so do not count toward L
values or have PL. The MRN model (Elsinger & Lewis, 2020) includes two PLDs, meets
and revision needed, with one non-PLD code of not assessable. The Discovery
Elementary School (2021) model also includes two PLDs, meets standard and
approaching mastery; both models are T = 2 with levels of (Lm = 1)(Ld = 1) and are
therefore clearly Class II. One could describe Michigan 1851 in the same way—it uses
pass and fail as its PLDs—if the assessment methodology were truly standards-based and
the reporting method an authentic SBRC; pedagogically, though, we can question the
wisdom of using such loaded traditional grading language for PLDs. Colby (1999) used
proficient, progressing, and not proficient, a model with T = 3 and (Lm = 1)(Ld = 2),
which is therefore Class IIIm. EMRN (Stutzman & Race, 2004) uses three PLDs of
exceeds, meets, and revision needed as well as one non-PLD code of not assessable. This
model is T = 3 with levels of (Lm = 2)(Ld = 1), making EMRN a Class IIIe model. It is
thus demonstrated that the Reeves Taxonomy can be easily and consistently applied to
any given model of SBRC that might be encountered.
It is worth reiterating here that many possible permutations of Class VI models
have been excluded from the current taxonomy, as no examples were found, either in
theory or in practice. Following the procedure outlined in the above sections, though,
Reeves 107
would facilitate the addition of additional classes as necessary. For example, in the case
of a novel Class VI with PLq ≠ 3, the existing designations (m, e, x, y) would be
appended to all T = 6 varieties as appropriate. Were someone to design a model with a
stratospheric PL+5, the designation z would be used, following the same rationale as
progressing from x to y.
As mentioned earlier, there was one SBRC system that posed a particular
challenge for classification: McNulty (Elsinger & Lewis, 2020) utilized master,
journeyman, and apprentice (MJA) as PLDs. One cannot be clear, at face value, if a
journeyman level of skill proficiency “meets standard,” or if master-level performance is
required to do so. McNulty’s use and definition of MJA are, fortunately, defined in the
literature, and the SBRC can therefore be properly classified: According to Elsinger and
Lewis, “only the ‘not yet mastered’ category is subdivided into two marks” (p. 889).
While T = 3 is clear, only through thorough investigation and scrutiny of practice can one
identify that the levels are (Lm = 1)(Ld = 2), making MJA a Class IIIm model.
Observations in Practice
As established earlier, the Reeves Taxonomy limits T based on the exhaustive
literature review performed in Stage 1, as well as observed practice, accounting for all
models with 2 ≤ T ≤ 5 and the observed permutation of T= 6. These models are
enumerated in Table 7, which provides examples of SBRC models in use—including the
language for each PLD—and their corresponding Reeves Taxonomy class.
Reeves 108
Table 7
Observed SBRC Classes
PLD Language
Elementary School
meets standard, approaching
T = 2
Elsinger and Lewis
mastered, not yet mastered
T = 2
Arlington Public
Schools (n.d.)
meets standard, approaching
mastery, developing mastery
T = 3
Athens City
Schools (n.d.)
meets the standards, working
toward the standards, experiencing
T = 3
Guskey (2000)
A, B, C
T = 3
Ulrich (2012)
proficient, developing, beginning
T = 3
Georgia CRCT
(Hardegree, 2012)
exceeds, meets, does not meet
T = 3
Harrisburg School
District (2016)
meets standard, progressing,
emerging, standard not met
T = 4
Arter and Busick
exceeds, meets, does not meet but
progressing, does not meet
T = 4
Bellingham Public
Schools 2013 (n.d.)
exceeding grade level standard,
meeting standard, approaching
standard, well below standard
T = 4
Guskey 2001
(Spencer, 2012)
4, 3, 2, 1
T = 4
O’Connor (2007)
advanced, proficient, basic, below
T = 4
Wormeli (2018)
exemplary, competent, satisfactory,
inadequate, unable to begin
T = 5
McTighe and
Wiggins 2005
sophisticated, mature, good,
adequate, naïve
T = 5
Wormeli (2018)
exceptional, strong, capable,
developing, beginning, emergent
T = 6
Reeves 109
While it is impractical to enumerate every standards-based assessment system
ever used, the Reeves Taxonomy effectively classifies all of the SBRC models actually
found in Stage 1. Figure 19 displays the nine observed classes, graying out the two
unobserved theoretical classes included in the Taxonomy.
Figure 19
SBRC Models Observed in Practice as Classified by the Reeves Taxonomy
Observations in This Study
As discussed in the “Population and Sample” section of Chapter III, 82,135 cases
from 45 schools were acquired for analysis. For each school’s data, the Reeves
Taxonomy was applied using the Stage 3 methodology described in this chapter. While
the actual PLD language is not published here to safeguard the anonymity of participating
school divisions, schools, and students, the sample population included three of the eight
observed SBRC classes:
Class II, 3,082 cases
Class IIIm, 12,782 cases
Class IVe, 66,271 cases
Reeves 110
Unobserved classes are grayed out in Figure 20.
Figure 20
SBRC Models in This Study as Classified by the Reeves Taxonomy
Stage 4: Quantitative Findings and Analysis
With the qualitative analysis of Stages 2 and 3 complete, the Reeves Taxonomy of
SBRC Models could be employed to make and analyze quantitative comparisons. As
discussed in Chapter III, two distinct analyses were indicated in order to answer the
research questions: “Is there a relationship between qualitative standards-based report
card models and students’ quantitative standardized test scores?” and “Are different
SBRC models more or less predictive of student performance on standardized tests?” For
the first question, the researcher needed to test to what extent SOL-test scores for
mathematics, reading, and science were related to the standardized PLD codes (PL-2, PL-
1, PL+1, and PL+2, coded respectively as -2, -1, +1, and +2) and the grade levels of the
students (3, 4, and 5). To answer the second question, the researcher needed to test to
what extent Class II, IIIm, and IVe SBRCs predicted whether a student would pass the
math, reading, and science SOL tests.
Reeves 111
Description of Sample
Scores for Standards of Learning tests in mathematics, reading, and science were
awarded to a population of N = 31,091 students in grades 3, 4, and 5. (The total number
of test scores in the SPSS data editor was N = 82,315, greater than the number of students
because each received more than one SOL-test score.) Each score ranged from 0 to 600,
with 400 being the cut-off score between pass and fail. As discussed in “Data Cleaning,”
cases including numerical scores of actual zero were scrubbed as invalid, and thus the
lowest valid score received for any SOL test was 181. Over half of the students (n =
17,573, 56.5%) achieved passing scores of ≥ 400, whilst the remainder (n = 13,518,
43.5%) fell below that cut-off. The frequency distribution histogram in Figure 21 reflects
that SOL-test scores were normally distributed across the student population. Therefore,
parametric statistics were applicable, including analysis of mean scores at ± 95%
confidence intervals.
Figure 21
Frequency Distribution of SOL-Test Scores
Reeves 112
In addition to the continuous level SOL-test scores, teachers assign students
categorical performance level descriptors, or PLDs. Per the Reeves Taxonomy of SBRC
Models, PLDs are labeled PLq, where q is the distance from the line of demarcation. This
q-value can be used as a hierarchical and ordinal code for quantitative analysis; for
example, a PLD labeled PL+1 can be coded “+1.”
Students were then placed into three categories, based on the Reeves Taxonomy
types of SBRC used by their schools. Coding the PLDs used by the represented SBRC
classes as PLq-values, Class II uses PLq of -1 and +1; Class IIIm uses -2, -1, and +1; and
Class IVe uses -2, -1, +1, and +2. Of the 82,135 cases in the study, 3,082 fell into Class
II, 12,782 into Class IIIm, and 66,271 into Class IVe.
Due to the constraints described in “Limitations, Delimitations, and
Assumptions,” the frequency distribution of SOL-test scores was not equal between
subject areas, SBRC classes, or grade levels, meaning some SBRC classes had more
cases for analysis than others. Tables 8 and 9 show that most of the cases for mathematics
and reading scores in grades 3, 4, and 5 came from schools using SBRC Class IVe, while
the least number of cases were classified in Class II. (It is worth noting that these two
classes of SBRC were not necessarily the most- or least-used by all schools in Virginia,
but rather by the respondents.) Table 10 shows that most students receiving SOL-test
scores for science were graded with Class IVe SBRCs; all of that data came from
students in grade 5, as third- and fourth-graders do not take that test.
Reeves 113
Table 8
Distribution: Mathematics Cases Between SBRC Classes and Grades
SBRC Class
Class II
Class IIIm
Class IVe
Table 9
Distribution: Reading Cases Between SBRC Classes and Grades
SBRC Class
Class II
Class IIIm
Class IVe
Reeves 114
Table 10
Distribution: Science Cases Between SBRC Classes
SBRC Class
Class II
Class IIIm
Class IVe
Analysis 1: SOL-Test Scores, PLD Codes, and Grades
To what extent are the SOL-test scores for mathematics, reading, and science
related to standardized PLD codes (-2, -1, +1, and +2) and student grade level (3, 4, 5)?
The mean scores between groups of students were compared visually using error-bar
charts. In Figures 22, 23, and 24, the mean SOL-test scores at ± 95% confidence intervals
for mathematics, reading, and science were plotted against the four standardized PLD
codes (-2, -1, +1, and +2) and the three grade levels (3, 4, and 5). Figure 22 shows that
the mean SOL-test score for mathematics increased linearly across the four standardized
PLD codes in grades 3 and 4; however, in grade 5 the mean score did not increase
between PLD codes +1 and +2. Figure 23 shows that the mean test score for reading
increased linearly across the four standardized PLD codes in grades 3, 4, and 5, and
figure 24 shows that the mean score for science also increased linearly across the four
standardized PLD codes in grade 5. (As previously mentioned, the SOL test for science is
not administered to the other grades.)
Visual examination of the error-bar charts in Figures 22, 24, and 25 reveals a clear
division between pass (≥ 400) and fail (< 400), across grades 3, 4, and 5. For
Reeves 115
mathematics, reading, and science, the mean test scores associated with PLD codes -2
and -1 were consistently < 400 (fail), whereas the mean test scores associated with PLD
codes +1 and +2 were consistently > 400 (pass).
Figure 22
Comparison: SOL-Test Scores for Mathematics by PLD Code and Grade
Reeves 116
Figure 23
Comparison: SOL-Test Scores for Reading by PLD Code and Grade
Figure 24
Comparison: SOL-Test Scores for Science by PLD Code and Grade
Reeves 117
Analysis 2: SOL-Test Scores, SBRC Classes, and Grades
To what extent do SBRC Classes II, IIIm, and IVe predict whether a student will
pass the SOL tests for mathematics, reading, and science? In Figures 25, 26, and 27 the
mean SOL-test scores at ± 95% confidence intervals for mathematics, reading, and
science are plotted against the three SBRC Classes (II, IIIm, and IVe) and the three
grades (3, 4, and 5). (Again, a Class II SBRC uses codes -1 and +1; Class IIIm uses codes
-2, -1, and +1; and Class IVe uses codes -2, -1, +1 and +2.) Figure 25 shows that, in
grades 3 and 4, the mean scores for mathematics were above the cut-off for all types (>
400), but consistently higher in SBRC Class II than in Classes IIIm and IVe. In grade 5,
the mean scores for mathematics were higher in SBRC Class II than in Class III, with
both of those groups > 400, while Class IVe fell below the cut-off. Figure 26 shows that
the mean SOL-test scores for reading in SBRC Classes II, IIIm, and IVe were
consistently above the cut-off (> 400), and that in all investigated grades, the mean scores
were consistently higher in Class II than in Classes IIIm and IVe. Figure 27 shows that, in
grade 5, the mean SOL-test score for science was above the cut-off in SBRC Classes II
and IVe, but below it in Class IIIm.
Reeves 118
Figure 25
Comparison: SOL-Test Scores for Mathematics by SBRC Class and Grade
Figure 26
Comparison: SOL-Test Scores for Reading by SBRC Class and Grade
Reeves 119
Figure 27
Comparison: SOL-Test Scores for Science by SBRC Class and Grade
Tables 11 to 16 present the results of a hierarchical mixed-model analysis with
both random and fixed effects. As detailed in “Multiple Nested Predictor Variables,” this
sort of analysis allows researchers to study models that vary at more than one level. A
helpful analogy may be to imagine each independent predictor variable as a slider that
can be manipulated up and down; the hierarchical mixed-model analysis performs a
mathematical regression that adjusts one slider up and down, while keeping the others
fixed, to see if the outcome changes. Then the model tests the results of manipulating the
next slider until every possible combination of positions is tested. If the researcher sees
no significant differences after adjusting a particular slider, one can say that the
represented independent predictor variable does not have a large impact on the dependent
outcome variable; on the other hand, if moving that slider notably changes the result,
Reeves 120
there is something statistically significant going on. It is important to note that regression
analyses like the one performed in this section can reveal relationships, but do not
necessarily imply causality. Attributing cause requires careful justification (Freedman,
2009), and is not warranted here, but we can still establish an associative relationship—
namely, whether different types of SBRC are more or less predictive of student
performance in standardized testing.
So, to what extent do SBRC Classes II, IIIm, and IVe predict whether a student
will pass the SOL tests for mathematics? The low p-values (p < .001) derived from the
mixed-model analysis of the scores (shown in Tables 11 and 12) indicate a low
probability that the data were incompatible with the statistical model, but rather reflect
actual differences in the test scores between the three SBRC classes. The effect sizes in
Table 12 indicate how strong the differences are between the marginal mean mathematics
scores across those three classes.
Table 11
Mixed-Model Analysis to Compare Mathematics Scores by SBRC Class and Grade
Numerator df
Denominator df
SBRC Class
Grade (SBRC Class)
Reeves 121
Table 12
Post-Hoc Comparison: Marginal Mean Mathematics Scores between SBRC Classes
Effect Size
Cohen’s d
95% CI
Note: The cut-off score between pass and fail for Mx = 400
Based on Ferguson’s (2016) criteria, Cohen’s d = 0.60 [0.58 0.61) showed a small
positive difference in the mean mathematics scores (M1 - M2 = 37.74) between Class II
(pass) and Class IIIm (pass). Cohen's d = 0 .74 [0.72, 0.75] also reflected a small positive
difference in the mean mathematics scores (M1 - M2 = 43.73) between Class II (pass) and
Class IVe (pass). Cohen’s d = 0.09 [0.08, 0.11] indicated that the mean difference in
mathematics scores (M1 - M2 = 6.00) between Class IIIm (pass) and Class IVe (pass) had
little or no practical significance. Ultimately, the conclusion is that SBRC Classes II,
IIIm, and IVe were equally effective predictors of whether a student would pass the SOL
tests for mathematics.
Next, the data were examined to determine to what extent do SBRC Classes II,
IIIm, and IVe predict whether a student will pass the SOL tests for reading? The low p-
values (p < .001) derived from the mixed-model analysis of the reading scores in Tables
13 and 14 indicate a low probability that the data were incompatible with the statistical
model, but reflected differences in the test scores between SBRC Classes II, IIIm, and
Reeves 122
IVe. As in the previous table, the effect sizes in Table 14 indicate how strong the
differences are between the marginal mean reading scores across the three SBRC Classes.
Table 13
Mixed Model to Compare Reading Scores by SBRC Class and Grade
Numerator df
Denominator df
SBRC Class
Grade (SBRC Class)
Table 14
Comparison: Marginal Mean Reading Scores between SBRC classes
Effect Size
Cohen’s d
95% CI
Note: The cut-off score between pass and fail for Mx = 400
Based on Ferguson’s (2016) criteria, Cohen’s d = 0.54 [0.49, 0.58] reflected a
small positive difference in the mean reading scores (M1 - M2 = 34.25) between Class II
(pass) and Class IIIm (pass), and similarly indicated a small positive difference (Cohen’s
d = 0.48 [0.44, 0.52]) in the mean reading scores (M1 - M2 = 28.88) between Class II
(pass) and Class IVe (pass). The mean difference in reading scores (M1 - M2 = 6.00)
between Class III (pass) and Group IVe (pass) had little or no practical significance
Reeves 123
(Cohen’s d = 0.08 [0.02, 0.12]). The conclusion is that none of these SBRC classes
performed consistently better than the others at predicting whether a student would pass
the SOL tests for reading.
Lastly, to what extent are SBRC classes II, IIIm, and IVe predictive of whether a
student will pass the SOL tests for science? Again, the low p-value (p < .001) derived
from the mixed-model analysis of the science scores in Table 15 indicates the data are
most likely not incompatible with the statistical model. The effect sizes in Table 16
indicate how strong the differences are between the marginal mean science scores across
Classes II, IIIm, and IVe.
Table 15
Mixed Model to Compare Science Scores by SBRC Class and Grade
Numerator df
Denominator df
SBRC Class
Reeves 124
Table 16
Comparison: Marginal Mean Science Scores Between SBRC Classes
Effect Size
Cohen’s d
95% CI
Note: The cut-off score between pass and fail for Mx = 400
Based on Ferguson’s (2016) criteria, Cohen’s d = 0.69 [0.66, 0.72) reflected a
small positive difference in the mean science scores (M1 -M2 = 66.05) between Class II
(pass) and Class IIIm (fail), and a practically significant negative difference (d = 0.39
[0.36, 0.42]) in the mean science scores (M1 - M2 = -29.09) between Class IIIm (fail) and
Class IVe (pass). In contrast, Cohen’s d = 0.24 [0.22, 0.27] indicated that the mean
difference in science scores (M1 - M2 = 16.29) between Class II (pass) and Class IVe
(pass) had little or no practical significance. The conclusion is that none of the three
SBRC types (Classes II, IIIm, and IVe) consistently outperformed the others in predicting
whether a student would pass the SOL tests for science.
Analysis 1 asked: To what extent are the SOL-test scores for mathematics,
reading, and science related to the standardized PLD codes (-2, -1, 1, 1) and the grades of
the students (3, 4, 5)? The conclusion is that the four standardized codes indicated a clear
division between pass and fail (above and below the cut-off score of 400); as one would
Reeves 125
expect, the mean test scores associated with PLD codes -2 and -1 were consistently < 400
(fail), whereas the mean scores associated with PLD codes +1 and +2 were consistently >
400 (pass).
According to this first analysis, regardless of grade level or subject area, and no
matter which of the three classes of SBRC (Classes II, IIIm, or IVe) were used in their
schools, children receiving PLDs within the Lm domain of “meets standard” consistently
passed the SOL tests, while students whose PLDs fell within the range of Ld—“does not
meet standard”—consistently failed the exams.
The second analysis asked: To what extent do SBRC Classes II, IIIm, and IVe
predict whether a student will pass the SOL tests? The data fit a nested or hierarchical
mixed model with both random and fixed effects. Across grades 3, 4, and 5, the mean
differences in SOL-test scores between the three SBRC classes did not reflect a clear
division between pass and fail. The effect sizes were close to or less than the
recommended minimum values indicating practical significance, leading to the
conclusion that combining the PLD code data into the SBRC classes did not consistently
predict whether a student would pass the SOL tests for mathematics, reading, and
According to this second analysis, changing the standards-based report card
model used to assess a given student would have no impact on the likelihood they would
pass or fail the SOL tests, regardless of grade level or subject area. Consistent with ASA
guidelines (Wasserstein & Lazar, 2016; Wasserstein et al., 2019), the p-value is not
determinative when it comes to practical ramifications, and the effect size seen here,
describing the differences between the predictivity of the II, IIIm, and IVe groups, is not
Reeves 126
of practical significance. From an educator's perspective, this signifies that there is no
real advantage or disadvantage, in terms of predicting SOL-test results, in choosing one
SBRC class over another.
As shown in Chapter II, over thirty years of research have consistently
demonstrated that traditional grades are generally ineffective predictors of student
performance on standardized tests (Greene, 2015); in contrast, the results of the first
quantitative analysis show that PLDs are much more predictive of test outcomes, and the
second one shows that PLDs remain effective predictors regardless of the type of SBRC
employed. In terms of practical effect, no given SBRC model in this study was shown to
be more predictive than another. Consequently, we must now turn to the rest of the
research about the impacts of grading on children to determine what models may be
better for learners than others—a topic we will explore in Chapter V.
Reeves 127
Chapter V
Summary, Implications for Praxis, and Conclusion
A century of literature makes it clear that traditional grading “does not meet
standard.” While it is unquestionable that we need a new framework to replace the old
model, the efficacy of various standards-based report cards (SBRC) has not been robustly
researched. Consequently, educators have been forced to guess at the best designs for
SBRCs, and may have been making mistakes—potentially harmful ones—in the process.
The central purpose of this study was to examine the relationship between standards-
based report cards (SBRCs) with different quantities of performance level descriptors
(PLDs) and the scores received on standardized end-of-course tests by primary-grade
students in Virginia. This new research should provide some concrete answers to help
educators design and employ effective standards-based report cards.
Summary of the Study
This sequential, exploratory, transformative mixed methods study was divided
into five stages. Stage 1, a systematic review of the literature about grading, showed that
traditional grading practices are biased and ineffectual, while standards-based grading is
more effective and accurate. The second stage of the research saw the development of a
classification system for standards-based report cards, the Reeves Taxonomy of SBRC
Models. An exhaustive system of mutually-exclusive categories, this taxonomy allows
SBRCs and PLDs to be sorted and compared. In Stage 3, the taxonomy was applied to
extant and theoretical SBRCs, including both ones found during the literature review and
those used in the classrooms of the study population (elementary-age students in Virginia
in grades 3, 4, and 5). Stage 4 involved quantitative analyses (specifically analysis-of-
Reeves 128
means testing and mixed-model analysis with multilevel regression) in order to test the
effects of various independent predictor variables—including SBRC class, PLD, subject
area, and grade level—on the dependent outcome variable, the Standards of Learning test
score. The fifth and final stage developed the findings into recommendations for praxis
for teachers and school leaders, and is the basis for Chapter V.
The following questions guided the research:
1. Is there a relationship between qualitative standards-based report card models
and students’ quantitative standardized test scores?
2. Are different SBRC models more or less predictive of student performance on
standardized tests?
3. Are certain SBRC models preferable?
The objectives of this study were:
to qualitatively classify different SBRCs into models based on the number and
nature of their performance level descriptors,
to quantitatively describe differences in the predictivity of different SBRC-
model classes, and
to make recommendations for actual practices in schools.
The first research question asked if there were a relationship between the
particular qualitative SBRC model used to grade a given student and that student’s
quantitative standardized test scores. To reach an answer, an analysis of means was
employed to determine the extent to which SOL-test scores for mathematics, reading, and
science were related to performance level descriptors and grade levels. This analysis
concluded that PLDs were consistently predictive of SOL-test scores, regardless of
Reeves 129
subject, grade level, or SBRC class. Students whose PLD in a particular subject was in
the “meets standard” domain (Lm) consistently passed the associated SOL test, while
students whose PLD in a given area fell into the “does not meet standard” domain (Ld)
consistently failed the test.
The second research question asked if different SBRC models had varying
reliability as predictors. A mixed model was used, employing multilevel regression, to
determine to what extent SBRC class predicted whether a student would pass the SOL
test. The analysis concluded that there was no practical difference between SBRC
classes; while these report cards can predict a student’s performance, no given model
appeared to significantly influence SOL-test outcomes.
Together, these two analyses answer the first two research questions:
1. No, there is no relationship between SBRC models and students’ test scores.
While there is an association between a student’s PLDs and their SOL-test
performance, the report card’s Reeves classification itself is not a practically
significant predictor of SOL-test scores.
2. No, there is no practically significant difference in how predictive the SBRC
models investigated in this study are. Again, while PLD is predictive of SOL-
test score, SBRC class is not.
The results of those two analyses can now be used to answer the third research
question, which asked if certain models were preferable to the others. From a strictly
statistics-driven perspective, the initial answer might seem to be no, because none of the
models were more predictive than others in a practically significant way. Differences
between SBRC models were small: Students who received PLDs within a Class II SBRC
Reeves 130
did better in mathematics and reading across grade levels, while grade 5 students who
received PLDs within a Class IIIm model did worse in science. However, these outcomes
cannot be deemed causative, as they can be accounted for by a variety of factors not
considered in this study (Freedman, 2009).
Given the lack of practical significant differences between models in terms of
predictivity, we must consider the totality of the evidence, including the literature cited in
Chapter II. That body of work goes into great detail describing the negative impacts of
grading generally, as well as the particular negative impacts inherent to ranking and
ordering children. With this context, established throughout this dissertation, the third
research question is answered:
3. Yes, SBRC models with fewer ranks and divisions are preferable to those with
more. As there is no practical difference in predictivity between SBRC
models, but reducing the ranking and classification of children is healthier for
them, it can be argued that best practices call for using as few PLDs as is
Implications for Practice
Iterative cycles are refining cycles: This generation of revolutionary praxis begets
the next generation of revolutionary theory. With each radical application of the best
science and thinking about children and learning, another wave of new analysis and
exploration emerges. To transform grading practices, theory must be translated into
action. Consequently, this dissertation concludes by qualitizing the quantitative, turning
the results of analysis into specific prescriptions for what teachers, educational leaders,
and policymakers can do differently and better. Thus can they enact positive change,
Reeves 131
ensure better outcomes, and spur the next cycle of great thinking about student skill
mastery and assessment. This section covers three key areas of praxis: teaching and
learning, school-division leadership, and school–community communications.
Implications for Teaching and Learning
For nearly a century, child-centered educators have decried traditional grading
practices because they diminish student interest in learning, create challenge-avoidance,
and reduce the quality of student thinking (Kohn, 2011, 2015). The psychological
underpinnings of this are well established in A. Miller (1990), who described the
“poisonous pedagogy” (p. 80) of deliberately treating children harshly in order to toughen
them up. This includes the suppression of children’s natural tendencies toward
spontaneity, exploration, and creativity, and introduces tremendously harmful humiliation
and shaming, which the child experiences as real trauma (A. Miller, 1997). Olson (2009)
describes in detail the direct results of such deleterious policies and practices, including
the wound-like feelings of stigmatization, mortification, and worthlessness caused by
methods that rely on ranking, classifying, and punishing children—all of which are too-
often endemic to traditional grading systems. Whether one prefers Vygotsky (1978), who
believed learning was antecedent to development, or Piaget (1953), who believed the
converse, constructivist psychology indicates clearly that learning and development are
deeply interrelated, and psychological injury, oppression, and trauma is damaging to
both. Children who are labeled, ranked, classified, and rejected by peers and teachers
alike experience harm that can lead to significant psychosocial difficulties (García-Bacete
et al., 2018; Goleman, 2006).
Reeves 132
Because traditional grading is inaccurate, and because ranking and shaming
children are so strongly correlated with negative outcomes, standards-based systems
should seek to maximize accuracy and minimize harm. Teachers employing SBG
practices should resolutely avoid hierarchical categorization of students as much as
possible. By inference, this suggests that grading frameworks should use the fewest
feasible levels of assessment while still allowing an instructor to accurately communicate
the state of a pupil’s progress toward skill mastery—which is, after all, the core purpose
of a report card (Allmain, 2013). While grades have been used by educators to plan future
instruction and by institutions to determine placement and promotion (Airasian, 1994),
these two functions are better accomplished by considering the constellation of data
incorporated in the summative grading process, rather than the reductive grade itself.
Consequently, appropriate standards-based grading can and should include a deep and
diverse set of meaningful data points, and standards-based report cards should use as few
descriptors as possible, to avoid attracting focus to the PLDs themselves. In fact, a well-
designed SBA system does not need a plethora of PLDs; rather than putting energy into
creating multiple levels of assessment, instructors—and their students—are better served
by thoughtfully developing meaningful content standards. If the rubric itself is made well,
and it is clear what students need to know and do in order to meet those standards, the
only distinction necessary is whether a student has attained mastery, or if they have as-yet
unfulfilled learning needs to address before meeting that content standard. Since this
study shows that there is no practically significant difference between SBRC models in
terms of their ability to predict scores, the author recommends using a Class-II Reeves
Reeves 133
model, as using frameworks with more performance levels has no benefit—but does
bring the potential for confusion or harm.
Pedagogically, teachers should feel confident in their assessments and
observations, and be comfortable using the gathered data to plan and execute learning
strategies for their pupils. It should be kept in mind, however, that while instructors
employ various diagnostic instruments and skill checks to inform their teaching practices,
the results of those investigations do not necessarily mean anything to non-educators,
including students and parents. Thus, teachers should translate their data into meaningful
narrative feedback (Chamberlain et al., 2018; Guskey, 1996; D. Reeves et al., 2017),
which describes what the child needs to know in order to master a particular skill, and
should also prescribe steps or methods to approach that mastery. This sort of clear,
descriptive guidance has more practical utility than reductive symbols like letters and
numbers; teachers should avoid using labels and shorthand out of personal convenience,
and instead provide detailed assessments, which can be more labor-intensive, but provide
significantly more impactful benefits. While it can be tempting to attach a plus-sign to
“meets standard” to indicate performance above the required level, or replace a PLD with
a letter grade or integer, such practices reintroduce the problematic phenomena of
traditional grading, and must be avoided.
Implications for Assessment
Chapter IV explained that, in the broadest terms, PLDs fall into two categories:
those that describe a student’s performance as meeting the standard, and those that denote
a failure to do so. We can epistemologically improve upon the descriptors used by
applying a growth mindset (Dweck, 2007), instead saying that performance either
Reeves 134
“meets” or “does not yet meet” standards. In this arrangement, the wording emphasizes
that a given student does indeed have the ability to achieve mastery, but also indicates
that they still bear responsibility for improving their performance.
While this may seem reasonable, one of the drivers of standards-based grading in
praxis is a shift toward moving PLDs away from an implied assessment of children
themselves, or even their performance. Teaching is a craft in which the instructor takes
full responsibility for meeting every one of each learner’s academic needs, and improved
assessment practices can and must be a central part of the solution to the problem of
schools that do not teach (K. D. Reeves, 2015). Collectively, teachers must remain
oriented toward what we can do for children, not what they can do for us. Consequently,
Lm and Ld PLDs should not be regarded as representing that a child “meets” or “does not
yet meet” the standard. Instead, the two levels of L should indicate that the teachers, as a
team, have either “done everything necessary to help the child reach mastery” or “have
not yet done so.” While it may seem, on its face, that this mindset obviates the
responsibilities of the learner and places them all on the teacher, this should not be seen
as an onus, but rather an acknowledgement of what we actually expect instructors to do.
With proper support, teachers can create environments and conditions that meet the
learning needs of every child.
As educators, when we conceive of ourselves as agents of student success and
understand teaching as service, rather than gatekeeping, we can allow ourselves to be
guided by a genuine desire to help children succeed, using every single resource at our
disposal to help every pupil thrive. When teachers do this, we can shift the concept of
“meets standard” and “does not meet standard” away from judgmental ranking and
Reeves 135
classification, and the psychosocial and psychoemotional harms associated with those
practices. The shift from “not there yet” to “we have not yet helped them get there” does
require some modifications to PLDs, but there is support for such adjustments. Guskey
(2014), for example, indicts “exceeds standard” as inappropriate, citing evidence that it
confuses parents, frustrates teachers, alters both the denotative and connotative
interpretations of the standard, and ultimately corrupts the accuracy it seeks to engender
in assessment. By understanding the line of demarcation not as the border between
success and failure, but rather as indicating whether or not a child’s needs have been met,
moving from Ld to Lm becomes more than a simple academic goal—it is an ethical
imperative. With those considerations in mind, it should give teachers great pause before
subdividing students whose learning needs have been met into hierarchical categories of
“needs met” and “needs really, really met.” Schools would be well served by eliminating
any form of “exceeds standard” and instead deploying Class II SBRCs with a sup(PL) of
Were a school willing to undertake a radical pedagogical transformation, the
elimination of standardized tests altogether should be considered. Because teacher-
assigned PLDs are reliable predictors of performance on a subsequent standardized test,
the converse also holds true: performance on tomorrow’s exam can be predicted by
today’s assessment. When one considers the problems endemic to high-stakes multiple-
choice standardized tests, the question naturally arises: “Does such testing serve a
pedagogical purpose?” If a standard is properly constructed, and the performance level
descriptor is accurately applied, further testing is redundant.
Reeves 136
Unfortunately, standardized testing is deeply embedded, culturally and
pedagogically. After World War II, as the practice took hold, standardized tests and their
administrators became, as Sacks (1999) wrote, “gatekeepers of America’s meritocracy,”
labeling people as academically “worthy … or not worthy” (p. 5). As Starr (2017) stated,
“standardized testing is the system” (p. 72).
That system is ripe with “ethical and moral dilemmas” (Counsell, 2007 as cited in
Himelfarb, 2019, p. 152), as practices undertaken under the guise of accountability yield
data of limited utility. While standardized achievement tests purport to provide valuable
information about students’ knowledge and ability, teachers rely not on standardized-test
scores, but rather the whole corpus of data they collect in the classroom. On the local
level, educators perceive state-mandated exams as yielding the least useful data of any
assessment tool (a belief that persists at both the primary and secondary levels);
moreover, the larger the school, the less the data were regarded as useful (Fairman et al.,
Students, too, fail to benefit from standardized testing; such tests are often
explicitly biased against girls (Kızılca, 2013), poor students, and students of color
(DeMatthews, 2021). Moreover, they generally devalue classical liberal arts fields like
art, music, history, and other such areas (American University, 2020). Unsurprisingly,
some students thus find these exams unfair (Himelfarb, 2019). And their perception does
not merely reflect those students’ self-interest; evidence has shown cases where these
tests are so significantly biased in so many ways as to be borderline illegal (Connor &
Vargyas, 1992). These biases reflect and perpetrate a wide range of inequities, including
those based on sex and gender (Faggen-Steckler et al., 1974; Kızılca, 2013),
Reeves 137
socioeconomic status and economic class (Feldman, 2018), disability (The New Teacher
Project, 2018), race/ethnicity (Gillborn & Mirza, 2000), stereotyping (Walton & Spencer,
2009), positive self-image (Marsh et al., 2005), and even obesity (MacCann & Roberts,
Teachers are also negatively impacted by standardized testing. Expectations for
student performance, set by administrators and legislatures, puts pressure on instructors,
who feel pushed into teaching explicitly to the test regardless of the needs of learners; the
desire to avoid punitive measures aimed at teachers or institutions whose students fail to
meet standards drives risk-avoidant behavior—much akin to that exhibited by students
seeking to avoid failing marks (American University, 2020; Butler & Nisan, 1986; Kohn,
2011, 2015; Long, 2015; Pulfrey et al., 2011). In Fairman et al. (2018), schools indicated
that the information gained from standardized tests was not useful—and even when the
data were of value, results were often received too late to be used for making
instructional decisions. As a result, little credence was given to those test results; in
Maine, for example, over 99% of schools did not rely on state-mandated standardized
testing to provide needed data, but instead used an alternative assessment mechanism.
However, states like Virginia—where this study was conducted—have a strong central
authority characterized by a top-down system of accountability (Ruff, 2019), which
requires and enforces standardized testing. Significant change would be needed, at the
state level, before high-stakes end-of-year exams could finally be replaced by useful,
pedagogically sound assessments such as truly effective standards-based report cards.
It is unrealistic, though, to expect that schools will end standardized testing in the
near term. While there are sound ethical, philosophical, and pedagogical reasons to move
Reeves 138
away from those examinations, political and cultural considerations ensure that they will
be with us for some time. Thus, it is even more imperative that effective, research-based
assessments are used in the classroom, both to better guide student learning and to inform
instructors as they make decisions.
The question for educators and administrators then becomes: “Which model is
best?” Findings showing that different SBRC models are equally predictive of
standardized test scores (and therefore practically the same in terms of accuracy) must be
further refined by examining them in light of established psychosocial research, which
shows that the more we grade and rank children, the more harm accrues to them. While
future research should be conducted to evaluate the validity of these findings in practice,
one potential implication of this research may be that standardized tests are not requisite
for schools to accurately report student skill mastery.
Implications for SchoolCommunity Communications
When referring to the CDC’s early recommendations about mask-wearing during
the pandemic, Saad Omer, Director of the Yale Institute of Global Health, said “They got
a good grade, probably an A, in terms of biological science when they came out with
some of those recommendations, but they got an incomplete at best on behavioral
science” (McCammon & Inskeep, 2019, 5:35). Research is important, but without
successfully communicating the findings to decision-makers, the results are of limited
utility. Unfortunately, this discourse can be hampered by the fact that researchers and
policy-makers often have competing interests. Simply publishing studies like this one and
speaking from a position as an expert and academic will not sway a skeptic who is
invested in a contrary position, or whose experiences and perceptions run counter to what
Reeves 139
the study suggests. Communication must precede understanding and agreement, but there
are hurdles which must be overcome.
There are two competing models which seek to explain why people can struggle
to comprehend complex phenomena: the deficit model, which presumes that information
deficiency underlies ignorance, and the contextual model, which holds that understanding
exists in a social context, which includes personal and shared beliefs (Gross, 1994;
McDivitt, 2016). In the past decade, researchers have increasingly found that the deficit
model is irredeemably flawed (Simis et al., 2016); it assumes that, given enough high-
quality information, the scientifically ignorant public can eventually be persuaded
(McDivitt, 2016). Gross (1994) cleverly observed that this assumes “public deficiency,
but scientific sufficiency” (p. 116), and indicted the model for failing to speak to people
authentically. Rather than attempting to persuade, a speaker following the deficit model
tries to convince; this works when all parties broadly agree on priorities and values, but
fails miserably when there is emotional or social opposition. Simply bombarding the
public with a flood of jargon-filled data often fails to effectively communicate, and
indeed is often counterproductive, as the relentless—and one-way—flow of information
can be overwhelming, and seem haughty; an expert’s certainty can come across as smug
When truly trying to help the public understand and accept complex phenomena
and ideas—like a revolution in assessment and pedagogy—contemporary scholars prefer
the contextual model, which allows beliefs, attitudes, and perceptions a place in the
conversation. While these subjective elements should not be considered valid in the same
sense as scientific truth, acknowledging them allows all participants to engage in a
Reeves 140
discussion authentically, without interpreting difference or opposition as antagonism or
attack. Applying this model to serious and complex things that impact peoples’ lives—
like understanding the education of their children—means that ensuring the audience has
genuine understanding is paramount (Moser & Dilling, 2004). S. Miller (2001) declared
it critical to consider what he called “social context and lay knowledge” (p. 115) when
trying to engage the public, and speakers must actively avoid the perception that they are
“talking down” to their audience.
Educators seeking to convince stakeholders, be they policy-makers or community
members, should consider the work of Habermas (1983, 1991), a philosopher, ethicist,
and expert in discourse who deeply understood the importance of human autonomy
(Young, 1990). His idea of communicative rationality held that when everyone speaks
truthfully to each other, authentic argument—and agreement—ensue. This requires using
what he called normative language, speaking and writing as unambiguously as possible
(1983). Imprecise or inauthentic language inevitably leads to misunderstanding and
mistrust, and undermines the recipient’s ability—indeed, their human right—to perceive
reality. Habermas understood that communication is not merely an exchange of words,
but of ideas, and that a high degree of accountability must be included in discourse in the
public sphere; we should apply this high standard to ourselves as educators when
speaking to the community we serve. Rather than attempting to dictate change, it is wise
to keep in mind what Habermas (1991) himself said: the “unforced force of the better
argument prevails” (p. 159). An expert cannot force their argument down the proverbial
throat of the audience and expect a positive, prosocial result that includes behavioral
Reeves 141
And given how innately damaging traditional grading is to children, we cannot
but desire that change in every aspect of public education. Consequently, we as educators
must invest ourselves deeply in communicating with—not at—our constituent
communities, engaging with them specifically and locally to address their concerns and
answer their questions about the radical transformation involved in eliminating traditional
grading altogether. This must be done authentically, keeping the receiving audience
squarely in mind; the cultural worldviews and internal landscapes of the people we are
trying to persuade are inextricably woven into their perceptions of what we say—and
their view of how we say it. Simply put, effecting change requires the engagement and
support of our audience of our community stakeholders. As Moser and Dilling (2004)
wrote, they “must be made to feel a part of a larger collective that can successfully tackle
the problem” (as cited in McDivitt, 2016).
In order to successfully win that community support, the concerns of both parents
and teachers must be addressed. Regarding parental reluctance, an article aptly titled
“Vocal and Vehement” enumerates five major themes underpinning parents’ resistance to
standards-based grading: traditional grading is a known and trusted system; there is
strong aversion to and anxiety about switching to an unknown methodology; poor past
communication has led to distrust of and disappointment with educators; there is
confusing and unclear information about SBG; and there is a feeling of dismay about a
new system which opponents believe results in lesser outcomes (Frankin et al., 2016).
Even when these fears are unfounded, it is important to treat them seriously and
respectfully, while considering the root causes of those concerns. For example, it can be
detrimental to articulate the harms associated with traditional grading; this implies that
Reeves 142
trusted and familiar systems were somehow damaging, that parents were “taught wrong”
when they were children. While this might indeed be true, broaching the topic in that
fashion carries a risk of creating an adversarial relationship between the school and the
parent. Instead, focusing on the benefits of standards-based assessment and citing the
many examples of its positive impact on both students and teachers is likely a better
approach. Maintaining a consistent message that communicates the advantages is key,
and the message to families should be crystal-clear: standards-based grading is better for
you and your student. Explaining why, in a local context, is essential to ensuring parents
understand and accept the paradigm shift.
Teachers, too, sometimes need persuading. Despite the evidence against
hodgepodge systems, some instructors remain loyal adherents of traditional grading
practices (Barnes & Buring, 2012; Walker, 2016). Educators are not immune to the
powerful gravity of tradition; however, it is possible to shift their attitudes, if properly
approached. Instead of enlisting the teacher’s support before changing to SBG, Knight
and Cooper (2019) found ample evidence that effectively implementing standards-based
instruction leads directly to better outcomes, and that teacher beliefs about grading can be
moved in a progressive direction once they observe those benefits in praxis (Bonner &
Chen, 2021; Chen & Bonner, 2017; Kunnath, 2017).
Recommendations for Future Research
Researchers who desire to meaningfully ask and accurately answer research
questions should be cautioned against the use of outdated and reductive techniques;
similarly, instructors who teach quantitative research methods at the university level
should be wary of relying upon simple null-hypothesis significance testing and p-values
Reeves 143
when helping students investigate complex topics such as learning. Additionally, the
psychosocial and psychoemotional impacts that certain policies and practices have on
primary-age children should be given far more consideration in teacher-preparation
courses and tertiary research programs. Too often, pedagogy and child development fall
by the wayside in favor of tried-and-true curricula, even when contemporary research—
such as that cited in this dissertation—demonstrates that perhaps those curricula are more
tired than true. What teachers do in the classroom may not be consistent with best
practices established by research.
Regarding future research, the application of the Reeves Taxonomy can and
should lead to further investigation of SBRC models, both inside and outside of the
Commonwealth of Virginia, including confirmatory studies with larger data sets as well
as further investigations to determine if there are other, as-yet-unexplored differences
between the models.
From an equity perspective, identifying sex- and race-based biases in the
application of performance level descriptors is well worthwhile, both within individual
models between them. As mentioned in Chapter I, the writings of Tannenbaum et al.
(2016) and Andrews et al. (2019) strongly advise future researchers to collect—and
school divisions to provide—sex and race/ethnicity data to facilitate meaningful
explorations of impacts on at-risk communities. While information obtained for this study
was limited to the pandemic, future unhindered researchers should actively investigate
questions of equity as they relate to SBRCs.
Reeves 144
Concluding Remarks
The idea of schools without letter grades is almost inconceivable, given that they
have been mainstays of public education for so long. However, as a review of the
literature demonstrates, so-called traditional grading is actually a relatively recent
invention, and it is replete with problems for children, teachers, and parents alike.
Standards-based assessment seeks to eliminate those deficiencies by following research-
supported practices which bring greater accuracy and less discrimination, helping
children achieve desired outcomes in a way informed by an understanding of their
psychosocial and emotional needs. Over the past three decades, several authors have
significantly advanced the cause of standards-based assessment (Guskey, 1994; Marzano,
1998), but few of the assumptions underlying SBRC designs have been rigorously
researched. Chief among those assumptions, as detailed in Chapter II, was that there was
no measurable advantage attributable to using any specific number of PLDs in a given
SBRC. Since this dissertation establishes that for the classes tested, there was no practical
difference between SBRC models in terms of their ability to predict students’ grades on
end-of-course standardized tests, educators need not prefer systems with greater numbers
of PLDs, even those frequently referenced in literature and practice. To the contrary,
teachers who are familiar with research showing the harmful nature of systems that rank
and classify children should prefer to use the fewest possible number of PLDs. Additional
research which further tests the models classified in the taxonomy is warranted to expand
upon this finding.
Educators have an ethical imperative to be accurate in their assessment of
children, but they must also demonstrate love for their pupils by guiding, rather than
Reeves 145
harming them (K. D. Reeves, 2015), while actively seeking to meet their needs (Maslow,
1954). Because psychological research, as reviewed in Chapter II, has shown that
traditional grading methods can significantly undermine students’ motivation, learning,
and quality of thinking—effectively ghettoizing children into worthy and unworthy castes
(Diaz-Loza, 2015)educators must avoid such classification whenever possible.
Consequently, it is reasonable to expect educators and policymakers to use the fewest
possible performance level descriptors to accomplish their goals. While a mindset shift
for many stakeholders may be necessary, and a great deal of work is required to
reconceive and redesign assessment and grading for students, the benefits outweigh the
Reeves 146
Adrian, C. A. (2012). Implementing standards-based grading: Elementary teachers'
beliefs, practices and concerns [Doctoral dissertation]. (ProQuest No. 3517370).
Washington State University.
Airasian, P. (1994). Classroom assessment (2nd ed.). McGraw Hill.
Aitken, E. N. (2016). Grading and reporting student learning. In S. Scott, D. Scott, & C.
Webber (Eds.), Assessment in Education. The Enabling Power of Assessment,
Vol. 2 (pp. 231–260). Springer, Cham.
Allen, J. D. (2005). Grades as valid measures of academic achievement of classroom
learning. The Clearing House: A Journal of Educational Strategies, Issues and
Ideas, 78(5), 218–223.
Allen, M. (Ed.). (2017). The SAGE encyclopedia of communication research methods.
SAGE Publications.
Allmain, G. (2013, October 30). School district defends and explains standards-based
grading. Federal Way Mirror.
Altman, D. G., & Royston, P. (2006). The cost of dichotomising continuous variables.
British Medical Journal, 443(9549), 1080.
American University. (2020, July 2). Effects of standardized testing on students &
teachers: Key benefits & challenges. American University School of Education
Reeves 147
Online Programs.
Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical
significance. Nature, 567(7748), 305–307.
Anderson, J. (2017, August 17). The case for eliminating letter grades, according to a
school with too many straight-a students. Quartz.
Andrade, H. L., & Brown, G. T. L. (2016). Student self-assessment in the classroom. In
G. T. L. Brown & L. R. Harris (Eds.), Handbook of human and social conditions
in assessment (pp. 319–334). Routledge.
Andrews, K., Parekh, J., & Peckoo, S. (2019). How to embed a racial and ethnic equity
perspective in research: Practical guidance for the research process (Working
paper). Child Trends.
Arikan, S., Kilman, S., Abi, M., & Üstünel, E. (2019). An example of empirical and
model based methods for performance descriptors: English proficiency test.
Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 10(3), 219–234.
Arlington Public Schools. (n.d.). Standards-based assessment.
Reeves 148
Arter, J., & Busick, K. (2001). Practice with student-involved classroom assessment.
Assessment Training Institute.
Athens City Schools. (n.d.). Standards based reporting in ACS. https://www.acs-
Bailey, K. D. (1994). Typologies and taxonomies: An introduction to classification
techniques. SAGE Publications.
Balfanz, R., Herzog, L., & MacIver, D. J. (2007). Preventing student disengagement and
keeping students on the graduation path in urban middle-grades schools: Early
identification and effective interventions. Educational Psychologist, 42(4), 223–
Banditwattanawong, T., & Masdisornchote, M. (2021a). Norm-referenced achievement
grading: Methods & comparisons. In E. Hassanien, A. Slowik, V. Snášel, H. El-
Deeb, & F. M. Tolba (Eds.), Proceedings of the International Conference on
Advanced Intelligent Systems and Informatics 2020 (pp. 159–170).
Banditwattanawong, T., & Masdisornchote, M. (2021b). On characterization of norm-
referenced achievement grading schemes toward explainability and selectability.
Applied Computational Intelligence and Soft Computing, 2021.
Barnes, K. D., & Buring, S. M. (2012). The effect of various grading scales on student
grade point averages. American Journal of Pharmaceutical Education, 76(3), 41.
Reeves 149
Bartholomew, K. J., Ntoumanis, N., Mouratidis, A., Katartzi, E. S., Thøgersen-Ntoumani,
C., & Vlachopoulos, S. (2018). Beware of your teaching style: A school-year long
investigation of controlling teaching and student motivational experiences.
Learning and Instruction, 53, 50–63.
Beatty, I. D. (2013). Standards-based grading in introductory university physics. Journal
of the Scholarship of Teaching and Learning, 13(2), 1–22.
Bellingham Public Schools. (n.d.). Frequently asked questions—Standards based
Benziger, K. (2004). Thriving in mind: The art and science of using your whole brain.
Berman, E. A. (2017). An exploratory sequential mixed methods approach to
understanding researchers' data management practices at UVM: Integrated
findings to develop research services. Journal of eScience Librarianship, 6(1).
Black, P., & Wiliam, D. (2018). Classroom assessment and pedagogy. Assessment in
Education: Principles, Policy & Practice, 25(6), 551–575.
Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of
educational goals. Longmans, Green.
Bloom, B. S. (1964). Stability and change in human characteristics. John Wiley & Sons.
Reeves 150
Bloom, B. S. (1968). Learning for mastery. Evaluation Comment, 1(2), 1–12. Regional
Educational Laboratory for the Carolinas and Virginia.
Bloom, B. S. (1971). Mastery learning. In J. H. Block (Ed.), Mastery learning: Theory
and practice. Holt, Rinehart & Winston.
Bokas, A. (2018, July 17). Communicating student progress: What works and what
doesn’t in standards-based report cards. ASCD InService.
Bond, L. A. (1996). Norm- and criterion-referenced testing. Practical Assessment,
Research, and Evaluation, 5(5), 1–3.
Bond, T., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental
measurement in the human sciences. Taylor & Francis.
Bonner, S. M., & Chen, P. P. (2021). Development and validation of the survey of
unorthodox grading beliefs for teachers and teacher candidates. Journal of
Psychoeducational Assessment, 39(6), 746–760.
Boone, W. J., & Staver, J. R. (2020). Advances in Rasch analyses in the human sciences.
Borgès Da Silva, R. (2013). Taxonomie et typologie: est-ce vraiment des synonymes?
Santé publique, 25(5), 633–637.
Reeves 151
Bradley, E. H., Curry, L. A., & Devers, K. J. (2007). Qualitative data analysis for health
services research: Developing taxonomy, themes, and theory. Health Services
Research, 42(4), 1758–1772.
Brandstätter, E., & Kepler, J. (1999). Confidence intervals as an alternative to
significance testing. Methods of Psychological Research Online, 4(2), 33–46.
Broekkamp, H., & van Hout-Wolters, B. (2007). The gap between educational research
and practice: A literature review, symposium, and questionnaire. Educational
Research and Evaluation, 13(3), 203–220.
Brookhart, S. M. (1991). Grading practices and validity. Educational Measurement:
Issues and Practice, 10(1), 35–36.
Brookhart, S. M. (1993). Teachers' grading practices: Meaning and values. Journal of
Educational Measurement, 30(2), 123–142.
Brookhart, S. M. (2011). Starting the conversation about grading. Educational
Leadership, 69(3), 10–14.
Brookhart, S. M. (2013). Grading. In J. H. McMillan (Ed.), SAGE handbook of research
on classroom assessment (pp. 257–271). SAGE Publications.
Brookhart, S. M. (2017). How to give effective feedback to your students (2nd ed.).
Reeves 152
Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan, J. H., Smith, J. K., Smith, L.
F., Stevens, M. T., & Welsh, M. E. (2016). A century of grading research:
Meaning and value in the most common educational measure. Review of
Educational Research, 86(4), 803–848.
Brookhart, S. M., & Nitko, A. (2008). Assessment and grading in classrooms. Pearson.
Brown, R. B. (2006). Doing your dissertation in business and management: The reality
of research and writing. SAGE Publications.
Buckmiller, T. M., & Peters, R. E. (2018). Getting a fair shot? School Administrator,
75(2), 22–25.
Buckmiller, T. M., Peters, R. E., & Kruse, J. (2017). Questioning points and percentages:
Standards-based grading (SBG) in higher education. College Teaching, 64(4),
Bulger, M., McCormick, P., & Pitcan, M. (2017). The legacy of inBloom (Working
Paper). Data & Society.
Burkhardt, A. L. (2020). Parents' perception of standards-based grading practices versus
norm-referenced grading practices (Publication no. 27836846) [Doctoral
dissertation, Wilmington University]. ProQuest Dissertations Publishing.
Bushaw, W. J., & Gallup, A. M. (2008). Americans speak out—are educators and policy
makers listening: The 40th annual Phi Delta Kappa/Gallup poll of the public’s
attitude toward the public schools. Phi Delta Kappan, 90(1), 8–20.
Reeves 153
Butler, R., & Nisan, M. (1986). Effects of no feedback, task-related comments, and
grades on intrinsic motivation and performance. Journal of Educational
Psychology, 78(3), 210–216.
Cameron, L. J. (2020). The formative potential of standards-based grades and report
cards (Publication No. 13699) [Doctoral dissertation, Durham University].
Canady, R. L., Canady, C. E., & Meek, A. (2017). Beyond the grade: Refining practices
that boost student achievement. Solution Tree.
Canfield, M. L., Kivisalu, T. M., Van Der Karr, C., King, C., & Phillips, C. E. (2015).
The use of course grades in the assessment of student learning outcomes for
general education. SAGE Open, 5(4).
Carlin, J. B. (2016). Is reform possible without a paradigm shift? The American
Statistician, 901(10 Suppl).
Carter, A. B. (2016). Best practices for leading a transition to standards-based grading
in secondary schools (Publication No. ED568036) [Doctoral dissertation, Walden
Chamberlain, K., Yasué, M., & Chiang, I.-C. A. (2018). The impact of grades on student
motivation. Active Learning in Higher Education, 1-16.
Reeves 154
Chen, P. P., & Bonner, S. M. (2017). Teachers' beliefs about grading practices and a
constructivist approach to teaching. Educational Assessment, 22(1), 18–34.
Close, D. (2009). Fair grades. Teaching Philosophy, 32(4), 361–398.
Close, D. (2014). Reflections on fair grades. In E. Esch, K. Hermberg, & R. E. Kraft, Jr.
(Eds.), Philosophy through teaching (pp. 189–197). American Association of
Philosophy Teachers.
Colby, S. A. (1999). Grading in a standards-based system. Educational Leadership,
56(6), 52–55.
Collier, D., LaPorte, J., & Seawright, J. (2012). Putting typologies to work: Concept
formation, measurement, and analytic rigor. Political Research Quarterly, 65(1),
Colnerud, G. (2006). Teacher ethics as a research problem: Syntheses achieved and new
issues. Teachers and Teaching, 12(3), 365–385.
Connor, K., & Vargyas, E. J. (1992). The legal implications of gender bias in
standardized testing. Berkeley Journal of Gender, Law & Justice, 7(1), 13–89.
Reeves 155
Coombs, A., DeLuca, C., & LaPointe-McEwan, D. (2018). Changing approaches to
classroom assessment: An empirical study across teacher career stages. Teaching
and Teacher Education, 71, 134–144.
Cornwell, C., Mustard, D. B., & Parys, J. V. (2013). Noncognitive skills and the gender
disparities in test scores and teacher assessments: Evidence from primary school.
Journal of Human Resources, 48(1), 236–264.
Counsell, S. (2007). What happens when veteran and beginner teachers’ life histories
intersect with high-stakes testing and what does it mean for learners and teaching
practice: The making of a culture of fear [Doctoral dissertation, University of
Northern Iowa].
Cox, K. B. (2011). Putting classroom grading on the table: A reform in progress.
American Secondary Education, 40(1), 67–87.
Craig, T. A. (2011). Effects of standards-based report cards on student learning [Doctoral
dissertation, Northeastern University].
Cremin, L. A. (1970). American education: The colonial experience, 16071783. Harper
& Row.
Creswell, J. W. (2003). Research design: Qualitative, quantitative, and mixed methods
approaches (2nd ed.). SAGE Publications.
Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating
quantitative and qualitative research (4th ed.). Pearson.
Reeves 156
Creswell, J. W., & Creswell, J. D. (2017). Research design: Qualitative, quantitative, and
mixed methods approaches. SAGE Publications.
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods
research (3rd ed.). SAGE Publications.
Crocker, J. (2002). The costs of seeking self-esteem. Journal of Social Issues, 58(3),
Crooks, A. D. (1933). Marks and marking systems: A digest. Journal of Educational
Research, 27(4), 259–272.
Cross, L. H., & Frary, R. B. (1999). Hodgepodge grading: Endorsed by students and
teachers alike. Applied Measurement in Education, 12(1), 53–72.
Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better
questions. Zeitschrift für Psychologie, 217, 15–26.
Dawson, N. V., & Weiss, R. (2012). Dichotomizing continuous variables in statistical
analysis. Medical Decision Making, 32(2), 225–226.
Day, D. (2011). Membership categorization analysis. In C. A. Chapelle (Ed.), The
Encyclopedia of Applied Linguistics. Wiley-Blackwell.
Delaney, H., Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median splits and
spurious statistical significance. Psychological Bulletin, 113(1), 181–190.
Reeves 157
DeLuca, C., Valiquette, A., Coombs, A., LaPointe-McEwan, D., & Luhanga, U. (2018).
Teachers’ approaches to classroom assessment: A large-scale survey. Assessment
in Education: Principles, Policy & Practice, 25(4), 355–375.
DeMatthews, D. (2021, March 15). Standardized testing amid pandemic does kids more
harm than good. University of Texas at Austin.
d'Erizans, R. (2020). A correlational study of standards-based grades, traditional grades,
and achievement [Doctoral dissertation, Northeastern University].
Derksen, J. (2014, May 15). Power failure. Mr. Math Man.
Diaz-Loza, A. M. (2015, May 7). Humans of New York, Nadia Lopez & Marcus Foster.
Urban Schools in Historical Perspective.
Discovery Elementary School. (2021). Standards-based assessment.
Dueck, M. (2014). The problem with penalties. Educational Leadership, 71(6), 44–48.
Durm, M. W. (1993). An A is not an A is not an A: A history of grading. The
Educational Forum, 57(3), 294–297.
Reeves 158
Dweck, C. S. (2007). Mindset: The new psychology of success. Ballantine.
Dwyer, L. M., & Parker, C. E. (2014). A primer for analyzing nested data: multilevel
modeling in SPSS using an example from a REL study (REL 2015–046). U.S.
Department of Education, Institute of Education Sciences, National Center for
Education Evaluation and Regional Assistance, Regional Educational Laboratory
Northeast & Islands.
Edmondson, A. C. (1999). Psychological safety and learning behavior in work teams.
Administrative Science Quarterly, 44(2), 350–383.
Edmondson, A. C., & Lei, Z. (2014). Psychological safety: The history, renaissance, and
future of an interpersonal construct. Annual Review of Organizational Psychology
and Organizational Behavior, 1, 23–43.
Edwards, N., & Richey, H. G. (1947). The school in the American social order.
Houghton Mifflin.
Egalite, A. J., Fusarelli, L. D., & Fusarelli, B. C. (2017). Will decentralization affect
educational inequity? The Every Student Succeeds Act. Educational
Administration Quarterly, 53(5), 757–781.
Elliott, S. N., Kratochwill, T. R., Littlefield Cook, J., & Travers, J. (2000). Educational
psychology: Effective teaching, effective learning (3rd ed.). McGraw-Hill.
Reeves 159
Elsinger, J., & Lewis, D. (2020). Applying a standards-based grading framework across
lower level mathematics courses. PRIMUS, 30(8), 885–907.
Erickson, J. A. (2011a). Grading practices: The third rail. Principal Leadership, 10(7),
Erickson, J. A. (2011b). How grading reform changed our school. Educational
Leadership, 69(3), 66–70.
Faggen-Steckler, J., McCarthy, K. A., & Tittle, C. K. (1974). A quantitative method for
measuring sex "bias" in standardized tests. Journal of Educational Measurement,
11(3), 151–161.
Fairman, J., Johnson, A., Mette, I., Wickerd, G., & LaBrie, S. (2018, April). A review of
standardized testing practices and perceptions in Maine. Maine Education Policy
Research Institute.
Family Education Rights and Privacy Act, 20 U.S.C. § 1232 et seq. (1974).
Feldman, J. (2018). School grading policies are failing children: A call to action for
equitable grading. Crescendo Education Group.
Feldman, J. (2019, January 23). What traditional classroom grading gets wrong.
Education Week.
Reeves 160
Feldmesser, R. A. (1971, February). The positive functions of grades [Paper
presentation]. American Educational Research Association Conference, New
York, New York.
Ferguson, C. J. (2016). An effect size primer: A guide for clinicians and researchers. In
A. E. Kazdin (Ed.), Methodological Issues and Strategies in Clinical Research
(pp. 301–310). American Psychological Association.
Fernandez, A., Malaquias, C., Figueiredo, D., da Rocha, E., & Lins, R. (2019). Why
quantitative variables should not be recoded as categorical. Journal of Applied
mathematics and Physics, 7(7), 1519–1530.
Fetters, M. D., Curry, L. A., & Creswell, J. W. (2013). Achieving integration in mixed
methods designs - Principles and practices. Health Services Research, 48(6–2),
Fidler, F., & Loftus, G. R. (2009). Why figures with error bars should replace p-values:
Some conceptual arguments and empirical demonstrations. Journal of
Psychology, 217(1), 27–37.
Filho, D. B. F., Paranhos, R., da Rocah, E. C., Batista, M., da Silva, J. A., Santos, M. L.,
& Marino, J. G. (2013). When is statistical significance not significant? Brazilian
Political Science Review, 7(1), 31–35.
Finkelstein, I. (1913). The marking system in theory and practice. Warwick & York.
Reeves 161
Fisher, D., Frey, N., & Pumpian, I. (2011). No penalties for practice. Educational
Leadership, 69(3), 46–51.
Fleenor, A., Lamb, S., Anton, J., Stinson, T., & Donen, T. (2011). The grades game.
Principal Leadership, 11(6), 48–52.
Fofana, F., Bazeley, P., & Regnault, A. (2020). Applying a mixed methods design to test
saturation for qualitative data in health outcomes research. PLOS One, 15(6),
Article e0234898.
Fraenkel, J. R., & Wallen, N. E. (2018). How to design and evaluate research in
education (10th ed.). McGraw-Hill.
Francis, E. (2017, May 9). What is depth of knowledge? ASCD InService.
Frankin, A., Buckmiller, T., & Kruse, J. (2016). Vocal and vehement: Understanding
parents' aversion to standards-based grading. International Journal of Social
Science Studies, 4(11), 19–29.
Frechette, A. (2017). The impact of a standards-based approach on student motivation
(Publication No. 141). [Doctoral dissertation, Northeastern University].
Frederickson, L. G. (2017). Parents/guardians' perceptions of elementary standards-
based report cards [Doctoral dissertation, North Carolina State University].
Reeves 162
Freedman, D. A. (2009). Statistical models: Theory and practice. Cambridge University
Freire, P. (1970). Pedagogy of the oppressed. Continuum.
Freytas, H. (2017). A collaborative & systematic approach to implementing an effective
standards-based grading and reporting system: A change leadership plan
(Publication No. 281) [Doctoral dissertation, Northeastern University].
García-Bacete, F. J., Marande-Perrin, G., Schneider, B. H., & Cillessen, A. H. N. (2018).
Children's awareness of peer rejection and teacher reports of aggressive behavior.
Psychosocial Intervention, 28(1), 37–47.
Garson, G. D. (2016). Case study analysis & QCA. Statistical Associates Publishers.
Gay, L. S. (2017). What can one learn by measuring assessments two ways: Traditional
grading scale versus standards-based grading method? (Publication No. 4356)
[Master’s thesis, Hamline University].
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
models. Cambridge.
Gershenson, S. (2018). Grade inflation in high schools (20052016). Thomas B.
Fordham Institute.
Gillborn, D., & Mirza, H. S. (2000). Educational inequality: Mapping race, class and
gender. A synthesis of research evidence. Office for Standards in Education.
Reeves 163
Giroux, H. A. (1997). Pedagogy and the politics of hope: Theory, culture and schooling.
Westview Press.
Given, L. M. (2008). The SAGE encyclopedia of qualitative research methods (Vol. 2).
SAGE Publications.
Goleman, D. (2006). Social intelligence: Beyond IQ, beyond emotional intelligence.
Bantam Books.
Gordon, M., & Fay, C. (2010). The effects of grading and teaching practices on students’
perceptions of fairness. College Teaching, 58, 93–98.
Goss-Sampson, M. A. (2020). Statistical analysis in JASP: A guide for students. JASP
Grassi, P. A., Fenton, J. L., Newton, E. M., Perlner, R. A., Regenscheid, A. R., Burr, W.
E., Richer, J. P., Lekfovitz, N. B., Danker, J. M., Choong, Y.-Y., Greene, K. K.,
& Theofanos, M. F. (2017). Digital identity guidelines: Authentication and
lifecycle management (NIST Special Publication 800-63B).
Great Schools. (2016). Verifying proficiency performance indicators. Great Schools.
Green, S. K., Johnson, R. L., Kim, D., & Pope, N. S. (2007). Ethics in classroom
assessment practices: Issues and attitudes. Teaching and Teacher Education,
23(7), 999–1011.
Reeves 164
Greene, G. L. (2015). An analysis of the comparison between classroom grades earned
with a standards-based grading system and grade-level assessment scores as
measured by the Missouri Assessment Program (ProQuest No. 3737088)
[Doctoral dissertation, Lindenwood University]. ProQuest Dissertations
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework
for mixed-method evaluation designs. Educational Evaluation and Policy
Analysis, 11, 255–274.
Greer, W. (2018). The 50 year history of the common core. The Journal of Educational
Foundations, 31(3–4), 100–117.
Gregory, R. J. (2011). Psychological testing: History, principles, and applications (6th
ed.). Allyn & Bacon.
Greig, C. (2006). A is the best one [Master's thesis, MacAlester University].
Gross, A. G. (1994). The roles of rhetoric in the public understanding of science. Public
Understanding of Science, 3(1), 3–23.
Guskey, T. R. (1994). Making the grade: What benefits students. Educational
Leadership, 52(2), 14–20.
Guskey, T. R. (1996). Reporting on student learning: Lessons from the past –
Prescriptions for the future. In T. R. Guskey (Ed.), Communicating student
learning: 1996 ASCD yearbook. (pp. 13–24). ASCD.
Guskey, T. R. (2000). Fixing grading policies that undermine standards. National
Association of Secondary Schools Principals Bulletin, 84(1), 20–29.
Reeves 165
Guskey, T. R. (2001). Helping standards make the grade. Educational Leadership, 59(1),
Guskey, T. R. (2004). The communication challenge of standards-based reporting. Phi
Beta Kappan, 86(4), 326–328.
Guskey, T. R. (2005, April). Formative classroom assessment and Benjamin S. Bloom:
Theory, research, and implications [Paper presentation]. American Educational
Research Association, Montreal, Canada.
Guskey, T. R. (2007). Multiple sources of evidence: An analysis of stakeholders'
perceptions of various indicators of student learning. Educational Measurement:
Issues and Practice, 26(1), 19–27.
Guskey, T. R. (2011). Five obstacles to grading reform. Educational Leadership, 69(3),
leadership/nov11/vol69/num/03/ abstract.aspx
Guskey, T. R. (2013a). Bound by tradition: Teachers' views of crucial grading and
reporting issues. Journal of Educational Research and Policy Studies, 13(1), 32–
Guskey, T. R. (2013b). The case against percentage grades. Educational Leadership,
71(1), 68–72.
Reeves 166
Guskey, T. R. (2014, October 17). Why the label 'exceeds standard' doesn't work.
Education Week.
Guskey, T. R. (2015). On your mark: Challenging the conventions of grading and
reporting. Solution Tree.
Guskey, T. R. (2020). Breaking up the grade. Educational Leadership, 78(1), 40–46.
Guskey, T. R., & Bailey, J. M. (2001). Developing grading and reporting systems for
student learning. Corwin.
Guskey, T. R., & Bailey, J. M. (2010). Developing standards-based report cards.
Guskey, T. R., & Brookhart, S. M. (Eds.). (2019). What we know about grading: What
works, what doesn't, and what's next. ASCD.
Guskey, T. R., & Jung, L. A. (2012). Four steps in grading reform. Principal Leadership,
13(4), 23–28.
Guskey, T. R., & Link, L. J. (2017, April 27). Grades represent achievement and
"something else:" Analysis of the non-achievement factors teachers consider in
determining students' grades [Paper presentation]. American Educational
Research Association Conference, San Antonio, TX.
Guskey, T. R., & Link, L. J. (2019). Exploring the factors teachers consider in
determining students' grades. Assessment in Education: Principles, Policy &
Practice, 26(3), 303–320.
Reeves 167
Guskey, T. R., & Pigott, T. D. (1988). Research on group-based mastery learning
programs: A meta-analysis. Journal of Educational Research. 4(81), 197–216.
Guskey, T. R., Swan, G. M., & Jung, L. A. (2011). Grades that mean something:
Kentucky develops standards-based report cards. Phi Delta Kappan, 93(2), 52–
Gutek, G. L. (1986). Education in the United States: An historical perspective. Prentice
Habermas, J. (1983). Discourse Ethics: Notes on a Program of Philosophical
Justification. MIT Press.
Habermas, J. (1991). Moral Consciousness and Communicative Action. MIT Press.
Hamilton, L. S., Stecher, B. M., & Yuan, K. (2008). Standards-based reform in the
United States: History, research, and future directions [White paper]. RAND
Hanover Research. (2014). The impact of formative assessment and learning intentions
on student achievement.
Hardegree, A. C. (2012). Standards-based assessment and high stakes testing: Accuracy
of standards-based grading (Publication No. 594) [Doctoral dissertation, Liberty
Harrisburg School District. (2016). Harrisburg Middle Schools standards-based grading.
Reeves 168
Harrison, M. A., Meister, D. G., & LeFevre, A. J. (2011). Which students complete extra
credit work? College Student Journal, 45(3), 550–555.
Harrison, R. L., Reilly, T. M., & Creswell, J. W. (2020). Methodological rigor in mixed
methods: An application in management studies. Journal of Mixed Methods
Research, 14(4), 473–495.
Hayat, M. J., Staggs, V. S., Schwartz, T. A., Higgins, M., Azuero, A., Budhathoki, C.,
Chandrasekhar, R., Cook, P., Cramer, E., Dietrich, M. S., Garnier-Villarreal, M.,
Hanlon, A., He, J., Hu, J., Kim, M., Mueller, M., Nolan, J. R., Perkhounkova, Y.,
Rothers, J. … Ye, S. (2019). Moving nursing beyond p < .05. International
Journal of Nursing Studies, 42(4), 244–245.
Hedden, H. (2016). The accidental taxonomist (2nd Ed.). Today.
Henry, D. T. (2018). Standards-based grading: The effect of common grading criteria on
academic growth (Publication No. 10871501) [Doctoral dissertation, Bowling
Green State University]. ProQuest Dissertations Publishing.
Henry, M. A., Shorter, S., Charkoudian, L., Heemstra, J. M., & Corwin, L. A. (2018).
FAIL is not a four-letter word: A theoretical framework for exploring
undergraduate students' approaches to academic challenge and responses to
failure in STEM learning environments. CBE: Life Sciences Education, 18(11),
Reeves 169
Himelfarb, I. (2019). A primer on standardized testing: History, measurement, classical
test theory, item response theory, and equating. Journal of Chiropractic
Education, 33(2), 151–165.
Hoekstra, R., Johnson, A., & Kiers, H. A. L. (2012). Confidence intervals make a
difference: Effects of showing confidence intervals on inferential reasoning.
Educational and Psychological Measurement, 72(6), 1039–1052.
Hooper, J., & Cowell, R. (2014). Standards-based grading: History adjusted true score.
Educational Assessment, 19(1), 58–76.
Hurlbert, S. H., Levine, R. A., & Utts, J. (2019). Coup de grâce for a tough old bull:
Statistical significance expires. The American Statistician, 73(Supp. 1), 352–357.
Iamarino, D. (2014). The benefits of standards-based grading: A critical evaluation of
modern grading practices. Current Issues in Education, 17(2), 1–10.
James, A. R. (2018). Grading in physical education. Journal of Physical Education,
Recreation & Dance, 89(5), 5–7.
Jonsson, A. (2014). Rubrics as a way of providing transparency in assessment.
Assessment & Evaluation in Higher Education, 39(7), 840–852.
Reeves 170
Jung, L. A., & Guskey, T. R. (2007). Standards-based grading and reporting: A model for
special education. Teaching Exceptional Children, 40(2), 48–53.
Jung, L. A., & Guskey, T. R. (2011). Fair & accurate grading for exceptional learners.
Principal Leadership, 12(3), 32–37.
Kafka, T. (2016, January). A list of non-cognitive assessment instruments. Columbia
University Teachers College Community College Research Center.
Kahn, W. A. (1990). Psychological conditions of personal engagement and
disengagement at work. Academic of Management Journal, 33(4), 692–724.
Kalin, M. (2017). Letter grades deserve an 'F': A recommendation for updating school
report cards. Cognoscenti.
Ketch, P. (2019). Standards-based report cards and effects on student achievement
[Doctoral dissertation, Centenary University].
Kifer, E. (2001). Large-scale assessment: Dimensions, dilemmas, and policy. Corwin.
Kincheloe, J. L. (2001). Getting beyond the Facts: Teaching social studies/social science
in the twenty-first century. Peter Lang.
Kincheloe, J. L. (2005). Critical constructivism primer. Peter Lang.
Kincheloe, J. L. (2008). Knowledge and critical pedagogy. Springer.
Reeves 171
Kirschenbaum, H., Simon, S. B., & Napier, R. W. (1971). Wad-ja-get?: The grading
game in American education. Hart.
Kızılca, F. K. (2013). Standardized test-based student selection and gender differences in
academic achievement. iŞGÜÇ: The Journal of Industrial Relations and Human
Resources, 15(4), 102–115.
Klapp, A. (2015). Does grading affect educational attainment? A longitudinal study.
Assessment in Education: Principles, Policy & Practice, 22(3), 302–323.
Klapp, A. (2019). Differences in educational achievement in norm- and criterion-
referenced grading systems for children and youth placed in out-of-home care in
Sweden. Children and Youth Services Review, 99, 408–417.
Knight, M., & Cooper, R. (2019). Taking on a new grading system: The interconnected
effects of standards-based grading on teaching, learning, assessment, and student
behavior. National Association of Secondary Schools Principals Bulletin, 103(1),
Kock, N. (2015). Hypothesis testing with confidence intervals and P-values (Report).
ScriptWarp Systems.
Kohn, A. (2011). The case against grades. Educational Leadership, 69(3), 28–22.
Reeves 172
Kohn, A. (2015). The myth of the spoiled child: Coddled kids, helicopter parents, and
other phony crises. Beacon Press.
Kulik, C. L., Kulik, J. A., & Bangert-Drowns, J. (1990). Effectiveness of mastery
learning programs: A meta-analysis. Review of Educational Research, 60(1), 265–
Kunnath, J. P. (2016). A critical pedagogy perspective of the impact of school poverty
level on the teacher grading decision-making process [Doctoral dissertation,
California State University, Fresno].
Kunnath, J. P. (2017). Teacher grading decisions: Influences, rationale, and practices.
American Secondary Education, 45(3), 68–88.
Kuss, O. (2013). The danger of dichotomizing continuous variables: a visualization.
Teaching Statistics, 35(2), 78–79.
Lau, A. M. S. (2016). ‘Formative good, summative bad?’ – A review of the dichotomy in
assessment literature. Journal of Further and Higher Education, 40(4), 509–525.
Lehman, E., De Jong, D., & Baron, M. (2018). Investigating the relationship of
standards-based grades vs. traditional-based grades to results of the Scholastic
Math Inventory at the middle school level. Education Leadership Review of
Doctoral Research, 6, 1–16.
Reeves 173
Livingston, K., & Hutchinson, C. (2017). Developing teachers’ capacities in assessment
through career-long professional learning. Assessment in Education: Principles,
Policy & Practice, 24(2), 290–307.
Lok, B., McNaught, C., & Young, K. (2015). Criterion-referenced and norm-referenced
assessments: Compatibility and complementarity. Assessment & Evaluation in
Higher Education, 41(3), 450–465.
Long, C. (2015, August 19). Are letter grades failing our students? NEA Today.
Lund, J., & Shanklin, J. (2011). The impact of accountability on student performance in a
secondary physical education badminton unit. Physical Educator, 68(4), 210–220.
MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of
dichotomization of quantitative variables. Psychological Methods, 7(1), 19–40.
MacCann, C., & Roberts, R. D. (2005). Just as smart but not as successful: Obese
students obtain lower school grades but equivalent test scores to nonobese
students. International Journal of Obesity, 37(1), 40–46.
Reeves 174
Manley, A. L. (2019). Implementation of standards-based grading at the middle school
level [Doctoral dissertation, Arkansas Tech University].
Marsh, H. W., Trautwein, U., Lüdtke, O., Köller, O., & Baumert, J. (2005). Academic
self-concept, interest, grades, and standardized test scores: Reciprocal effects
models of causal ordering. Child Development, 76(2), 397–416. https://www-
Martin, A. (2004). Can’t any body count? Counting as an epistemic theme in the history
of human chromosomes. Social Studies of Science, 34, 923–948.
Marzano, R. J. (1998). Models of standards implementation: Implications for the
classroom. Mid-continent Regional Educational Laboratory.
Marzano, R. J. (2000). Transforming classroom grading. ASCD.
Marzano, R. J. (2006). Classroom Assessment & Grading that Work. ASCD.
Marzano, R. J. (2009). Formative assessment & standards-based grading. Solution Tree.
Marzano, R. J., & Heflebower, T. (2011). Grades that show what students know.
Educational Leadership, 69(3), 34–39.
Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4), 370–
Maslow, A. H. (1954). Motivation and personality. Harper and Row.
Reeves 175
Mathena, A. A. (2017). Understanding the impact of online grading and standards-based
report cards: A phenomenological study on teacher instruction at the elementary
level [Doctoral dissertation, Liberty University].
McCammon, S., & Inskeep, S. (2019, July 29). News brief: Infrastructure deal, push for
vaccinations, extreme weather. NPR.
McDivitt, P. (2016). The information deficit model is dead. Now what? Evaluating new
strategies for communicating anthropogenic climate change in the context of
contemporary American politics, economy, and culture [Master's thesis].
University of Colorado.
McLaren, P. (1998). Life in schools: An introduction to critical pedagogy in the
foundations of education (3rd ed.). Longman.
McTighe, J., & Wiggins, G. (2005). Understanding by design (2nd ed). ASCD.
Meaghan, D. E., & Casas, F. R. (2004). Bias in standardized testing and the misuse of test
scores: exposing the Achilles heel of educational reform. In M. Moll (Ed.),
Passing the test: The false promises of standardized testing (pp. 35–50). Canadian
Centre for Policy Alternatives.
Melrose, S. (2017). Pass/fail and discretionary grading: A snapshot of their influences on
learning. Open Journal of Nursing, 7(2), 185–192.
Reeves 176
Mertens, D. M. (2003). Mixed methods and the politics of human research: The
transformative-emancipatory perspective. In A. Tashakkori & C. Teddlie (Eds.),
Handbook of mixed methods in social and behavioral research (pp. 135–164).
SAGE Publications.
Michael, R. D., Webster, C., Patterson, D., Laguna, P., & Sherman, C. (2016). Standards-
based assessment, grading, and professional development of California middle
school physical education teachers. Journal of Teaching in Physical Education,
35(3) 277–283.
Mild, T. L. (2018). A study of elementary educators' perceptions and experiences related
to the implementation process of the Responsive Classroom approach [Doctoral
dissertation, Youngstown State University].
Miller, A. (1990). For your own good: Hidden cruelty in child-rearing and the roots of
violence (3rd ed.). Farrar, Straus, and Giroux.
Miller, A. (1997). The drama of the gifted child: The search for the true self (2nd ed.).
Basic Books.
Miller, S. (2001). Public understanding of science at the crossroads. Public
Understanding of Science, 10(1), 115–120.
Moloi, M., & Kanjee, A. (2018). Beyond test scores: A framework for reporting
mathematics assessment results to enhance teaching and learning. Pythagoras:
Reeves 177
Journal of the Association for Mathematics Education of South Africa, 39(1),