ThesisPDF Available

A meta-analysis of testing accommodations for students with disabilities: Implications for high-stakes testing

Authors:
  • Greater Victoria Coalition to End Homelessness

Abstract and Figures

Test accommodations are designed to ensure the comparability of test scores between students and their typically developing counterparts by eliminating as much construct-irrelevant variance and construct-irrelevant difficulty as possible. Although those involved in test creation endeavor to create tests with suitable accommodations for students with disabilities, there is lack of consensus regarding accommodation efficacy. Using meta-analysis and meta-regression to summarize previous research, this study examined whether test accommodations differentially boost test scores of students with disabilities, and whether accommodated conditions provided a more effective and valid assessment of students with disabilities. Results from the meta-analysis of 34 studies (119 effect sizes) lend support to the differential boost hypotheses, whereby students with disabilities ( ES = 0.30, k = 62, p < 0.001) are positively impacted by test accommodations while their typically developing peers ( ES = 0.17, k = 57, p < 0.001) gain little from test accommodations. Presentation assessment accommodations ( ES = 0.22, k = 41, p < 0.001) had a small statistically significant impact on the performance of students with disabilities, while use of timing/scheduling accommodations ( ES = 0.47, k = 17, p < 0.001) had a small, bordering on medium, statistically significant impact on these students. The effect for presentation accommodations intensified when narrowing the focus to students with learning disabilities ( ES = 0.36, k = 23, p < 0.001) but not for timing/scheduling accommodations ( ES = 0.48, k = 13, p < 0.001). Overall results for setting (k = 1) and response (k = 3) accommodations were not available as there were too few studies for an overall comparison. The results of meta-regression analyses examining the effects of assessment accommodations on test scores for students with disabilities showed that 42% of the heterogeneity in test score could be explained by an overall model examining population description, test characteristic, results dissemination, and researcher-manipulated (test accommodation effect size for students with disabilities) variables. Population description and test characteristic variable sets explained the greatest amounts of variability for mean increase in test score, R =0.22 and R =0.35 respectively; researcher-manipulated variable (test accommodation) and research dissemination explained little variance, R =0.07 and R =0.01, respectively.
Content may be subject to copyright.
University of Denver
Digital Commons @ DU
*$"1/-,("7$0$0 ,#(00$/1 1(-,0 / #2 1$12#($0

A Meta-analysis of Testing Accommodations for
Students with Disabilities: Implications for High-
stakes Testing
Michelle Vanchu-Orosco
University of Denver3 ,"'2-/-0"-'-1+ (*"-+
-**-41'(0 ,# ##(1(-, *4-/)0 1 '8.#(&(1 *"-++-,0#2$#2$1#
7(0(00$/1 1(-,(0!/-2&'11-5-2%-/%/$$ ,#-.$, ""$00!51'$/ #2 1$12#($0 1(&(1 *-++-,01' 0!$$, ""$.1$#%-/(,"*20(-,(,
*$"1/-,("7$0$0 ,#(00$/1 1(-,0!5 , 21'-/(6$# #+(,(01/ 1-/-%(&(1 *-++-,0-/+-/$(,%-/+ 1(-,.*$ 0$"-,1 "1
0'$(* 5$'#2$#2
$"-++$,#$#(1 1(-,
,"'2/-0"-("'$**$$1  , *50(0-%$01(,&""-++-# 1(-,0%-/12#$,104(1'(0 !(*(1($0+.*(" 1(-,0%-/(&'01 )$0
$01(,& Electronic eses and Dissertations. .$/
A META-ANALYSIS OF TESTING ACCOMMODATIONS FOR STUDENTS WITH
DISABILITIES: IMPLICATIONS FOR HIGH-STAKES TESTING
__________
A Dissertation
Presented to
the Faculty of the Morgridge College of Education
University of Denver
__________
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
__________
by
Michelle Vanchu-Orosco
November 2012
Advisor: Dr. Kathy Green
©Copyright by Michelle Vanchu-Orosco 2012
All Rights Reserved
ii
Author: Michelle Vanchu-Orosco
Title: A META-ANALYSIS OF TESTING ACCOMMODATIONS FOR STUDENTS WITH
DISABILITIES: IMPLICATIONS FOR HIGH-STAKES TESTING
Advisor: Dr. Kathy Green
Degree Date: November 2012
Abstract
Test accommodations are designed to ensure the comparability of test scores
between students and their typically developing counterparts by eliminating as much
construct-irrelevant variance and construct-irrelevant difficulty as possible. Although
those involved in test creation endeavor to create tests with suitable accommodations for
students with disabilities, there is lack of consensus regarding accommodation efficacy.
Using meta-analysis and meta-regression to summarize previous research, this study
examined whether test accommodations differentially boost test scores of students with
disabilities, and whether accommodated conditions provided a more effective and valid
assessment of students with disabilities. Results from the meta-analysis of 34 studies (119
effect sizes) lend support to the differential boost hypotheses, whereby students with
disabilities (
ES
= 0.30, k = 62, p < 0.001) are positively impacted by test
accommodations while their typically developing peers (
ES
= 0.17, k = 57, p < 0.001)
gain little from test accommodations.
Presentation assessment accommodations (
ES
= 0.22, k = 41, p < 0.001) had a
small statistically significant impact on the performance of students with disabilities,
while use of timing/scheduling accommodations (
ES
= 0.47, k = 17, p < 0.001) had a
small, bordering on medium, statistically significant impact on these students. The effect
for presentation accommodations intensified when narrowing the focus to students with
learning disabilities (
ES
= 0.36, k = 23, p < 0.001) but not for timing/scheduling
iii
accommodations (
ES
= 0.48, k = 13, p < 0.001). Overall results for setting (k = 1) and
response (k = 3) accommodations were not available as there were too few studies for an
overall comparison.
The results of meta-regression analyses examining the effects of assessment
accommodations on test scores for students with disabilities showed that 42% of the
heterogeneity in test score could be explained by an overall model examining population
description, test characteristic, results dissemination, and researcher-manipulated (test
accommodation effect size for students with disabilities) variables. Population description
and test characteristic variable sets explained the greatest amounts of variability for mean
increase in test score, R
2
=0.22 and R
2
=0.35 respectively; researcher-manipulated
variable (test accommodation) and research dissemination explained little variance, R
2
=0.07 and R
2
=0.01, respectively.
iv
Acknowledgments
I wish to thank my dissertation committee for their support and guidance through
the process of conducting this research. I appreciate the kind assistance of Dr. Kathy
Green, my dissertation advisor and mentor, Dr. Monica Martinussen, who was always
available to provide guidance as I struggled through the many stages that make up a
meta-analysis and meta-regression, and Dr. Karen Riley, who guided me through the
ever-changing field of special education. I owe thanks to Dr. Nick Cutforth, Antonio
Olmos-Gallo, and Dr. Don Bacon who provided their support for this endeavor. I am
grateful to Amelia Dinache for her hard work on the inter-rater reliability portion of this
dissertation. I am indebted to Bridget Julian for her eyes and editing skills, helping at
very short-notice. Finally, I wish to thank my wonderful family, friends, and my husband,
Ty, for their love and support during my years as a graduate student.
v
Table of Contents
Abstract .............................................................................................................................. ii
Acknowledgments ............................................................................................................ iv
List of Tables .................................................................................................................... ix
List of Figures .................................................................................................................. xii
Chapter One ...................................................................................................................... 1
Rationale ............................................................................................................................. 1
Problem Statement .............................................................................................................. 5
Purpose of the Study ........................................................................................................... 7
Research Hypotheses .......................................................................................................... 9
Null hypotheses. ............................................................................................................ 10
Review of the Literature ................................................................................................... 11
Students with disabilities. ............................................................................................. 11
Educational legislation and students with disabilities. ................................................. 12
Assessment inclusion for students with disabilities. ................................................. 16
Calls for inclusion in assessments......................................................................... 16
Impact of exclusion from assessment programs. .................................................. 21
A brief history of inclusion in high-stakes assessment programs. ........................ 25
Accommodations for students with disabilities. ........................................................... 34
Types of accommodations. ....................................................................................... 37
Primary studies of the effectiveness of accommodations. ........................................ 40
Syntheses of the literature on the effectiveness of accommodations. ....................... 44
Synthesis studies of the effectiveness of accommodations....................................... 56
Test accommodation interaction hypothesis and differential boost. ......................... 60
Gaps in the literature. .................................................................................................... 67
Meta-regression............................................................................................................. 71
Delimitations ..................................................................................................................... 73
Definitions......................................................................................................................... 75
Summary ........................................................................................................................... 82
Chapter Two .................................................................................................................... 84
Method .............................................................................................................................. 84
Purpose of the current study. ........................................................................................ 84
Research Hypotheses ........................................................................................................ 84
Meta-analysis .................................................................................................................... 85
Criteria for selection of studies. .................................................................................... 86
Substantive inclusion criteria. ................................................................................... 87
Methodological inclusion criteria. ............................................................................ 88
Categorization of test accommodation research. ...................................................... 89
vi
Exclusion criteria. ..................................................................................................... 91
Overview of the selection process. ........................................................................... 93
Search strategy. ............................................................................................................. 95
Computerized database searches............................................................................... 96
Overview and results of the search process. ............................................................. 99
Coding and classifying study variables....................................................................... 102
Dependent and non-independent effect sizes. ......................................................... 106
Coding characteristics of operational definitions. .................................................. 107
Issues of reliability throughout the coding process................................................. 109
Statistical methods of analysis. ................................................................................... 111
Methods for calculating independent effect sizes. .................................................. 112
Accounting for variance in the distribution of effect sizes. .................................... 115
Outlier analysis. .................................................................................................. 115
Analysis of the homogeneity of variance and the distribution of effect size. ..... 119
Sensitivity analysis.............................................................................................. 124
Publication bias analysis. .................................................................................... 127
Meta-regression............................................................................................................... 130
Rationale for meta-regression. .................................................................................... 130
Statistical methods of analysis. ................................................................................... 131
Meta-regression limitations. ....................................................................................... 136
Meta-regression methodological issues. ..................................................................... 140
Chapter Three ............................................................................................................... 143
Results ............................................................................................................................. 143
Demographics for studies as the unit of analysis. ....................................................... 143
Results for the Meta-analyses ......................................................................................... 156
Meta-analysis research hypotheses. ............................................................................ 156
Study as the unit of analysis: Description of effect size. .................................... 157
Substudy as the unit of analysis: Description of effect size. ............................... 159
Research hypothesis 1. ............................................................................................ 161
Study as the unit of analysis: Research hypothesis 1 results. ............................. 162
Students with disabilities. ................................................................................... 163
Typically developing students. ........................................................................... 166
Substudy as the unit of analysis: Research hypothesis 1 results. ........................ 171
Students with disabilities. ................................................................................... 171
Typically developing students. ........................................................................... 174
Comparison of results between students with disabilities and typically developing
students. .............................................................................................................. 178
Study and substudy as the unit of analysis: A comparison. ................................ 179
Research hypothesis 2 results. ................................................................................ 184
Accommodation category: Research hypothesis 2 results. ................................. 184
Presentation accommodations. ............................................................................ 185
Timing/scheduling accommodations. ................................................................. 187
Comparison of accommodation categories. ........................................................ 191
Specific accommodation category: Research hypothesis 2 results. .................... 192
vii
Read-aloud accommodation................................................................................ 193
Extended-time accommodation. ......................................................................... 194
Ancillary analysis: Students with learning disabilities versus students requiring
special education services. .................................................................................. 195
Students with learning disabilities. ..................................................................... 196
Students receiving special education services. ................................................... 200
Results for the Meta-regression Analyses....................................................................... 204
Meta-regression research hypothesis. ......................................................................... 207
Research hypothesis 3 results. ................................................................................ 207
Effect of test accommodation on test scores for students with disabilities. ........ 208
Effect of timing and presentation accommodations on test scores for students with
disabilities. .......................................................................................................... 210
Effect of timing and presentation accommodations on test scores for students with
learning disabilities. ............................................................................................ 212
Test accommodations, construct irrelevance, and effect size. ............................ 214
Chapter Four ................................................................................................................. 215
Discussion ....................................................................................................................... 215
Summary of findings................................................................................................... 216
Meta-analysis. ......................................................................................................... 216
Differential boost. ............................................................................................... 217
Presentation test accommodations. ..................................................................... 219
Timing/scheduling test accommodations. ........................................................... 221
Meta-regression....................................................................................................... 222
Effect of test accommodation on test scores for students with disabilities. ........ 222
Effect of timing and presentation accommodations on test scores for students with
disabilities. .......................................................................................................... 223
Effect of timing and presentation accommodations on test scores for students with
learning disabilities. ............................................................................................ 224
Relation of results of this study to research in the field. ............................................. 226
Issues in Meta-analysis ................................................................................................... 229
Issues in Meta-regression ................................................................................................ 234
Limitations ...................................................................................................................... 236
Conclusion ...................................................................................................................... 238
Suggestions for Future Research .................................................................................... 244
Policy implications...................................................................................................... 250
References ...................................................................................................................... 255
Appendices ..................................................................................................................... 268
Appendix A ..................................................................................................................... 268
Appendix B ..................................................................................................................... 270
Appendix C ..................................................................................................................... 271
Appendix D ..................................................................................................................... 273
Appendix E ..................................................................................................................... 312
viii
Appendix F...................................................................................................................... 328
Appendix G ..................................................................................................................... 332
Appendix H ..................................................................................................................... 334
Appendix I ...................................................................................................................... 342
Appendix J ...................................................................................................................... 343
Appendix K ..................................................................................................................... 345
Appendix L ..................................................................................................................... 351
Appendix M .................................................................................................................... 352
Appendix N ..................................................................................................................... 353
Appendix O ..................................................................................................................... 355
Appendix P...................................................................................................................... 357
Appendix Q ..................................................................................................................... 359
Appendix R ..................................................................................................................... 360
Appendix S...................................................................................................................... 361
ix
List of Tables
Table 1: Types of Assessment Accommodations ............................................................... 39
Table 2: Outlier Analysis for Effect Size Estimates - Study as the Unit of Analysis ....... 117
Table 3: Outlier Analysis for Effect Size Estimates - Substudy as the Unit of Analysis . 118
Table 4: Sensitivity Analysis for Research Hypothesis 1 -
ES
Estimates, Confidence
Intervals, & Significance ................................................................................................ 125
Table 5: Sensitivity Analysis for Research Hypothesis 2 -
ES
Estimates, Confidence
Intervals, & Significance ................................................................................................ 126
Table 6: Study Demographics - Publication and Research Information ........................ 144
Table 7: Study Demographics - Participant Information ............................................... 146
Table 8: Study Demographics - Accommodation Type x Grade Level and Disability
Classification .................................................................................................................. 149
Table 9: Study Demographics - Assessment Information ............................................... 151
Table 10: Study Demographics - Individuals w/ Learning Disabilities & Individuals
Receiving Special Education........................................................................................... 154
Table 11: Number of Effect Sizes by Research Approach & Design (Unit of Analysis =
Study) .............................................................................................................................. 158
Table 12: Substudy Sample Size Based on Total Number of Effect Sizes
a
...................... 159
Table 13: Number of Effect Size Estimates by Research Approach & Design (Unit of
Analysis = Substudy) ...................................................................................................... 160
Table 14: Substudy Sample Size Based on Total Number of Effect Sizes
a
...................... 161
Table 15: Comparison Between Students With and Without Disabilities -
ES
Estimates,
Confidence Intervals, & Q-statistics
a
............................................................................. 163
Table 16: Comparison Between Students With and Without Disabilities -
ES
Estimates,
Confidence Intervals, & Q-statistics
a
............................................................................. 171
Table 17: Comparison of Effect Size Estimates with Study and Substudy as Unit of
Analysis - Students with Disabilities ............................................................................... 181
x
Table 18: Comparison of Effect Size Estimates with Study and Substudy as Unit of
Analysis - Typically Developing Students ....................................................................... 182
Table 19: Comparison Between Accommodations for Students with Disabilities -
ES
,
Confidence Intervals, & Q-statistics ............................................................................... 184
Table 20: Comparison between Specific Accommodations -
ES
, Confidence Intervals, &
Q-statistics ...................................................................................................................... 193
Table 21: Comparison between Students with Learning Disabilities & Receiving Special
Education Services -
ES
, Confidence Intervals, & Q-statistics ...................................... 195
Table 22: Comparison between Accommodations (Students with Learning Disabilities) -
ES
Estimates, Confidence Intervals, & Q-statistics ........................................................ 196
Table 23: Comparison Between Accommodations (Students Receiving Special Education
Services) -
ES
, Confidence Intervals, & Q-statistics ....................................................... 201
Table 24: Random-effects Model for Students with Disabilities - All Data .................... 208
Table 25: Random-effects Model for Students with Disabilities - All Data .................... 210
Table 26: Random-effects Model for Students with Disabilities - Timing & Presentation
Accommodation Data Only ............................................................................................. 210
Table 27: Random-effects Model for Students with Disabilities - Timing & Presentation
Accommodation Data Only ............................................................................................. 212
Table 28: Random-effects Model for Students with Learning Disabilities - Timing &
Presentation Accommodation Data Only
a
...................................................................... 213
Table 29: Random-effects Model for Students with Learning Disabilities - Timing &
Presentation Accommodation Data Only
a
...................................................................... 214
Table 30: Random Effects Model for Students with Disabilities - Statistically Significant
Variables Only ................................................................................................................ 359
Table 31: Random Effects Model for Students with Disabilities - Overall Model without
Test Accommodation ....................................................................................................... 359
Table 32: Random Effects Model for Students with Disabilities - Statistically Significant
Variables Only ................................................................................................................ 360
Table 33: Random Effects Model for Students with Disabilities - Overall Model without
Test Accommodation ....................................................................................................... 360
xi
Table 34: Random Effects Model for Students with Disabilities - Statistically Significant
Variables Only ................................................................................................................ 361
Table 35: Random Effects Model for Students with Learning Disabilities - Timing &
Presentation Accommodation Data Only
a
...................................................................... 361
xii
List of Figures
Figure 1: Prevalence rates of students with disabilities, by disability type, 1977 – 2006. . 2
Figure 2: Publication Bias for the Random-Effects Model ............................................ 129
Figure 3: Forest Plot of Effect Size Estimates for Students with Disabilities – Study as
the Unit of Analysis ........................................................................................................ 165
Figure 4: Forest Plot of Effect Size Estimates for Typically Developing Students – Study
as the Unit of Analysis .................................................................................................... 167
Figure 5: Graph of Hedges' g Effect Size Estimates for Students with Disabilities
Compared to Typically Developing Students - Study as Unit of Analysis..................... 169
Figure 6: Forest Plot of Effect Size Estimates for Students with Disabilities – Substudy
as the Unit of Analysis .................................................................................................... 173
Figure 7: Forest Plot of Effect Size Estimates for Typically Developing Students –
Substudy as the Unit of Analysis .................................................................................... 175
Figure 8: Graph of Hedges' g Effect Size Estimates for Students with Disabilities
Compared to Typically Developing Students - Substudy as Unit of Analysis ............... 177
Figure 9: Forest Plot of Effect Size Estimates for Presentation Accommodations ........ 186
Figure 10: Forest Plot of Effect Size Estimates for Timing/Scheduling Accommodations
......................................................................................................................................... 188
Figure 11: Graph of Hedges' g Effect Size Estimates for Presentation Accommodations
Compared to Timing/Scheduling Accommodations ....................................................... 190
Figure 12: Forest Plot of Effect Size Estimates for Read-Aloud Accommodations for
Students with Learning Disabilities ................................................................................ 198
Figure 13: Forest Plot of Effect Size Estimates for Extended-Time Accommodations for
Students with Learning Disabilities ................................................................................ 199
Figure 14: Forest Plot of Effect Size Estimates for Read-Aloud Accommodations for
Students Receiving Special Education Services ............................................................. 202
Figure 15: Histograms for Study as the Unit of Analysis .............................................. 343
Figure 16: Histograms for Substudy as the Unit of Analysis ......................................... 344
xiii
Figure 17: Effect Size Estimates by Weights - Study Level (all data) ........................... 345
Figure 18: Effect Size Estimates by Weights - Study Level (students with disabilities) 346
Figure 19: Effect Size Estimates by Weights - Study Level (typically developing
students) .......................................................................................................................... 347
Figure 20: Effect Size Estimates by Weights - Substudy Level (all data) ..................... 348
Figure 21: Effect Size Estimates by Weights - Substudy Level (students with disabilities)
......................................................................................................................................... 349
Figure 22: Effect Size Estimates by Weights - Substudy Level (typically developing
students) .......................................................................................................................... 350
1
Chapter One
Rationale
The No Child Left Behind Act of 2001 (Public Law 107-110), generally referred
to as NCLB, was enacted to ensure that all students learn. Consequently, in an effort to
understand what students have learned, there has been an increase in the measurement of
student achievement, coupled with an increased emphasis on the assessment of all
students. States wishing to receive federal funding for their schools have been required to
create assessments of basic skills and to test all of their students at certain, predetermined
grades. The assessments provide one component for the Average Yearly Progress (AYP)
reports necessary to ensure funding for schools. Thus, “…the goal [of high-stakes testing]
has changed from differentiated standards for a small elite and the larger masses to one of
high standards for all students” (Linn, 2001, p. 31, emphasis added). This change in
direction has led to standardized, high-stakes testing of increasingly larger numbers of
special education students.
Concurrently, with the increased emphasis on the assessment of all students, the
number of students identified as requiring special education services has increased. In
1977, just over 8% of the total student population was receiving special education
services. By 2006 this figure rose to nearly 14% (Dillon, 2007), with approximately
13.5% in K–12 schools receiving special education services (Figure 1: Dillion, 2007).
Students with learning disabilities comprise the largest group of students with disabilities,
2
at 6% of the total population of students with disabilities, and represent a diverse
population with a wide range of skill strengths and deficits (Fuchs, Fuchs, & Capizzi,
2005). This trend appears to be continuing with recent increases in the identification of
children with disabilities, such as autism, receiving national coverage in the popular
news; e.g., The New York Times article on ‘autism guru’ Andrew Wakefield (Dominus,
2011).
Note: Data is for selected years: 1976-77, 1990-91, and 1995 through 2006 (Dillion, 2007)
Figure 1: Prevalence rates of students with disabilities, by disability type, 1977 – 2006.
Students requiring special education services are often referred to as students with
special needs, students with disabilities, disabled students, or differently-abled students.
Students with disabilities include students who are visually impaired (including
blindness), hearing impaired (including deafness), cognitively impaired (including mental
retardation), physically/orthopedically impaired (e.g., cerebral palsy, spina bifida,),
speech or language impaired, seriously emotionally disturbed (e.g., attention deficit
3
disorder (ADD)), autistic, traumatically brain injured, have other health impairments, or
are specifically learning disabled. Such students, once found eligible for special
education services under federal and state eligibility/disability standards, receive an
Individualized Education Plan (IEP). Laws concerning the identification, funding, and
provision of services of such students include the Individuals with Disabilities Education
Act (IDEA 2004, Public Law 108-446 reauthorized in 2004), Section 504 of the
Rehabilitation Act of 1973, and the Americans with Disabilities Act (ADA).
To provide a way to include students with disabilities in testing efforts, the
development and use of suitable testing accommodations have been implemented. These
accommodations provide a way to include these students in testing efforts, allowing them
to perform at optimal levels, and be appropriately assessed. Test accommodations refer to
a “… change to testing materials, setting, or procedures that does not alter what is being
measured” (Thurlow, 2007, p. 2) and are used to promote fairness in testing (Sireci, Li, &
Scarpati, 2003). Additionally, the use of accommodations for students with disabilities is
thought to allow for the elimination of construct-irrelevant variance (Fuchs, Fuchs, Eaton,
Hamlett, & Karns, 2000a) which, in turn, “… level[s] the playing field so that the format
of the test or the test administration conditions do not unduly prevent such students from
demonstrating their ‘true’ knowledge, skills, and abilities” (Sireci et al., 2003, p. 3).
There is a “… great diversity in the way accommodations are created and
implemented…” (Sireci et al., 2003, p. 62) with the most common types of testing
accommodations for students with disabilities including, but not limited to:
Presentation – oral test administration,
Presentation – changes in test content (e.g., simplified language),
4
Presentation – changes in test format (e.g., Braille, large print),
Response – students write directly in test booklet,
Response – students dictate response (e.g., use scribe),
Setting – separate room for testing,
Setting – individual administration,
Timing/Scheduling – extended/unlimited administration time,
and
Timing/Scheduling – break up test administration into separate sessions.
As high-stakes decisions are made using assessment results, the effectiveness of
accommodations designed to allow access to assessments and increase the accuracy of
student results have been examined. In an effort to provide the most efficacious and
appropriate testing accommodations for students requiring special education services,
educational researchers have examined differences between these students and their
typically developing counterparts for the various types of accommodations (see Bolt &
Thurlow, 2006; Helwig & Tindal, 2003; Kosciolek & Ysseldyke, 2000). While these
studies provide much-needed research in this area, they are limited to an examination of
one or two accommodations for a relatively small sample of students requiring special
education services and their typically developing peers. To address this and other
shortcoming(s), several summaries of the research literature have been carried out. In
particular, the National Center on Educational Outcomes (NCEO) produces a new
technical report, summarizing the research literature, approximately every three years.
For the most part, these reviews have not provided any firm conclusions regarding the
effectiveness of the testing accommodations examined, with most reviews yielding mixed
5
results. As Sireci et al. summarized, “[o]ne thing that is clear from our review is that
there are no unequivocal conclusions that can be drawn regarding the effects, in general,
of accommodations on students’ test performance” (2003, p. 48).
Prior to NCLB, in an effort to synthesize information on the effects of test
accommodations, Chiu and Pearson (1999) conducted a meta-analysis of research
looking into the effects of test accommodations for both students requiring special
education services and students with limited English proficiency. Their findings did not
support the use of testing accommodations for either population of students.
While original research, reviews of the research literature, and meta-analyses have
added to our knowledge of testing accommodations for students requiring special
education services, they have not provided a definitive understanding of the types of
accommodations that are the most useful for these students.
Problem Statement
Students with disabilities are often excluded from the high-stakes tests needed to
fulfill annual yearly progress (AYP) obligations for state and federal funding. High-
stakes tests, taken without accommodations, generally do not represent these students’
true abilities. Such tests introduce construct-irrelevant variance as a type of systematic
error (Messick, 1989, 1990, 1995) when students with disabilities are faced with modes
of testing (e.g., paper and pencil) with which they are not facile. Construct-irrelevant
variance is considered one of two primary threats to construct validity as a “contaminant
with respect to score interpretation” (Messick, 1989, p. 34). In addition, construct-
irrelevant difficulty, where “aspects of the task that are extraneous to the focal construct
make the test irrelevantly more difficult for some individuals or groups” and “… [lead] to
6
construct scores that are invalidly low for those individuals adversely affected” (p. 34)
affects test scores for students with disabilities.
Test accommodations are designed to ensure the comparability of test scores
between students with disabilities and their typically developing counterparts by
eliminating as much construct-irrelevant variance and construct-irrelevant difficulty as
possible. While researchers, measurement specialists, and test designers have endeavored
to create tests with appropriate accommodations, there is no consensus as to whether or
not test accommodations for students with disabilities are indeed effective.
The present study is important because it is an attempt to synthesize previous
research in a manner; i.e., meta-analysis of the aggregate research on test
accommodations for students with disabilities, that has only been attempted once in the
past (see Chiu & Pearson, 1999), presenting what could be more objective results when
compared to narrative syntheses of the research literature. As standardized test scores are
used to assess AYP and provide school districts and schools with much needed
educational funding as well as assessing individual growth and achievement, they must
be both accurate and adequate measures of student knowledge for all students. When
such tests are inadequate, inaccurate, or invalid measures of student knowledge, the
inherent repercussions are manifold. Such repercussions include inadequate or inaccurate
placement of students, loss of funding, teacher loss of jobs, and potential school closures.
With extant research limited by the number of accommodations that are addressed
and the size of the samples drawn in a single study, it is difficult to draw generalized
conclusions about the efficacy of test accommodations. With the introduction of NCLB,
7
numerous studies have been completed. Some of this research points to an interaction
between student characteristics and the type of accommodation.
The interaction hypothesis states that (a) when test accommodations are given
to the [students with disabilities] who need them, their test scores will improve,
relative to the scores they would attain when taking the test under standard
conditions; and (b) students without disabilities will not exhibit higher scores
when taking the test with those accommodations (Sireci, Scarpati, & Li, 2005,
p. 458).
Most research in this area is restricted by small sample sizes, as classification of students
as “students with disabilities” occurs for less than 14% of the general student population.
As well, most research and synthesis reports in this area generally aggregate students
with disabilities with English language learners (ELL). Currently available research only
allows for general accommodation decision-making and implementation guidelines, thus
“more empirical study is warranted to further investigate the effects of testing
accommodations for students with disabilities” (Bolt & Thurlow, 2004, p. 151).
Purpose of the Study
The purpose of the study was to: (a) determine whether there is empirical support
to suggest provision of testing accommodations produces more effective assessment of
students with disabilities (b) provide an estimate of the strength of this effect and (c)
contribute to the understanding of the effects of test accommodations for this population
of students.
Lack of consensus in the research literature regarding the efficacy of test
accommodations for students with disabilities has prompted this researcher to investigate
the issue of effective test accommodation for students with disabilities using meta-
analysis. With the introduction of NCLB, numerous studies have been completed and
serve as data points for the present research. Meta-analysis of research on testing
8
accommodation practices allow us to understand which accommodations are being used,
in which situations, and for what types of students. This technique also allows us to
aggregate data across studies thus providing more power to detect effects that may not be
apparent in an individual study, possibly because of the small sample sizes that plague
studies focusing on students with disabilities.
In an effort to understand the ramifications of testing accommodations for
students with disabilities, this research focused on studies, or portions of studies,
examining students with disabilities or students with disabilities and their typically
developing peers. Variables reflecting presentation, response, setting, and
timing/scheduling test accommodations for paper and pencil tests were included. This
study examined studies, or portions of studies, focusing on paper and pencil tests only.
Computer-based testing (CBT) and other non-paper and pencil tests were considered
inherently different from paper and pencil tests and were not included. Additionally,
testing accommodations that are most effective for paper and pencil tests may not be
effective for these other types of tests. Studies between 1999 and 2011 were selected for
the meta-analysis to further, and not overlap, Chiu and Pearson’s (1999) meta-analytic
research. This research adds to the existing body of research and research syntheses and
extends the original work of Chiu and Pearson (1999) by narrowing the focus from
English Language Learners and students with disabilities populations on a variety of
different assessments to students with disabilities on standardized, paper and pencil
assessments only. Further, meta-regression analyses and graphic representations, not
available to Chiu and Pearson in 1999, provide a unique contribution to research in this
area.
9
Sireci et al.s (2005) notion of an interaction hypothesis has been incorporated
within the framework of the present meta-analysis. As well, several summaries of the
research have provided additional direction regarding research findings on types of
accommodations being used, and information on studies in this area. To further our
understanding of test accommodations for students with disabilities, salient variables
were entered into a meta-regression analysis. Meta-regression was incorporated into this
study in order to integrate the effects of multiple, potentially related predictors in an
effort to yield a summary of overall prediction of the most effective testing
accommodations, as well as examining residual variance and assessing the
generalizability of the effects of these accommodations on students with disabilities and
typically developing students.
Research Hypotheses
In the current study, the following hypotheses are addressed for the meta-analytic
portion of the research:
Research Hypothesis 1: Is there empirical support for effects of test
accommodations for the target group, students with disabilities, as opposed to their
typically developing peers?
Research Hypothesis 2: As measured by effect size, does each of the following
constitute an effective accommodation for students with disabilities?
o Presentation test accommodations?
o Response test accommodations?
o Setting test accommodations?
o Timing/Scheduling test accommodations?
10
The following research hypothesis is addressed through the meta-regression
portion of the current research:
Research Hypothesis 3: Which type of accommodation(s)–Presentation, Response,
Setting, or Timing/Scheduling–more effectively remove construct-irrelevant
variance from target students’ test scores?
Null hypotheses.
The following null hypotheses are addressed in the meta-analytic portion of the
research:
Research Hypothesis 1: There is no empirical support for effects of test
accommodations for the target group, students with disabilities, as opposed to their
typically developing peers
Research Hypothesis 2: Test accommodations are not effective.
o Presentation test accommodations do not increase access to test items for
target students
o Response test accommodations do not increase access to test items for
target students
o Setting test accommodations do not increase access to test items for target
students
o Timing/Scheduling test accommodations do not increase access to test
items for target students
The following null hypothesis was addressed in the meta-regression portion of the
current research:
11
Research Hypothesis 3: No test accommodations effectively remove construct-
irrelevant variance from target students’ test scores
Review of the Literature
Students with disabilities.
There are 13 special education categories listed in federal special education law
(Individuals with Disabilities Act reauthorization of 2004, PUBLIC LAW 108–446,
2004). The disabilities cited in the legislation include
mental retardation, hearing impairments (including deafness), speech or
language impairments, visual impairments (including blindness), serious
emotional disturbance (referred to in this title as ‘emotional disturbance’),
orthopedic impairments, autism, traumatic brain injury, other health
impairments, or specific learning disabilities (Part A (SEC. 602) (3) (A) (i),
118 STAT.2652, 2004, see Appendix A for the statute in its entirety).
While not in the same definitional area of this law, specific learning disabilities are
further spelled out as
… a disorder in 1 or more of the basic psychological processes involved in
understanding or in using language, spoken or written, which disorder may
manifest itself in the imperfect ability to listen, think, speak, read, write,
spell, or do mathematical calculations
and “… includes such conditions as perceptual disabilities, brain injury, minimal brain
dysfunction, dyslexia, and developmental aphasia” but not “… learning problem[s] that
[are] primarily the result of visual, hearing, or motor disabilities, of mental retardation, of
emotional disturbance, or of environmental, cultural, or economic disadvantage” (IDEA,
Part A – (30) (A), (B), and (C) (118 STAT.2657 – 118 STAT.2658)). The No Child Left
Behind Act of 2001 relies on the definition “under section 602(3) of the Individuals with
Disabilities Education Act” (TITLE I A: (111) (b)(2) (C) (v) (II) (cc), 115 STAT. 1451,
2001) when referring to children, or students, with disabilities. As well, the Council for
12
Exceptional Children (CEC), one of the major organizations worldwide for those
involved in the field of Special Education, refers to the same legislation when discussing
students with disabilities.
It should be noted that the identification of certain disabilities, such as specific
learning disability and emotional disturbance, are often thought to be more subjective
(National Association of Special Education Teachers) than disabilities with obvious
associated medical or physical conditions such as deafness, blindness, and orthopedic
impairments. As well, some of these designations; for example, specific learning
disability and emotional disturbance, can be more dynamic and temporary. Students with
specific learning disabilities or emotional disturbances may move out of or back into
these conditions. Based on the preceding definition, it appears that students with
disabilities are indeed a very diverse group.
While other definitions for students with disabilities exist; for example, in
countries other than the United States, they were not applied within the scope of this
research. Additionally, studies using definitions for students with disabilities found in the
research under meta-analysis that could not be aligned with the definition previously
cited were removed from the analysis.
Educational legislation and students with disabilities.
The Individuals with Disabilities Act reauthorization of 2004 (PUBLIC LAW
108–446, 2004), or IDEA, and No Child Left Behind (PUBLIC LAW 107-110, 2002), or
NCLB, two relatively recent major laws affecting education in the United States have
heavily impacted services for, and the assessment of, students with disabilities.
13
NCLB (2001) requires that educators be accountable for making sure all students,
including students with disabilities, meet high expectations. Under TITLE I A (1111)
(b)(2) (C) (v) (II) (cc), NCLB breaks out separate measurable annual objectives for
students with disabilities as part of state, district, and school accountability for the
adequate yearly progress of all students (see Appendix B for the statute in its entirety).
Adequate yearly progress (AYP) includes the same high academic standards for all public
school students with the expectation of continuous and substantial academic progress,
and requires each student to become proficient in mathematics, reading/language skills,
and science, with the exception of low-achieving students. According to the Council for
Exception Children (CEC, 2002), low-achievers has not been defined in NCLB. Whether
low-achieving students refer to all students with disabilities, a subset of students with
disabilities, or some other groups of students is not made clear in the legislation. CEC
(2002) believes the definitions in this section of the legislation
…appear to have the same meaning as child with a disability under Sec. 602
of the IDEA …[b]ut judging by the nature of all further stipulations respecting
students with disabilities, IDEA eligible and served children constitute the target
population being cited (p. 8).
IDEA (2004) focuses on providing a free and appropriate public education
(FAPE) to children with diagnosed disorders that impact their ability to learn in a regular
classroom setting. As part of FAPE, IDEA Part D (2004) outlines activities to be used to
improve the education of children with disabilities. A three-pronged approach for an
effective educational system for students with disabilities should:
(A) maintain high academic achievement standards and clear performance goals
for children with disabilities, consistent with the standards and expectations for all
students in the educational system, and provide for appropriate and effective
strategies and methods to ensure that all children with disabilities have the
opportunity to achieve those standards and goals;
14
(B) clearly define, in objective, measurable terms, the school and post-school
results that children with disabilities are expected to achieve; and
(C) promote transition services and coordinate State and local education, social,
health, mental health, and other services, in addressing the full range of student
needs, particularly the needs of children with disabilities who need significant
levels of support to participate and learn in school and the community ((SEC.
650) (4) (A), (B), and (C), 118 STAT. 2763, 2004), (see Appendix A for the
statute in its entirety).
IDEA (2004) provides funding, at the state level, for assessment activities
including appropriate accommodations or alternative assessments used to “assess[…] the
performance of children with disabilities, in accordance with sections 1111(b) and 6111
of the Elementary and Secondary Education Act of 1965” (Part B (SEC. 611) (e) (2) (C)
(x), 118 STAT.2667– 118 STAT.2668, 2004). This is also covered in NCLB (2001) as
measurable objectives for all students in statewide assessment programs, including
students with disabilities, with provisions for funding assessment accommodations for
limited English proficiency (LEP) students and students with disabilities.
Both NCLB and IDEA provide information on assessment of students with
disabilities, albeit each with a different focus. As part of AYP, NCLB proposes assessed,
measurable objectives of academic standards for accountability include a
single minimum percentage of students who are required to meet or exceed the
proficient level on the academic assessments that applies separately to each
group of students described in subparagraph (C) (v) (NCLB, TITLE I A (111)
(b)(2)(G)(iii), 115 STAT. 1448),
of which students with disabilities constitute one group. This annual improvement cannot
be less than 95% of each of the (C) (v) groups. While there is frequent mention of
assessment as it pertains to statewide testing and the Elementary and Secondary
Education Act of 1965, or its current reauthorization, NCLB (2001), much of the
legislature is concerned with assessment information necessary to develop Individualized
15
Education Programs (IEPs) for students with disabilities; i.e., use of developmental and
other assessments. While developmental and other assessments can be considered high-
stakes tests for the student with disabilities, for purposes of the current study high-stakes
tests refer to assessments of achievement used for decisions at the school, school district,
state, or federal level.
At the federal level, NCLB (2001) and IDEA (2004) have pushed an agenda of
assessing improved student achievement through a series of accountability structures.
This generally plays out at the state level, as high-stakes tests comprise state assessment
programs.
Notwithstanding a lack of definitional clarity of low-achieving students in NCLB,
NCLB relying on clarification of this population in IDEA (1997), the full inclusion for
students with disabilities is no longer the same type of choice it had been prior to the
enactment of IDEA’s predecessor, PL 94-142 of 1975 (Education of All Handicapped
Children Act), with these two pieces of legislation (Crawford & Tindal, 2006; Thurlow,
Lazarus, Thompson, & Blount Morse, 2005). Schools, districts, and states are no longer
able to exclude students with disabilities, as a group, from assessment requirements; this,
in turn, ensures equitable access to assessment and instruction (Baker, 2008). While
school districts may decide to exclude some students with disabilities from state-
mandated assessments, and states may decide to exclude some students with disabilities
from federally mandated assessments, this is becoming more difficult to justify,
especially when state, district, and school grant money is tied to AYP as defined in
NCLB.
16
Assessment inclusion for students with disabilities.
Inclusion of students with disabilities in school, district, state, and federal
assessment programs, discussed in the following sections, covers the calls for inclusion,
the impact of exclusion, and a brief history of inclusion in high-stakes assessment
programs for these students.
Calls for inclusion in assessments.
While recognition of the importance of providing services for students with
disabilities in the general educational system had been a hotly debated topic for a number
of years in the United States, steps toward including students with disabilities in that
educational system reached fruition with passage of PL 94-142, the Education of All
Handicapped Children Act of 1975. This legislation provided students with disabilities
access to the regular educational system. Provisions within this act included a free and
public education (FAPE) in the least restrictive environment (LRE) for students with
disabilities, and introduced the individualized educational programs (IEP). Students with
disabilities now had access to the educational system but were not included in the
ongoing district, state, and federal assessment programs.
In the early 1990s, prior to President Bill Clinton’s signing IDEA (1997) into law,
opinions about including students with disabilities in district, state, and national level
assessments differed; in some instances, radically. In 1992, Allington and McGill-Frazen
were among the first to document issues with statewide assessment programs, citing lack
of inclusion of students with disabilities as potential corruption of assessment results.
Other early calls for students with disabilities’ inclusion in assessment programs by
researchers such as Algozzine (1993), McGrew, Thurlow, Shriner, and Spiegel (1992),
17
Reschly (1993), and Reynolds (1993) were prefaced by the belief that no student,
including students with disabilities, should be excluded from testing. Algozzine (1993)
argued that excluding students “… violates the spirit and practice of full inclusion” (p. 8)
and suggested accommodations or modifications offered to a student be offered to all
students. Reynolds (1993) felt universal assessment practices, which allowed for full
inclusion, should be used for imperative domains such as language, mathematics, social
skills, and self-dependence. McGrew et al. (1992), in their examination of students with
disabilities inclusion in federal and state assessment databases, held that it was imperative
all students with disabilities able to participate in national and state assessments must
participate, as “[t]here is … concern that we … only value who we can measure” (p. 3),
emphasizing a need to value students with disabilities. Reschly (1993), in an exploration
of advantages and disadvantages of full exclusion, full inclusion, and allowing two
percent of students to be excluded, argued that “implementation of liberal
accommodations policies would probably increase the perception of fairness and the
assessment programs’ credibility” (p. 9). As well, the National Center on Educational
Outcomes (NCEO) proposed a complex model of six educational outcomes, the
assessment of which was considered useful in guiding state and federal agencies
educational resource and program policy decisions and reflected commitment to the
inclusion of students with disabilities in the assessment of these outcomes to the
maximum extent possible (Gilman, Thurlow, & Ysseldyke, 1993; Ysseldyke & Thurlow,
1993).
Perhaps one of the strongest advocates for inclusive models of assessment for
students with disabilities, Algozzine (1993) stated “… difference[s] in performance
18
across comparison groups [would be] due to naturally-occurring differences in
characteristics of comparison groups” (p. 12) if all students were included in assessment
programs. He noted that differences in inclusion practices for students with disabilities in
assessment programs between states made state comparisons on standardized assessments
virtually meaningless. As an advocate for the full inclusion perspective, Algozzine
stressed that permitting IEP data to stand in for state and national assessments taken by
general education students and establishing different performance standards for students
with disabilities are “… discriminatory, selective practices that … violate the sentiments
of full inclusion” (p. 13).
Reschly (1993) proposed a partial inclusion assessment model he felt might
counter issues found with total exclusion, or barring students with disabilities’ access to
standardized state and national assessments. Within this model, students with severe
disabilities, constituting approximately two percent of the student population, would be
excluded. All other students with disabilities would be included, but would be given the
lowest score possible if they did not participate. With such a model, students who would
not benefit from participation in the assessment process would not be forced to complete
the assessment. Reschly believed such a practice might be considered more equitable and
be seen to foster more accurate comparisons of educational units, such as districts and
states, when reporting standardized assessments results.
In opposition to full inclusion, based primarily on technological considerations,
Merwin (1993) stated that excluding students with disabilities from testing could be
justified as “… students in special education comprise such a small number of students
that their exclusion [would] not affect state and national comparisons” (p. 8) and that
19
excluding students with disabilities would “… affect group averages less than excluding
other subgroups, such as children from low socioeconomic status groups” (p. 8).
In counterpoint, McGrew et al. (1992) declared that it was time to “…address the
numerous political and technical hurdles that must be overcome in order for these
students to participate more fully in our national and state data collection programs” (p.
8) given the enormity of state and federal support for educational programs for students
with disabilities with “… over 4.5 million school-age youngsters receive[ing] some form
of special education services, services that are provided at significant expense to our
educational system” (p. 10). Thus, an examination of student performance was not only
warranted, it was necessary. Algozzine (1993), echoing this sentiment, argued that while
considering the inclusion of students with disabilities in federal and state assessments of
educational outcomes may not be easy; full inclusion of these students should not be
viewed simply as a technical question. Federal and state assessment programs should not
dismiss the use of assessment accommodations as they present technical issues that
cannot be addressed by psychometric practice. Rather, “… all tests and testing procedures
lack perfect technical adequacy” (Algozzine, 1993, p. 13) so we should “simply take a
step in some direction” (p. 14). The direction Algozzine (1993) pointed to was to “…
avoid any practices that produce, encourage, foster, or facilitate separation among
students” (p. 14). To that end, he suggested all students take all tests with any assessment
accommodation allowed on one test being allowed on all tests for all students. In more
recent research on design patterns for improving accessibility for test takers with
disabilities Hansen and Mislevy state that “… there is a moral imperative to ensure that
20
all students, including individuals with disabilities, have access to assessment products
and services” (2008, p. 1).
When IDEA, the Individuals with Disabilities Education Act, was signed into law,
the notion of “improving results” was added to the lexicon of access for students with
disabilities. The amendments
reflect[ed] a concern about the standards to which [students with disabilities]
[were] held, and about the extent to which they participate[d] in state and district
assessments, the primary means that education [uses] to demonstrate educational
results (Ysseldyke, Thurlow, Kozleski, & Reschly, 1998, p. 14)
and required states to report on the performance of students with disabilities. Such
participation and reporting not only allows for monitoring performance of students with
disabilities through the demonstration of improving or declining results; it allows districts
and states the ability to provide concrete evidence when justifying the costs of education
for students with disabilities. With such legislature and the growing recognition of “…
the value of large scale federally funded studies to assess student progress” (McGrew et
al., 1992, p. 2) as part of the effort to measure the overall quality of its educational
system in United States, students with disabilities’ access to district, state, and federal
assessment programs has been an issue for over a decade.
It should be noted that the extent to which students with disabilities are included
in assessment programs continues to be complicated by domains being assessed,
unresolved issues regarding the purpose(s) of assessment and inferences that will be
made based on assessment, the type and severity of student’s disability, and the
measurement procedures used. All of these considerations need to be accounted for when
assessing students, as it is the competency under consideration that should be assessed,
not the student’s disability.
21
While it was beyond the scope of this research to determine which content areas
should be assessed in district, state, and federal assessment programs, research in the
areas of language and mathematics was examined as these are considered to be necessary
skills in the information and digital ages. As skills in these areas are considered basic to
everyday life, understanding the progress of all students and program efficacy in teaching
these skills cannot be overlooked.
Impact of exclusion from assessment programs.
Prior to the implementation of NCLB (2001), research consistently showed that
students with disabilities were not included in district and state assessments; and if these
students were included in the assessment process their test scores were not always
reported (Elliott, Erickson, Thurlow, & Shriner, 2000). Educational researchers and
policy analysts have forwarded several reasons for excluding students with disabilities
from district, state, and national assessment programs, particularly large-scale, high-
stakes assessment programs. Tindal and Fuchs (2000) stated that
… for many [students with disabilities], the outcomes assessed within general
education accountability systems have been viewed as irrelevant to setting and
skills required for successful post-school adjustments (p. 9),
further arguing that this notion is reinforced by PL 94-142 (1975) in which student with
disabilities’ IEPs becomes an individually-referenced, separate apparatus for describing
progress for the student with disabilities, with this system of assessment being removed
from any existing general assessment systems. Additionally, many schools and school
districts have excluded students with disabilities from their general assessment programs
in an effort to ensure they do not report poor school progress (McGrew et al., 1992;
Reschly, 1993; Tindal & Fuchs, 2000). Alternatively, schools which have included
22
students with disabilities in their assessment program and have reported poor progress
have been known to blame the victim, placing failure on the student with disabilities then
isolating or removing the student from the school’s educational mainstream (Reynolds,
1993).
Exclusion of students with disabilities from assessment programs has often been
unwarranted (Reschly, 1993) with two related negative outcomes. One of the outcomes,
placing emphasis on producing positive school-level/district-level assessment results in
high-stakes decision-making processes, has been the possible discrimination against some
students due to existing background characteristics, specifically disabilities, whereby “…
conditions [are] ripe for … unwarranted exclusion of students with disabilities or low
achievement” (Reschly, 1993, p. 45). Such unwarranted exclusion has been carried out in
an attempt to raise average levels of performance on assessments as students with
disabilities generally perform at much lower levels than same-grade/age peers have.
Unwarranted exclusion is exemplified when students with disabilities with IEP reading
goals are excluded from standardized literacy assessments. Methods to exclude students
with disabilities from assessments may be a straightforward directive while other
exclusion methods may be much more subtle. Anecdotal information provided to Reschly
(1993) indicated methods to exclude students with disabilities from assessment efforts
took the form of (i) encouraging the student to stay at home on “test day”, (ii) marking
the student absent on “test day” although they were present, or (iii) having test booklets
for students with disabilities invalidated as their answer sheet was not appropriately
completed. While the previous examples of exclusionary practices are discriminatory,
some types of exclusionary practices are perfectly acceptable; e.g., deciding against
23
assessing the literacy performance of middle school students with extremely low
cognitive functioning who do not have literacy goals as their skill levels are below the
average skill levels of kindergarten-aged students. Excluding such students from the
literacy assessment, perhaps providing them with access to an alternative assessment, is
generally considered a more appropriate course of action as including such students
would not provide useful information about these students nor their program.
Consequences of exclusion run the gamut from issues with district, state, and
national estimates of student performance to the myth of difference between students
with disabilities and their typically developing counterparts. To start, many researchers
question the accuracy of assessment when not all students participate in the assessment
program (Crawford & Tindal, 2006; Elliott et al., 2000; McGrew et al., 1992). McGrew
et al. (1992) pointed out that, treating students with disabilities as outliers in data,
assessment programs “make it difficult to produce accurate national and state statistical
estimates for this population [and] it also raises questions about bias being present in
most national and state education statistical estimates that are reported” (p. 29). As Elliott
et al. (2000) point out, “[w]ithout the inclusion of all students in accountability systems,
incomplete data are reported” (p. 40). Inferences made from assessment results from
programs that exclude students with disabilities are questionable. Additionally, exclusion
practices are not uniform across districts or states, further complicating any comparisons
or generalizations that could be made from the assessment data collected. Policy makers
cannot make knowledgeable decisions about students with disabilities and programs for
students with disabilities and curriculum based on incomplete information.
24
This issue is further complicated by the fact that students with disabilities are
often excluded from norming samples for standardized tests. As well, most standardized
tests are normed without including accommodations. Thus, when students with
disabilities are measured using these assessments, they are generally outside the range
assessed by the test. As this subgroup is generally not adequately represented,
intervention information is suspect.
Perhaps the primary reason for concern about the exclusion of students
with disabilities from state and district assessments [has been] the lack of
accountability for the results of education for these students. Intentional
exclusion of students, either from testing or from reporting, [means] that
there [is] no data available on the results of education for students with
disabilities (Yssledyke et al., 1998, p. 15).
Without such data, judgments about student performance or the adequacy of programs for
students with disabilities cannot be made. Students with disabilities must be allowed
access to assessment programs if we are required, and desire, to see and interpret the
results of these assessments to provide systematic information about individual
performance for a student with disabilities, aggregate performance for students with
disabilities, and the performance of educational programs and curriculum aimed at
students with disabilities.
Other documented consequences of exclusion of students with disabilities from
assessment programs include increases in retention at grade level, rates of referral to
special education, and spurious comparisons among school districts (Thurlow, McGrew,
Tindal, Thompson, Ysseldyke, & Elliot, 2000; Ysseldyke et al., 1998). Exclusion from
the assessment process often results in exclusion from curriculum or reform initiatives
designed to improve students’ performance (Elliott et al., 2000; Ysseldyke et al., 1998).
Further, McGrew et al. (1992), hold that it is imperative all students with disabilities, who
25
are able, participate in national and state assessments as “[t]here is … concern that we …
only value who we can measure” (p. 3) with those not being measured becoming non-
students, and possibly non-people.
While there has been progress in the area of inclusion, and more states expressly
prohibit exclusion of students, exclusionary practices still exist. Christensen, Lazarus,
Crone, and Thurlow (2008) found that almost one-third of all states in 2007 provided
some reasons students may be excluded from statewide assessment accountability
programs. Further, they noted this was an increase from the previous examination of state
policies on participation of students with disabilities in 2005.
A brief history of inclusion in high-stakes assessment programs.
With Section 504 of the Rehabilitation Act of 1974 and Title 1 of the Elementary
and Secondary Education Act educational accountability came to the forefront. With the
increased emphasis on educational accountability “… appropriate testing and reporting of
assessment results … increased in importance to educators and policymakers across the
nation” (Bolt & Thurlow, 2004, p. 141). With the significant expansion of assessment
activities and increasing use of state-level assessments for accountability purposes in the
1990s (Elliott et al., 2000) calls for inclusion of students with disabilities in state
accountability systems intensified, leading to inclusion of more students with disabilities
in state assessment programs. However, there was little or no documentation on the
actual participation rates of students with disabilities, or progress on goals and standards
set for all learners, on these assessments (Elliott et al., 2000). Additionally, prior to 1996,
of the total number of state-level assessments carried out, students with disabilities
26
participation rates could be provided for less than 40% of these assessments (Elliott et al.,
2000).
In an effort to better understand inclusion and participation rates of students with
disabilities, in January of 1998, 44 people from various educational stakeholder groups
met in Washington D.C. to, among other things, “… identify key issues and make
recommendations related to assessment practices, research and development” (Ysseldyke
et al., 1998, p. 9) and other areas impacted by IDEA 1997. The meeting was convened by
the National Center on Educational Outcomes (NCEO) with the Council of Chief State
School Officers (CCSSO) and the National Association of Directors of Special Education
(NASDSE) also participating. The report generated by this meeting was in response to
concerns “about the standards to which [students with disabilities] are held, … the extent
to which they participate in state and district assessments, [and] the primary means that
education has used to demonstrate educational results” (Ysseldyke et al., p. 14). New
requirements generated in IDEA 1997 necessitated that students with disabilities be
included in state and district-wide assessments with provision of appropriate
accommodations where necessary (Thurlow et al., 2000; Ysseldyke et al., 1998). With
the passage of this legislation, the general trend in state-wide assessment programs for
those states with assessment programs, general and alternate, was toward “inclusiveness
of [students with disabilities] in assessments, rather than toward delineating limitations
on either who participates or the accommodations that they can use” (Thurlow et al., p.
162). IEPs began taking on a more pivotal role and were required to include statements
about individual modifications to state or district-wide assessments for individual
students with disabilities; or, if warranted, participation of a student with disabilities in an
27
alternate assessment instead of the general state/district-wide assessments (Thurlow et al.;
Ysseldyke et al.). Federal funding for states and districts now hinged on participation of
students with disabilities in statewide assessment programs (IDEA, 1997 Part B funding;
Thurlow et al., 2000). As some states reported on participation rates for students with
disabilities and performance of student with disabilities for statewide assessments
separately, concerns about the accuracy of results reported were raised. For example, a
district could report there were 200 students with disabilities and then post the assessment
results of students with disabilities based on a fraction (e.g., one-half) of those students
taking the statewide assessment (Elliott et al., 2000). Thus, districts and states were called
upon to report participation rates for students with disabilities as well as student
performance using standardized reporting procedures.
Federal legislation changed in the late 1990s (IDEA, 1997) through early 2000
(NCLB, 2001 and IDEA, 2004) partially based on the premise that all students can learn,
the notion of providing outcomes-based information for students with disabilities
education in public accountability systems (Tindal & Fuchs, 2000), and calls for
inclusion and participation of students with disabilities in district-wide, state-wide and
federal assessment programs. Inclusion of students with disabilities in these mandates
focused on accountability systems dealing with improvement of student achievement.
The legislation clearly stated that students with disabilities had to be included in
state/district-wide assessment programs, with states/districts having to report on (i)
participation rates for state/district-wide assessments and (ii) student performance on
state/district-wide assessments. Once IDEA (1997) was signed into law, educators had to
find ways to include, or in some cases legally exclude, students with disabilities in
28
assessment programs. It was no longer possible to exempt students with disabilities from
participating in district and statewide assessments without appropriate documentation or
some indication of how their learning would be assessed (Elliott et al., 2000). Now that
total exemption was no longer an option, states began looking at how to make decisions
about partial participation, out-of-level testing, and alternate assessments (Thurlow et al.,
2000).
School accountability for improving education outcomes for all students has
almost exclusively been addressed through state-wide assessment programs (Thurlow et
al., 2005), with inclusion of students with disabilities in these assessment programs as a
way for schools to monitor improvement of programs designed for this particular
population of students. Inclusion of students with disabilities in statewide assessment
programs was “considered essential to improving education opportunities for [students
with disabilities] and to providing meaningful and valuable information about student
performance to schools and communities” (Thurlow et al., p. 233). With the interplay of
statewide assessment programs and school accountability, as well as federal legislation
mandating assessment participation decisions for students with disabilities be made by
local IEP teams, state policymakers were placed in charge of defining what participation
for students with disabilities would look like. State guidelines for inclusion and
participation of students with disabilities usually included rules about which assessment
accommodations could and could not be used, as well as which students could be
excluded from testing (Crawford & Tindal, 2006). Bolt and Thurlow (2004) “…
anticipated that nearly all students with disabilities can participate in statewide
29
assessments with appropriate accommodations, with only about 10% of these students
requiring the use of an alternate assessment” (p. 142).
Beginning in 1993, NCEO began tracking and analyzing state policies
encompassing assessment and accommodation policies for students with disabilities,
providing information on the kind and amount of access students with disabilities had to
statewide and federal general assessment programs. Each time the NCEO reported on
state policies there were significant changes resulting from the report, as statewide
accountability efforts began to include statewide assessments in efforts to improve
educational programs for all students (Thurlow et al., 2005).
Between 1995 and 1997 there were 34 new or revised policies about participation
of students with disabilities in statewide assessment programs (Thurlow et al., 2000).
Early NCEO reports showed that 40 of 50 states had active policies on the participation
of students with disabilities in state assessment programs. Of the ten states that did not
have assessment programs, five were developing or had suspended assessment programs
while three were revising participation policies. As well, 36 of 40 states relied on the IEP
team’s decision, looked at additional criteria (e.g., meaningfulness of testing for students
with disabilities, certification of a medical condition, examination of the motivation for a
student with disabilities to be like his/her peers, adverse effects of testing on students
with disabilities, availability of appropriate accommodations), and/or examined course
content or curricular validity when determining inclusion of students with disabilities in
their assessment programs (Thurlow et al.). By 2002, research indicated that students
with disabilities were being included in statewide assessment programs; however, it was
30
not clear if test scores for students with disabilities were part of state accountability
calculations (Bolt, Krentz, & Thurlow, 2002).
By 2001, assessment systems were evolving and all 50 states had state-level
participation policies for students with disabilities in place for state or district testing
(Thurlow et al., 2005). Additionally, English language learners and students with 504
plans were included in state policies and, thus in the research conducted by NCEO.
Policies for participation, as well as accommodations, were becoming more specific for
each of these groups. More assessment options were added to state repertoires including
general assessment without accommodations, general assessment with accommodations,
alternative assessment (available, albeit not always used, in all states), and two
procedures not used in state-wide assessment before: (i) out-of-level testing and (ii)
partial participation. As well, there were still two states that indicated that they might use
the performance of students with disabilities to decide which assessment option was most
appropriate. Some of the most notable changes in state participation policies included the
rise in the number of state policies that prohibited use of nature or category of disability
in assessment participation decision (from 11 to 22 states), looking at whether or not
students with disabilities were being instructed in the content being assessed (from 15 to
28 states), and parental involvement in the assessment decision (from 9 to 25 states).
In 2006, Crawford and Tindal examined student assessment inclusion and
participation rates in Oregon. They found that the assessment participation rate was part
of the accountability structure and, as such, was designed to improve student achievement
with the expectation that all students, including students with disabilities, participate in
state assessment. To this end, the state was trying to extend the state assessment scale so
31
all students would be assessed on a common set of academic standards across several
forms of the state assessment. Students could take the state assessment with, or without,
accommodations or with modifications (i.e., non-standard or unapproved test
accommodations). Students with disabilities could also participate in (i) extended
reading/writing/mathematics assessments if they had academic goals in these areas and
‘significant’ disabilities or (ii) extended career and life role assessment. Student
assessment scores were aggregated for students who participated, with or without
accommodations, in the Oregon general state assessment. However, the scores for
students participating in the other assessments were not included as part of the
aggregation.
The most recent analysis of inclusion, participation, and accommodations
available was conducted by Christensen et al. (2008) and sponsored by NCEO.
Christensen et al. (2008) examined 2007 data and found that state policies were still
evolving – becoming more detailed and specific at this point in their development. Some
states, including Washington D.C., now had policies posted on their websites. Again,
though not to the same extent as in previous analyses, participation policies extended
testing options for students with disabilities, as well as English language learners and
students with 504 plans. Testing options found included:
state testing without accommodations
state testing with accommodations
alternate assessments
selective participation
combination participation
32
out-of-level assessment
locally selected assessment
state testing with modifications or non-standard accommodations
and
testing with unique aggregated accommodations.
Christensen et al. (2008) found that there were 27 states, down from 30 states
from the previous analyses, providing some type of testing option for every student as
well as prohibiting the exclusion of students from their state assessment programs.
However, it should be noted that only two of these 27 states explicitly declared
“exclusion prohibited.” Eight states permitted exclusion and provided waivers based on
exemptions such as parental exemption, emotional distress experienced by the student,
student medical condition or illness, student refusal, student absence, or other. The
“other” category encompassed a wide variety of reasons. For example, in Colorado, other
could mean incarceration or the student was a foreign exchange student, and in Alaska,
other could mean the student arrived late in school system or the student had a sudden
and traumatic experience close to testing time.
Christensen et al. (2008) noted that inclusion of students with disabilities and
participation decisions were determined by students’ IEPs in all 50 states. Additionally,
consideration was given to instructional relevance and instructional goals for the student,
the student’s current performance and level of functioning, and the student’s level of
independence when deciding whether the student with disabilities would be included and
participate in the statewide assessment program. They noted that, for this group of
students, there were many policy changes between 2005 and 2007, with many states
33
citing level of independence, nature or category of disability, and instructional
relevance/instructional goals when deciding whether or not to include students with
disabilities in the their state-wide assessment program. As well, they found fewer states
cited consideration of student needs and characteristics, content/nature/purpose of
assessment, and “other” when deciding whether or not to include students with
disabilities in the their assessment program. Christensen et al. also explored frequently
cited participation decision-making criteria that were not allowed. These criteria,
relatively unchanged since NCEO’s 2005 data analysis, included presence or category of
disability, cultural/social/linguistic/environmental factors, excessive absences, and low
expectations/anticipated low scores (with the latter cited by 28% of states).
Guidelines for inclusion in statewide assessment programs have changed very
little since McGrew et al. (1992) and Ysseldyke, Thurlow, McGrew, and Shriner (1994)
looked into issues of inclusion and exclusion of students with disabilities. By 2000,
Elliott et al. found some states implementing some of the previously mentioned
guidelines and piloting inclusive testing programs. By 2008, all states had adopted more
sophisticated policies, with defined criteria regarding the inclusion and/or exclusion of
students with disabilities in their state testing programs (Christensen et al., 2008).
However, ideological differences still abound when it comes to inclusion of students with
disabilities in assessment programs. Debate still, more often than not, centers on
…. whether it is more psychometrically sound to base decision making on
smaller numbers of students (e.g., general education students) who participate
fully in a nonaccommodated test or to base decisions on all students, some of
whom have had some changes to the test (Thurlow et al., 2000, p. 163).
With the focus now on inclusion for students with disabilities, and with many
researchers, educators, and policy-makers looking at participation rates and aggregated
34
data for students with disabilities, there has been a search for new or refined assessment
protocols that are more inclusive and attentive to an individual’s accessibility needs and
preferences (Hansen & Mislevy, 2008). Such protocols have been associated with the
universal design of assessments that, from inception, have been designed to be both
accessible and valid for the widest range of students possible, including students with
disabilities and English language learners. Universal design principles often include
formatting changes such as adding bullets or adding white space (Baker, 2008), with “…
universal design … mak[ing] … assessment[s] more amenable to accommodations a
student may need in order to access the content of the items in the assessment” (p. 20).
Though not intimately tied with universal design of assessments, it is hoped that, with the
focus on analyses aiming to find some of the most effective accommodations to allow
students with disabilities to demonstrate content knowledge rather than disability in
federal, statewide, and district-wide assessment programs, this research will aid in the
efforts made by those exploring universal test design. To this end, focus is now turned to
the types of assessment accommodations provided for students with disabilities in
district-wide, statewide, and federal assessment programs.
Accommodations for students with disabilities.
“An assessment accommodation is an alteration in the way a test is administered”
(Elliott, Thurlow, Ysseldyke, and Erickson, 1997, p. 1) with the accommodation provided
based on student need. Accommodations should not provide a student with an advantage
on the content, or construct, being measured. Typically, there are two parts to the
definition of assessment accommodation. Accommodations change the way tests are
administered, given or taken, under standardized conditions (Bolt & Thurlow, 2004;
35
Fuchs et al., 2000a) and are intended to facilitate the measurement goals of the
assessment (Bolt & Thurlow, 2004). Tindal and Fuchs (2000) reaffirm this definition and
add that the construct being measured is not altered and changes are referenced to
individual need and differential benefit, not overall improvement.
Assessment accommodations allow students with disabilities to participate in the
assessment process in a meaningful way, providing a way to accommodate for a student’s
disability. Accommodations have been part of the effort to curtail unwanted exclusion of
students with disabilities in assessment programs. With assessment accommodations, it is
expected that students with disabilities be tested on the content they are expected to have
competency in based on their educational experiences, usually noted in their IEPs. While
not the only way to ensure all students have access to assessments, accommodations are
one of the most frequently used methods of ensuring students, particularly students with
disabilities, have access to assessment programs. Additionally, federal laws such as
NCLB (2001) and IDEA (2004) require reasonable and valid accommodations to
measure the academic achievement of students with disabilities. Even the popular media,
in their quest to edify the general public on educational issues, have added to the lexicon
of assessment accommodations. For example, Lewin (2002) in the New York Times
looked at the question of “how far to accommodate students with learning disabilities on
college entrance tests like the SAT” in terms of the “clash between disability rights and
educational standards” noting that “requests for special accommodations proliferate,
especially from affluent white families.”
Variously considered as a way to level the playing field (Tindal & Fuchs, 2000), a
corrective lens to decrease distortion (Chiu & Pearson, 1999), or tools to help in the
36
assessment process (Enriquez, 2008), assessment accommodations attempt to remove
construct-irrelevant variance due to the disabilities of students with disabilities. As such,
accommodations may remove barriers to assessment access, increasing the probability
that the construct, or content, is accurately measured (Baker, 2008).
[W]ith appropriate accommodations, a student disability…, if unrelated to
the constructs being measured, will no longer be a source hindering the true
demonstration of their competence. Without accommodations, [students with
disabilities] may score lower than they should (Chiu and Pearson, 1999, p. 4).
Thus, when a student with disabilities is not provided with appropriate accommodation[s]
they cannot access the test content and are not able to demonstrate their knowledge,
making it difficult to accurately measure the student with disabilities’ understanding of
the content under consideration on the assessment.
In his discussions of test validity, interpretation, and use, Messick (1990, 1995)
defines construct-irrelevant variance as a type of systematic error that is introduced into
the assessment process. Such error reduces the likelihood that test scores on the
assessment adequately reflect the knowledge, or true achievement level, of the test-taker.
Of particular interest, construct-irrelevant difficulty (Messick, 1995) is some aspect of the
task, extraneous to the construct being assessed, that makes the task unduly difficult for
some individuals or groups. Construct-irrelevant variance is considered a major source of
bias in test scoring, test interpretation, and unfairness in test use.
[L]ow scores should not occur because the assessment is missing something
relevant to the focal construct that, if present, would have permitted the
affected persons to display their competence, ... [nor should they occur]
because the measurement contains something irrelevant that interferes with
the affected persons' demonstration of competence (Messick, 1995, p. 746).
37
Low scores, as presented by Messick (1990, 1995); confer an inaccurate representation
and a systematic underestimate of the abilities of students with disabilities. It should be
noted that assessment accommodations are not considered assessment, or test,
modifications as assessment accommodations do not change the construct being assessed.
Additionally, assessment accommodations have been viewed as a method to
increase participation in national, state, and/or district assessment programs.
Accommodations enhance the perceptions of fairness and credibility for these assessment
programs when the same assessment accommodations are used in the same way
(Reschly, 1993).
Specific legislation related to assessment accommodations is provided in both
NCLB (2001) and IDEA (2004). IDEA (2004) requires participation of students with
disabilities in state and district-wide assessments “with appropriate accommodations
where necessary” ((SEC. 612) (a) (16) (A)) based on the IEP team and IEP information
of the student with disabilities (see (SEC. 614) (d) (1) (A) (V) and (VI)). NCLB (2001)
complements IDEA (1997, 2004) with its emphasis on stronger accountability for results.
As such, NCLB (2001) requires the participation of all students on state accountability
assessments, with provisions for reasonable adaptations or accommodations allowing
students with disabilities access to assessment content as defined under section
612(a)(17)(A) of IDEA (2004) (see NCLB (2001): TITLE I A(1111) (b)(2)(I)(ii)).
Types of accommodations.
Assessment accommodations have typically been categorized in four (Thurlow
Seyfarth, Scott, & Ysseldyke, 1997; Tindal & Fuchs, 2000; Ysseldyke et al., 1994), five
(Christensen et al., 2008; Clapper, Morse, Lazarus, Thompson, & Thurlow, 2003;
38
Lazarus, Thurlow, Lail, Eisenbraun, & Kato, 2005; Thurlow et al., 2005), or six different
categories (Elliott, 1997; Thurlow et al., 2000). Typical categories used to classify
assessment accommodations are setting, presentation, timing, response, scheduling, and
other. The ‘other’ category is generally used as a catchall for accommodations that do not
fit neatly into the other classification areas. Most frequent categorization schemas place
scheduling and timing in the same category as well as including a new category,
equipment and materials accommodations, not found in earlier documentation on
classification categories (Christensen et al.; Clapper et al.; Lazarus et al.; Thurlow et al.,
2005). The number and types of assessment accommodations cited in the literature have
varied little over the years research on assessment accommodations for students with
disabilities has been conducted.
An example of typical assessment accommodations falling under the various
categories follows (Table 1).
39
Table 1: Types of Assessment Accommodations
Setting Presentation
• Administer the test to a small group in • Provide on audio tape
a separate location • Increase spacing between items or
• Administer the test individually in a reduce items per page or line
a separate location • Increase size of answer bubbles
• Provide special lighting • Provide reading passages with one
• Provide adaptive or special furniture complete sentence per line
• Provide special acoustics • Highlight key words or phrases in
• Administer the test in a location with directions
minimal distractions • Provide cues (e.g., arrows and stop
• Administer the test in a small group, signs) on answer form
study carrel, or individually • Secure papers to work area with
tape/magnets
Timing Response
• Allow a flexible schedule • Allow marking of answers in booklet
• Extend the time allotted to complete • Tape record responses for later verbatim
the test translation
• Allow frequent breaks during the test • Allow use of scribe
• Provide frequent breaks on one • Provide copying assistance between
subtest but not another drafts
Scheduling Other
• Administer the test in several • Special test preparation
sessions, specifying the duration of • On-task/focusing prompts
each session • Any accommodation that a student
• Administer the test over several days, needs that does not fit under the
specifying the duration of each days' existing categories
session
• Allow subtests be taken in a
different order
• Administer the test in the afternoon
rather than in the morning, or vice
versa
Elliot et al., 1997, p. 2
It is generally recommended that
[a]ccommodations… be provided for the assessment when they are routinely
provided during classroom instruction. In other words, when classroom
accommodations are made so that learning is not impeded by a student's
disability, such accommodations generally should be provided during assessment
(Elliott et al., 1997, p. 3).
Research, such as that conducted by NCEO, shows that state lists of approved standard
accommodations which are considered not to be a threat to the validity of the assessment
or the comparability of test items, vary from state to state and there is limited consensus
regarding acceptable, allowable accommodations for students with disabilities (Bolt &
40
Thurlow, 2004). Perhaps, as a result of legislative requirements for students with
disabilities participation in state and district-wide assessment programs, practices in
allowing assessment accommodations are quite variable with differences in availability of
state guidelines and, when provided, differences in the content of state guidelines on test
accommodations. Additionally, “[s]tate accommodation policies are continually changing
reflecting uncertainty of educational agencies” (Bolt & Thurlow, p. 142). Thurlow et al.
(2000) noted that this lack of agreement across states poses problems, particularly for
students with disabilities moving from one state to another.
One of the most frequently allowed accommodations is “[p]roviding extended
time or unlimited time to [students with disabilities]” (Chiu & Pearson, 1999, p. 2). More
recent research (Bolt & Thurlow, 2004) indicated the five most frequently allowed
accommodations for statewide assessment programs are dictated response, large print,
Braille, extended-time, and sign language interpreter.
Primary studies of the effectiveness of accommodations.
Many primary studies examining the effectiveness of testing accommodations for
students with disabilities can be found in the literature. Primary research in this area
usually falls under one of three research designs: experimental where the test
administration condition was manipulated and there was random assignment to condition,
quasi-experimental where the test administration condition was manipulated but students
weren’t randomly assigned to condition, and non-experimental often using an ex post
facto comparison of students taking a standard version and an accommodated version of
the same test. An example of primary research using each one of these designs follows.
41
Calhoon, Fuchs, and Hamlett (2000) provide an example of a primary study on
the effectiveness of testing accommodations using an experimental design. Calhoon et al.
compared the effects of computer-based test accommodation, non-computer-based test
accommodation; i.e., teacher oral presentation, and no accommodation conditions on a
constructed-response mathematics performance assessment. Four different testing
conditions were examined (i) standard administration, (ii) teacher-read administration,
(iii) computer-read administration, and (iv) computer-read administration accompanied
by video. Over the course of four weeks 81 ninth- through twelfth-grade students with
disabilities who were receiving mathematics and reading instruction in special education
resource rooms, based on IEPs, were assessed under each of the different,
counterbalanced testing conditions. The researchers found that students with disabilities
performed better when the assessment was read aloud than when a standard paper and
pencil administration was used, with the effect sizes ranging from approximately one-
quarter to one-third of a standard deviation. There were no significant differences
between the oral presentation, teacher versus computer, conditions. However, a survey of
the students with disabilities indicated that they preferred the computer oral presentation
as it afforded them anonymity when taking the test. A major limitation of this research
relates to only using students with disabilities. The authors suggested that future research
includes both students with disabilities and typically developing students in the analyses.
Helwig and Tindal (2003) provide an example of a primary study on the
effectiveness of testing accommodations using a quasi-experimental design. Helwig and
Tindal investigated the accuracy with which special education teachers were able to
recommend oral accommodations for students. Using a 5-point Likert scale, teachers
42
were asked to judge a student’s proficiency in reading and mathematics and then rate how
important an oral accommodation would be to the student’s success on one of two forms
(A and B) of a thirty-item, multiple-choice mathematics assessment. Students with
disabilities (n = 245) and typically developing students (n = 973) in fourth through eighth
grades in eight states then took an accommodated, items read aloud via a video
presentation, and a non-accommodated form of the mathematics test. Research results
were contraindicative of research in the area, whereby, in most of the comparisons, both
students with disabilities and typical developing students performed better in the non-
accommodated condition than in the accommodated condition. It was even more
surprising that students considered to be “low readers” followed this trend. There was no
connection between performance on reading and basic math skills tests and the need for
oral administration accommodations. As well, teachers were not able to predict which
students would benefit from the oral administration accommodation as teacher ratings of
student need for assessment accommodations only coincided with actual student
performance approximately one-half of the time. The authors recognized that one of the
major limitations of their study was the elimination of students who did not experience at
least one-half a standard deviation change in assessment score between the assessment
conditions. This effectively reduced, by one-half, the total number of students accounted
for in the analyses of the assessment accommodation condition. It also reduced the
number of teacher ratings by one-half, potentially eliminating many correct
recommendations. Helwig and Tindal also noted that it might have been beneficial for the
students participating in the study to have practice in using the accommodation prior to
the testing situation.
43
Zurcher and Bryant (2001) provide an example of a primary study on the
effectiveness of testing accommodations using a non-experimental design, albeit not an
ex post facto design. Zurcher and Bryant examined the comparability and criterion
validity of test scores for college-aged students with disabilities, specifically learning
disabilities, and typically developing college students serving as the control group, under
accommodated and non-accommodated conditions. Thirty undergraduate volunteers from
three different colleges in southwestern Texas, 15 students with disabilities and 15
students with typical development, were selected to participate in the study. Students with
disabilities selected to participate had to be eligible to take, but had not yet taken, the
Miller Analogies Test under accommodated conditions: extended-time or oral
administration using an audiocassette, reader and/or scribe. Using a counter-balanced
design, the test was split into two halves and each student, a student with disabilities
matched with a typically developing student, took one-half of the assessment using a
student-specific accommodation and the other half of the assessment without any
accommodation. Although typically developing students did not display a significant test
score gain under accommodated conditions, results did not support the test interaction
hypothesis (Sireci et al., 2003; Sireci et al., 2005) as their matched counterparts, students
with disabilities, also did not display a significant gain under accommodated conditions.
The authors noted several methodological limitations including small sample size,
relatively short half-tests that may not have captured the potency of the accommodation
effect, and lack of random assignment and matching which made across group
comparisons difficult. For example, the GPA for students with disabilities was 2.72,
while the GPA for their typically developing peers was 3.27.
44
Syntheses of the literature on the effectiveness of accommodations.
Several syntheses of the literature on the effectiveness of test accommodations for
students with disabilities exist, most looking at testing accommodations after the
implementation of NCLB (2001). Starting in 2002, NCEO began a review of primary
studies in this area, generally providing three-year snapshots, starting with 1999 to 2000,
of research on the effects of test accommodations.
Tindal and Fuchs (2000) conducted one of the first synthesis of research literature
on the effectiveness of testing accommodations. They were seeking to provide personnel
in school districts and state departments of education with a “comprehensive synthesis of
the research literature on the effects of test accommodations on students with disabilities”
(p. 16). In an effort to summarize research on changes to test administration over the
preceding decade they identified 114 studies on more than 20 different accommodations,
including research on test accommodations, test modifications, and the use of alternate
assessments. Tindal and Fuchs categorized the research they reviewed into the three
approaches: descriptive, comparative, and experimental. Additionally, the research
studies were synthesized and organized according to types of test changes, generally
assessment accommodations, based on a taxonomy proposed by NCEO. The research
reviewed was grouped according to changes in schedule, presentation, test directions, use
of assistive devices/supports, and test setting.
While the authors concluded that research on assessment accommodations was in
its infancy, as most research at that point was usually not generalizable and needed to be
interpreted with caution, there were consistent significant effects for moderately to
significantly disabled preschoolers taking tests in the presence of familiar examiners. As
45
well, “…making changes in the way tests are presented had a positive impact on student
performance although the results have not always been differential for students with
disabilities versus those without disabilities” with the “most clear and positive finding …
to be in the use of large print or Braille and in the use of read aloud of math problems
both of which appear differentially effective” (Tindal & Fuchs, 2000, p. 58). Tindal and
Fuchs further suggested research on assessment accommodations (i) use experimental
rather than descriptive or comparative designs and (ii) be studied in the context of
validity and not necessarily in the context of population, such as students with disabilities
or English language learners.
Thompson, Blount, and Thurlow (2002), in an NCEO technical report, extended
the work of Tindal and Fuchs (2000), reviewing 46 empirical studies published from
1999 through 2001, to provide evidence regarding whether the use of certain assessment
accommodations (i) threatened test validity or score comparability and (ii) were useful
for individual students as “[t]he enactment of the No Child Left Behind Act of 2001
[brought] urgency” (p. 5) to research questions focusing on assessment accommodations.
The authors believe that “[o]ne of the most viable ways to increase the participation of
[students with disabilities] in assessments is through the use of accommodations”
(Thompson et al., p. 8), participation that was mandated in NCLB 2001. Components of
research summarized in the technical report included
type of assessment, content area assessed, number of research participants, types
of disabilities included in the sample, grade-level of the participants, research
design, research findings, limitations of the study, and recommendations for
future research (p. 9).
Thompson et al. (2002) noted a dramatic increase in the number of research
studies on test accommodations, with 58 published in the nine-year span from 1990
46
through 1998, as compared to 46 published from 1999 though 2001. The two most
common purposes for studying assessment accommodations were the investigation of
differential boost, or test interaction hypothesis, where students with disabilities had
greater test score gains than their typically developing peers and the investigation of
assessment accommodations on test score validity. Criterion-referenced tests used for
state accountability were the most common types of tests examined, in 21 studies, with
norm-referenced or other standardized tests following closely behind, in 17 studies.
Almost one-half of all tests under investigation were mathematics tests, while
approximately one-third were reading or language arts tests. The number of participants
in the studies under investigation ranged from three to almost 21,000, with the majority
of studies looking at elementary school students. Twenty-seven of the studies
documented participants’ disabilities, with the two most common types of disabilities
being learning and cognitive disabilities. Researchers in 21 of the 46 research studies
reviewed identified limitations for their studies with the three most common limitations
cited being “unknown variations among students included in the study, sample sizes too
small to provide adequate statistical support, and nonstandard administration of the
accommodations across proctors and schools” (Thompson et al., p. 6).
With respect to assessment accommodations, Thompson et al. (2002) noted that
three accommodations showed a positive effect on student test scores...: computer
administration [four of seven studies], oral presentation [six of seven studies], and
extended time [four of seven studies]. However, additional studies on each of
these accommodations also found no significant effect on scores or alterations in
item comparability (p. 23).
Thompson et al. (2002) suggested that research on assessment accommodations
lacked clarity in the (i) definitions of the constructs tested and (ii) accommodations
47
needed by individual students. They also suggested that researchers explore students
perceptions of desirability and usefulness of the accommodations provided, as they are
the primary consumers of assessment accommodations. Further, they believe “[m]ore
rigorous research, using designs comparing scores and interactions between the presence
and absence of a disability are needed in the future” (p. 23).
Bolt and Thurlow (2004) identified and reviewed 36 studies on five of the most
frequently mentioned accommodations for research conducted between 1990 and 2002.
They selected studies on dictated response (k = 16), large print (k = 4), Braille (k = 2),
extended-time (k = 22), and use of a sign language interpreter (k = 2) based on the 1999
NCEO report on state accommodation policies. Studies were selected based on the
following four criteria:
1. The study was conducted or published after 1990.
2. The study focused on the effects of accommodations for students with disabilities
in kindergarten through 12th grade.
3. The study examined the effects of accommodations on achievement or college
entrance tests.
4. The study design allowed for the analysis of the effects of single accommodations,
as opposed to the effects of accommodation packages.
Of all the studies investigated, 17 used traditional experimental methodologies, 4 of
which involved individualized assignment of students to accommodation packages.
Comparative methodologies were used in 13 studies; 5 studies were descriptive, …
and the remaining study was a meta-analysis (Bolt & Thurlow, 2004, p. 145).
The authors also examined the different approaches used to examine assessment
accommodations in the research studies; differential boost studies (interaction of the
disability status and accommodation condition), boost studies (accommodation increased
48
test scores), studies of measurement comparability of the test (examination of factor
structure and/or DIF in accommodated and unaccommodated conditions), and
comparative studies (comparison of students with disabilities’ “accommodated”
assessment scores to “non-accommodated” assessment scores of students with or without
disabilities).
Bolt and Thurlow (2004) found mixed results for the three of the five
accommodations under review. Studies looking at dictated response, large print, and
extended time produced supportive and non-supportive results for each of these
assessment accommodations. It should be noted that much of the research indicated that
“dictated response” is an effective accommodation and boosts the test scores of students
with disabilities, findings similar to Chui and Pearson (1999). However, some researchers
point out that this may result in implausibly high scores for this population. As very little
research was found for Braille and use of an interpreter for instructions, little could be
concluded about the use of these assessment accommodations. The authors discussed
several issues with the studies they reviewed including, providing test accommodations
for students who have a clear need for a specific accommodation; poor student selection
(e.g., selecting students with disabilities who do not need accommodations); more than
adequate time for extended time studies such that the research condition is not mimicking
the less-than-adequate time provided in the actual testing situation; examining alternative
types of extended time such as more frequent breaks; and ensuring students with
disabilities and typically developing peers participating in the research condition are
comfortable with and have used the assessment accommodation under investigation.
49
Tindal and Ketterlin-Geller (2004) reviewed research examining the effects of
assessment accommodations on large-scale tests of mathematics, expressly mathematics
tests with specific relevance for the National Assessment of Educational Progress
(NAEP). Specific accommodations reviewed included assessment in small group settings,
extended-time, use of calculators, read-aloud, and multiple accommodations (also called
administration accommodation packages). The authors noted that NAEP did not allow for
the use of assessment accommodations until 2002, thus prior results did not include a
representative sample of students with disabilities.
Tindal and Ketterlin-Geller (2004) identified all published literature on large-
scale mathematics assessments, finding a total of 28 studies published prior to 2000 and
14 studies published between 2000 and 2002. Unlike other authors of syntheses in this
area, they were not specifically interested in the different study approaches of boost,
differential boost, measurement comparability, or comparison of accommodated and non-
accommodated test scores. They found results of the research they reviewed, generally
based on the different approaches, to be tentative with conflicting overall test results.
They alleged that the “one consistent finding … beginning to emerge … is the interaction
of the item with specific skills of individuals” (p. 13), leading them to state that
“[c]onstruct-irrelevant variance (unintended influence of skills and knowledge that are
not part of the construct being measured) is item specific” (p. 8) such that studies on
assessment accommodations consider using (i) universal design in item development, (ii)
organize tests into sections in an effort to quarantine construct-irrelevant variance by
allowing accommodations on sections where it does not interfere with the measurement
of the construct under consideration, and (iii) use computer adaptive testing as the
50
presentation of items is based on item characteristic curves, distribution on an ability
scale, and the “item’s target construct relative to an access skill” (p. 13). The authors
noted that the latter is still under development and was not available for general use in
2003.
Johnstone, Altman, Thurlow, and Thompson (2006), in a continuation of the work
of Tindal and Fuchs (2000) and Thompson et al. (2002), reviewed recent research on the
effects of assessment accommodations for students with disabilities on large-scale
assessments. Such research and research syntheses are needed
[a]s states and school districts strive to meet the goals for adequate yearly
progress required by NCLB, [given that] the use of individual accommodations
continues to be scrutinized for effectiveness, threats to test validity, and score
comparability (Johnstone et al., 2006, p. iii).
Johnstone et al. (2006) summarized information and findings from 49 empirical
studies conducted between 2002 and 2004. Research examined involved 1 – 100
participants, 100 – 1,000 participants, or over 1,000 participants from multiple age
categories being tested, generally on norm-referenced or criterion-referenced
mathematics or reading/language arts large-scaled assessments. Subjects targeted for the
research under review fell under the learning disability category more often than any
other disability category. As with the Thompson et al. (2002) synthesis, the components
of research summarized included the type of assessment, content area assessed, number
of research participants, types of disabilities included in the sample, the participant grade-
level, research findings, limitations of the study, and recommendations for future
research. The authors extended the components summarized to include research purpose,
type of accommodation, and percentage of sample that were students with disabilities.
51
There were two primary purposes for the studies reviewed, that of examination of
the effect of assessment accommodations on test scores (k = 23) and the effects of
assessment accommodations on test score validity (k = 13). Researchers used a variety of
research methods, with the two most common methods being experimental or quasi-
experimental in nature (k = 21) and reviews of/research using extant data (k = 17). Two
studies conducted during this timeframe were considered to be meta-analyses; however,
upon further examination these studies would not be considered “formal” meta-analyses.
Fifteen different types of accommodations found were grouped according to presentation
(k = 21), timing/scheduling (k = 8), response (k = 2), technological aids (k = 2), and
multiple accommodations (k = 11). When viewing the 49 studies the authors did not find
any common themes. They cited this lack of consistency in research results as an
indicator of the need for further research in this area.
Johnstone et al. (2006) found the limitations most frequently mentioned by the
researchers were noting that studies were too narrow in scope, involved a small sample
size, or had confounding factors. Echoing the research limitations found by Thompson et
al. (2000), the authors pointed to the need for clearer definitions of the constructs tested
and examination of student perception of the desirability and usefulness of the
accommodations they were provided. Additionally, the authors pointed to the need to
study the institutional factors affecting accommodations judgment; how schools, districts,
and states decide which assessment accommodations are allowable and which are not.
Zenisky and Sireci (2007) provided a further secondary analysis of the research,
reviewing 32 published studies on assessment accommodation research conducted
between 2005 and 2006 with all but five of the studies published in refereed journals.
52
Research conducted with the most frequency during this timeframe focused on (i) the
empirical evaluation of test score comparability for tests administered with and without
accommodations and (ii) descriptive studies of current accommodations practices for
students with disabilities and their typically developing peers. As well, the research
examined generally looked at academic measures, criterion-referenced tests,
miscellaneous cognitive and intelligence measures, and instruments developed for
research purposes for content in mathematics and reading, with state criterion-referenced
assessment often used for NCLB purposes as the most commonly used data collection
instruments. Participants in these studies ranged from nine to 107,000 with most studies
collecting data on 100 to 300 participants. As well, participants were from drawn from
various grade levels, K – 12, and included college/university students. One study used
participants in an adult education setting. As with other synthesis studies in this area,
there was a wide range of disabilities included in the research; learning disabilities being
the most commonly represented disability. However, it should be acknowledged that ten
studies did not provide information on specific disability for participants. While most
studies examined assessment accommodations that fell under presentation and
timing/scheduling categories, a few studies looked at accommodations falling under
setting categories. This narrowing of assessment accommodations to two primary
categories is in contrast to the four categories reported in the summaries of
accommodations by Johnstone et al. (2006) and Thompson et al. (2002). It should be
noted that timing/scheduling accommodations, specifically extended time, was, again,
one of the most-studied accommodations. Other frequently studied accommodations
included oral accommodations and computerized administration. Most of the studies
53
conducted used non-experimental (k = 14), followed by quasi-experimental (k = 11), and
experimental (k = 7) research designs. Of the empirical research, over 50% used primary
data collection rather than existing data sets for their analyses. Some of the research
studies focused on assessing the need for accommodations as well as the selection and
implementation of accommodations, frequently using surveys to collect this information.
Zenisky and Sireci (2007) noted that empirically tested oral presentation, timing
(extended time), and accommodations for computerized assessment were often found to
have positive effects on test scores, with some studies reporting no effects for assessment
accommodations. By and large, timing accommodations yielded positive effects on test
scores. No studies reported negative effects on test scores for testing accommodations.
Limitations most frequently noted by the investigators represented in this
summary of research were small sample size, lack of diversity in the sample, and issues
with operationalization and implementation of the assessment accommodations. As well,
some researchers cited test or testing context; for example, number of items on the
measure used; and unexpected results as study limitations.
Zenisky and Sireci (2007) cited a number of promising avenues for future
research including “varying or improving on research methods with respect to testing for
the effects of specific accommodations and improving test development practices to
reduce the need for accommodations” (p. iv). Specific directions for future studies on
assessment accommodations were “(1) further study of extended time, (2) computers and
assistive technology as accommodations, (3) the role of teachers, and (4) the interaction
hypothesis” (p. 15). The authors note that directions such as these are needed to further
refine research in the area of assessment accommodations and expand our knowledge of
54
how best to obtain valid measures of student performance since “variations across
operational definitions, tests, populations, settings, and contexts still curb all but the most
general policy implications” (p. 17). With the high-stakes consequences of decisions
made based on test score interpretation, particularly in light of NCLB (2001), general
policy implications are no longer adequate.
Thurlow (2007), in a paper presented at the American Education Research
Association conference, summarized the findings of syntheses on the effectiveness of
assessment accommodations by Tindal and Fuchs (1999), Thompson et al. (2002),
Johnstone et al. (2006), and Zenisky and Sireci (2007, in press at the time of her
presentation). Thurlow noted the increase in the amount of research conducted, beginning
in 1990, in this area. Aggregating across the syntheses, Thurlow saw a significant amount
of research conducted using oral administration and extended-time accommodations. The
author found that the results from studies on oral administration to be
complicated by the inclusion of different groups of students, the study of
different content areas, the use of different media for presenting the
accommodation (person vs. video vs. audio tape), and by other refinements
(such as the length of the passage to be read) (p. 6),
with results showing positive effects for students with disabilities, positive effects for
students with disabilities and typically developing peers, or no effects. Research focusing
on extended time accommodations was more consistent, generally showing positive
effects for students with disabilities. Thurlow found that the most commonly allowed
assessment accommodations in assessment programs were not necessarily the most
frequently studied accommodations; the most commonly allowed assessment
accommodations being large print, individualized administration, small group
55
administration, magnification, Braille, use of a separate room, writing directly in the test
booklet, and extended time (time beneficial to the test taker).
Thurlow (2007) observed an expansion in the number of states providing
assessment accommodation policies and guidelines, an increase in the complexity of the
accommodations, and increased length in the documentation regarding accommodations.
As well, Thurlow found that states were also becoming concerned with the “[c]larity
about the effects of … test changes on the validity of test results” (p. 10). States were also
trying to increase the validity of accommodations such as oral administration, scribe, and
sign language interpretation, which include a human component, referred to as “access
assistants” by NCEO, by providing written guidelines for most, albeit not all, access
assistants.
Thurlow (2007) recommended aligning research with existing state policies on
accommodations allowed without restrictions and accommodations allowed with
restrictions, specifically those allowed with restrictions; oral administration, use of
calculator, use of scribe, and extended time; as they are the most controversial of the
testing accommodations. With a growing number of states implementing assessment
accommodation policies and guidelines, Thurlow indicated that this type of alignment
was especially relevant when considering how best to affect policy on testing
accommodations, noting that most states do not have the resources to conduct research on
assessment accommodations which have impact on specific state accommodation
policies.
Cormier, Altman, Shyyan, & Thurlow (2010) summarized the results of 40
empirical studies conducted between 2007 and 2008. Most of the studies focused on
56
either (i) the effects of accommodations on test scores of students with disabilities, k =
13, or (ii) a comparison of test scores for unaccommodated versus accommodated
assessment conditions, k = 11; i.e. boost or differential boost studies. Most studies
conducted during this time examined math or reading content and research participants
were enrolled in the K – 12 educational system. A majority of studies had large, more
than 300 participants, sample sizes. As with previous syntheses of the research in this
area; e.g., research examining the effects of read-aloud or extended-time conditions,
results from the aggregate research was mixed.
Cormier et al. (2010) found that research on extended time accommodations was
declining, while research investigating accommodation packages was increasing. They
noted that “[a]lthough this accommodation was studied frequently in the past, it has lost
its place as an accommodation in many states because of a move to untimed tests” (p.
18). While investigation of accommodation packages is valuable, others have expressed
concern that empirically effective accommodation packages may include extraneous
accommodations that do not add to the efficacy of the package (Elliott, Kratochwill, &
McKevitt, 2001).
Synthesis studies of the effectiveness of accommodations.
The most frequently cited large-scale secondary analysis of the effectiveness of
assessment accommodations was conducted by Chiu and Pearson in 1999. Using meta-
analytic techniques, Chui and Pearson examined 30 research studies searching for
empirical evidence to support the hypothesis that test accommodations would increase
the test scores of students with disabilities and English language learners relative to a
57
situation where no accommodations were provided and relative to typically developing
peers. Additionally,
… to determine if the accommodations under investigation ‘matched’ the needs of
the target students, [they] checked to ensure that the included research studies had
explicitly described the nature of the target students and had provided narrative
descriptions for the accommodations used (p. 6).
For the studies they examined, Chui and Pearson found the most frequently studied
accommodation was timing of the test, or extended time (47%), with test setting (2%) and
response format (2%) were being the least frequently studied. Students with learning
disabilities (61%) were the most commonly studied subgroup, with timing of the test
being the most frequently studied accommodation for this subgroup.
Chui and Pearson (1999) noted that
… the significant Q test for homogeneity of variance revealed that the variations
among the accommodation effects were large, implying that using the mean
effect alone could be misleading because it would fail to portray the diversity of
accommodation effects (p. 15).
To counter this issue Chiu and Pearson only used effect sizes where both the target
groups, students with disabilities and English language learners, and general education
populations were included; i.e., equivalent groups or test-retest designs. The recomputed
mean effect size was 0.11 using Hedges and Olkin’s (1985) procedure to “examine the
relationship between the characteristics of the studies and outcome measures” (p. 15).
They found test accommodations have a small, positive effect on the target
students under analysis. Evidence pointed to an overall weighted mean effect of 0.16 for
students with disabilities and English language learners, providing them with a slight
advantage over their typically developing “peers,” with an overall weighted mean effect
of 0.06 (Chui & Pearson, 1999). They noted that, for the types of accommodations
58
examined, presentation format was the only accommodation with a homogenous mean
relative effect, while all other accommodations exhibited heterogeneous effects.
However, they suggested that their results be interpreted with caution, as there were a
variety of accommodations, statuses for students, and implementations of
accommodations. Further, some confidence intervals for effect sizes were extremely wide
and could envelop the mean effect and the relative mean effect for the type of
accommodation, thus leading them to state that there was no difference in the efficacy of
the accommodation for the target population relative to the general education population.
Chui and Pearson concluded that students with disabilities and English language learners
could increase their test scores on standardized tests with appropriate test
accommodations.
Specific issues with this meta-analysis are related to combining English language
learners and students with disabilities populations to study accommodation effects. While
many studies provide information on the use of test accommodations with these groups,
recent considerations in the field indicate that effective accommodations for students with
disabilities, for the most part, are different from those found to be efficacious for English
language learners (Enriquez, 2008). As well, this meta-analysis is over ten years old and
was conducted prior to NCLB, which mandated testing for AYP and school
accountability. There has been rapid growth in the testing industry, with much more
research into testing accommodations, since Chiu and Pearson (1999) conducted their
meta-analysis, the only meta-analysis to date, on this particular topic.
It must be noted that two further meta-analyses examining the effects of
assessment accommodations on students with disabilities were conducted within the past
59
five years, but were limited in their scope. Elbaum (2007), as part of a larger study on the
efficacy of oral test accommodations for students with disabilities on math assessments,
used meta-analysis to examine existing research on read-aloud accommodations for
students with disabilities. Gregg and Nelson (2012) used meta-analysis to examine the
use of extra time for students with learning disabilities transitioning from high school to
college.
Elbaum (2007) focused on studies using read-aloud accommodations on math
assessments that may, or may not, have been considered high-stakes assessments. Elbaum
calculated separate mean effect size differences, d, for studies examining (i) elementary
school students and (ii) secondary school students. Findings indicated that there was a
small effect for elementary school students, d = 0.20, and a very small effect, d = 0.12 for
secondary school students. Elbaum concluded that there was “… a statistically significant
association of students’ school level with the difference in effect sizes for students with
and without [learning disabilities]” (p. 225). Further, Elbaum found
… the accommodation boost for elementary students is clearly of greater
magnitude for students with [learning disabilities]than it is for students
without [learning disabilities], the impact on secondary students shows
greater benefits for students without disabilities (p. 227).
Gregg and Nelson (2012) examined the use of extra time for students with
learning disabilities, specifically those students transitioning from high school to college.
Using the results from nine studies, their meta-analyses focused on three comparisons:
scores of students with learning disabilities in accommodated conditions to typically
developing peers in non-accommodated conditions, scores of students with learning
disabilities to typically developing peers in accommodated conditions, and scores of
students with learning disabilities to typically developing peers in non-accommodated
60
conditions. Using Comprehensive Meta-Analysis V.2, they estimated Cohen’s d effect
sizes. They found that typically achieving students in unaccommodated conditions
outperform students with disabilities using an extended time accommodation (d = -0.41).
They were unable to provide similar information for their other two comparisons as
“[t]he results … underscore the lack of research available to make conclusions about the
comparability of scores for transitioning students with [learning disabilities] taking tests
with extended time to their normally achieving peers” (p. 136).
Test accommodation interaction hypothesis and differential boost.
Considered a well-controlled research approach, the test interaction hypothesis
involves testing the interaction between testing condition (accommodated and
unaccommodated conditions) and disability status (students with and without
disabilities). The test interaction hypothesis postulates that appropriate accommodations
will boost the scores of students with disabilities more than their typically developing
peers (Bolt & Thurlow, 2004; Sireci et al., 2003; Sireci et al., 2005). This “[d]ifferential
impact on students with and without disabilities provides evidence that the
accommodation removes a barrier based on disability” (Macarthur & Cavalier, 2004, p.
55) and effectively removes construct-irrelevant variability (Messick, 1995). “Boost
studies;” employing a within-subjects or a random-independent-groups (across subjects)
design and having a control group that does not receive accommodations to determine
whether or not students with disabilities score significantly higher under accommodated
conditions (Bolt & Thurlow, 2004); do not test the significance of an interaction between
disability status and testing condition as is found with research work using the test
accommodation interaction hypothesis. Research studies exploring how test scores for
61
accommodated students with disabilities compare to test scores of other students with
disabilities or those of typically developing students, called “comparative studies”, also
do not test the significance of an interaction between disability and testing condition
(Bolt & Thurlow, 2004).
The interaction hypothesis also referred to as the “maximum potential thesis,”
posited by Zuriff (2000), states that “students without disabilities would not benefit from
extra examination time because they are already operating at their maximum potential
under timed conditions” (p. 101). A similar theory, differential boost (Fuchs & Fuchs,
1999) posits that both students with disabilities and their typically developing peers will
benefit from testing accommodations. However, students with disabilities are expected to
benefit differentially more than their typically developing peers. The test accommodation
interaction hypothesis, maximum potential thesis, and differential boost theory are used
to justify the use of test accommodations for students with disabilities as (i) test scores of
students with disabilities are improved relative to the score they would receive under
standard administrative conditions, (ii) typically developing students’ test scores will not
improve if they take the test using the same test accommodations, and (iii) students with
disabilities and typically developing peers, the student factor, interacts with the
administration condition (standard or accommodated administration).
In 2000, Zuriff examined five studies that utilized the maximum potential thesis
in their design, testing the interaction between assessment condition and disability status.
These studies investigated the use of extra examination time for college students with
learning disabilities versus their typically developing peers. All studies cited used a
common measure, the Nelson-Denny Reading Test, considered reliable, related to
62
scholastic achievement, and normed through the fourth year of college. The author found
support, albeit very weak empirical support, for the maximum potential thesis.
Contradictory evidence for the maximum potential thesis came from typically developing
students seeing test score gains, albeit not as large as students with disabilities, in
untimed assessment conditions. Zuriff recommended examining individual differences
under timed and untimed conditions for all students participating in research studies
looking at the maximum potential thesis, as this would allow for a better understanding of
patterns in the data that is not afforded when only using group means.
Sireci et al. (2003) reviewed 150 studies concerned with the effects of test
accommodations, critiquing all studies in light of the “interaction hypothesis [such] that
test accommodations should improve the test scores for targeted groups, but should not
improve the scores of examinees for whom the accommodations are not intended” (p. 2).
Of the 150 research studies, 46 examined the effects of test accommodations for students
with disabilities and English language learners. Of the 46 studies, only 38 studies
empirically looked at data from accommodated tests with 21 using an experimental
design: 12 for students with disabilities and 8 for English language learners. Less than
one-half of the research studies examined were found in peer-reviewed journals. The
authors’ critique was structured using three primary criteria: (i) group that was to be
helped by the assessment accommodation, that is students with disabilities or English
language learners, (ii) type of accommodation examined; for example, presentation
accommodations, timing/scheduling accommodations, and response accommodations,
and (iii) type of research design, that is literature reviews, experimental studies, and non-
experimental studies. The 38 studies reviewed spanned several subject areas and multiple
63
grades. At the time of publication, 26 studies relating to assessment accommodations had
been critically reviewed.
Sireci et al. (2003) concluded that the vast majority of studies showed
improvements for all students taking accommodated tests, with the “accommodation of
extended time improv[ing] the performance of students with disabilities more than it
improved the performance of students without disabilities” (p. 2). They noted that “there
are no unequivocal conclusions that can be drawn regarding the effects, in general, of
accommodations on students’ test performance” (p. 48). Sireci et al. felt that the
interaction hypothesis as typically stated was on “shaky ground” (p. 48) and proposed a
revision to the hypothesis, namely differential boost (Fuchs & Fuchs, 1999). Differential
boost allows that typically developing students may benefit from assessment
accommodations, though not to the same extent as their peers with disabilities. With
respect to extended time, Sireci et al. (2003) found “gains for students without
disabilities, although the gains for students with disabilities were significantly greater” (p.
63). Research exploring the use of oral presentation accommodations was unclear, with
half of the studies finding positive effects, while the remaining studies saw either no
effects or similar effects for students with disabilities and their typically developing
peers.
Issues with the studies reviewed included the heterogeneous nature of both the
students (large within-group diversity) and the assessment accommodations, and diversity
in the creation and implementation of accommodations. Although students with
disabilities were heterogeneous with respect to type of disability, they were generally
ethnically homogeneous groups of students, thus results from the studies under
64
consideration cannot be generalized to minority students. As well, much of the research
was undertaken in Los Angeles, California, making generalizability to other locales
contentious. Additionally, virtually all of the research was conducted on elementary
school students, making generalization to other levels impossible. Further, effect sizes
were not reported in most studies. While effect sizes could be estimated for some of the
studies, this was not possible for all studies under review.
Sireci et al. (2005), in a later secondary study of the test accommodation
interaction hypothesis were, again, seeking empirical support for the interaction
hypothesis, whereby “…test accommodations lead to improved test scores for students
with disabilities relative to their non-disabled peers” (p. 459). The authors reviewed
several recent empirical studies that focused on the effects of accommodations on test
performance, particularly the test performance of students with disabilities. Of the studies
they reviewed, they selected 28 and categorized them based on the type of test
accommodation; extended time, oral (read-aloud) presentation, or multiple
accommodations; and research design; experimental, quasi-experimental, and non-
experimental using an ex post facto comparison of students taking a standard version of
the test and students taking an accommodated version of the same test.
Of the studies they reviewed, Sireci et al. (2005) found that the most common
accommodations examined were oral administration, at 39%, and extra time, at 24%.
Studies investigating oral administration were often accompanied by extra time as a
second accommodation, thus making it difficult, if not impossible, to decouple the effects
of the accommodations. As well, a variety of different accommodations was analyzed
within a single study for some of the studies being reviewed. Most of the studies focused
65
on students in third through eighth grades taking tests in mathematics, reading, and
science.
For research relating to extended time, Sireci et al. (2005) found that five of eight
studies provided qualified support for the interaction hypothesis. For the most part, the
results indicated that students with disabilities exhibit greater score gains than typically
developing peers. However, results from two of the eight studies did not display any
gains. Five of the ten studies concentrating on oral accommodations provided partial
support for the interaction hypothesis. The research literature substantiated findings that a
more valid interpretation of mathematics achievement was possible when students with
disabilities received oral; e.g., read-aloud, accommodations. This could not be said for
other subject areas. For studies relating to multiple accommodations, all seven of the
studies reviewed provided support, at some level, for the interaction hypothesis. Four of
the seven studies using experimental designs also demonstrated results that were
consistent with the interaction hypothesis.
While two fairly consistent findings were discussed, those of extended time
tending to improve the performance of all students, albeit students with disabilities
showing the greatest gains, supporting a differential boost interpretation, and oral
accommodations on mathematics tests improving performance for some students with
disabilities, consistent conclusions could not be drawn across the studies. With the wide
variety of accommodations, the differences between accommodation implementation, and
the heterogeneity of students receiving accommodations, heterogeneity being found even
within the students with disabilities groups, it was not surprising that there were a lack of
consistent inferences.
66
Sireci et al. (2005) concluded that the vast majority of research explored showed
that all student groups had test score gains under accommodated conditions, with students
with disabilities displaying the largest test score gains. As with the Sireci et al. (2003)
research review, the authors felt that qualification of the interaction hypothesis, with
greater gains experienced by students with disabilities implying that the standardized
testing conditions are too stringent for all students and not that the test accommodations
are unfair, better explained their findings, particularly their findings regarding the use of
extended time. Additionally, their findings were consistent with the concept of
differential boost put forth by Fuchs and Fuchs (1999), whereby “an accommodation ….
increases the performance of students with disabilities more than it increases the scores of
students without disabilities” (p. 24). Further, Sireci et al. (2003) concluded (i) most
educational tests are speeded, (ii) oral accommodations on math tests produce gains for
students with disabilities, however, the same cannot be said for tests in other content
areas, and (iii) students with disabilities need extra time to demonstrate their true
knowledge, skills, and abilities.
Sireci et al. (2005) noted several issues with the studies they reviewed. These
issues included the use of small, ethnically homogenous groups of students with
disabilities whose results could not be generalized to minority students with disabilities
and almost all the studies focused on elementary grades. They noted that only one of the
experimental studies looked at test accommodations for secondary school students. They
believed this was a tremendous issue, as there are a growing number of states
implementing high school graduation examinations. The growing number of graduation
examinations, coupled with a dearth of information on the potential usefulness of
67
assessment accommodations and/or the interaction effect of accommodations on such
examinations for this group, was seen as a major limitation.
Issues with this review that could not be controlled for were the great diversity (i)
within the students with disabilities group, (ii) in the way the test accommodations were
created, and (iii) in the way the test accommodations were implemented. Such diversity
makes it very difficult to make unequivocal statements about the research findings.
Gaps in the literature.
Concerns that students with disabilities are tested fairly when examinations are
used for promotion and high-stakes decisions abound and are discussed in non-academic
and academic circles alike, with discussion on this topic commonly found in mainstream
newspapers such as the New York Times.
[Q]uestion[s] of how far to accommodate students with learning disabilities on
college entrance tests like the SAT has become a familiar one [in mainstream
society], as requests for special accommodations proliferate, especially from
affluent white families (Lewin, 2002).
Information that had been the sole purview of educational policymakers and researchers
is becoming part of the mainstream ethos. Delineation of educational legal issues,
particularly those relating to issues of equity and access, have become commonplace in
the news. Articles with information such as the following have become part of the
mainstream lexicon:
Judge Charles R. Breyer of Federal District Court [of California] ruled that
students with learning disabilities had the right to special treatment, through
different assessment methods or accommodations like the use of a calculator
or the chance to have test questions read aloud (Lewin, 2002).
With such judgments coming to the fore, it is imperative we become better able to make
sound decisions based on strong evidence.
68
With existing educational legislation regarding students with disabilities and
assessment accommodations, states are tasked with creating and implementing
assessment accommodations. However, there is an “… amazing lack of agreement across
states in how to go about making participation and accommodation decisions, and which
accommodations are acceptable” (Thurlow et al., 2000, p. 162). Many researchers have
noted that states continue to make changes to their assessment accommodations policies
despite the lack of a solid research base on accommodations (Bolt & Thurlow, 2004;
Sireci et al., 2003; Sireci et al., 2005; Thompson et al., 2002; Thurlow & Bolt, 2001).
Secondary studies point to a lack of definitive findings, providing suggestions on how
this might be remedied (Bolt & Thurlow, 2004; Johnstone et al., 2006; Thompson et al.,
2002; Tindal & Fuchs, 2000; Tindal & Ketterlin-Geller, 2004; Zenisky & Sireci, 2007).
Educators and policy-makers need more information regarding the effectiveness of
testing accommodations for students with disabilities and whether they remove or reduce
presentation, response, setting, and timing/scheduling barriers in assessment. It has also
been noted that much of the research does not directly address the use of
accommodations that are frequently allowed under state policy (Bolt & Thurlow, 2004;
Tindal & Fuchs, 1999).
There appears to be a lack of experimental research and empirical evidence when
it comes to understanding which assessment accommodations are efficacious.
Researchers and those examining the existing literature have noticed that very few studies
examining assessment accommodations use experimental designs (Bolt & Thurlow,
2004; Tindal & Fuchs, 1999). Ysseldyke et al. (1998) noted
69
… research on accommodations needs to be experimental in nature, and designed
to address the perception that the use of accommodations may invalidate a test.
Experimental research goes beyond simply examining the performance of
students who use accommodations and comparing it to the performance of
students who do not use accommodations by providing appropriate controls (p.
31).
Additionally, several researchers indicated that the empirical research base regarding the
effects of specific testing accommodations is very limited (Bolt & Thurlow, 2004; Fuchs
et al, 2000a). Such research helps us answer questions about which accommodations
would be beneficial for specific groups of students with disabilities, and for which
situations these accommodations would be the most beneficial, thus providing more
accurate assessments of students with disabilities. As Ysseldyke et al. (1998) noted
[s]pecific issues arise for each disability type, or combination of disabilities, and
for each specific accommodation [with] considerably more rhetoric and opinion
than sound empirical evidence about the validity of specific accommodations. The
knowledge base about the effects of accommodations is not adequate to address
many practical, everyday questions, nor is it in a form that is readily accessible to
or easily understood by personnel in states and districts (p. 21).
The existing research on assessment accommodations is spotty, with some types
of accommodations being glossed over and some groups of students with disabilities
being skipped over. Chui and Pearson (1999) noted a dearth of research in the areas of
accommodations such as “assistive devices, combinations of accommodations,
presentation formats, response formats, setting of tests, and radical accommodations” (p.
33), with learning disabled students receiving the most attention in the research literature.
While this has slowly been changing, with studies looking at a larger variety of
accommodations and students with disabilities, syntheses of the literature in this area
have only considered three- to four-year slices of research work. As educational research
can be very cyclical in nature, with different studies occurring in the same time frame
70
overlapping in areas examined, trends for the different types of assessment
accommodations, and students with disabilities groupings may be hidden.
The existing research in the area of assessment accommodations for students with
disabilities is far from conclusive. Much of the research in this area, at best, remains
equivocal and open for debate. There is very little agreement on which accommodations,
or combinations of accommodations, allow students with disabilities to demonstrate what
they know without providing an unfair advantage for these students. Long recognized in
research syntheses and secondary studies, research on assessment accommodations,
provide ambiguous information, as these syntheses highlight the contradictory findings
for the research which was reviewed (Johnstone et al., 2006; Sireci et al., 2003). As well,
“variations across operational definitions, tests, populations, settings, and contexts still
curb all but the most general policy implications” (Zenisky & Sireci, 2007), such that “…
more empirical study is warranted to further investigate the effects of testing
accommodations for students with disabilities” (Bolt & Thurlow, 2004, p. 151).
As noted in 1999 by Chiu and Pearson, there has been enough research in the field
of assessment accommodations and students with disabilities to make meta-analysis
useful. Although much primary and secondary research on students with disabilities and
testing accommodations has been conducted, there have been no meta-analyses of
students with disabilities across all categories of assessment accommodations conducted
since Chiu and Pearson’s research in 1999. In the intervening years, well over 100
primary studies have been conducted. With the capacity to examine the convergence
across studies objectively and systematically, and the use of a common metric, meta-
analysis has the potential to fill in the gaps in the assessment accommodation literature,
71
providing more definitive empirical answers to the hypotheses posed by research in this
area. Zenisky and Sireci (2007) found that
[g]reat diversity exists both with respect to the individuals requiring assessment
accommodations and the range of accommodations available [and that] such
diversity does not easily lend itself to consensus on policy for valid testing
practice. The completion of more well-constructed meta-analyses of specific
accommodations is one strategy that researchers should consider, in addition to
further empirical study of specific accommodations with different—both
heterogeneous and homogeneous—student populations (p. 17).
As well, Sireci and Pitoniak (2007) believe that meta-analysis, potentially based on state
practices, would be useful at this point in time. While not overcoming all of the pitfalls of
existing primary research in this area, using meta-analysis to aggregate and quantitatively
analyze existing research will provide a more rigorous examination of the data collected
to date. With the addition of meta-regression, providing a statistical means to delve
deeper into possible explanations for variance, together with effect size findings provided
through a meta-analysis of existing research studies, it is hoped that this research will fill
some of the gaps discussed by those in the field.
Meta-regression.
Meta-regression extends regression analyses by examining multiple studies to
model, estimate, and explain the variation among reported empirical results (Stanley,
2001). Meta-regression is used when heterogeneity in effect sizes is found or is believed
to exist and “… aims to relate the size of the effect to one or more characteristics of the
study involved” (Thompson & Higgins, 2002, p. 1559). Increasingly, “[m]eta-regression
has become a commonly used tool for investigating whether study characteristics may
explain heterogeneity of results among studies in a systematic review” (Higgins &
Thompson, 2004, p. 1663).
72
There are a variety of meta-regression approaches. The regression model used
may be linear or logistic with a single study as the observation or unit of analysis. In a
simulation study comparing and contrasting meta-regression approaches which model
heterogeneity, Morton, Adams, Suttorp, and Shekelle (2004) identified four meta-
regression approaches: fixed-effects utilizing logistic regression, random-effects meta-
regression, control rate meta-regression, and Bayesian hierarchical modeling. Further,
Morton et al. identified and evaluated five meta-regression methods: fixed-effects with
and without moderators; random-effects with and without moderators; and control rate
meta-regression. They used the results of their simulation to provide meta-regression
practitioners with a set of guidelines. Specifically, Morton et al. noted that results can be
biased if important moderators were not incorporated at the person or study level,
moderators that are aggregates of person-level rather than study-level characteristics can
produce biased results, control rate (in health and medical studies) needs to be
incorporated if it affects treatment, and bias can be reduced using a larger number of
studies and a larger number of subjects with proper modeling.
There are several statistical issues with meta-regression. These include, but are
not limited to, a small number of degrees of freedom in research that reviews a small
number of studies and the use of highly collinear moderators. While there are several
issues with this technique and many researchers call for more study of meta-regression
(Higgins & Thompson, 2004; Stanley, 2001; Thompson & Higgins, 2002) it has the
potential to explain differences between studies and can aid in understanding the causes
of heterogeneity, a truly handy instrument in the meta-analyst’s tool box.
73
Delimitations
Delimitations for this study relate to both the unit of analysis and the analytic
techniques proposed.
In standardized, and other, assessments we need an accurate and adequate
measure of student knowledge. This means we must endeavor to minimize construct-
irrelevant variance, as well as provide methods to increase access to these assessments for
students with disabilities. One of the goals for standardized assessment is to ensure, in
part by providing empirical evidence, that test scores for all students are valid and
comparable, regardless of population subgroup. As such, this study will be limited by the
adequacy of the assessments used in the primary research studies under examination.
Sireci et al. (2003) have noted several limitations of the extant research.
Limitations included focus on a “relatively small, and ethnically homogenous groups of
students” (p. 65), with “… most of the studies focused on elementary school grades…”
(p. 66), and “… virtually no experimental studies involved secondary students…” (p. 66).
It is hoped that expanding the bandwidth of the studies to include primary studies for a
longer time period, mid-1999 through mid-2011, will help circumvent these particular
limitations.
Research design limitations for primary research in this area include poor and
inconsistent classification of students with disabilities and their typically developing
peers, absent or poor control groups, insufficient time for accommodations that require
additional materials, and validity concerns due to a poor match between test content and
curriculum.
74
Another potential limitation for this research relates to one of the subgroups of
students with disabilities; students with learning disabilities. Students with learning
disabilities comprise almost one-half of the population of students with disabilities
(Tindal & Fuchs, 2000) and are a heterogeneous group. This makes logical analysis of
assessment accommodations difficult. As well, it is difficult to conduct studies in the area
of test accommodations as it is difficult to
find... and recruit… sufficient numbers of students with disabilities and students
without disabilities to participate in studies involving taking tests, particularly if
the design requires them to take a test twice; under standard and accommodated
conditions. The small numbers of students with disabilities in specific disability
categories make it particularly hard to find sufficient numbers of different types
of students with disabilities who are prepared to take a test in a specific subject
area in a specific grade level (Scarpati, 2003 cited in Sireci et al., 2005, p. 487).
Several primary studies examining extended time used speeded tests; thus all
students would be expected to show test score gains when given extra time. This makes
results from these studies equivocal and a potential limitation for the present research.
A major limitation that cannot be overcome concerns lack of reporting of
appropriate statistics; i.e., at a minimum means, standard deviations, and number of
participants, thus studies that do not contain useable statistics cannot be included in the
analyses. As well,
[m]ost of the studies that focused on multiple accommodations were ex
post facto studies that analyzed data from a large-scale assessment and
broke out accommodated test administrations from non-accommodated
administrations… [which] … typically do not use an experimental design…
(Sireci et al., 2005, p. 475).
Additionally, there was incomplete reporting which resulted in low statistical power and
questionable findings for some of the primary studies being considered for the meta-
75
analysis. Due to the preceding issues there is the potential to lose a great number of
research studies during the coding phase of this research.
Research is only useful insofar as we can generalize the findings from research on
assessment accommodations to students in classrooms (Tindal & Fuchs, 2000). When
coding the primary studies, appropriate sampling in the primary studies must be
examined to ensure students are sampled appropriately. Primary studies that do not
conformation to appropriate sampling procedures will not be included in the meta-
analysis.
It must be noted that using meta-analysis does not allow us to examine
measurement comparability; i.e., to see if internal characteristics are the same for
accommodated and unaccommodated tests. This limitation cannot be avoided with meta-
analytic techniques.
Definitions
A number of definitions specific to this study apply. Terms relating to students
with disabilities and legislation regarding students with disabilities, assessments and
accommodations, assessment of students with disabilities, organizations involved with
students with disabilities and research regarding students with disabilities, as well as
meta-analytic techniques are defined in the following section.
Terms specific to students with disabilities include the definition of student with
disabilities, Individualized Education Plan, Least Restrictive Environment, and Free and
Appropriate Public Education.
The thirteen legislative special education categories used to identify students with
disabilities, delineated in IDEA (2004), are
76
mental retardation, hearing impairments (including deafness), speech or
language impairments, visual impairments (including blindness), serious
emotional disturbance (referred to in this title as ‘emotional disturbance’),
orthopedic impairments, autism, traumatic brain injury, other health
impairments, or specific learning disabilities (Part A (SEC. 602) (3) (A) (i),
118 STAT.2652, 2004).
Individualized Education Plans (IEPs) are used to define an appropriate
education, guide delivery of educational services and frame methods for evaluating
outcomes for students with disabilities. IEPs “… must include a statement of the
student’s current levels of educational performance and a statement of measureable
annual goals, including short-term objectives or benchmarks” (Tindal & Fuchs, 2000, p.
10).
The least restrictive environment (LRE) allows that,
[t]o the maximum extent appropriate, children with disabilities... be educated
with children who are not disabled, and... special classes, separate schooling,
or other removal of children with disabilities from the regular educational
environment should occur only when the nature or severity of the disability
is such that education in regular classes with the use of supplementary aids
and services cannot be achieved satisfactorily (Federal Register, 1999, (20
U.S.C. 1412(a)(5)(B))).
Section 504 of the Rehabilitation Act of 1973 defines a free and appropriate
public education (FAPE) as school district provision of a “‘free appropriate public
education’ … to each qualified person with a disability who is in the school district’s
jurisdiction, regardless of the nature or severity of the person’s disability” (U.S.
Department of Education, 2007, p. 1).
Terms specific to assessment and accommodation include the definition of
test/assessment accommodation, high-stakes assessments, statewide assessment
programs, partial participation, out-of-level testing, combination participation,
assessment modification, and alternate assessments.
77
Test accommodation, or assessment accommodation, refer to accommodations
providing support for students with disabilities involving adjustments to the assessment
presentation, setting, timing or scheduling, or response and are generally dependant on
the disability involved. Accommodations should not provide any advantages to
individuals taking the test in question.
High-stakes assessments generally refer to assessment results tied to important
decisions which may significantly impact the lives of students and educational
professionals (Reschly, 1993). Statewide assessment programs, as part of the
accountability structure for states since NCLB (2001), are considered to be high-stakes
assessments.
Partial participation in assessment programs occurs when students take certain
parts of the assessment, but are not required to take the entire assessment.
Out-of-level testing occurs when students take assessments designated for
students in lower grades.
Combination participation occurs when students take different parts of different
assessments from an entire assessment program. For example, students might take certain
parts of state reading, writing, mathematics, and science assessments.
Test modification, assessment modification, or non-standard accommodations
involve student use of modifications or accommodations that change the construct being
measured, thus test scores for these students are considered invalid and student
participation is not included in aggregated results for the assessment.
Alternate assessments are normally designed for a specific subgroup of students.
These assessments are most frequently used to assess students having significant
78
cognitive disabilities who would otherwise not be able to access the assessment, even
with accommodations.
Terms specific to assessment of students with disabilities include access to
assessment programs, inclusion in education, participation in assessment programs, and
unwarranted exclusion.
Access to assessment programs; for example, state assessment programs, refers to
the ability of all students to have an equal opportunity, or the right, to participate in the
assessment program in order to demonstration their abilities in the area(s) being measured
and receive benefits provided by the demonstration of their abilities (e.g., graduation
from high school). It is expected that all students have access to assessment programs
regardless of their social class, ethnicity, background or physical disabilities. Access to
assessment programs for students with disabilities often requires bridging technologies
such as accommodations, modifications, or alternate assessments and “deals specifically
with removing barriers for student” (Baker, 2008, p. 24) and allows students with
disabilities a way to demonstrate their skills and abilities.
Inclusion in education refers to the education of students with disabilities in the
regular classroom for all, or nearly all, of the school day. Inclusion models do not allow
for the education of students with disabilities in a separate school or classroom. Inclusion
in assessment programs; for example, state assessment programs, refers to including
students with disabilities in the assessment experience. Unlike access where students
have the right to participate and be provided with the tools to participate, inclusion
simply refers to being included in the process or program, including assessment
programs.
79
Participation in assessment programs, such as statewide assessment programs,
refers to students with disabilities taking part in the assessment process and having their
results included in any reports generated from the assessment efforts; i.e., district
accountability reports used as part of the AYP requirements for the federal government.
Participation differs from access, as it is not mandated by law. Participation differs from
inclusion in that, although students with disabilities may be included in programs, they
may not be able to participate in the program and/or their results may not be included in
the reports generated from the assessment program.
Unwarranted exclusion refers to the
… directed or arranged non-participation in state or national assessment
programs involving students for whom the assessment is appropriate to
curriculum goals pursued in their educational programs and the receptive
or expressive language demands of the assessment tasks are within the
student’s behavioral repertoire (Reschly, 1993, p. 46).
Organizations involved with students with disabilities, in legislative and/or
research capacities include the National Center on Educational Outcomes, Council of
Chief State School Officers, Council for Exceptional Children, and National Association
of Directors of Special Education.
The National Center on Educational Outcomes (NCEO), founded in 1990, is
tasked with working with federal and state agencies to assess educational results for
students with disabilities (Elliot et al., 2000). This mandate includes investigation of
access to, inclusion in, and participation on state and federal assessment programs for
students with disabilities, as well as their participation in accountability systems. NCEO
has been tracking and analyzing state policies on assessment participation and
accommodations since 1992.
80
The Council of Chief State School Officers (CCSSO) is a nonpartisan,
nationwide, nonprofit organization. This council consists of heads of departments of
elementary and secondary education in the states, the District of Columbia, the
Department of Defense Education Activity, and five U.S. extra-state jurisdictions.
CCSSO’s mandate is to provide “leadership, advocacy, and technical assistance on major
educational issues” (http://www.ccsso.org, retrieved May 23, 2009). The Council
provides information on major educational issues to civic and professional organizations,
federal agencies, Congress, and the general public.
A major organization,
[t]he Council for Exceptional Children (CEC) is the largest international
professional organization dedicated to improving the educational success of
individuals with disabilities and/or gifts and talents. CEC advocates for
appropriate governmental policies, sets professional standards, provides
professional development, advocates for individuals with exceptionalities, and
helps professionals obtain conditions and resources necessary for effective
professional practice (http://www.cec.sped.org, retrieved May 23, 2009).
The National Association of Directors of Special Education (NASDSE), founded
in the late 1930s, provides services to state agencies assisting in their efforts to improve
educational outcomes for students with disabilities. NASDSE provides leadership
throughout the United States, the federal territories and the Freely Associated States of
Palau, Micronesia and the Marshall Islands. The association believes
[a]ligning policies and practices to improve educational outcomes for [students
with disabilities] is critical ensure full participation [of students with disabilities]
in their education and transition to post-school employment
(http://www.nasdse.org/AboutNASDSE/LetterFromOurPresident/tabid/404/Defau
lt.aspx, retrieved May 23, 2009).
Terms specific to meta-analytic techniques include mean effect, mean relative
effect, Q-statistic, fixed-effects, random-effects, sensitivity analysis, and publication bias.
81
The mean effect, computed by weighting each effect size by the inverse of its
variance (i.e., the effect size is multiplied by its weight), is used to find the central
tendency for the aggregate of the effect sizes computed in the meta-analysis.
The mean relative effect, as it applies to research on students with disabilities and
general education populations, is (i) the difference between the mean effect on students
with disabilities (target population) and the mean effect on the general education
population or (ii) the difference between the mean effect on students with disabilities in a
non-accommodated assessment condition and the mean effect on students with
disabilities in an accommodated assessment condition.
The Q-statistic is “…a measure of weighted squared deviations…” (Borenstein,
Hedges, Higgins, & Rothstein, 1009, p. 105) and is used to assess heterogeneity in effect
size estimates; i.e., the variability in true effect sizes. The Q-statistic helps determine
whether effect size is consistent. If effect size is consistent we are able to focus on the
summary effect size statistic, if not, we must focus on the dispersion of effect sizes.
The fixed-effects model is one of the two statistical models used in meta-analyses.
Under the fixed-effects model, one true effect size is assumed to underlie all studies in
the meta-analysis.
The random-effects model, the second of the two statistical models used in meta-
analyses, allows for the possibility of different effect sizes underlying the studies
included in the meta-analysis. That is, if we were able to select a random sample of
primary studies from the infinite number of studies available, the true effect sizes would
be distributed about a mean.
82
In meta-analytic studies, a sensitivity analysis focuses on “the extent to which the
results are (or are not) robust to assumptions and decisions that were made when carrying
out the synthesis” (Borenstein et al., 2009, p. 368).
Publication bias refers to the likelihood that certain types of research, specifically
research conducted that did not find significant results, is not included in a meta-analysis.
When meta-analyses do not include unpublished research work, an upward bias in effect
size summary statistics will be found. Methods to examine publication bias include
funnel plots, Rosenthal’s Fail-safe N, Orwin’s Fail-safe N, and Duvall and Tweedie’s
Trim and Fill.
Summary
The purpose of the study was to: (a) determine whether there is empirical support
for effects of testing accommodations, (b) provide an estimate of the mean effect size,
and (c) contribute to the understanding of effective test accommodations for students with
disabilities.
This study aims to add to the existing body of research and research syntheses on
testing accommodations for students with disabilities by extending the original work of
Chiu and Pearson (1999). This research narrowed the focus, from English language
learners and students with disabilities on a variety of different assessments, to students
with disabilities on high-stakes and/or large-scale, paper and pencil assessments only,
focusing on participation on federal, state, and district tests with accommodations for
students with disabilities. Further, meta-regression analyses and graphic representations,
not available to Chiu and Pearson in 1999, provide a unique contribution to research in
this area.
83
Sireci et al. (2005) stated that our “… challenge is to implement …
accommodations appropriately and identify which accommodations are best for specific
students” (p. 486). This cannot be accomplished solely through the use of primary and
secondary analyses. Synthesis of research, that is meta-analysis, must be employed to
provide more definitive answers to research questions posed in the area of assessment
accommodations and students with disabilities. To that end, this study provides a
quantitative, rather than a qualitative, view of the aggregate research on all researched
testing accommodations for students with disabilities, something that has not been done
since 1999 by Chiu and Pearson.
84
Chapter Two
Method
The present research proposed using two different statistical methods, meta-
analysis and meta-regression, in an effort to examine research on the efficacy of
assessment accommodations for students with disabilities. Use of these meta-methods
allowed us to scrutinize the existing research literature for overall trends using
quantitative methodologies in an effort to better understand findings across the breadth of
the research literature in this area.
Purpose of the current study.
The purpose of the current study was threefold. The current study sought to
establish if assessment accommodations provide a more effective assessment of students
with disabilities than no accommodations; estimate the strength of this effect; and add to
the knowledge base pertaining to effective assessment accommodations for students with
disabilities. As such, results from this study were used to summarize previous research,
estimate population parameters, and generalize findings from prior research.
Research Hypotheses
The current study addressed the following hypotheses for the meta-analytic
portion of the research:
85
Research Hypothesis 1: Is there empirical support for effects of test
accommodations for the target group, students with disabilities, as opposed to their
typically developing peers?
Research Hypothesis 2: As measured by effect size, does each of the following
constitute an effective accommodation for students with disabilities?
o Presentation test accommodations?
o Response test accommodations?
o Setting test accommodations?
o Timing/Scheduling test accommodations?
The current study addressed the following hypothesis for the meta-regression
portion of the current research:
Research Hypothesis 3: Which type of accommodation(s)–Presentation, Response,
Setting, or Timing/Scheduling–more effectively remove construct-irrelevant
variance from target students’ test scores?
Meta-analysis
Meta-analysis, one type of research synthesis, was selected as a method to
integrate research findings from multiple research studies, vis-à-vis assessment
accommodations for students with disabilities. “Research syntheses attempt to integrate
empirical research for the purpose of creating generalizations” (Cooper & Hedges, 1994,
p. 5). Meta-analysis provides a statistical method to integrate information from primary
studies on assessment accommodations for students with disabilities selected for further
86
scrutiny and analysis, something which could not be accomplished using syntheses of the
research literature; i.e., integrative narrative reviews.
The research design for the present study was based on Cooper and Hedges’
(1994) stages of research synthesis found in their “definitive vade mecum” (p. 7). These
stages include: (i) problem formulation, (ii) data collection/literature search methods, (iii)
data evaluation/coding and evaluating research reports, (iv) analysis and
interpretation/meta-analytic calculations of effect size(s), and (v) public
presentation/meaningful interpretation and effective presentation of the synthesis results.
The problem formulation was addressed via the purpose for this study and the research
hypotheses posed. The purpose and research hypotheses form the basis for the selection
of studies for the meta-analysis. Reports selected for the present meta-analysis were
based upon the following selection and exclusion criteria.
Criteria for selection of studies.
Studies selected had to meet several criteria in order to be considered for the
meta-analysis. Explicit inclusion and exclusion criteria aid in the selection of relevant
studies, as well as limiting researcher bias (Lipsey & Wilson, 2001). General categories
guiding selection criteria were “(a) the distinguishing features of a qualifying study, (b)
the research respondents, (c) key variables, (d) research design, (e) cultural and linguistic
range, (f) time frame, and (g) publication type” (pp. 16 - 17). Although an exhaustive
search of the literature is not required when defining inclusion criteria (White, 1994), it is
recommended that researchers do not use criteria that are too strict as useful reports may
be overlooked (Lam & Kennedy, 2005).
87
Inclusion criteria were separated into two non-overlapping groups: (i) substantive
domain of inquiry and (ii) methodological characteristics. This allowed for a more
granular look at existing research prior to creating a meaningful common metric across
the studies under consideration.
Studies that did not fully meet both substantive and methodological inclusion
criteria were included in some cases. The rationale for including these studies is provided
in the analyses section. Further, coding was created to explicate inclusion of these
studies.
Substantive inclusion criteria.
Initial substantive inclusion criteria focused on four different areas: (i) types of
students included in the analyses, (ii) type of assessment accommodation used, (iii) type
of assessment under investigation, and (iv) year of publication.
Substantive inclusion criteria were as follows:
(i) Experimental or quasi-experimental studies that quantitatively examined the
effects of assessment accommodations for students with disabilities in the regular
educational system from kindergarten through college. Definition of students with
disabilities followed categories of disability outlined in IDEA (2004) legislation.
(ii) Studies examining assessment accommodations falling under the categories of
presentation, response, setting, and timing/scheduling as defined by Sireci et al. (2003).
(iii) Studies examining large-scale, high-stakes, or commonly-used published
assessments of achievement or college entrance.
88
(iv) Studies conducted and/or published on or after 1999 through June, 2011. This
was purposefully done in order to ensure that studies included did not overlap with the
previous meta-analysis conducted by Chiu and Pearson (1999).
Substantive characteristics were coded and accounted for in the statistical
analyses conducted.
Demographic variables were also recorded as such variables were seen as a
potential source of covariate and/or mediator information.
Methodological inclusion criteria.
Initial methodological inclusion criteria also guided the selection of studies for the
meta-analysis. Methodological inclusion criteria focused on four different areas: (i)
available data, (ii) examination of single assessment accommodation, (iii) assessment
accommodation validity, and (iv) research examining boost, differential boost, and/or the
interaction hypothesis.
Methodological inclusion criteria were as follows:
(i) Experimental and quasi-experimental studies with statistical data such as
means and standard deviations, or significance test results necessary to calculate an
estimated effect size of the impact of the testing accommodation under study.
(ii) Study designs focusing on the effects of single accommodations as opposed to
effects of accommodation packages, that is, multiple accommodations for individual
students. Note that more than one assessment accommodation may be analyzed in a
single study with results for each accommodation reported separately. However, analysis
needed to focus on one accommodation at a time for inclusion in the meta-analysis.
89
(iii) Assessment accommodation which did not alter the construct being assessed;
i.e., studies examining assessment accommodations and not assessment modifications
were included in the meta-analysis.
(iv) Research examining boost, differential boost (Fuchs & Fuchs, 1999), and/or
the interaction hypotheses (Sireci et al., 2005) for students with disabilities and/or
typically developing students.
Study quality was not explicitly coded. Research by Ahn and Becker (2011)
showed that the use of quality weights in meta-analysis does not add to the analysis nor
does it significantly change results found, thus they recommend against the use of quality
weights. However, for the present meta-analysis, type of publication was noted; i.e.,
article, dissertation, report, and conference proceeding, in lieu of study quality.
Methodological characteristics were coded and accounted for in the statistical
analyses conducted.
Categorization of test accommodation research.
Methodological inclusion criteria are intimately linked with the type of
methodological approach used by researchers in this field. Tindal (1998, cited in Bolt &
Thurlow, 2004) categorized primary research on assessment accommodations into three
approaches. A fourth approach, or category, was added by Fuchs et al. in 2000a. The four
approaches are descriptive, comparative, experimental, and individual diagnosis.
The descriptive approach provides a logical analysis of difficulties associated with
disability, conducted to determine which accommodations are considered to be helpful
and allow students with disabilities to demonstrate their knowledge and skills on an
90
assessment (e.g., surveys of perceived integrity and effectiveness of accommodations).
Such research is generally relevant to policy presentations, policy interpretations, or
implementation analysis.
The comparative approach examines test scores, generally existing test scores, to
see how accommodations affect scores of different groups of students. Research
employing this type of approach helps articulate how accommodations function in an
applied setting. Such research has issues with confounding factors, such as decisions to
provide accommodations and how accommodations are administered, limiting any
conclusions reached. Post hoc comparisons are primary examples of studies employing a
comparative approach.
The experimental approach isolates effects of accommodations by manipulation
of presence and/or absence of accommodations among different groups. This is generally
the preferred approach for research in this area. Examples of research employing the
experimental approach are group experiments and single subject experiments.
The individual diagnostic approach examines the set of procedures used to
determine which accommodations an individual student with disabilities should receive.
“Because accommodated students frequently receive multiple accommodations that are
based on their individual needs, the individual approach seems to exemplify how
accommodations are used in real testing situations” (Bolt & Thurlow, 2004, p. 143), thus,
are more likely to provide information on real-world assessment conditions.
While Bolt and Thurlow (2004) suggest that accommodations should only be
considered valid if they are supported by each one of these four approaches this meta-
91
analysis endeavored to provide information based on research guided by experimental
approaches, focusing on research that looked at boost, differential boost, or the
interaction hypothesis.
Exclusion criteria.
Studies which were not included in the meta-analyses of testing accommodations
for students with disabilities were excluded based on the following criteria:
(i) Studies did not report means and standard deviations and/or significance test
results. Such research did not provide enough information to create an aggregate metric
for an effect size.
(ii) Studies did not use large-scale assessments, high-stakes assessments,
commonly used/published achievement or college entrance assessments, or proxies for
these types of assessments (e.g., researcher-developed assessments using items from state
assessment item banks). Aggregating multiple types of tests was thought to provide an
apples-to-oranges rather than an apples-to-apples type of comparison.
(iii) Studies looked at assessment accommodation packages. Unless information
from such studies could be disentangled, these studies were excluded from the meta-
analyses.
(iv) Studies examined assessment modifications. Including such studies was
beyond the scope of the present analyses. Further, these studies were thought to cloud
interpretations which could be made as assessment validity would be altered in such a
way that results from the assessment would no longer be comparable to results from a
more standardized type of testing condition.
92
(v) Studies did not report primary research findings for students; i.e., secondary
studies.
(vi) Studies published before 1999.
(vii) Studies found in multiple sources, such as dissertations, papers, and
publications. For studies located in multiple sources, the study with the most information
which could be coded and/or was thought to be easier to retrieve was selected.
(viii) Qualitative studies.
(ix) Research, not reported in English, or for which English translations were not
available.
Of the 81 studies located, 47 studies were excluded from the meta-analyses. These
studies were excluded from the meta-analyses as the purpose for the research conducted
did not match that of the current study, data did not include information that could be
used to calculate an effect size, some of the data necessary to calculate an effect size were
missing, or the study was eliminated after performing an outlier analysis. Citations and
reasons for the studies’ exclusion may be found in Appendix H. A further eight studies
could not be located (see Appendix I).
Selection criteria were tested and refined by applying these criteria to five
randomly selected studies. One of the studies, Burch (2004), was rejected as the students
used computers to answer test questions. This was not apparent when reviewing the title,
abstract, and research questions for the article. The four articles which were coded were:
(i) Abedi, J., Kao, J. C., Leon, S., Mastergeorge, A. M., Sullivan, L., Herman, J.,
& Pope, R. (2010)
93
(ii) Helwig, R., Rozek-Tedesco, M.A., Tindal, G. (2002)
(iii) Kosciolek, S. & Ysseldyke, J. E. (2000)
(iv) Ofiesh, N., Mather, N., & Russell, A. (2005)
Final selection criteria, both substantive and methodological, were integrated into
the Coding Manual (Appendix D), providing a method of labeling all studies reviewed.
This was done to assist in potential future analyses, whereby excluded studies, solely and
in combination with studies selected for the present research, could be analyzed using
similar methods.
Overview of the selection process.
The selection process started with a review of citations found in secondary
studies, located on the NCEO website, involving the summary of the research on the
effects of tests accommodations. Secondary studies included both narratives and
syntheses of the research literature. As well, titles and keywords found through a
comprehensive database search were screened. Additionally, bibliographies from located
studies were examined for research work that might potentially be included. Studies
thought to be of interest were marked for retrieval. Inclusion and exclusion criteria
guided the identification of studies thought to be relevant to the population of studies to
be used in this meta-analysis, with exclusion of studies that did not meet the substantive
and methodological inclusion criteria. While this was a guiding principle, exceptions
were made in certain cases where studies found met some, but not all, of the inclusion
criteria. The rationale for including these studies was provided in the coding database
accompanying each study. Note that a coding form was developed (see Appendix F).
94
This form was used to structure the coding database, as well as for training an additional
coder for the inter-rater reliability study.
Unpublished reports were also considered for retrieval during the selection
process. It was thought these studies were necessary to provide a methodologically sound
meta-analysis. Glass et al. (1981) noted that there was reporting bias, whereby research
with significant results or results with a high surprise factor were more likely to be
published while results from research where there were non-significant findings or
findings that are contrary to mainstream theory were less likely to be published. As well,
for journals where blind review was not conducted there may be issues of editorial bias;
reputation of author, affiliation of author, novelty of research affecting editorial selection;
and/or reviewer bias; author prestige, author nationality; affecting reviewer selection.
Several reports and conference proceedings were located. Of these, five reports were
included in the final analyses. While it was expected that not all journal articles located
would be peer-reviewed, this was not the case. Of the final 19 journal articles included in
the analyses, all were peer-reviewed.
Of concern when identifying potentially relevant studies was the differentiation
between test accommodation and test modification. Studies where test modifications
were used, or where it was not clear whether a test modification or a test accommodation
was used, were removed from the pool of studies used in the meta-analysis.
The screening of potentially relevant research was an iterative process, whereby
selection criteria and guidelines for selection were further refined and clarified. The
initial screening of these articles included examination of the title of the study, study
95
abstract, and research purpose/questions for the study. As this process was not wholly
reliant on identification of studies through citations provided by the electronic databases
searched, it was expected that fewer studies were missed due to insufficient or misleading
information found in these citations. Moreover, the general rule for inclusion of studies
identified through electronic database citations was to err on the side of over-inclusion
rather than exclusion of prospectively applicable studies. Studies not meeting inclusion
criteria were winnowed from the meta-analysis and were not included in final counts of
studies found.
Search strategy.
The search strategy employed for the meta-analysis was guided by the selection
criteria as well as an extensive search strategy designed to be congruent with the meta-
analytic research hypotheses posed. Hedges (1994) stated that “[t]he sampling procedure
must be designed so as to yield studies that are representative of the intended universe of
studies” (p. 35). While the notion of exhaustive sampling is meant to garner a
representative and sufficient sample of studies of assessment accommodations and
students with disabilities, it must be noted that representativeness of the variability of
studies in the potential universe of studies in the field may not be achieved due to issues
of publication bias, including both editorial and reviewer bias. A combination of Lipsey
and Wilson’s (2001) and White’s (1994) suggestions for finding research reports were
used to identify relevant research. Lipsey and Wilson’s approach utilizes the following
sources:
(a) review articles, (b) references in studies, (c) computerized bibliographic
databases, (d) bibliographic references volumes, (e) relevant journals, (f)
96
conference programs and proceedings, (g) authors or experts in the areas of
interest, and (h) government agencies (2001, p. 25).
White’s approach includes “(a) footnote chasing or review of bibliographies of selected
articles, (b) consultation, (c) searches in subject indexes, such as electronic database
searches, (d) browsing, and (e) citation searches of electronic databases” (1995, p. 46). It
should be noted that there is overlap between these approaches. It must also be noted that
electronic database searching is more prevalent with the introduction of personal
computing, greater personal computing power, and the push to store as much information
online as possible. As well, many online databases now include search and retrieval
functionality.
Computerized database searches.
Computerized database searches were conducted to find potentially eligible
studies for the meta-analysis. Articles, reports, papers, or dissertations will be referred to
as research studies in this section. As most current online searches yield both the
bibliographic reference and the research study in question it was not generally necessary
to locate the research study once the bibliographic reference was located. Location of
some of the research studies did require a two-step process, whereby the bibliographic
reference was found using one database but the study itself was located in a different
database. For example, a citation for a research study would be found using ERIC but the
copy of the article was available through the PsycINFO database.
Computerized database searches were conducted using natural language and
controlled vocabulary keyword searches (White, 1994). Natural language refers to terms
that “emerge naturally from the vocabularies of authors” (p. 49) while controlled
97
vocabulary keywords refer to the “terms … added to the bibliographic record by the
employees of A&I services or large research libraries” (p. 50). Generally, controlled
vocabulary keywords are found in a thesaurus produced specifically for the database
being used. Keywords are typically associated with the title, abstract and/or standardized
descriptors for the study in question.
Lipsey and Wilson (2001) recommend using keywords that broadly cover the
domain of interest by
(a) identifying all those standardized descriptors in a given database that may
be associated with the studies of interest and (b) identifying the range of terms
that different researchers might include in their study titles or abstracts that give
a clue that the study might deal with the topic of interest (p. 26).
They further recommend using appropriate Boolean connectors; for example, and, or,
not, to limit or expand the search as necessary. Further, they recommend caution when
trying to narrow the size of the search as many eligible research studies may be missed.
As there is often a fine line between a search which is too expansive and one which is too
restrictive, there was much trial and error in finding the appropriate search terms and
Boolean connectors. Some of the trial and error in creating appropriate search phrases
was reduced through examining the titles and abstracts of research studies which were
identified during the review of the literature.
Based on the recommendations of Lipsey and Wilson (2001) and White (1994) a
list of search criteria, keywords, and connectors was developed. Search criteria included,
but were not limited to, combinations of following terms: accommodation, test,
standardized assessment, large-scale assessment, high-stakes assessment, and disability.
A complete list of search criteria used for searching databases, databases searched, and
98
number of eligible studies found is located in Appendix G. It should be noted that once
studies were located, they were reviewed for eligibility as not all studies located were
considered relevant for the purposes of the present meta-analysis.
While the current meta-analysis does not involve multiple disciplines, it does
involve many different facets of educational research; for example, research on state
assessment programs, validity of assessment accommodations, and policies developed for
effective use of assessment accommodations. As such, multiple divergent databases were
used to locate eligible studies. These databases were Academic Search Complete,
Applied Social Sciences Index and Abstracts (ASSIA) , British Periodicals , Dissertations
& Theses @ University of Denver , ERIC, Google Scholar, JSTOR, ProQuest
Dissertations & Theses (PQDT), ProQuest Education Journals, PsycINFO,
PsycARTICLES, and Sociological Abstracts.
An effort to retrieve unpublished studies was made by searching Dissertations &
Theses @ University of Denver and ProQuest Dissertations & Theses (PQDT). As it was
suspected that the number of unpublished studies found was not representative of the
number of unpublished studies in this area, publication bias was explored using
Comprehensive Meta-Analysis V.2.2.050. Comprehensive Meta-Analysis V.2.2.050
provides a method, similar to the calculation of a fail-safe number, to represent the
number of unpublished studies with a negligible, or zero, effect size. This was deemed
necessary to examine the overall effect of publication bias. Funnel plots and calculations
for several types of ‘fail-safe’ numbers are provided by the program.
99
Overview and results of the search process.
A comprehensive search strategy, based on a number of different approaches, was
used to locate eligible research studies for the current meta-analysis. Reference lists
found in syntheses, searches of electronic databases, conference proceedings, web sites,
and hand searches of journals such as the American Educational Research Journal,
Educational Measurement: Issues and Practice, and Educational Researcher were used to
identify likely studies for the meta-analysis. As there is generally a lag between
publication and listing in electronic databases, hand searches of nine journals, focusing
on large-scale assessment, assessment/test accommodations and special education, were
also conducted. As well, in an effort to ensure the most recent studies were included,
papers presented at conferences sponsored by the American Educational Research
Association, the Council for Chief State School Officers Large-Scale Assessment
Conference, National Council on Measurement in Education, and National Association of
School Psychologists in 2010 and 2011 were examined. Further, web sites for
organizations such as NCEO (with a searchable database), Wisconsin Center for
Education Research, Center for Research on Evaluation, Standards, and Student Testing,
College Board, and Behavioral Research and Teaching at the University of Oregon were
explored for prospective research studies. Additionally, secondary studies identified as a
part of the review of the literature provided summaries of the research on testing
accommodations for students with disabilities, supplying useful search terms for types of
accommodations being used, as well as additional direction regarding research findings
100
vis-à-vis accommodation use. Moreover, research studies needed to be published or
conducted between January 1999 and July 2011.
The initial search was broadened in an effort to locate studies on the interaction
hypothesis (Sireci et al., 2005) and included the terms differential boost (Fuchs & Fuchs,
1999), boost studies, and comparative studies.
Database searches were conducted for substantive and methodological terms. In
addition, using database indices, citations, and abstracts several subject headings which
were of potential interest were identified. A combined search of pertinent substantive and
methodological terms yielded a single meta-analysis by Chiu and Pearson (1999). For
purposes of this meta-analysis, the Chiu and Pearson study was used to frame the
timeline for study eligibility.
Titles, keywords, abstracts, and research questions/hypotheses/purposes for each
research study found were reviewed for inclusion in the meta-analysis. All studies were
reviewed by the primary researcher and were selected for inclusion or exclusion. As well,
eligible research studies were reviewed for prospective keywords for additional database
searches. Furthermore, reference lists for these studies were used to identify additional
studies. Research studies considered ineligible, based on exclusion criteria, were cited
(see Appendix H).
Several attempts were made to locate studies which appeared to meet the
substantive and methodological criteria. Efforts to collect as many unpublished studies as
possible were also made. Several online databases were searched for the missing studies.
When the researcher was unable to find the research studies, online library resources
101
were used. A total of seven studies were not retrievable. See Appendix I for a complete
listing of citations for irretrievable studies.
Comprehensive searches of online databases yielded 226 studies, not including
duplicates. Eighty studies; comprising 33 research articles, 11 research reports, 34
published dissertations, and 2 papers; i.e., unpublished research studies; were initially
identified as eligible research studies. After reviewing these studies, all 80 research
studies were found to focus on the effects of test accommodations for students with
disabilities and were empirical. These 80 eligible studies were then reviewed (i) for
serious methodological flaws such that designs posed threats to external validity or did
not use random assignment when possible (Bangert Drowns, 1993), (ii) to determine if
there was sufficient statistical information to calculate effect sizes, and (iii) to determine
if they matched the substantive research hypotheses posed by this study. While none of
the studies were considered to have serious methodological flaws, 27 were eliminated as
they did not match the substantive research hypothesis; e.g., the primary study examined
multiple accommodations for individual research participants or did not disaggregate
students with disabilities from English language learners. Results indicated that 44 of the
remaining 53 studies appeared to contain the information necessary to calculate effect
sizes. However, 5 studies did not contain information necessary; e.g., mean and standard
deviations, to calculate effect sizes, and a further 3 research studies were eliminated as
they used a comparative research design. Not included in this total were 20 duplicates,
and of these, 10 were duplicates for rejected studies. The work that was easiest to locate,
generally journal articles, was coded while the duplicate, generally a report or
102
dissertation, was used to locate and code information that was not included in the primary
work. Based on these analyses, 36 studies were retained for inclusion in the meta-
analysis. It should be noted that when selecting a research study for inclusion, it was
thought that journal articles and dissertations were the most accessible sources of
information, thus they were more likely to be included in analyses than reports or
conference proceedings.
The 36 eligible studies were further evaluated to ensure that explicit information
regarding the nature of the disabilities of the target group and, where necessary,
comparison groups was provided. As well, the research studies were reviewed looking for
unambiguous descriptions of assessment accommodations used in the research and details
regarding implementation of the accommodations.
Coding and classifying study variables.
As part of the meta-analysis, variables identified in the research studies were
coded according to a codebook (Appendix D) used to collect data for the present meta-
analysis. Coding forms were developed based on the codebook. Both the codebook and
coding forms developed were adapted from Lipsey and Wilson (2001), Stock (1994), and
Van Horn, Green, and Martinussen (2009), with coding formulated to allow for statistical
analysis of the eligible research studies. Due to the complexity encountered during initial
coding, a coding manual was also developed. The coding manual contains instructions on
how to enter information on the coding form, study inclusion and rejection rules, and
glossaries for useful keywords (see Appendix G).
103
Coding was based upon both substantive and methodological concerns (Glass,
McGraw, & Smith, 1981; Stock, 1994). As well, coding information was based on “two
rather different parts” (Lipsey & Wilson, 2001, p. 73): information regarding (i) research
study characteristics and (ii) empirical findings. While some variables used in the
codebook were decided upon a priori; for example, publication type and research study
type, many of the variables were established at a later stage, thus capitalizing on the
iterative nature of the coding process.
Development of a codebook was an iterative process, progressing through the data
collection phase of the study as this researcher became more knowledgeable about the
domain of inquiry and the statistical demands and biases which needed to be addressed in
the meta-analysis. Steps in coding and classifying study variables included the following:
(i) creating the codebook with initial set of codes (Lipsey & Williams, 2001; Van Horn et
al., 2009); (ii) reading five articles with the initial codebook and revising as new
information came to light; (iii) coding one article during the coder training session and
revising with the aid of the second coder; (iv) coding three more articles with the revised
codebook and revising again; (v) create coding forms (Appendix F) and a coding manual
(Appendix D) to accompany the codebook (Appendix E); (vi) coding all remaining
studies; (vii) using a second coder to code 15% of the studies using the coding manual,
codebook and coding forms; (viii) calculating inter-rater reliability for completed coding
for a 15% random sample of eligible studies.
The codebook consisted of the following broad categories: report identification,
study retrieval information, study citation, research participant information, assessment
104
citation and demographic information, research methodology, research design, research
results, and a proxy for quality of study. Each category was defined in terms of the
variables it contained with different levels or options associated with each variable
described in the codebook. For example, report identification contained data regarding
the year of publication, type of publication (dissertation, article, report, paper), and name
of publication. Assessment citation and demographic information contained data related
to the kind of scales used; names of tests or diagnostic systems, reliability, test item
format, and construct or content assessed (see Appendix D). Test item format was
included as
… Koretz and Hamilton (2000) found differences between the performance
of students with disabilities' performance on multiple choice and constructed
response items, [thus] future research should further evaluate potential differential
impact of accommodations on these different item formats (Zenisky & Sireci,
2007, p. 17).
It should be noted that students with ADHD were classified as ‘other health impairment’
in one study as
after the passage of IDEA in 1990 and a subsequent 1991 memorandum, that
the U.S. Department of Education and its Office of Special Education chose to
reinterpret these regulations, thereby allowing children with ADHD to receive
special educational services for ADHD per se under the ‘Other Health Impaired’
category of IDEA (Barkley, 2006, p. 16-17).
The coding form reflected each of these broad categories with the different levels or
options provided.
A proxy for study quality was used as there is much disagreement in the field
regarding classification of study quality. Ahn and Becker (2011) found that using
“quality weighting adds uncertainty to average effect sizes but does not eliminate serious
bias related to study quality… [and] adds bias in many cases” (p. 579-580). Therefore, a
105
pseudo-measure of quality, grouping primary studies by (i) published journal articles and
conference proceedings that are peer reviewed, (ii) published reports which may or may
not undergo a peer review process, and (iii) unpublished dissertations which are reviewed
by dissertation committee members, was used. The ‘quality’ for journals, conference
papers, and dissertations was, arguably, considered ‘equivalent,’ while research reports
were viewed as being of ‘lesser quality.’
It must be pointed out that some variables found in the research studies were very
difficult to classify; for example, participant disability classification; thus room was left
for qualitative descriptions. These descriptions were later analyzed, identifying
commonalities and differences that were then coded so they could potentially be included
in the statistical analysis (Lipsey & Wilson, 2001). Per Lipsey and Wilson’s
recommendation, such qualitative descriptions were “only used for critical issues and
when absolutely necessary” (p. 74). As well, there were instances where variables could
not be coded based on the data included in the study being analyzed. In these instances,
an explicit option to indicate that it was not possible to “tell what the status of the study
[was] on that item” (p. 88) was provided in the codebook and the accompanying coding
form via a missing option, and coded as not reported. It was also necessary to distinguish
between missing and not applicable (Lipsey & Wilson, 2001), thus a not applicable
category was also provided.
As several different research designs are found in research studies involving
assessment accommodations, coding for research design was implemented. This allowed
for the inclusion of studies with diverse research designs, whereby different effect sizes
106
were calculated to reflect the differences in research design (Lipsey & Wilson, 2001). It
should be noted that this was not a factor in the Chui and Pearson (1999) meta-analysis as
most studies conducted prior to 2000 used boost research designs.
Dependent and non-independent effect sizes.
It is recommended that the same data set should only be used once in an analysis
(Lipsey & Wilson, 2001). For example, the results of a research study may be presented
at a conference and then later reported in a journal. In such instances, for the present
study, the unit of analysis was the research study containing the most information that
could be readily coded.
Some eligible research studies provided dependent and non-independent effect
sizes; that is, there were multiple samples with multiple results reported within a single
research study. When this occurred, it was necessary to distinguish between the types of
effect sizes as only effect sizes that are independent are suitable for the calculation of the
overall mean effect size in a meta-analysis (Lipsey & Wilson, 2001). Issues calculating
mean effect when multiple effect sizes are present include problems estimating the
variance across the studies, issues when conducting significance testing, problems
looking for moderators, providing inaccurate sample size(s), and giving too much weight
to a few studies. When using the Hunter and Schmidt (1990) method to calculate effect
sizes, multiple effect sizes in a single study appear to be less of an issue, with some data
indicating that these estimates may in fact be better (Martinussen & Bjørnstad, 1999).
Suggestions to resolve this issue include (i) picking one of the results randomly, (ii) using
the most common effect size, and (iii) computing the mean effect size and the mean
107
sample size, which is the mean of the subjects per effect size and not the mean of all the
subjects involved (Martinussen
1
, 2007). Martinussen (2007) recommended using the third
method and, in the cases where the samples in the research study were dependent, the
third method was employed. In the cases where multiple samples in a single research
study were independent, the information was captured twice; once to analyze the data
while accounting for the independent samples, using the substudy as the unit of analysis,
and once when not accounting for the independent samples; i.e., to examine the
aggregate, using the study as the unit of analysis. It should be noted that in the instances
where substudy was the unit of analysis, and there were dependencies, there was a
reduction in the effect size estimation.
Coding characteristics of operational definitions.
Consideration of certain constructs central to the meta-analysis needed to be taken
into account. Specific operational and conceptual criteria for assessment, assessment
accommodation, and student with disabilities were used to guide coding information for
their associated variables; for example, type of assessment, category of accommodation,
and sampling method.
A range of large-scale assessments was used in the research studies collected. To
account for the variety of assessments, each assessment was coded in relation to the
assessment category measured (achievement, aptitude, performance, placement,
selection, screening, diagnosis, other), construct and/or content measured (mathematics,
reading/language arts, science, writing, social studies, physical education, multiple
content areas, other), method of standardization (norm-referenced, criterion-referenced,
1
Personal communication with Dr. Monica Martinussen, (May, 2007).
108
domain-referenced, standards-based), and assessment format (multiple-choice, fill-in-the-
blanks, short answer questions, open-ended questions). Assessment citation information
was entered as qualitative information.
To account for the diversity of assessment accommodations included in the
analysis, accommodation operational definitions were coded in relation to predetermined
categories based on the NCEO criteria of presentation, response, setting, and timing and
scheduling. These categories were further broken down into specific accommodation; i.e.,
oral administration as a sub-category for presentation. Every effort was made to
determine the mode students used to answer the assessment questions; i.e., paper and
pencil or computer. If students used a computer to read or hear assessment directions,
questions, response options, etc. and used a paper and pencil form to answer the
questions on the assessment, then the assessment was included in the meta-analysis. It
was rejected if the students used a computer to answer the assessment questions.
To accurately report on the students with disabilities category, each research study
was coded according to explicitly stated information on type of disability. Disabilities
were coded according to the 13 special education categories listed in federal special
education law (Individuals with Disabilities Act reauthorization of 2004, PUBLIC LAW
108–446, 2004):
mental retardation, hearing impairments (including deafness), speech or language
impairments, visual impairments (including blindness), serious emotional
disturbance (referred to in this title as ‘emotional disturbance’), orthopedic
impairments, autism, traumatic brain injury, other health impairments, or specific
learning disabilities (Part A (SEC. 602) (3) (A) (i), 118 STAT.2652, 2004).
With the iterative nature of coding, some adjustments were made to the coding
process. It was originally hoped that there would be viable number of studies using
109
original versions of high-stakes, large-scale, or standardized tests. However, the majority
of studies used researcher-developed assessments, drawing from large-scale and/or high-
stakes assessment item banks; using such data was believed to be appropriate.
Additionally, both achievement and ability measures were included in the meta-analysis
as achievement and ability are highly correlated (Tindal & Fuchs, 2000). Comparative
research designs; i.e., post hoc analyses, were dropped from the meta-analysis as they
lacked use of random assignment or counterbalancing thus did not appear to adequately
address either meta-analytic research hypothesis posed. It was felt that empirical
research; i.e., experimental or quasi-experimental research, was a better match to the
research purpose for this study as it is a way of gaining knowledge through direct
observation or experience. As well, although type of assessment, that is, norm-referenced,
criterion-referenced, domain-referenced, standards-based, curriculum-based, was coded it
was not included in any of the analyses as there were much missing data.
Issues of reliability throughout the coding process.
Another area of consideration during the coding process was the avoidance of
errors and biases introduced when coding the data. By providing explicit, unambiguous
descriptions of each coded variable in the codebook, “coding errors” associated with
judgments were, for the most part, avoided. Additionally, use of electronic coding forms,
with data entered directly on a computer, were used to avoid commonplace coding errors
associated with data entry, thus avoiding reentry or copying of data from one database to
another. Although these preventive measures were implemented, a statistical analysis of
coding errors and bias was conducted, as the introduction of coding error cannot be
110
entirely avoided. After the coding manual (Appendix D), codebook (Appendix E), and
coding form (Appendix F) were developed, two different coders reviewed and coded
15% of eligible studies. A measure of inter-rater reliability, percentage agreement, for a
random sample of 15% of all studies was calculated. In the event there was disagreement
between the two raters, the rationale for the difference was discussed and eventual
consensus on coding was reached; and, when needed, the coding form reflected changes.
The inter-rater reliability by category, the categories being study citation, participant
information, assessment information, accommodation information, statistical analysis,
and results (i.e., means and standard deviations), and ‘additional’ results (i.e.,
significance tests and correlation coefficients between the non-accommodated and
accommodated conditions), ranged from 77% to 100%, and was 92% overall. The
percentage agreement for continuous participant and results data, used to calculate effect
sizes for the primary studies, was 98.9%. Additionally, the reliability coefficient
calculated for these data, reached 1.00 and was statistically significant. The inter-rater
reliability was considered adequate for purposes of this study. While final coding was
consensual, calculation of reliability did not include coding which changed; i.e., it was
computed before the original codes were changed.
To minimize other possible issues of reliability, joint training sessions for the
coders were conducted. During the training sessions the coding manual, codebook, and
coding form were reviewed, followed by a discussion regarding code entry using an
Excel spreadsheet. Once the review was completed, the two coders examined and coded a
previously coded study together. Additionally, all coding decisions were recorded,
111
together with the rationale for these decisions, and the information was saved to an Excel
spreadsheet. Further, the same ID number was used for the same research study even
when the study was found in multiple sources such as papers, research reports, and
journal articles. An alpha character, beginning with A, was appended to the ID number
when multiple instances of the same study were found. For studies with multiple samples
and multiple results, a lower case roman numeral following the ID number and the alpha
character, beginning with i, was appended to the ID number.
If both multiple independent sections and a summative section with information
to estimate an effect size were present in a study, the information from the summative
section was not included in the meta-analysis.
In an effort to ensure comparisons made were apples to apples and not apples to
oranges, eligible studies had to focus on (i) students with disabilities and groups
compared to students with disabilities; not English language learner or other group
comparisons, (ii) testing accommodations which could be categorized under presentation,
response, setting, and/or timing/scheduling, (iii) studies examining a single
accommodation, and (iv) large-scale, high-stakes, published assessments, or researcher-
developed assessments using items banks from large-scale and/or high-stakes
assessments. It was expected that these assessments would present fewer issues with
reliability and validity.
Statistical methods of analysis.
Following the coding of eligible studies, a suitable effect size statistic and
appropriate statistical methods to combine effect sizes across studies were selected. Meta-
112
analytic experts have devised statistical procedures for calculating a variety of effect
sizes, weighting the mean effect sizes, estimating the effect of other potential moderators,
correcting effect sizes for attenuation, and combining effect sizes from studies employing
different designs. In texts authored by Borenstein et al. (2009), Hunter and Schmidt
(1990), and Lipsey and Wilson (2001), information on meta-analytic statistical
procedures is presented. These texts, together with coursework in meta-analysis taken at
the University of Denver, provide primary references for the statistical methods used in
the present meta-analysis.
Comprehensive Meta-Analysis V.2.2.050 (Borenstein, Hedges, Higgins, &
Rothstein, 2009) (http://www.metaanalysis.com/index.html) was used to compute the
necessary meta-analytic statistics.
Methods for calculating independent effect sizes.
“A critical step in meta-analysis is to encode or ‘measure’ selected research
findings on a numeric scale, such that the resulting values can be meaningfully compared
to each other and analyzed much like any other set of values on a variable” (Lipsey &
Wilson, 2001, p. 34). Effect size statistics, previously referred to; provide the “index used
to represent study findings in a meta-analysis” (Lipsey & Wilson, 2001, p. 34). In order
to meaningfully aggregate findings from primary studies it is generally necessary to
determine a standardized scale appropriate to the types of research designs seen in the
eligible research studies. As the unit of analysis; i.e., the research report, research article,
conference paper, or dissertation; consistently examined differences between means for
(i) students with disabilities, (ii) students with disabilities compared to other students
113
with disabilities, or (iii) students with disabilities compared to typically developing peers,
effect sizes based on the standardized difference between means formed the basis of the
analysis.
For primary studies Hedges’ g, an unbiased estimator of
δ
, the standardized
mean difference, based on Cohen’s d, was used to calculate the effect size for differences
between means.
σ
µ
µ
δ
21
=
(2.1)
p
ce
s
YY
d
=
(2.2)
where
e
Y
is the mean of the experimental group, in this case students with disabilities,
c
Y
is the mean of the control group, in this case typically developing students, and
p
s
is
the pooled sample standard deviation.
( )
+
=
= 94
3
1
14
3
1
ce
nn
d
m
dg
(2.3)
For these calculations, means and standard deviations needed to be available for
each unit of analysis. In some cases means and standard deviations were not available, so
effect sizes were calculated from reported test statistics, such as a t-tests or tests of
significance, when these data were available. Note that use of the pooled standard
deviation for the groups under study is generally recommended. However, if the standard
deviations for the groups under study are very different it is recommended that the
standard deviation for the control group be used instead (Lipsey & Wilson, 2001).
While the control group standard deviation is the recommended standard
deviation for the groups under study, the standard deviation used was pooled within
114
groups; for example, pooled within the students with disabilities subgroup separately
from the typically developing students subgroup. Pooling within groups does not assume
the study-to-study variance (
2
τ
) is the same for all subgroups. As it was “anticipate[d]
that the true between-studies dispersion [was] actually different from one subgroup to the
next … tau-squared [was estimated] for each subgroup” (Borenstein et al., 2009, p. 163).
With several studies within each subgroup, these estimates were not considered imprecise
(Borenstein et al., 2009). In an effort to ensure these assumptions were appropriate, a
sensitivity analysis was performed comparing pooled within-group standard deviation
and pooled across-group standard deviation results.
The random-effects model was employed, as there was variation beyond sampling
error from differences among studies’ effect sizes. The random-effects model does not
produce the substantial Type I bias for mean effects significance tests and moderator
variables; i.e., interactions, seen with fixed-effects models. As well, confidence intervals
generated using the random-effects model do not overstate the degree of precision for the
meta-analytic findings (Hunter & Schmidt, 2000). Statistical significance of effect sizes
were calculated using 95% confidence intervals. Effect sizes with confidence intervals
that did not include zero were considered statistically significant.
While one effect size was provided per independent study, or independent section
of a research study; i.e., substudy, a correction to the observed standard deviation was
used to account for sampling error. Additionally, before combining the effect size data
for the difference between the means into a mean effect size, Lipsey and Wilson (2001)
recommend assessing the effect of outliers and adjusting individual effect sizes based on
115
the consideration of common sources of error. All corrections were performed prior to
running the final analyses.
The steps followed for calculating independent effect sizes included estimating
the mean effect size, tests of significance for the test statistics and the size of the effect,
and estimating and testing the variation between the units of analysis.
All effect sizes were interpreted using Cohen’s (1992) labels for “mean” effect
sizes where 0.8 is considered a large effect size, 0.5 is considered a medium effect size
and 0.2 is considered a small effect size. At present, in the testing accommodation
literature for students with disabilities, there are no clearly defined demarcations between
small, medium, and large effects. Therefore, the values cited by Cohen were used as
lower-bound estimates for calculated mean effect sizes as using this more conservative
estimate was considered to be the more prudent course of action rather than possibly
providing an overestimate with respect to the efficacy of testing accommodations.
Accounting for variance in the distribution of effect sizes.
After calculating independent variance estimates, variance in the distribution of
effect sizes was accounted for. Mean effect size is difficult to interpret without examining
the variance in the distribution of effect sizes and ensuring that parametric statistical test
assumptions have been addressed.
Outlier analysis.
As the “purpose of meta-analysis is to arrive at a reasonable summary of the
quantitative findings of a body of research studies” (Lipsey & Wilson, 2001, p. 107), the
presence of extreme values for effects may be unrepresentative of the research area of
116
interest. Such outliers may produce spurious results; disproportionately affecting means,
variances, and other statistics used in the meta-analysis; hence the need for outlier
analysis. The distribution of effect sizes was analyzed and outliers were identified
(Hedges & Olkin, 1985 cited in Lipsey & Wilson, 2001). Once the degree of dispersion
for existing outliers was determined and their effect on the summary statistics assessed,
appropriate procedures for handling the outliers was addressed on a case-by-case basis. In
general, the outlier was removed from further analysis. Potential reasons for the existence
of outliers in a meta-analysis include methodological error and poor validity of
operational definitions (Lipsey & Wilson, 2001).
Outlier analyses, examining standardized effect sizes, were conducted prior to the
meta-analysis and meta-regression analyses. To start, science and social studies results
were removed from analyses. As assessments in some studies were run across multiple
years and multiple subjects, it was felt that keeping results for a single subject—math—
across multiple years was a more appropriate match to the present research purpose.
Once the remaining studies were deemed an appropriate match to research
purpose for the current study, incremental outlier analyses using study as the unit of
analysis was conducted, followed by the same analyses using substudy as the unit of
analysis.
Table 2 provides results of the incremental outlier analysis with study as the unit
of analysis. For accompanying histograms, see Appendix J.
Results from the Bouck and Yadav (2008) study had extreme values for both
students with disabilities and their typically developing peers. Tests of normality were
117
statistically significant (
p
<0.001) which indicated non-normality. These values were
removed from the data and the analysis was repeated. Results from the second iteration
showed extreme values for students with disabilities and typically developing students for
the Lewandowski and Lovett (2008) study. Again, tests of normality were statistically
significant (
p
<0.001), thus the data from this study were removed and the analysis was
repeated. A final test showed extreme values for students with disabilities for the Lesaux,
Pearson, and Siegel (2006) study. With statistically significant tests of normality (
p
=
0.005), results from this study for both students with disabilities and students with typical
development were removed and a final analysis was completed. While tests for normality
were not significant (
p
> 0.005), and the assumption of normality was not rejected, it was
felt that it was not necessary to remove this study as only students with disabilities, and
not their typically developing peers displayed extreme values.
Table 2: Outlier Analysis for Effect Size Estimates - Study as the Unit of Analysis
Study Group
ES
a
Issues Result
Analysis 1
Bouck &Yadav (2008) students w/o disabilities
b
11.63 skewness, kurtosis, & normality removed
Bouck &Yadav (2008) students w/ disabilities 3.30 skewness, kurtosis, & normality removed
Lewandowski & Lovett (2008) students w/o disabilities
b
1.87
Lesaux, Pearson, & Siegel (2006) students w/ disabilities 1.43
Analysis 2
Lewandowski & Lovett (2008) students w/o disabilities
b
1.87 skewness, kurtosis, & normality removed
Lesaux, Pearson, & Siegel (2006) students w/ disabilities 1.43
Lewandowski & Lovett (2008) students w/ disabilities 1.30 skewness, kurtosis, & normality removed
Analysis 3
Lesaux, Pearson, & Siegel (2006) students w/ disabilities 1.43 skewness, kurtosis, & normality retained
a
ES is Hedges' g effect size estimate
b
students w/o disabilities refers to typically developing students
The incremental outlier analysis, with substudy as the unit of analysis, is provided
in Table 3. For accompanying histograms, see Appendix J.
118
While it was expected that, given the addition of substudy, there would be a
different set of outliers, this was not the case. The same iterative analyses were run, with
the same results.
Studies with extreme values (Bouck &Yadav, 2008; Lewandowski & Lovett,
2008); i.e., those not in line with information from other primary studies listed in Table 2
and Table 3, were removed from further analyses.
Table 3: Outlier Analysis for Effect Size Estimates - Substudy as the Unit of Analysis
Study Group
ES
a
Issues Result
Analysis 1
Bouck &Yadav (2008) students w/o disabilities
b
11.63 skewness, kurtosis, & normality removed
Bouck &Yadav (2008) students w/ disabilities 3.30 skewness, kurtosis, & normality removed
Analysis 2
Lewandowski & Lovett (2008) students w/o disabilities
b
1.87 skewness, kurtosis, & normality removed
Lesaux, Pearson, & Siegel (2006) students w/ disabilities 1.43
Lewandowski & Lovett (2008) students w/ disabilities 1.30 skewness, kurtosis, & normality removed
Analysis 3
Lesaux, Pearson, & Siegel (2006) students w/ disabilities 1.43 skewness, kurtosis, & normality retained
Meloy, Deville, & Frisbie (2002) students w/ disabilities 1.20
a
ES is Hedges' g effect size estimate
b
students w/o disabilities refers to typically developing students
While it might be argued that the larger effect sizes seen for the outlier studies
were the result of a good match between the study participants and the accommodation
under investigation, this did not appear to be the case as there were no discernable
differences between these studies and those that were included in the meta-analyses. It
was felt that removing these specific studies, particularly as no relevant differences
between ‘outlier’ and ‘included’ studies were seen, provided a more conservative
estimate of the mean effect for testing accommodations. Thus, in the event statistically
significant mean effects were found, the use of a more conservative estimate was thought
to provide a better approximation of the mean effects than potentially overestimating
these effects.
119
Analysis of the homogeneity of variance and the distribution of effect size.
Examination of the homogeneity of the effect size distribution; i.e., the
distribution of primary effect sizes around the mean effect size, is one of the next steps in
meta-analytic research. With a homogenous distribution, the amount by which the effect
size distribution differs from that of the population is equal to that expected by sampling
error. Rejection of homogeneity of variance suggests the variability of the effect sizes is
larger than sampling error and, therefore, “each effect size does not estimate a common
population mean” (Lipsey & Wilson, 2001, p. 115). The Q statistic was employed to test
the homogeneity of the distribution of primary effect sizes.
The Q statistic is distributed as a chi-square with
k
– 1 degrees of freedom where
k
is equal to the number of effect sizes used in the meta-analysis,
ES
is the individual
effect size for
i
= 1 through
k
effect sizes, and
ES
is the weighted effect size over the
k
effects.
(
)
2
=
i
ii
ESESQ
ω
(2.4)
where
i
ω
is the individual weight for
i
ES
,
i
ES
is the individual effect size for
i
= 1,…,
k
effect sizes, and
i
ES
is the weighted mean effect size over
k
effect sizes.
From a statistical perspective, the Q statistic examines the assumption of a fixed-
effects model, with a significant Q indicating a heterogeneous distribution, challenging
the fixed-effects model. Conversely, a non-significant Q may not be indicative of a fixed-
effects model. For example, if there is a small number of primary studies and each
examines a small number of subjects, there may not be enough statistical power to be
120
able to reject the homogeneity of variance assumption (Lipsey & Wilson, 2001; Morton
et al., 2004).
Sources of variance associated with the distribution of the primary study effect
sizes were expected to be randomly distributed. This led to the adoption of the random-
effects, or unconditional, model. The random-effects model differs along two
dimensions; study characteristics and the effect size parameter. That is, effect size
variation is explained by a random component as well as by subject-level sampling error.
Hedges (1994) explained that “studies in the study sample … differ from those in the
universe as a consequence of the sampling of people into the groups of the study” (p. 31)
with “the study sample (and their effect size parameters) differ[ing] from those in the
universe by as much as might be expected as a consequence of drawing a sample from a
population” (p. 31) such that there is “variation of observed effect sizes about their
respective effect size parameters” (p. 31), referred to as study-level and subject-level
random variability by Lipsey and Wilson (2001).
The assumptions of the fixed-effects model, whereby random error found in the
primary studies was due to subject-level sampling error alone and effect sizes were
presumed to estimate the consequent population effect, was considered untenable on
theoretical grounds. The primary analyses forming the basis of the present meta-analysis
were considered to be part of a larger universe of primary analyses that do not have a
common effect size for the population of potential eligible studies. That is, the observed
effects sizes were expected to have both study-level and subject-level sampling error
121
variability. As well, the assumptions necessary for the fixed-effects model were difficult
to meet.
While potentially tenable, the mixed-effects model, which assumes that variance
not explainable by sampling error can be attributed to both random and systematic
sources of variance, was not employed. It was believed that regardless of how much
attention was devoted to the design of the coding tools, allowing for the quantification of
potential moderator variables, the coding conducted would not be able to capture the
information in enough detail to meet the assumptions necessary to conclude differences
were truly systematic sources of variance. Additionally, the mixed-effect model allows
for the use of a random-effects model to combine the studies within each subgroup; i.e.,
students with disabilities and typically developing students, and a fixed-effects model to
combine the subgroups to yield the overall mean effect size. As the research purpose was
to compare subgroups, and not aggregate these two groups, use of the mixed-effects
model was not warranted.
Due to the nature of the design of the present meta-analysis, effect sizes found for
the primary studies examined were derived from a non-uniform set of sample
characteristics; i.e., assessment accommodations for students with disabilities. Therefore,
homogeneity of variance of the primary effect sizes was not expected due to the degree of
differences between both assessment accommodations and students with disabilities. This
led to the use of the random-effects model in the final analysis examining the efficacy of
assessment accommodations and their delivery to students with disabilities as opposed to
their typically developing peers.
122
When coding the data, several studies using a repeated measures design did not
contain test score correlation, necessary for effect size estimation, between the non-
accommodated and accommodated conditions. For studies missing these correlations, the
correlations were estimated using information from test websites, searching the online
version of the Mental Measurements Yearbook, and other research studies with similar
tests (i.e., for the same age group assessing the same test content), frequently using test-
retest reliability as an approximation of this value for the measures in question. Both
Borenstein et al. (2009) and Lipsey and Wilson (2001) have mentioned this issue noting
that using estimates, particularly test-retest reliability scores, “affects the confidence
interval around the mean effect size thus caution should be used in interpreting the
confidence interval” (Lipsey & Wilson , 2001, p. 43). Sensitivity analyses were
performed, see ‘Sensitivity analyses,’ examining differences between studies using a
repeated measures design and those using an independent groups design to ensure that the
using these estimates were not drastically different.
Some studies using counterbalancing provided different results for test and/or
order of condition results. In these cases, all data provided in the study were included in
the analyses. While it was expected that there may be issues with some of the study
variables; particularly as tests used in counterbalanced designs might not be parallel or
the order of administration of the condition might affect the results; the data were
included in the meta-analysis as they were still thought to provide legitimate evidence
with respect to the research hypotheses posed.
123
Both boost and differential boost/interaction study data were combined in the
analyses used to answer the hypotheses posed by the current research. Borenstein et al.
(2009) point to issues of combining data from studies using different designs, as there
may be substantive differentiation as well; this was not suspected to be an issue for the
present study. There were several instances in the primary research (see Abedi et al.,
2010; Johnson, 2000; Kosciolek & Ysseldyke, 2000; Schnirman, 2005; and Walz, Albus,
Thompson, & Thurlow, 2000) where the same data set was used to answer questions
regarding the efficacy of accommodations for students with disabilities and whether or
not these accommodations were differentially effective for students with disabilities as
compared to their typically developing peers. Similarly, meta-analyses conducted by
Elbaum (2006) and Gregg and Nelson (2012) included results from primary research for
both boost and differential boost/interaction research approaches.
Data from primary studies using repeated measures and independent group
designs were combined in the analyses conducted. While this is not an issue “from a
statistical perspective [as] the effect size … has the same meaning regardless of the study
design” (Borenstein et al., 2009, p. 25), there may be issues regarding the focus of the
studies and the effect sizes. Morris and DeShon (2002) note that the
…IG [independent groups] focus of research [is] on differences across alternative
treatments using raw score metric while RM [repeated measures] focus of
research [is] on individual change using change score metric (p. 110)
and that “[t]he use of change score metric will often produce larger effect sizes than raw
score metric” (p. 110). Still Borenstein et al. (2009) point out that “we need to assume
that the studies are functionally similar in all other important respects” (p. 361).
124
With respect to the current research work, it was felt that the benefits of
combining the different designs based on substantive grounds, and use of
Comprehensive
Meta-Analysis V.2.2.050
to calculate and appropriately weight the different studies
included, provided information that would not be fully addressed examining the results
based on the two different research designs. Sensitivity analyses examining the
differences between the results for the aggregate versus the disaggregated studies
provided useful information to make certain that there were not drastic differences
between estimates for the repeated measures, independent groups, and aggregated
analyses (see ‘Sensitivity analysis’).
Sensitivity analysis.
Table 4 provides a comparison of the mean effect size estimates for the random-
effects model for the two different research designs, repeated measures and independent
groups, to the mean effect size estimates when combining both research designs.
The mean effect size estimates comparing students with disabilities to their
typically developing peers for primary studies, using a repeated measures design (
ES
=
0.31 for students with disabilities;
ES
= 0.17 for typically developing students) or an
independent groups design (
ES
= 0.26 for students with disabilities;
ES
= 0.15 for
typically developing students), as compared to the combination of both repeated
measures and independent groups primary studies (
ES
= 0.30 for students with
disabilities;
ES
= 0.17 for typically developing students), are extremely similar. Further,
standard errors and confidence intervals were not considered very different. However,
there was a non-significant mean effect size estimate for typically developing students for
125
the independent groups research design. This is, most likely, to be expected given the
smaller number of primary studies constituting the mean effect size estimate.
This sensitivity analysis provided evidence for combining primary study
information for both repeated measures and independent groups research designs when
answering the first research hypothesis posed by the current study.
Table 4: Sensitivity Analysis for Research Hypothesis 1 -
ES
Estimates, Confidence Intervals, & Significance
Mean effect size & 95% confidence interval for Hedges' g
Comparison group k
ES
a
Std Err
a
LL
a
UL
a
p(ES)
Combined Studies (random-effects model)
students w/ disabilities 62 0.30 0.04 0.21 0.38 < 0.001
students w/o disabilities
b
57 0.17 0.03 0.11 0.22 < 0.001
Repeated Measures Designs (random-effects model)
students w/ disabilities 48 0.31 0.05 0.22 0.41 < 0.001
students w/o disabilities
b
46 0.17 0.03 0.11 0.23 < 0.001
Independent Groups Designs (random-effects model)
students w/ disabilities 14 0.26 0.12 0.02 0.50 0.033
students w/o disabilities
b
11 0.15 0.12 -0.08 0.38 0.193
a
ES
is Hedges' g mean effect size estimate, Std Err is standard error, LL is lower limit, & UL is upper limit
b
students w/o disabilities refers to typically developing students
Sensitivity analyses for research hypothesis 2 are displayed in Table 5. The mean
effect size estimates for the random-effects model, when combining both research
designs, are compared to the mean effect size estimates for the two different research
designs; repeated measures and independent groups.
As can be seen, the mean effect size estimates comparing the four different
categories of accommodations–presentation, response, setting, and timing/scheduling–are
similar for presentation and timing/scheduling accommodations for the repeated
measures research design (
ES
= 0.19 for presentation;
ES
= 0.47 for timing/scheduling)
as compared with the combination of repeated measures and independent groups research
designs (
ES
= 0.22 for presentation;
ES
= 0.47 for timing/scheduling). The same cannot
be said for the independent groups research design as the mean effect size for
126
presentation,
ES
= 0.39 is larger, albeit still within the small range (Cohen, 1992), and
timing/scheduling,
ES
= -0.04 is smaller. It must be noted that there is only one
timing/scheduling study for the independent groups research design, rendering sensitivity
analyses for this comparison moot. As there are so few primary studies for either
response or setting accommodation categories, sensitivity analysis was not considered
relevant. Additionally, these two accommodation categories were not subject to intensive
meta-analytic scrutiny or closely examined in the meta-regression analyses.
Again, evidence for combining primary study information to answer the second
research hypothesis under investigation, for both repeated measures and independent
groups research designs, albeit only for presentation and response assessment
accommodations, is supported by the sensitivity analysis.
Table 5: Sensitivity Analysis for Research Hypothesis 2 -
ES
Estimates, Confidence Intervals, & Significance
Mean effect size & 95% confidence interval for Hedges' g
Type of Accommodation k
ES
a
Std Err
a
LL
a
UL
a
p(ES)
Combined Studies (random-effects model)
Presentation 41 0.22 0.06 0.12 0.33 < 0.001
Response 3 0.24 0.38 -0.50 0.98 0.525
Setting 1 0.32 0.17 -0.02 0.66 0.061
Timing-Scheduling 17 0.47 0.09 0.30 0.64 < 0.001
Repeated Measures Designs (random-effects model)
Presentation 30 0.19 0.06 0.07 0.31 0.002
Response 1 1.14 0.17 0.80 1.48 < 0.001
Setting 1 0.32 0.17 -0.02 0.66 0.061
Timing-Scheduling 16 0.48 0.09 0.31