Approaches for development of criterion-referenced standards in health-related youth fitness tests.
- Citations (38)
-
Cited In (0)
-
Article: Instructional technology and the measurement of learing outcomes: Some questions.
[show abstract] [hide abstract]
ABSTRACT: "Measures which assess student achievement in terms of a criterion standard provide information as to the degree of competence attained by a particular student which is independent of reference to the performance of others." Achievement measures may also convey information about the capability of a student compared with the capability of other students. Achievement tests are used (a) to provide information about the characteristics of an individual's present behavior and (b) to provide information about the conditions or instructional treatments which produce that behavior. Test development has been dominated by the particular requirements of predictive, correlation aptitude test theory." Achievement and criterion measurement has attempted frequently to cast itself in this framework; some additional considerations are required. (PsycINFO Database Record (c) 2012 APA, all rights reserved)American Psychologist 07/1963; 18(8):519-521. · 6.87 Impact Factor -
Article: Criterion-referenced standards for youth health-related fitness tests: a tutorial.
[show abstract] [hide abstract]
ABSTRACT: A new development in the testing of physical fitness of youth is the use of criterion-referenced standards (CRS). Although three national youth health-related physical fitness (HRPF) tests currently have CRS, a detailed description of the procedures used in their development has not been published nor have the standards been validated. Consequently, the scientific basis of these standards has been questioned. The purposes of this tutorial are (a) to discuss briefly issues related to the development of CRS for HRPF tests, (b) to provide a detailed description of procedures used in development of mile run/walk test CRS as an example, and (c) to illustrate how these standards can be validated. The objective is to stimulate discussion and critical evaluation of CRS for youth HRPF tests.Research quarterly for exercise and sport 04/1990; 61(1):7-19. · 1.49 Impact Factor -
Article: A comparison of two criterion-referenced standard setting procedures for sports skills testing.
[show abstract] [hide abstract]
ABSTRACT: The application of criterion-referenced (CR) standard setting procedures in physical education has been limited to the examinee-centered model known as criterion groups. Alternative examinee-centered approaches are available but have not been applied in sport skills testing. The purpose of this study was to compare two examinee-centered models for setting performance standards for a sport skills test battery. CR performance standards were determined for the tennis skills test battery published in Tennis skills test manual (Hensley, 1989) using the borderline group (BG) (Livingston & Zieky, 1982) and criterion groups (CG) (Berk, 1976) models. The comparison of these two methods demonstrated that the CG method consistently produced performance standards that were lower than the BG method. In one instance the BG method produced a standard that was clearly unreasonable. Estimates of CR reliability for the CG standards (.76 less than or equal to P less than or equal to .93; .52 less than or equal to Kq less than or equal to .86) were higher than BG estimates (.55 less than or equal to P less than or equal to .84; .11 less than or equal to Kq less than or equal to .68). Although each method has strengths, neither is without problems. Results from this study suggest these two methods might be combined to minimize the problems associated with each. This combined method should produce standards with improved accuracy, validity, and reliability.Research quarterly for exercise and sport 03/1992; 63(1):1-10. · 1.49 Impact Factor
Page 1
Approaches for Development of
Criterion-Referenced Standards in
Health-Related Youth Fitness Tests
Weimo Zhu, PhD, Matthew T. Mahar, EdD, Gregory J. Welk, PhD,
Scott B. Going, PhD, Kirk J. Cureton, PhD
Introduction
Y
ance for Health, Physical Education and Recreation
(AAHPER) Youth Fitness Test, the birth of the health-
related fıtness construct, and changes in evaluation and
awards.1The transitions from performance-related fıt-
ness to health-related fıtness and from norm-referenced
standards to criterion-referenced (CR) standards are
noteworthy since they influenced how fıtness is assessed
and interpreted. The current paper reviews historical
trends in fıtness testing and explains the advantages of a
CR framework. Methods used for establishing CR stan-
dards are described, providing a background for the sub-
sequent articles in this supplement to the American Jour-
nal of Preventive Medicine.
outhfıtnesstestingintheU.S.hasarichhistoryof
over 50 years.1–4Key developments and changes
include the development of the American Alli-
Historical Background on Youth Fitness
Testing and Standards
Early interest in youth fıtness testing in the U.S. has been
attributed to Kraus and Hirschland’s comparative study
inthe1950s,5,6inwhichtheyfoundthatAmericanyouth
were far less fıt than their European counterparts. Presi-
dent Dwight D. Eisenhower, former Allied Commander
in the European Theater of WWII, learned of the study
and worried about the impact of fıtness levels on the
readiness of American youth for military service. Under
his leadership, the President’s Council on Youth Fitness
was established in 1956, and the fırst AAHPER Youth
FitnessTestwaspublishedin1958.Interestinthepossible
link between fıtness and preparedness for military service
continued into the 1960s.1In his then well-known article
“The Soft American” in Sports Illustrated, President-Elect
JohnF.Kennedystated:
We face in the Soviet Union a powerful and implaca-
ble adversary determined to show the world that only
the Communist system possesses the vigor and deter-
mination necessary to satisfy awakening aspirations
for progress and the elimination of poverty and want.
To meet the challenge of this enemy will require de-
termination and will and effort on the part of all
Americans. Only if our citizens are physically fıt will
they be fully capable of such an effort.7
Consistent with this vision, fıtness testing protocols
evolved to focus on the importance of performance. The
original AAHPER Youth Fitness Test was the only na-
tional test for many years, until several states, such as
California, Illinois, Indiana, New York, Oregon, South
Carolina, Texas, Vermont, and Washington, started
developing their own state tests during the 1950s and
through the 1970s. Performance-related fıtness was
also consistent with the growing emphasis on sports,
both in school and in society. Together, the drive for
military preparedness and society’s interest in sport
led to performance-related fıtness being the predomi-
nant paradigm during that time.
The concept and practice of health-related fıtness
emerged in the 1970s.8–10Many factors are believed to
havecontributedtothischange:theimpendingendofthe
Cold War, better understanding of the relationship be-
tween physical fıtness and health, the publication of Aer-
obics by Dr. Kenneth H. Cooper in 196811and its subse-
quent popularity, and the development and maturation
of exercise physiology, physical activity epidemiology,
and measurement,8to name just a few of the important
influences. Health-related physical fıtness was defıned in
1980 as “. . . a multifaceted continuum extending from
birthtodeath.Affectedbyphysicalactivity,itrangesfrom
optimalabilitiesinallaspectsoflifethroughhighandlow
levelsofdifferentfıtness,toseverelylimitingdiseasesand
dysfunction.”12Four key traditional components of
From the Department of Kinesiology and Community Health, University
of Illinois at Urbana-Champaign (Zhu), Urbana, Illinois; the Department
of Exercise and Sport Science, East Carolina University (Mahar), Green-
ville,NorthCarolina;theDepartmentofKinesiology,IowaStateUniversity
(Welk),Ames,Iowa;theDepartmentofNutritionalSciences,Universityof
Arizona (Going), Tucson, Arizona; and the Department of Kinesiology,
University of Georgia (Cureton), Athens, Georgia
Addresscorrespondenceto:WeimoZhu,PhD,Professor,Department
ofKinesiologyandCommunityHealth,UniversityofIllinoisatUrbana-
Champaign, 205 Freer Hall, MC-052, Urbana IL 61801. E-mail:
weimozhu@illinois.edu.
0749-3797/$17.00
doi: 10.1016/j.amepre.2011.07.001
S68
Am J Prev Med 2011;41(4S2):S68–S76 © 2011 American Journal of Preventive Medicine • Published by Elsevier Inc.
Page 2
health-relatedphysicalfıtnessarecardiorespiratoryfunc-
tion, body composition, muscular strength, and endur-
ance and flexibility. The latter two are now sometimes
integrated into the component defıned as musculoskele-
tal function, reducing the number of components to
three.13The scientifıc validity and measurement mile-
stones of these key components are well described in the
literature.8
The second noticeable change in fıtness testing con-
tributing to the shift from a norm-referenced to a CR
perspective is directly related to the evolving defınition
and operationalization of fıtness. When the interest was
on performance, the focus in testing reflected the view
that“themore(e.g.,numberofpull-upsastudentcando)
or less (e.g., how fast a student can fınish a 1-mile run/
walk test), the better,” depending on the fıtness measure.
The norm-referenced evaluation framework, in which a
student’s performance is compared with his/her peers, is
appropriate in this case since the emphasis is on peak
performanceorhigh-levelachievement.ThePresidential
Physical Fitness Award Program (PCPFS) is a good ex-
ample of norm-referenced evaluation, in which students
must score at or above the 85th percentile on all fıve test
items to qualify for the award.14Many similar examples
infıtness,sportsperformance,andhealthcanbefoundin
a recent collection of norms.15
Technically,constructinganorm-referencedtestisrel-
atively easy as long as a nationally representative sample
canbeobtainedandregularlyupdated.Withsuchasam-
ple, norms (e.g., percentiles and percentile ranks) can be
computed and derived. There are, however, three major
limitations associated with the norm-referenced evalua-
tion framework. First, it is diffıcult to update norms reg-
ularlyduetocost,time,andmanpowerconstraints.Asan
example, the PCPFS’s norms were based on the 1985
National School Population Fitness Survey,16and there
have been no major national fıtness studies in the U.S.
since the 1980s (note: the other major national fıtness
studies in the 1980s included National Children and
Youth Fitness Study I [NCYFS I], 1985; and NCYFS II,
1987).17,18As a result, these outdated values likely do not
reflect current norms (e.g., an 85th percentile from the
1980s may now be equivalent to the 95th percentile), but
rather how the values compare to the previous norms,
making them inaccurate in its original evaluation
framework.
The second related limitation of the norm-referenced
evaluation framework is that the interpretation depends
on the fıtness of the reference population. The designa-
tions of average and above average have limited meaning
if the majority of a population is unfıt or unhealthy. The
CDC obesity-evaluation criterion is a good example of
this limitation. According to CDC’s current standard, a
child is defıned as overweight with a BMI at or above the
age- and gender-specifıc 85th percentile, and obese if the
child’s BMI is at or above the 95th percentile of their
peers. The percentile is defıned as the score value for a
specifıc percentage of cases in a distribution of scores. If
theCDCnormiscurrentandtrue,itwoulddefıne15%of
American children as overweight and 5% as obese.
Clearly, this is not reflective of the childhood obesity
epidemic that we hear about almost daily wherein one
third (33%) of children and adolescents are identifıed as
overweight or obese.19The difference in prevalence esti-
matesisexplainedbythefactthattheCDC’snormswere
derived from 1970s and 1980s data when American chil-
dren were relatively healthy.20If the 85th/95th percentile
standards based on today’s norms are used, a large of
percentage of overweight and obese children would be
misclassifıed as having normal weight.
Thethirdlimitationofthenorm-referencedevaluation
framework is that it tends to reward children and youth
who are already fıt while potentially discouraging those
whoarenotfıt.Ifrewardsarebasedonachievingthe85th
percentile (as with the Presidential Fitness Award in the
President’s Challenge program), only highly fıt youth
may be motivated to try to achieve it. Less-fıt youth may
be less motivated because they know their chances of
achieving the standard are low. If unfıt students are less
motivated during physical fıtness testing, they may come
to perceive physical education classes as a punishment/
ordeal, rather than an enjoyable experience. Although
otherawardsystemsareavailableinthePresident’sChal-
lenge program for students with lower levels of fıtness,
these limitations can be better overcome by employing
the CR evaluation framework.
The concept of CR evaluation and testing was intro-
duced in the fıeld of education in the 1960s by Glaser.21
However, real development and applications of CR as-
sessment were not done until in the late 1970s and early
1980s.22,23The fıeld of physical education and fıtness
testing embraced the new concept24and started to apply
it in assessment practice from the late 1980s.25–28In con-
trast to the norm-referenced framework in which the
evaluation of a test-taker’s competency is judged relative
to the performance of other students, the CR evaluation
compares the test-taker’s performance with an absolute
criterion. In educational assessment, the “absolute crite-
rion behavior” could be if a student has mastered the
information taught in a specifıc subject or grade; in the
context of youth health-related fıtness, the interest could
beifachildmeetsaminimalneededphysicalfıtnesslevel
based on a criterion. Thus, the norm-referenced evalua-
tion can be considered a relative evaluation, whereas the
CR evaluation is an absolute one.
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
S69
October 2011
Page 3
Because the criterion behavior is defıned indepen-
dently from the behavior of others, it is not affected by
changesinapopulation.Therefore,thelimitationofpop-
ulation dependence in the norm-referenced evaluation
will likely have no impact on the CR-based evaluation.
Similarly, although there are always some students clas-
sifıed as below average, average, and above average in a
norm-referenced evaluation framework, there is a possi-
bility that all students will be classifıed as fıt or not fıt
based on a criterion (i.e., it is possible for everyone to
either meet or not meet the CR standards) in a CR evalu-
ation framework. As a result, the limitation of needing a
fıt population in order for the evaluation to be useful in
the norm-referenced evaluation is eliminated in the CR
evaluation framework.
Finally,sincethefocusisontheminimalneededfıtness
for a child, the evaluation standard established is often
attainable by any child as long as an effort is made. Thus,
the limitation of discouraging unfıt participants associ-
ated with the norm-referenced approach is minimized in
the CR evaluation approach. However, CR evaluation is
not without its own challenges. Setting an appropriate
standard, known as the cut-off score, is one of the most
important challenges.
Methods Used in Setting
Criterion-Referenced Standards
The fundamental interest in setting a CR standard is to
determine whether a test-taker is “good enough” on the
constructbeingmeasured,whichcouldbethetest-taker’s
reading comprehension, math problem-solving skill, or
language profıciency. For health-related fıtness testing,
the key interest is in whether a test-taker is fıt enough to
be free of potential health risks. For children’s fıtness
testing, the interest could be further extended to repre-
sent whether a child is fıt enough for the future (i.e., fıt
enough to likely grow up to be a healthy adult). Because
the key interest and outcome of the CR test/evaluation is
theclassifıcation(e.g.,passversusfail,fıtversusnotfıt,or
at-risk versus needs improvement versus in the healthy
fıtness zone [HFZ]), the accuracy of the classifıcation is
key.
Many methods have been developed to set perfor-
mance standards or simply determine CR standards. In
general, these methods can be classifıed as either test
centered or examinee centered. In the test-centered
methods,apanelofexpertsisaskedtoexamineeachitem
on a competency test and set the cut-off score accord-
ingly. In the Angoff method,29for example, the panel is
asked to examine each item and estimate the probability
that the “minimally acceptable” person would answer
eachitemcorrectly.Thesumoftheseprobabilitieswould
then represent the minimally acceptable score.
In the examinee-centered methods, the focus is on
identifying examinees with/without defıned minimum
competency, from which the cut-off score is established.
Twoproceduresinthiscategoryaretheborderline-group
and the contrasting-groups procedures,30and the latter
has been applied to setting CR standards for a number of
motor-skill tests. The contrasting group method is based
onevaluatingtherelativedistributionsofatrainedandan
untrainedgrouponaspecifıctest.Standardsaresettotry
to minimize the number of false positives (passing the
standardifuntrained)whilealsominimizingthenumber
of false negatives (not achieving the standard if trained).
Meanwhile, the health outcome–centered method has
been the predominant approach in setting CR standards
for health-related fıtness tests.
Thekeystepsofthehealthoutcome–centeredmethod
include:
● determine the components of health-related fıtness,
which often include cardiorespiratory fıtness or aero-
bic capacity, body composition, and muscular fıtness
(i.e., muscular strength, endurance, and flexibility);
● select a criterion measure, as well as fıeld tests, of the
fıtness component (e.g., VO2max as the criterion mea-
sureand1-milerun/walkandProgressiveAerobicCar-
diovascular Endurance Run [PACER] as the fıeld tests
for cardiorespiratory fıtness);
● determine the relationships between the criterion
measure/fıeldtestsandhealth-outcomemeasures,which
couldbemortality,anindividualfactor(e.g.,ifaperson’s
bloodpressureishigh),oragroupofhealth-riskmeasures
(e.g.,ifapersonhasmetabolicsyndrome);
● set the standards or cut-off scores according to the
relationship determined (i.e., determine the point or
levelonwhichafıtnessparameterisassociatedwithan
increasedriskofadiseaseoutcomeorriskfactorsofthe
disease);
● validate or cross-validate using additional measures
and samples.
The procedures used to set up the original CR stan-
dardsforbodycompositioninFITNESSGRAM®provide
a good example of these steps. The original cut-off scores
for body composition were based on the relationship
between body fatness and cardiorespiratory disease risk
factors, including blood pressure, total cholesterol, and
serum lipoprotein ratios in children and adolescents31
(Goingetal.32inthissupplementhasadetailedreviewof
this procedure). The original cut-off scores for aerobic
capacity were developed in a slightly different way by
Cureton33in 1994. Based on an extensive literature re-
view, morbidity and mortality in adults were chosen as
S70
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
www.ajpmonline.org
Page 4
the health outcomes. Because morbidity (caused mainly
by unwanted pregnancy, substance abuse, physical/
sexualabuse,andstress)andmortality(causedmainlyby
accidents,suicide,andhomicide)inchildrenandyouthis
not directly related to physical fıtness, cut-off scores can-
not be directly related to children’s morbidity and mor-
tality data. Instead, Cureton25,33derived the cut-off
scores based on the information of both adult morbidity
and mortality and age-/growth-related changes in
VO2max. The assumptions and decisions used in setting
thesestandardshavebeensupportedbysubsequentstud-
ies based on related health-risk factors in other children
(Welk et al.34in this supplement contains additional dis-
cussion). However, as described in the preface to this
supplement,severalunresolvedissueswiththestandards
necessitated a re-evaluation.
Critical Issues and Challenges in Setting
Criterion-Referenced Standards
Although CR evaluation is able to address the shortcom-
ingsofthenorm-referencedevaluationandfıtstheneeds
of health-related fıtness assessment very well, it has its
own issues and challenges, including the selection of
health outcome measures, equivalence of cut-off scores
across fıeld tests, consequence of misclassifıcation, and
cross-group and cultural differences.
Selecting a Health-Outcome Measure
Although the theoretic relationships among physical ac-
tivity, fıtness, and health35and health-related fıtness and
health8have been well described in the literature, limited
informationisavailableonwhichhealthoutcomeshould
beemployedwhenvalidatingheath-relatedfıtnessassess-
ments.Likefıtness,healthisaconstruct.Inthepast,itwas
simply defıned as “freedom from physical disease or
pain.” A more accepted defınition of health now is the
defınition set by the WHO in 1948: “Health is a state of
complete physical, mental and social well-being and not
merely the absence of disease or infırmity.”36
In theory, there are endless ways to measure health. A
naturalquestionthenis:Whichhealthmeasure/outcome
shouldbeusedinvalidatinghealth-relatedfıtness?There
isnoabsolutecorrectanswertothisquestion,and“select
the most appropriate one” (i.e., select the most appropri-
ate measure/outcome(s) based on the existing theoretic
and empirical knowledge base and evidence) may be the
best answer. As described in the previous text, the health
outcomes in determining body composition standards
included total cholesterol, serum lipoprotein ratios, and
blood pressure,31whereas morbidity and mortality were
the measures when setting aerobic-capacity standards.33
Another related question in selecting health outcome
measures is: How many outcome measures should be
selected?Again,thereisnoabsolutecorrectanswertothis
question, but the recommendation of the authors of this
paper is to consider and examine all available outcome
measures although there is no need to use all of them
when making the fınal decision. As described in this
supplement, metabolic syndrome was selected as the
most appropriate outcome measure for establishing new
standards for both body fatness and aerobic capacity.
Finally, another related selection question is which age
group should be the focus: children, youth, adults, or
olderadults.Asillustratedinbothbodycompositionand
aerobic-capacity standard setting, the decision depends
on the assessment of interest (i.e., to determine the cur-
rent fıtness status, to predict future fıtness status, or
both), along with other information availability. The au-
thors’ recommendation, once again, is to try to use all
available information and make a decision accordingly.
Equivalence of Cut-Off Scores
As with the health outcome measures, a number of fıeld
tests are often used simultaneously to measure the same
construct. For example, the 1-mile run/walk, PACER,
and 1-mile walk tests are used to measure aerobic capac-
ity in FITNESSGRAM. Usually, when a new fıeld test is
developed, the cut-off scores often will be set based on a
new,small-samplestudyorsimplyderivedfromthenor-
mativedataortheexistingliteraturebyanexpertpanel.37
Because of sample variations and other factors, the stan-
dardequivalenciesamongfıeldtestsareoftennotconsis-
tent. For example, Mahar et al.38reported that 34% of
4th- and 5th-grade girls who achieved PACER standards
failed to pass the 1-mile run/walk standards (see also
BeetsandPitetti39).Althoughitisexpectedthattherewill
be a difference in achievement levels among tests, such a
large difference is not acceptable.
As another example, several fıeld tests are frequently
usedtomeasureupper-bodymuscularstrength:pull-ups,
flexed arm hang, push-ups, modifıed pull-ups, and mod-
ifıedpush-ups.Thescoringformatsrangefromthenum-
ber of repetitions to time in seconds performing a test.
Accordingtoavaliditystudyoffıvesuchfıeldtests,40only
moderate correlations (r ranged from 0.50 to 0.70) were
found among these tests. Therefore, classifıcation sys-
tems developed for these tests will likely be inconsistent.
A simple solution for this inconsistency problem is to
adopt a standardized single-test approach, (i.e., use a
single test for a fıtness component). Although theoreti-
cally sound, this single-test approach is unlikely to be
adopted in reality due to many historical (e.g., one
country/areahasalreadyusedaspecifıctestformanyyears)
and practical (e.g., limitations in space and facilities) rea-
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
S71
October 2011
Page 5
sons.Fortunately,thisproblemcanbeaddressedbyemploy-
ing a new “primary test centered equating method,”41de-
scribedbrieflyinthefollowingtext(andinBoiarskaiaetal.42
inthissupplement).
Consequences of Misclassification
Therewillbemisclassifıcationwhenanassessmentserves
aclassifıcationrolenomatterhowwelltherelatedcut-off
score is set up. There are usually two kinds of misclassifı-
cations: false-positive classifıcation (e.g., an unfıt test-
taker misclassifıed as fıt in the context of fıtness testing)
andfalse-negativeclassifıcation(afıttest-takermisclassi-
fıedasunfıt).AswelldescribedbyCuretonandWarren,25
the false-positive classifıcation may be a more serious
error in this case since the misclassifıed test-takers may
getthewrongimpressionthattheyarefıtenoughalready,
and therefore not exercise at a desirable level and conse-
quentlyfailtoreduceorevenincreasetheirriskofdisease.
Although a call was made 20 years ago by Cureton and
Warren25for more research to understand the conse-
quences of these misclassifıcations, little progress has
been made in this area.
Cross-Group and Culture Differences
Finally, whether a cut-off score should be set up differ-
ently for various subpopulations must be empirically ex-
amined and determined. Although age and gender have
often been taken into consideration in setting cut-off
scores, many other factors, such as ethnicity and disabil-
ity, have not been considered. It is noted that to address
cross-cultural differences, WHO developed and pub-
lished an international BMI standard in 2006.43The
WHO’s standard is norm-referenced as is the CDC’s
standard, which was discussed earlier as being a refer-
ence population issue. This is an area that needs more
research.
New Measurement and Statistical
Methods and Applications
Some new measurement and statistical methods have
been developed to facilitate establishment of standards.
In particular, the use of test-equating procedures and
approaches based on receiver operating characteristic
(ROC) curves offer considerable potential for addressing
some of the CR evaluation–related issues and challenges
noted in the previous text.
Test Equating
Equating is a set of statistical procedures that puts two or
more tests that measure the same construct in different
ways onto the same scale so they can be directly com-
pared.44,45Toaddresstheissueofinconsistencyinsetting
a standard for cross-test classifıcation when measuring
aerobic capacity, Zhu et al.41proposed the primary test
centeredequatingmethod.Theprimaryfıeldtestrefersto
a fıeld test whose validity related to the criterion test has
beenwelldocumented(e.g.,1-milerun/walkforestimat-
ing VO2max and skinfold measurements for predicting
body fat percentage). The key steps in the method for
setting a standard for a new fıeld test, whose validity has
been confırmed by other studies, are as follows:
● select a validated fıeld test (e.g., validity and reliability
coeffıcients ?0.80) as the primary fıeld test;
● administerboththeprimaryfıeldtestandnewfıeldtest
to a large sample (say n ? 200) from the targeted
population using a counterbalanced order; make sure
there is adequate rest time between tests to avoid car-
ryover effect;
● set the fıeld test onto the scale of the primary fıeld test
using an equating procedure;
● use the cut-off scores already set for the primary test
or set them based on the equivalent relationship
developed.
Using aerobic assessment as an example, the primary
fıeldtestisthe1-milerun/walk,andthe“new”fıeldtestis
thePACER.AfterthePACERisequatedtothescaleofthe
1-mile run/walk, the equivalent 1-mile run/walk score
can be used to estimate VO2max and determine HFZ
classifıcation using the cut-off score set for the 1-mile
run/walk or VO2max. The concept of this new cut-off
score setting method is illustrated in Figure 1. The meth-
od’s validity has been confırmed by Zhu et al.41and
further cross-validated in the study by Boiarskaia et al.42
reported in this supplement.
Receiver Operating Characteristic Curves
Many statistical procedures have been developed to eval-
uate accuracy and consistency of classifıcations. Percent-
age agreement and kappa statistics are among the most
popular.46A contingency table can best illustrate these
statistics (Figure 2). When determining the classifıcation
accuracy of a fıeld test, the focus is on the agreement
betweenthecriterionmeasure,whichisusedtorepresent
trueclassifıcationstatus,andthefıeldtest.Casesclassifıed
positively by both the fıeld test and the criterion measure
are categorized as true positives (TP), whereas cases clas-
sifıed negatively by both tests are categorized as true
negatives (TN). A false-negative (FN) error occurs when
a fıeld test erroneously indicates that a person does not
achievethestandardonthecriterion.Alternately,afalse-
positive (FP) error occurs when a fıeld test incorrectly
identifıes a person as achieving the standard on the
criterion.
S72
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
www.ajpmonline.org
Page 6
Note that in the context of setting health-related fıtness
standards, one, or a set of, health measure(s)/outcome(s) is
used as the criterion measure, and fıtness tests as the fıeld
test. Using a similar analogy, the health measure/outcome
can be classifıed as healthy (H) and unhealthy (U), and
fıtness measure can be classifıed as fıt (i.e., health-risk free
[F])andnotfıt(i.e.,havingsomehealthrisks[N]).Accord-
ingly,HF(beingclassifıedbothashealthyandfıt)?TP,UN
(unhealthy/not Fit) ? TN, HN (healthy/not fıt) ? FN, and
UF(unhealthy/fıt)?FP(Figure3).
Two commonly used statistical indexes for classifı-
cation accuracy are the Proportion of Agreement
[P ? (TP ? TN)/(TP ? TN ? FP ? FN)] and the kappa
statistic, which removes the chance factor from P.46To
determine the optimal cut-off score, one can vary the
cut-off scores of the fıeld test and calculate the corre-
spondingagreementstatistics,aswellasFPandFNrates.
The optimal cut-off score is the one with the highest
agreement and fewest classifıcation errors.
The development of ROC curves provides a graphical
procedure that enables errors to be systematically evalu-
ated across all possible scores.47The ROC curve displays
the sensitivity (probability of correctly detecting TP re-
sults) and specifıcity (probability of correctly detecting
TN results) of a particular fıeld test for a range of cut-off
points or thresholds. Ideally, a diagnostic cut-off point
value should result in low FP and low FN rates across a
reasonable range of cut-off values. The primary indica-
tors of ROC analyses can be calculated from the contin-
gency table in Figure 2:
● accuracy(i.e.,P)?(TP?TN)/(TP?TN?FP?FN);
● sensitivity ? TP/(TP ? FN);
● specifıcity ? TN/(FP ? TN).
The unique value of ROC curves is that cut-off points
canbeselectedbasedontherelativeimportanceofsensi-
tivity or specifıcity (i.e., the ROC approach makes it pos-
sible to weigh the relative costs of one type of error over
another). Although ROC has been widely used in clinical
medicine and was introduced to kinesiology a few years
ago,48,49it has not been widely employed in setting cut-
off scores in health-related fıtness testing. Studies re-
portedbyLaursonetal.50andWelketal.34inthissupple-
ment represent the fırst wave of ROC applications in this
area.
Remaining Issues and Future Research
Needs
There are still a number of unresolved issues in setting
cut-off scores in health-related fıtness measurement and
evaluation, namely standards for muscular fıtness
(strength,endurance,andflexibility),understandingCR-
based fıtness growth assessment and evaluation, and re-
lated matters of motivation.
Figure 1. Conceptual illustration of the primary field test
centered equating method for cut-off score setting
Note: Using aerobic assessment as an example, the criterion
measure is VO2max, and the primary field test is 1-mile run. Field
Tests A and B are PACER and 1-mile walk, respectively. After Tests A
and B are equated to the scale of the primary field test, the raw
testing scores of S1, who performed the PACER test, and S2, who
performed the 1-mile walk test, can be transferred onto the scale of
1-mile run and used to estimate their VO2max. Now their perfor-
mance can be evaluated and compared on the same scale.
Reprinted with permission from © American Alliance for Health,
Physical Education, Recreation and Dance41
PACER, Progressive Aerobic Cardiovascular Endurance Run; S1,
Subject 1; S2, Subject 2
Figure 2. Contingency tables for classification accuracy and
errors in the context of criterion-referenced fitness testing
Figure 3. Contingency tables for classification accuracy
and errors in the context of health-related fitness testing
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
S73
October 2011
Page 7
Standards for Muscular Fitness
The cut-off scores of aerobic capacity and body compo-
sition have been well studied and established, as illus-
trated in this supplement. The well-described relation-
ship between health measures and these two variables, as
well as available rich data and information, are perhaps
the reasons.3,8In contrast, although the validity and reli-
ability of commonly used tests of muscular strength, en-
durance, and flexibility are generally well supported,51
the relationships between these tests and health have not
been well established.
For instance, sit-up and sit-and-reach tests were in-
cluded in health-related fıtness testing because they were
believed to be good indicators of lower-back health.12,52
Others, however, showed that there is little, if any, rela-
tionship between physical fıtness and lower-back pain, a
symptom of bad lower-back health.37,53,54Plowman,37
based on a comprehensive review, stated almost 20 years
ago: “While items of trunk strength/endurance and
lower-back and hamstring flexibility can be marginally
accepted as predictor tests, what absolute values on these
tests might prove to be protective is a total unknown due
to the wide overlap of scores between those who eventu-
ally had lower-back problems and those who did not,”
which is still true today. This is clearly an area requiring
more research.
Criterion Referenced–Based Fitness Growth
Thefocusonhealth-relatedfıtnessandCRevaluationhas
been concurrent with the relationship between fıtness
status and health, and little effort has been made to un-
derstand criterion-related fıtness growth in children.
When studying CR fıtness growth, the focus shifts to
whetheratest-takerisontracktobeingfıt,knownalsoas
growth to standard. There are several reasons for this
understudied research area. An assumption in youth fıt-
ness testing is that fıt children grow up to become fıt
adults, but evidence to support this fıt child ? fıt adult
hypothesis is limited. It is likely that fıtness needs may
change along with normal growth and maturation
changes, and this needs confırmation. The application of
LMS (L ? skewness, M ? median, and S ? coeffıcient of
variation) growth curves provides a way to model
growth-related changes over time, and new curves re-
portedinthissupplementweredevelopedspecifıcallyfor
this purpose.50,55
Another consideration related to growth is that due to
many factors (e.g., parent’s education and SES, and local
preschool sport program availability), children enter
schoolatdifferentfıtnesslevels.Children’simprovement
over time (relative to their initial status) should be the
basis of education so these data can be used for the eval-
uation of the effectiveness of a school, program, and
teacher. When linked with a predetermined evaluation
standard, this type of evaluation is referred to as criterion-
related growth, a critical part of standard-based as-
sessments and evaluations. The concepts of criterion-
related growth, value-added assessment, and modeling
are being introduced and used in educational research
andstandard-basedassessments.56–58Physicaleducation
andfıtnessresearchersandpractitionersneedtocatchup
with the progress already being made in these areas.
Standards and Students’ Motivation
It is generally believed that a norm-referenced evaluation
willdiscouragestudentswhosefıtnesslevelsmightbemod-
erateorlowsinceonlyasmallpercentageofstudentswillbe
abletomeetthestandardsundersuchanevaluationframe-
work. For example, less than 5% of students could actually
qualify for the President’s Challenge Award (i.e., scored at
the85thpercentilesorhigherforallfıvetests).59Incontrast,
it is believed that in a CR-evaluation framework, such as
FITNESSGRAM,childrenareencouragedtofocusontheir
own health status rather than their level compared with
others.59Asaresult,studentsareabletoenhancetheirmo-
tivationandself-confıdence.
Arecentstudyprovidedsomesupportforsuchbeliefs:60
A majority of students studied (86%) believed fıtness tests
enhanced their knowledge of the importance of being
healthy, and motivated them to be more physically active.
Meanwhile, according to the report from the latest Texas
YouthFitnessStudy,61manyteachersstillreportednegative
experiences when using FITNESSGRAM, such as apathy/
unwillingness, self-consciousness, frustration, and teasing.
More studies are needed to understand the impacts, espe-
cially long-term ones, of norm- and criterion-referenced
fıtnesstestingonevaluations,andonsubsequentbehaviorof
theyouthevaluated.
Conclusion
In summary, two of the most notable changes in youth
fıtness testing are the shift from performance-centered
assessment to health-related fıtness testing, and from
norm-referencedevaluationtoCRevaluation.Settingthe
standards, or cut-off scores, is one of the most important
issuesinthedesignofaCRtest.Manymethodshavebeen
developed to set cut-off scores in CR tests, and the health
outcome–centeredmethodisthemostpopularandeffec-
tive one for setting standards for health-related fıtness
tests.
Critical issues related to this method include select-
ing appropriate health outcomes, equivalence of cut-
off scores, consequences of misclassifıcation, and cross-
groupandculturaldifferences.Recentdevelopmentsand
applicationsinstatisticaltechniques,suchastestequating
S74
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
www.ajpmonline.org
Page 8
and ROC, have proven to be helpful in addressing some of
these issues. Several of these techniques were specifıcally
employedinthedevelopmentofthenewbodycomposition
and aerobic-capacity standards for FITNESSGRAM. Al-
though progress has been made in these areas, many issues
remain;includingtheneedforsettingstandardsformuscu-
larcomponentsanddeterminingCR-basedfıtnessgrowth.
Publication of this article was supported by The Cooper Insti-
tute through a philanthropic gift from Lyda Hill.
No fınancial disclosures were reported by the authors of this
paper.
References
1. MorrowJRJr.,ZhuW,FranksBD,MeredithMD,SpainC.1958–2008:
50 years of youth fıtness tests in the U.S. Res Q Exerc Sport
2009;80(1):1–11.
2. Plowman SA, Sterling CL, Corbin CB, Meredith MD, Welk GJ, Mor-
row JR Jr. The history of FITNESSGRAM. J Phys Act Hlth
2006;3(S2):S5–20.
3. Safrit MJ. The validity and reliability of fıtness tests for children: a
review. Pediatr Exerc Sci 1990;2(1):9–28.
4. Seefeldt V, Vogel P. Physical fıtness testing of children: a 30-year
history of misguided efforts? Pediatr Exerc Sci 1989;1:295–302.
5. Kraus H, Hirschland RP. Muscular fıtness and health. JOPERD
1953;24(10):17–9.
6. Kraus H, Hirschland RP. Minimum muscular fıtness tests in school
children. Res Q 1954;25:178–88.
7. Kennedy JF. The soft American. Sports Illustrated 1960;Dec;13(26):
14–7.
8. Jackson AS. The evolution and validity of health-related fıtness. Quest
2006;58(1):160–75.
9. Pate RR. The evolving defınition of physical fıtness. Quest 1988;40(3):
174–9.
10. Pate RR. A new defınition of youth fıtness. Phys Sports Med 1983;
11(4):77–83.
11. Cooper KH. Aerobics. New York NY: Bantam Books, 1968.
12. American Alliance for Health, Physical Education, Recreation and
Dance. Health related fıtness test. Reston VA: American Alliance for
Health, Physical Education, Recreation and Dance, 1980.
13. Corbin CB, Pangrazi RP. FITNESSGRAM and ACTIVITYGRAM:
an introduction. In: Welk GJ, Meredith MD, eds. FITNESSGRAM/
ACTIVITYGRAM reference guide. Dallas TX: The Cooper Institute,
2008.
14. The President’s Council on Fitness, Sports & Nutrition. The President’s
Challenge.ChooseaChallenge.PhysicalFitnessTest.AwardBenchmarks.
www.presidentschallenge.org/challenge/physical/benchmarks.shtml.
15. HoffmanJ.Normsforfıtnessperformance,andhealth.ChampaignIL:
Human Kinetics, 2006.
16. Reiff G, Dixon W, Jacoby D, Ye GX, Spain C, Hunsicker P. The
President’s Council on Physical Fitness and Sports 1985: national
school population fıtness survey. Washington DC: U.S. Government
Printing Offıce, 1986.
17. Ross JG, Gilbert GG. The National Children and Youth Fitness Study:
a summary of fındings. JOPERD 1985;56(1):45–50.
18. Ross J, Pate R, Relpy L, Gold R, Svilar M. The National Children and
Youth Fitness Study II: new health-related fıtness norms. JOPERD
1987;58(9):66–70.
19. Ogden CL, Carroll MD, Curtin LR, Lamb MM, Flegal KM. Prevalence
of high body mass index in U.S. children and adolescents, 2007–2008.
JAMA 2010;303(3):242–9.
20. Kuczmarski RJ, Ogden CL, Grummer-Strawn LM, et al. CDC growth
charts: U.S. Advance data from vital and health statistics, no. 314.
Hyattsville MD: National Center for Health Statistics, 2000.
21. Glaser R. Instructional technology and the measurement of learning
outcomes: some questions. Am Psychol 1963;18:519–21.
22. Popham WJ. Criterion referenced measurement. Englewood Cliffs NJ:
Prentice Hall, 1978.
23. Berk RA, ed. Criterion-referenced measurement: the state of the art.
Baltimore MD: Johns Hopkins University Press, 1980.
24. Safrit MJ, Baumgartner TA, Jackson AS, Stamm CL. Issues in setting
motor performance standards. Quest 1980;32(2):152–62.
25. Cureton KJ, Warren GL. Criterion-referenced standards for youth
health-related fıtness tests: a tutorial. Res Q Exerc Sport 1990;61(1):
7–19.
26. Kalohn JC, Wagoner K, Gao LG, Safrit MJ, Getchell N. A comparison
oftwocriterion-referencedstandardsettingproceduresforsportsskills
testing. Res Q Exerc Sport 1992;63(1):1–10.
27. Safrit MJ. Criterion-referenced measurement: validity. In: Safrit MJ,
Wood TM, eds. Measurement concepts in physical education and
exercise science. 1st ed. Champaign IL: Human Kinetics, 1989.
28. Looney MA. Criterion-referenced measurement: reliability. In: Safrit
MJ,WoodsTM,eds.Measurementconceptsinphysicaleducationand
exercise science. 1st ed. Champaign IL: Human Kinetics, 1989.
29. AngoffWH.Scales,normsandequivalentscores.In:ThorndikeRL,ed.
Educational measurement. 2nd ed. Washington DC: American Coun-
cil on Education, 1971.
30. Zieky MJ, Livingston SA. Manual for setting standards on the basic
skills assessment tests. Princeton NJ: Educational Testing Service,
1977.
31. Williams DP, Going SB, Lohman TG, et al. Body fatness and risk for
elevated blood pressure, total cholesterol and serum lipoprotein ratios
in children and adolescents. Am J Public Health 1992;82(3):358–63.
32. Going SB, Lohman TG, Cussler EC, Williams DP, Morrison JA, Horn
PS. Percent body fat and chronic disease risk factors in U.S. children
and youth. Am J Prev Med 2011;41(4S2):S77–86.
33. Cureton KJ. Aerobic capacity. In: Morrow JR Jr., Falls HB, Kohl HW
III, eds. The Prudential FITNESSGRAM technical reference manual.
Dallas TX: Cooper Institute for Aerobics Research, 1994.
34. Welk GJ, Laurson KR, Eisenmann JC, Cureton KJ. Development of
youthaerobic-capacitystandardsusingreceiveroperatingcharacteris-
tic curves. Am J Prev Med 2011;41(4S2):S111–6.
35. Bouchard C, Shephard RJ. Physical activity, fıtness, and health: the
modelandkeyconcepts.In:BouchardC,ShephardRJ,StephensT,eds.
Physical activity, fıtness, and health: international proceedings and
consensus statement. Champaign IL: Human Kinetics, 1994.
36. WorldHealthOrganization.PreambletotheConstitutionoftheWHO
as adopted by the International Health Conference, New York, 19–22
June 1946, and entered into force on 7 April 1948.
37. PlowmanSA.Criterionreferencedstandardsforneuromuscularphys-
ical fıtness tests: an analysis. Pediatr Exerc Sci 1992;4(1):10–9.
38. Mahar MT, Rowe DA, Parker CR, Mahar FJ, Dawson DM, Holt JE.
Criterion-referenced and norm-referenced agreement between the mile
run/walkandPACER.MeasPhysEducExercSci1997;1(4):245–58.
39. BeetsMW,PitettiKH.Criterion-referencedreliabilityandequivalency
between the PACER and 1-mile run/walk for high school students. J
Phys Act Health 2006;3(S May):S17–29.
40. Pate RR, Brugess ML, Woods JA, Ross JG, Baumgartner T. Validity of
fıeld tests of upper body muscular strength. Res Q Exerc Sport
1993;64(1):17–24.
41. ZhuW,PlowmanSA,ParkY.Aprimer-testcenteredequatingmethod
for cut-off score setting. Res Q Exerc Sport 2010;81(4):400–9.
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
S75
October 2011
Page 9
42. BoiarskaiaEA,BoscoloMS,ZhuW,MaharMT.Cross-validationofan
equating method linking aerobic FITNESSGRAM® fıeld tests. Am J
Prev Med 2011;41(4S2):S124–30.
43. WHO Multicentre Growth Reference Study Group. WHO Child
Growth Standards: length/height-for-age, weight-for-age, weight-for-
length, weight-for-height and body mass index-for-age: methods and
development. Geneva, Switzerland: World Health Organization, 2006.
44. Zhu W. Test equating: what, why, how? Res Q Exerc Sport 1998;
69(1):11–23.
45. Zhu W. Scales, norms, and score comparability. In: Wood T, Zhu W,
eds. Measurement theory and practice in kinesiology. Champaign IL:
Human Kinetics, 2006.
46. Safrit MJ, Wood TM. Introduction to measurement in physical educa-
tion and exercise science. St. Louis MO: Mosby, 1995.
47. Zweig MH, Campbell G. Receiver-operating characteristic (ROC)
plots: a fundamental evaluation tool in clinical medicine. Clin Chem
1993;39(4):561–77.
48. Looney MA. Measurement issues in the clinical setting. In: Wood T,
Zhu W, eds. Measurement theory and practice in kinesiology. Cham-
paign IL: Human Kinetics, 2006.
49. JacksonAS.Preemploymentphysicaltesting.In:WoodT,ZhuW,eds.
Measurement theory and practice in kinesiology. Champaign IL:
Human Kinetics, 2006.
50. Laurson KR, Eisenmann JC, Welk GJ. Body fat percentile curves
for U.S. children and adolescents. Am J Prev Med 2011;41(4S2):
S87–92.
51. Plowman SA. Muscular strength, endurance and flexibility
assessments. In: Welk GJ, Meredith MD, eds. FITNESSGRAM/
ACTIVITYGRAM reference guide. Dallas TX: The Cooper Insti-
tute, 2008.
52. PayneN,GledhillN,KatzmarzykPT,JamnikV.Health-relatedfıtness,
physical activity, and history of back pain. Can J Appl Physiol
2000;25(4):236–49.
53. JacksonAW,MorrowJRJr.,BrillPA,KohlHW,GordonNF,BlairSN.
Relations of sit-up and sit-and-reach tests to low back pain in adults.
J Orthop Sports Phys Ther 1998;27(1):22–6.
54. Nachemson AL. Exercise, fıtness, and back pain. In: Bouchard C,
Shephard RJ, Stephens T, Sutton JR, McPherson BD, eds. Exercise,
fıtness, and health: a consensus of current knowledge. Champaign IL:
Human Kinetics, 1990.
55. Eisenmann JC, Laurson KR, Welk GJ. Aerobic fıtness percentiles for
U.S. adolescents. Am J Prev Med 2011;41(4S2):S106–10.
56. Amrein-Beardsley A. Methodological concerns about the education
value-added assessment system. Educ Res 2008;37(2):65–75.
57. Braun HI. Using student progress to evaluate teachers: a primer on
value-added models. Princeton NJ: Educational Testing Service,
2005.
58. Braun H, Chudowsky N, Koenig J, eds. Getting value out of value-
added: report of a workshop. Washington DC: National Academy of
Science, 2010.
59. Koebel CI, Swank AM, Shelburne L. Fitness testing in children: a
comparison between PCPFS and AAHPERD standards. J Strength
Cond Res 1992;6(2):107–14.
60. Sampson BB. Children’s perceptions of the FITNESSGRAM fıtness
test [Master’s thesis]. Salt Lake City UT: Brigham Young Univer-
sity, 2008.
61. Zhu W, Welk G, Meredith M, Boiarskaia E A survey of Texas schools’
physical education programs and policies. Res Q Exerc Sport 2010;
81(S3):S42–52.
Did you know?
When you become a member of the ACPM (www.acpm.org) or APTR (www.aptrweb.org),
you receive a subscription to AJPM as a member benefit.
S76
Zhu et al / Am J Prev Med 2011;41(4S2):S68–S76
www.ajpmonline.org